Tracking and summary of deadlock and zookeeper communication interruption caused by HBase asynchronous query [non-technical]

Tracking and summary of deadlock and zookeeper communication interruption caused by HBase asynchronous query [non-technical]

There are a total of ten front-end machines in machine room T and machine room Y. The request amount of machine room Y is twice that of T, which is mainly used for data query.

1) Tomcat zombie processing steps

a Check the code and find that after read through, the DB data is not written to the cache, and the write-back code is added; but a single machine requests dozens of requests per second, and the HBase pressure is very small, and it is finally found to be invalid.

b Check the code and think that it is completely consistent with the dynamic code that has been running for several months in the use of HBase, so it is considered that there is no problem with the business code layer; printing the stack information, it is considered that the HBase client has found that the resource is waiting for a deadlock problem

c Download 0.94.2 patch, analyze that it solves the deadlock problem, and update the jar package deployment.

In the second week, I found that the tomcat log reported an Interrupted error, the process was not dead, but a large number of queries timed out, up to 100 seconds, firelog 5000 + slow query every 3 minutes.

2) Overtime processing steps

a. I think 0.94.2 does not solve the problem, but it avoids the deadlock, but it will cause the Interrupted exception; the 0.94.2 patch package launched by liwei was launched, and the startup failed, and it failed (the jar package lacks version information and cannot be started)

b Comparing the differences between the two computer rooms, it is considered that there is a problem with the Y computer room network. The ping HBase resource test did not find any problems. The three servers in the T computer room were stopped at night, and the load was all on the remaining two, reaching a balance of requests; T was found that night. The computer room also has abnormalities and a large number of timeouts; network problems are eliminated

c The next day due to product pressure, the development and DBA are called to solve the problem in a closed manner. Start the tcpcopy environment for testing and reproduce the problem as soon as possible. 4.plans are planned 

  1. 0.94.0 online patch

  2. tcpcopy test 0.94.2 Interrupt problem

  3. Remove the timeout from the thread pool, that is, do not use asynchronous; use a background thread to check the zookeeper watcher of the HBase client once every 2 minutes to see if the data can be obtained. If there is a problem, reset the zookeeper; set the retry number to 3 times to avoid 10 retry , Each time doubling leads to long queries

  4. Upgrade the zookeeper jar version

   After trying the third version, it was finally normal. It went online at 10 o'clock, and there was no situation at 11 o'clock. Department personnel observed 2 o'clock and there was no problem. The next day's data statistics 99.92% requested less than 200ms. By avoiding the asynchronous timeout task, it does not conflict with the default asynchronous call of HBase, thus solving the problem. It is necessary to do fundamental research and thoroughly understand the principle.

To sum up, there are problems in the four aspects and need to be improved.

1. Network problems Did not do the flow stress test and tcpcopy test in different computer rooms as early as possible

2. Code logic problem; because the dynamic running for a few months has no problem, there is no difference between the new code and the old code reading part, so the problem is eliminated by mistake, and the problem is attributed to the HBase client code.

3. Problem assessment: The severity of the problem and the overtime rate were not evaluated, which led to the deterioration of the final service.

4. Manpower input issues: Manpower analysis and processing should be invested early instead of calling for processing when there is no support at all and high-level complaints.

Reference: https://cloud.tencent.com/developer/article/1067411 HBase asynchronous query deadlock and zookeeper communication interruption problem tracking and summary [non-technical]-Cloud + Community-Tencent Cloud