Availability is as high as 99.999%! High-availability architecture design of payment system

Availability is as high as 99.999%! High-availability architecture design of payment system

Author: Pingzhong flag

1. Background

For Internet applications and large-scale enterprise applications, most of them require 7*24 hours of uninterrupted operation as much as possible, and to achieve complete uninterrupted operation, it can be said that it is "difficult to reach the sky". For this reason, there are generally 3 9 to 5 9s to measure the degree of application usability.

Usability index


Unavailable time (minutes)










For an application with ever-increasing functions and data volume, it is not easy to maintain a relatively high availability. In order to achieve high availability, the CreditEase payment system has made many explorations and practices in avoiding single points of failure, ensuring the high availability of the application itself, and solving the increase in transaction volume.

Regardless of the sudden failures of external dependent systems, such as network problems, third-party payment and large-scale unavailability of banks, the service capacity of the CreditEase payment system can reach 99.999%.

This article focuses on how to improve the usability of the application itself. How to avoid single points of failure and solve the problem of transaction volume growth will be discussed in other series.

In order to improve the availability of applications, the first thing to do is to avoid application failures as much as possible, but it is impossible to avoid failures at all. The Internet is a place where the "butterfly effect" is prone to occur. Any seemingly small accident with a zero probability of occurrence may occur and then be infinitely amplified.

Everyone knows that RabbitMQ itself is very stable and reliable. CreditEase payment system has been using single-point RabbitMQ at the beginning, and it has never failed to operate, so everyone thinks that this thing is unlikely to be a problem.

Until one day, the hardware of the physical host where this node was located was broken due to disrepair for a long time. At that time, the RabbitMQ could not provide services, resulting in instantaneous unavailability of system services.

It is not terrible if the fault occurs. The most important thing is to find and solve the fault in time. The CreditEase payment system's requirement for its own system is to find faults in seconds, diagnose and solve the faults quickly, thereby reducing the negative impact of the fault.

2. the problem

Take history as a mirror. First of all, let’s briefly review some of the problems encountered by the CreditEase payment system:

(1) When dealing with newly-connected three-way channels, new developers neglect the importance of setting the timeout period due to lack of experience. It is such a small detail that causes all transactions in the three-party queue to be blocked, and at the same time affects the transactions of other channels;

(2) The CreditEase payment system is distributed deployment and supports grayscale release, so the environment and deployment modules are very many and complex. A new module was added at a certain time. Because there are multiple environments and each environment is dual-node, after the new module is online, the number of database connections is not enough, which affects the functions of other modules;

(3) It is also a timeout problem. A tripartite timeout causes exhaustion of all currently configured worker threads, so that there are no threads that can be processed for other transactions;

(4) The third party A provides authentication, payment and other interfaces at the same time. One of the interfaces is due to the sudden increase in transaction volume of the CreditEase payment system, which triggers the DDoS restriction of the third party A on the network operator's side. Usually, the export IP of the computer room is fixed, so the network operator mistakenly believes that the transaction from this export IP is a traffic attack, which eventually leads to the simultaneous unavailability of the three-party authentication and payment interface of A.

(5) Let's talk about another database problem, which is also caused by the sudden increase in transaction volume of CreditEase payment system. The colleague who created the sequence gave the upper limit of a sequence to 999,999,999, but the length of this field in the database is 32 bits. When the transaction volume is small, the value generated by the system matches the 32 bits of the field, and the sequence will not Promotion. However, as the transaction volume increased, the sequence unknowingly increased its digits, which resulted in insufficient storage of 32 bits.

Problems like this are very common for Internet systems and are concealed, so how to avoid them is very important.

3. the solution

Below we look at the changes made by CreditEase payment system from three aspects.

3.1 Avoid failures as much as possible

3.1.1 Design a fault-tolerant system

For example, rerouting. For user payment, the user does not care which channel his money is paid from, the user only cares about success. The CreditEase payment system is connected to more than 30 channels. It is possible that the A channel payment is unsuccessful. At this time, it needs to be dynamically rerouted to the B or C channel, so that the system can reroute to avoid user payment failure and achieve payment fault tolerance.

There is also fault tolerance for OOM, like Tomcat. The system memory is always exhausted. If you reserve some memory for the application itself at the beginning, when the system OOM occurs, you can catch the exception and avoid this OOM.

3.1.2 "Fail fast principle" in certain links

The Fail fast principle is that when there is a problem in any step of the main process, the entire process should be terminated quickly and reasonably, rather than waiting for a negative impact.

To give a few examples:

(1) When the payment system is started, some queue information and configuration information need to be loaded into the cache. If the loading fails or the queue configuration is incorrect, it will cause the request processing process to fail. The best way to deal with this is to load the data and the JVM directly Exit to avoid unavailability of subsequent startup;

(2) The real-time transaction processing response time of the payment system is the longest 40s. If it exceeds 40s, the system will not wait any more, release the thread to inform the merchant that it is processing, and the subsequent processing results will be notified by way of notification or the business line will take the initiative Get the result by query;

(3) The CreditEase payment system uses redis as a cache database, where it has functions such as real-time alarm burying and weight checking. If the redis connection exceeds 50ms, the redis operation will be automatically abandoned. In the worst case, the impact of this operation on the payment is 50ms, which is controlled within the range allowed by the system.

3.1.3 Design a system with self-protection capabilities

Systems generally have third-party dependencies, such as databases, third-party interfaces, and so on. When developing the system, you need to be suspicious of the third party to avoid the chain reaction when the third party has problems and cause downtime.

(1) Split message queue

The CreditEase payment system provides various payment interfaces to merchants. The commonly used ones are Quick, Personal Internet Banking, Corporate Internet Banking, Refund, Cancellation, Bulk Payment, Bulk Deduction, Single Payment, Single Deduction, Voice payment, balance inquiry, ID card authentication, bank card authentication, card secret authentication, etc. Corresponding payment channels include WeChat Pay, ApplePay, Alipay and more than 30 payment channels, and hundreds of merchants are connected. In these three dimensions, how to ensure that different businesses, three parties, merchants, and payment types do not affect each other, what the CreditEase payment system does is to split the message queue. The following figure is a split diagram of part of the business message queue:

(2) Restrict the use of resources

The design of restrictions on resource usage is the most important point of a high-availability system, and one that is easily overlooked. Resources are relatively limited and excessive use will naturally lead to application downtime. For this reason, the CreditEase payment system has done the following homework:

•Limit the number of connections

With the horizontal expansion of distributed, you need to consider the number of database connections, rather than endless maximization. The number of database connections is limited, and all modules need to be considered globally, especially the increase brought about by horizontal expansion.

• Limit the use of memory

Excessive memory usage will cause frequent GC and OOM. The memory usage mainly comes from the following two aspects:

A: The collection capacity is too large;

B: Objects that are no longer referenced are not released. For example, objects placed in ThreadLocal will be collected when the thread exits.

• Limit thread creation

Unlimited creation of threads eventually leads to uncontrollable, especially the method of creating threads hidden in the code.

When the SY value of the system is too high, it means that linux needs to spend more time for thread switching. The main reason for this phenomenon in Java is that there are more threads created, and these threads are in constant blocking (lock waiting, IO waiting) and changes in the execution state, which generates a large number of context switches.

In addition, Java applications will manipulate physical memory outside the JVM heap when creating threads, and too many threads will also use too much physical memory.

For the creation of threads, it is best to implement the thread pool to avoid context switching caused by too many threads.

•Limit concurrency

Those who have done payment systems should be aware that some third-party payment companies have requirements for concurrency among merchants. The number of concurrency opened by the three parties is evaluated based on the actual transaction volume. Therefore, if the concurrency is not controlled and all transactions are sent to the three parties, the three parties will only reply "please reduce the frequency of submission".

Therefore, special attention is needed in both the system design phase and the code review phase to limit concurrency to the scope allowed by the three parties.

We mentioned that the CreditEase payment system has made three changes in order to achieve system availability. One is to avoid failures as much as possible, and then we will talk about the next two points.

3.2 Find faults in time

Failure is like a devil entering the village, coming off guard. When the defense line of prevention is breached, how to pull up the second line of defense in time and discover failures to ensure availability? At this time, the alarm monitoring system begins to take effect. For a car without a dashboard, it is impossible to know the speed, fuel level, and whether the turn signal is on. Even if the level of the "old driver" is high, it is quite dangerous. Similarly, the system also needs to be monitored, and it is best to call the police in advance when there is a danger, so that the fault can be solved before the real risk is caused.

3.2.1 Real-time alarm system

If there is no real-time alarm, the uncertainty of the operating state of the system will cause unquantifiable disasters. The monitoring system indicators of the CreditEase payment system are as follows:

• Real-time-to achieve second-level monitoring; • Comprehensive-to cover all system services to ensure no blind coverage; • Practicability-early warning is divided into multiple levels, monitoring personnel can conveniently and practically make accurate decisions based on the severity of the early warning; • Diversity-Early warning modes provide push-pull modes, including text messages, emails, and visual interfaces to facilitate monitoring personnel to find problems in time.

Alarms are mainly divided into stand-alone alarms and cluster alarms, and CreditEase payment system belongs to cluster deployment. Real-time early warning mainly relies on the real-time statistical analysis of buried point data of various business systems, so the difficulty lies mainly in the data buried point and analysis system.

3.2.2 Buried point data

To achieve real-time analysis without affecting the response time of the transaction system, the CreditEase payment system uses redis to do real-time data embedding in each module, and then aggregates the embedding data to the analysis system, and the analysis system analyzes and alarms according to the rules.

3.2.3 Analysis system

The most difficult thing to do in the analysis system is the business alarm points, for example, which alarms must be issued as soon as they come out, and which alarms only need to be paid attention to when they come out. Below we give a detailed introduction to the analysis system:

(1) System operation architecture

(2) System operation process

(3) System business monitoring point

The business monitoring points of the CreditEase payment system are summarized bit by bit in the daily operation process, and are divided into two major parts: the police type and the concerned type.

A: Police class

• Early warning of network abnormalities; • Early warning of uncompleted single order overtime; • Early warning of real-time transaction success rate; • Early warning of abnormal status; • Early warning of non-return; • Early warning of failure notification; Check the inconsistency warning; • Special status warning;

B: Concerned

• Early warning of abnormal transaction volume; • Early warning of transaction volume exceeding 500W; • Early warning of SMS backfill timeout;

3.2.4 Non-business monitoring points

Non-business monitoring points mainly refer to monitoring from the perspective of operation and maintenance, including networks, hosts, storage, logs, etc. details as follows:

(1) Service availability monitoring

Use JVM to collect Young GC/Full GC times and time, heap memory, time-consuming Top 10 thread stack and other information, including the length of the cache buffer.

(2) Flow monitoring

The Agent monitoring agent is deployed on each server to collect traffic conditions in real time.

(3) External system monitoring

Observe whether the three parties or the network are stable through intermittent detection.

(4) Middleware monitoring

•For the MQ consumption queue, use RabbitMQ script detection to analyze the queue depth in real time; •For the database part, install the plug-in xdb to monitor database performance in real time.

(5) Real-time log monitoring

Complete the collection of distributed logs through rsyslog, and then complete real-time monitoring and analysis of logs through system analysis and processing. Finally, through the development of a visualization page to show the user.

(6) System resource monitoring

Use Zabbix to monitor the host's CPU load, memory usage, upstream and downstream traffic of each network card, read and write rates of each disk, read and write times (IOPS) of each disk, and disk space utilization.

The above is what the real-time monitoring system of the CreditEase payment system does. It is mainly divided into two aspects: business point monitoring and operation and maintenance monitoring. Although the system is deployed in a distributed manner, each early warning point is a second-level response. In addition, there is also a difficulty in the alarm points of the business system, that is, some alarms are reported in a small amount and not necessarily a problem, but a large number of alarms will cause problems, that is, the so-called quantitative change causes a qualitative change.

For example, taking a network abnormality as an example, the occurrence of a single transaction may be due to network jitter, but if multiple transactions occur, it is necessary to pay attention to whether the network is really problematic. Examples of alarms for the network abnormality of the CreditEase payment system are as follows:

•Single-channel network anomaly early warning: 12 consecutive A-channel network anomalies occurred within 1 minute, triggering the early warning threshold; •Multi-channel network anomaly early warning 1: Within 10 minutes, 3 network anomalies occurred every minute in a row, involving 3 channels triggered the early warning threshold; • Multi-channel network abnormal warning 2: Within 10 minutes, a total of 25 network exceptions occurred, involving 3 channels, which triggered the early warning threshold.

3.2.5 Log Recording and Analysis System

For a large-scale system, it is difficult to record a large number of logs and analyze logs every day. The CreditEase payment system has an average of 200W orders per day, and a transaction flows through more than a dozen modules. Assuming that an order records 30 logs, it is conceivable that there will be a huge log volume every day.

The analysis of CreditEase payment system logs has two functions, one is to warn of abnormal logs in real-time, and the other is to provide order trajectory for operation personnel to use.

(1) Real-time log warning

Real-time log warning is for all real-time transaction logs, capturing keywords with Exception or Error in real time and then alarming. The advantage of this is that if there is any abnormal operation in the code, it will be discovered the first time. The CreditEase payment system's processing method for real-time log warnings is to first use rsyslog to complete log collection, then use the analysis system to capture in real time, and then make real-time warnings.

(2) Order trajectory

For the trading system, it is very necessary to understand the status flow of an order in real time. The original approach of the CreditEase payment system was to record the order trajectory through a database, but after running for a period of time, it was found that the order volume increased sharply and the database table was too large, which was not conducive to maintenance.

The current practice of the CreditEase payment system is that each module prints the log track, and the format of the log track printing is printed in the way of the database table structure. After all the logs are printed, rsyslog completes the log collection, and the analysis system will capture and print in real time. Analyze the specification log of the system, then store it in the database on a daily basis, and show it to the operator’s visual interface.

The log printing specifications are as follows:

2016-07-22 18:15:00.512||pool-73-thread-4||channel adapter||channel adapter-after sending three parties||CEX16XXXXXXX5751||16201XXXX337||||||04||9000||【 Settlement platform message] Processing||0000105||98XX543210||GHT||03||11||2016-07-22 18:15:00.512||Zhang Zhang||||01||tunnelQuery||true|| ||Pending||||||8cff785d-0d01-4ed4-b771-cb0b1faa7f95||10.999.140.101||O001||||0.01||||||||http://10.100.444.59: 8080/regression/notice||||240||2016-07-20 19:06:13.000xxxxxxx||2016-07-22 18:15:00.170||2016-07-22 18:15:00.496xxxxxxxxxxxxxxxxxxxx|| 2016-07-2019:06:13.000||||||||01||0103||111xxxxxxxxxxxxxxxxxxxxxxxxx||8fb64154bbea060afec5cd2bb0c36a752be734f3e9424ba7xxxxxxxxxxxxxxxxxxxx|211|2298818812424|242d|22xxxxxxxxxx||9bc195aedd35a47|21 ||||||6xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx010||||||||||

The brief log visualization trace is as follows:

In addition to the above two points, the log recording and analysis system also provides downloading and viewing of transactions and response messages.

3.2.6 7*24 hours monitoring room

The alarm items above the CreditEase payment system provide operators with two push and pull methods, one is SMS and email push, and the other is report display. In addition, due to the importance of the payment system itself compared to other Internet systems, the CreditEase payment system uses a 7*24-hour monitoring room to ensure the safety and stability of the system.

3.3 Deal with faults in time

After the failure occurs, especially in the production environment, the first thing to do is not to find the cause of the failure, but to deal with the failure as quickly as possible to ensure the availability of the system. The common faults and handling measures of the CreditEase payment system are as follows:

3.3.1 Automatic repair

Regarding the automatic repair part, the common failures of the CreditEase payment system are caused by the instability of the three parties. In this case, the system mentioned above will automatically reroute.

3.3.2 Service degradation

Service degradation refers to shutting down certain functions in the event of a failure and cannot be quickly repaired to ensure the use of core functions. When the CreditEase payment system promotes merchants, if a merchant’s transaction volume is too large, it will adjust the merchant’s traffic in real time to downgrade the merchant’s service so that it will not affect other merchants. There are many scenarios like this. The specific service degradation function will be introduced in the follow-up series.

4. Q&A

Q1: Can you tell me the specific details and solution of the RabbitMQ downtime?

A1: RabbitMQ downtime triggered thinking about system availability. At that time, our RabbitMQ itself did not go down (RabbitMQ is still very stable). It was the hardware machine where RabbitMQ was down, but the problem was the deployment of RabbitMQ at that time. It is a single point of deployment, and everyone thinks that RabbitMQ will not be down, thus ignoring the container in which it is located. Therefore, our thinking about this problem is that all businesses cannot have a single point, including application servers, middleware, Network equipment, etc. The single point is not only considered from the single point itself, for example, the entire service is doubled, and then the AB test, of course, there are also double computer rooms.

Q2: Is your company's development, operation and maintenance together?

A2: Our development, operation and maintenance are separate. Today's sharing is mainly based on the overall system availability. There are too many developments, and there are some operations and maintenance. I have witnessed the journey of the CreditEase payment system.

Q3: Do you all use Java in the backend? Have you considered other languages?

A3: Most of our current systems are java, and there are a few python, php, C++, this depends on the type of business, currently java is the most suitable for us at this stage, and may consider other languages ​​as the business expands.

Q4: I am skeptical of third-party reliance. Can you give a specific example to illustrate how to do it? What if the third party can't use it at all

A4: The system generally has third-party dependencies, such as databases, third-party interfaces, etc. When developing the system, you need to be suspicious of the third party to avoid the chain reaction when the third party has problems and cause downtime. Everyone knows that once a problem occurs in the system, it will snowball and become bigger and bigger. For example, if we have only one scan channel, there is nothing to do when there is a problem with this scan channel. So at the beginning, I doubt it. By connecting to multiple channels, if an abnormality occurs , The real-time monitoring system automatically switches the routing channel after triggering an alarm to ensure the availability of the service; second, asynchronous message splitting is performed for different payment types, merchants, and transaction types to ensure that if one type of transaction occurs once it is unpredictable After the abnormality, it will not affect other passages. This is like a highway with multiple lanes. The express and slow lanes do not affect each other. In fact, the overall idea is fault tolerance + split + isolation, and this specific issue is treated concretely.

Q5: After the payment is overtime, there will be network problems. Will there be money already paid, lost orders, how to do disaster recovery and data consistency, and whether there are replay logs and repaired data?

A5: The most important thing for payment is security, so we use conservative processing strategies for the order status. Therefore, we set the processing status for orders with abnormal network, and then finally complete the process of contacting the bank or the third party by actively inquiring or passively accepting notifications. The ultimate consistency. In the payment system, in addition to the order status and the response code problem, everyone knows that the bank or the three parties respond through the response code. The translation of the response code and order status must also be conservative to ensure that there will be no overpayment or underpayment. And other issues. In short, the general idea of ​​this point is that the safety of funds comes first, and all strategies are based on the whitelist principle.

Q6: As mentioned earlier, if a payment channel times out, the routing strategy will be distributed to another channel. According to the channel diagram, it can be seen that they are all different payment methods, such as Alipay or WeChat payment. If I only want to pay through WeChat Pay, why don’t you try again and switch to another channel? Or does the channel itself mean the request node?

A6: First of all, rerouting is not allowed for timeout, because socket timeout cannot determine whether the transaction has been sent to three parties, whether it has succeeded or failed. If it is successful, try again. If it succeeds, the payment is overpaid. The loss of funds in this situation is not acceptable to the company; Secondly, for the routing function, it needs to be divided into business types. If it is a single collection and payment transaction, the user does not care about the channel through which the money goes out, and it can be routed. It is a scan code channel. If a user uses WeChat to scan a code, he will definitely go to WeChat. But we have many intermediate channels. WeChat is sent out through intermediate channels. Here we can route different intermediate channels, so that the user is still WeChat. Paid.

Q7: Can you give an example of the automatic repair process? How to find out the details of unstable to rerouting?

A7: Automatic repair is to do fault-tolerant processing through rerouting. This problem is very good. If it is found to be unstable, then it will decide to reroute. Rerouting must be clear that the current rerouted transaction is not successful before it can be routed, otherwise it will cause the problem of overpayment and overcharge of funds. The current rerouting of our system is mainly based on two methods: after the event and during the event. For example, if a channel is found to be unstable through the real-time warning system within 5 minutes after the event, then the transactions after the current period will be routed to other channels. ; Regarding the matter, it is mainly through analyzing the failure response code returned by each order, the response code is sorted out, and the rerouting is done only when it is clear that it can be retransmitted. Here I mean to list these two points. There are still many other business points. Due to space reasons, I will not elaborate on it, but the overall idea is that there must be a real-time memory analysis system, decision-making in seconds, this system must be fast, and then combined with real-time analysis With offline analysis for decision support, our real-time second-level early warning system does just that.

Q8: Are merchant promotions regular? What is the difference between the peak value during promotion and usual? Is there a technical drill? What is the priority of downgrade?

A8: For merchant promotions, we will usually keep in touch with merchants in advance, understand the time and volume of the promotion in advance, and then do something targeted; the peak of promotion is very different from normal times, and promotions are generally compared within 2 hours. Many. For example, some sales of wealth management products are concentrated within one hour, so the peak is very high; the technical exercise is to understand the sales volume of the merchant, and then estimate the processing capacity of the system, and then do the exercise in advance; downgrade The priority of is mainly for merchants. Since there are many payment scenarios for access to our merchants, there are financial management, collection and payment, shortcuts, scan codes, etc., so our overall principle is that different merchants must not Can influence each other, because you can't influence other businesses because of your promotion.

Q9: How to store rsyslog collection logs?

A9: This is a good question. At the beginning, our log, that is, the order trajectory log, was recorded in the database table. As a result, we found that a lot of modules are required for the circulation of an order, so the log trajectory of an order is about 10, if one day is 400w In the case of a transaction, there is a problem with this database table, even if it is split, it will affect the performance of the database, and this is an auxiliary business and should not be done. Then, we found that writing the log is better than writing to the database, so we printed the real-time log in the form of a table and printed it to the hard disk. Since this is only a real-time log, the amount of logs is not large, and it is in a fixed directory of the log server. Since the logs are all on distributed machines, the logs are collected to a centralized place. This block is stored by mounting, and then there is a program written by a dedicated operation and maintenance team to parse these logs in the form of tables in real time, and finally pass The visualization page is displayed on the operation operation page, so that the order trajectory seen by the operation staff is almost real-time. It is actually not a problem how to store what you care about, because we have divided the real-time log and the offline log, and then the offline log after a certain period of time will be Cut and eventually be deleted.

Q10: How do system monitoring and performance monitoring work together?

A10: The system monitoring I understand includes system performance monitoring. System performance monitoring is part of the overall system monitoring. There is no coordination problem. System performance monitoring has multiple dimensions, such as application level, middleware, containers, etc.

Reference: https://cloud.tencent.com/developer/article/1447245 Availability is as high as 99.999%! High-availability architecture design of payment system in practice-Cloud + Community-Tencent Cloud