1. The origin of the problem
Taobao search blog http://www.searchtb.com/2011/01/zookeeper-research.html mentions that Paxos is the soul of zookeeper
There is an article titled " Zookeeper full analysis-Paxos as the soul " as the title, which is considered to be the basis of zookeeper:
Google's Chubby and Apache's Zookeeper are all based on its theory. Paxos is considered to be the only distributed consensus algorithm so far, and other algorithms are improvements or simplifications of Paxos. There is a question to mention. Paxos has a premise: there is no Byzantine general problem. That is to say, Paxos can only be established in a trusted computing environment, and this environment will not be destroyed by intrusion.
The Paxos algorithm is a message-passing consensus algorithm proposed by Leslie Lamport in 1990. The Paxos algorithm solves the problem of how a distributed system can agree on a certain value (resolution).
The part-time parliament Paxos Made Simple describes the execution process of the Paxos algorithm as follows:
There are two stages on the wiki, but three stages in the paper, and by default there is a proposer equivalent to the leader. The heroes listed the third stage (http://www.wuzesheng.com/?p=2724):
3. Learn stage:
Zookeeper uses org.apache.zookeeper.server.quorum.FastLeaderElection as its default election algorithm
This article http://csrd.aliapp.com/?p=162&replytocom=782 directly uses the paxos implementation as the title, mentioning that zookeeper uses the paxos algorithm when electing the leader (mainly fast paxos)
I happened to see someone rebutting below:
Wei Jiangwen: "
FastLeaderElection is not Paxos at all, nor is it an implementation of Fast Paxos. The FastLeaderElection source code is far from Paxos's paper.
There is also a leader election problem in Paxos and FastPaxos algorithms.
For zookeeper, FastLeaderElection is just equivalent to leader election in Paxos.
2. Data verification
Okay, check the information, analyze the source code and start the investigation
The first is Wei Jiangwen's rebuttal :
There is a very common misunderstanding that the leader election algorithm in zookeeper is paxos or fast paxos. The leader election algorithm is not paxos or fast paxos, please consider the following facts:
Then the author's three comparisons
1) Mailing list
Our protocol instead, has only two phases, just like a two-phase commit protocol. Of course, for Paxos, we can ignore the first phase in runs in which we have a single proposer as we can run phase 1 for multiple instances at a time, which is what Ben called previously Multi-Paxos, I believe. The trick with skipping phase 1 is to deal with leader switching.
We made a few interesting observations about Paxos when contrasting it to Zab, like problems you could run into if you just implemented Paxos alone. Not that Paxos is broken or anything, just that in our setting, there were some properties it was not giving us . Some people still like to map Zab to Paxos, and they are not completely off, but the way we see it, Zab matches a service like ZooKeeper well.
Zk's distributed consensus algorithm has a name called Zab
We use an algorithm that shares some of the character- istics of Paxos, but that combines transaction logging needed for consensus with write-ahead logging needed for data tree recovery to enable an efficient implementa- tion.
3. leader election analysis
In my understanding, first of all, the paxos algorithm cannot be used in elections. Whether the president is selected in paxos or the leader is selected by zk, there is a difference between making a proposal for most people to agree. Only by selecting the master can there be no concurrent proposal conflicts with multiple proposers
Who is going to make a proposal as the proposer? It is the premise for the paxos algorithm to proceed. When a proposal is made to allow most followers to agree, an algorithm similar to paxos can be used to achieve consistency. Zookeeper uses the Zab algorithm to achieve consistency.
Zk's master selection strategy:
there can be at most one leader (proposer) at any time, and we guarantee this by making sure that a quorum of replicas recognize the leader as a leader by committing to an epoch change. This change in epoch also allows us to get unique zxids since the epoch forms part of the zxid.
Each server has an id, and when a submitted transaction is received, it has a zxid. As the updated data changes, the transaction number increases, and the server id is different. First choose the leader with the largest zxid. If the zxid cannot be compared, choose the leader with the largest server id.
The zxid contains an epoch number. The epoch indicates the period when a server is the leader, and it increases with the birth of a new leader.
Look at the code again:
4. analysis of zookeeper data update principle
After understanding the method of master selection, let's understand the method of synchronizing data. The synchronization data uses the Zab protocol: Zookeeper Atomic broadcast protocol, which is a protocol similar to two-phase submission:
The difference with paxos is that leaer is sent to all followers, not most, and all followers must confirm and notify the leader, not most.
Guarantee mechanism: Two transactions broadcast in sequence, T and Tʹ, T must be submitted before Tʹ takes effect. If one server submits T and Tʹ, all other servers must also submit T before Tʹ.
5. the leader's exploration
In order to solve the problem of leader crash and avoid multiple leaders causing transaction chaos, the Zab algorithm guarantees:
1. When a new transaction is opened, the leader must submit all transactions submitted during the last epoch
2. At no time will two leaders gain enough supporters at the same time.
The starting state of a new leader requires the agreement of most servers
The third role in zk, the difference between observer and follower is that they have no right to vote. It is mainly 1. Existence for the read request scalability of the system 2. To meet the deployment requirements of multiple computer rooms, the central computer room deploys leader and follower, and other computer rooms deploy observers, and the read configuration is first read locally.
I think zookeeper can only be said to be affected by the paxos algorithm, the role division is similar, the proposal is similar in the way of passing, and the implementation is simpler and more intuitive. The selection of the leader is based on voteid (server-id) and zxid for size priority sorting, and information synchronization uses two-phase commit. The leader submits the transaction and updates the status after obtaining all the consent of the follower. The observer role is to increase system throughput and satisfy cross-machine room deployment.
 Reed, B., & Junqueira, FP (2008). A simple totally ordered broadcast protocol. 2.Workshop on Large-Scale Distributed Systems and Middleware (LADIS 2008). Yorktown Heights, NY: ACM. ISBN: 978-1 -60558-296-2.  Lamport, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 1825.  F. Junqueira, Y. Mao, and K. Marzullo. Classic paxos vs. fast paxos: caveat emptor. In Proceedings of the 3rd USENIX/IEEE/IFIP Workshop on Hot Topics in System Dependability (HotDep.07). Citeseer, 2007.
O'Reilly.ZooKeeper.Distributed process coordination.2013
 http://agapple.iteye.com/blog/1184023 zookeeper project use some summary