[Shipped] Dynamic committee size

dungtran · 17 March 2021 17:47

Dynamic committee size

1. What problem are you solving?

Problem 1. Incognito currently employs a fixed committee set up. In practice, the number of validators will change over time, so should the size of the committee.

Some considerations:

The relationship between committee size and validators on the waiting list. If the committee size is much larger than number of candidates, there will not be enough backup validators. If the committee size is too small, candidates may have to wait a long time to be selected.
Number of committee members are rotated in each epoch. If it’s small then the candidate validators may wait too long, if it’s too big then the new committee at the beginning of each epoch may not be stable.

Problem 2. Issues arise if members of a shard committee suddenly go offline or disconnect from the majority group. The committee will not be able to collect ⅔+1 votes to commit new blocks or a new committee when the epoch is over. We need a mechanism which allows the system to automatically replace offline validators, or reduce committee size to ensure the liveness of the chain.

Problem 3. Validator availability may vary. Contributions towards building a durable chain may vary. Some validators be running out dated software, broken hardware, or drop out of their own intent, which may harm the stability of the system. Therefore, slashing is put in place to ensure balance, integrity and durability of the system.

2. What is our solution?

Preliminaries

Preparation node: Staking done → Assigned to a Shard → Start downloading shard database.

Candidate node: Finish downloading shard database → ready to join the committee.

Initial approaches

Problem 1.

The proposed protocol should respect the following properties:

Safety: maintain a balance between committee size and the number of candidates in the waiting list.
Stability: when adjusting the committee size, increase/decrease a reasonable number of committee members to ensure the stability of the system.

Case 1. Fewer than 30 committee members

Assign all candidates to committees, i.e allow empty waiting list
No lower bound of committee members

Case 2. 30-300 committee members

The size of the waiting list is 100-120% of the current committee size
1. Committee size is unchanged, 10% of committee members are replaced at each epoch.
If number of candidates in waiting list > 120% of the current committee size
1. Increase committee size by up to 15% of the current committee size. Keep the size of the waiting list at least 100% of the committee size.
2. While the committee size increases, don’t replace any current committee members.
If the number of candidates in waiting list < 100% of the current committee size
1. Reduce the committee size by up to 10%, until the size of the waiting list is equal to the size of the current committee.
2. While committee size reduces, replace 10% of current committee members (after reduced).

Case 3. Greater than 300 committee members

If the size of the waiting list is 120%-180% of the current committee size
1. Committee size is unchanged, 10% of committee members are replaced at each epoch.
If the number of candidates in the waiting list > 180% current committee size
1. Increase committee size by up to 10% of current committee size. Keep the size of the waiting list at least 150% of the committee size.
2. While the committee size increases, don’t replace any current committee members.
If the number of candidates in the waiting list < 120% of the current committee size
1. Reduce the committee size by up to 8% until the waiting list is equal to the size of the current committee.
2. While reducing the committee size, replace 10% of current committee members.
Note: In a situation where increasing the committee size causes a system overload and ⅔ +1 signatures cannot be obtained, reverse to the previous committee size, and stop increasing the committee size for 100 epochs (~ 2 weeks).

Problem 2.

If a significant number of current committee members are disconnected, the beacon is responsible for changing the committee of any shard with the proof of failure from shard nodes. In order to do that, beacon nodes have to keep track of the shard nodes during the time where they failed to produce new blocks.

The new committee flow (if not enough signatures are collected) is depicted in fig 1.

Fig 1. Update committee size when >=⅓ members are disconnected.

Problem 3.

When it comes to slashing, we prefer a light punitive approach – i.e. a slashing policy that does not deduct any PRV from the stake/reward of the validators. However the policy must effectively prevent misbehavior and unreliability.

% of missing votes when in the committee	Minimum # of epochs: node has to wait before rejoining the candidate list
11-20	2
21-30	4
31-40	8
41-50	16
51-60	32
61-70	64
> 70%	128

3. Which solutions do people resort to because this doesn’t exist yet

NA

4. Who are you?

Hi, i’m @dungtran, part of the research & development team at Incognito. We believe decentralization and autonomous operation are two of the most beautiful properties of blockchain technology. We work to help Incognito achieve those targets.

5. Why do you care?

The Incognito chain currently fixes parameters such as: committee size, conditions to be a candidate, and no slashing policy. We need a mechanism that allows the chain to automatically find the appropriate parameters for any given situation. This is important, if a system is to maintain stability and balance by itself.

6. What’s your plan? What’s your schedule?

The solutions detailed above will be complete in 1 month.

1st week	Literature review, find different approaches around these problems
2nd week	Propose appropriate approaches for the Incognito chain
3rd week	Criticism
4th week	Finalize the solution
Future weeks	Implementation if approved

7. What’s your budget?

1 researcher 2000 PRV/mo

8. Is there an existing conversation around this idea?

Proof-of-stake (PoS) is a consensus algorithm for blockchain networks. It is based on a randomly selected state of validators who “stake” the native network tokens by locking them into the blockchain to produce and approve blocks. However, validator misbehavior (whether intentional or not) is unavoidable. PoS blockchains take different approaches in handling misbehavior. These approaches can be classified into two groups:

Non punitive projects: Aion [1], Algorand [2], and Tezos [3] stand out as they do not punish delegators at any time, although Tezos does include a variant of slashing for validators in order to protect the network. Tezos’ non-punitive approach to delegating is to require an initial 21-day period before generating rewards (with Tezos offering a high stake).

Punitive projects: Cosmos [4], Livepeer [5] and ICON [6] have chosen a higher degree of severity to make both delegators and validators accountable for protocol violations made by the validator. Cosmos operates in a similar rewards environment as Tezos, despite having one of the highest double-signing slashing parameters.

References

[1] https://theoan.com/

[2] https://www.algorand.com/what-we-do/technology/algorand-protocol

[3] https://tezos.com/static/white_paper-2dc8c02267a8fb86bd67a108199441bf.pdf

[4] https://cosmos.network/cosmos-whitepaper.pdf

[5] https://github.com/livepeer/wiki/blob/master/WHITEPAPER.md

[6] http://docs.icon.foundation/ICON-Whitepaper-EN-Draft.pdf

raz · 15 April 2020 07:33

Very interesting proposal!

Josh_Hamon · 15 April 2020 17:50

I think this is a great place to extend some resources

semi · 16 April 2020 02:03

@dungtran COOLL proposal!

0x123 · 16 April 2020 04:14

Great read

According to this article on interoperability with Ethereum, the committee must be kept on an Ethereum smart contract. Will this proposal affect the operation/performance of the current Ethereum bridge (e.g., gas cost, storage, speed, etc)?

Also, can you give some intuition behind the numbers in your proposal (why split between < 300 and > 300 members)? What if the waiting list is too big (like 10x committee size)?

dungtran · 16 April 2020 10:54

Thank you for raising tough issues
We are working on the proposed solution such that the chain could keep a reasonable cost of using Ethereum bridge while strengthening the security. The staking amount parameter might be taken into account. Because the chain’s security is proportional to the stake amount, while the number of validators are not.

When committee size is of 300, and if the attacker has 25% of the global staking pool, the chance that any given random seed will lead to a sample favoring the attacker is extremely small (~ 10^(-26)).
If the committee size is about 150, this probability is ~ 4*10^(-11), it’s also very small indeed. With this committee size, the cost for withdrawal on Ethereum bridge is around 2mi gas, it’s still acceptable.

The proposed solution is trying to keep the waiting list in range 1x-2x of the committee size. If the waiting list is 10x of committee size, the system again could increase the staking amount to relax the incentive to be validators.

0xkumi · 20 April 2020 02:23

I have two questions @dungtran

For problem 2: it seems that Beacon is a single point of failure. Beacon nodes decide which validator in/out shard committee member. Who will decide for changing beacon committees?
For problem 3: how do we decide if a node misses a vote in a block? These vote numbers will be calculated in beacon? or shard?
And for both problem above, collecting votes from block of all members in p2p network is not trivial. How do we solve if we mistake to punish a node because of their votes not come on time?

ruler · 20 April 2020 03:54

Great post @dungtran. There are two point in my mind about Dynamic Committee Size:

Committee Auto Adjustment between epoch. After each epoch the algorithm automatically adjust the committee size of all shards and beacon, so the can balanced the number of waiting candidate as well as committee in shard. I think fixed committee size in one epoch time is still reasonable at early stage of development. Because Incognito’s epoch is relatively short, in fact it’s really short (just 3 hour), compared to 24 hours of Harmony.one and others (maybe days). So if we got an honest validator from candidate list. The probability that validator still honest in next three hour is high. It’s too little time to corrupt that validator during an epoch.
Dynamic committee size during an epoch: I think this would help slashing become more effectively. Because Incognito Chain is using PBFT, so committee is very important to identify how many signatures proposer need to get so as to commit a block. So when nodes try to kick out some validator out of committee, we need to adjust the committee size in order to continue producing block with PBFT. In my opinion, this problem is very rare because of the assumption: we got 2f + 1 honest node in our system.

So I think Point 1 should be solved first then Point 2 next. The reason why Point 1 should be solved first because:

Incognito Chain currently has a lot of waiting candidate in list, it’s very much unbalanced 176 node in committee and more than 1000 node in waiting list. If we balance these two number then we greatly enhance the network security. Of course this also push a lot of pressure on network bandwidth and synchronization.
Beside committee size is algorithmically changed in a secure way. Not just waiting for an external force.

I think Point 1 match @dungtran Problem 1 then Point 2 match Problem 2. I haven’t thought much about Problem 3.

ruler · 20 April 2020 04:14

In Problem 2, i have a question. Why don’t we let shard change committee itself then update to beacon later? This enhance shard independent ability also avoiding beacon single point of failure. Is there any other problem?

dungtran · 20 April 2020 04:20

Thank you for interesting questions.

The committee size is actually determined by the protocol, not by Beacon committee. The Beacon chain only stores the aggregated signature on its chain as a proof of missing-vote nodes.
If the beacon chain failed to commit new blocks for a while, it can base on this rule to auto replace the disconnected nodes.

In practice, the aggregated signature might be mistaken to include the signatures from some validators. However, we are doing statistical analysis to find the error rates. For example if some validators missed their votes in more than 30% of committed blocks while these rates are less than 10% for all other validators. If it’s the case, it’s safe to say that some validators don’t have fairly contribute to the robustness of the chain. The chain could somehow limit their appearance in the committee until they improve the performance. The protocol could base on the beacon on-chain db to make the decisions.

dungtran · 20 April 2020 04:52

If a shard auto updates its committees, then it has to wait until the beacon updates this new committee on the beacon chain, and other shards can acknowledge the update, i.e the system (beacon/shards) has the same view. Therefore, if the beacon chain auto updates the shard committee by the protocol, then we should save one step of submitting a new committee by a shard to the beacon chain.

ruler · 22 April 2020 03:19

However, on the other hands if we let beacon chain auto updates shard committee by protocol, shard has to wait for beacon to acknowledge there’s some problem in shard. Then beacon chain has to committed new block or a kind of message that beacon chain agree to change shard committee.
If we let’s shard change committee by itself, shard could continue to work more quickly. So there’s a tradeoff here, I think we should choose what benefit to maximize.

Shard update committee itself: Shard can continue to work quickly, but it will delay beacon and other to receive new committee shard list. In this case, how about, shard auto adjust his committee size, produce new block with slashing proof and committee size adjustment
Beacon update shard committee: Shard will have to wait for beacon update, this would delay block creation. But help network to acknowledge new shard committee quickly

dungtran · 23 April 2020 03:46

Cross-shard tx won’t be accepted by Beacon or other shards, since they have no idea about the new committee.
Per security perspective, beacon itself cannot process any tx. However, a shard committee can process tx, the system may introduce more risk when allowing committee able to change by themselves then they process their favor tx. Even updating new committee must respect the defined protocol, but if there is a chance to update to attacker’s “friendly” committee, they are willing create the trigger for committee change happen.

ruler · 24 April 2020 02:24

We can add constraint in block like this. Said, no cross-shard-tx in this kind of block or block producer will consider byzantine.

I totally agree with this point. There will be a risk if we think shard change committee via some kind of tx.
However, if we take a look back at the figure 1 above. Shard node some how send the best collected signatures to beacon node to demonstrate that block can’t be finalized and committee size must be shrink.
No matter shard update committee by itself or beacon. Shard nodes must provide some kind of proof that verifiable by beacon. Otherwise, a malicious validator with enough power to change committees anytime they want. So I think the point here is, what kind of proof that shard present to beacon. This proof must be verifiable by beacon, or even it will be challenged by other nodes in committee

dungtran · 24 April 2020 10:03

Each node submits to the beacon its best signature collection, beacon will choose the biggest one.

dungtran · 24 April 2020 10:22

Update to April 24th:

Let n be the total number of validators (candidate + committee)
Let m be the committee size
Let k be the number of validators owned by attacker

Assume that the probability of a validator to be selected into committee is uniform and independent, i.e this probability = 1/n.

Let P be the probability of k validators of the attacker will be in committee at the same time.
P = C(m-k, n-k)/C(m, n)
where C(x, y) is the total possible combinations of x items from the set of y items.

Assume that given some fixed n, k, we are working on finding the best m (committee size) such that P is minimized.

mesquka · 25 April 2020 04:29

So the node members submit their signature collections to the beacon committee to check? What’s the contents of the batch, just the signatures? Is there a window for a node to challenge? If so how? By submitting their partial signature? Is it possible to verify this post-hoc?

ruler · 27 April 2020 02:46

I think signatures collection might reflect about one’s availability. One problem is, this collection totally depends on the node which create it. Byzantine node can easily delete one’s signatures in its collection. In this case, Byzantine node somehow can manage to kick other node out. Then signature collection doesn’t reflect one’s availability any more. Do you think about some bad situation we might fall in? There are some I think of:

What if we have two signature collection of the same size, but the are not entirely the same?
What if there are two collections, a bigger (B) one and a smaller (S) one, but S is not subset of B? Some signatures in S didn’t appear in B.
What if there are two disjoint set of signatures?

Moreover, should we have challenge time? During this period other node can submit and challenge their signature collect. The purpose is to figure out byzantine node or get the best available signature collection.

dungtran · 27 April 2020 04:26

In the normal case, a node collects enough signatures (> 2/3n) then commits and forward block header + aggregated signatures to the beacon committee. Nodes which failed to collect enough signatures won’t commit or send anything to the beacon committee. However, after multiple consecutive failed to commit blocks (~25-30 blocktimes), node starts to send whatever it can collect to the beacon. Beacon based on this information to update the new committee in case.

ruler · 2 May 2020 02:52

@dungtran after researching for a while, I have some more issues.

Balance shards’ committee size
We need to consider the number of committee between shard, we should try to balance them or one shard will get much more committee than other shard. Which mean one shard is more secure than other shard. Random assignment validator into shard might get equal distribution. But with dynamic committee size approach, one shard can kick off many validators than other shard. This leads to an unbalance in shards committee size.
Slowly adaptive:

If we keep adding without swapping any node out. Byzantine nodes can stay long enough then slowly adaptive and corrupt this shard.

Imagine we have many nodes which enough to corrupt one shard but not the entire blockchain, and we want to put them into one shard.

Step 1: We stake them into our network. Thanks to random assignment, these nodes will be distributed to shard randomly.
Step 2: Then we choose a target shard, which happens to have most of my nodes assigned in (this could be equal, so we can choose any target shard)
Step 3: For nodes, which fall into other shard, will try to get out then re-stake again. Until nodes get into a particular target shard.

I can shut down this node, so this will be kick out of committee. Because of most of my stake are safe. Re-staking would be an easy job until our node get into a target shard.

In case, kick out of committee mean slashing the stake amount. Node can waiting for one epoch the re-staking. This could make slowly adaptive harder to achieve but still possible
The more we tried to get in target shard, the more likely the waiting list will be > 120% committee size. So none of my node in that committee won’t be kicked out.

In step 3: if nodes in our target shard can stay long enough then we just can wait for our nodes in other shard to finish their job and get out then re-stake.
Also notice that, unbalanced in shard committee-size would make this kind of attack more practical. Shard with smaller committee can be chosen as target easily.

Notice: the more stake at the beginning one have, the more reward he/she will get. Which make attackers have more PRV to conduct their attack.

In summary, we could encounter these problem:

Balancing shard committee size
How to fight against slowly adaptive attacker?
A proper slashing mechanism
A supportive validator assignment and committee swap