📢 Gate Square #MBG Posting Challenge# is Live— Post for MBG Rewards!
Want a share of 1,000 MBG? Get involved now—show your insights and real participation to become an MBG promoter!
💰 20 top posts will each win 50 MBG!
How to Participate:
1️⃣ Research the MBG project
Share your in-depth views on MBG’s fundamentals, community governance, development goals, and tokenomics, etc.
2️⃣ Join and share your real experience
Take part in MBG activities (CandyDrop, Launchpool, or spot trading), and post your screenshots, earnings, or step-by-step tutorials. Content can include profits, beginner-friendl
A Rational View on Decentralized Computing Power Networks
TL;DR
1. Distributed Computing Power—Large Model Training
We are discussing the application of distributed computing power in training, and generally focus on the training of large language models. The main reason is that the training of small models does not require much computing power. In order to do distributed data privacy and a bunch of projects The problem is not cost-effective, it is better to solve it directly and centrally. The large language model has a huge demand for computing power, and it is now in the initial stage of the outbreak. From 2012 to 2018, the computing demand of AI will double approximately every 4 months. Judging that the next 5-8 years will still be a huge incremental demand.
While there are huge opportunities, problems also need to be seen clearly. Everyone knows that the scene is huge, but where are the specific challenges? Who can target these problems instead of blindly entering the game is the core of judging the excellent projects of this track.
(NVIDIANeMoMegatronFramework)
1. Overall training process
Take training a large model with 175 billion parameters as an example. Due to the huge size of the model, it needs to be trained in parallel on many GPU devices. Suppose there is a centralized computer room with 100 GPUs and each device has 32GB of memory.
This process involves a large amount of data transfer and synchronization, which may become a bottleneck for training efficiency. Therefore, optimizing network bandwidth and latency, and using efficient parallel and synchronization strategies are very important for large-scale model training.
2. Bottleneck of communication overhead:
It should be noted that the communication bottleneck is also the reason why the current distributed computing power network cannot do large language model training.
Each node needs to exchange information frequently to work together, which creates communication overhead. For large language models, this problem is especially serious due to the large number of parameters of the model. The communication overhead is divided into these aspects:
Although there are some methods to reduce communication overhead, such as compression of parameters and gradients, efficient parallel strategies, etc., these methods may introduce additional computational burden or negatively affect the training effect of the model. Also, these methods cannot completely solve the communication overhead problem, especially in the case of poor network conditions or large distances between computing nodes.
As an example:
Decentralized distributed computing power network
The GPT-3 model has 175 billion parameters, and if we represent these parameters using single-precision floating point numbers (4 bytes per parameter), then storing these parameters requires ~700GB of memory. In distributed training, these parameters need to be frequently transmitted and updated between computing nodes.
Assuming that there are 100 computing nodes, each node needs to update all parameters in each step, then each step needs to transfer about 70TB (700GB*100) of data. If we assume that a step takes 1s (very optimistic assumption), then 70TB of data needs to be transferred every second. This demand for bandwidth already far exceeds that of most networks and is also a matter of feasibility.
In reality, due to communication delays and network congestion, the data transmission time may be much longer than 1s. This means that computing nodes may need to spend a lot of time waiting for data transmission instead of performing actual calculations. This will greatly reduce the efficiency of training, and this reduction in efficiency cannot be resolved by waiting, but the difference between feasible and infeasible, which will make the entire training process infeasible.
Centralized computer room
Even in a centralized computer room environment, the training of large models still requires heavy communication optimization.
In a centralized computer room environment, high-performance computing devices are used as a cluster, connected through a high-speed network to share computing tasks. However, even when training a model with an extremely large number of parameters in such a high-speed network environment, the communication overhead is still a bottleneck, because the parameters and gradients of the model need to be frequently transmitted and updated between various computing devices.
As mentioned at the beginning, suppose there are 100 computing nodes, and each server has a network bandwidth of 25Gbps. If each server needs to update all parameters in each training step, then each training step needs to transfer about 700GB of data and it takes ~224 seconds. By taking advantage of the centralized computer room, developers can optimize the network topology inside the data center and use technologies such as model parallelism to significantly reduce this time.
In contrast, if the same training is performed in a distributed environment, assuming there are still 100 computing nodes distributed all over the world, the average network bandwidth of each node is only 1Gbps. In this case, it takes ~5600 seconds to transfer the same 700GB of data, which is much longer than in the centralized computer room. Also, due to network delays and congestion, the actual time required may be longer.
However, compared to the situation in a distributed computing power network, it is relatively easy to optimize the communication overhead in a centralized computer room environment. Because in a centralized computer room environment, computing devices are usually connected to the same high-speed network, and the bandwidth and delay of the network are relatively good. In a distributed computing power network, computing nodes may be distributed all over the world, and the network conditions may be relatively poor, which makes the problem of communication overhead more serious.
In the process of training GPT-3, OpenAI uses a model parallel framework called Megatron to solve the problem of communication overhead. Megatron divides the parameters of the model and processes them in parallel among multiple GPUs, and each device is only responsible for storing and updating a part of the parameters, thereby reducing the amount of parameters that each device needs to process and reducing communication overhead. At the same time, a high-speed interconnection network is also used during training, and the length of the communication path is reduced by optimizing the network topology.
(Data used to train LLM models)
3. Why can’t the distributed computing power network do these optimizations
It can be done, but compared with the centralized computer room, the effect of these optimizations is very limited.
4. Data security and privacy challenges
Almost all links involving data processing and transmission may affect data security and privacy:
**What solutions are available for data privacy concerns? **
Summary
Each of the above methods has its applicable scenarios and limitations, and none of the methods can completely solve the data privacy problem in the large model training of distributed computing power network.
*** Can ZK, which has high hopes, solve the data privacy problem in large model training? ***
In theory, ZKP can be used to ensure data privacy in distributed computing, allowing a node to prove that it has performed calculations according to regulations, but does not need to disclose actual input and output data.
But in fact, the following bottlenecks are faced in the scenario of using ZKP for large-scale distributed computing power network training large models:
Summary
To use ZKP for large-scale distributed computing power network training large models, it will take several years of research and development, and it also requires more energy and resources from the academic community in this direction.
2. Distributed Computing Power—Model Reasoning
Another relatively large scenario of distributed computing power is model reasoning. According to our judgment on the development path of large models, the demand for model training will gradually slow down as the large models mature after passing a high point. Reasoning requirements will correspondingly increase exponentially with the maturity of large models and AIGC.
Compared with training tasks, inference tasks usually have lower computational complexity and weaker data interaction, and are more suitable for distributed environments.
(Power LLM inference with NVIDIA Triton)
1. Challenge
Communication Delay:
In a distributed environment, communication between nodes is essential. In a decentralized distributed computing power network, nodes may be spread all over the world, so network latency can be a problem, especially for reasoning tasks that require real-time response.
Model Deployment and Update:
The model needs to be deployed to each node. If the model is updated, each node needs to update its model, which consumes a lot of network bandwidth and time.
Data Privacy:
Although inference tasks usually only require input data and models, and do not need to return a large amount of intermediate data and parameters, the input data may still contain sensitive information, such as users' personal information.
Model Security:
In a decentralized network, the model needs to be deployed on untrusted nodes, which will lead to the leakage of the model and lead to the problem of model property rights and abuse. This can also raise security and privacy concerns, if a model is used to process sensitive data, nodes can infer sensitive information by analyzing the behavior of the model.
QC:
Each node in a decentralized distributed computing power network may have different computing capabilities and resources, which may make it difficult to guarantee the performance and quality of inference tasks.
2. Feasibility
Computational complexity:
In the training phase, the model needs to iterate repeatedly. During the training process, it is necessary to calculate the forward propagation and back propagation of each layer, including the calculation of the activation function, the calculation of the loss function, the calculation of the gradient and the update of the weight. Therefore, the computational complexity of model training is high.
In the inference phase, only one forward pass is required to compute the prediction. For example, in GPT-3, it is necessary to convert the input text into a vector, and then perform forward propagation through each layer of the model (usually the Transformer layer), and finally obtain the output probability distribution, and generate the next word according to this distribution. In GANs, the model needs to generate an image based on the input noise vector. These operations only involve the forward propagation of the model, do not need to calculate gradients or update parameters, and have low computational complexity.
Data Interactivity:
During the inference phase, the model usually processes a single input rather than the large batch of data during training. The result of each inference only depends on the current input, not on other input or output, so there is no need for a large amount of data interaction, and the communication pressure is less.
Taking the generative image model as an example, assuming we use GANs to generate images, we only need to input a noise vector to the model, and then the model will generate a corresponding image. In this process, each input will only generate one output, and there is no dependency between outputs, so there is no need for data interaction.
Taking GPT-3 as an example, each generation of the next word only requires the current text input and the state of the model, and does not need to interact with other inputs or outputs, so the requirement for data interactivity is also weak.
Summary
Regardless of whether it is a large language model or a generative image model, the computational complexity and data interactivity of reasoning tasks are relatively low, which is more suitable for decentralized distributed computing power networks, which is why most projects we see now In one direction of force.
3. Items
The technical threshold and technical breadth of a decentralized distributed computing power network are very high, and it also requires the support of hardware resources, so we have not seen too many attempts now. Take Together and Gensyn.ai as examples:
1.Together
(RedPajama from Together)
Together is a company that focuses on the open source of large models and is committed to decentralized AI computing power solutions. It hopes that anyone, anywhere can access and use AI. Together just completed a 20m USD seed round led by Lux Capital.
Together was co-founded by Chris, Percy, and Ce. The original intention was that large-scale model training required a large number of high-end GPU clusters and expensive expenditures, and these resources and model training capabilities were also concentrated in a few large companies.
From my point of view, a more reasonable entrepreneurial plan for distributed computing power is:
Step1. Open source model
To implement model reasoning in a decentralized distributed computing power network, the prerequisite is that nodes must be able to obtain the model at low cost, that is to say, the model using the decentralized computing power network needs to be open source (if the model needs to be licensed in the corresponding If used below, it will increase the complexity and cost of the implementation). For example, chatgpt, as a non-open source model, is not suitable for execution on a decentralized computing power network.
Therefore, it can be speculated that the invisible barrier of a company that provides a decentralized computing power network needs to have strong large-scale model development and maintenance capabilities. Self-developed and open-sourced a powerful base model can get rid of the dependence on third-party model open source to a certain extent, and solve the most basic problems of decentralized computing power network. At the same time, it is more conducive to proving that the computing power network can effectively carry out training and reasoning of large models.
And Together did the same. The recently released LLaMA-based RedPajama was jointly launched by teams including Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research. The goal is to develop a series of fully open source large language models.
Step2. Distributed computing power landed on model reasoning
As mentioned in the above two sections, compared with model training, model inference has lower computational complexity and data interaction, and is more suitable for a decentralized distributed environment.
On the basis of the open source model, Together's R&D team has made a series of updates to the RedPajama-INCITE-3B model, such as using LoRA to achieve low-cost fine-tuning, making the model run on the CPU (especially MacBook Pro with M2 Pro processor) Runs on the model more silky. At the same time, although the scale of this model is small, its ability exceeds other models of the same scale, and it has been practically applied in legal, social and other scenarios.
Step3. Distributed computing power landed on model training
(Schematic diagram of the computing power network of Overcoming Communication Bottlenecks for Decentralized Training)
In the medium and long term, although facing great challenges and technical bottlenecks, it must be the most attractive to undertake the computing power demand for AI large model training. Together began to lay out work on how to overcome the communication bottleneck in decentralized training at the beginning of its establishment. They also published a related paper on NeurIPS 2022: Overcoming Communication Bottlenecks for Decentralized Training. We can mainly summarize the following directions:
Scheduling Optimization
When training in a decentralized environment, it is important to assign communication-heavy tasks to devices with faster connections because the connections between nodes have different latencies and bandwidths. Together builds a model to describe the cost of a specific scheduling strategy, and better optimizes the scheduling strategy to minimize communication costs and maximize training throughput. The Together team also found that even though the network was 100 times slower, the end-to-end training throughput was only 1.7 to 2.3 times slower. Therefore, it is interesting to catch up the gap between distributed networks and centralized clusters through scheduling optimization.
Communication compression optimization
Together proposes communication compression for forward activations and reverse gradients, and introduces the AQ-SGD algorithm, which provides strict guarantees for stochastic gradient descent convergence. AQ-SGD is able to fine-tune large base models on slow networks (e.g. 500 Mbps), only 31% slower than end-to-end training performance on centralized computing networks (e.g. 10 Gbps) without compression. In addition, AQ-SGD can be combined with state-of-the-art gradient compression techniques such as QuantizedAdam to achieve a 10% end-to-end speedup.
Project Summary
Together team configuration is very comprehensive, members have a very strong academic background, from large-scale model development, cloud computing to hardware optimization are supported by industry experts. And Together did show a long-term and patient posture in path planning, from developing open source large models to testing idle computing power (such as mac) in the distributed computing power network and reasoning with models, and then to distributed computing power in large Layout on model training. — There is that kind of accumulation and thin hair feeling :)
But so far, I haven’t seen too many research results of Together in the incentive layer. I think this is as important as technology research and development, and it is a key factor to ensure the development of a decentralized computing power network.
2.Review.ai
(Gensyn.ai)
From the technical path of Together, we can roughly understand the implementation process of the decentralized computing power network in model training and reasoning, as well as the corresponding research and development priorities.
Another important point that cannot be ignored is the design of the incentive layer/consensus algorithm of the computing power network. For example, an excellent network needs to have:
……
See how Gensyn.ai does it:
First of all, solvers in the computing power network compete for the right to process tasks submitted by users through bids, and according to the scale of the task and the risk of being found to be cheating, the solver needs to mortgage a certain amount.
Solver generates multiple checkpoints while updating parameters (to ensure the transparency and traceability of work), and periodically generates cryptographic reasoning proofs (proof of work progress) about tasks;
When the Solver completes the work and generates a part of the calculation results, the protocol will select a verifier, and the verifier will also pledge a certain amount (to ensure that the verifier performs the verification honestly), and decide which part of the calculation results needs to be verified according to the proofs provided above.
Through the Merkle tree-based data structure, the exact location where the calculation results differ is located. The entire verification operation will be on the chain, and cheaters will be deducted from the pledged amount.
Project Summary
The design of the incentive and verification algorithm makes Gensyn.ai not need to replay all the results of the entire computing task during the verification process, but only needs to copy and verify a part of the results according to the provided proof, which greatly improves the efficiency of verification. At the same time, nodes only need to store part of the calculation results, which also reduces the consumption of storage space and computing resources. In addition, potential cheating nodes cannot predict which parts will be selected for verification, so this also reduces the risk of cheating;
This method of verifying differences and discovering cheaters can also quickly find errors in the calculation process without comparing the entire calculation results (starting from the root node of the Merkle tree and traversing down step by step). Very effective for large-scale computing tasks.
In short, the design goal of Gensyn.ai's incentive/verification layer is: concise and efficient. However, it is currently limited to the theoretical level, and the specific implementation may face the following challenges:
Fourth, a little thinking about the future
The question of who needs a decentralized computing power network has not been verified. The application of idle computing power to large-scale model training that requires huge computing power resources is obviously the most make sense and the most imaginative space. But in fact, bottlenecks such as communication and privacy have to make us rethink:
Is there really hope for decentralized training of large models?
If you jump out of this consensus, "the most reasonable landing scenario", whether to apply decentralized computing power to the training of small AI models is also a big scenario. From a technical point of view, the current limiting factors have been solved due to the size and structure of the model. At the same time, from the market point of view, we have always felt that the training of large models will be huge from now to the future, but the market for small AI models is Not attractive anymore?
I don't think so. Compared with large models, small AI models are easier to deploy and manage, and are more efficient in terms of processing speed and memory usage. In a large number of application scenarios, users or companies do not need the more general reasoning capabilities of large language models, but only Focus on a very granular forecast target. Therefore, in most scenarios, small AI models are still the more viable option and should not be prematurely overlooked in the tide of fomo large models. Reference