1 graphics card+a few lines of code： 40% faster for large model training! – 爱上海419论坛|上海夜生活论坛|上海后花园论坛|上海品茶网论坛

I have to say, in order to let more people use the big model, the technical circle is really different!

The model is not open enough? Someone started a free open source version on their own.

For example, DALL·E Mini, which is popular all over the network recently, and Opt-175B (Open Trained Transformer), which is opened by Meta.

It is all through the way of re-engraving that the original big model that is not open enough becomes available to everyone.

1 graphics card+a few lines of code: 40% faster for large model training!

Others think that the model is too big for individual players to bear the sky-high cost.

Therefore, methods such as heterogeneous memory and parallel computing are proposed to speed up and reduce the cost of large model training.

For example, Colossal-AI, an open source project, has just realized that an NVIDIA RTX 3090 can single out a large model with 18 billion parameters.

1 graphics card+a few lines of code: 40% faster for large model training!

In these two days, they came up with a new wave:

Seamlessly support Hugging Face community model, and only add a few lines of code to realize low-cost training and fine-tuning of large models.

1 graphics card+a few lines of code: 40% faster for large model training!

You know, Hugging Face, as one of the most popular AI libraries, provides the realization of more than 50,000 AI models, which is the first choice for many AI players to train large models.

The Colossal-AI operation is to make the training and fine-tuning of the open model more practical.

And the training effect has also improved.

On a single GPU, compared with Microsoft’s DeepSpeed, Colossal-AI’s automatic optimization strategy can achieve 40% acceleration at the earliest.

Traditional deep learning frameworks such as PyTorch can no longer run such a large model on a single GPU.

For parallel training using 8 GPUs, it can be realized by adding -nprocs 8 to the startup command.

1 graphics card+a few lines of code: 40% faster for large model training!

After this wave, it can be said that the cost, efficiency and practical problems that individual AI players need to consider are all grasped ~

No need to modify the code logic.

All talk and no practice.

Let’s take OPT as an example, and expand it in detail to see how the new functions of Colossal-AI can be used.

OPT, the full name is Open Pretrained Transformer.

It was released by Meta AI, and compared with GPT-3, the maximum parameter can reach 175 billion.

The biggest feature is that GPT-3 does not disclose model weights, while OPT opens all codes and weights.

Therefore, every developer can develop personalized downstream tasks on this basis.

The following example is to fine-tune Casual Language Modelling according to the pre-training weights provided by OPT.

Mainly divided into two steps:

1. Add a configuration file

2. Start the operation

The first step is to add a configuration file according to the task you want to perform.

For example, on a GPU, taking heterogeneous training as an example, it is only necessary to add relevant configuration items to the configuration file, without changing the training logic of the code.

For example, tension _ placement _ policy determines the strategy of heterogeneous training, and the parameters can be CUDA, CPU and auto.

The advantages of each strategy are different, and the adaptation situation is different.

CUDA: All model parameters are placed on the GPU, which is suitable for traditional scenes that can still be trained without unloading.

CPU: All the model parameters are placed in the CPU memory, and only the weights currently involved in calculation are reserved in the GPU memory, which is suitable for the training of super-large models.

Auto: According to the real-time memory information, automatically determine the amount of parameters to be kept in GPU memory, which can maximize the utilization of GPU memory and reduce data transmission between CPU and GPU.

For ordinary users, using auto strategy is the most convenient.

In this way, Colossal-AI can automatically and dynamically select the best heterogeneous strategy in real time to maximize the computational efficiency.

from colossalai.zero.shard_utils import TensorShardStrategy

zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),

　　　　　　　　　　　　　　　 tensor_placement_policy="auto"),

　　　　　　 optimizer_config=dict(gpu_margin_mem_ratio=0.8))

The second step is to insert a few lines of code to start the new function after the configuration file is ready.

First, start Colossal-AI with a configuration file through a line of code.

Colossal-AI will automatically initialize the distributed environment, read the relevant configuration, and then automatically inject the functions in the configuration into components such as models and optimizers.

colossalai.launch_from_torch(config=’https://news.mydrivers.com/1/845/configs/colossalai_zero.py’)

Then, the data set, model, optimizer, loss function and so on are defined as usual.

For example, using native PyTorch code directly, when defining the model, you only need to initialize the model under ZeroInitContext.

Here, the OPTForCausalLM model and pre-training weights provided by Hugging Face are used to fine-tune the Wikitext dataset.

with ZeroInitContext(target_device=torch.cuda.current_device(),　

　　　　　　　　　　 shard_strategy=shard_strategy,

　　　　　　　　　　 shard_param=True):

　　 model = OPTForCausalLM.from_pretrained(

　　　　　　　　 ‘facebook/opt-1.3b’

　　　　　　　　 config=config

Next, just call colossalai.initialize to inject the heterogeneous memory functions defined in the configuration file into the training engine, and then start the corresponding functions.

engine, train_dataloader, eval_dataloader, lr_scheduler = colossalai.initialize(model=model,

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　optimizer=optimizer,

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　criterion=criterion,

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　train_dataloader=train_dataloader,

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　test_dataloader=eval_dataloader,

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　lr_scheduler=lr_scheduler)

Still have to rely on GPU+CPU heterogeneity

The key to enable users to achieve the above "fool-like" operation is that the AI system itself should be smart enough.

Gemini, an efficient heterogeneous memory management subsystem of Colossal-AI system, plays a core role.

It is like a manager in the system, which dynamically allocates the memory of CPU and GPU after collecting the information needed for calculation.

The specific working principle is to preheat in the previous step and collect the memory consumption information in the dynamic calculation diagram of PyTorch.

After preheating, before calculating an operator, using the collected memory usage records, Gemini will reserve the peak memory required by this operator on the computing device, and at the same time move some model tensors from GPU memory to CPU memory.

1 graphics card+a few lines of code: 40% faster for large model training!

Gemini’s built-in memory manager marks each tensor with a state information, including HOLD, COMPUTE, FREE and so on.

Then, according to the dynamically queried memory usage, the tensor state is dynamically changed and the tensor position is adjusted.

The direct benefit is to maximize the model capacity and balance the training speed under the condition of very limited hardware.

You know, the mainstream method of ZeRO (Zero Reduency Optimizer) in the industry, although it also uses the method of CPU+GPU heterogeneous memory, will still cause problems such as system crash and unnecessary traffic due to static division.

Moreover, by using dynamic heterogeneous CPU+GPU memory, memory can also be expanded by adding memory bars.

It’s much more cost-effective than buying high-end graphics cards.

1 graphics card+a few lines of code: 40% faster for large model training!

At present, using Colossal-AI method, RTX 2060 6GB ordinary game instinct training 1.5 billion parameter model; RTX 3090 24GB host directly hits the big model with 18 billion parameters; Tesla V100 32GB can even win 24 billion parameters.

In addition to maximizing the use of memory, Colossal-AI also uses distributed and parallel methods to continuously improve the training speed.

It proposes complex parallel strategies such as data parallelism, pipeline parallelism and 2.5-dimensional tensor parallelism.

Although the method is complex, it is still very "fool’s operation" to get started, and it can be realized automatically by simple declaration.

There is no need to invade code like other systems and frameworks, and handle complex underlying logic manually.

parallel = dict(

pipeline=2,

tensor=dict(mode=’2.5d’, depth = 1, size=4)

)

What else can Colossal-AI do?

In fact, since the open source, Colossal-AI has been ranked first in the world in the hot list of GitHub and Papers With Code for many times, and it is well-known in the technical circle.

In addition to training a large model with a single GPU as mentioned above, Colossal-AI can double its performance and reduce its resources to less than one tenth when it is extended to large-scale parallel scenes with dozens or even hundreds of GPUs, compared with existing systems such as Megatron-LM in NVIDIA.

In terms of conversion, the cost savings can reach millions of yuan on pre-training GPT-3 and other super-large AI models.

1 graphics card+a few lines of code: 40% faster for large model training!

According to reports, Colossal-AI related solutions have been used by well-known manufacturers in autonomous driving, cloud computing, retail, medicine, chips and other industries.

At the same time, they also attach great importance to the construction of open source community, provide Chinese tutorials, open user community forums, and constantly update iterations according to everyone’s feedback.

For example, we found that a fan left a message asking whether Colossal-AI could directly load some models on Hugging Face.

Well, this update is coming.

1 graphics card+a few lines of code: 40% faster for large model training!

So, what difficulties do you think need to be solved urgently for large model training?