Huawei and Partners Achieve Breakthroughs in DeepSeek V3/R1 Inference Deployment on Ascend Servers

In a significant advancement for the artificial intelligence (AI) industry, Huawei, in collaboration with a team of experts, has released a report titled “Best Practices for DeepSeek V3/R1 Inference Deployment on Huawei Ascend Servers.” This report details the optimal strategies for deploying the DeepSeek V3/R1 large – language model on Ascend servers, offering solutions that can meet the demands of various inference scenarios.

DeepSeek V3/R1 is a leading open – source large – language model that has demonstrated remarkable application value in multiple fields such as natural language processing, code generation, and knowledge reasoning. The model’s continuous updates, like the DeepSeek V3 – 0324 and DeepSeek – Prover – V2 – 671B, have expanded its capabilities while maintaining compatibility with existing deployment schemes.

The report focuses on two typical Ascend server models: the CloudMatrix 384 super – node and the Atlas 800I A2 inference server. For the CloudMatrix 384 super – node, its high – speed interconnection bus enables a unique large – scale Expert Parallel (EP) deployment strategy. By using 144 cards as a Decode instance, it can achieve high concurrency with low latency. Currently, under a 50ms latency constraint, it can output 1920 Tokens/s per card. This is a remarkable feat considering the complex nature of large – language model inference.

On the other hand, the Atlas 800I A2 server adopts a small – scale EP deployment strategy. With 4 – node A2 servers configured as a Decode instance, it can achieve flexible deployment with good throughput. Under a 100ms latency constraint, it can output 723 – 808 Tokens/s per card.

The deployment framework is based on vLLM, which has been modified to adapt to the EP/Data Parallel (DP)/Tensor Parallel (TP) hybrid parallel strategy. This modification enables flexible scheduling and optimal performance. At the model level, the A8W8 (INT8) dynamic quantization method is employed, along with the Multi – Token Prediction technology for acceleration. The team has also re – examined the model’s inference process from a mathematical perspective, taking into account the characteristics of Ascend chips and server networking. By selecting appropriate parallel methods and calculation logic, and leveraging the multi – stream concurrency capabilities of Ascend hardware, they have maximized the mutual concealment of communication, computing, and data transfer, achieving optimal performance at the model level.

At the operator level, various optimization schemes for computing and communication operators have been proposed. These schemes combine mathematical equivalent transformations, fused operators, cache reuse, and pipeline concealment technologies. This has enabled the MLA, MoE, and communication operators to reach the expected computing power utilization, memory access bandwidth, and communication bandwidth.

In terms of performance optimization, the report covers multiple aspects. On the framework side, techniques such as API Server expansion and MoE model load balancing have been implemented. The API Server expansion improves the system’s request – handling ability and throughput, while the MoE model load – balancing strategy addresses issues like “hot and cold experts” to enhance inference performance.

On the model side, communication optimization techniques like FlashComm and layer – in – parallel conversion have been introduced, reducing communication latency and improving inference performance. The model – side concurrency scheme takes advantage of Ascend chip resources to implement communication – communication concurrency, computing – communication concurrency, and communication – weight prefetch concurrency. Additionally, the FusionSpec inference speculation framework has been proposed to further improve the performance of the MTP layer.

The Ascend operator performance has also been optimized. For the MLA operator, algorithms like AMLA have been proposed, and cache – related optimizations have been carried out, significantly improving the operator’s performance. For the MoE communication operator, new operators and algorithms such as Dispatch/Combine communication – calculation fusion operators and the fine – grained hierarchical pipeline algorithm have been developed to reduce communication latency.

The performance analysis in the report shows the effectiveness of these deployment and optimization strategies. Although there are still some factors affecting the actual throughput, such as latency constraints and bandwidth contention, the current results have already demonstrated the competitiveness of Ascend servers in large – language model inference.

Looking ahead, the team plans to continue improving the deployment scheme. They aim to optimize for low – latency scenarios, implement Micro – batch optimization on the Atlas 800I A2 server, explore low – bit quantization schemes, support MLA layer operator quantization, study larger EP deployment schemes for the Atlas 800I A2, and optimize sequence load balancing. These efforts are expected to further enhance the performance of DeepSeek V3/R1 inference on Ascend servers and support more complex AI scenarios. This research not only benefits the development of large – language models but also promotes the progress of the entire AI industry by providing more efficient and practical deployment solutions.

Chinatimenow