GPT-3 Scared You? Meet Wu Dao 2.0: A Monster of 1.75 Trillion Parameters

Wu Dao 2.0 is 10x larger than GPT-3. Imagine what it can do.

By Alberto Romero | 6 June 2021

(Photo by GR Stocks on Unsplash)

We’re living exciting times in AI. OpenAI shocked the world a year ago with GPT-3. Two weeks ago Google presented LaMDA and MUM, two AIs that will revolutionize chatbots and the search engine, respectively. And just a few days ago, on the 1st of June, the Beijing Academy of Artificial Intelligence (BAAI) conference presented Wu Dao 2.0.

Wu Dao 2.0 is now the largest neural network ever created and probably the most powerful. Its potential and limits are yet to be fully disclosed, but the expectations are high and rightly so.

In this article, I’ll review the available information about Wu Dao 2.0: What it is, what it can do, and what are the promises of its creators for the future. Enjoy!

Wu Dao 2.0: Main Features compared to GPT-3

Parameters and data

Wu Dao – which means Enlightenment – is another GPT-like language model. Jack Clark, OpenAI’s policy director, calls this trend of copying GPT-3, “model diffusion.” Yet, among all the copies, Wu Dao 2.0 holds the record of being the largest of all with a striking 1.75 trillion parameters (10x GPT-3).

Coco Feng reported for South China Morning Post that Wu Dao 2.0 was trained on 4.9TB of high-quality text and image data, which makes GPT-3’s training dataset (570GB) pale in comparison. Yet, it’s worth noting OpenAI researchers curated 45TB of data to extract clean those 570GB.

The training data is divided in:

  • 1.2TB Chinese text data in Wu Dao Corpora.
  • 2.5TB Chinese graphic data.
  • 1.2TB English text data in the Pile dataset.


Wu Dao 2.0 is multimodal. It can learn from text and images and tackle tasks that include both types of data (something GPT-3 can’t do). We’re seeing a shift in the last years from AI systems specialized in managing a single mode of information towards multimodality.

It’s expected that computer vision and natural language processing, traditionally the two big branches within deep learning, will end up combined in every AI system in the future. The world is multimodal. Humans are multisensory. It’s reasonable to create AIs that mimic this feature.

Mixture of Experts

Wu Dao 2.0 was trained with FastMoE, a system similar to Google’s Mixture of Experts (MoE). The idea is to train different models within a larger model for each modality. A gating system permits the larger model to select which models to consult for each type of task.

FastMoE, in contrast with Google’s MoE, is open-source and doesn’t require specific hardware, which makes it more democratic. It allowed BAAI researchers to solve training bottlenecks preventing models such as GPT-3 from reaching the 1-trillion-parameter milestone. They wrote in BAAI’s official WeChat blog that “[FastMoE] is simple to use, flexible, high-performance, and supports large-scale parallel training.” The future of large AI systems will certainly pass through these training frameworks.

Wu Dao 2.0’s fantastic capabilities


In an article for VentureBeat, Kyle Wiggers emphasized Wu Dao 2.0’s multimodal capabilities: It has “the ability to perform natural language processing, text generation, image recognition, and image generation tasks. […] as well as captioning images and creating nearly photorealistic artwork, given natural language descriptions.”

Andrew Tarantola writes for Engadget that Wu Dao 2.0 can “both generate alt text based off of a static image and generate nearly photorealistic images based on natural language descriptions. [It can also] predict the 3D structures of proteins, like DeepMind’s AlphaFold.”

Leading researcher Tang Jie highlighted Wu Dao 2.0’s skills in “poetry creation, couplets, text summaries, human setting questions and answers, painting” and even acknowledge that the system “ha[s] been close to breaking through the Turing test, and competing with humans.”

Wu Dao 2.0’s has nothing to envy GPT-3 or any other existing AI model. Its multitasking abilities and multimodal nature grant it the title of most versatile AI. These results suggest multi-AIs will dominate the future.

Benchmark achievements

Wu Dao 2.0 reached/surpassed state-of-the-art (SOTA) levels on 9 benchmark tasks widely recognized by the AI community, as reported by BAAI (benchmark: achievement).

  • ImageNet (zero-shot): SOTA, surpassing OpenAI CLIP.
  • LAMA (factual and commonsense knowledge): Surpassed AutoPrompt.
  • LAMBADA (cloze tasks): Surpassed Microsoft Turing NLG.
  • SuperGLUE (few-shot): SOTA, surpassing OpenAI GPT-3.
  • UC Merced Land Use (zero-shot): SOTA, surpassing OpenAI CLIP.
  • MS COCO (text generation diagram): Surpassed OpenAI DALL·E.
  • MS COCO (English graphic retrieval): Surpassed OpenAI CLIP and Google ALIGN.
  • MS COCO (multilingual graphic retrieval): Surpassed UC² (best multilingual and multimodal pre-trained model).
  • Multi 30K (multilingual graphic retrieval): Surpassed UC².

It’s undeniable these results are amazing. Wu Dao 2.0 achieves excellent levels in key benchmarks across tasks and modalities. However, a quantitative comparison between Wu Dao 2.0 and SOTA models in these benchmarks is missing. Until they publish a paper, we’ll have to wait to see the degree of Wu Dao 2.0’s amazingness.

A virtual student

Hua Zhibing, Wu Dao 2.0’s child, is the first Chinese virtual student. She can learn continuously, compose poetry, draw pictures, and will learn to code in the future. In contrast with GPT-3, Wu Dao 2.0 can learn different tasks over time, not forgetting what it has learned previously. This feature seems to bring AI yet closer to human memory and learning mechanisms.

Tang Jie went as far as to claim that Hua Zhibing has “some ability in reasoning and emotional interaction.” People’s Daily Online reported that Peng Shuang, a member of Tang’s research group, “hoped that the virtual girl will have a higher EQ and be able to communicate like a human.”

When people started playing with GPT-3, many went crazy with the results. “Sentient”, “general intelligence,” and capable of “understanding” were some of the attributes people ascribed to GPT-3. So far, there’s no proof this is true. Now, the ball is in Wu Dao 2.0’s court to show the world it’s capable of “reasoning and emotional interaction.” For now, I’d be prudent before jumping to conclusions.

Final thoughts: Wu Dao 2.0 towards AGI

Some of BAAI’s most important members expressed their thoughts on Wu Dao 2.0’s role on the road towards AGI (artificial general intelligence):

“The way to artificial general intelligence is big models and big computer. […] What we are building is a power plant for the future of AI. With mega data, mega computing power, and mega models, we can transform data to fuel the AI applications of the future.”

– Dr. Zhang Hongjiang, chairman of BAAI

“These sophisticated models, trained on gigantic data sets, only require a small amount of new data when used for a specific feature because they can transfer knowledge already learned into new tasks, just like human beings. […] Large-scale pre-trained models are one of today’s best shortcuts to artificial general intelligence.”

– Blake Yan, AI researcher

Wu Dao 2.0 aims to enable machines to think like humans and achieve cognitive abilities beyond the Turing test.”

– Tang Jie, lead researcher behind Wu Dao 2.0

They bet on GPT-like multimodal and multitasking models to reach AGI. Without a doubt, Wu Dao 2.0 – as GPT-3 before it – is an important step towards AGI. Yet, how much closer it will take us is debatable. Some experts argue we’ll need hybrid models to reach AGI. Others defend embodied AI, rejecting traditional bodiless paradigms, such as neural networks, entirely.

No one knows which is the right path. Even if larger pre-trained models are the logical trend today, we may be missing the forest for the trees, and we may end up reaching a less ambitious ceiling ahead. The only clear thing is that if the world has to suffer from environmental damage, harmful biases, or high economic costs, not even reaching AGI would be worth it.

Reprinted with permission from the author.

Alberto Romero is a Spanish engineer, neuroscientist, and writer. Follow him at LinkedIn, Medium and Twitter.

China showcases artificial intelligence advances

Google I/O 2021 keynote in 16 minutes

Beyond AI with Sam Altman and Greg Brockman

Superintelligence: Science or Fiction? | Elon Musk & Other Great Minds

Be sure to ‘like’ us on Facebook


Please enter your comment!
Please enter your name here