4 Things GPT-4 Will Improve From GPT-3

GPT-3 revolutionized AI. Will GPT-4 do the same?

By Alberto Romero | 28 May 2021

(Photo by Robynne Hu on Unsplash)

In May 2020 OpenAI presented GPT-3 in a paper titled Language Models are Few Shot Learners. GPT-3, the largest neural network ever created, revolutionized the AI world. OpenAI released a beta API for people to play with the system and soon the hype started building up. People were finding crazy results. GPT-3 could transform a description of a web page into the corresponding code. It could emulate people and write customized poetry or songs. And it could wonder about the future or the meaning of life.

And it wasn’t trained for any of this. GPT-3 was brute-force trained in most of the Internet’s available text data. But it wasn’t explicitly taught to do these tasks. The system is so powerful that it became a meta-learner. It learned how to learn. And users could communicate with it in plain natural language; GPT-3 would receive the description and recognize the task it had to do.

This was a year ago. For the last 3 years, OpenAI has been releasing GPT models yearly. In 2018 they presented GPT-1, then GPT-2 in 2019, and finally, GPT-3 arrived in 2020. Following this pattern, we could presumably be close to the creation of a hypothetical GPT-4. Given everything that GPT-3 can do and the degree to which it has changed some paradigms within AI, the question is: What can we expect from GPT-4? Let’s get into it!

Disclaimer: GPT-4 doesn’t exist (yet). What’s next is a compilation of speculation and predictions based on my knowledge of GPT models in general and GPT-3 in particular, which I compiled in this long-form article for Towards Data Science:

GPT-3 is big. GPT-4 will be bigger. Here’s the reason

GPT-3 isn’t just big. The title of the biggest neural network ever created is very ambiguous. It could be just a tiny fraction bigger than other models. To put its size into perspective, GPT-3 is 100x bigger than its predecessor, GPT-2, which was already extremely big when it came up in 2019. GPT-3 has 175 billion parameters, which is 10x its closest competitors.

Increasing the number of parameters 100-fold from GPT-2 to GPT-3 not only brought quantitative differences. GPT-3 isn’t just more powerful than GPT-2, it is differently more powerful. There’s a qualitative leap between both models. GPT-3 can do things GPT-2 can’t do. From this fact, it’s reasonable to expect that OpenAI continues with this trend, making GPT-4 notably bigger than GPT-3, aiming at finding new qualitative differences. If GPT-3 can learn to learn, who knows what GPT-4 could bring. We might see the first neural network capable of true reasoning and understanding.

These results would further reinforce the notion that “bigger is better.” In the words of DeepMind’s researcher Richard Sutton, “the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.” He called this The Bitter Lesson in AI. We’ll see if it holds up in the future.

GPT-4 will perform better few-shot multitasking

GPT-3 was impressive at solving NLP tasks such as machine translation, question answering, or cloze tasks (fill-in-the-blank) in few-shot settings. In zero-shot settings, however, its performance wasn’t as good. Expecting GPT-3 to solve a task it hasn’t been trained on without even seeing an example beforehand may be too much to ask from it. Not even we, humans, can solve many tasks by pure intuition. And we have the advantage of living in the world, full of rich context that helps us navigating reality.

GPT-3 strength is few-shot multitasking. OpenAI researchers acknowledged that few-shot results were notably better than zero-shot results. As Rohin Shah puts it, “few-shot performance increases as the number of parameters increases, and the rate of increase is faster than the corresponding rate for zero-shot performance.” This means that GPT-3 is a meta-learner and that the bigger a model is, the better its meta-learning capabilities.

If we assume GPT-4 will have way more parameters, then we can expect it to be even a better meta-learner. One usual criticism of deep learning systems is that they need a lot of examples to learn a single task. GPT-4 could be the proof that language models can learn multitasking from a few examples almost as good as we can. GPT-3 can “understand” that it has to continue a conversation without being explicitly told to do so. We can only imagine what GPT-4 could do.

GPT-4 won’t depend as much on good prompting

OpenAI released the beta API playground in July 2020 for external developers to play with GPT-3. One of the most powerful features of the system is that we can communicate with it using natural language. I could tell GPT-3: “The following is a story about the universe that a wise man is telling a young boy. The wise man is nice, helpful, and knowledgeable on cosmology and astronomy,” and the system would continue the story without me explicitly saying so.

But not just that. GPT-3 would also make the wise man tell the story nicely and make him appear knowledgeable about the universe. It’d also write the story in easy language because the man is talking to a kid. GPT-3 would get all this information with that simple sentence. Any person could infer all this, but an AI? It’s incredible. Tech blogger Gwern Branwen has a lot of similar examples in his blog.

Gwern calls this way of interacting with GPT-3 prompt programming. We can give GPT-3 a written input and it’ll know which task it has to perform. The prompt can include some explicit examples – a few-shot setting – to help the system. It’s striking that the system can perform different tasks it’s never been trained on just by telling it to do them in plain English.

However, Gwern warns that the results of these tests can vary in quality. The reason is that prompt programming involves sampling. And sampling can “prove the presence of knowledge but not the absence”. A bad prompt would produce a bad result, but who is to blame; GPT-3 or the human that wrote a bad prompt?

This is a problem because, as Gary Marcus criticized, not knowing when a prompt will generate a bad result underlines a deep drawback of the system. We could always find a better prompt, Gwern would argue, but what is the point if we can’t be sure if the result is correct or not? To find the limits of the system we shouldn’t give up if a prompt doesn’t work, only if no prompt works. However, the impracticability of this approach is obvious; we can’t test every possible prompt.

This is why we need to make GPT-4 more robust to bad prompting. Human-made errors will be there unless we standardize prompt programming (which will be most likely done in coming years), and even then the limitations of these systems will come hand in hand with our incapability of extracting their true potential.

A true artificial intelligence shouldn’t depend so much on a good prompting in the first place. We, humans, also depend on “prompting,” but we can self-assess to find issues. If I’m doing an exam and the exercises are badly worded, it may confuse me, but I can realize the wording is bad and ask the teacher. GPT-3 can’t do that, it would try to perform the task without realizing anything.

GPT-4 may implement a way of assessing the quality of a given prompt. This is more science-fiction than reality as of now, but we’ll need to keep this in mind in the future. A system that’s incapable of self-assessment can’t be called intelligent. If GPT-4 could express doubt and lack of understanding as in: “I don’t know,” “I’m not very convinced of my answer,” or “your prompt is not very clear” that would be a great step in this direction.

GPT-4 will have a larger context window

GPT-3 is very powerful, but its memory is quite limited. A person doesn’t forget things that happened yesterday. The beta API allows the user to input a text 500-1000 words long – the context window – for GPT-3 to do something with it. This means the system can’t continue a half-written novel or complete the code for a large program.

GPT-3 has simply no clue of what is outside its context window. This limitation hits few-shot settings strongly because the user has to write different examples. In Q&A it may not be a problem as it’s very repetitive, but in other tasks such as translating an article, it’s not doable. GPT-4 would also have this limitation but the degree to which its usefulness is narrowed by this would certainly diminish.

Arguably, a more important issue is that even if we can fit our intentions in a few-hundred-word prompt, GPT-3 is a forgetful machine. It finds it difficult to keep coherence intact over long texts. If we start writing an article and ask GPT-3 to continue indefinitely, it’d end up repeating ideas or even diverging to non-related topics.

GPT-3 and the previous transformer-based models all suffer this restriction. Transformers are an “old” architecture based only on attention – no convolution, no recurrence – that appeared in 2017. There are already better ways to perform these types of tasks. Gwern argues that there are ways to improve the attention deficits of transformers. Simple transformers are a way to create super-powerful language models, but may not be the only way or even the best way. (A compilation of the improvements of the transformer architecture can be found here).

These ideas could be implemented in GPT-4. It’d enjoy a larger context window and it’d allow users to feed the system with books, long-form articles, images, video, or audio.


Here are my predictions of how GPT-4 would improve from GPT-3:

  • GPT-4 will have more parameters, and it’ll be trained with more data to make it qualitatively more powerful.
  • GPT-4 will be better at multitasking in few-shot settings. Its performance in these settings will be closer to that of humans.
  • GPT-4 will depend less on good prompting. It will be more robust to human-made errors.
  • GPT-4 will avoid the limitations of early transformer architectures. The context window will be larger, allowing the system to perform more complex tasks.

Reprinted with permission from the author.

Alberto Romero is a Spanish engineer, neuroscientist, and writer. Follow him at LinkedIn, Medium and Twitter.

Beyond AI with Sam Altman and Greg Brockman

Greg Brockman & Ilya Sutskever | Mainstage | VB Transform 2019

Superintelligence: Science or Fiction? | Elon Musk & Other Great Minds

Be sure to ‘like’ us on Facebook


Please enter your comment!
Please enter your name here