Deep Learning's Most Important Ideas - A Brief Historical Review

The goal of this post is to review well-adopted ideas that have stood the test of time. I will present a small set of techniques that cover a lot of basic knowledge necessary to understand modern Deep Learning research. If you're new to the field, these are a great starting point.

Authors
Denny Britz
Published
Jul 29, 2020
Formats

Deep Learning is an extremely fast-moving field and the huge number of research papers and ideas can be overwhelming. Even seasoned researchers have a hard time telling company PR from real breakthroughs. The goal of this post is to review those ideas that have stood the test of time, which is perhaps the only significance test one should rely on. These ideas, or improvements of them, have been used over and over again. They're known to work.

If you were to start in Deep Learning today, understanding and implementing each of these techniques would give you an excellent foundation for understanding recent research and working on your own projects. It's what I believe the best way to get started. Working through papers in historical order is also a useful exercise to understand where the current techniques come from and why they were invented in the first place. Put another way, I will try to present a minimal set of ideas that most of the basic knowledge necessary to understand modern Deep Learning research.

A rather unique thing about Deep Learning is that its application domains (Vision, Natural Language, Speech, RL, etc) share the majority of techniques. For example, someone who has worked in Deep Learning for Computer Vision his whole career could quickly be productive in NLP research. The specific network architectures may differ, but the concepts, approaches and code are mostly the same. I will try to present ideas from various fields, but there are a few caveats about this list:

  • My goal is not to give in-depth explanations or code examples for these techniques. It's not easily possible to summarize long complex papers into a single paragraph. Instead, I will give a brief overview of each technique, its historical context, and links to papers and implementations. If you want to learn something, I highly recommend trying to re-produce some of these paper results from scratch in raw PyTorch without using existing code bases or high-level libraries.
  • The list is biased towards my own knowledge and the fields I am familiar with. There are many exciting subfields that I don't have experience with. I will stick to what most people would consider the popular mainstream domains of Vision, Natural Language, Speech, and Reinforcement Learning / Games.
  • I will only discuss research that has official or semi-official open source implementations that are known to work well. Some research isn't easily reproducible because it involves huge engineering challenges, for example DeepMind's AlphaGo or OpenAI's Dota 2 AI, so I won't highlight it here.
  • Some choices are arbitrary. Often, rather similar techniques are published at around the same time. The goal of this post is not be a comprehensive review, but to to expose someone new to the field to a cross-section of ideas that cover a lot of ground. For example, there may be hundreds of GAN variations, but to understand the general concept of GANs, it really doesn't matter which one you study.

2012 - Tackling ImageNet with AlexNet and Dropout

Papers

Implementations

Source: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

AlexNet is often considered the algorithm responsible for the recent boom in Deep Learning and Artificial Intelligence research. It is a Deep Convolutional Neural Network based on the earlier LeNet developed by Yann LeCun. AlexNet beat previous methods at classifying images from the ImageNet dataset by a significant margin through a combination of GPU power and algorithmic advances. It demonstrated that neural networks actually work! AlexNet was also one of the first times Dropout Hinton et al. (2012) was used, which has since become a crucial component for improving the generalization ability of all kinds of Deep Learning models.

The architecture used by AlexNet, a sequence of Convolutional layers, ReLU nonlinearity, and max-pooling, became the accepted standard that future Computer Vision architectures would extend and built upon. These days, software libraries such as PyTorch are so powerful, and compared to more recent architectures AlexNet is so simple, that it can be implemented in only a few lines of code. Note that many implementations of AlexNet, such as those linked above, use the slight variation of the network described in One weird trick for parallelizing convolutional neural networks Krizhevsky (2014).

2013 - Playing Atari with Deep Reinforcement Learning

Papers

Implementations

Source: https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning

Building on top of the recent breakthroughs in image recognition and GPUs, a team at DeepMind managed to train a network to play Atari Games from raw pixel inputs. What's more, the same neural network architecture learned to play seven different games without being told any game-specific rules, demonstrating the generality of the approach.

Reinforcement Learning differs from Supervised Learning, such as image classification, in that an agent must learn maximize to the sum of rewards over multiple time steps, such as winning a game, instead of just predicting a label. Because the agent interacts directly with the environment and each action affects the next, the training data is not independent and identically distributed (iid), which makes the training of many Machine Learning models quite unstable. This was solved by using techniques such as experience replay Lin (1992).

While there was no obvious algorithmic innovation that made this work, the research cleverly combined existing techniques, convolutional neural networks trained on GPUs and experience replay, with a few data processing tricks to achieve impressive results that most people would not have expected. This gave people confidence in extending Deep Reinforcement Learning techniques to tackle even more complex tasks such as Go, Dota 2, Starcraft 2, and others.

Atari Games Bellemare et al. (2013) have since become a standard benchmark in Reinforcement Learning research. The initial approach only solved (beat human baselines on) seven games, but over the coming years advances built on top of these ideas would start beating humans on an ever increasing number of games. One particular game, Montezuma’s Revenge, was famous for requiring long-term planning and was considered to be among the most difficult to solve. It was only recently Badia et al. (2020) Ecoffet et al. (2020) that techniques managed to beat human baselines on all 57 games.

2014 - Encoder-Decoder Networks with Attention

Papers

Implementations

Source: https://ai.googleblog.com/2017/04/introducing-tf-seq2seq-open-source.html

Deep Learning's most impressive results had largely been on vision-related tasks and was driven by Convolutional Neural Networks. While the NLP community had success with Language Modeling and Translation using LSTM networks Hochreiter and Schmidhuber (1997) and Encoder-Decoder architectures Sutskever, Vinyals, and Le (2014), it was not until the invention of the attention mechanism Bahdanau, Cho, and Bengio (2016) that things started to work spectacularly well.

When processing language, each token, which could be a character, a word, or something in between, is fed into a recurrent network, such as an LSTM, which maintains a kind of memory of previously processed inputs. In other words, a sentence is very similar to a time series with each token being a time step. These recurrent models often had difficulty dealing with dependencies over long time horizons. When they process a sequence, they would easily "forget" earlier inputs because their gradients needed to propagate through many time steps. Optimizing these models with gradient descent was hard.

The new attention mechanism helped alleviate the problem. It gave the network an option to adaptively "look back" at earlier time steps by introducing shortcut connections. These connections allowed the network to decide which inputs are important when producing a specific output. The canonical example is translation: When producing an output word, it typically maps to one or more specific input words.

2014 - Adam Optimizer

Papers

Implementations

Source: http://arxiv.org/abs/1910.11758

Neural networks are trained by minimizing a loss function, such as the average classification error, using an optimizer. The optimizer is responsible for figuring out how to adjust the parameters of the network to make it learn the objective. Most optimizers are based on variations of Stochastic Gradient Descent (SGD). However, many of these optimizers contain tunable parameters such as a learning rate themselves. Finding the right settings for a specific problem not only reduces training time, but can also lead to better results due to finding a better local minimum of the loss function.

Big resarch labs often ran expensive hyperparameter searches that came up with complex learning rate schedules to get the best out of simple but hyperparameter-sensitive optimizers such as SGD. When they beat existing benchmarks, it sometimes was a result of spending a lot of money to tune the optimizer. Such details often went unmentioned in published research papers. Researchers who did not have the same budget to optimize their optimizer were stuck with worse results.

The Adam optimizer proposed to use the first and second moments of the gradients to automatically adapt the learning rate. The result turned out to be quite robust and less sensitive to hyperparameter choices. In other words, Adam often just works and did not require the same extensive tuning as other optimizers Sivaprasad et al. (2020). While an extremely well-tuned SGD could still get slightly better results, Adam made research more accessible because if something didn't work, you knew it was unlikely to be the fault of a badly tuned optimizer.

2014/2015 - Generative Adversarial Networks (GANs)

Papers

Implementations

Source: https://developers.google.com/machine-learning/gan/gan_structure

The goal of generative models, such as variational autoencoders, is to create realistic-looking data samples, like these images of people's faces you've probably seen somewhere. Because they have to model the full data distribution (many pixels!), and not just classify cats or dogs as a discriminative model would, such models are often quite difficult to train. Generative Adversarial Networks, or GANs, are one such type of model.

The basic idea behind GANs is to train two networks in tandem - a generator and a discriminator. The generator's goal is to produce samples that fool the discriminator, which is trained to distinguish between real and generated images. Over time, the discriminator will become better at recognizing fakes, but the generator will also become better at fooling the discriminator and thus produce ever more realistic-looking samples. The very first iteration of GANs produced blurry low-resolution images and was quite unstable to train. But over time, variation and improvements such as DCGAN Radford, Metz, and Chintala (2016), Wasserstein GAN Arjovsky, Chintala, and Bottou (2017), CycleGAN Zhu et al. (2018), StyleGAN (v2) Karras et al. (2020), and many others have built upon this idea to produce high-resolution photorealistic images and videos.

2015 - Residual Networks (ResNet)

Papers

Implementations

Researchers had been building on top of the AlexNet breakthrough for a while, inventing better-performing architectures based on Convolutional Neural Networks such as VGGNet Simonyan and Zisserman (2015), Inception Szegedy et al. (2014), and many others. ResNet was the next iteration in this rapid series of advances. To this day, ResNet variations are commonly used as a baseline model architecture for all kinds of tasks and as buildings blocks for more complex architectures.

What made ResNet special, apart from it winning first place in the ILSVRC 2015 classification challenge, was its depth compared to other network architectures. The deepest network presented in the paper had 1,000 layers and still performed well, though slightly worse than its 101 and 152 layer counterparts on the benchmark tasks. Training such deep networks was a challenging optimization problem due to the vanishing gradients, which also appeared in sequence models. Not many researchers believed that training such extremely deep networks could lead to good stable results.

ResNet used identity shortcut connections that help the gradient flow. One way to interpret these connections is that ResNet only needs to learns "deltas" from one layer to another, which is often easier than learning full transformations. Such identity connections were a special case of the connections proposed in Highway Networks Srivastava, Greff, and Schmidhuber (2015), which in turn were inspired by the gating mechanisms LSTMs used.

2017 - Transformers

Papers

Implementations

Source: https://arxiv.org/abs/1706.03762

Sequence-to-Sequence models with attention (described earlier in this post) worked quite well, but they had a few drawbacks due to their recurrent nature that required sequential computation. They were difficult to parallelize because they processed the input one step at a time. Each time step depends on the previous one. This also made it difficult to scale them to very long sequences. Even with their attention mechanism, they still struggled with modeling complex long-range dependencies. Most of the "work" seemed to be done in the recurrent layers.

Transformers solved these issues by completely removing the recurrence and replacing it with multiple feed-forward self-attention layers, processing all inputs in parallel and producing relatively short (= easy to optimize with gradient descent) paths between inputs and outputs. This made them really fast to train, easy to scale, and able to process a lot more data. To tell the network about the order of the inputs, which was implicit in the recurrent model, Transformers used positional encodings Gehring et al. (2017). To learn more about how exactly transformers work, which can be a bit confusing at first, I recommend this illustrated guide.

To say that Transformers worked better than almost anyone expected would be an understatement. Over the next few years, they would become the standard architecture for the vast majority of NLP and other sequence tasks, and even make their way into architectures for computer vision.

2018 - BERT and fine-tuned NLP Models

Papers

Implementations

Pre-training refers to training a model to perform some task and then using the learned parameters as an initialization to learn a related task. This makes intuitive sense - a model that has learned to classify images as cats or dogs should have learned something general about images and furry animals. When this model is fine-tuned to classify foxes, we would expect it to do better than a model that must learn from scratch. Similarly, a model that has learned to predict the next word in a sentence should have learned something general about human language patterns. We would expect it to be a good initialization for related tasks like translation or sentiment analysis.

Pre-training and fine-tuning had been used successfully in both Computer Vision and NLP, but while it had been standard in vision for a long time, making it work well in NLP seemed more challenging. Most state-of-the-art results still came from fully supervised models. With the advent of transformers, researchers finally started to make pre-training work, resulting in approaches such as ELMo Peters et al. (2018), ULMFiT Howard and Ruder (2018) and OpenAI's GPT.

BERT was the latest of such developments and many consider it to have started a new era of NLP research. Instead of being pre-trained on predicting the next word, as most other models, it was pre-trained on predicting masked (intentionally removed) words anywhere in the sentence, and whether two sentences are likely to follow each other. Note that these tasks don't require labeled data. It can be trained on any text, a whole lot of it! This pre-trained model, which likely has learned some general properties about language, can then be fine-tuned to solve supervised tasks, such as question answering or sentiment prediction. BERT performed incredibly well across a wide variety of tasks. Companies such as HuggingFace made it easy to download and fine-tune BERT-like models for any NLP task. Since then, BERT has been built upon by advances such as XLNet Yang et al. (2020) and RoBERTa Liu et al. (2019) and ALBERT Lan et al. (2020).

2019/2020 and beyond - BIG Language Models, Self-Supervised Learning?

The clearest trend throughout the history of Deep Learning is perhaps that of the bitter lesson. Algorithmic advances for better parallelization (= more data) and more model parameters win over "smarter techniques" again and again. This trend seems to continue well into 2020 where GPT-3, a huge 175 billion parameter language model by OpenAI, shows unexpectedly good generalization abilities despite its simple training objective and standard architecture.

Playing into the same trend are approaches such as contrastive self-supervised learning, e.g. SimCLR, that make better use of unlabeled data. As models become bigger and faster to train, techniques that can make efficient use of the huge set of unlabeled data on web, and learn general-purpose knowledge that which can transferred to other tasks, are becoming more valuable and widely adopted.

Honorary mentions

If you really need more papers to look at…

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. “Wasserstein GAN.” arXiv:1701.07875 [Cs, Stat], December. http://arxiv.org/abs/1701.07875.

Badia, Adrià Puigdomènech, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, and Charles Blundell. 2020. “Agent57: Outperforming the Atari Human Benchmark.” arXiv:2003.13350 [Cs, Stat], March. http://arxiv.org/abs/2003.13350.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2016. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv:1409.0473 [Cs, Stat], May. http://arxiv.org/abs/1409.0473.

Bellemare, Marc G., Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. “The Arcade Learning Environment: An Evaluation Platform for General Agents.” Journal of Artificial Intelligence Research 47 (June): 253–79. https://doi.org/10.1613/jair.3912.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv:1810.04805 [Cs], May. http://arxiv.org/abs/1810.04805.

Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. 2020. “First Return Then Explore.” arXiv:2004.12919 [Cs], May. http://arxiv.org/abs/2004.12919.

Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. “Convolutional Sequence to Sequence Learning.” arXiv:1705.03122 [Cs], July. http://arxiv.org/abs/1705.03122.

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Networks.” arXiv:1406.2661 [Cs, Stat], June. http://arxiv.org/abs/1406.2661.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” arXiv:1512.03385 [Cs], December. http://arxiv.org/abs/1512.03385.

Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors.” arXiv:1207.0580 [Cs], July. http://arxiv.org/abs/1207.0580.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

Howard, Jeremy, and Sebastian Ruder. 2018. “Universal Language Model Fine-Tuning for Text Classification.” arXiv:1801.06146 [Cs, Stat], May. http://arxiv.org/abs/1801.06146.

Karras, Tero, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. “Analyzing and Improving the Image Quality of StyleGAN.” arXiv:1912.04958 [Cs, Eess, Stat], March. http://arxiv.org/abs/1912.04958.

Kingma, Diederik P., and Jimmy Ba. 2017. “Adam: A Method for Stochastic Optimization.” arXiv:1412.6980 [Cs], January. http://arxiv.org/abs/1412.6980.

Krizhevsky, Alex. 2014. “One Weird Trick for Parallelizing Convolutional Neural Networks.” arXiv:1404.5997 [Cs], April. http://arxiv.org/abs/1404.5997.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 1097–1105. Curran Associates, Inc. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations.” arXiv:1909.11942 [Cs], February. http://arxiv.org/abs/1909.11942.

Lin, Long-Ji. 1992. “Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching.” Machine Language 8 (3-4): 293–321. https://doi.org/10.1007/BF00992699.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” arXiv:1907.11692 [Cs], July. http://arxiv.org/abs/1907.11692.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement Learning.” arXiv:1312.5602 [Cs], December. http://arxiv.org/abs/1312.5602.

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” arXiv:1802.05365 [Cs], March. http://arxiv.org/abs/1802.05365.

Radford, Alec, Luke Metz, and Soumith Chintala. 2016. “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.” arXiv:1511.06434 [Cs], January. http://arxiv.org/abs/1511.06434.

Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv:1409.1556 [Cs], April. http://arxiv.org/abs/1409.1556.

Sivaprasad, Prabhu Teja, Florian Mai, Thijs Vogels, Martin Jaggi, and François Fleuret. 2020. “Optimizer Benchmarking Needs to Account for Hyperparameter Tuning.” arXiv:1910.11758 [Cs, Stat], February. http://arxiv.org/abs/1910.11758.

Srivastava, Rupesh K, Klaus Greff, and Jürgen Schmidhuber. 2015. “Training Very Deep Networks.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2377–85. Curran Associates, Inc. http://papers.nips.cc/paper/5850-training-very-deep-networks.pdf.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. “Sequence to Sequence Learning with Neural Networks.” arXiv:1409.3215 [Cs], December. http://arxiv.org/abs/1409.3215.

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. “Going Deeper with Convolutions.” arXiv:1409.4842 [Cs], September. http://arxiv.org/abs/1409.4842.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv:1706.03762 [Cs], December. http://arxiv.org/abs/1706.03762.

Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” arXiv:1906.08237 [Cs], January. http://arxiv.org/abs/1906.08237.

Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A. Efros. 2018. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks.” arXiv:1703.10593 [Cs], November. http://arxiv.org/abs/1703.10593.

BibTeX Citation
@article{britz2020deeplearningmostimportantideas,
  author={Denny Britz},
  title={Deep Learning's Most Important Ideas - A Brief Historical Review},
  year={2020},
  url={http://dennybritz.com/blog/deep-learning-most-important-ideas},
}
References
[1] ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton (2012)
Advances in Neural Information Processing Systems 25
[2] Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov (2012)
arXiv:1207.0580 \[cs\]
[3] Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2016)
arXiv:1409.0473 \[cs, stat\]
[4] Sequence to Sequence Learning with Neural Networks
Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014)
arXiv:1409.3215 \[cs\]
[5] Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017)
arXiv:1706.03762 \[cs\]
[6] Generative Adversarial Networks
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)
arXiv:1406.2661 \[cs, stat\]
[7] Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller (2013)
arXiv:1312.5602 \[cs\]
[8] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle, Michael Carbin (2019)
arXiv:1803.03635 \[cs\]
[9] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019)
arXiv:1810.04805 \[cs\]
[10] Language Models are Unsupervised Multitask Learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
[11] Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020)
arXiv:2005.14165 \[cs\]
[12] Adam: A Method for Stochastic Optimization
Diederik P. Kingma, Jimmy Ba (2017)
arXiv:1412.6980 \[cs\]
[13] Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015)
arXiv:1512.03385 \[cs\]
[14] One weird trick for parallelizing convolutional neural networks
Alex Krizhevsky (2014)
arXiv:1404.5997 \[cs\]
[15] Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching
Long-Ji Lin (1992)
Machine Language
[16] Long Short-Term Memory
Sepp Hochreiter, Jürgen Schmidhuber (1997)
Neural Computation
[17] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford, Luke Metz, Soumith Chintala (2016)
arXiv:1511.06434 \[cs\]
[18] Mastering the game of Go with deep neural networks and tree search
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016)
Nature
[19] Convolutional Sequence to Sequence Learning
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin (2017)
arXiv:1705.03122 \[cs\]
[20] WaveNet: A Generative Model for Raw Audio
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu (2016)
arXiv:1609.03499 \[cs\]
[21] The Arcade Learning Environment: An Evaluation Platform for General Agents
Marc G. Bellemare, Yavar Naddaf, Joel Veness, Michael Bowling (2013)
Journal of Artificial Intelligence Research
[22] First return then explore
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune (2020)
arXiv:2004.12919 \[cs\]
[23] Agent57: Outperforming the Atari Human Benchmark
Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell (2020)
arXiv:2003.13350 \[cs, stat\]
[24] Optimizer Benchmarking Needs to Account for Hyperparameter Tuning
Prabhu Teja Sivaprasad, Florian Mai, Thijs Vogels, Martin Jaggi, François Fleuret (2020)
arXiv:1910.11758 \[cs, stat\]
[25] Wasserstein GAN
Martin Arjovsky, Soumith Chintala, Léon Bottou (2017)
arXiv:1701.07875 \[cs, stat\]
[26] Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros (2018)
arXiv:1703.10593 \[cs\]
[27] Analyzing and Improving the Image Quality of StyleGAN
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila (2020)
arXiv:1912.04958 \[cs, eess, stat\]
[28] Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan, Andrew Zisserman (2015)
arXiv:1409.1556 \[cs\]
[29] Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich (2014)
arXiv:1409.4842 \[cs\]
[30] Training Very Deep Networks
Rupesh K Srivastava, Klaus Greff, Jürgen Schmidhuber (2015)
Advances in Neural Information Processing Systems 28
[31] XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le (2020)
arXiv:1906.08237 \[cs\]
[32] RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov (2019)
arXiv:1907.11692 \[cs\]
[33] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut (2020)
arXiv:1909.11942 \[cs\]
[34] Deep contextualized word representations
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer (2018)
arXiv:1802.05365 \[cs\]
[35] Universal Language Model Fine-tuning for Text Classification
Jeremy Howard, Sebastian Ruder (2018)
arXiv:1801.06146 \[cs, stat\]