Posts
Since proposed by Jacob el al1 in 2018, neural tangent kernels have become an important tool in recent literature2,3,4 that study the theoretical behaviors of deep neural networks. In this post I’ll discuss one of the earlier works on convolutional neural tangent kernels, or CNTK, by Arora et al5 in NeurIPS 2019. In this paper, the authors derived an efficient algorithm that computes a CNTK kernel regression problem which the authors claim could estimate the training dynamics of ReLU CNN’s trained with gradient descent. The CNTK algorithm achieves 77.4% accuracy on the CIFAR-10 dataset. Although a poor performance compared to well-trained deep CNN’s, this is nonetheless an impressive +10% higher than previous state-of-the-art for a purely kernel-based method. Since the main significance of NTKs lies with them being a powerful tool to study the dynamics of training deep neural networks, one could argue that the absolute accuracy numbers are of less concern. In this post I’ll focus on how the CNTK works exactly, as well as what kind of assumptions and limitations it might have for studying mainstream deep neural networks.
Contrastive learning has emerged as a leading paradigm for self-supervised learning of high quality visual representations from unlabeled data. It is a manifestation of a broader trend in the deep learning community in recent years that seeks to reduce the need for large amounts of labeled data through unsupervised or self-supervised pretraining. In contrastive learning, the network is trained by a contrastive loss function that discriminates between “positive” and “negative” views of images. This post would briefly introduce the contrastive framework along with some of the established baselines works.
This post briefly introduces the popular deep reinforcement learning algorithms, TRPO and PPO, from the theoretical grounds leading up to the development of the algorithms, to the various approximations necessary for the agents to work practically. As they became more mainstream, it is natural to raise a question central to all of these deep RL methods: to what extend does the algorithm in practice reflect the theoretical principles leading to its development? Recently, there have been several interesting works that explored in depth the empirical behaviors of the algorithms. It is suggested that the seemingly sound theoretical justifications for the algorithms often fail to manifest in practice, and consequently there are still a lot more to be understood as to why these algorithms perform well on certain benchmarks.
This blog post briefly reviews the paper “Generative Pretraining from Pixels” by Mark Chen et al, one of the ICML 2020 best paper award candidates. In this work, the authors constructed a new sequence pretraining task by resizing and reshaping a 2D image into a 1D sequence of pixels. Using this sequence as the input, GPT-2, the highly successful pretrained language model consisting primarily of self-attention modules, is then pretrained in a self-supervised manner by predicting pixels. The paper largely follows widely used paradigms in modern natural language processing. In particular, the authors investigated the effects of both the auto-regressive and the denoising auto-encoding pretraining objectives. In evaluation, the paper reports results obtained through linear probing by treating the pretrained model as a feature extractor, as well as through full fine-tuning, the standard paradigm of transfer learning for image tasks.
This blog post briefly introduces the paper “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks” by Mingxing Tan & Quoc V.Le, ICML 2019. In the paper, the authors detailed a paradigm to design convolutional neural networks based on an augmented form of neural architecture search, where the search space is reduced considerably through factorization of the search procedure. A baseline architecture is produced, which is subsequently modified by a compound scaling scheme proposed in the paper to simultaneously scale the model in depths, widths and resolution dimensions in a hollistic manner. The result is a set of networks termed EfficientNets, widely ranging in their complexity, which achieve state-of-the-art or highly competitive results against mainstream benchmark models under comparable resource constraints in classificiation and transfer tasks.
Modules are the building blocks of deep learning. This post introduces some of the most widely used differentiable modules in modern networks, from the basic parameter-free operations such as pooling, activations, to linear, attention and the more complex reccurent modules. For each module introduced, either a set of mathematical formulations or a PyTorch/Numpy implementation of the module’s forward, backward, and when applicable, parameter gradient methods is provided. Regular updates will be made to include some of the more recent progresses in literature pertaining to the design, analysis or integration of novel modules.