NeurIPS 2019: Review of Computer Vision Papers

NeurIPS 2019: Review of Computer Vision Papers

Author: Maria Dobko Compiler: ronghuaiyang


This is an overview (notes) of NeurIPS 2019 held in Vancouver from December 9th to 14th, 2019. All the papers mentioned in this article are in the field of computer vision.

Review of some papers on NIPS 2019\

Conference website:

Complete collection of papers:

This is an overview (notes) of NeurIPS 2019 held in Vancouver from December 9th to 14th, 2019. More than 13,000 participants. Two days of seminars, one day of tutorials and three days of main meetings. In this article, I will briefly describe some papers that have caught my attention. All the papers mentioned in this article are in the field of computer vision, which is my research field.

Full gradient representation of neural network visualization

Suraj Srinivas, Fran ois Fleuret

Link to the paper:

Explore how the importance of the input part is captured by the saliency map. Studies have shown that the output of any neural network can be decomposed into input gradient terms and neuron gradient terms. They proved that aggregating these gradient maps in a convolutional network can improve the saliency map. The paper proposes FullGrad saliency, which combines the input gradient and the feature-level deviation gradient, so it satisfies two important concepts: local (the sensitivity of the model to the input) and global (the completeness of the saliency map).

Emergence of Object Segmentation in Perturbed Generative Models

Adam Bielski, Paolo Favaro

Link to the paper:

A framework for learning target segmentation from a set of images without manual annotation is proposed. The main idea is based on the observation that relative to a given background, the position of an object can be locally disturbed without affecting the realism of the scene. Train the generative model to generate hierarchical image representation: background , mask , foreground . The author uses small random shifts to expose invalid segmentation. They used two generators to train StyleGAN, using masks as background and foreground respectively. It is trained to make synthetic images with shifted foregrounds present effective scenes. There are also two loss items on the generated mask to promote binarization and help the convergence of the minimum mask. Both of these items are added to the WGAN-GP generator loss. They also trained encoders and fixed generators to obtain segmented real images. The method was tested on 4 categories of the LSUN object data set: car, horse, chair, and bird.

GPipe: Use pipeline parallelism to effectively train giant neural networks

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen

In order to solve the high-efficiency and task-independent model parallelism requirements, GPipe is introduced, which is an extensible model parallelism library for training giant neural networks that can be expressed as a sequence of layers. The algorithm adopts the method of synchronous gradient update to parallelize the model, and has high hardware utilization and training stability. The main contributions include model scalability (almost linear acceleration in throughput and size, support for very deep transformers with more than 1k layers and 90B parameters), flexibility (extension of any network), and simple programming interface . GPipes provides a way to improve quality, and can even use transfer learning or multi-task learning to improve the quality of smaller data sets. Experiments show that the deeper the network migration effect, the better, and the wider the model memory effect is better.

Use neural networks to learn conditional deformable templates

Adrian Dalca, Marianne Rakic, John Guttag, Mert Sabuncu

Link to the paper:


The learning framework estimates the atlases together with the calibration network. Enable conditional template generation functions based on required attributes. This method jointly learns the registration network and the atlas. We developed a learning framework to build deformable templates, which play a fundamental role in many image analysis and computational anatomy tasks. In the traditional method of template creation and image alignment, the template is constructed using an iterative process of template estimation and alignment, which is usually very computationally expensive. The methods introduced include a probabilistic model and effective learning strategies, generating general templates or conditional templates, and a neural network to provide images that effectively align these templates. This is particularly useful for clinical applications.

Learning to predict the conditional convolution of layout to image for semantic image generation

Link to the paper:


This method predicts the convolution kernel based on semantic label mapping, generates intermediate feature maps from noise maps, and finally generates images. The author believes that for generator : the convolution kernel should know different semantic labels at different positions, and for **discriminator, the details and semantic alignment between the generated image and the input semantic layout should be strengthened. ** Therefore, use an image generator to predict conditional convolution (effectively predict depth separable convolution, only predict the weight of depth convolution, which is a global context-aware weight prediction network). The introduced feature pyramid semantic-embedding discriminator is used for details such as texture and edges, and is also used for semantic alignment with layout maps.

Saccader: Improve the accuracy of the visual attention model

Gamaleldin F. Elsayed, Simon Kornblith, Quoc V. Le

Link to the paper:


In this work, improvements to the hard attention model (they select salient regions in the image and only use them for prediction) are proposed. The model introduced in this article, Saccader, has a pre-training step that only requires class labels and policy gradient optimization to provide initial attention positions. The structure of Saccader: 1. Representation network (BagNet), 2. Attention network, 3. Saccader unit (no RNN, predicts the visual attention position each time). The best Saccader model narrowed the gap with the ordinary ImageNet baseline, reaching 75% of top-1 and 91% of top-5, while only focusing on less than one-third of the images.

Unsupervised object segmentation using redrawing

Micka l Chen, Thierry Arti res, Ludovic Denoyer

Link to the paper:

ReDO (Redraw Object) is an unsupervised data-driven object segmentation method. The author assumes that the generation of natural images is a composite process in which each object is generated independently. They view the object segmentation task as finding areas that can be redrawn without seeing the rest of the image. As described in this article, the method is based on an adversarial architecture in which the generator is guided by input samples: given an image, it extracts the object mask, and then redraws a new object at the same position. The generator is controlled by a discriminator to ensure that the distribution of the generated image is aligned with the original image. Add the learning function, try to reconstruct the noise vector from the general image, and then bind the output and input together by rebuilding only one area at a time, keeping the rest of the image unchanged.

Learn the complete model of object segmentation. The learned neural network is represented by bold colored lines\

Approximate feature conflict in neural network

Ke Li, Tianhao Zhang, Jitendra Malik

Link to the paper:

Feature conflict-Two different samples share the same feature activation and therefore have the same classification decision. This paper proposes a method of feature conflict detection. In this paper, the authors proved that neural networks can be surprisingly insensitive to large adverse selection changes. In this experiment, they observed that this phenomenon may be caused by the inherent characteristics of the ReLU activation function, which leads to two very different samples sharing the same feature activation and thus making the same classification decision. Possible applications include representative data collection, regularizer design, and identification of vulnerable training samples.

The importance of grids for semantic segmentation context interpretation

Lukas Hoyer, Mauricio Munoz, Prateek Katiyar, Anna Khoreva, Volker Fischer

Link to the paper:

The results show that grid saliency can successfully provide easy-to-interpret contextual explanations, and can be used to detect and locate contextual deviations in data. The main goal is to develop a saliency method that extends the existing method to generate grid saliency, so as to provide a visual explanation for network prediction. This provides a spatially coherent visual interpretation for the (pixel-level) dense prediction network and a contextual interpretation for the semantic segmentation network.

Use adversarial model operations to trick neural network interpretation

Juyeon Heo, Sunghwan Joo, Taesup Moon

Link to the paper:

Hypothesis: The mapping-based saliency interpreter can be easily deceived without significantly reducing accuracy. This paper proves that the most advanced interpreters based on saliency mapping, such as LRP, Grad-CAM and SimpleGrad, are easily deceived by adversarial model operations. The article proposes two types of deception, passive and active, as well as a quantitative measure-deception success rate (FSR). It gives why adversarial model operations are effective, and some limitations.

Benchmark for interpretation methods in deep neural networks

Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim

Link to the paper:

Incorrect estimates of the important content of model predictions may lead to decisions that adversely affect sensitive areas (medical, autonomous driving, etc.). The author compares feature importance estimators and explores whether integrating them can improve accuracy. In order to compare these methods, they removed a small part of all pixels from each image, which are considered to contribute the most to the model's prediction, and retrained the model without these pixels. It is assumed that the best interpretation method should provide the pixels with the weakest performance to remove the model. This evaluation method is called ROAR : RemOve And Retrain. Test methods include basic estimation (gradient heat map, gradient integration, guided backward propagation), integration of basic predictors (SmoothGrad gradient integration, VarGrad gradient integration, etc.), and control variables (random, sobel edge filter). The most effective methods are SmoothGrad-Squared and VarGrad .

Human eye perception evaluation: a benchmark for generative models

Sharon Zhou, Mitchell L. Gordon et al.

HYPE is a standardized and effective generative model evaluation, which tests the fidelity of the generative model in the human eye. As the author mentioned, it is consistent . It is inspired by psychophysical methods in perceptual psychology. It can reliably obtain separable model performance through different sets of outputs randomly sampled from a model. And time is very efficient .

Regional Mutual Information Loss for Semantic Segmentation

Shuai Zhao, Yang Wang, Zheng Yang, Deng Cai

Link to the paper:


Semantic segmentation is usually solved by pixel classification, and pixel loss ignores the dependence between pixels in the image. The author uses a pixel and its neighbors to represent this pixel, and converts an image into a multi-dimensional distribution. Therefore, by maximizing the mutual information between the prediction and the target distribution, the prediction and the target can be more consistent. The idea of RMI is intuitive, and it is also easy to use, because it only requires some additional memory in the training phase, and does not even need to change the basic segmentation model. RMI can also achieve substantial and consistent improvements in performance. This method was tested on PASCAL VOC 2012.

Multi-source adaptive semantic segmentation

Sicheng Zhao, Bo Li, Xiangyu Yue, Yang Gu, Pengfei Xu, Runbo Hu, Hua Chai, Kurt Keutzer

Link to the paper:

In this field of work, the adaptation to semantic segmentation is carried out from multiple sources, and a new framework called MADAN is proposed. As the author stated, in addition to feature-level alignment, pixel-level alignment is further considered by generating an adaptive domain for each source loop, which is consistent with a new dynamic semantic consistency loss. In order to improve the consistency of different adaptive domains, two discriminators are proposed: cross-domain cyclic discriminator and sub-domain aggregation discriminator. The model was tested on synthetic data sets-GTA and SYNTHIA, as well as real urban landscapes and BDDS.


Original English:

Wonderful review of past issues Route and data download suitable for beginners to get started with artificial intelligence Machine learning online manual Deep learning online manual AI basic download (pdf updated to 25 episodes) Remarks: Join the WeChat group or qq group of this site, please reply to "add group" to get a copy Station Knowledge Planet coupons, please reply to "Knowledge Planet" Copy code

Like the article, click one to read