
Learning to Zoom: Saliency-Based Sampling for Neural Networks
Explore how a saliency-based sampling layer enhances downsampling in neural networks by adjusting sampling density based on image regions' saliency values. This differentiable module improves performance for tasks requiring small object detection or fine details.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
arXiv:1809.03355v1 [cs.CV] 10 Sep 2018 Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks Adri`aRecasens, Petr Kellnhofer, Simon Stent, WojciechMatusik, an d Antonio Torralba Massachusetts Institute of Technology, Cambridge MA 02139, USA Toyota Research Institute, Cambridge, MA, 02139, USA P76094648
01 Introduction 02 Related Work 03 Saliency Sampler 04 Experiments 05 Discussion 06 Conclusion
01 Introduction Many modern neural network models used in computer vision have input size constraints. These constraints exist for various reasons. By restricting the input resolution, one can control the time and computation required during both training and testing, and benefit from efficient batch training on GPU. On certain datasets, limiting the input feature dimensionality can also empirically increase performance by improving training sample coverage over the input space. When the target input size is smaller than the images in the original dataset, the standard approach is to uniformly downsamplethe input images. While uniform downsamplingis simple and effective in many situations, it can be lossyfor tasks which require information from different spatial resolutions and locations.
01 Introduction In this work they introduce a saliency-based sampling layer : Asimple plug-in module that can be appended to the start of any input-constrained network and used to improve downsamplingin a task-specific manner. The layer consists of a saliency map estimator connected to a sampler which varies sampling density for image regions depending on their relative saliency values. Since the layer is designed to be fully differentiable, it can be inserted before any conventional network and trained end-to-end. Apply this approach to tasks where the discovery of small objects or fine-grained details is important, and consistently find that adding the layer results in performance improvements over baseline networks.
02 Related Work Spatial Transformer Network (STN) A layer that estimates a parametrized transformation from an input image in an effort to undo nuisance image variation (such as from object pose in the task of rigid object classification) and thereby improve model generalization. In their work, the authors proposed three types of transformation that could be learned : affine, projective and thin plate spline (TPS). In this paper, they do not attempt to undo variation such as local translation or rotation. Rather they try to vary the resolution dynamically to favor regions of the input image which are more task salient.
02 Related Work Spatial Transformer Network (STN) (1) Localisationnet (2) Grid generator (3) Sampler Localisationnet 2x3 1. 2. Gridgenerator (i,j) 3. Sampler Grid generator
02 Related Work Deformable convolutional networks (DCNs) Convolutional layers can learn to dynamically adjust their receptive fields to adapt to the input features and improve invariance to nuisance factors. Their proposal involves the replacement of any standard convolutional layer in a CNN with a deformable layer which learns to estimate offsets to the standard kernel sampling locations, conditioned on the input.
02 Related Work Four main differences between DCN and Saliency Sampler : DCNs samples from the same low-resolution input as the original CNN architecture, saliency sampler is designed to sample from any available resolution. Saliency sampler estimates the sample field through saliency maps which have been shown to emerge naturally when training fully convolutional neural networks. Saliency sampler can be applied to existing trained networks without modification. Saliency sampler produces human readable outputs in the form of the saliency map and the deformed image which allow for easy visual inspection and debugging.
03 Saliency Sampler The sampling process can be divided into two stages. In the first stage, a CNN is used to produce a saliency map. This map is task specific, since different tasks may require focus on different image regions. In the second stage, the most important image regions are sampled according to the saliency map.
03 Saliency Sampler 3.1 Saliency Network The saliency network fs produces a saliency map S from the low resolution image: The choice of network for this stage is flexible and may be changed depending on the task. Apply a softmaxoperation in the final layer to normalize the output map.
03 Saliency Sampler 3.2 Sampling Approach Asampler g takes as input the saliency map S along with the full resolution image I and computes J = g(I, S). That is, an image with the same dimensions as Il , that has been sampled from I such that highly weighted areas in S are represented by a larger image extent.
03 Saliency Sampler 3.2 Sampling Approach They compute a mapping between the sampled image and the original image and then use the grid sampler introduced in STN. This mapping can be written in the standard form as two functions u(x, y) and v(x, y) such that J(x, y) = I(u(x, y), v(x, y)). The main goal for the design of u and v is to map pixels proportionally to the normalized weight assigned to them by the saliency map. Assuming that u(x, y), v(x, y), x and y range from 0 to 1, an exact approximation to this problem would be to find u and v such that:
03 Saliency Sampler 3.2 Sampling Approach Finding u and v is equivalent to finding the change of variables that transforms the distribution set by S(x, y) to a uniform distribution. However, solutions for the problem are computationally very costly. Therefore, they took an alternative approach that is suitable for use in CNNs.
03 Saliency Sampler 3.2 Sampling Approach This formulation holds certain desirable properties for the functions u and v, notably: Sampled areas Areas of higher saliency are sampled more densely, since those pixels with higher saliency mass will attract other pixels to them. Convolutional form This formulation allows us to compute u and v with simple convolutions, which is key for the efficiency of the full system. This layer can be easily added in a standard CNN and preserve differentiability needed for training by backpropagation.
03 Saliency Sampler 3.3 Training with the Saliency Sampler Complete pipeline consists of four steps : 1. We obtain a low resolution version Ilof the image I. 2. This image is used by the saliency network fsto compute a saliency map S = fs(Il), where task-relevant areas of the image are assigned higher weights. 3. We use the deterministic grid sampler g to sample the high resolution image I according to the saliency map, obtaining the resampled image J = g(I, S) which has the same resolution as Il. 4. The original task network ftis used to compute our final output y = f(J).
04 Experiments
04 Experiments
04 Experiments
05 Discussion The method proved to be easier to train than other approaches which modify spatial sampling, such as Spatial Transformer Networks or Deformable Convolutional Networks. These methods performed closer to the baseline as they failed to find suitable parameters for their sampling strategy. The non-uniform approach to the magnification introduced by the saliency map also enables variability of zoom over the spatial domain. This together with the end-to-end optimization results in a performance benefit over uniformly magnified area-of-interest crops.
05 Discussion Limitation : 1. 2. Gaze Tracking
06 Conclusion In this paper, they present the saliency sampler a novel layer for CNNs that can adapt the image sampling strategy to improve task performance while preserving memory allocation and computational efficiency for a given image processing task. The method is simple to integrate into existing models and can be efficiently trained in an end- to-endfashion. Unlike some of the other image transformation techniques, Saliency Sampler is not restricted to a predefined number or size of important regions and it can redistribute sampling density across the entire image domain.