![]() In this work, we extracted the outputs of the 5 pooling layers of a fixed VGG-16 network We also denote J the set of layers from which we extract the representations. ϕ j ( S ) is then of size C j × H j × W j, where C j represents the number of filters, H j and W j represent the height and width of the feature maps at the layer j, respectively. Let ϕ j ( S ) be the activation at the j t h layer of the VGG network when fed a saliency map S. The idea behind those losses is to take into account not only the saliency map, but also the deep hidden patterns that could exist, as well as the potential relationship between such patterns. The idea is to compare the representations of the ground truth and predicted saliency maps that are extracted from different layers of a fixed pre-trained convolutional neural network. We propose two new loss functions for deep saliency, that have been applied with success in image style-transfer problems. Figure 1: Architecture of the proposed deep network. In the following section, we present the different tested loss functions used during the training phase. The number of trainable parameters is approximately 1,62 millions. During the training, the network was validated against the validation test to monitor the convergence and to prevent over-fitting. To prevent over-fitting, a dropout layer, with a rate of 0.25, is added on top the network. We use a batch size of 60, and the stochastic gradient descent. We split this dataset into 500 images for the training, 200 images for the validation and the rest for the test. The network was trained over the MIT dataset composed of more than 1000 images The activation function of these layers is a ReLU activation. This map is then smoothed by a Gaussian filter 5 × 5 The last 1 × 1 convolutional layer reduces the data dimensionality to 1 feature map. The output of the four pyramid levels are then merged together, leading to 4 × 32 maps. The ASPP benefit is to catch information in a coarse-to-fine approach while keeping the resolution of the input feature maps. The dilatation rates are 1, 3, 6, and 12. Each level has a convolution kernel of 3 × 3 ![]() That feature map with 1280 channels is then fed into a shallow network composed of the following layers: a first convolutional layer allows us to reduce by a factor ten the number of channels, which are then processed by an ASPP (an atrous spatial pyramid pooling ) of 4 levels. Feature maps of layers conv 4_pool and conv 5_conv3 are rescaled to get feature maps with a similar spatial resolution. ![]() Is used for extracting deep features of an input image ( 400 × 300) from layers conv 3_pool, conv 4_pool, conv 5_conv3. Our architecture is based on the deep gaze network of and on the multi-level deep network of. The purpose of designing a new architecture is only to perform a comparison with existing architectures. Significant improvements in performances on different datasets as well as on aĭifferent network architecture, hence demonstrating the robustness of aįigure 1 presents the overall architecture of the proposed model. Show that a linear combination of several well-chosen loss functions leads to Have never been used for saliency prediction to our knowledge. We also introduce new loss functions that We demonstrate that on a fixed networkĪrchitecture, modifying the loss function can significantly improve (orĭepreciate) the results, hence emphasizing the importance of the choice of the In this work, we explore some of the most popular loss functions that are Typical deep learning model is often neglected: the choice of the loss ![]() Order to create the best saliency representation. Literature present new ways to design neural networks, to arrange gaze patternĭata, or to extract as much high and low-level image features as possible in Saliency models way further than it has ever been. Recent advances in deep learning have pushed the performances of visual ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |