by Bab on August 25, 2020 under jekyll

6 minute read

Work in progress

DINO: Emerging Properties in Self-Supervised Vision Transformers

DINO comes from self-distillation with no labels. The authors use a teacher and a student model. A given image is augmented to two “global-views” (e.g. original image coverage >50%) and several (e.g. 6) “local-views” with multi-crop and additional augmentation?

The teacher gets a “global-views” of an image (actually 2 global views) and the student gets “local-views”. Global means

Meta Pseudo Labels

Teacher and student network.
Normally the student is capped by the teacher -> confirmation bias
Teacher learns on how well the student does, student learns on pseudolabelled data, annotated by teacher

Emerging Properties in Self-Supervised Vision Transformers

Looks good.

Example Code

End-to-End Object Detection with Transformers

Learnable object detection without post processing.

Example Code

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Unsupervised pretraining, may gives a strong baseline for finetuning.

Code

Self-Challenging Improves Cross-Domain Generalization

5db151f5b7dcb1ad69bf9aef97b127bc7772dd13

Destruction and Construction Learning for Fine-grained Image Recognition

Paper Link

First places on some fine grained product retrieval tasks

MagFace: A Universal Representation forFace Recognition and Quality Assessment

So far: About quality assessment and pushing low quality examples away. Improved Arcface loss

Towards Open World Object Detection

Incremental learning. Energy based and memory replay based.

Shop The Look: Building a Large Scale Visual Shopping Systemat Pinterest

Hard-Aware Point-to-Set Deep Metric forPerson Re-identification

In Defense of the Triplet Loss Again: Learning Robust Person Re-Identificationwith Fast Approximated Triplet Loss and Label Distillation

In Defense of the Triplet Loss for Person Re-Identification

Combination of Multiple Global Descriptors for Image Retrieval

Image similarity. Use SPoc Mac and GeM for pooling features and combines them.

Code with additional ideas (Keras Tensorflow)

High-Performance Large-Scale Image Recognition Without Normalization

Getting rid of Batchnormalization. Better results then efficientnet while beeing faster.

SoftPool: Refining activation downsampling with SoftPool

“Better” pooling (instead of Max, Mean, GeM)

SAM: Sharpness-Aware Minimization for Efficiently ImprovingGeneralization

Optimize the network for a flat (unsharp) minima. Improves generalization and robustness to label noise.

DELG: Unifying Deep Local and Global Features for Image Search

Image similarity and retrieval. Uses an image pyramid and GeM.

DELF: Large-Scale Image Retrieval with Attentive Deep Local Features

Predecessor of DELG Image similarity based on learned local feature extraction

GeM

Generalized-mean (GeM) pooling.

Mixed Pooling

Comparison of mean, max and hybrid(both) pooling

Illustration of semantic information captured by each feature map of pool5 layer using CNN -> Mean pooling is better then just concat mean and max (hybrid)

Retrieving Similar E-Commerce Images Using Deep Learning

Siamese architecture (2 images?), angular loss, combination of low and toplevel, fractional distance matrix to calc distance between features
Manhatten distance matrix provides the best discrimination in high dimensional data spaces. L_k norm with k smaller then 1

Graph Neural Networks:A Review of Methods and Applications

Self-Correction for Human Parsing

YOLOv4: Optimal Speed and Accuracy of Object Detection

Yolo wants to reduce model size and inference speed while keeping accuracy high.

Bag of freebies (only cost training time)

Augmentation, solving dataset bias (imbalance), better loss

Bag of specials (small inference cost, significantly improvement in accuracy)

Enhance receptive field, attention - channelwise (squeeze-and-excitiation (SE)) - pointwise (Spatial Attention Module (SAM)) -> SE ist costly on gpu -> SAM makes improvement and is efficient on gpu Multi-scale prediction methods, Activationfunctions, Post-processing methods like nms (not needed in anchor-free (without x1, x2, y1, y2) methods)

Color quantization using modified median cut

Coherent Semantic Attention for Image Inpainting

Parallel Multiscale Autoregressive Density Estimation

EIDETIC3D LSTM:A MODEL FORVIDEOPREDICTION ANDBEYOND

For perceiving and memorizin both short-term and long-term representations in videos. This paper uses RNNs and 3D-CNNs as part of their architecture. These are both mechanisms for modeling spatio (image) temporal data. They found out that they have to somehow integrate the 3D-Convs inside the LSTM

HEATED-UPSOFTMAXEMBEDDING

GarmentGAN: Photo-realistic Adversarial Fashion Transfer

Main Idea: Cutting out the to be seperated part, transfrom the new garment onto that part.

Directly seen shortcomings: A dress will be shortened onto a t-shirt part.

Ideas to be better: Keep the aspect ratio and place the garment over other parts too.

The Problem of normal GANs is still blurred images and unrealistic

Garment transfer can be broken down into two components. First seperate the human body(pose, shape, color) from his clothes then transfer the garment onto that body. This is done in a segmentation map space. So we first try to get a realistic segmentatin map from the person in arbitrary pose wearing the desired cloth. This phase is coarse. As the first step they train a shape transfer network to produce a segmantic map, given an already segmented but masked map, which is produced by a foreign masking network. They want the network to learn only to segment arms, upper torso and top clothes regions, therefore mask these regions. To retain the hands, they use keypoints from a human body pose network. The create a line from the elbow to the wrist and append a box at the end of the line over the hands, so that the side of the box is perpendicular to the line. All pixels, belonging to the hand region and inside the box are retained. The Segmentation network does not have to learn them and complext poses and gestures of the hand are retained. Everything that the network is not supposed to predict (everything not masked) is overwritten by the input segmentation map. The identity of the person is preserved.

The Second stage comprises the transformation of the garment

Semantic Image Synthesis with Spatially-Adaptive Normalization

Where the SPADE Layer comes from. Its a Layer for creating an image from a segmentation map.

Semantic image synthesis is about creating a photorealistic image from a segmentation map.

The Problem:

The Solution

Delf, Delg, Deep Learning, Image Retrieval

Papers and Content