Papers and Content
Work in progress
DINO: Emerging Properties in Self-Supervised Vision Transformers
DINO comes from self-distillation with no labels. The authors use a teacher and a student model. A given image is augmented to two “global-views” (e.g. original image coverage >50%) and several (e.g. 6) “local-views” with multi-crop and additional augmentation?
The teacher gets a “global-views” of an image (actually 2 global views) and the student gets “local-views”. Global means
Meta Pseudo Labels
Teacher and student network.
Normally the student is capped by the teacher -> confirmation bias
Teacher learns on how well the student does, student learns on pseudolabelled data, annotated by teacher
Emerging Properties in Self-Supervised Vision Transformers
Looks good.
End-to-End Object Detection with Transformers
Learnable object detection without post processing.
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
Unsupervised pretraining, may gives a strong baseline for finetuning.
Self-Challenging Improves Cross-Domain Generalization
5db151f5b7dcb1ad69bf9aef97b127bc7772dd13
Destruction and Construction Learning for Fine-grained Image Recognition
First places on some fine grained product retrieval tasks
MagFace: A Universal Representation forFace Recognition and Quality Assessment
So far: About quality assessment and pushing low quality examples away. Improved Arcface loss
Towards Open World Object Detection
Incremental learning. Energy based and memory replay based.
Shop The Look: Building a Large Scale Visual Shopping Systemat Pinterest
Hard-Aware Point-to-Set Deep Metric forPerson Re-identification
In Defense of the Triplet Loss Again: Learning Robust Person Re-Identificationwith Fast Approximated Triplet Loss and Label Distillation
In Defense of the Triplet Loss for Person Re-Identification
Combination of Multiple Global Descriptors for Image Retrieval
Image similarity. Use SPoc Mac and GeM for pooling features and combines them.
Code with additional ideas (Keras Tensorflow)
High-Performance Large-Scale Image Recognition Without Normalization
Getting rid of Batchnormalization. Better results then efficientnet while beeing faster.
SoftPool: Refining activation downsampling with SoftPool
“Better” pooling (instead of Max, Mean, GeM)
SAM: Sharpness-Aware Minimization for Efficiently ImprovingGeneralization
Optimize the network for a flat (unsharp) minima. Improves generalization and robustness to label noise.
DELG: Unifying Deep Local and Global Features for Image Search
Image similarity and retrieval. Uses an image pyramid and GeM.
DELF: Large-Scale Image Retrieval with Attentive Deep Local Features
Predecessor of DELG Image similarity based on learned local feature extraction
GeM
Generalized-mean (GeM) pooling.
Mixed Pooling
Comparison of mean, max and hybrid(both) pooling
Illustration of semantic information captured by each feature map of pool5 layer using CNN -> Mean pooling is better then just concat mean and max (hybrid)
Retrieving Similar E-Commerce Images Using Deep Learning
-
Siamese architecture (2 images?), angular loss, combination of low and toplevel, fractional distance matrix to calc distance between features
-
Manhatten distance matrix provides the best discrimination in high dimensional data spaces. L_k norm with k smaller then 1
Graph Neural Networks:A Review of Methods and Applications
Self-Correction for Human Parsing
YOLOv4: Optimal Speed and Accuracy of Object Detection
Yolo wants to reduce model size and inference speed while keeping accuracy high.
Bag of freebies (only cost training time)
Augmentation, solving dataset bias (imbalance), better loss
Bag of specials (small inference cost, significantly improvement in accuracy)
Enhance receptive field, attention - channelwise (squeeze-and-excitiation (SE)) - pointwise (Spatial Attention Module (SAM)) -> SE ist costly on gpu -> SAM makes improvement and is efficient on gpu Multi-scale prediction methods, Activationfunctions, Post-processing methods like nms (not needed in anchor-free (without x1, x2, y1, y2) methods)
Color quantization using modified median cut
Coherent Semantic Attention for Image Inpainting
Parallel Multiscale Autoregressive Density Estimation
EIDETIC3D LSTM:A MODEL FORVIDEOPREDICTION ANDBEYOND
For perceiving and memorizin both short-term and long-term representations in videos. This paper uses RNNs and 3D-CNNs as part of their architecture. These are both mechanisms for modeling spatio (image) temporal data. They found out that they have to somehow integrate the 3D-Convs inside the LSTM
HEATED-UPSOFTMAXEMBEDDING
GarmentGAN: Photo-realistic Adversarial Fashion Transfer
Main Idea: Cutting out the to be seperated part, transfrom the new garment onto that part.
Directly seen shortcomings: A dress will be shortened onto a t-shirt part.
Ideas to be better: Keep the aspect ratio and place the garment over other parts too.
The Problem of normal GANs is still blurred images and unrealistic
Garment transfer can be broken down into two components. First seperate the human body(pose, shape, color) from his clothes then transfer the garment onto that body. This is done in a segmentation map space. So we first try to get a realistic segmentatin map from the person in arbitrary pose wearing the desired cloth. This phase is coarse. As the first step they train a shape transfer network to produce a segmantic map, given an already segmented but masked map, which is produced by a foreign masking network. They want the network to learn only to segment arms, upper torso and top clothes regions, therefore mask these regions. To retain the hands, they use keypoints from a human body pose network. The create a line from the elbow to the wrist and append a box at the end of the line over the hands, so that the side of the box is perpendicular to the line. All pixels, belonging to the hand region and inside the box are retained. The Segmentation network does not have to learn them and complext poses and gestures of the hand are retained. Everything that the network is not supposed to predict (everything not masked) is overwritten by the input segmentation map. The identity of the person is preserved.
The Second stage comprises the transformation of the garment
Semantic Image Synthesis with Spatially-Adaptive Normalization
- Where the SPADE Layer comes from. Its a Layer for creating an image from a segmentation map.
Semantic image synthesis is about creating a photorealistic image from a segmentation map.
The Problem:
The Solution