Enhancing the performance of image classification through features automatically
learned from depth-maps
George Ciubotariu, Vlad-Ioan Tomescu, Gabriela Czibula
September 2021
Contents
Original Contribution Introduction
Computer Vision and Deep Learning Data Set
Unsupervised Analysis
Supervised Analysis
Future Enhancements
Research Questions and Original Contributions
I
RQ1: How relevant are depth maps in the context of indoor-outdoor image classification?
I Unsupervised learning based analysis on DIODE dataset for indoor-outdoor classification
I t-SNE clustering support for further supervised investigations
I
RQ2: To what extent does aggregating visual features into more granular sub-images increase the performance of classifiers?
I Supervised learning based classification for supporting the unsupervised approach
I Multilayer Perceptron (MLP) classifier tested to confirm hypothesis
I
RQ3: How correlated are the results of the unsupervised based analysis and the performance of supervised models applied for indoor-outdoor image classification?
I Comparative analysis on image features aggregation
Introduction in the Approached Tasks
I
Indoor-Outdoor Classification
I motivationI
Semantic Segmentation
IDepth Estimation
Related Work
I A review on indoor-outdoor scene classification, feature extraction methods, classifiers and data sets is done by Tong et al. [TSYW06]
I multiple remarkable methods
I mentions good performances between 1998 and 2017 I features such as color, texture, edge etc.
I multiple data sets were mentioned I Cvetkovic et al. [CNI14]
I color and texture descriptors and a SVM classifier
I results of 93.71% and 92.36% accuracy on two public data sets I Tahir et al. [TMR15]
I computes the GIST descriptor as a feature vector I 90.8% accuracy on a public data set
I Raja et al. [RRDR13]
I uses HSV instead of RGB color encoding I extracts color, texture and entropy features I features extracted from 100 sub-images I lightweight KNN classifier
Computer Vision (CV) and Deep Learning (DL)
Most recent work implementConvolutional Neural Networks(CNNs) in dense visual tasks such asSemantic Segmentation(SS) orDepth Estimation(DE).
I
[LRSK19, RBK21] Dense Prediction Transformers (DPT)
I model that leverages visual transformers instead ofconvolutions.
I robust architecture to serve as a backbone in our experiments I tested for both SS and DE tasks, achieving great results,
therefore offering us the possibility to create a comparative approach
Vision Transformers for Dense Prediction (DPT)
Model Image #extracted features #extracted features resolution after encoder after decoder Depth Estimation
384×384 49152 12582912
Semantic Segmentation
Table:DPT architectures details
Figure:DPT architecture
DIODE (Dense Indoor and Outdoor DEpth)
I
Data has been collected with a FARO Focus S350
IIt consists of 27858 1024
×768 RGB-D images
I
Photos have been taken both at daytime and night, over several seasons (summer, fall, winter)
Apart from RGB-D images, DIODE dataset also provides us with normal maps that could further enhance the learning of depth and vice-versa
DIODE (Dense Indoor and Outdoor DEpth)
Figure:Sample images from DIODE dataset
DIODE Structure
Figure:Histogram of depth values frequency (%) for indoor train set
Figure:Histogram of depth values frequency (%) for outdoor train set
Methodology
I
Feature extraction
I manually engineered features I automatically learned features
I
Unsupervised learning-based analysis
ISupervised learning-based analysis
I depth-augmented images
Automatic Feature Extraction
1. aggregating RGB from sub-images
I 3·k dimensional vector (k = 1,4,16) I average RGB values for eachsub-image
2. aggregating RGBD from sub-images
I 4·k dimensional vector (k = 1,4,16) I average RGBD values for eachsub-image
Figure:Structure of image splits
3. features from DPT encoder/decoder
I trained for SSI trained for DE
Unsupervised Learning for Analysing the Data
I
3D t-SNE unsupervised clustering
I used fornon-linear dimensionality reduction I able to uncover more useful patterns in data
I usesStudent t-distributionto better disperse the clusters I
data normalization with the inverse hyperbolic sine (asinh)
I increased sensitivity to particularly small and large values I
parameters used
I perplexityof 20 I learning rateof 3.0
I for a slower converging but finer learning curve I 1000iterations
Measure RGBD features DPT DE DPT SS DPT SS depth
(4splits) learned features learned features augmented features
Prec 0.769 0.729 0.945 0.957
Table:Precvalues for the t-SNE transformations depicted in Figures6–9.
Features extracted aggregating RGB and RGBD values
I
4 splits
Figure:t-SNE for RGB with 4 splits Figure: t-SNE for RGB-D with 4 splits
Features Extracted from DL models
I
DPT trained for Semantic Segmentation
Figure:t-SNE of DPT encoder extracted features for SS
Figure:t-SNE of DTP encoder extracted features for DE
Supervised Learning Results
Features #Splits (n) Accuracy AUC Specificity Sensitivity 0 0.692±0.077 0.525±0.056 0.980±0.028 0.070±0.121 RGB 1 0.688±0.064 0.517±0.022 0.989±0.014 0.046±0.049 2 0.669±0.049 0.545±0.048 0.912±0.068 0.163±0.136 0 0.880±0.039 0.858±0.041 0.898±0.058 0.817±0.081 RGBD 1 0.876±0.043 0.862±0.044 0.894±0.046 0.829±0.063 2 0.838±0.044 0.826±0.053 0.848±0.060 0.804±0.099 DPT-DE 0 0.823±0.131 0.831±0.076 0.812±0.185 0.850±0.069 DPT-SS 0 0.950±0.027 0.942±0.029 0.969±0.034 0.915±0.053 DPT-SS+D 0 0.961±0.015 0.956±0.021 0.970±0.019 0.941±0.041
Table:The results of supervised learning indoor-outdoor classification on DIODE dataset. Confidence intervals of 95% were used in the analysis. Only the features extracted by the DPT encoder are used in the experiments.
Comparative Results
Benefits of our method:
I lightweight
I uses less features and parameters compared to other models I low memory and computational cost compared to other deep
learning methods
I significant increase in performance when adding depth cues I capable of being optimised using multi-threading
I displays potential of depth cues use for multiple visual tasks
According to the study performed by Tong et al., our approach which uses features extracted using DPT-SS+D (96.1% accuracy) establishes a new State-of-the-art in indoor-outdoor classification. The best performance presented in [TSYW06] is 93.8% accuracy.
Ongoing Experiments and Future Enhancements
I
Identifying features that can be used in both SS and DE
IIdentifying other problems that can be solved with adapted
DL models
I