George Ciubotariu

(1)

Enhancing the performance of indoor-outdoor image classifications using features extracted from depth-maps

George Ciubotariu

Babe¸s-Bolyai University

WeADL 2021 Workshop

The workshop is organized under the umbrella of WeaMyL, project funded by the EEA and Norway Grants under the number RO-NO-2019-0133. Contract: No

26/2020.

Working together for agreen,competitiveandinclusiveEurope

(2)

Introduction

2

Original Contribution

3

Computer Vision and Deep Learning

4

Data Set

5

Unsupervised Analysis

6

Supervised Analysis

7

Future Enhancements

(3)

Introduction

Figure:A picture taken from space

(4)

Introduction

Figure: The same picture, but flipped upside down

(5)

Introduction

Figure:An illusion of depth

(6)

Research Questions and Original Contributions

RQ1: How relevant are depth maps in the context of indoor-outdoor image classification?

Unsupervised learning based analysis on DIODE dataset for indoor-outdoor classification

t-SNE clustering support for further supervised investigations

RQ2: To what extent does aggregating visual features into more granular sub-images increase the performance of classifiers?

Supervised learning based classification for supporting the unsupervised approach

Multilayer Perceptron (MLP) classifier tested to confirm hypothesis

RQ3: How correlated are the results of the unsupervised based analysis and the performance of supervised models applied for indoor-outdoor image classification?

Comparative analysis on image features aggregation

(7)

Computer Vision (CV) and Deep Learning (DL)

Most recent work implementConvolutional Neural Networks(CNNs) in dense visual tasks such asSemantic Segmentation(SS) orDepth Estimation(DE).

[ZWZ

⁺

20] Split-Attention Network (ResNeSt)

efficient network that outperformed other similar models in what regards both computational costs and performance

the model introduced a new split-attention block for dense task prediction.

[LRSK19, RBK21] Dense Prediction Transformers (DPT)

model that leverages visual transformers instead of convolutions.

its results outperform ResNeSt models that have previously been considered state-of-the-art.

(8)

Vision Transformers for Dense Prediction (DPT)

Model Image #extracted features #extracted features resolution after encoder after decoder Depth Estimation

384×384 49152 12582912

Semantic Segmentation

Table:DPT architectures details

Figure:DPT architecture

(9)

DIODE (Dense Indoor and Outdoor DEpth)

Data has been collected with a FARO Focus S350 It consists of 27858 1024

×

768 RGB-D images

Photos have been taken both at daytime and night, over several seasons (summer, fall, winter)

Apart from RGB-D images, DIODE dataset also provides us with normal maps that could further enhance the learning of depth and vice-versa

(10)

DIODE (Dense Indoor and Outdoor DEpth)

Figure:Sample images from DIODE dataset

(11)

DIODE Structure

Figure: Histogram of depth values frequency (%) for the whole train set

Figure:Histogram of depth values frequency (%) for the whole validation set

(12)

DIODE Structure

Figure: Histogram of depth values frequency (%) for indoor train set

Figure:Histogram of depth values frequency (%) for indoor validation set

(13)

DIODE Structure

Figure: Histogram of depth values frequency (%) for outdoor train set

Figure:Histogram of depth values frequency (%) for outdoor validation set

(14)

Unsupervised Learning Approach for Analysing the Data

3D t-SNE unsupervised clustering

used fornon-linear dimensionality reduction able to uncover more useful patterns in data

usesStudent t-distributionto better disperse the clusters

data normalization with the inverse hyperbolic sine (asinh)

increased sensitivity to particularly small and large values

parameters used

perplexityof 20 learning rateof 3.0

for a slower converging but finer learning curve 1000iterations

Relevance

Unsupervised learning-based analysis provide useful insight about data

organization and features’ importance.

(15)

Automatic Feature Extraction

1

aggregating RGB from sub-images

3·k dimensional vector (k = 1,4,16) average RGB values for each sub-image

2

aggregating RGBD from sub-images

4·k dimensional vector (k = 1,4,16) average RGBD values for each sub-image

Figure: Structure of image splits

3

features from DPT encoder/decoder

trained for SS

trained for DE

(16)

Deep Learning Tasks

Indoor-Outdoor Classification

Semantic Segmentation

Depth Estimation

(17)

Features Extracted from DL models

DPT trained for Semantic Segmentation

Figure: t-SNE of DPT encoder extracted features for SS

Figure:t-SNE of DPT decoder extracted features for SS

(18)

Features Extracted from DL models

DPT trained for Depth Estimation

Figure: t-SNE of DTP encoder extracted features for DE

Figure:t-SNE of DTP decoder extracted features for DE

(19)

Features extracted aggregating RGB and RGBD values

no splits

Figure:t-SNE for RGB without splits Figure:t-SNE for RGB-D without splits

(20)

Features extracted aggregating RGB and RGBD values

4 splits

Figure:t-SNE for RGB with 4 splits Figure: t-SNE for RGB-D with 4 splits

(21)

Features extracted aggregating RGB and RGBD values

16 splits

Figure:t-SNE for RGB with 16 splits Figure:t-SNE for RGB-D with 16 splits

(22)

Supervised Learning Results

Features #Splits Accuracy AUC Specificity Recall 1 0.692±0.077 0.525±0.056 0.980±0.028 0.070±0.121 MLP RGB 4 0.688±0.064 0.517±0.022 0.989±0.014 0.046±0.049 16 0.669±0.049 0.545±0.048 0.912±0.068 0.163±0.136 1 0.880±0.039 0.858±0.041 0.898±0.058 0.817±0.081 MLP RGBD 4 0.876±0.043 0.862±0.044 0.894±0.046 0.829±0.063 16 0.838±0.044 0.826±0.053 0.848±0.060 0.804±0.099 DPT encoder DE 1 0.823±0.131 0.831±0.076 0.812±0.185 0.850±0.069 DPT encoder SS 1 0.953±0.027 0.944±0.030 0.974±0.031 0.915±0.053

Table:Results of indoor-outdoor supervised classification on DIODE dataset

Best two performances (AUC)

1 DPT encoder SS.

2 RGBD with 4 splits.

(23)

Ongoing Experiments and Future Enhancements

Identifying features that can be used in both SS and DE

Identifying other problems that can be solved with adapted DL models

Architecture Transfer from SS towards DE

(24)

Thank you!

Questions?

(25)

Bibliography I

Katrin Lasinger, Ren´ e Ranftl, Konrad Schindler, and Vladlen Koltun.

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.

CoRR, abs/1907.01341, 2019.

Ren´ e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun.

Vision transformers for dense prediction.

CoRR, abs/2103.13413, 2021.

Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander J. Smola.

Resnest: Split-attention networks.

CoRR, abs/2004.08955:1–12, 2020.

George Ciubotariu