skip to main content
research-article

Multi-modal deep feature learning for RGB-D object detection

Published: 01 December 2017 Publication History

Abstract

We present an approach for RGB-D object detection, which can exploit both modality-correlated and modality-specific relationships between RGB and depth images.The shared weights strategy and a parameter-free-correlation layer are introduced to extract the modality-correlated representations.The proposed approach can simultaneously generate RGB-D region proposals and perform region-wise RGB-D object recognition. We present a novel multi-modal deep feature learning architecture for RGB-D object detection. The current paradigm for object detection typically consists of two stages: objectness estimation and region-wise object recognition. Most existing RGB-D object detection approaches treat the two stages separately by extracting RGB and depth features individually, thus ignore the correlated relationship between these two modalities. In contrast, our proposed method is designed to take full advantages of both depth and color cues by exploiting both modality-correlated and modality-specific features and jointly performing RGB-D objectness estimation and region-wise object recognition. Specifically, shared weights strategy and a parameter-free correlation layer are exploited to carry out RGB-D-correlated objectness estimation and region-wise recognition in conjunction with RGB-specific and depth-specific procedures. The parameters of these three networks are simultaneously optimized via end-to-end multi-task learning. The multi-modal RGB-D objectness estimation results and RGB-D object recognition results are both boosted by late-fusion ensemble. To validate the effectiveness of the proposed approach, we conduct extensive experiments on two challenging RGB-D benchmark datasets, NYU Depth v2 and SUN RGB-D. The experimental results show that by introducing the modality-correlated feature representation, the proposed multi-modal RGB-D object detection approach is substantially superior to the state-of-the-art competitors. Moreover, compared to the expensive deep architecture (VGG16) that the state-of-the-art methods preferred, our approach, which is built upon more lightweight deep architecture (AlexNet), performs slightly better.

References

[1]
P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, TPAMI, 32 (2010) 1627-1645.
[2]
Y. Li, S. Wang, Q. Tian, X. Ding, Feature representation for statistical-learning-based object detection: a review, PR, 48 (2015) 3542-3559.
[3]
E. Ohn-Bar, M.M. Trivedi, Multi-scale volumes for deep object detection and localization, PR, 61 (2016) 557-572.
[4]
T. Chen, F.X. Yu, J. Chen, Y. Cui, Y.-Y. Chen, S.-F. Chang, Object-based visual sentiment concept analysis and application, ACM, 2014.
[5]
Y. Jiang, J. Meng, J. Yuan, J. Luo, Randomized spatial context for object search, TIP, 24 (2015) 1748-1762.
[6]
Y. Yang, L. Yang, G. Wu, S. Li, Image relevance prediction using query-context bag-of-object retrieval model, TMM, 16 (2014) 1700-1712.
[7]
J. Xue, L. Wang, N. Zheng, G. Hua, Automatic salient object extraction with contextual cue and its applications to recognition and alpha matting, PR, 46 (2013) 2874-2889.
[8]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database, IEEE, 2009.
[9]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, C.L. Zitnick, Microsoft coco: common objects in context, Springer, 2014.
[10]
R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE, 2014.
[11]
R. Girshick, Fast r-cnn, IEEE, 2015.
[12]
H. Peng, B. Li, W. Xiong, W. Hu, R. Ji, Rgbd salient object detection: a benchmark and algorithms, Springer, 2014.
[13]
R. Ju, Y. Liu, T. Ren, L. Ge, G. Wu, Depth-aware salient object detection using anisotropic center-surround difference, Signal Process. Image Commun., 38 (2015) 115-126.
[14]
N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, Springer, 2012.
[15]
R. Ju, X. Xu, Y. Yang, G. Wu, Stereo grabcut: interactive and consistent object extraction for stereo images, Springer, 2013.
[16]
O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences, IEEE, 2013.
[17]
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose recognition in parts from single depth images, IEEE, 2011.
[18]
B. Alexe, T. Deselaers, V. Ferrari, Measuring the objectness of image windows, TPAMI, 34 (2012) 2189-2202.
[19]
S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, MIT Press, 2015.
[20]
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, MIT Press, 2012.
[21]
S. Song, S.P. Lichtenberg, J. Xiao, Sun rgb-d: a rgb-d scene understanding benchmark suite, IEEE, 2015.
[22]
M.-M. Cheng, Z. Zhang, W.-Y. Lin, P.H.S. Torr, BING: binarized normed gradients for objectness estimation at 300fps, IEEE, 2014.
[23]
C.L. Zitnick, P. Dollr, Edge boxes: locating object proposals from edges, Springer, 2014.
[24]
X. Xu, L. Ge, T. Ren, G. Wu, Adaptive integration of depth and color for objectness estimation, IEEE, 2015.
[25]
R. Socher, B. Huval, B. Bath, C.D. Manning, A.Y. Ng, Convolutional-recursive deep learning for 3d object classification, MIT Press, 2012.
[26]
A. Wang, J. Lu, J. Cai, T.-J. Cham, G. Wang, Large-margin multi-modal deep learning for rgb-d object recognition, TMM, 17 (2015) 1887-1898.
[27]
L. Bo, X. Ren, D. Fox, Unsupervised feature learning for rgb-d based object recognition, Springer, 2013.
[28]
S. Gupta, R. Girshick, P. Arbelez, J. Malik, Learning rich features from rgb-d images for object detection and segmentation, Springer, 2014.
[29]
S. Gupta, J. Hoffman, J. Malik, Cross modal distillation for supervision transfer, IEEE, 2016.
[30]
A. Wang, J. Cai, J. Lu, T.-J. Cham, Mmss: Multi-modal sharable and specific feature learning for rgb-d object recognition, IEEE, 2015.
[31]
J.R. Uijlings, K.E. van de Sande, T. Gevers, A.W. Smeulders, Selective search for object recognition, IJCV, 104 (2013) 154-171.
[32]
C. Couprie, C. Farabet, L. Najman, Y. Lecun, Convolutional nets and watershed cuts for real-time semantic labeling of rgbd videos, JMLR, 15 (2014) 3489-3511.
[33]
S. Song, J. Xiao, Deep sliding shapes for amodal 3d object detection in RGB-D images, IEEE, 2016.
[34]
J. Liu, Y. Jiang, Z. Li, Z.-H. Zhou, H. Lu, Partially shared latent factor learning with multiview data, TNNLS, 26 (2015) 1233-1246.
[35]
Y. Li, J. Yosinski, J. Clune, H. Lipson, J. Hopcroft, Convergent learning: do different neural networks learn the same representations?, 2015.
[36]
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998) 2278-2324.
[37]
Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Comput., 1 (1989) 541-551.
[38]
K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, Springer, 2014.
[39]
A. Levin, D. Lischinski, Y. Weiss, Colorization using optimization, TOG, 23 (2004) 689-694.
[40]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in: arXiv:1408.5093, 2014.
[41]
D. Hoiem, Y. Chodpathumwan, Q. Dai, Diagnosing error in object detectors, Springer, 2012.
[42]
J. Hosang, R. Benenson, P. Dollr, B. Schiele, What makes for effective detection proposals?, TPAMI, 38 (2015) 814-830.
[43]
T.K. Ho, J.J. Hull, S.N. Srihari, Decision combination in multiple classifier systems, TPAMI, 16 (1994) 66-75.
[44]
J. Kittler, M. Hatef, R.P. Duin, J. Matas, On combining classifiers, TPAMI, 20 (1998) 226-239.
[45]
Z.-H. Zhou, CRC Press, 2012.
[46]
X.-S. Wei, J.-H. Luo, J. Wu, Z.-H. Zhou, Selective convolutional descriptor aggregation for fine-grained image retrieval, TIP, 26 (2017) 2868-2881.
[47]
L.v.d. Maaten, G. Hinton, Visualizing data using t-sne, JMLR, 9 (2008) 2579-2605.
[48]
M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, IJCV, 88 (2010) 303-338.

Cited By

View all
  1. Multi-modal deep feature learning for RGB-D object detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Pattern Recognition
    Pattern Recognition  Volume 72, Issue C
    December 2017
    523 pages

    Publisher

    Elsevier Science Inc.

    United States

    Publication History

    Published: 01 December 2017

    Author Tags

    1. Convolutional neural networks
    2. Multi-modal learning
    3. RGB-D object detection
    4. RGB-D objectness estimation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Industrial object detection with multi-modal SSD: closing the gap between synthetic and real imagesMultimedia Tools and Applications10.1007/s11042-023-15367-083:4(12111-12138)Online publication date: 1-Jan-2024
    • (2023)Domain embedding transfer for unequal RGB-D image recognitionPattern Recognition10.1016/j.patcog.2023.109771143:COnline publication date: 1-Nov-2023
    • (2023)Geometry-guided multilevel RGBD fusion for surface normal estimationComputer Communications10.1016/j.comcom.2023.04.014206:C(73-84)Online publication date: 1-Jun-2023
    • (2022)Noise-tolerant RGB-D feature fusion network for outdoor fruit detectionComputers and Electronics in Agriculture10.1016/j.compag.2022.107034198:COnline publication date: 1-Jul-2022
    • (2022)Multi-scale Cross-Modal Transformer Network for RGB-D Object DetectionMultiMedia Modeling10.1007/978-3-030-98358-1_28(352-363)Online publication date: 6-Jun-2022
    • (2021)AFI-NetComputational Intelligence and Neuroscience10.1155/2021/88614462021Online publication date: 1-Jan-2021
    • (2021)Weak segmentation supervised deep neural networks for pedestrian detectionPattern Recognition10.1016/j.patcog.2021.108063119:COnline publication date: 1-Nov-2021
    • (2021)Cross-Modal Pyramid Translation for RGB-D Scene RecognitionInternational Journal of Computer Vision10.1007/s11263-021-01475-7129:8(2309-2327)Online publication date: 1-Aug-2021
    • (2021)CNN-Based RGB-D Salient Object Detection: Learn, Select, and FuseInternational Journal of Computer Vision10.1007/s11263-021-01452-0129:7(2076-2096)Online publication date: 1-Jul-2021
    • (2021)Military object detection in defense using multi-level capsule networksSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-05912-027:2(1045-1059)Online publication date: 3-Jun-2021
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media