Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (2024)

{CJK}

UTF8\CJKfamilymj

Jinnyeong Kim Seung-Hwan Baek
POSTECH

Abstract

Integrating RGB and NIR stereo imaging provides complementary spectral information, potentially enhancing robotic 3D vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream vision tasks.In this paper, we introduce a robotic vision system equipped with pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures pixel-aligned pairs of RGB stereo images, NIR stereo images, and temporally synchronized LiDAR points. Utilizing the mobility of the robot, we present a dataset containing continuous video frames under diverse lighting conditions.We then introduce two methods that utilize the pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information.Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.

1 Introduction

RGB imaging captures visible light in wavelengths from 400 nm to 700 nm and serves as the primary data format for visual computing.Near-infrared (NIR) imaging with active illumination captures light in the range of 750 nm to 1000 nm, offering effective information in conditions where where RGB imaging is inadequate, such as low-light environments during night or poorly lit indoor conditions.NIR light is imperceptible to human vision, allowing for using active NIR illumination without disturbing people, which facilitates robust 3D imaging

[32, 55, 5, 38] and eye tracking[3]. Additionally, NIR light offers supplementary scene information due to different spectral reflectance between RGB and NIR wavelengths[26, 33, 17].

Combining these benefits, RGB-NIR imaging emerges as a promising modality for challenging environments requiring robust and efficient 2D and 3D vision capabilities. RGB-NIR imaging has been studied for applications including object detection[50, 62] and 3D reconstruction[5, 43, 20, 21]. However, existing RGB-NIR imaging systems and datasets face significant challenges compared to standard RGB vision systems. A major issue is the use of separate RGB and NIR cameras at different poses, leading to pixel misalignment between the NIR and RGB images. Previous methods address such misalignments using image registration, pose estimation, and other processing techniques, which can accumulate errors and limit effectiveness[64, 5, 48].

In this work, we develop a robotic vision system capable of capturing pixel-aligned RGB-NIR stereo images and synchronized LiDAR point clouds. Mounted on a mobile platform, the system enables exploration in both indoor and outdoor environments, facilitating data collection under diverse lighting conditions. Using this system, we collect and share a dataset of pixel-aligned RGB-NIR stereo images with synchronized LiDAR point clouds as ground-truth depth for 43 real-world scenes including 80,000 frames. Additionally, we provide a synthetic RGB-NIR dataset with ground-truth dense depth labels for 2,238 synthetic scenes. To demonstrate the effectiveness of pixel-aligned RGB-NIR imaging and datasets, we propose an RGB-NIR image fusion method and an RGB-NIR feature fusion method.The first method fuses the pixel-aligned RGB-NIR images into a three-channel image that can be input into pretrained RGB vision models without finetuning. The second method involves developing a stereo depth estimation technique by finetuning a pretrained stereo network[36] to utilize pixel-aligned RGB-NIR features. Our experiments show improvements in various downstream tasks over using only RGB or NIR data, as well as over methods relying on pixel-misaligned RGB-NIR datasets, particularly in challenging scenarios.

In summary, our contributions are as follows.

  • We develop a robotic vision system with pixel-aligned RGB-NIR stereo cameras and LiDAR, capturing synchronized RGB-NIR stereo images and LiDAR point clouds under diverse lighting conditions.

  • We present a large-scale dataset of pixel-aligned RGB-NIR stereo images and LiDAR point clouds, collected in various indoor and outdoor environments.

  • We propose an RGB-NIR image fusion method and an RGB-NIR stereo depth estimation network using RGB-NIR feature fusion.

  • We evaluate the proposed methods on diverse downstream tasks and illumination conditions, on synthetic and real-world datasets, outperforming baselines.

2 Related Work

RGB-NIR Imaging

Conventional RGB-NIR imaging systems employ separate RGB and NIR cameras positioned at different viewpoints, such as the Kinect sensor, Intel RealSense D415, and various multi-view RGB-NIR systems[7, 64, 48, 10, 11, 57, 58, 28, 8]. However, the captured RGB and NIR images from these systems exhibit pixel misalignment, necessitating depth-dependent registration between the RGB and NIR images. This introduces a significant challenge for effectively utilizing RGB and NIR information, creating a chicken-and-egg problem between depth estimation using spectrally fused images and spectral pose alignment using estimated depth.

To address the pixel misalignment, sequential capture with interchangeable RGB and NIR filters has been used[4, 24]; however, this approach is limited to capturing static scenes due to the time delay between captures. Designing custom color filter arrays for image sensors to record four channels (RGB-NIR) has also been explored[40]. However, using a single sensor with a fixed global exposure for all four channels makes it unsuitable for challenging environments where the dynamic ranges for RGB and NIR channels differ significantly, such as in dark indoor or outdoor conditions with active NIR illumination.In the domain of RGB-thermal imaging, beam splitters have been used to combine an RGB camera and a thermal camera, enabling pixel-aligned RGB-thermal imaging[27, 22, 61].

We capture pixel-aligned RGB-NIR stereo images alongside LiDAR point clouds using two prism-based RGB-NIR dual-sensor cameras. The system is mounted on a mobile robot, allowing scalable data acquisition of pixel-aligned RGB-NIR stereo images in both indoor and outdoor conditions.

RGB-NIR Fusion

To utilize information from both RGB and NIR spectra, fusing RGB-NIR images has been extensively studied with applications in image enhancement for long-distance visibility and fog penetration[1, 19, 9, 63], as well as in information visualization leveraging different RGB-NIR reflectance properties[34, 17, 47, 18]. Conventional methods apply simple arithmetic operations such as addition and subtraction in RGB, HSL, HSV, and YUV color spaces to create images that contain both RGB and NIR information[53, 37, 19, 18, 14, 42].Edge-aware filtering techniques have been employed to refine noisy and low-contrast RGB images using guidance from clean and high-contrast NIR images, showing promising performance in various imaging scenarios[31, 23, 34, 9]. More recently, learning-based approaches have attempted to enhance RGB images and their 3D vision applications using NIR guidance by fusing RGB-NIR features[49, 25, 24, 29, 56] or RGB-thermal features[22, 60, 13].

Leveraging our pixel-aligned RGB-NIR images, we develop a RGB-NIR image fusion method that effectively encodes information from both RGB and NIR images into a single three-channel image with learned spatially-varying weights. The resulting fused image can be directly used as input to neural networks pretrained on RGB images.

RGB-NIR Depth Estimation

Cross-spectral correspondence matching between RGB and NIR images using an RGB-NIR color transform network has demonstrated promising depth estimation results when the RGB-NIR reflectances are similar[64]. Liang et al. showed that GAN based architecture can enhance RGB-NIR color transform on cross-spectral stereo matching[35]. Active stereo systems using structured NIR illumination, such as the Kinect and Intel RealSense D415, employ multiple RGB and NIR cameras at different viewpoints. These systems often suffer from the aforementioned chicken-and-egg problem between depth estimation using spectrally fused images and spectral fusion using the estimated depth. Brucker et al.[5] fuse red, blue, and clean channels of stereo RCCB cameras with gating-based time-of-flight cameras positioned differently but still face a similar chicken-and-egg problem. In the domain of RGB-thermal imaging, Guo et al.[16] demonstrate effective depth estimation using pixel-aligned RGB-thermal stereo cameras with beam splitters.

We develop a depth estimation method leveraging our pixel-aligned RGB-NIR images for robust depth estimation, thereby bypassing the chicken-and-egg problem between depth estimation and spectral fusion.

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (1)

3 Pixel-aligned RGB-NIR Stereo Imaging

Robotic Imaging System

Figure1(a) illustrates our robotic imaging system, which integrates a stereo pair of pixel-aligned RGB-NIR cameras, NIR illumination, and a LiDAR mounted on a mobile robot (AgileX Ranger Mini 2.0). Each RGB-NIR camera (JAI FS-1600D-10GE), as shown in Figure1(c), employs a dichroic prism to separate RGB and NIR light, independently captured by RGB and NIR CMOS sensors. To enhance NIR visibility under varying lighting conditions, we include active NIR illumination (Advanced IlluminationAL295-150850IC) without disturbing human vision. The spectral sensitivities of the camera and the illumination are shown in Figure1(b). We installed a LiDAR (OUSTER OS1) to acquire ground-truth sparse depth maps, along with angular velocity and linear acceleration data from its inertial measurement unit.We use RJ-45 interfaces for the data transmission of the cameras and LiDAR.We set up an AC powerbank to supply power to the devices that require an AC adapter. These devices are connected to a laptop, which triggers the capture pipeline and manages real-time data storage.We manage separate threads on the laptop for the left camera, right camera, and LiDAR, acquiring synchronized data.

Image Formation

We model the image formation of the pixel-aligned RGB-NIR camera, which captures RGB and NIR images on separate CMOS sensors. This configuration results in the same exposure time and gain across the RGB channels, with separate exposure time and gain for the NIR channel: tR=tG=tBtNIRsubscript𝑡𝑅subscript𝑡𝐺subscript𝑡𝐵subscript𝑡NIRt_{R}=t_{G}=t_{B}\neq t_{\text{NIR}}italic_t start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≠ italic_t start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT and gR=gG=gBgNIRsubscript𝑔𝑅subscript𝑔𝐺subscript𝑔𝐵subscript𝑔NIRg_{R}=g_{G}=g_{B}\neq g_{\text{NIR}}italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≠ italic_g start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT.The image captured by camera c{left,right}𝑐leftrightc\in\{\text{left},\text{right}\}italic_c ∈ { left , right } for channel i{R,G,B,NIR}𝑖RGBNIRi\in\{\text{R},\text{G},\text{B},\text{NIR}\}italic_i ∈ { R , G , B , NIR } is given by

Iic(pc)=η1+gi(η2+ti(Ric(pc)(Eic(pc)+Lic(pc)))),superscriptsubscript𝐼𝑖𝑐superscript𝑝𝑐subscript𝜂1subscript𝑔𝑖subscript𝜂2subscript𝑡𝑖superscriptsubscript𝑅𝑖𝑐superscript𝑝𝑐superscriptsubscript𝐸𝑖𝑐superscript𝑝𝑐superscriptsubscript𝐿𝑖𝑐superscript𝑝𝑐\displaystyle I_{i}^{c}(p^{c})=\eta_{1}+g_{i}\left(\eta_{2}+t_{i}\left(R_{i}^{%c}(p^{c})\left(E_{i}^{c}(p^{c})+L_{i}^{c}(p^{c})\right)\right)\right),italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) ) ) ,(1)

where pcsuperscript𝑝𝑐p^{c}italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is a pixel of camera c𝑐citalic_c, Ric(pc)superscriptsubscript𝑅𝑖𝑐superscript𝑝𝑐R_{i}^{c}(p^{c})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) is the reflectance at pixel pcsuperscript𝑝𝑐p^{c}italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for channel i𝑖iitalic_i, and η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, η2subscript𝜂2\eta_{2}italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are Gaussian noise in the post-gamma and pre-gamma stages, respectively.The intensities measured in each channel are influenced by environmental illumination, denoted as Eic(pc)superscriptsubscript𝐸𝑖𝑐superscript𝑝𝑐E_{i}^{c}(p^{c})italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) for channel i𝑖iitalic_i. The active illumination Lic(pc)superscriptsubscript𝐿𝑖𝑐superscript𝑝𝑐L_{i}^{c}(p^{c})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) is present only in the NIR channel, satisfying Lic(pc)=0superscriptsubscript𝐿𝑖𝑐superscript𝑝𝑐0L_{i}^{c}(p^{c})=0italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = 0 for i{R,G,B}𝑖RGBi\in\{\text{R},\text{G},\text{B}\}italic_i ∈ { R , G , B }, since the active illumination is confined to the NIR spectrum (see Figure1(c)).

Calibration

One advantage of using pixel-aligned cameras is that calibration between RGB and NIR images is unnecessary. Thus, we only calibrate the stereo cameras using checkerboard calibration[2].The cameras contain precision time protocol, which enables maintaining temporal difference between stereo cameras less than 1 microsecond. We also obtain the LiDAR extrinsics by using corresponding point pairs between left-camera images and LiDAR point clouds[44].

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (2)

(a) Our dataset samples

RGB-NIR alignment GTdepth Environment
Pixel-alignedMulti-viewRGB stereoNIR stereo
[6]OOXXOOutdoor, day
[50]XOXXXOutdoor, day, night
[54]XOOXOOutdoor, day
[52]OXXXXOutdoor, day, night
[64]XOXXXOutdoor, sunny, dark
[7]XOXXXOutdoor, day, low-light
[48]XOOOOOutdoor, day, night, rain
[41]OOOXOOutdoor, offload, day
OursOOOOOOutdoor, day, night, indoor
Ours (synthetic)OOOOOOutdoor, day, night, indoor

(b) RGB-NIR machine vison dataset comparisonPixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (3)(c) Statistics of our dataset

Real-world Dataset

Using the system, we acquire a dataset consisting of 39 videos totaling 73,000 frames under various lighting conditions for training and 4 videos of 7000 frames for testing. For each frame, we provide pixel-aligned RGB-NIR stereo images, a sparse LiDAR point cloud, sensor timestamps, and exposure values for the RGB and NIR CMOS sensors.Each video was recorded at frame rates of 5–10Hz, with durations ranging from five to over thirty minutes. Figure3(a) shows samples from our real dataset and augmented synthetic dataset. We categorize our dataset based on lighting conditions: outdoor day, outdoor night, well-lit indoor, and dark indoor. Figure3(c) shows the statistics of our dataset.We use auto exposure on both RGB and NIR sensors, with fixed gain values, and recorded the exposure time for each image. The distribution of exposure times in our dataset varies depending on lighting conditions. For the indoor dataset, many frames have short exposure times for NIR images, as active illumination is sufficient to illuminate nearby indoor scenes. In the daytime dataset, strong sunlight causes the exposure times to shift towards very short durations under direct sunlight and longer durations in shaded areas. For the night dataset, although many frames exhibit long exposure times for both RGB and NIR, the average exposure time for NIR frames is shorter than that for RGB frames.

Synthetic Dataset

We augmented synthetic RGB stereo-image dataset[39] to obtain RGB-NIR stereo dataset with ground-truth dense depth. Figure3(a) shows a sample of synthetic scenes.We compute the RGB reflectance map RRGBsubscript𝑅RGBR_{\text{RGB}}italic_R start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT and the NIR reflectance map and RNIRsubscript𝑅NIRR_{\text{NIR}}italic_R start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT using the RGB-to-NIR color synthesis method[15].We then simulate diverse environmental lighting and obtain the RGB-NIR stereo images using Equation(1).Our augmented dataset comprising 2,238 short sequences (10 frames each) and four long sequences with total 2,200 frames. Our augmentation method is detailed in the supplemental material.

Dataset Comparison

Figure3(b) compares our dataset with existing RGB-NIR stereo datasets in terms of alignment, environment diversity, and the availability of ground truth depth.Unlike datasets requiring color transformation or material segmentation[7, 64], we capture stereo images directly in each spectral band, enhancing quality and reducing computational demands.While others need pose transformations due to differing camera positions[28, 48], our system uses pixel-aligned RGB-NIR camera.By overcoming these challenges, our dataset offers high-quality RGB-NIR stereo images suitable for various applications, including autonomous vehicles and robotics.

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (4)

4 Pixel-aligned RGB-NIR Fusion

We introduce two methods fusing pixel-aligned RGB-NIR images to improve performance of vision models.

4.1 Image Fusion with Learned Weights

First, we fuse pixel-aligned RGB-NIR images into a single three-channel image that can be fed to vision models pretrained for RGB images without finetuning. Figure3 shows the overview of our image-fusion method.

Baseline

As a baseline, we use the HSV channel blending method[14] to enhance performance on vision application without losing photometric consistency of both RGB and NIR. Given RGB and NIR images for the camera c𝑐citalic_c, the baseline method converts the RGB image into HSV representation:

[IHc,ISc,IVc]=M[IRc,IGc,IBc],superscriptsuperscriptsubscript𝐼𝐻𝑐superscriptsubscript𝐼𝑆𝑐superscriptsubscript𝐼𝑉𝑐𝑀superscriptsuperscriptsubscript𝐼𝑅𝑐superscriptsubscript𝐼𝐺𝑐superscriptsubscript𝐼𝐵𝑐[I_{H}^{c},I_{S}^{c},I_{V}^{c}]^{\intercal}=M[I_{R}^{c},I_{G}^{c},I_{B}^{c}]^{%\intercal},[ italic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT = italic_M [ italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ,(2)

where M𝑀Mitalic_M is the RGB-to-HSV conversion matrix. IHc,ISc,IVcsuperscriptsubscript𝐼𝐻𝑐superscriptsubscript𝐼𝑆𝑐superscriptsubscript𝐼𝑉𝑐I_{H}^{c},I_{S}^{c},I_{V}^{c}italic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are the hue, saturation, and brightness images.We then combine the brightness channel IVcsuperscriptsubscript𝐼𝑉𝑐I_{V}^{c}italic_I start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT with the NIR image INIRcsuperscriptsubscript𝐼NIR𝑐I_{\text{NIR}}^{c}italic_I start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT to include details from both while the hue and saturation channels remain unchanged. The fused HSV image is converted back to the RGB domain obtaining the fused image Ifusioncsuperscriptsubscript𝐼fusion𝑐I_{\text{fusion}}^{c}italic_I start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT:

Ifusionc=M1[IHc,ISc,αIVc+βINIRc]superscriptsubscript𝐼fusion𝑐superscript𝑀1superscriptsuperscriptsubscript𝐼𝐻𝑐superscriptsubscript𝐼𝑆𝑐𝛼superscriptsubscript𝐼𝑉𝑐𝛽superscriptsubscript𝐼NIR𝑐I_{\text{fusion}}^{c}=M^{-1}[I_{H}^{c},I_{S}^{c},\alpha I_{V}^{c}+\beta I_{%\text{NIR}}^{c}]^{\intercal}italic_I start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ italic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_α italic_I start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + italic_β italic_I start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT(3)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are fusion weights.

The baseline method has a limitation because the weights α𝛼\alphaitalic_α and β𝛽\betaitalic_β are predefined. This makes the fusion process non-adaptable to various environments and lighting conditions which make image details in NIR and RGB images vary significantly.

Learned Spatially-varying Weights

Instead of using predefined constants α𝛼\alphaitalic_α and β𝛽\betaitalic_β, we propose to learn spatially-varying α𝛼\alphaitalic_α and β𝛽\betaitalic_β using an attention-based MLP.We first encode each image into a 256-channel feature map as follows:

fenc(IRGBc)=FRGBcandfenc(INIRc)=FNIRcsubscript𝑓encsuperscriptsubscript𝐼RGB𝑐superscriptsubscript𝐹RGB𝑐andsubscript𝑓encsuperscriptsubscript𝐼NIR𝑐superscriptsubscript𝐹NIR𝑐f_{\text{enc}}(I_{\text{RGB}}^{c})=F_{\text{RGB}}^{c}\quad\text{and}\quad f_{%\text{enc}}(I_{\text{NIR}}^{c})=F_{\text{NIR}}^{c}italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_F start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_F start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT(4)

where fencsubscript𝑓encf_{\text{enc}}italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT is a ResNet-based feature extractor. For the NIR feature encoding, we copy the NIR channel three times as an input to the pretrained three-channel feature extractor.

We then obtain fused feature map Ffusioncsuperscriptsubscript𝐹fusion𝑐F_{\text{fusion}}^{c}italic_F start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT by combining FRGBcsuperscriptsubscript𝐹RGB𝑐F_{\text{RGB}}^{c}italic_F start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and FNIRcsuperscriptsubscript𝐹NIR𝑐F_{\text{NIR}}^{c}italic_F start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT using attentional fusion:

Ffusionc=ffusion(FRGBc,FNIRc),superscriptsubscript𝐹fusion𝑐subscript𝑓fusionsuperscriptsubscript𝐹RGB𝑐superscriptsubscript𝐹NIR𝑐F_{\text{fusion}}^{c}=f_{\text{fusion}}(F_{\text{RGB}}^{c},F_{\text{NIR}}^{c}),italic_F start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ,(5)

where the feature fusion module ffusionsubscript𝑓fusionf_{\text{fusion}}italic_f start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT computes a sum of RGB and NIR feature maps with feature attention[12].Finally, spatially varying weights α𝛼\alphaitalic_α and β𝛽\betaitalic_β are derived by the decoder fdecsubscript𝑓decf_{\text{dec}}italic_f start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT consisting of several convolutional layers:

fdec(Ffusionc)=α,β.subscript𝑓decsuperscriptsubscript𝐹fusion𝑐𝛼𝛽f_{\text{dec}}(F_{\text{fusion}}^{c})=\alpha,\beta.italic_f start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_α , italic_β .(6)

Once the spatially-varying scene-dependent values of α,β𝛼𝛽\alpha,\betaitalic_α , italic_β are obtained, we estimate the fused image using Equation(3).Then, we filter Ifusioncsuperscriptsubscript𝐼fusion𝑐I_{\text{fusion}}^{c}italic_I start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT using guided filtering with the reference NIR image INIRcsuperscriptsubscript𝐼NIR𝑐I_{\text{NIR}}^{c}italic_I start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT[31].

We evaluate the proposed method on object detection, depth estimation, and structure from motion described in Section5.We only train the image-fusion models using photometric loss and stereo depth reconstruction loss, without training the downstream vision models.Refer to the Supplemental Document for details.

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (5)
Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (6)

4.2 Feature Fusion for Stereo Depth Estimation

While the image fusion method proposed in Section4.1 allows the use of existing vision models without finetuning, it reduces the image dimensions from four channels (RGB and NIR) to three output channels, which may result in information loss.

Here, we develop a method that fuses RGB and NIR features and finetunes a downstream model. We use the RAFT-Stereo network[36] as an example. Figure4 shows a diagram of our proposed stereo depth estimation network.We extract features for the RGB stereo images IRGBcsuperscriptsubscript𝐼RGB𝑐I_{\text{RGB}}^{c}italic_I start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and NIR stereo images INIRcsuperscriptsubscript𝐼NIR𝑐I_{\text{NIR}}^{c}italic_I start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, where c{left,right}𝑐leftrightc\in\{\text{left},\text{right}\}italic_c ∈ { left , right }, using the same ResNet-based feature extractor:FRGBc=fenc(IRGBc),FNIRc=fenc(INIRc)formulae-sequencesuperscriptsubscript𝐹RGB𝑐subscript𝑓encsuperscriptsubscript𝐼RGB𝑐superscriptsubscript𝐹NIR𝑐subscript𝑓encsuperscriptsubscript𝐼NIR𝑐F_{\text{RGB}}^{c}=f_{\text{enc}}(I_{\text{RGB}}^{c}),\quad F_{\text{NIR}}^{c}%=f_{\text{enc}}(I_{\text{NIR}}^{c})italic_F start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) , italic_F start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ).We then apply the attention-based fusion method from Section4.1:Ffusionc=ffusion(FRGBc,FNIRc)superscriptsubscript𝐹fusion𝑐subscript𝑓fusionsuperscriptsubscript𝐹RGB𝑐superscriptsubscript𝐹NIR𝑐F_{\text{fusion}}^{c}=f_{\text{fusion}}(F_{\text{RGB}}^{c},F_{\text{NIR}}^{c})italic_F start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT )[12].

We compute correlation volumes for not only the fused features and but also NIR features:

Vs(x,y,k)=Fsleft(x,y)Fsright(x+k,y),s{fusion,NIR},formulae-sequencesubscript𝑉𝑠𝑥𝑦𝑘superscriptsubscript𝐹𝑠left𝑥𝑦superscriptsubscript𝐹𝑠right𝑥𝑘𝑦𝑠fusionNIRV_{s}(x,y,k)=F_{s}^{\text{left}}(x,y)\cdot F_{s}^{\text{right}}(x+k,y),\,{s\in%\{\text{fusion},\text{NIR}\}},italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x , italic_y , italic_k ) = italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT left end_POSTSUPERSCRIPT ( italic_x , italic_y ) ⋅ italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT right end_POSTSUPERSCRIPT ( italic_x + italic_k , italic_y ) , italic_s ∈ { fusion , NIR } ,(7)

where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) denotes the pixel location, k𝑘kitalic_k is the disparity, and \cdot represents the inner product.We found that using the two correlation volumes for the fused features and the NIR features effectively utilize cross-spectral information, resulting in better depth accuracy as shown in Table3.

We estimate a series of disparity maps {d1,,dN}subscript𝑑1subscript𝑑𝑁\{d_{1},\dots,d_{N}\}{ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } using the GRU structure of the RAFT-Stereo network[36]. By feeding Ffusionleftsuperscriptsubscript𝐹fusionleftF_{\text{fusion}}^{\text{left}}italic_F start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT start_POSTSUPERSCRIPT left end_POSTSUPERSCRIPT as context feature in the RAFT-Stereo, we alternate between the fused and NIR correlation volumes, Vfusionsubscript𝑉fusionV_{\text{fusion}}italic_V start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT and VNIRsubscript𝑉NIRV_{\text{NIR}}italic_V start_POSTSUBSCRIPT NIR end_POSTSUBSCRIPT, as input to the GRU at each iteration, inspired by the cross-spectral time-of-flight imaging method[5].

Our method leverages spectral information from both RGB and NIR features, placing more emphasis on NIR images that are robust to environmental lighting. We fine-tune our model on the synthetic and real training datasets using the disparity reconstruction loss and the LiDAR loss, respectively. For more details on the loss functions and optimization, we refer to the Supplemental Document.

Methods(a) Depth RMSE[m]\downarrow(b) Detection mAP\uparrow
RGB8.9430.756
NIR9.6460.703
YCrCb[19]8.5280.571
Bayesian[63]9.5160.745
DarkVision[24]8.3130.762
Adaptive[1]7.8300.773
VGG-NIR[29]9.6540.726
HSV(our baseline)[14]7.6920.744
Ours7.5670.809

5 Results

We present the evaluation of our proposed pixel-aligned RGB-NIR image fusion and feature fusion methods, along with comprehensive ablation studies.

5.1 Pixel-aligned RGB-NIR Image Fusion

To evaluate our pixel-aligned RGB-NIR image fusion method in Section4.1, we conducted assessments across three downstream tasks: depth estimation, object detection, and structure from motion.

Depth Estimation

Figure5 shows that using our fused image as an input to the pre-trained stereo depth estimation network[36] yields improved results over single-modality predictions using only RGB or NIR images.Table1(a) shows quantitative analysis compared to those single-modality predictions as well as using other RGB-NIR image fusion methods. Our proposed RGB-NIR image fusion outperforms all the baselines. Refer to the Supplemental Document for corresponding qualitative results.Our method also outperforms the baselines on another stereo-depth network[30] and monocular depth-estimation network[59], which can be found in the Supplemental Document.

Object Detection

We use the YOLO object detector[45] to test our RGB-NIR image fusion method.Figure6 and Table1(b) shows that our image fusion method consistently outperforms single-modality predictions of either RGB or NIR, and other RGB-NIR image fusion methods.

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (7)

Structure from Motion

Figure7 shows that our RGB-NIR fused images enables robust reconstruction for a structure-from-motion method[46] on challenging lighting conditions, where RGB and NIR images provide different scene visibility.This suggested the potential utility of our fused RGB-NIR images for downstream applications, including scene reconstruction and view synthesis.

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (8)
Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (9)

Depth estimation modelDepth RMSE [m]\downarrow
RAFT-Stereo (RGB)-[36]8.943
RAFT-Stereo (NIR)-[36]9.646
CS-Stereo[64]8.941
CSPD[16]10.101
DPSNet[51]7.633
Image fusion (ours)-[36]7.567
Ours6.747

5.2 Pixel-aligned RGB-NIR Feature Fusion

Figure8 shows that our feature-based RGB-NIR stereo method (Section4.2) outperforms not only the single-modality predictions using either RGB or NIR channels, but also our RGB-NIR image-fusion method (Section4.1).While using single-modality RGB/NIR images easily struggle with saturated areas and shadows, our RGB-NIR fused images provide better depth estimation without finetuning.Our feature-based stereo depth estimation method further improves reconstruction quality by effectively exploiting the strengths of RGB and NIR channels via finetuning.Table2 shows quantitative comparison over six baseline methods: RGB-only, NIR-only, our image-fusion method, and four multi-spectral fusion architectures[64, 16, 51].For the multi-spectral fusion methods[16, 51], we use their fusion architectures and retrain them on our dataset for fair comparison.Refer to the Supplemental Document for details.

Correlation volumes for disparity estimation Deptherror [m]\downarrow
Fusion correlation volumes only7.440
Alternating RGB-NIR correlation volumes8.571
Alternating Fusion-RGB-NIR correlation volumes7.426
Alternating Fusion-NIR correlation volumes6.747

Alternating Correlation Volumes

In Section4.2, we investigate the effect of using different types of correlation volumes for stereo disparity matching. Specifically, we experiment with several strategies: using only the fusion correlation volume, alternating between RGB and NIR correlation volumes, alternating among fusion, RGB, and NIR correlation volumes, and alternating between fusion and NIR correlation volumes. Table3 shows that alternating fusion-NIR correlation volumes yields the highest accuracy compared to the other methods.

6 Conclusion

In this paper, we have presented a pixel-aligned RGB-NIR stereo imaging system, collect datasets, and develop RGB-NIR image fusion for pretrained models and feature fusion methods with additional training.We demonstrate the effectiveness of pixel-aligned RGB-NIR images and our RGB-NIR fusion methods for three downstream tasks: depth estimation, object detection, and structure from motion.We hope our work shows the potential of pixel-aligned RGB-NIR imaging.

Limitations and Future Work

We have demonstrated 3D imaging applications of pixel-aligned RGB-NIR images.A potentially more interesting future work is to reproduce the success of RGB-image generative models for pixel-aligned RGB-NIR images.

Another interesting future research is to improve the accuracy of pixel-aligned RGB-NIR 3D imaging with the LiDAR measurements that provide complementary information about the scene via time-of-flight imaging.

References

  • [1]Mohamed Awad, Ahmed Elliethy, and HusseinA Aly.Adaptive near-infrared and visible fusion for fast image enhancement.IEEE Transactions on Computational Imaging, 6:408–418, 2019.
  • [2]Jean-Yves Bouguet.Camera calibration toolbox for matlab.http://www. vision. caltech. edu/bouguetj/calib_doc/, 2004.
  • [3]DavidP Broadbent, Giorgia D’Innocenzo, TobyJ Ellmers, Justin Parsler, AndreJ Szameitat, and DanielT Bishop.Cognitive load, working memory capacity and driving performance: A preliminary fnirs and eye tracking study.Transportation research part F: traffic psychology and behaviour, 92:121–132, 2023.
  • [4]Matthew Brown and Sabine Süsstrunk.Multi-spectral sift for scene category recognition.In CVPR 2011, pages 177–184. IEEE, 2011.
  • [5]Samuel Brucker, Stefanie Walz, Mario Bijelic, and Felix Heide.Cross-spectral gated-rgb stereo depth estimation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21654–21665, 2024.
  • [6]Nived Chebrolu, Philipp Lottes, Alexander Schaefer, Wera Winterhalter, Wolfram Burgard, and Cyrill Stachniss.Agricultural robot dataset for plant classification, localization and mapping on sugar beet fields.The International Journal of Robotics Research, 36(10):1045–1052, 2017.
  • [7]Gyeongmin Choe, Seong-Heum Kim, Sunghoon Im, Joon-Young Lee, SrinivasaG Narasimhan, and InSo Kweon.Ranus: Rgb and nir urban scene dataset for deep scene parsing.IEEE Robotics and Automation Letters, 3(3):1808–1815, 2018.
  • [8]Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, JaeShin Yoon, Kyounghwan An, and InSo Kweon.Kaist multi-spectral day/night data set for autonomous and assisted driving.IEEE Transactions on Intelligent Transportation Systems, 19(3):934–948, Mar. 2018.
  • [9]David Connah, MarkSamuel Drew, and GrahamDavid Finlayson.Spectral edge image fusion: Theory and applications.In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, page 65–80, Cham, 2014. Springer International Publishing.
  • [10]Camille Couprie, Clément Farabet, Laurent Najman, and Yann Lecun.Indoor semantic segmentation using depth information.In First International Conference on Learning Representations (ICLR 2013), pages 1–8, 2013.
  • [11]Angela Dai, AngelX Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner.Scannet: Richly-annotated 3d reconstructions of indoor scenes.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  • [12]Yimian Dai, Fabian Gieseke, Stefan Oehmcke, Yiquan Wu, and Kobus Barnard.Attentional feature fusion.In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), page 3559–3568, Waikoloa, HI, USA, Jan. 2021. IEEE.
  • [13]SriAditya Deevi, Connor Lee, Lu Gan, Sushruth Nagesh, Gaurav Pandey, and Soon-Jo Chung.Rgb-x object detection via scene-specific fusion modules.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7366–7375, 2024.
  • [14]Clément Fredembach and Sabine Süsstrunk.Colouring the near-infrared.In Color and imaging conference, volume16, pages 176–182. Society of Imaging Science and Technology, 2008.
  • [15]Tobias Gruber, Frank Julca-Aguilar, Mario Bijelic, and Felix Heide.Gated2depth: Real-time dense lidar from gated images.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1506–1516, 2019.
  • [16]Yubin Guo, Xinlei Qi, Jin Xie, Cheng-Zhong Xu, and Hui Kong.Unsupervised cross-spectrum depth estimation by visible-light and thermal cameras.IEEE Transactions on Intelligent Transportation Systems, 24(10):10937–10947, Oct. 2023.
  • [17]MarkF Hansen, GaryA Atkinson, LyndonN Smith, and MelvynL Smith.3d face reconstructions from photometric stereo using near infrared and visible light.Computer Vision and Image Understanding, 114(8):942–951, 2010.
  • [18]María Herrera-Arellano, Hayde Peregrina-Barreto, and Iván Terol-Villalobos.Visible-nir image fusion based on top-hat transform.IEEE Transactions on Image Processing, 30:4962–4972, 2021.
  • [19]MaríaA Herrera-Arellano, Hayde Peregrina-Barreto, and Iván Terol-Villalobos.Color outdoor image enhancement by v-nir fusion and weighted luminance.In 2019 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), pages 1–6. IEEE, 2019.
  • [20]Yuchen Hong, Youwei Lyu, Si Li, Gang Cao, and Boxin Shi.Reflection removal with nir and rgb image feature fusion.IEEE Transactions on Multimedia, 2022.
  • [21]Xuanlun Huang, Chenyang Wu, Xiaolan Xu, Baishun Wang, Sui Zhang, Chihchiang Shen, Chiennan Yu, Jiaxing Wang, Nan Chi, Shaohua Yu, etal.Polarization structured light 3d depth image sensor for scenes with reflective surfaces.Nature Communications, 14(1):6855, 2023.
  • [22]Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In SoKweon.Multispectral pedestrian detection: Benchmark dataset and baseline.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1037–1045, 2015.
  • [23]Dong-Won Jang and Rae-Hong Park.Colour image dehazing using near-infrared fusion.IET Image Processing, 11(8):587–594, 2017.
  • [24]Shuangping Jin, Bingbing Yu, Minhao Jing, Yi Zhou, Jiajun Liang, and Renhe Ji.Darkvisionnet: Low-light imaging via rgb-nir fusion with deep inconsistency prior.In Proceedings of the AAAI Conference on Artificial Intelligence, volume36, pages 1104–1112, 2022.
  • [25]Cheolkon Jung, Kailong Zhou, and Jiawei Feng.Fusionnet: Multispectral fusion of rgb and nir images using two stage convolutional neural networks.IEEE Access, 8:23912–23919, 2020.
  • [26]MinH Kim, ToddAlan Harvey, DavidS Kittle, Holly Rushmeier, Julie Dorsey, RichardO Prum, and DavidJ Brady.3d imaging spectroscopy for measuring hyperspectral patterns on solid objects.ACM Transactions on Graphics (TOG), 31(4):1–11, 2012.
  • [27]Namil Kim, Yukyung Choi, Soonmin Hwang, and InSo Kweon.Multispectral transfer network: Unsupervised depth estimation for all-day vision.In Proceedings of the AAAI Conference on Artificial Intelligence, volume32, 2018.
  • [28]AlexJunho Lee, Younggun Cho, Young-sik Shin, Ayoung Kim, and Hyun Myung.Vivid++ : Vision for visibility dataset.IEEE Robotics and Automation Letters, 7(3):6282–6289, July 2022.
  • [29]Hui Li, Xiao-Jun Wu, and Josef Kittler.Infrared and visible image fusion using a deep learning framework.In 2018 24th international conference on pattern recognition (ICPR), pages 2705–2710. IEEE, 2018.
  • [30]Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu.Practical stereo matching via cascaded recurrent network with adaptive correlation.In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 16242–16251, New Orleans, LA, USA, June 2022. IEEE.
  • [31]Shutao Li, Xudong Kang, and Jianwen Hu.Image fusion with guided filtering.IEEE Transactions on Image Processing, 22(7):2864–2875, July 2013.
  • [32]StanZ Li, RuFeng Chu, ShengCai Liao, and Lun Zhang.Illumination invariant face recognition using near-infrared images.IEEE Transactions on pattern analysis and machine intelligence, 29(4):627–639, 2007.
  • [33]Xuan Li, Fei Liu, Pingli Han, Shichao Zhang, and Xiaopeng Shao.Near-infrared monocular 3d computational polarization imaging of surfaces exhibiting nonuniform reflectance.Optics Express, 29(10):15616–15630, 2021.
  • [34]Zhuo Li, Hai-Miao Hu, Wei Zhang, Shiliang Pu, and Bo Li.Spectrum characteristics preserved visible and near-infrared image fusion algorithm.IEEE Transactions on Multimedia, 23:306–319, 2021.
  • [35]Mingyang Liang, Xiaoyang Guo, Hongsheng Li, Xiaogang Wang, and You Song.Unsupervised cross-spectral stereo matching by learning to synthesize.In Proceedings of the AAAI Conference on Artificial Intelligence, volume33, pages 8706–8713, 2019.
  • [36]Lahav Lipson, Zachary Teed, and Jia Deng.Raft-stereo: Multilevel recurrent field transforms for stereo matching.In 2021 International Conference on 3D Vision (3DV), pages 218–227. IEEE, 2021.
  • [37]Gang Liu and Guohong Huang.Color fusion based on em algorithm for ir and visible image.In 2010 The 2nd International Conference on Computer and Automation Engineering (ICCAE), volume2, page 253–258, Feb. 2010.
  • [38]Jingjing Liu, Shaoting Zhang, Shu Wang, and DimitrisN Metaxas.Multispectral deep neural networks for pedestrian detection.In 27th British Machine Vision Conference, BMVC 2016, 2016.
  • [39]Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox.A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4040–4048, 2016.
  • [40]Yusuke Monno, Hayato Teranaka, Kazunori Yoshizaki, Masayuki Tanaka, and Masatoshi Okutomi.Single-sensor rgb-nir imaging: High-quality system design and prototype implementation.IEEE Sensors Journal, 19(2):497–507, 2018.
  • [41]Peter Mortimer, Raphael Hagmanns, Miguel Granero, Thorsten Luettel, Janko Petereit, and Hans-Joachim Wuensche.The goose dataset for perception in unstructured environments.2024.
  • [42]Chulhee Park and MoonGi Kang.Color restoration of rgbn multispectral filter array sensor images based on spectral decomposition.Sensors, 16(5):719, 2016.
  • [43]Matteo Poggi, PierluigiZama Ramirez, Fabio Tosi, Samuele Salti, Stefano Mattoccia, and Luigi DiStefano.Cross-spectral neural radiance fields.In 2022 International Conference on 3D Vision (3DV), pages 606–616. IEEE, 2022.
  • [44]Long Quan and Zhongdan Lan.Linear n-point camera pose determination.IEEE Transactions on pattern analysis and machine intelligence, 21(8):774–780, 1999.
  • [45]Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.You only look once: Unified, real-time object detection.In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  • [46]JohannesL Schonberger and Jan-Michael Frahm.Structure-from-motion revisited.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  • [47]Takashi Shibata, Masayuki Tanaka, and Masatoshi Okutomi.Versatile visible and near-infrared image fusion based on high visibility area selection.Journal of Electronic Imaging, 25(1):013016–013016, 2016.
  • [48]Ukcheol Shin, Jinsun Park, and InSo Kweon.Deep depth estimation from thermal image.In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 1043–1053, Vancouver, BC, Canada, June 2023. IEEE.
  • [49]Haonan Su, Cheolkon Jung, and Long Yu.Multi-spectral fusion and denoising of color and near-infrared images using multi-scale wavelet analysis.Sensors, 21(11):3610, 2021.
  • [50]Karasawa Takumi, Kohei Watanabe, Qishen Ha, Antonio Tejero-De-Pablos, Yoshitaka Ushiku, and Tatsuya Harada.Multispectral object detection for autonomous vehicles.In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 35–43, 2017.
  • [51]Chaoran Tian, Weihong Pan, Zimo Wang, Mao Mao, Guofeng Zhang, Hujun Bao, Ping Tan, and Zhaopeng Cui.Dps-net: Deep polarimetric stereo depth estimation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3569–3579, 2023.
  • [52]Alexander Toet, MaartenA Hogervorst, and AlanR Pinkus.The triclobs dynamic multi-band image data set for the development and evaluation of image fusion methods.PloS one, 11(12):e0165016, 2016.
  • [53]Alexander Toet and Jan Walraven.New false color mapping for image fusion.Optical engineering, 35(3):650–658, 1996.
  • [54]Abhinav Valada, GabrielL Oliveira, Thomas Brox, and Wolfram Burgard.Deep multispectral semantic scene understanding of forested environments using multimodal fusion.In 2016 International Symposium on Experimental Robotics, pages 465–477. Springer, 2017.
  • [55]Stefanie Walz, Mario Bijelic, Andrea Ramazzina, Amanpreet Walia, Fahim Mannan, and Felix Heide.Gated stereo: Joint depth estimation from gated and wide-baseline active stereo cues.In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 13252–13262, Vancouver, BC, Canada, June 2023. IEEE.
  • [56]Linbo Wang, Tao Wang, Deyun Yang, Xianyong Fang, and Shaohua Wan.Near-infrared fusion for deep lightness enhancement.International Journal of Machine Learning and Cybernetics, 14(5):1621–1633, May 2023.
  • [57]Fei Xia, AmirR Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese.Gibson env: Real-world perception for embodied agents.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9068–9079, 2018.
  • [58]Karmesh Yadav, Ram Ramrakhya, SanthoshKumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, AngelXuan Chang, Dhruv Batra, Manolis Savva, etal.Habitat-matterport 3d semantics dataset.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2023.
  • [59]Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao.Depth anything: Unleashing the power of large-scale unlabeled data.In CVPR, 2024.
  • [60]Xue Zhang, Xiaohan Zhang, Jiangtao Wang, Jiacheng Ying, Zehua Sheng, Heng Yu, Chunguang Li, and Hui-Liang Shen.Tfdet: Target-aware fusion for rgb-t pedestrian detection.IEEE Transactions on Neural Networks and Learning Systems, page 1–15, 2024.
  • [61]Yigong Zhang, Yicheng Gao, Shuo Gu, Yubin Guo, Minghao Liu, Zezhou Sun, Zhixing Hou, Hang Yang, Ying Wang, Jian Yang, Jean Ponce, and Hui Kong.Build your own hybrid thermal/eo camera for autonomous vehicle.In 2019 International Conference on Robotics and Automation (ICRA), page 6555–6560, May 2019.
  • [62]Wenda Zhao, Shigeng Xie, Fan Zhao, You He, and Huchuan Lu.Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13955–13965, 2023.
  • [63]Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, and Jiangshe Zhang.Bayesian fusion for infrared and visible images.Signal Processing, 177:107734, 2020.
  • [64]Tiancheng Zhi, BernardoR Pires, Martial Hebert, and SrinivasaG Narasimhan.Deep material-aware cross-spectral stereo matching.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1916–1925, 2018.
Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Van Hayes

Last Updated:

Views: 5407

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.