Defocus deblurring is a challenging task due to the spatially varying nature of defocus blur. While deep learning approach shows great promise in solving image restoration problems, defocus deblurring demands accurate training data that consists of all-in-focus and defocus image pairs, which is difficult to collect. Naive two-shot capturing cannot achieve pixel-wise correspondence between the defocused and all-in-focus image pairs. Synthetic aperture of light fields is suggested to be a more reliable way to generate accurate image pairs. However, the defocus blur generated from light field data is different from that of the images captured with a traditional digital camera. In this paper, we propose a novel deep defocus deblurring network that leverages the strength and overcomes the shortcoming of light fields. We first train the network on a light fieldgenerated dataset for its highly accurate image correspondence. Then, we fine-tune the network using feature loss on another dataset collected by the two-shot method to alleviate the differences between the defocus blur exists in the two domains. This strategy is proved to be highly effective and able to achieve the state-of-the-art performance both quantitatively and qualitatively on multiple test sets. Extensive ablation studies have been conducted to analyze the effect of each network module to the final performance.
Intuitively, in order to capture defocused and all-in-focus image pairs for the training of neural netowrk, two images should be captured sequentially with different aperture sizes. However, it is hardly possible to capture defocused and all-in-focus pairs with accurate correspondence in two shots, especially for outdoor scenes due to moving objects (e.g., plants, cars) and illuminance variation (see figures below).
Mismatch when capturing with two-shot method
[more information come soon!]
It is suggested that light field technique can be used to generate defocused and all-in-focuse pairs with accurate pixel correspondance. As shown in Figure below, for a conventional digital camera (left), the rays emitted from a scene point on the focal plane converge to a single pixel of the image sensor by the main lens, while a point away from the focal plane projects to a patch of pixels on the sensor in a circular shape (COC), causing defocus blur. For a light field camera (right), a micro lens array is placed in front of the sensor, thus the rays coming from the main lens are re-distributed to the pixels under micro lenses, which means that each pixel does not only record the integrated illuminance but also the directional information of the rays. After capturing, synthetic aperture and post-refocusing can be achieved by integrating an appropriate subset of samples from multiple sub-aperture views and integrating pixels along different directions on the epipolar plane image (EPI), respectively.
However, there is domain gap between defocus image generated by light field image and defocus image captured by DSLR camera. Generally, the PSFs produced by the digital camera follows the diffraction pattern of single Airy disk, while the PSFs produced by the light field camera resemble the patterns of multiple Airy disks, which can be explained by the synthetic nature of light field generated defocus blur.
So in this paper we suggest to use both light field-generated and DSLR-captured datasets for training, where light field-generated dataset is used in network main training for its accurate pixel correspondance, and DSLR-captured dataset for network fine-tune. Below please find the interactive interface for comparison between network trained with only DPDD, only LFDOF, and both.