Defocus Map Estimation and Deblurring from a Single DualPixel Image
Abstract
We present a method that takes as input a single dualpixel image, and simultaneously estimates the image’s defocus map—the amount of defocus blur at each pixel—and recovers an allinfocus image. Our method is inspired from recent works that leverage the dualpixel sensors available in many consumer cameras to assist with autofocus, and use them for recovery of defocus maps or allinfocus images. These prior works have solved the two recovery problems independently of each other, and often require large labeled datasets for supervised training. By contrast, we show that it is beneficial to treat these two closelyconnected problems simultaneously. To this end, we set up an optimization problem that, by carefully modeling the optics of dualpixel images, jointly solves both problems. We use data captured with a consumer smartphone camera to demonstrate that, after a onetime calibration step, our approach improves upon prior works for both defocus map estimation and blur removal, despite being entirely unsupervised.
1 Introduction
Modern DSLR and mirrorless cameras feature largeaperture lenses that allow collecting more light, but also introduce defocus blur, meaning that objects in images appear blurred by an amount proportional to their distance from the focal plane. A simple way to reduce defocus blur is to stop down, i.e., shrink the aperture. However, this also reduces the amount of light reaching the sensor, making the image noisier. Moreover, stopping down is impossible on fixedaperture cameras, such as those in most smartphones. More sophisticated techniques fall into two categories. First are techniques that add extra hardware (e.g., coded apertures [46], specialized lenses [47, 15]), and thus are impractical to deploy at large scale or across already available cameras. Second are focus stacking techniques [76] that capture multiple images at different focus distances, and fuse them into an allinfocus image. These techniques require long capture times, and thus are applicable only to static scenes.
Ideally, defocus blur removal should be done using data from a single capture. Unfortunately, in conventional cameras, this task is fundamentally illposed: a captured image may have no highfrequency content because either the latent allinfocus image lacks such frequencies, or they are removed by defocus blur. Knowing the defocus map, i.e., the spatiallyvarying amount of defocus blur, can help simplify blur removal. However, determining the defocus map from a single image is closelyrelated to monocular depth estimation, which is a challenging problem in its own right. Even if the defocus map were known, recovering an allinfocus image is still an illposed problem, as it requires hallucinating the missing high frequency content.
Dualpixel (DP) sensors are a recent innovation that makes it easier to solve both the defocus map estimation and defocus blur removal problems, with data from a single capture. Camera manufacturers have introduced such sensors to many DSLR and smartphone cameras to improve autofocus [2, 36]. Each pixel on a DP sensor is split into two halves, each capturing light from half of the main lens’ aperture, yielding two subimages per exposure (Fig. 1). These can be thought of as a twosample lightfield [61], and their sum is equivalent to the image captured by a regular sensor. The two subimages have different halfapertureshaped defocus blur kernels; these are additionally spatiallyvarying due to optical imperfections such as vignetting or field curvature in lenses, especially for cheap smartphone lenses.
We propose a method to simultaneously recover the defocus map and allinfocus image from a single DP capture. Specifically, we perform a onetime calibration to determine the spatiallyvarying blur kernels for the left and right DP images. Then, given a single DP image, we optimize a multiplane image (MPI) representation [77, 91] to best explain the observed DP images using the calibrated blur kernels. An MPI is a layered representation that accurately models occlusions, and can be used to render both defocused and allinfocus images, as well as produce a defocus map. As solving for the MPI from two DP images is underconstrained, we introduce additional priors and show their effectiveness via ablation studies. Further, we show that in the presence of image noise, standard optimization has a bias towards underestimating the amount of defocus blur, and we introduce a bias correction term. Our method does not require large amounts of training data, save for a onetime calibration, and outperforms prior art on both defocus map estimation and blur removal, when tested on images captured using a consumer smartphone camera. We make our implementation and data publicly available [85].
2 Related Work
Depth estimation. Multiview depth estimation is a wellposed and extensively studied problem [30, 71]. By contrast, singleview, or monocular, depth estimation is illposed. Early techniques attempting to recover depth from a single image typically relied on additional cues, such as silhouettes, shading, texture, vanishing points, or datadriven supervision [5, 7, 10, 13, 29, 37, 38, 42, 44, 51, 67, 70, 72]. The use of deep neural networks trained on large RGBD datasets [17, 22, 50, 52, 69, 74] significantly improved the performance of datadriven approaches, motivating approaches that use synthetic data [4, 28, 56, 60, 92], selfsupervised training [23, 25, 26, 39, 54, 90], or multiple data sources [18, 66]. Despite these advances, producing highquality depth from a single image remains difficult, due to the inherent ambiguities of monocular depth estimation.
Recent works have shown that DP data can improve monocular depth quality, by resolving some of these ambiguities. Wadhwa \etal [82] applied classical stereo matching methods to DP views to compute depth. Punnappurath \etal [64] showed that explicitly modeling defocus blur during stereo matching can improve depth quality. However, they assume that the defocus blur is spatially invariant and symmetric between the left and right DP images, which is not true in real smartphone cameras. Depth estimation with DP images has also been used as part of reflection removal algorithms [65]. Garg \etal [24] and Zhang \etal [87] trained neural networks to output depth from DP images, using a captured dataset of thousands of DP images and ground truth depth maps [3]. The resulting performance improvements come at a significant data collection cost.
Focus or defocus has been used as a cue for monocular depth estimation prior to these DP works. Depth from defocus techniques [19, 63, 78, 84] use two differentlyfocused images with the same viewpoint, whereas depth from focus techniques use a dense focal stack [27, 33, 76]. Other monocular depth estimation techniques use defocus cues as supervision for training depth estimation networks [75], use a coded aperture to estimate depth from one [46, 81, 89] or two captures [88], or estimate a defocus map using synthetic data [45]. Lastly, some binocular stereo approaches also explicitly account for defocus blur [12, 49]; compared to depth estimation from DP images, these approaches assume different focus distances for the two views.
Defocus deblurring. Besides depth estimation, measuring and removing defocus blur is often desirable to produce sharp allinfocus images. Defocus deblurring techniques usually estimate either a depth map or an equivalent defocus map as a first processing stage [14, 40, 62, 73]. Some techniques modify the camera hardware to facilitate this stage. Examples include inserting patterned occluders in the camera aperture to make defocus scale selection easier [46, 81, 89, 88]; or sweeping through multiple focal settings within the exposure to make defocus blur spatially uniform [59]. Once a defocus map is available, a second deblurring stage employs nonblind deconvolution methods [46, 21, 43, 83, 57, 86] to remove the defocus blur.
Deep learning has been successfully used for defocus deblurring as well. Lee \etal [45] train neural networks to regress to defocus maps, that are then used to deblur. Abuolaim and Brown [1] extend this approach to DP data, and train a neural network to directly regress from DP images to allinfocus images. Their method relies on a dataset of pairs of wide and narrow aperture images captured with a DSLR, and may not generalize to images captured on smartphone cameras, which have very different optical characteristics. Such a dataset is impossible to collect on smartphone cameras with fixed aperture lenses. In contrast to these prior works, our method does not require difficulttocapture large datasets. Instead, it uses an accurate model of the defocus blur characteristics of DP data, and simultaneously solves for a defocus map and an allinfocus image.
3 DualPixel Image Formation
We begin by describing image formation for a regular and a dualpixel (DP) sensor, to relate the defocus map and the allinfocus image to the captured image. For this, we consider a camera imaging a scene with two points, only one of which is in focus (Fig. 3(b)). Rays emanating from the infocus point (blue) converge on a single pixel, creating a sharp image. By contrast, rays from the outoffocus point (brown) fail to converge, creating a blurred image.
If we consider a lens with an infinitesimallysmall aperture (i.e., a pinhole camera), only rays that pass through its center strike the sensor, and create a sharp allinfocus image (Fig. 3(c)). Under the thin lens model, the blurred image of the outoffocus point equals blurring with a depthdependent kernel , shaped as a scaled version of the aperture—typically a circular disc of radius , where is the point depth, and and are lensdependent constants [24]. Therefore, the perpixel signed kernel radius , termed the defocus map , is a linear function of inverse depth, thus a proxy for the depth map. Given the defocus map , and ignoring occlusions, the sharp image can be recovered from the captured image using nonblind deconvolution. In practice, recovering either the defocus map or the sharp image from a single image is illposed, as multiple combinations produce the same image . Even when the defocus map is known, determining the sharp image is still illposed, as blurring irreversibly removes image frequencies.
DP sensors make it easier to estimate the defocus map. In DP sensors (Fig. 3(a)), each pixel is split into two halves, each collecting light from the corresponding half of the lens aperture (Fig. 3(b)). Adding the two halfpixel, or DP, images and produces an image equivalent to that captured by a regular sensor, i.e., . Furthermore, DP images are identical for an infocus scene point, and shifted versions of each other for an outoffocus point. The amount of shift, termed DP disparity, is proportional to the blur size, and thus provides an alternative for defocus map estimation. In addition to facilitating the estimation of the defocus map , having two DP images instead of a single image provides additional constraints for recovering the underlying sharp image . Utilizing these constraints requires knowing the blur kernel shapes for the two DP images.
Blur kernel calibration. As real lenses have spatiallyvarying kernels, we calibrate an grid of kernels. To do this, we fix the focus distance, capture a regular grid of circular discs on a monitor screen, and solve for blur kernels for left and right images independently using a method similar to Mannan and Langer [55]. When solving for kernels, we assume that they are normalized to sum to one, and calibrate separately for vignetting: we average left and right images from six captures of a white diffuser, using the same focus distance as above, to produce left and right vignetting patterns and . We refer to the supplement for details.
4 Proposed Method
The inputs to our method are two singlechannel DP images, and calibrated left and right blur kernels. We correct for vignetting using and , and denote the two vignettingcorrected DP images as and , and their corresponding blur kernels at a certain defocus size as and , respectively. We assume that blur kernels at a defocus size can be obtained by scaling by a factor [64, 88]. Our goal is to optimize for the multiplane image (MPI) representation that best explains the observed data, and use it to recover the latent allinfocus image and defocus map . We first introduce the MPI representation, and show how to render defocused images from it. We then formulate an MPI optimization problem, and detail its loss function.
4.1 Multiplane Image (MPI) Representation
We model the scene using the MPI representation, previously used primarily for view synthesis [80, 91]. MPIs discretize the 3D space into frontoparallel planes at fixed depths (Fig. 5). We select depths corresponding to linearlychanging defocus blur sizes . Each MPI plane is an \RGBA image of the infocus scene that consists of an intensity channel and an alpha channel .
Allinfocus image compositing. Given an MPI, we composite the sharp image using the over operator [53]: we sum all layers weighted by the transmittance of each layer ,
(1) 
Defocused image rendering. Given the left and right blur kernels for each layer, we render defocused images by convolving each layer with its corresponding kernel, then compositing the blurred layers as in Eq. (1):
(2) 
where denotes convolution. In practice, we scale the calibrated spatiallyvarying left and right kernels by the defocus size , and apply the scaled spatiallyvarying blur to each \RGBA image . We note that we render left and right views from a single MPI, but with different kernels.
4.2 Effect of Gaussian Noise on Defocus Estimation
Using Eq. (2), we can optimize for the MPI that minimizes the error between rendered images and observed DP images . Here we show that, in the presence of noise, this optimization is biased toward smaller defocus sizes, and we correct for this bias.
Assuming additive white Gaussian noise distributed as , we can model DP images as:
(3) 
where are the latent noisefree images. For simplicity, we assume for now that all scene content lies on a single frontoparallel plane with ground truth defocus size . Then, using frequency domain analysis similar to Zhou \etal [88], we prove in the supplement that for a defocus size hypothesis , the expected negative logenergy function corresponding to the MAP estimate of the MPI is:
(4) 
where and are the Fourier transforms of kernels and respectively, is the inverse spectral power distribution of natural images, and the summation is over all frequencies. We would expect the loss to be minimized when . The first term measures the inconsistency between the hypothesized blur kernel and the true kernel , and is indeed minimized when . However, the second term depends on the noise variance and decreases as decreases. This is because, for a normalized blur kernel (), as the defocus kernel size decreases, its power spectrum increases. This suggests that white Gaussian noise in input images results in a bias towards smaller blur kernels. To account for this bias, we subtract an approximation of the second term, which we call the bias correction term, from the optimization loss:
(5) 
We ignore the terms containing ground truth , as they are significant only when is itself small, i.e., the bias favors the true kernels in that case. In an MPI with multiple layers associated with defocus sizes , we subtract perlayer constants computed using Eq. (5).
We note that we use a Gaussian noise model to make analysis tractable, but captured images have mixed PoissonGaussian noise [31]. In practice, we found it beneficial to additionally denoise the input images using burst denoising [32]. However, there is residual noise even after denoising, and we show in Sec. 5.1 that our bias correction term still improves performance. An interesting future research direction is using a more accurate noise model to derive a better bias estimate and remove the need for any denoising.
4.3 MPI Optimization
We seek to recover an MPI such that defocused images rendered from it using the calibrated blur kernels are close to the input images. But minimizing only a reconstruction loss is insufficient: this task is illposed, as there exists an infinite family of MPIs that all exactly reproduce the input images. As is common in defocus deblurring [46], we regularize our optimization:
(6) 
where is a biascorrected data term that encourages rendered images to resemble input images, is an auxiliary data term applied to each MPI layer, and the remaining are regularization terms. We discuss all terms below.
Biascorrected data loss. We consider the Charbonnier [11] loss function , and define a biascorrected version as , where we choose the scale parameter [6]. We use this loss function to form a data loss penalizing the difference between left and right input and rendered images as:
(7)  
(8) 
We compute the total bias correction as the sum of all bias correction terms of each layer, weighted by the corresponding defocused transmittance. Eq. (8) is equivalent to Eq. (2) where we replace each MPI layer’s intensity channel with a constant bias correction value . To compute from Eq. (5), we empirically set the variance to , and use a constant inverse spectral power distribution , following previous work [79].
Auxiliary data loss. In most realworld scenes, a pixel’s scene content should be on a single layer. However, because the compositing operator of Eq. (2) forms a weighted sum of all layers, can be small even when scene content is smeared across multiple layers. To discourage this, we introduce a perlayer auxiliary data loss on each layer’s intensity weighted by the layer’s blurred transmittance:
(9) 
where denotes elementwise multiplication. This auxiliary loss resembles the data synthesis loss of Eq. (7), except that it is applied to each MPI layer separately.
Intensity smoothness. Our first regularization term encourages smoothness for the allinfocus image and the MPI intensity channels. For an image with corresponding edge map , we define an edgeaware smoothness based on total variation , similar to Tucker and Snavely [80]:
(10) 
where is the Charbonnier loss. We refer to the supplement for details on and . Our smoothness prior on the allinfocus image and MPI intensity channels is:
(11) 
Alpha and transmittance smoothness. We use an additional smoothness regularizer on all alpha channels and transmittances (sharpened by computing their square root), by encouraging edgeaware smoothness according to the total variation of the allinfocus image:
(12) 
Alpha and transmittance entropy. The last regularizer is a collision entropy penalty on alpha channels and transmittances. Collision entropy, defined for a vector as , is a special case of Renyi entropy [68], and we empirically found it to be better than Shannon entropy for our problem. Minimizing collision entropy encourages sparsity: is minimum when all but one elements of are , which in our case encourages scene content to concentrate on a single MPI layer, rather than spread across multiple layers. Our entropy loss is:
(13) 
We extract the alpha channels and transmittances of each pixel from all MPI layers, compute their square root for sharpening, compute a perpixel entropy, and average these entropies across all pixels. When computing entropy on alpha channels, we skip the farthest MPI layer, because we assume that all scene content ends at the farthest layer, and thus force this layer to be opaque ().
5 Experiments
We capture a new dataset, and use it to perform qualitative and quantitative comparisons with other state of the art defocus deblurring and defocus map estimation methods. The project website [85] includes an interactive HTML viewer [8] to facilitate comparisons across our full dataset.
Data collection. Even though DP sensors are common, to the best of our knowledge, only two camera manufacturers provide an API to read DP images—Google and Canon. However, Canon’s proprietary software applies an unknown scenedependent transform to DP data. Unlike supervised learningbased methods [1] that can learn to account for this transform, our loss function requires raw sensor data. Hence, we collect data using the Google Pixel 4 smartphone, which allows access to the raw DP data [16].
The Pixel 4 captures DP data only in the green channel. To compute ground truth, we capture a focus stack with slices sampled uniformly in diopter space, where the closest focus distance corresponds to the distance we calibrate for, , and the farthest to infinity. Following prior work [64], we use the commercial Helicon Focus software [35] to process the stacks and generate ground truth sharp images and defocus maps, and we manually correct holes in the generated defocus maps. Still, there are image regions that are difficult to manually inpaint, e.g., near occlusion boundaries or curved surfaces. We ignore such regions when computing quantitative metrics. We capture a total of scenes, both indoors and outdoors. Similar to Garg \etal [24], we centrally crop the DP images to . We refer to the supplement for more details. Our dataset is available at the project website [85].
5.1 Results
We evaluate our method on both defocus deblurring and depthfromdefocus tasks. We use MPI layers for all scenes in our dataset. We manually determine the kernel sizes of the front and back layers, and evenly distribute layers in diopter space. Each optimization runs for 10,000 iterations with Adam [41], and takes 2 hours on an Nvidia Titan RTX GPU. We gradually decrease the global learning rate from 0.3 to 0.1 with exponential decay. Our JAX [9] implementation is available at the project website [85].
We compare to stateoftheart methods for defocus deblurring ( DPDNet [1], Wiener deconvolution [79, 88]) and defocus map estimation (DP stereo matching [82], supervised learning from DP views [24], DP defocus estimation based on kernel symmetry [64], Wiener deconvolution [79, 88], DMENet [45]). For methods that take a single image as input, we use the average of the left and right DP images. We also provide both the original and vignetting corrected DP images as inputs, and report the best result. We show quantitative results in Tab. 1 and qualitative results in Figs. 6 and 7. For the defocus map, we use the affineinvariant metrics from Garg \etal [24]. Our method achieves the best quantitative results on both tasks.
Defocus deblurring results. Despite the large amount of blur in the input DP images, our method produces deblurred results with highfrequency details that are close to the ground truth (Fig. 6). DPDNet makes large errors as it is trained on Canon data and does not generalize. We improve the accuracy of DPDNet by providing vignetting corrected images as input, but its accuracy is still lower than ours.
Defocus map estimation results. Our method produces defocus maps that are closest to the ground truth (Fig. 7), especially on textureless regions, such as the toy and clock in the first scene. Similar to [64], depth accuracy near edges can be improved by guided filtering [34] as shown in Fig. 14(d).
Ablation studies. We investigate the effect of each loss function term by removing them one at a time. Quantitative results are in Tab. 2, and qualitative comparisons in Fig. 8.
Our full pipeline has the best overall performance in recovering allinfocus images and defocus maps. and strongly affect the smoothness of allinfocus images and defocus maps, respectively. Without or , even though recovered allinfocus images are reasonable, scene content is smeared across multiple MPI layers, leading to incorrect defocus maps. Finally, without the bias correction term , defocus maps are biased towards smaller blur radii, especially in textureless areas where noise is more apparent, e.g., the white clock area.
Results on Data from Abuolaim and Brown [1]. Even though Abuolaim and Brown [1] train their model on data from a Canon camera, they also capture Pixel 4 data for qualitative tests. We run our method on their Pixel 4 data, using the calibration from our device, and show that our recovered allinfocus image has fewer artifacts (Fig. 9). This demonstrates that our method generalizes well across devices of the same model, even without recalibration.
6 Discussion and Conclusion
Method  Allinfocus Image  Defocus Map  

Wiener Deconv. [88]  25.806  0.704  0.032  0.156  0.197  0.665 
DPDNet [1]  25.591  0.777  0.034       
DMENet [45]        0.144  0.183  0.586 
Punnappurath \etal [64]        0.124  0.161  0.444 
Garg \etal [24]        0.079  0.102  0.208 
Wadhwa \etal [82]        0.141  0.177  0.540 
Ours  26.692  0.804  0.027  0.047  0.076  0.178 
Ours w/ guided filtering  26.692  0.804  0.027  0.059  0.083  0.193 
We presented a method that optimizes an MPI scene representation to jointly recover a defocus map and allinfocus image from a single dualpixel capture. We showed that image noise introduces a bias in the optimization that, under suitable assumptions, can be quantified and corrected for. We also introduced additional priors to regularize the optimization, and showed their effectiveness via ablation studies. Our method improves upon past work on both defocus map estimation and blur removal, when evaluated on a new dataset we captured with a consumer smartphone camera.
Limitations and future directions. We discuss some limitations of our method, which suggest directions for future research. First, our method does not require a large dataset with ground truth to train on, but still relies on a onetime blur kernel calibration procedure. It would be interesting to explore blind deconvolution techniques [20, 48] that can simultaneously recover the allinfocus image, defocus map, and unknown blur kernels, thus removing the need for kernel calibration. The development of parametric blur kernel models that can accurately reproduce the features we observed (i.e., spatial variation, lack of symmetry, lack of circularity) can facilitate this research direction. Second, the MPI representation discretizes the scene into a set of frontoparallel depth layers. This can potentially result in discretization artifacts in scenes with continuous depth variation. In practice, we did not find this to be an issue, thanks to the use of the softblending operation to synthesize the allinfocus image and defocus map. Nevertheless, it could be useful to replace the MPI representation with a continuous one, e.g., neural radiance fields [58], to help better model continuouslyvarying depth. Third, reconstructing an accurate allinfocus image becomes more difficult as defocus blur increases (e.g., very distant scenes at noninfinity focus) and more highfrequency content is missing from the input image. This is a fundamental limitation shared among all deconvolution techniques. Using powerful datadriven priors to hallucinate the missing high frequency content (e.g., deeplearningbased deconvolution techniques) can help alleviate this limitation. Fourth, the high computational complexity of our technique makes it impractical for realtime operation, especially on resourceconstrained devices such as smartphones. Therefore, it is worth exploring optimized implementations.
Method  Allinfocus Image  Defocus Map  

Full  26.692  0.804  0.027  0.047  0.076  0.178 
No  14.882  0.158  0.136  0.047  0.078  0.185 
No  24.748  0.726  0.037  0.161  0.206  0.795 
No  27.154  0.819  0.026  0.057  0.085  0.190 
No  26.211  0.768  0.030  0.148  0.190  0.610 
No  26.265  0.790  0.028  0.063  0.092  0.214 

Acknowledgments. We thank David Salesin and Samuel Hasinoff for helpful feedback. S.X. and I.G. were supported by NSF award 1730147 and a Sloan Research Fellowship.
References
 [1] Abdullah Abuolaim and Michael S. Brown. Defocus deblurring using dualpixel data. European Conference on Computer Vision, 2020.
 [2] Abdullah Abuolaim, Abhijith Punnappurath, and Michael S. Brown. Revisiting autofocus for smartphone cameras. European Conference on Computer Vision, 2018.
 [3] Sameer Ansari, Neal Wadhwa, Rahul Garg, and Jiawen Chen. Wireless software synchronization of multiple distributed cameras. IEEE International Conference on Computational Photography, 2019.
 [4] Amir AtapourAbarghouei and Toby P. Breckon. Realtime monocular depth estimation using synthetic data with domain adaptation via image style transfer. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
 [5] Ruzena Bajcsy and Lawrence Lieberman. Texture gradient as a depth cue. Computer Graphics and Image Processing, 1976.
 [6] Jonathan T. Barron. A general and adaptive robust loss function. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
 [7] Jonathan T. Barron and Jitendra Malik. Shape, albedo, and illumination from a single image of an unknown object. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012.
 [8] Benedikt Bitterli, Wenzel Jakob, Jan Novák, and Wojciech Jarosz. Reversible jump metropolis light transport using inverse mappings. ACM Transactions on Graphics, 2017.
 [9] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye WandermanMilne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018.
 [10] Michael Brady and Alan Yuille. An extremum principle for shape from contour. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984.
 [11] Pierre Charbonnier, Laure BlancFeraud, Gilles Aubert, and Michel Barlaud. Two deterministic halfquadratic regularization algorithms for computed imaging. IEEE International Conference on Image Processing, 1994.
 [12] ChingHui Chen, Hui Zhou, and Timo Ahonen. Bluraware disparity estimation from defocus stereo images. IEEE/CVF International Conference on Computer Vision, 2015.
 [13] Sunghwan Choi, Dongbo Min, Bumsub Ham, Youngjung Kim, Changjae Oh, and Kwanghoon Sohn. Depth analogy: Datadriven approach for single image depth estimation using gradient samples. IEEE Transactions on Image Processing, 2015.
 [14] Laurent D’Andrès, Jordi Salvador, Axel Kochale, and Sabine Süsstrunk. Nonparametric blur map regression for depth of field extension. IEEE Transactions on Image Processing, 2016.
 [15] Edward R. Dowski and W. Thomas Cathey. Extended depth of field through wavefront coding. Applied optics, 1995.
 [16] Dual pixel capture app. https://github.com/googleresearch/googleresearch/tree/master/dual_pixels.
 [17] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multiscale deep network. Advances in Neural Information Processing Systems, 2014.
 [18] Jose M. Facil, Benjamin Ummenhofer, Huizhong Zhou, Luis Montesano, Thomas Brox, and Javier Civera. Camconvs: cameraaware multiscale convolutions for singleview depth. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
 [19] Paolo Favaro. Recovering thin structures via nonlocalmeans regularization with application to depth from defocus. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2010.
 [20] Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T. Roweis, and William T. Freeman. Removing camera shake from a single photograph. ACM Transactions on Graphics, 2006.
 [21] D.A. Fish, A.M. Brinicombe, E.R. Pike, and J.G. Walker. Blind deconvolution by means of the Richardson–Lucy algorithm. Journal of the Optical Society of America A, 1995.
 [22] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
 [23] Ravi Garg, Vijay Kumar B.G., Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. European Conference on Computer Vision, 2016.
 [24] Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T. Barron. Learning single camera depth estimation using dualpixels. IEEE/CVF International Conference on Computer Vision, 2019.
 [25] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with leftright consistency. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
 [26] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Digging into selfsupervised monocular depth estimation. IEEE/CVF International Conference on Computer Vision, 2019.
 [27] Paul Grossmann. Depth from focus. Pattern Recognition Letters, 1987.
 [28] Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and Xiaogang Wang. Learning monocular depth by distilling crossdomain stereo networks. European Conference on Computer Vision, 2018.
 [29] Christian Häne, L’ubor Ladický, and Marc Pollefeys. Direction matters: Depth estimation with a surface normal classifier. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
 [30] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
 [31] Samuel W. Hasinoff, Frédo Durand, and William T. Freeman. Noiseoptimal capture for high dynamic range photography. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2010.
 [32] Samuel W. Hasinoff, Dillon Sharlet, Ryan Geiss, Andrew Adams, Jonathan T. Barron, Florian Kainz, Jiawen Chen, and Marc Levoy. Burst photography for high dynamic range and lowlight imaging on mobile cameras. ACM Transactions on Graphics, 2016.
 [33] Caner Hazırbaş, Sebastian Georg Soyer, Maximilian Christian Staab, Laura LealTaixé, and Daniel Cremers. Deep depth from focus. Asian Conference on Computer Vision, 2018.
 [34] Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. European Conference on Computer Vision, 2010.
 [35] Helicon focus. https://www.heliconsoft.com/.
 [36] Charles Herrmann, Richard Strong Bowen, Neal Wadhwa, Rahul Garg, Qiurui He, Jonathan T. Barron, and Ramin Zabih. Learning to autofocus. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
 [37] Derek Hoiem, Alexei A. Efros, and Martial Hebert. Automatic photo popup. ACM Transactions on Graphics, 2005.
 [38] Berthold K.P. Horn. Shape from shading: A method for obtaining the shape of a smooth opaque object from one view. Technical report, Massachusetts Institute of Technology, 1970.
 [39] Huaizu Jiang, Erik G. LearnedMiller, Gustav Larsson, Michael Maire, and Greg Shakhnarovich. Selfsupervised depth learning for urban scene understanding. European Conference on Computer Vision, 2018.
 [40] Ali Karaali and Claudio Rosito Jung. Edgebased defocus blur estimation with adaptive scale selection. IEEE Transactions on Image Processing, 2017.
 [41] Diederick P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
 [42] Janusz Konrad, Meng Wang, Prakash Ishwar, Chen Wu, and Debargha Mukherjee. Learningbased, automatic 2dto3d image and video conversion. IEEE Transactions on Image Processing, 2013.
 [43] Dilip Krishnan and Rob Fergus. Fast image deconvolution using hyperlaplacian priors. Advances in Neural Information Processing Systems, 2009.
 [44] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of perspective. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
 [45] Junyong Lee, Sungkil Lee, Sunghyun Cho, and Seungyong Lee. Deep defocus map estimation using domain adaptation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019.
 [46] Anat Levin, Rob Fergus, Frédo Durand, and William T. Freeman. Image and depth from a conventional camera with a coded aperture. ACM Transactions on Graphics, 2007.
 [47] Anat Levin, Samuel W. Hasinoff, Paul Green, Frédo Durand, and William T. Freeman. 4D frequency analysis of computational cameras for depth of field extension. ACM Transactions on Graphics, 2009.
 [48] Anat Levin, Yair Weiss, Frédo Durand, and William T. Freeman. Understanding blind deconvolution algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
 [49] Feng Li, Jian Sun, Jue Wang, and Jingyi Yu. Dualfocus stereo imaging. Journal of Electronic Imaging, 2010.
 [50] Jun Li, Reinhard Klein, and Angela Yao. A twostreamed network for estimating finescaled depth maps from single rgb images. IEEE/CVF International Conference on Computer Vision, 2017.
 [51] Xiu Li, Hongwei Qin, Yangang Wang, Yongbing Zhang, and Qionghai Dai. DEPT: depth estimation by parameter transfer for single still images. Asian Conference on Computer Vision, 2014.
 [52] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
 [53] Patric Ljung, Jens Krüger, Eduard Groller, Markus Hadwiger, Charles D. Hansen, and Anders Ynnerman. State of the art in transfer functions for direct volume rendering. Computer Graphics Forum, 2016.
 [54] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and egomotion from monocular video using 3D geometric constraints. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
 [55] Fahim Mannan and Michael S. Langer. Blur calibration for depth from defocus. Conference on Computer and Robot Vision, 2016.
 [56] Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What makes good synthetic training data for learning disparity and optical flow estimation? International Journal of Computer Vision, 2018.
 [57] Tomer Michaeli and Michal Irani. Blind deblurring using internal patch recurrence. European Conference on Computer Vision, 2014.
 [58] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. European Conference on Computer Vision, 2020.
 [59] Hajime Nagahara, Sujit Kuthirummal, Changyin Zhou, and Shree K. Nayar. Flexible Depth of Field Photography. European Conference on Computer Vision, 2008.
 [60] Jogendra Nath Kundu, Phani Krishna Uppala, Anuj Pahuja, and R. Venkatesh Babu. Adadepth: Unsupervised content congruent adaptation for depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
 [61] Ren Ng, Marc Levoy, Mathieu Brédif, Gene Duval, Mark Horowitz, and Pat Hanrahan. Light field photography with a handheld plenoptic camera. Technical report, Stanford University, 2005.
 [62] Jinsun Park, YuWing Tai, Donghyeon Cho, and In So Kweon. A unified approach of multiscale deep and handcrafted features for defocus estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
 [63] Alex Paul Pentland. A new sense for depth of field. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1987.
 [64] Abhijith Punnappurath, Abdullah Abuolaim, Mahmoud Afifi, and Michael S. Brown. Modeling defocusdisparity in dualpixel sensors. IEEE International Conference on Computational Photography, 2020.
 [65] Abhijith Punnappurath and Michael S. Brown. Reflection removal using a dualpixel sensor. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
 [66] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zeroshot crossdataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
 [67] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen Koltun. Dense monocular depth estimation in complex dynamic scenes. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
 [68] Alfréd Rényi. On measures of entropy and information. Berkeley Symposium on Mathematical Statistics and Probability, 1961.
 [69] Anirban Roy and Sinisa Todorovic. Monocular depth estimation using neural regression forest. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
 [70] Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng. Learning depth from single monocular images. Advances in Neural Information Processing Systems, 2006.
 [71] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. International Journal of Computer Vision, 2002.
 [72] Jianping Shi, Xin Tao, Li Xu, and Jiaya Jia. Break ames room illusion: depth from general single images. ACM Transactions on Graphics, 2015.
 [73] Jianping Shi, Li Xu, and Jiaya Jia. Just noticeable defocus blur detection and estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
 [74] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. European Conference on Computer Vision, 2012.
 [75] Pratul P. Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, and Jonathan T. Barron. Aperture supervision for monocular depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
 [76] Supasorn Suwajanakorn, Carlos Hernandez, and Steven M. Seitz. Depth from focus with your mobile phone. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
 [77] Rick Szeliski and Polina Golland. Stereo matching with transparency and matting. International Journal of Computer Vision, 1999.
 [78] Huixuan Tang, Scott Cohen, Brian Price, Stephen Schiller, and Kiriakos N. Kutulakos. Depth from defocus in the wild. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
 [79] Huixuan Tang and Kiriakos N. Kutulakos. Utilizing optical aberrations for extendeddepthoffield panoramas. Asian Conference on Computer Vision, 2012.
 [80] Richard Tucker and Noah Snavely. Singleview view synthesis with multiplane images. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
 [81] Ashok Veeraraghavan, Ramesh Raskar, Amit Agrawal, Ankit Mohan, and Jack Tumblin. Dappled photography: Mask enhanced cameras for heterodyned light fields and coded aperture refocusing. ACM Transactions on Graphics, 2007.
 [82] Neal Wadhwa, Rahul Garg, David E. Jacobs, Bryan E. Feldman, Nori Kanazawa, Robert Carroll, Yair MovshovitzAttias, Jonathan T. Barron, Yael Pritch, and Marc Levoy. Synthetic depthoffield with a singlecamera mobile phone. ACM Transactions on Graphics, 2018.
 [83] Yilun Wang, Junfeng Yang, Wotao Yin, and Yin Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, 2008.
 [84] Masahiro Watanabe and Shree K. Nayar. Rational filters for passive depth from defocus. International Journal of Computer Vision, 1998.
 [85] Shumian Xin, Neal Wadhwa, Tianfan Xue, Jonathan T. Barron, Pratul P. Srinivasan, Jianwen Chen, Ioannis Gkioulekas, and Rahul Garg. Project website, 2021. https://imaging.cs.cmu.edu/dual_pixels.
 [86] Jiawei Zhang, Jinshan Pan, WeiSheng Lai, Rynson W.H. Lau, and MingHsuan Yang. Learning fully convolutional networks for iterative nonblind deconvolution. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
 [87] Yinda Zhang, Neal Wadhwa, Sergio OrtsEscolano, Christian Häne, Sean Fanello, and Rahul Garg. Dunet: Learning depth estimation from dualcameras and dualpixels. European Conference on Computer Vision, 2020.
 [88] Changyin Zhou, Stephen Lin, and Shree K. Nayar. Coded aperture pairs for depth from defocus. IEEE/CVF International Conference on Computer Vision, 2009.
 [89] Changyin Zhou and Shree K. Nayar. What are good apertures for defocus deblurring? IEEE International Conference on Computational Photography, 2009.
 [90] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and egomotion from video. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
 [91] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. ACM Transactions on Graphics, 2018.
 [92] Yuliang Zou, Zelun Luo, and JiaBin Huang. DFnet: Unsupervised joint learning of depth and flow using crosstask consistency. European Conference on Computer Vision, 2018.
Appendix A Introduction
In this supplementary material, we cover the following topics:

In Sec. B, we describe blur kernel calibration in more detail, and explore how blur kernels change with respect to scene depth and focus distance.

In Sec. C, we provide more technical details about our method. More specifically, we explain how we render defocus maps from the multiplane image (MPI) representation, provide the derivation of the bias correction term, and define the total variation function and the edge map used in the regularization terms.
Appendix B Blur Kernel Calibration
We provide more information about our calibration procedure for the left and right blur kernels used as input to our method. We use a method similar to the one proposed by Mannan and Langer [55], and calibrate blur kernels for left and right dualpixel (DP) images independently (Fig. 10) for a specific focus distance. Specifically, we image a regular grid of circular discs on a monitor screen at a distance of from the camera. We apply global thresholding and binarize the captured image, perform connected component analysis to identify the individual discs and their centers, and generate and align the binary sharp image with the known calibration pattern by solving for a homography between the calibration target disc centers and the detected centers. In order to apply radiometric correction, we also capture allwhite and allblack images displayed on the same screen, and generate the grayscale latent sharp image as , where represents pixelwise multiplication, and and are captured allwhite and allblack images. Once we have the aligned latent image and the captured image, we can solve for spatiallyvarying blur kernels using the optimization proposed by Mannan and Langer [55]. Specifically, we solve for a grid of kernels corresponding to central field of view.
In addition to the blur kernels, we calibrate for different vignetting in left and right DP images. Specifically, for the same focus distance as above, we capture six images of a white sheet through a diffuser. We then average all left and right images individually to obtain the left and right vignetting patterns and , respectively.
Next, we explore how DP blur kernels change with respect to scene depth and focus distance (Fig. 11). As observed by Tang and Kutulakos [79], we find that kernels behave differently on the opposite sides of the focus plane. Therefore we choose focus settings such that all scene contents are at or behind the focus plane for all of our experiments, including this kernel analysis. We observe that DP blur kernels are approximately resized versions of each other as the scene depth or focus distance changes, similar to the expected behavior for blur kernels in a regular image sensor.
Appendix C Additional Method Details
In this section, we provide more technical details about our method. We explain how we render defocus maps from the MPI representation in Sec. C.1, provide the derivation of the bias correction term in Sec. C.2, and finally define the total variation function and the edge map used in the regularization terms in Sec. C.3.
c.1 Defocus Map from MPI
We have shown in the main paper that an allinfocus image can be composited from an MPI representation as:
(14) 
We can synthesize a continuousvalued defocus map in a similar way as discussed by Tucker and Snavely [80], by replacing all pixel intensities in Eq. (14) with the defocus blur size of that layer:
(15) 
c.2 Proof of Eq. (4) of the Main Paper
In this section, we provide a detailed derivation of the bias correction term. To be selfcontained, we restate our assumed image formation model. Given an MPI representation, its corresponding DP images can be expressed as:
(16) 
where are the latent noisefree left and right defocused images, and is additive white Gaussian noise with entries independent identically distributed with distribution . Our goal is to optimize for an MPI with intensityalpha layers , with defocus sizes , such that the loss is minimized. We show that, in the presence of image noise, minimizing the above loss biases the estimated defocus map towards smaller blur values. Specifically, we quantify this bias and then correct for it in our optimization.
For simplicity, we assume for now that all scene contents lie on a single frontoparallel plane with ground truth defocus size , and our scene representation is an MPI with a single opaque layer (i.e., ) with a defocus size hypothesis . Under this assumption, the defocused image rendering equation (Eq. (2) of the main paper)
(17) 
reduces to
(18) 
Similarly, Eq. (16) becomes:
(19) 
We can express the above equations in the frequency domain as follows:
(20) 
where , and are the Fourier transforms of , and , respectively. Note that the entries of are also independent identically distributed with the same Gaussian distribution as the entries of .
We can obtain a maximum a posteriori (MAP) estimate of and by solving the following optimization problem [88]:
(21) 
According to Eq. (20), we have
(22) 
We also follow Zhou \etal [88] in assuming a prior for the latent allinfocus image such that:
(23) 
where we define such that
(24) 
and is the frequency. As is the unknown variable, we approximate Eq. (24) by averaging the power spectrum over a set of natural images :
(25) 
where represents the probability distribution of in image domain.
Maximizing the loglikelihood of Eq. (21) is equivalent to minimizing the following loss:
(26) 
can be estimated as the minimizer of the above energy function. Then given , setting yields the following solution of , known as a generalized Wiener deconvolution with two observations:
(27) 
where is the complex conjugate of , and .
We then evaluate the defocus size hypothesis by computing the minimization loss given the latent ground truth depth , and the noise level , that is,
(28)  
(29) 
where is the expectation. Substituting with Eq. (27) gives us:
(30) 
Then substituting with Eq. (20), we get:
(31) 
We now define . We can rearrange the above equation as:
(32) 
Given that the entries of are independent identically distributed with distribution , we have and , and we can simplify the above equation as:
(33)  
(34) 
Recall that, in Eq. (25), we defined such that Then we can further simplify as: