Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Más filtros











Base de datos
Intervalo de año de publicación
1.
Artículo en Inglés | MEDLINE | ID: mdl-39288047

RESUMEN

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV-v2, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking of models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich test bed to study robustness and will help push forward research in this area. Our dataset is publically available online, https://genintel.mpi-inf.mpg.de/ood-cv-v2.html.

2.
Int J Comput Vis ; 132(4): 1148-1166, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38549787

RESUMEN

Portrait viewpoint and illumination editing is an important problem with several applications in VR/AR, movies, and photography. Comprehensive knowledge of geometry and illumination is critical for obtaining photorealistic results. Current methods are unable to explicitly model in 3D while handling both viewpoint and illumination editing from a single image. In this paper, we propose VoRF, a novel approach that can take even a single portrait image as input and relight human heads under novel illuminations that can be viewed from arbitrary viewpoints. VoRF represents a human head as a continuous volumetric field and learns a prior model of human heads using a coordinate-based MLP with individual latent spaces for identity and illumination. The prior model is learned in an auto-decoder manner over a diverse class of head shapes and appearances, allowing VoRF to generalize to novel test identities from a single input image. Additionally, VoRF has a reflectance MLP that uses the intermediate features of the prior model for rendering One-Light-at-A-Time (OLAT) images under novel views. We synthesize novel illuminations by combining these OLAT images with target environment maps. Qualitative and quantitative evaluations demonstrate the effectiveness of VoRF for relighting and novel view synthesis, even when applied to unseen subjects under uncontrolled illumination. This work is an extension of Rao et al. (VoRF: Volumetric Relightable Faces 2022). We provide extensive evaluation and ablative studies of our model and also provide an application, where any face can be relighted using textual input.

3.
Artículo en Inglés | MEDLINE | ID: mdl-38376960

RESUMEN

The reconstruction and novel view synthesis of dynamic scenes recently gained increased attention. As reconstruction from large-scale multi-view data involves immense memory and computational requirements, recent benchmark datasets provide collections of single monocular views per timestamp sampled from multiple (virtual) cameras. We refer to this form of inputs as monocularized data. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is often limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360° inward-facing novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. In addition to existing synthetic monocularized data, we systematically analyze the performance on real-world inward-facing scenes using a newly recorded challenging dataset sampled from a synchronized large-scale multi-view rig. In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views. Our code and dataset are available online: https://github.com/MoritzKappel/MoNeRF.

4.
Artículo en Inglés | MEDLINE | ID: mdl-37585333

RESUMEN

We propose a new method for learning a generalized animatable neural human representation from a sparse set of multi-view imagery of multiple persons. The learned representation can be used to synthesize novel view images of an arbitrary person and further animate them with the user's pose control. While most existing methods can either generalize to new persons or synthesize animations with user control, none of them can achieve both at the same time. We attribute this accomplishment to the employment of a 3D proxy for a shared multi-person human model, and further the warping of the spaces of different poses to a shared canonical pose space, in which we learn a neural field and predict the person- and pose-dependent deformations, as well as appearance with the features extracted from input images. To cope with the complexity of the large variations in body shapes, poses, and clothing deformations, we design our neural human model with disentangled geometry and appearance. Furthermore, we utilize the image features both at the spatial point and on the surface points of the 3D proxy for predicting person- and pose-dependent properties. Experiments show that our method significantly outperforms the state-of-the-arts on both tasks.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15098-15119, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-37624713

RESUMEN

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modeling the interaction among multimodal information, multimodal image synthesis and editing has become a hot research topic in recent years. Instead of providing explicit guidance for network training, multimodal guidance offers intuitive and flexible means for image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of multimodal features, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis and editing and formulate taxonomies according to data modalities and model types. We start with an introduction to different guidance modalities in image synthesis and editing, and then describe multimodal image synthesis and editing approaches extensively according to their model types. After that, we describe benchmark datasets and evaluation metrics as well as corresponding experimental results. Finally, we provide insights about the current research challenges and possible directions for future research.

6.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 4009-4022, 2023 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-34191722

RESUMEN

Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness. This work is an extended version of [1] where we provide more detailed explanations, comparisons and results as well as applications.

7.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 8962-8974, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-34727024

RESUMEN

3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. Existing methods addressing it directly regress hand meshes via 2D convolutional neural networks, which leads to artifacts due to perspective distortions in the images. To address the limitations of the existing methods, we develop HandVoxNet++, i.e., a voxel-based deep network with 3D and graph convolutions trained in a fully supervised manner. The input to our network is a 3D voxelized-depth-map-based on the truncated signed distance function (TSDF). HandVoxNet++ relies on two hand shape representations. The first one is the 3D voxelized grid of hand shape, which does not preserve the mesh topology and which is the most accurate representation. The second representation is the hand surface that preserves the mesh topology. We combine the advantages of both representations by aligning the hand surface to the voxelized hand shape either with a new neural Graph-Convolutions-based Mesh Registration (GCN-MeshReg) or classical segment-wise Non-Rigid Gravitational Approach (NRGA++) which does not rely on training data. In extensive evaluations on three public benchmarks, i.e., SynHand5M, depth-based HANDS19 challenge and HO-3D, the proposed HandVoxNet++ achieves the state-of-the-art performance. In this journal extension of our previous approach presented at CVPR 2020, we gain 41.09% and 13.7% higher shape alignment accuracy on SynHand5M and HANDS19 datasets, respectively. Our method is ranked first on the HANDS19 challenge dataset (Task 1: Depth-Based 3D Hand Pose Estimation) at the moment of the submission of our results to the portal in August 2020.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Mano/diagnóstico por imagen
8.
IEEE Trans Vis Comput Graph ; 27(10): 4009-4022, 2021 10.
Artículo en Inglés | MEDLINE | ID: mdl-32746256

RESUMEN

Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this article, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.


Asunto(s)
Gráficos por Computador , Procesamiento de Imagen Asistido por Computador/métodos , Redes Neurales de la Computación , Grabación en Video/métodos , Artefactos , Aprendizaje Profundo , Humanos , Realidad Virtual
9.
IEEE Trans Pattern Anal Mach Intell ; 42(2): 357-370, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-30334783

RESUMEN

In this work, we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is the differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance, and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world datasets feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation. This work is an extended version of [1] , where we additionally present a stochastic vertex sampling technique for faster training of our networks, and moreover, we propose and evaluate analysis-by-synthesis and shape-from-shading refinement approaches to achieve a high-fidelity reconstruction.


Asunto(s)
Cara/anatomía & histología , Cara/diagnóstico por imagen , Imagenología Tridimensional/métodos , Aprendizaje Automático no Supervisado , Aprendizaje Profundo , Femenino , Humanos , Masculino , Redes Neurales de la Computación
10.
IEEE Trans Vis Comput Graph ; 25(5): 2093-2101, 2019 05.
Artículo en Inglés | MEDLINE | ID: mdl-30794176

RESUMEN

We propose the first real-time system for the egocentric estimation of 3D human body pose in a wide range of unconstrained everyday activities. This setting has a unique set of challenges, such as mobility of the hardware setup, and robustness to long capture sessions with fast recovery from tracking failures. We tackle these challenges based on a novel lightweight setup that converts a standard baseball cap to a device for high-quality pose estimation based on a single cap-mounted fisheye camera. From the captured egocentric live stream, our CNN based 3D pose estimation approach runs at 60 Hz on a consumer-level GPU. In addition to the lightweight hardware setup, our other main contributions are: 1) a large ground truth training corpus of top-down fisheye images and 2) a disentangled 3D pose estimation approach that takes the unique properties of the egocentric viewpoint into account. As shown by our evaluation, we achieve lower 3D joint error as well as better 2D overlay than the existing baselines.


Asunto(s)
Imagenología Tridimensional/métodos , Gafas Inteligentes , Bases de Datos Factuales , Aprendizaje Profundo , Actividades Humanas , Humanos , Postura , Programas Informáticos , Grabación en Video
11.
IEEE Trans Vis Comput Graph ; 23(11): 2447-2454, 2017 11.
Artículo en Inglés | MEDLINE | ID: mdl-28809688

RESUMEN

We present a novel real-time approach for user-guided intrinsic decomposition of static scenes captured by an RGB-D sensor. In the first step, we acquire a three-dimensional representation of the scene using a dense volumetric reconstruction framework. The obtained reconstruction serves as a proxy to densely fuse reflectance estimates and to store user-provided constraints in three-dimensional space. User constraints, in the form of constant shading and reflectance strokes, can be placed directly on the real-world geometry using an intuitive touch-based interaction metaphor, or using interactive mouse strokes. Fusing the decomposition results and constraints in three-dimensional space allows for robust propagation of this information to novel views by re-projection. We leverage this information to improve on the decomposition quality of existing intrinsic video decomposition techniques by further constraining the ill-posed decomposition problem. In addition to improved decomposition quality, we show a variety of live augmented reality applications such as recoloring of objects, relighting of scenes and editing of material appearance.


Asunto(s)
Imagenología Tridimensional/métodos , Grabación en Video/métodos , Algoritmos , Humanos
12.
Int J Comput Vis ; 124(1): 96-113, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-32025094

RESUMEN

This paper presents a novel approach to recover true fine surface detail of deforming meshes reconstructed from multi-view video. Template-based methods for performance capture usually produce a coarse-to-medium scale detail 4D surface reconstruction which does not contain the real high-frequency geometric detail present in the original video footage. Fine scale deformation is often incorporated in a second pass by using stereo constraints, features, or shading-based refinement. In this paper, we propose an alternative solution to this second stage by formulating dense dynamic surface reconstruction as a global optimization problem of the densely deforming surface. Our main contribution is an implicit representation of a deformable mesh that uses a set of Gaussian functions on the surface to represent the initial coarse mesh, and a set of Gaussians for the images to represent the original captured multi-view images. We effectively find the fine scale deformations for all mesh vertices, which maximize photo-temporal-consistency, by densely optimizing our model-to-image consistency energy on all vertex positions. Our formulation yields a smooth closed form energy with implicit occlusion handling and analytic derivatives. Furthermore, it does not require error-prone correspondence finding or discrete sampling of surface displacement values. We demonstrate our approach on a variety of datasets of human subjects wearing loose clothing and performing different motions. We qualitatively and quantitatively demonstrate that our technique successfully reproduces finer detail than the input baseline geometry.

13.
IEEE Trans Pattern Anal Mach Intell ; 37(9): 1792-805, 2015 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-26353127

RESUMEN

Improving the quality of degraded images is a key problem in image processing, but the breadth of the problem leads to domain-specific approaches for tasks such as super-resolution and compression artifact removal. Recent approaches have shown that a general approach is possible by learning application-specific models from examples; however, learning models sophisticated enough to generate high-quality images is computationally expensive, and so specific per-application or per-dataset models are impractical. To solve this problem, we present an efficient semi-local approximation scheme to large-scale Gaussian processes. This allows efficient learning of task-specific image enhancements from example images without reducing quality. As such, our algorithm can be easily customized to specific applications and datasets, and we show the efficiency and effectiveness of our approach across five domains: single-image super-resolution for scene, human face, and text images, and artifact removal in JPEG- and JPEG 2000-encoded images.

14.
IEEE Trans Pattern Anal Mach Intell ; 35(11): 2720-35, 2013 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-24051731

RESUMEN

Capturing the skeleton motion and detailed time-varying surface geometry of multiple, closely interacting peoples is a very challenging task, even in a multicamera setup, due to frequent occlusions and ambiguities in feature-to-person assignments. To address this task, we propose a framework that exploits multiview image segmentation. To this end, a probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Given the articulated template models of each person and the labeled pixels, a combined optimization scheme, which splits the skeleton pose optimization problem into a local one and a lower dimensional global one, is applied one by one to each individual, followed with surface estimation to capture detailed nonrigid deformations. We show on various sequences that our approach can capture the 3D motion of humans accurately even if they move rapidly, if they wear wide apparel, and if they are engaged in challenging multiperson motions, including dancing, wrestling, and hugging.


Asunto(s)
Algoritmos , Inteligencia Artificial , Interpretación de Imagen Asistida por Computador/métodos , Imagenología Tridimensional/métodos , Movimiento/fisiología , Reconocimiento de Normas Patrones Automatizadas/métodos , Imagen de Cuerpo Entero/métodos , Humanos
15.
IEEE Trans Cybern ; 43(5): 1370-82, 2013 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-23893757

RESUMEN

We present an algorithm for creating free-viewpoint video of interacting humans using three handheld Kinect cameras. Our method reconstructs deforming surface geometry and temporal varying texture of humans through estimation of human poses and camera poses for every time step of the RGBZ video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem, which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Finally, texture recovery is achieved through jointly optimization on spatio-temporal RGB data using matrix completion. As opposed to previous methods, our algorithm succeeds on free-viewpoint video of human actors under general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.


Asunto(s)
Actigrafía/métodos , Inteligencia Artificial , Periféricos de Computador , Imagenología Tridimensional/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Grabación en Video/métodos , Imagen de Cuerpo Entero/métodos , Actigrafía/instrumentación , Algoritmos , Simulación por Computador , Humanos , Aumento de la Imagen/instrumentación , Aumento de la Imagen/métodos , Transductores , Juegos de Video , Imagen de Cuerpo Entero/instrumentación
16.
IEEE Trans Pattern Anal Mach Intell ; 35(5): 1039-50, 2013 May.
Artículo en Inglés | MEDLINE | ID: mdl-23520250

RESUMEN

We describe a method for 3D object scanning by aligning depth scans that were taken from around an object with a Time-of-Flight (ToF) camera. These ToF cameras can measure depth scans at video rate. Due to comparably simple technology, they bear potential for economical production in big volumes. Our easy-to-use, cost-effective scanning solution, which is based on such a sensor, could make 3D scanning technology more accessible to everyday users. The algorithmic challenge we face is that the sensor's level of random noise is substantial and there is a nontrivial systematic bias. In this paper, we show the surprising result that 3D scans of reasonable quality can also be obtained with a sensor of such low data quality. Established filtering and scan alignment techniques from the literature fail to achieve this goal. In contrast, our algorithm is based on a new combination of a 3D superresolution method with a probabilistic scan alignment approach that explicitly takes into account the sensor's noise characteristics.

17.
IEEE Trans Vis Comput Graph ; 13(4): 663-74, 2007.
Artículo en Inglés | MEDLINE | ID: mdl-17495327

RESUMEN

By means of passive optical motion capture, real people can be authentically animated and photo-realistically textured. To import real-world characters into virtual environments, however, surface reflectance properties must also be known. We describe a video-based modeling approach that captures human shape and motion as well as reflectance characteristics from a handful of synchronized video recordings. The presented method is able to recover spatially varying surface reflectance properties of clothes from multiview video footage. The resulting model description enables us to realistically reproduce the appearance of animated virtual actors under different lighting conditions, as well as to interchange surface attributes among different people, e.g., for virtual dressing. Our contribution can be used to create 3D renditions of real-world people under arbitrary novel lighting conditions on standard graphics hardware.


Asunto(s)
Gráficos por Computador , Interpretación de Imagen Asistida por Computador/métodos , Articulaciones/anatomía & histología , Articulaciones/fisiología , Iluminación/métodos , Modelos Biológicos , Movimiento/fisiología , Simulación por Computador , Aumento de la Imagen/métodos , Imagenología Tridimensional/métodos , Interfaz Usuario-Computador
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA