Apprentissage profond et IA pour l'amélioration de la robustesse des techniques de localisation par vision artificielle

Achref Elouni

Résumé

The work carried out as part of this thesis takes place in the context of a collaborative project aimed at the development of an augmented reality headset. In order to operate such a device, it is necessary to calculate the position of an on-board camera in the operating environment of the user.Recently, two technologies called SLAM (for Simultaneous Localization And Mapping”) and SfM (for Structure From Motion)undeniable performance for 3D reconstruction of an environment fromof a collection of images. We took an interest in them in order to resolve thedelicate problem of initializing our device or re-initializing it in the event of failure of real-time position monitoring. Indeed, despite the research work carried out in recent years, several limitations prevent localization systems from estimating a perfect installation in all conditions.These conditions include slight changes in the context such as variations in brightness, observation point, or geometric changes such as the addition of objects.To address these limitations and to provide an easy-to-deploy solution, we investigated the possibility of incorporating invariant information into the localization process that could increase the probability of having a precise pose. Two types of invariant information (semantic and geometric) have been exploited in this thesis to help the localization system find its position.The proposed solutions were validated on several internal and external datasets (Dubrovnik, Rome, Oxford, Museum) thanks to which we were able to compare our results with the work described in the state of the art. Two types of request images were studied in this thesis: that composed of a single image and that from a stereo device. The advantage of using a stereo pair is that you can triangulate homologous points in order to extract their height and exploit the latter in the localization process. The other approach considered consists in using as invariant the pixel label obtained by a semantic segmentation algorithm based on a convolutional neural network. In both cases, the results obtained show a significant improvement in the precision of the estimated poses.

Le travail réalisé dans le cadre de ce doctorat se place dans le contexte d’un projet collaboratif ayant pour objectif la mise au point d’un casque de réalité augmenté. Afin de faire fonctionner un tel dispositif il s’avère nécessaire de calculer la position d’une caméra embarquée dans l’environnent d’intervention de l’utilisateur. Récemment, deux technologies dénommées SLAM (pour « Simultaneous Localization And Mapping ») et SfM (pour « Structure From Motion ») ont fait preuve de performances indéniables pour la reconstruction 3D d’un environnement à partir d’une collection d’images. Nous nous sommes intéressés à elles afin de résoudre le problème délicat de l’initialisation de notre dispositif ou de sa ré-initialisation en cas d’échec du suivi temps réel de la position. En effet, malgré les travaux de recherche réalisés ces dernières années, plusieurs limitations empêchent les système de localisation d’estimer une pose parfaite dans toutes les conditions. Ces conditions incluent les changements légers du contexte comme les variations de la luminosité, du point d’observation ou des modifications géométriques telles que l’ajout d’objets.Pour faire face à ces limitations et afin de proposer une solution facile à déployer,nous avons étudié la possibilité d’intégrer dans le processus de localisation des informations invariantes qui pourraient augmenter la probabilité d’avoir une pose précise.Deux types d’information invariante (sémantique et géométrique) ont été exploitées dans cette thèse pour aider le système de localisation à trouver sa position.Les solutions proposées ont été validées sur plusieurs jeux de données internes et externes (Dubrovnik, Rome, Oxford, Musée) grâce auxquels nous avons pu comparer nos résultats avec les travaux décrits dans l’état de l’art. Deux types d’images requêtes ont été étudiées dans cette thèse : celle composée d’une seule image et celle issue d’un dispositif stéréo. L’avantage d’utiliser une paire stéréo est de pouvoir trianguler des points homologues afin d’extraire leur hauteur et d’exploiter cette dernière dans le processus de localisation. L’autre approche envisagée consiste à utiliser comme invariant le label des pixels obtenu par un algorithme de segmentation sémantique basé sur un réseau de neurones convolutionnel. Dans les deux cas,les résultats obtenus montrent une amélioration sensible sur la précision des poses estimées.

Deep learning and AI for improving the robustness of machine vision localization techniques

Apprentissage profond et IA pour l'amélioration de la robustesse des techniques de localisation par vision artificielle

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager