WO2023017947A1

WO2023017947A1 - Method and system for processing visual information for estimating visual properties in autonomous driving

Info

Publication number: WO2023017947A1
Application number: PCT/KR2022/004580
Authority: WO
Inventors: 임진욱
Original assignee: (주)에이아이매틱스
Priority date: 2021-08-12
Filing date: 2022-03-31
Publication date: 2023-02-16
Also published as: KR102399047B1

Abstract

The present invention relates to a method and a system for processing visual information for estimating visual properties in autonomous driving, the method comprising the steps of: receiving learning data including image data from a client; building a learning model by learning the learning data, wherein the learning model is implemented to receive an image and estimate visual properties including depth, motion, and intrinsic parameters; deploying the learning model to the client; receiving, from the client, a result of a deep learning analysis performed by applying local data on the relevant client to the learning model; and evaluating the stability of the learning model on the basis of the result of the deep learning analysis.

Description

Visual information processing method and system for estimating visual attributes in autonomous driving

The present invention relates to visual information processing technology, and more particularly, to a system for learning artificial intelligence using a video taken from a camera for an autonomous vehicle and estimating visual attributes using a result of the artificial intelligence. will be.

Recently, the types of sensors (for example, cameras, radars, lidars, ultrasonic sensors, etc.) for configuring autonomous vehicles are increasing, and the number thereof is also increasing. Therefore, there is a problem in that the cost of installing each sensor and the cost of developing and operating software for the equipment in operating the autonomous vehicle increases exponentially.

In addition, although a machine learning model using a plurality of random images has generality by being trained with images taken from various cameras, there is a problem in that it cannot have specificity suitable for the environment of each camera sensor.

Generally, a visual property estimation technique based on images is to find out visual properties from images captured by a camera. It is used in various industrial fields such as various robots, drones, self-driving vehicles, and smartphones. For example, in an autonomous vehicle, depth information and motion (movement of a vehicle's own vehicle) of an image can be estimated as visual attributes. And, the corresponding results can be used for other application technologies (3D object detection, pseudo-lidar).

Meanwhile, examples of conventional techniques for estimating visual attributes are largely divided into two types. The first is a technique related to depth estimation, which estimates the depth of an input image. Korean Patent Publication No. 10-2017-0082794 (July 17, 2017) discloses a depth estimation method and device and a distance estimator learning method and device. The second is a technique related to motion estimation, which is a technique for estimating the motion of an input video. Korean Patent Registration No. 10-1758058 (July 10, 2017) discloses a method and apparatus for estimating camera motion using depth information, and an augmented reality system.

[Prior art literature]

[Patent Literature]

Korean Patent Publication No. 10-2017-0082794 (2017.07.17)

Korean Patent Registration No. 10-1758058 (2017.07.10)

An embodiment of the present invention is to learn artificial intelligence using a video captured by a camera for an autonomous vehicle, and visual information for estimating visual attributes in autonomous driving that estimates visual attributes using a result of the artificial intelligence. It is intended to provide a processing method and system.

An embodiment of the present invention solves the cost problem caused by the use of radar, lidar, ultrasonic sensor, etc. by solving visual properties (depth estimation, motion estimation, glass distortion resolution, etc.) required for autonomous driving only with images, and In autonomous driving, which can solve problems by giving specificity to each camera by conducting additional online-learning at the client from a model with generality by using the advantages of a machine learning model based on road learning. It is intended to provide a visual information processing method and system for estimating visual attributes of .

Among the embodiments, a visual information processing method for estimating visual attributes in autonomous driving includes receiving learning data including image data from a client; A learning model is built by learning the training data - the learning model is implemented to receive an image and estimate visual properties including depth, motion and intrinsic parameters. doing; deploying the learning model to the client; receiving, from the client, a result of deep learning analysis performed by applying local data on the corresponding client to the learning model; and evaluating the stability of the learning model based on the deep learning analysis result.

The receiving of the training data may include collecting related data that is synchronized with a generation time point of the image data on the client and includes the unique parameter, IMU, GPS, and vehicle speed; and packaging the related data together with the image data to generate the learning data on the client.

Building the learning model may include independently estimating the depth and the motion based on the image; estimating the eigenparameter based on the image or extracting the eigenparameter from the training data; Extracting IMU, GPS and vehicle speed from the learning data; and iteratively performing learning in a direction that minimizes a loss function based on the depth, the motion, the intrinsic parameter, the IMU, the GPS, and the vehicle speed.

Building the learning model may include defining the loss function as a sum of a warping loss and a depth smoothing loss; and redefining the loss function by selectively adding at least one of a velocity supervision loss and a motion regularization loss to the loss function.

Receiving the deep learning analysis result may include online-learning the learning model on the client based on training data composed of the local data.

The receiving of the deep learning analysis result may further include receiving, from the client, an application result obtained by applying the deep learning analysis result to an application on the corresponding client.

Evaluating the stability may include calculating accuracy of the application based on a result of the application and evaluating stability of the learning model according to the accuracy.

Among the embodiments, a visual information processing system for estimating visual attributes in autonomous driving generates learning data including image data and online-learns a learning model received from a server to perform deep learning analysis on local data. ) the client that produces the result; and the learning model by learning learning data received from the client. The learning model receives an image and estimates visual properties including depth, motion, and intrinsic parameters. A server that builds a server and evaluates the stability of the learning model based on the deep learning analysis result received from the client.

The disclosed technology may have the following effects. However, it does not mean that a specific embodiment must include all of the following effects or only the following effects, so it should not be understood that the scope of rights of the disclosed technology is limited thereby.

A visual information processing method and system for estimating visual attributes in autonomous driving according to an embodiment of the present invention learn artificial intelligence using a video taken from a camera for an autonomous vehicle and use the result of the artificial intelligence. Thus, the visual properties can be estimated.

A visual information processing method and system for estimating visual attributes in autonomous driving according to an embodiment of the present invention solves visual attributes (depth estimation, motion estimation, glass distortion resolution, etc.) , Lidar, and ultrasonic sensors can solve the cost problem.

A visual information processing method and system for estimating visual attributes in autonomous driving according to an embodiment of the present invention uses the advantages of an unsupervised learning-based machine learning model to provide additional online learning at the client from a general model. (online-learning) to solve the problem by assigning specificity to each camera.

1 is a diagram illustrating a visual information processing system according to the present invention.

2 is a diagram explaining the basic configuration of a visual information processing system according to the present invention.

FIG. 3 is a diagram explaining the functional configuration of the server shown in FIG. 1 .

4 is a flowchart illustrating an embodiment of an operation performed in the client shown in FIG. 1;

5 is a flowchart illustrating the basic operation of the visual information processing system according to the present invention.

FIG. 6 is a flowchart illustrating an embodiment of an operation performed by the client and server shown in FIG. 1 .

FIG. 7 is a flowchart illustrating an online learning operation performed by the client shown in FIG. 1 .

FIG. 8 is a diagram for explaining an inference process performed by the client shown in FIG. 1 .

9 is a diagram explaining a learning process performed in the visual information processing system according to the present invention.

10A to 10C are diagrams for explaining various embodiments of a learning process performed in the visual information processing system according to the present invention.

Since the description of the present invention is only an embodiment for structural or functional description, the scope of the present invention should not be construed as being limited by the embodiments described in the text. That is, since the embodiment can be changed in various ways and can have various forms, it should be understood that the scope of the present invention includes equivalents capable of realizing the technical idea. In addition, since the object or effect presented in the present invention does not mean that a specific embodiment should include all of them or only such effects, the scope of the present invention should not be construed as being limited thereto.

Meanwhile, the meaning of terms described in this application should be understood as follows.

Terms such as "first" and "second" are used to distinguish one component from another, and the scope of rights should not be limited by these terms. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected to the other element, but other elements may exist in the middle. On the other hand, when an element is referred to as being "directly connected" to another element, it should be understood that no intervening elements exist. Meanwhile, other expressions describing the relationship between components, such as “between” and “immediately between” or “adjacent to” and “directly adjacent to” should be interpreted similarly.

Expressions in the singular number should be understood to include plural expressions unless the context clearly dictates otherwise, and terms such as “comprise” or “having” refer to an embodied feature, number, step, operation, component, part, or these. It should be understood that it is intended to indicate that a combination exists, and does not preclude the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

In each step, the identification code (eg, a, b, c, etc.) is used for convenience of explanation, and the identification code does not describe the order of each step, and each step clearly follows a specific order in context. Unless otherwise specified, it may occur in a different order than specified. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

The present invention can be implemented as computer readable code on a computer readable recording medium, and the computer readable recording medium includes all types of recording devices storing data that can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices. In addition, the computer-readable recording medium may be distributed to computer systems connected through a network, so that computer-readable codes may be stored and executed in a distributed manner.

All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs, unless defined otherwise. Terms defined in commonly used dictionaries should be interpreted as consistent with meanings in the context of the related art, and cannot be interpreted as having ideal or excessively formal meanings unless explicitly defined in the present application.

Referring to FIG. 1 , the visual information processing system 100 may be implemented by including a client 110 and a server 130 . That is, the visual information processing system 100 may process visual information necessary for autonomous driving through linkage between the client 110 and the server 130, and may execute a series of operations for estimating visual attributes in the process. there is.

First, the client 110 may correspond to a device that generates training data including image data and online-learns the learning model received from the server 130 to generate a deep learning analysis result for local data. can The client 110 may be installed and operated in a vehicle capable of autonomous driving, and may be implemented by including predetermined sensors for implementing functions necessary for autonomous driving. The client 110 may be connected to the server 130 through a network, and a plurality of clients 110 may perform independent operations while simultaneously connected to the server 130 .

The server 130 may correspond to a device that builds a learning model by learning training data received from the client 110 and evaluates the stability of the learning model based on the deep learning analysis result received from the client 110 . Here, the learning model may receive an image and perform an operation of estimating visual properties including depth, motion, and intrinsic parameters. In particular, the server 130 may perform pre-learning so that the learning model has general characteristics, and transmits the learned model to the client 110 to give unique characteristics to each camera on the client 110. learning can take place.

Meanwhile, the server 130 may operate in conjunction with a database (not shown in FIG. 1 ) for storing learning data received from various clients 110 . In this case, the database may be implemented as a device independent of the server 130 and connected to the server 130 through a network. For example, the database may be implemented as an instance operating in the cloud and connected to the server 130 through a network to provide a virtual storage space with unlimited capacity.

Referring to FIG. 2 , the visual information processing system 100 may be implemented by including various functional modules for visual attribute estimation in autonomous driving. Here, the functional modules may be included and implemented on the client 110 or the server 130, and may be selectively added to only one side or collectively applied to both sides as needed.

More specifically, functional modules may include vision sensors 210, additional sensors 220, network interface 230, artificial intelligence processor 240, image pre-processing processor 250 and deep learning network 260. can

The vision sensors 210 may correspond to sensors that generate images or visual information of videos, such as cameras. The vision sensors 210 may perform operations such as contour detection, pixel calculation, code reading, and 3D scanning. The additional sensors 220 may correspond to sensors other than the vision sensors 210 . For example, additional sensors 220 may include GPS, IMU, vehicle velocity, and the like.

In addition, the network interface 230 may perform network communication, and may include WIFI, cellular, and the like. The artificial intelligence (AI) processor 240 and the image pre-processing processor 250 may correspond to arithmetic processing devices that execute and process various operation procedures. The artificial intelligence (AI) processor 240 and the image pre-processing processor 250 may be implemented with a CPU, GPU, TPU, FPGA, DSP, or the like.

The deep learning network 260 may be built through deep learning and may process various prediction operations. For example, the deep learning network 260 may perform operations such as estimating depth, estimating pose, and estimating intrinsic parameters, and includes independent models performing each operation. can be built with

Referring to FIG. 3 , the server 130 according to the present invention may be implemented by including a plurality of functional components that perform independent operations. More specifically, the server 130 includes a learning data receiving unit 310, a learning model building unit 330, a learning model distribution unit 350, a deep learning analysis unit 370, a learning model evaluation unit 390, and a control unit ( (not shown in FIG. 3) may be included.

The training data receiver 310 may receive training data including image data from the client 110 . Here, the client 110 may be installed and operated in a vehicle capable of autonomous driving, and may generate various types of image data according to an installation location and function in the vehicle. The learning data receiving unit 310 may collect and store learning data for each vehicle and client 110, and may additionally add identification information about the vehicle and client 110 to the learning data in the storage process.

In one embodiment, the learning data received by the learning data receiving unit 310 may be generated on the client 110 . More specifically, related data may be collected on the client 110 at the time of generation of the image data and include intrinsic parameters, IMU, GPS, and vehicle speed, and the collected related data may be collected along with the image data. It may be packaged and generated as learning data.

That is, the client 110 may generate learning data by integrating the information collected on its own, and at this time, a generation rule for the learning data provided from the server 130 may be applied. The training data may be generated by basically including image data and various related data at the point in time when the corresponding image data is generated. The creation rule may include information about data type, data sorting order, timing, data size, and the like.

Also, the learning data may be generated so as not to include data unique to each client 110 . That is, the client 110 may intentionally remove training data containing characteristic information of the client 110 in order to provide general-purpose training data used for building a general model on the server 130 . The client 110 may remove data that meets a pre-set removal condition in the packaging process of the learning data from the learning data.

The learning model construction unit 330 may build a learning model by learning learning data. Here, the learning model may be implemented to receive an image and estimate visual properties including depth, motion, and intrinsic parameters. In the case of a self-driving vehicle, visual attributes must be effectively acquired for self-driving control, and the learning model builder 330 can build a generalized learning model for effectively estimating visual attributes using only images.

In particular, in the case of the learning model built by the learning model building unit 330, it can be built through unsupervised learning based on unlabeled learning data, and the learning model with this generality is each client ( 110), specificity for each camera can be given through additional learning. The learning model builder 330 may independently build learning models that perform operations such as depth estimation, motion estimation, and parameter estimation necessary for autonomous driving.

In one embodiment, the learning model building unit 330 independently estimates depth and motion based on an image, estimates a unique parameter based on the image or extracts a unique parameter from training data, and from the training data, an IMU, GPS and vehicle speed can be extracted. The learning model builder 330 may repeatedly perform learning in a direction that minimizes a loss function based on depth, motion, intrinsic parameters, IMU, GPS, and vehicle speed. As described above, the specific learning operation performed by the learning model building unit 330 can be utilized in the process of building a generalized learning model on the server 130, and a specialized learning model dedicated to the client on the client 110. It can also be used in the construction process. A detailed description of this will be described in more detail with reference to FIG. 10 .

In one embodiment, the learning model building unit 330 defines a loss function as the sum of a warping loss and a depth smoothing loss, and the velocity supervision loss and motion It can be overridden by selectively adding at least one of the motion regularization losses. A detailed description of this will be described in more detail with reference to FIG. 10 .

The learning model distribution unit 350 may deploy the learning model to the client 110 . The generalized learning model built by the learning model builder 330 may be distributed to each client 110 by the server 130 . The distribution process may proceed through a network connected between the client 110 and the server 130, and the learning model distribution unit 350 may control the overall distribution process. If the distribution process is interrupted due to disconnection of the network connection, the learning model distribution unit 350 may temporarily store distribution information, and resume the discontinued distribution process when the network connection is restored. The learning model distributing unit 350 may process a distributing operation according to the distributing rule, and may generate a distributing rule for each client 110 and apply it independently during the distributing process, if necessary.

The deep learning analyzer 370 may receive a deep learning analysis result obtained by applying local data on the corresponding client to the learning model from the client 110 . That is, the client 110 may perform deep learning analysis using local data collected on its own on the client based on the learning model received from the server 130 . Here, the local data may correspond to individual data collected by each client 110 and may include sensor data of sensors operating in the corresponding client 110 .

In one embodiment, the deep learning analysis unit 370 may online-learn a learning model based on training data composed of local data on the client 110 . The client 110 may perform online learning by itself, and the deep learning analysis unit 370 may control an online learning operation performed on the client 110 by interworking with the client 110 . For example, the deep learning analysis unit 370 may monitor a process of distributing a model to the client 110 and monitor a collection situation of local data on the client 110 . The deep learning analyzer 370 may deliver an online learning start command to the client 110 when model distribution in the specific client 110 is completed and local data is sufficiently collected. The corresponding client 110 may perform online learning on the learning model according to the start command of the deep learning analysis unit 370 .

Meanwhile, the client 110 may independently perform online learning according to its own control procedure, and may transmit information about the start and end of online learning to the deep learning analyzer 370 if necessary.

In one embodiment, the deep learning analysis unit 370 may apply the deep learning analysis result to an application on the corresponding client 110 and receive a result of the application from the client 110 . Here, the application may correspond to a unique application function operating on the client 110 . For example, applications may include 3D object detection, pseudo-lidar, and the like. Each client 110 may execute at least one application that performs a unique function, and may provide a deep learning analysis result performed through a learning model as part of information required for execution of the corresponding application. The deep learning analysis unit 370 may receive, from the client 110 , an execution result of an application to which the corresponding deep learning analysis result is applied, that is, an application result, together with a deep learning analysis result.

The learning model evaluation unit 390 may evaluate the stability of the learning model based on the deep learning analysis result. The learning model evaluation unit 390 evaluates the learning process performed on the server 130 based on the deep learning analysis result collected from each client 110 or evaluates the learning algorithm used for learning and evaluates the previously built learning model. evaluation can be performed. Since there is no evaluation criterion for evaluating the stability of the learning model itself, the learning model evaluation unit 390 may use the performance result of the application executed in the client 110 as an evaluation criterion. That is, the learning model evaluation unit 390 indirectly evaluates the safety of the learning model according to the execution result of the application, thereby determining whether to continuously use the learning model or update the learning model.

In an embodiment, the learning model evaluation unit 390 may calculate the accuracy of the application based on the application result and evaluate the stability of the learning model according to the accuracy. For example, the learning model evaluation unit 390 may receive a result of 3D object detection performed on the client 110 and evaluate the accuracy of object detection based thereon. Here, accuracy may be calculated through comparison between ground truth and prediction results.

More specifically, object detection based on image data may correspond to a process of identifying objects present in an image and classifying them. The learning model evaluator 390 may detect a bounding box of an object from object information received as a result of object detection and determine whether a class of an identified object within the bounding box matches an actual class. If the classes do not match, the corresponding object detection may be classified as a failure. If the classes match, the learning model evaluation unit 390 may calculate object detection accuracy by comparing the bounding box of the identified object with the actual bounding box. In this case, Intersection Over Union (IoU), Precision, Recall, Average Precision (AP), Mean Average Precision (mAP), and the like may be used to calculate precision.

The controller (not shown in FIG. 3) controls the overall operation of the server 130, and includes the learning data receiving unit 310, the learning model building unit 330, the learning model distribution unit 350, and the deep learning analysis unit 370. ) and a control flow or data flow between the learning model evaluation unit 390 may be managed.

Referring to FIG. 4 , the client 110 may acquire image data through a vision sensor installed therein (S410). Also, the client 110 may acquire related data through additional sensors (S420). In this case, the related data may include a camera intrinsic parameter, an IMU, vehicle velocity, and the like. The client 110 may generate learning data by packaging image data and related data (S430). Thereafter, the training data packaged in the client 110 may be transmitted to the server 130 and used in a model learning process.

Meanwhile, the packaging operation for the learning data may be performed by the client 110, but is not necessarily limited thereto, and may be performed by the server 130. In this case, the server 130 may sequentially receive image data and related data from the client 110 and then package the learning data.

Referring to FIG. 5 , the visual information processing system 100 may operate through interworking between a client 110 and a server 130 . More specifically, the server 130 may prepare learning data based on data collected from the client 110 (S510). Thereafter, the server 130 may build a machine learning model by learning the learning data (S520). The learning operation performed on the server 130 may be performed through unsupervised learning using unlabeled data, and through this, a generalized model independent of the image sensor of the client 110 and the operating environment is built. It can be. The machine learning model built by the server 130 may be distributed to each client 110 and tuned to suit the image sensor and operating environment of the client 110 .

That is, the client 110 to which the machine learning model is distributed may collect sensor data from internal sensors (S540) and apply the sensor data to the distributed machine learning model (S550). Then, the client 110 may perform a specific operation by transferring the result of the model to an application operating on the client 110 .

Referring to FIG. 6 , the client 110 may receive sensor data from sensors (S610) and may perform a pre-processing operation on the received sensor data (S620). For example, in the case of image data, preprocessing processes such as augmentation, normalization, and scaling may be applied. The client 110 may obtain data in a form applicable to the model through preprocessing of sensor data.

In addition, the client 110 may perform deep learning analysis by inputting the preprocessed data to a machine learning model (S630). Here, the machine learning model is a model learned on the server 130 and may correspond to a model distributed by the server 130 to the client 110 . Deep learning analysis may include inferring visual properties of image data. For example, in the case of the client 110 installed and operating in an autonomous vehicle, estimation results on depth, pose, and intrinsic parameters are obtained through deep learning analysis on images as a result of deep learning analysis. can be created as

The client 110 may deliver the deep learning analysis result to an internal application (S640), and obtain a result according to the operation of the application as an application result. Thereafter, the client 110 may transmit the deep learning analysis result and the application result to the server 130 (S650). The server 130 may evaluate the stability of the deep learning model using the data received from the client 110 (S660). That is, the deep learning result can be indirectly evaluated using the result of the application algorithm performed on the client 110, and thus the stability of the model can be confirmed.

Referring to FIG. 7 , a model learned by the server 130 may be distributed to each client 110 . The client 110 may utilize the distributed model as it is, but may perform additional learning for tuning according to the corresponding image sensor and operating environment, if necessary. For example, in FIG. 7 , the client 110 may prepare training data for a model by utilizing sensor data received from its own sensor (S710), and perform online-learning based on this (S710). It can (S720).

Here, online learning may correspond to a method of learning a model by sequentially applying data in mini batches. The client 110 may tune a model having generality into a model having unique characteristics of the client 110 through online learning in which data collected by itself is applied to the model distributed from the server 130 .

Thereafter, the client 110 may receive sensor data from the sensors (S730) and then apply the sensor data to a model finally built through online learning (S740). The client 110 may transmit the result of the model to the application as a deep learning analysis result and perform a unique application operation (S750).

Referring to FIG. 8 , the client 110 may estimate visual properties by collecting images and related sensor data and applying the pre-processed data to each model. Specifically, the client 110 may estimate depth through a depth estimation model, estimate motion through a motion estimation model, and unique parameters. A camera-specific parameter (Camera Intrinsic) may be estimated through an estimation model (Camera Intrinsic Estimation Model).

Thereafter, the client 110 may transfer the estimated values to the application as a deep learning analysis result to obtain an application output. The deep learning analysis results and application results generated through the client 110 may be transmitted to the server 130 and used for model evaluation.

Referring to FIG. 9 , the visual information processing system 100 may perform a learning process of building a model on the client 110 or the server 130 . That is, the server 130 may build a generalized machine learning model by learning data received from various clients 110, and may distribute it to each client 110 again. In addition, the client 110 may build a client-only model by online learning the distributed model according to each characteristic.

In the case of FIG. 9 , a learning process performed on the client 110 or the server 130 is shown. The learning process may operate in the direction of minimizing the loss of a predefined loss function in the process of inferring various visual properties from images received as input and generating them as output, and the optimizer ), learning can be achieved by repeatedly updating the parameters of the model.

Depth Estimation Model, Ego-motion Estimation Model, and Intrinsic Estimation Model are two or more images (I _t , I _t+1 ) with a difference in time Δt adjacent to each other. It can be constructed by unsupervised learning of the geometric relationship between In order to proceed with unsupervised learning, a loss must be defined using a loss function, and the optimizer can operate in a direction to reduce the loss. As a result, each parameter of the previous model can be cumulatively learned in the process of reducing the loss.

Meanwhile, the most common example of an optimizer may be a gradient descent method (eg, Adam optimizer, RMS prop, etc.). In addition, the loss function is a Reprojection Error loss function, SSIM loss function, Depth smoothing loss function, Velocity Supervision loss function, Motion regularization A loss function (Motion Regularization loss function) and the like may be included. Here, the sum of the reprojection error loss and the SSIM loss may correspond to warping loss. In the actual learning process, the inputs of each loss may be set differently, and only some of the models to be learned may be used, and some operations may be changed for use. More specific embodiments will be described with reference to FIG. 10 .

A learning process performed in the visual information processing system 100 may be performed in a direction of minimizing losses of a predefined loss function. Here, the loss function may be defined as the sum of various loss functions. For example, it can be defined as loss function = 1 + 2 + 3 + 4 + 5, and the loss function of each number is as follows.

1: Reprojection Error loss function, 2: SSIM loss function, 3: Depth smoothing loss function, 4: velocity Supervision loss function, 5: Motion Regularization loss function

In addition, the learning process in the direction of minimizing the loss function includes a forward pass for generating source images from a target image and a back pass for generating a target image from source images ( Backward pass) process may be included. Since the forward propagation and back propagation processes are mutually symmetric, the following will be described based on the forward propagation process.

In addition, variables used in the detailed description may be defined as follows.

Pinhole Camera Intrinsic: K ∈ R ^3×3 ; Grid: G ∈ R ^{Height×Width×3} , if G is in pixel coordinates, G ∈ R ^{Height×Width×2} ; Depth: D ∈ R ^{Height×Width×1} ; Rotation: R ∈ R ^3×3 ; Translation: T ∈ R ^3×1 ; Image: I ∈ R ^{Height×Width×3}

Referring to FIG. 10A , the first learning method may correspond to the most basic learning method among learning methods performed on the client 110 or the server 130 . That is, the first learning method may be a method of learning a depth model and an ego-motion model. Accordingly, the corresponding two models can be learned using only warping loss and depth smoothing loss. Meanwhile, in the first learning method, absolute scale translation may be performed for IMU, GPS, and Velocity. In the case of learning the above two models, due to physical limitations, one camera (monocular) depth and ego-motion can only be acquired up to scale, and to solve this, absolute Through scale conversion, an absolute scale for depth and egomotion can be trained on the model.

At this time, the reprojection error loss (Reprojection Error Loss) can be calculated as follows.

Reprojection Error refers to the difference between the original target image and the synthesized target image when a source image is rotated and translated (RT) to generate a synthesis target image. It may correspond to an error. The reprojection error may be applied in the same meaning to the second and third learning methods in a state in which a predetermined change is made.

Structural Similarity Error (SSIM Error) may correspond to an error related to the difference between luminance, contrast, and structure of two input images, and may be calculated as follows. SSIM errors may equally be included in the second and third learning methods.

where x and y are pixel positions,

,

,

am. Also, Luminance

, Contrast

, Structure

am.

Depth smoothing error may correspond to an error that mitigates the gradient of depth using the gradient of the image when the depth and image are obtained from the same scene, and can be calculated as follows. there is. The depth smoothing error may be equally included in the second and third learning methods.

Velocity supervision error may correspond to an error related to the difference between the magnitude of absolute translation and the scale of translation of egomotion, and may be calculated as follows . A speed monitoring error may be selectively included in the second and third learning methods.

Referring to FIG. 10B , compared to the first learning method, the second learning method may add a motion-vector to the output of the ego-motion estimation model. In this case, the ego-motion is a rigid-motion, and may correspond to a motion related to the movement of the camera. In contrast, the motion-vector may correspond to motion not related to the motion of the camera as a non-rigid motion. For example, a motion of a building seen in an image may correspond to a rigid body motion, and a motion of a vehicle or a person may correspond to a non-rigid motion.

Therefore, if the translation T ∈ R ^3×1 of ego motion in the first learning method, both of the ego motion and motion vector may be T ∈ R ^H×W×3 in the second learning method.

In the second learning method, the reprojection error loss can be calculated as follows.

From here,

: Rigid Motion Rotation (ego-motion),

: Rigid Motion Translation (ego-motion),

: Non-Rigid Motion Translation (Motion vector).

Motion regularization loss may correspond to the sum of group loss and sparsity loss. The group loss plays a role of minimizing the change in non-rigid motion, and the sparse loss can play a role of making the motion have an almost constant size in the non-rigid motion. The motion normalization loss may be calculated as follows and may be optionally included in the third learning method.

Here, T:= Motion Translation Map ∈ R ^H×W×3 , and <|T|>:= mean of T.

Referring to FIG. 10C , the third learning method may use an intrinsic parameter estimation model and a camera intrinsic parameter as a result thereof. In the first and second learning methods, information on the intrinsic parameters given to the actual camera must be secured in advance, whereas in the case of using the intrinsic parameter estimation model as in the third learning method, the intrinsic parameters can be estimated only with the image. can

In the third learning method, the reprojection error loss can be calculated as follows.

Here, K is a camera intrinsic parameter estimated by an intrinsic parameter estimation model.

Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. You will understand that it can be done.

[Description of code]

100: visual information processing system

110: client 130: server

210 vision sensors 220 additional sensors

230: network interface 240: artificial intelligence processor

250: image pre-processing processor 260: deep learning network

310: learning data receiving unit 330: learning model building unit

350: learning model distribution unit 370: deep learning analysis unit

390: learning model evaluation unit

Claims

Receiving training data including image data from a client;

A learning model is built by learning the training data - the learning model is implemented to receive an image and estimate visual properties including depth, motion and intrinsic parameters. doing;

deploying the learning model to the client;

receiving, from the client, a result of deep learning analysis performed by applying local data on the corresponding client to the learning model; and

A visual information processing method for estimating visual attributes in autonomous driving, comprising: evaluating stability of the learning model based on the deep learning analysis result.
The method of claim 1, wherein receiving the learning data

collecting related data on the client, which is synchronized at a time of generation of the image data and includes the unique parameter, IMU, GPS, and vehicle speed; and

and packaging the related data together with the image data on the client to generate the learning data.
The method of claim 1, wherein the step of building the learning model

independently estimating the depth and the motion based on the image;

estimating the eigenparameter based on the image or extracting the eigenparameter from the training data;

Extracting IMU, GPS and vehicle speed from the learning data; and

and repeatedly performing learning in a direction that minimizes a loss function based on the depth, the motion, the intrinsic parameter, the IMU, the GPS, and the vehicle speed. A visual information processing method for visual property estimation.
The method of claim 3, wherein the step of building the learning model

defining the loss function as a sum of a warping loss and a depth smoothing loss; and

Redefining the loss function by selectively adding at least one of a velocity supervision loss and a motion regularization loss to the loss function. processing method.
The method of claim 1, wherein receiving the deep learning analysis result

The visual information processing method for estimating visual attributes in autonomous driving, characterized in that it comprises the step of online-learning the learning model based on the learning data composed of the local data on the client.
The method of claim 1, wherein receiving the deep learning analysis result

The method of processing visual information for estimating visual attributes in autonomous driving further comprising receiving a result of an application performed by applying the result of the deep learning analysis to an application on the corresponding client from the client.
The method of claim 6, wherein the step of evaluating the stability

A visual information processing method for estimating visual attributes in autonomous driving comprising calculating accuracy of the application based on a result of the application and evaluating stability of the learning model according to the accuracy.
a client generating learning data including image data and online learning the learning model received from the server to generate a deep learning analysis result for local data; and

The learning model by learning learning data received from the client - The learning model receives an image to estimate visual properties including depth, motion, and intrinsic parameters. A visual information processing system for estimating visual attributes in autonomous driving including; implemented - a server that builds and evaluates the stability of the learning model based on the deep learning analysis result received from the client.