WO2022165809A1

WO2022165809A1 - Method and apparatus for training deep learning model

Info

Publication number: WO2022165809A1
Application number: PCT/CN2021/075856
Authority: WO
Inventors: 刘杨
Original assignee: 华为技术有限公司
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2022-08-11
Also published as: CN112639846A

Abstract

A method and apparatus for training a deep learning model. The method comprises: firstly loading a virtual camera and a target object model in a three-dimensional cabin model, then obtaining a first image output by the virtual camera, and generating a label of the first image according to scene information when the virtual camera outputs the first image; then adjusting environmental parameters of the three-dimensional cabin model and/or pose parameters of the target object model, after the adjustment, obtaining a second image output by the virtual camera, and generating a label of the second image according to scene information when the virtual camera outputs the second image; and finally, using the first image, the label of the first image, the second image and the label of the second image for deep learning of the model, so that the generalization capability of the deep learning model can be effectively improved, and the precision and effectiveness of the deep learning model are optimized.

Description

A method and apparatus for training a deep learning model

technical field

The present application relates to the field of artificial intelligence (Artificial Intelligence, AI), and in particular, to a method and apparatus for training a deep learning model.

Background technique

In recent years, deep learning has penetrated into various fields of machine vision and brought significant improvements to the original methods. The most successful of these is supervised deep learning. Supervised deep learning uses annotated training datasets to train deep learning models, and the resulting deep learning models can perform detection on new data in both test sets and real-world scenarios. However, supervised deep learning has obvious limitations: Before starting any vision project, a large amount of annotated training data needs to be prepared. The application field of machine vision in the real world is constantly updated, so once new application scenarios appear, such as cockpit scenarios, it is necessary to re-prepare training data and train new deep learning models.

At present, for new application scenarios, real data is generally collected and labeled data is manually completed to obtain training data. This method is not only time-consuming, labor-intensive, and expensive, but also some training data cannot be manually labeled, resulting in poor model training results.

Therefore, for new application scenarios, how to obtain diverse labeled training data at low cost, and then obtain a deep learning model with high accuracy, is a technical problem to be solved by this application.

SUMMARY OF THE INVENTION

The present application provides a method and apparatus for training a deep learning model, which can improve the generalization ability of the deep learning model and optimize the accuracy and effectiveness of the deep learning model.

In a first aspect, a method for training a deep learning model is provided, comprising: loading a virtual camera and a target object model in a three-dimensional cockpit model, where the target object model is located within the shooting range of the virtual camera; obtaining a first image output by the virtual camera; The scene information when the virtual camera outputs the first image generates a label of the first image, and the label of the first image is used to describe the identification information of the target object model in the first image; adjust the environmental parameters of the three-dimensional cockpit model and/or the target object model obtain the second image output by the virtual camera; generate the label of the second image from the scene information when the virtual camera outputs the second image, and the label of the second image is used to describe the recognition of the target object model in the second image information; based on the first image and the label of the first image, the second image and the label of the second image, train the deep learning model; wherein, the trained deep learning model is used to identify the target object in the newly input cockpit image identification information.

For the cockpit scene, the embodiment of the present application uses the three-dimensional cockpit model of the cockpit as the background of the target object model in the simulation process. Compared with the method of generating training data by simply rendering the human body model in the prior art, the synthetic data (that is, in the third The authenticity of an image and the label of the first image, the label of the second image and the label of the second image); and during the simulation process, the pose of the target object model and the environment of the three-dimensional cockpit model are adjusted, so that a variety of cockpits can be obtained. The unique synthetic data in the field can be used in the whole process without manual participation, which realizes the effect of obtaining diversified annotated training data at low cost and high efficiency for new application scenarios. Finally, using such annotated training data to train the model can effectively improve the generalization ability of the deep learning model and optimize the accuracy and effectiveness of the deep learning model.

In one possible design, the target object model is a human body model or a face model.

In a possible design, before loading the target object model in the 3D cockpit model, at least one real target object can be scanned to obtain at least one target object model, and the at least one target object model can be saved to the model library, and the corresponding , and loading the target object model in the 3D cockpit model includes: randomly selecting one or more target object models from the model library, and loading one or more target object models in the 3D cockpit model.

In this design, the target object model is generated based on the real human/face scan data, so that the target object model can provide more texture details and diversity, thereby further improving the generalization ability of the deep learning model, and optimizing the accuracy of the deep learning model. effectiveness.

In a possible design, to load a virtual camera in the 3D cockpit model, the shooting parameters of the real camera in the real cockpit can be placed in the virtual camera, and the shooting parameters include resolution, distortion parameters, focal length, field of view, aperture or At least one of exposure time.

In this design, the shooting parameters of the real camera in the real cockpit are placed in the virtual camera, which can make the images output by the virtual camera more realistic, further improve the generalization ability of the deep learning model, and optimize the accuracy and effectiveness of the deep learning model.

In a possible design, the environmental parameters of the 3D cockpit model include at least one of the light field of the 3D cockpit model, the interior or the external environment; wherein the light field includes the brightness of the light, the number of light sources, the color of the light source, or the amount of the light source. at least one of the locations.

In this design, the light field, interior or external environment of the 3D cockpit model can be adjusted to create a rich variety of backgrounds and light and shadow effects, which can make the images output by the virtual camera more diverse, and further improve the generalization of the deep learning model. ability to optimize the accuracy and effectiveness of deep learning models.

In a possible design, the pose parameters of the target object model include position coordinates and/or Euler angles.

In this embodiment of the present application, there may be multiple types of labels, or in other words, the trained deep learning model may have multiple uses.

In the first possible design, the identification information is depth information, and the trained deep learning model is used to identify the depth information of the target object in the newly input cockpit image. Correspondingly, generating the label of the first image from the scene information when the virtual camera outputs the first image may include: determining the virtual camera according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image. When the first image is output, the first depth information of the target object model relative to the virtual camera; the first depth information is used as the label of the first image. Generating the label of the second image from the scene information when the virtual camera outputs the second image may include: determining the virtual camera to output the second image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image. The second depth information of the target object model relative to the virtual camera during the image; the second depth information is used as the label of the second image.

In this design, a high-precision deep learning model for recognizing the depth information of the target object in the image can be obtained.

In the second possible design, the identification information is attitude information, and the trained deep learning model is used to identify the attitude information of the target object in the newly input cockpit image;

Generating the label of the first image from the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image, and generating the pose parameters of the target object model when the virtual camera outputs the first image parameter as the label of the first image;

Generating the label of the second image from the scene information when the virtual camera outputs the second image includes: determining the pose parameters of the target object model when the virtual camera outputs the second image, and generating the pose parameters of the target object model when the virtual camera outputs the second image parameter as a label for the second image.

In this design, a high-precision deep learning model for recognizing the pose information of the target object in the image can be obtained.

It should be understood that the above two tags are only examples rather than limitations, and there may actually be other implementation manners.

In a second aspect, a model training apparatus is provided, comprising modules/units for performing the method steps described in the first aspect or any possible design of the first aspect.

Exemplarily, the device includes:

The processing module is used for: loading the virtual camera and the target object model in the three-dimensional cockpit model, and the target object model is located in the shooting range of the virtual camera; obtaining the first image output by the virtual camera; and outputting the scene information of the virtual camera when the first image is output Generate a label of the first image, and the label of the first image is used to describe the identification information of the target object model in the first image; adjust the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model; obtain the output of the virtual camera. the second image; generating the label of the second image from the scene information when the virtual camera outputs the second image, and the label of the second image is used to describe the identification information of the target object model in the second image;

A training module for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, the trained deep learning model is used to identify the newly input cockpit image The identification information of the target object in .

In a third aspect, a computing device system is provided, including at least one computing device, each computing device including a memory and a processor, and the memory is used for executing computer instructions stored in the memory to execute the first aspect or any one of the first aspects. method described in a possible design.

A fourth aspect provides a computer-readable storage medium, the computer-readable storage medium stores instructions, when the instructions are executed, the method described in the above-mentioned first aspect or any possible design of the first aspect is executed. accomplish.

A fifth aspect provides a chip that can be coupled to a memory for reading and executing program instructions stored in the memory to implement the method described in the first aspect or any possible design of the first aspect.

A sixth aspect provides a computer program product comprising instructions, the computer program product having instructions stored in the computer program product that, when executed on a computer, cause the computer to execute the above-mentioned first aspect or any possible design of the first aspect. method described.

For the beneficial effects of the above-mentioned second to sixth aspects, refer to the beneficial effects of the corresponding designs of the first aspect, which will not be repeated here.

Description of drawings

Figure 1 is a schematic diagram of a visual system identifying abnormal behavior;

2 is a schematic diagram of feature learning of a deep neural network;

3 is a schematic diagram of a hand image rendered by a model;

4 is a schematic diagram of a deep learning model device provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of another deep learning model device provided by an embodiment of the present application;

6 is a flowchart of a method for training a deep learning model provided by an embodiment of the present application;

Figure 7 is an interior view of a car model from one viewing angle (from the inside out);

Fig. 8 is a sectional view of a real automobile;

9A-9B are schematic diagrams of a high-quality face model;

10 is a schematic diagram of the relative position of the target object model and the virtual camera;

11A to 11C are examples of a set of synthetic data;

12 is a schematic diagram of the architecture of a semantic segmentation (SegNet) model;

FIG. 13A is a schematic diagram of a composite picture generated by the method according to an embodiment of the present application;

FIG. 13B is a depth map obtained by inputting the synthesized picture shown in FIG. 13A into a depth estimation model;

Figure 14 is an RGB image processed by a semantic segmentation algorithm;

FIG. 15 is a schematic structural diagram of a model training apparatus 150 provided by an embodiment of the application;

FIG. 16 is a schematic structural diagram of a model training apparatus 160 provided by an embodiment of the present application;

FIG. 17 is a schematic structural diagram of a computing device 170 provided by an embodiment of the present application.

Detailed ways

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Machine vision systems can convert imperfect, blurry, and constantly changing images into semantic representations. As shown in Figure 1, it is a schematic diagram of the visual system identifying abnormal behavior. After inputting the image to be recognized in the visual system, the visual system can output the corresponding semantic information to identify the abnormal behavior in the image, that is, "smoking".

Vision systems can be implemented by deep learning methods. In the 2012 ImageNet (ImageNet is the name of a computer vision system recognition project) large-scale visual recognition challenge (ImageNet Large Scale Visual Recognition Challenge, ILSVRC), the Alexnet model of the competition winner Hilton team classified images (1000 classes) The top-5 error rate of the track was reduced to 15.3%, crushing the 26.2% of the second place using the SVM algorithm. Since then, deep learning has embarked on a booming road.

In recent years, deep learning has penetrated into various fields of machine vision and brought significant improvements to the original methods. The most successful of these is supervised deep learning (the process of training an intelligent algorithm to map the input data to the labels, given the data and its one-to-one corresponding labels). Supervised deep learning trains a model with annotated training datasets, resulting in a deep learning model ("deep learning model" in this article may simply be referred to as a "model") that can identify and classify new data in test sets and actual scenarios .

The success of supervised deep learning in machine vision in recent years is mainly due to:

1), a large amount of data: such as the ImageNet (computer vision system recognition project name) image classification track, which provides millions of training data. Due to its large model capacity, Deep Neural Networks (DNN) are prone to overfitting for small sample data. With a large amount of data, the problem of DNN overfitting has been effectively alleviated.

2), super computing power: processing a large amount of data requires super computing power. Before Alexnet, it took years to train a deep learning network on ImageNet using a common Central Processing Unit (CPU), so it was not practical. Alex Krizhevsky's pioneering Compute Unified Device Architecture (CUDA) interface based on Nvidia (Nvidia, an artificial intelligence computing company) Graphics Processing Unit (GPU) successfully trained a performance breakthrough Enhanced deep neural networks, thus ushering in the GPU era of artificial intelligence.

Supervised deep learning needs to satisfy two assumptions: 1) The training data and the real data are Independent Identically Distributed (IID). In probability theory and statistics, IID refers to each variable in a set of random variables The probability distributions are the same, and these random variables are independent of each other; 2) The training data has sufficient coverage of the real data. Therefore, a large amount of high-quality training data is especially important for supervised deep learning.

Supervised deep learning has obvious limitations: Before starting any vision project, a large amount of training data needs to be prepared. The training data and the real data must be independent and identically distributed. For example, if there is an existing ordinary camera model, to expand to the fisheye, the data must be collected and trained again. The probability distribution in the real world may be constantly evolving, which requires continuous preparation of training data, retraining, and updating of deep learning models.

Therefore, when the above two assumptions are not fully satisfied, misjudgment and misjudgment are inevitable (although the difference is insignificant in the eyes of humans). When an object that can be recognized normally with high accuracy is put into a training set of scenes that are negatively correlated with the scene in which the object appears, the vision system is easily misled. For example, such a system might not be able to identify a cow standing on a beach, since it is almost impossible for a cow to be present in a beach scene.

Illustratively, see Fig. 2, which is a schematic diagram of feature learning of a deep neural network. In the eyes of humans, the same animal (cat) is mapped by a Convolutional Neural Network (CNN) to a distant object in the embedding space. distance, resulting in recognition errors. Faced with this kind of problem, we can only dump the "data", but there is no guarantee of how much data is needed to completely eliminate such annoying "corner cases".

It can be seen that supervised deep learning requires a large amount of high-quality labeled data to train deep learning models.

Below, two main sources of training data for deep learning models are introduced.

In one possible implementation, real data is used to build a training dataset for the deep learning model.

The sources of real data generally include: 1), public datasets; 2), data collection and labeling suppliers; 3), self-collection.

However, using real data has the following disadvantages:

1), expensive. Buy data from data annotation providers for tens of thousands, hundreds of thousands or millions. Since deep learning model training requires a lot of data, the overall price is often high.

2), laborious and time-consuming. The document needs to be clearly written and the markup format specified. Often it takes several months.

3) Some training data cannot be manually labeled. Such as human hand 3D keypoint data, pixel-level semantic segmentation data, etc.

In another possible implementation, synthetic data is used to construct a training dataset for the deep learning model.

Synthetic data refers to data generated by a simulated environment of a computer system, rather than data measured and collected from a real-world environment. This data is anonymous and has no privacy concerns. Since it is created based on user-specified parameters, it can have the same characteristics as real-world data as much as possible.

Creating synthetic data is more efficient and less expensive than collecting real data. This enables companies to quickly experiment with new scenarios and better cope with rapidly changing field of view environments. Synthetic data can also be complemented with real data. Some "corner case" recognition errors occur because images of this category are absent or few in the real data. At this time, this type of data can be synthesized through parameter control when synthesizing data, so that the distribution of the data set is closer to the usage scenario.

However, creating a simulation environment for high-quality data synthesis is not easy. Deep learning models trained on synthetic data are not usable if the synthetic data is significantly different from the real data.

The existing synthetic data does not consider the cockpit scene, so when the deep learning model trained with these synthetic data recognizes the cockpit scene image, it is difficult for the model to distinguish the foreground (that is, the target object to be detected) and the background (that is, the background), resulting in The accuracy of the model is poor.

Not only that, the existing synthetic data is completely rendered data from the 3D model. For example, in the construction of a finger tracking vision system, images of the hands in different poses are generated based on the parametric hand model, and the background is randomly replaced for training. As shown in Figure 3, a schematic diagram of the hand image rendered for the model, it can be seen from Figure 3 that the color, texture and other features of the human body model simply generated by the rendering of the human body model do not have enough diversity and lack of realism. The generalization ability of deep learning models trained on synthetic data is extremely poor, and they cannot achieve due performance in real scenarios.

In view of this, the embodiments of the present application provide a method and apparatus for training a deep learning model, so as to achieve a low-cost and high-efficiency acquisition of diverse annotated training data for a new application scenario (taking a cockpit as an example), Improve the generalization ability of deep learning models.

Before introducing the method for training a deep learning model provided by the embodiment of the present application, the system architecture applicable to the embodiment of the present application is introduced.

The method for training a deep learning model provided in the embodiment of the present application may be performed by a model training apparatus, and the location where the model training apparatus is deployed is not limited in the embodiment of the present application. Exemplarily, as shown in FIG. 4 , the deep learning model device can run on a cloud computing device system (including at least one cloud computing device, such as a server, etc.), or can run on an edge computing device system (including at least one edge computing device). , such as: server, desktop computer, etc.), can also run on various terminal computing devices, such as: notebook computer, personal desktop computer, etc.

The model training device may also be a device composed of multiple parts, such as a scene creation module, a simulation module, and a training module, which is not limited in this application. Each component can be deployed in different systems or servers. Exemplarily, as shown in FIG. 5 , each part of the apparatus may run in three environments of cloud computing equipment system, edge computing equipment system or terminal computing equipment, or may run in any two of these three environments. middle. The cloud computing device system, the edge computing device system, and the terminal computing device are connected by a communication channel, and can communicate and transmit data with each other. The method for training a deep learning model provided by the embodiment of the present application is performed by the combined parts of the model training apparatus running in the three environments (or any two of the three environments).

Hereinafter, the method for training a deep learning model provided by the present application will be introduced in detail with reference to the accompanying drawings.

Referring to FIG. 6, a flowchart of a method for training a deep learning model provided by an embodiment of the present application includes:

S601. Scene creation: load a 3D cockpit model in the simulation space, and load a virtual camera in the 3D cockpit model.

Simulation space is a 3D simulation environment provided by simulation software. Each 3D model loaded into the 3D simulation environment can be displayed in 3D in the simulation space. Based on the simulation software, the display effect of each 3D model in the simulation space can be controlled, such as adjusting the environmental parameters of the 3D cockpit model and the pose parameters of the target object model as described later. Simulation software includes but is not limited to CARLA, AirSim, PreScan, etc.

A 3D cockpit model is a polygon mesh representation of the cockpit that can be displayed on a computer or other video device. The cockpit shown can be a real-world physical cockpit or a fictitious cockpit. The application does not limit the type of the cockpit. For example, the cockpit of a car, the cockpit of a ship, the cockpit of an airplane, and the like. The interior of the three-dimensional cockpit model can accommodate at least one target object. Taking the 3D cockpit model as a car model as an example, and taking the target object as a human body as an example, the cockpit of the car has a main driver's seat and a co-pilot seat, which can accommodate two people.

Loading the 3D cockpit model in the simulation space refers to placing the 3D cockpit model in the simulation space. Loading a virtual camera in the 3D cockpit model means that the virtual camera is also placed in the simulation space, and the virtual camera is located inside the 3D cockpit model.

In this embodiment of the present application, the source of the 3D cockpit model can be obtained directly from other places. For example, when the 3D cockpit model is the cockpit of a car, the 3D cockpit model corresponding to the car can be obtained from the manufacturer that generated the car, or It is generated by direct modeling, for example, by using special software such as a three-dimensional modeling tool, but it can also be generated by other methods, which is not limited in this application. When using a 3D modeling tool to construct a cockpit model with 3D data through a virtual 3D space, the specific modeling method can be Non-Uniform Rational B-Splines (NURBS) or polygon meshes, etc. No restrictions.

It should be emphasized that when creating a 3D cockpit model in this embodiment of the present application, in addition to constructing the original shape, structure, color, etc. of the 3D cockpit model itself, the rendering of the inner/outer environment of the 3D cockpit model should also be considered, because the practical application When a real camera takes a real cockpit image, in addition to the image of the target object, it will also capture a part of the internal environment, and may also capture images outside the cockpit (for example, the outside of the car captured through the window environment image).

Optionally, the internal environment may include the internal structure of the cockpit, such as the material, texture, shape, structure, color, etc. of the interior, such as seats, steering wheels, storage boxes, ornaments, car seat cushions, car floor mats, automobiles, etc. Pendants, internal decorations, daily necessities (such as paper towels, water cups, mobile phones, etc.). The interior environment may also include a light field (or illumination) inside the cabin, such as illumination brightness, number of light sources, color of light sources, location of light sources or orientation of light sources, and the like.

Optionally, the external environment includes the geographic location, shape, texture, color, etc. of the external environment. For example, if the three-dimensional cockpit model is located in an urban environment, there may also be urban buildings (such as tall buildings) and lane signs (such as traffic lights) outside the cockpit. For other vehicles, for example, the three-dimensional cockpit model is located in a field environment, there may also be flowers, plants and trees outside the cockpit.

Since the internal/external environment features of the three-dimensional cockpit model are considered when creating the three-dimensional cockpit model in the embodiments of the present application, the shooting content of the virtual camera can be made closer to the shooting content of the real camera in practical applications, thereby helping to improve the synthetic data. (the synthetic data here refers to the image carrying the label output in step S602, which may also be referred to as a sample image herein).

The virtual camera in the embodiment of the present application can describe the position, attitude and other shooting parameters of the camera in the real cockpit. Based on these parameters of the virtual camera and the environmental parameters of the three-dimensional cockpit model, it can output the virtual camera from the perspective of the virtual camera. The two-dimensional image inside the three-dimensional cockpit model obtains the effect of shooting a two-dimensional image of the real cockpit interior scene by a real camera in the real world.

In the embodiment of the present application, after the 3D cockpit model is obtained, if the 3D cockpit model itself does not have a virtual camera, or the virtual camera is not applicable to the present application, the virtual camera needs to be loaded in the 3D cockpit model.

Referring to FIG. 7 , which is an interior view of the car model from a perspective (looking from the inside out), a virtual camera is placed inside the cockpit of the three-dimensional cockpit model, and the virtual camera can photograph the scene in the cockpit (at least the main driving position can be photographed) Image). Let the Y-axis of the coordinate system of the simulation space be the vertical direction, and the vertical upward direction is the direction in which the Y-axis coordinate becomes larger, and the Z-axis and X-axis are in the horizontal direction. As shown in Figure 7, the virtual camera is set in the coordinate system of the simulation space. At the origin of , the shooting direction of the virtual camera is the direction in which the Z-axis coordinate becomes larger.

If the 3D cockpit model comes with a virtual camera, you can use the virtual camera directly. In addition, if there are multiple virtual cameras in the three-dimensional cockpit model, it is necessary to determine the virtual camera required by the application from the multiple virtual cameras, and the selected virtual camera can output the image inside the cockpit, that is, the selected virtual camera. The camera is capable of capturing the scene in the cockpit (at least an image of the primary driver's position).

Referring to Figure 8, which is a cross-sectional view of a real car, Figure 8 shows 13 positions where cameras can be set. The functions served by the cameras at these positions are as follows: Position 1, Driver Monitoring System (DMS) ;Position 2, Panoramic Surveillance Image System (Around View Monitor, AVM); Position 3, B-pillar exterior face door; Position 4, Backseat Monitoring System (BMS); Position 5, Rear entertainment; Position 6. AVM; position 7, rear entertainment; position 8, BMS; position 9, AVM; position 10, driving recorder (Digital Video Recorder, DVR); position 11, in-car (rearview mirror) high-definition photography; position 12 , AVM; position 13, B-pillar outside the car to open the door. Among them, the cameras at positions 1, 4, 5, 7, 8, and 11 are directed toward the inside of the car, and the cameras at

positions

2, 3, 6, 9, 10, 12, and 13 are directed toward the outside of the car.

For the position of the virtual camera used in the embodiment of the present application in the three-dimensional cockpit model, you can refer to the deployment position of the camera with the shooting angle facing the inside of the car in the real car shown in FIG. , the person in the driver's seat can be photographed); Positions 4 and 8 (such as the inside of the rear door glass, the person in the rear seat can be photographed); Positions 5 and 7 (such as the back of the front and rear seats, can be photographed person in the rear seat); position 11 (e.g. rearview mirror).

In the embodiment of the present application, after the virtual camera is loaded in the three-dimensional cockpit model, the shooting parameters of the virtual camera also need to be set. The shooting parameters may include one or more of resolution, distortion parameters, focal length, field angle, aperture or exposure time, and the like. If the pose of the virtual camera is adjustable, the shooting parameters may further include the position and pose of the virtual camera, where the pose of the virtual camera may be represented by Euler angles. Of course, the shooting parameters may also include other parameters, as long as they are parameters that affect the shooting effect of the virtual camera, which is not limited in this application.

Optionally, the shooting parameters can be set to default values, which is simple to implement and low in cost.

Optionally, the shooting parameters can be obtained by collecting the shooting parameters of the real camera in the real cockpit, that is, collecting the shooting parameters of the real camera in the real cockpit, and placing the collected shooting parameters in the virtual camera. In a specific example, the internal parameters of the on-board driver monitoring system (Driver Monitoring System, DMS) and cockpit monitoring system (Cockpit Monitoring System, CMS) cameras can be obtained by Zhang's calibration method, and then the internal parameters are placed in the virtual camera. In this way, the shooting effect of the virtual camera is closer to the shooting effect of the real camera in practical applications, thereby helping to improve the realism of the synthetic data.

It should be understood that the above two manners of determining the shooting parameters may be implemented independently, or may be implemented in combination with each other. When implemented in combination with each other, for example, the resolution adopts the default value (such as 256x256), and the parameter value of the real camera is collected for the distortion parameters, focal length, field angle, aperture or exposure time.

S602. Real-time simulation: load the target object model in the 3D cockpit model, adjust the environmental parameters of the 3D cockpit model and/or the pose parameters of the target object model for many times, and output several (2D) images carrying tags.

In the embodiment of the present application, the type of the target object model needs to be determined according to the target object that the deep learning model needs to recognize when it is applied. For example, the deep learning model needs to recognize the depth information of the human face in the image, and the target object model can be a human face. The model, for example, the deep learning model needs to recognize the posture of the human body in the image, and the target object model can be a human body model. It should be understood that the above is only an example and not a limitation, and the present application does not limit the specific type of the target object model, for example, it can also be an animal (such as a dog, a cat).

Optionally, loading the target object model into the three-dimensional cockpit model includes: randomly selecting one or more target object models from the model library, and loading the selected target object model into the three-dimensional cockpit model. It should be understood that when there are multiple target object models, the method steps performed for each target object model in this embodiment of the present application are similar. Therefore, the following description will mainly take one target object model as an example.

This embodiment of the present application does not limit the source of the target object model in the model library.

Optionally, before loading the target object model into the three-dimensional cockpit model, scan at least one real target object, generate at least one target object model based on the scanned real data, and save the at least one target object model to the model library. Scanning method, for example, a data acquisition device using Intel RealSense technology scans a real face. 9A-9B are schematic diagrams of a high-quality human face model, wherein FIG. 9A is a front view of the high-quality human face model, and FIG. 9B is a side view of the high-quality human face model. Since the target object models in the model library are all generated based on the scanning of real target objects, the target object model can provide more texture details and diversity, which can improve the realism of the target object model, which in turn helps to improve the Diversity and realism of synthetic data.

Optionally, the environmental parameters of the three-dimensional cockpit model may include parameters of the internal environment, such as the brightness of the light inside the cockpit, the number of light sources, the color of the light source, or the position of the light source, etc., such as the type, quantity, position, shape, structure, color, etc. The environment parameters of the three-dimensional cockpit model may also include parameters of the external environment, such as the geographic location, shape, texture, color, etc. of the external environment. In this way, by changing the environmental parameters of the three-dimensional cockpit model, more diverse sample images (ie, images with labels, or synthetic data) can be obtained.

Optionally, the pose parameters of the target object model include the position coordinates of the target object model (for example, the coordinate values (x, y, z) in the coordinate system of the simulation space), and/or the Euler angles of the target object model. (including pitch angle, yaw angle and roll angle). In this way, more diverse sample images can be obtained by changing the pose parameters of the three-dimensional cockpit model.

In the embodiment of the present application, the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model are adjusted, and several training images carrying labels are output, including:

Step 1. Set the environment parameter of the three-dimensional cockpit model as the first environment parameter, and set the pose parameter of the target object model as the first pose parameter; output the first image based on the virtual camera in the three-dimensional cockpit model, and output the first image according to the virtual camera. The scene information of an image generates a label of the first image, and the first image carrying the label is obtained.

Step 2, adjusting the environment parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model, for example, setting the environment parameters of the three-dimensional cockpit model as the second environment parameters, and setting the pose parameters of the target object model as the second pose parameters; The second image is output based on the virtual camera, and a label of the second image is generated according to the scene information when the virtual camera outputs the second image, so as to obtain the second image carrying the label.

It should be noted that in the process of adjusting the environmental parameters of the 3D cockpit model and/or the pose parameters of the target object model, it is necessary to ensure that the contours of the target object model and the 3D cockpit model do not occupy the same space (for example, the human body can only be placed in the car cockpit). activities in the interior space, must not be embedded with the car shell) to improve the realism of the synthetic data.

Since the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model in steps 2 and 1 are different, the first image output in step 1 is different from the second image output in step 2. The first image and the second image are different. Image carrying tags are also different. By repeatedly performing the above steps 1 and 2 (ie, constantly changing the environment parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model), more diverse images with labels can be obtained. When the number of images with labels reaches a certain value, it can be used as a training sample to train the deep learning model, that is, S603 is executed.

In order to facilitate the understanding of the simulation process in the embodiment of the present application, a specific example is given here: the simulation script is used to execute the program code of the simulation, and when the simulation script is run by the computer, the computer can realize the following functions: randomly select from the model library The face model is loaded into the 3D cockpit model; the X and Y coordinates of the face model are controlled to change randomly within the range of plus or minus 100 mm, and the Z coordinate is randomly changed within the range of 200 to 800 mm; the pitch angle (Pitch) of the face model is controlled. ) varies randomly within the range of plus or minus 20 degrees, and the yaw angle (Yaw) of the face model varies randomly within the range of plus or minus 40 degrees; the random variation range of the illumination inside the control cabin is between 1 and 3; In the process of changing the pose of the model and the brightness of the light, the synthetic data is obtained based on the output image of the virtual camera and the output label based on the scene information.

Optionally, the target object model loaded in the three-dimensional cockpit model may also be updated during the simulation. For example, after outputting a preset number of images with labels based on the current target object model, re-select one or more target object models from the model library, and then load the re-selected target object models in the 3D cockpit model. The reselected target object model performs steps 1 and 2 above to output more images with labels. In this way, the diversity of synthetic data can be further improved.

Optionally, the shooting parameters of the virtual camera may also be updated during the simulation. For example, after setting the environment parameter of the three-dimensional cockpit model as the first environment parameter, setting the pose parameter of the target object model as the first pose parameter, and outputting the first image carrying the label, update the shooting parameters of the virtual camera (for example, adjusting the resolution rate or exposure time, etc.), and then based on the same environment parameters and pose parameters (that is, keep the environment parameters of the 3D cockpit model as the first environment parameters, and keep the pose parameters of the target object model as the first pose parameters change), and output a third image different from the first image. In this way, the diversity of synthetic data can be further improved.

Optionally, when the target object model can be divided into multiple parts and the poses of different parts can be independent of each other, during the simulation process, in addition to adjusting the overall pose parameters of the target object model (that is, maintaining the In addition to the relative pose unchanged), the pose parameters of each part of the target object model can also be adjusted separately. Exemplarily, when the target object model is a human body model, the overall pose parameters of the human body model may be adjusted, or only the pose parameters of the head may be adjusted, or only the pose parameters of the hands may be adjusted, or the pose parameters of the head may be adjusted simultaneously. Pose parameters and hand pose parameters, etc. In this way, the flexibility of simulation can be improved, and the diversity of synthetic data can be further improved.

Here is a brief introduction to the principle based on the output image of the virtual camera: project each feature point (or vertex) on the target object model to a pixel on the image plane of the virtual camera, and then obtain the color (Red Green) Blue, RGB) image, this process can be realized by computer graphics (Computer Graphics, CG) technology.

It should be emphasized that, when the embodiment of the present application outputs an RGB image based on a virtual camera, the most important role of the virtual camera is to determine the rendering effect (for example, the position of the image plane, the resolution of the image, the viewing angle of the image) according to the shooting parameters of the virtual camera. field angle, etc.), so that the display effect of the rendered 2D image is the same as the display effect of the image captured by the real camera using the shooting parameters, and the realism of the RGB image is improved.

In the embodiment of the present application, the label of the image is used to describe the identification information of the target object in the cockpit image. The label type of the image needs to be determined according to the output data (namely identification information) that the deep learning model needs to map. For example, if the trained deep learning model is used to determine the depth information of the target object in the image to be recognized, the label includes the depth information of the target object model; for example, the trained deep learning model is used to determine the depth information of the target object model to be recognized. The attitude information of the target object in the cockpit image, then the label includes the attitude information of the target object model; for example, the trained deep deep learning model is used to determine the depth information and attitude information of the target object in the image to be recognized, Then the label includes both the depth information and the pose information of the target object model. Of course, other types of labels may also exist in practical applications, which are not limited in this application.

It should be noted that, based on different tag types, the embodiments of the present application have different specific methods for generating tags based on scene information. The following takes the tags as depth information and pose information as examples to introduce two possible methods for generating tags based on scene information.

Example 1. The label includes depth information.

Depth estimation is a fundamental problem in the field of computer vision. There are many devices, such as depth cameras and millimeter-wave radars, that can directly acquire depth. But car-grade high-resolution depth cameras are expensive. It is also possible to use binoculars for depth estimation, but since binocular images need to use stereo matching to perform pixel correspondence and parallax calculation, the computational complexity is also high. Especially for low texture scenes the matching effect is not good. The monocular depth estimation is relatively cheaper and easier to popularize. Then for monocular depth estimation, as the name implies, it is to use an RGB image to estimate the distance of each pixel in the image relative to the shooting source. For the human eye, due to the existence of a large amount of prior knowledge, a large amount of depth information can be extracted from the image information obtained by one eye. Therefore, monocular depth estimation not only needs to learn direct depth information from two-dimensional images, but also needs to extract some indirect information related to the camera and the scene to assist in more accurate depth estimation. The synthetic data generated in the embodiment of the present application may be an image carrying a depth information label, and may be used as training data for a supervised monocular depth estimation algorithm based on deep learning. The trained monocular depth estimation algorithm can be used to estimate depth information based on RGB images.

Specifically, the method for generating a depth information label according to scene information includes: according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs a certain image, determining the relative position of the target object model relative to the virtual camera when the virtual camera outputs the image The depth information of the camera, which is used as the label of the image.

Illustratively, take the first image and the second image above as an example. Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the virtual camera to output the first image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image the first depth information of the target object model relative to the virtual camera; the first depth information is used as the label of the first image. Generating the label of the second image according to the scene information when the virtual camera outputs the second image includes: determining the virtual camera to output the second image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image the second depth information of the target object model relative to the virtual camera; the second depth information is used as the label of the second image.

It should be noted that each feature point on the target object model has its own 3D coordinates (x, y, z), and the depth information corresponding to each feature point is not the linear distance from the feature point to the center of the virtual camera aperture, but The vertical distance from this point to the plane of the virtual camera aperture. Therefore, to obtain the depth information of each feature point on the target object model, the essence is to obtain the vertical distance between each feature point and the plane where the virtual camera is located. The vertical distance between each feature point and the plane where the virtual camera is located can be specifically calculated and obtained according to the position coordinates of the virtual camera and the position coordinates of the target object model.

Exemplarily, referring to FIG. 10 , it is assumed that the camera aperture center S is located at the origin O of the coordinate system in the simulation space, that is, the coordinates of the camera aperture center S are (0, 0, 0), the Y axis is the vertical direction, and the vertical upward direction is Y. The direction in which the axis coordinates increase, the Z axis and the X axis are in the horizontal direction, the shooting direction of the virtual camera is the direction in which the coordinates of the Z axis increase, and the plane where the virtual camera aperture is located is the plane formed by Y, X, and O, then The depth value corresponding to each feature point is equal to the vertical distance between the feature point and the plane where the virtual camera is located, which is equal to the Z coordinate of the feature point. It can be seen from Figure 10 that the positions of point A and point B on the face are different, and the straight-line distance between point B and the center of the virtual camera aperture is greater than the straight-line distance between point A and the center of the virtual camera aperture, but point A is where the virtual camera aperture is located. The vertical distance of the plane is equal to the vertical distance from point B to the plane where the virtual camera aperture is located, that is, the Z coordinates of points A and B are the same (z1=z2), so the depth values of points A and B are equal; point C is where the virtual camera aperture is located. The vertical distance of the plane is less than the vertical distance from points A and B to the plane where the virtual camera aperture is located, that is, the Z coordinate of points A and B is greater than the Z coordinate of point C (z3<z1, z2), that is, the two points of A and B are The depth value is greater than the depth value of the Z point.

In this embodiment of the present application, the depth information corresponding to each RGB image may be expressed in the form of a depth map. The RGB image and/or the depth map can be in JPG, PNG or JPEG format, which is not limited in this application. The depth of each pixel in the depth map can be stored in 16 bits (two bytes) in millimeters. Exemplarily, FIGS. 11A and 11B are examples of a set of synthetic data, wherein FIG. 11A is an RGB image, FIG. 11B is a depth map corresponding to the RGB image shown in FIG. 11A , and FIG. 11C is a more intuitive display of FIG. 11B in Matplotlib. schematic diagram.

Example 2. The label includes pose information.

Attitude estimation problem is to determine the orientation of a three-dimensional target object. Pose estimation has applications in many fields such as robot vision, motion tracking, and single-camera calibration. The synthetic data generated in the embodiment of the present application may be an image carrying a pose information label, and may be used as training data for a supervised pose estimation algorithm based on deep learning. The trained pose estimation algorithm can be used to estimate the pose of the target object based on the RGB image.

Specifically, the method for generating the depth information label according to the scene information includes: according to the pose parameter of the target object model when the virtual camera outputs a certain image, the pose parameter is used as the label of the image.

Illustratively, take the first image and the second image above as an example. Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image, and determining the pose parameters of the target object model when the virtual camera outputs the first image The parameter is used as the label of the first image; the label of the second image is generated according to the scene information when the virtual camera outputs the second image, including: determining the pose parameters of the target object model when the virtual camera outputs the second image, and outputting the second image from the virtual camera. The pose parameters of the target object model in the image are used as the label of the second image.

In this embodiment of the present application, the attitude information may be represented by Euler angles. The Euler angle may specifically be the rotation angle of the head around the three coordinate axes (ie, X, Y, and Z axes) of the simulation space coordinate system. The rotation angle around the X axis is the pitch angle (Pitch), the rotation angle around the Y axis is the yaw angle (Yaw), and the rotation angle around the Z axis is the roll angle (Roll).

S603 , training the model: train the deep learning model based on the image with the label, and obtain the deep learning model that has been trained.

Exemplarily, taking the label as depth information as an example, the RGB image in each group of combined data is used as the input of the deep learning model, and the label corresponding to the RGB image is used as the output of the deep learning model, and the deep learning model is trained to obtain the completed A deep learning model (ie, a monocular depth estimation algorithm) can be used to identify the depth information of RGB images. For example, after inputting the real cockpit image to be recognized into the deep learning model, the deep learning model can output the depth information of the face in the cockpit image, thereby assisting in the realization of gaze tracking, eye positioning and other functions.

Exemplarily, taking the label as the attitude information as an example, the RGB image in each combination of data is used as the input of the deep learning model, the corresponding label of the RGB image is used as the output of the deep learning model, the deep learning model is trained, and the completed A deep learning model (ie, a pose estimation algorithm) can be used to identify the depth information of RGB images. For example, after inputting the cockpit image to be recognized into the deep learning model, the deep learning model can output the posture information of the human body in the cockpit image (such as Euler angles of the head, Euler angles of the hands, etc.), thereby assisting the realization of the human body Motion tracking, driving distraction detection, and more.

Based on the above description, for the cockpit scene, the embodiment of the present application uses the 3D model of the cockpit (that is, the 3D cockpit model) as the background of the target object model (such as a human body model or a face model) during the simulation process, so as to create a realistic background and light and shadow Compared with the existing method of generating training data by simply rendering the human body model, the authenticity of the synthetic data can be improved; in the simulation process, the pose of the target object model and the environment of the three-dimensional cockpit model are randomly set, so that the The unique synthetic data in the diverse cockpit field is obtained, which realizes the effect of obtaining diverse annotated training data at low cost and high efficiency for new application scenarios. The embodiments of the present application can effectively improve the generalization ability of the deep learning model, and optimize the accuracy and effectiveness of the deep learning model.

Not only that, when the virtual camera is set in the simulation environment in the embodiment of the present application, the shooting parameters of the real camera in the real cockpit can be put into the virtual camera, so as to improve the realism of the images captured by the virtual camera, and further optimize the precision and accuracy of the deep learning model. effectiveness.

Not only that, the embodiment of the present application can also generate a target object model based on real human body/face scan data, so that the target object model can provide more texture details and diversity, thereby further improving the generalization ability of the deep learning model and optimizing the depth. Accuracy and effectiveness of the learned model.

In order to test the availability of the synthetic data generated in the embodiments of the present application, a depth estimation model (ie, a deep learning model for estimating depth information) may be designed based on semantic segmentation (SegNet). Referring to Figure 12, it is a schematic diagram of the architecture of the SegNet model. In the process of model training, color dithering can be performed on the input image of the model as data enhancement. The loss function of the model can be set as the average error (in millimeters) of the per-pixel depth estimate from the true depth. The model was optimized using the quasi-Newton method (L-BFGS), and it was observed that the model converged significantly on the synthesized training data.

The test includes two aspects: qualitative test and quantitative test:

1) The detection effect of the depth estimation model is qualitatively tested by using the synthetic data generated in the embodiment of the present application. For example, FIG. 13A is a synthetic picture generated by using the method of the embodiment of the present application, and the synthetic picture shown in FIG. 13A is input into the depth estimation model to obtain the depth map shown in FIG. 13B . From a qualitative point of view, the effect is ideal.

2) Use real data to quantitatively test the detection effect of the depth estimation model: use a depth camera to shoot a real target object to obtain a real RGB photo and a target depth map corresponding to the RGB photo; input the real RGB photo into the The depth estimation model obtains the estimated depth map; the quantitative test results can be obtained by comparing the estimated depth map with the target depth map.

It should be noted that when the estimated depth map and the target depth map are directly compared, there may be more background interference, so before inputting the real RGB photo into the depth estimation model, you can use the semantic segmentation algorithm. RGB photos are processed to obtain the target area (that is, the area where the target object is located, such as the face area), and other areas are masked, as shown in Figure 14. After obtaining the estimated depth map, only the estimated depth map and the target depth map are used. The target area is compared, which can improve the reliability of the test.

It should be understood that the above embodiments can be combined with each other to achieve different technical effects.

Based on the same technical concept, an embodiment of the present application further provides a model training apparatus, which includes a module/unit for executing the above method steps.

15, which is a schematic structural diagram of a model training apparatus 150 provided in an embodiment of the present application, the apparatus 150 includes a processing module 1501 and a training module 1502;

The processing module 1501 is used for: loading a virtual camera and a target object model in the three-dimensional cockpit model, and the target object model is located within the shooting range of the virtual camera; acquiring the first image output by the virtual camera; when outputting the first image from the virtual camera generate the label of the first image from the scene information of the The second image output by the camera; the scene information when the virtual camera outputs the second image is used to generate a label of the second image, and the label of the second image is used to describe the identification information of the target object model in the second image;

The training module 1502 is used for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, the trained deep learning model is used to identify the newly input cockpit Identification information of the target object in the image.

Optionally, the target object model is a human body model or a face model.

Optionally, the processing module 1501 is further configured to: scan at least one real target object before loading the target object model in the three-dimensional cockpit model, obtain at least one target object model, and save the at least one target object model to the model library; When loading the target object model in the 3D cockpit model, the processing module 1501 is specifically used for: randomly selecting one or more target object models from the model library, and loading one or more target object models into the 3D cockpit model.

Optionally, when the processing module 1501 loads the virtual camera in the three-dimensional cockpit model, it is specifically used to: put the shooting parameters of the real camera in the real cockpit into the virtual camera, and the shooting parameters include resolution, distortion parameters, focal length, and field of view. , at least one of aperture, or exposure time.

Optionally, the environmental parameters of the three-dimensional cockpit model include at least one of a light field of the three-dimensional cockpit model, an interior or an external environment; wherein the light field includes any of the brightness of the light, the number of light sources, the color of the light source, or the position of the light source. at least one.

Optionally, the pose parameters of the target object model include position coordinates and/or Euler angles.

Optionally, the trained deep learning model is used to identify the depth information of the target object in the newly input cockpit image;

When the processing module 1501 generates the label of the first image from the scene information when the virtual camera outputs the first image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image, determining the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image; using the first depth information as a label of the first image;

When the processing module 1501 generates the label of the second image from the scene information when the virtual camera outputs the second image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image, determining the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image; and using the second depth information as a label of the second image.

Optionally, the trained deep learning model is used to identify the pose information of the target object in the newly input cockpit image;

When generating the label of the first image from the scene information when the virtual camera outputs the first image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the first image, and output the first image from the virtual camera. When the pose parameter of the target object model is used as the label of the first image;

When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the second image, and output the second image from the virtual camera. When the pose parameters of the target object model are used as the label of the second image.

For the specific implementation process of the corresponding functions performed by the foregoing modules, reference may be made to the specific implementation process of the corresponding method steps in the above method embodiments, which will not be repeated here.

It should be understood that the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation. For example, the above processing module 1501 can be further subdivided:

Referring to FIG. 16 , which is a schematic structural diagram of a model training apparatus 160 provided in an embodiment of the present application, the apparatus 160 includes a scene creation module 1601 , a simulation module 1602 , and a training module 1603 .

Among them, the scene creation module 1601 is used to establish a three-dimensional cockpit model, and a virtual camera is installed in the three-dimensional cockpit model. The type of the cockpit is not limited in the present application, such as the cockpit of a car, the cockpit of a ship, the cockpit of an airplane, and the like. The shooting parameters of the virtual camera may be default values or the shooting parameters of the real camera, which are not limited here.

The simulation module 1602 is used to load the target object model in the 3D cockpit model; by continuously adjusting the environment parameters of the 3D cockpit model and/or the pose parameters of the target object model, output several training images with labels. The type of the target object is not limited in this application, such as human body, human face, and the like. The label carried by the training image needs to be determined according to the output data that the deep learning model needs to map. For example, the trained deep learning model is used to determine the depth information of the target object in the input cockpit image to be recognized, then the label is the sample image For example, if the trained deep learning model is used to determine the pose of the target object in the input cockpit image to be recognized, the label is the pose of the target object model in the sample image.

The training module 1603 is similar in function to the above-mentioned training module 1702, and is used to train the deep learning model based on several images with labels output by the simulation module 1602 to obtain the trained deep learning model. During the training process, the image is the input of the deep learning model, and the label is the output of the deep learning model.

Further optionally, the apparatus 160 may further include a data acquisition module, configured to scan a real target object to generate a target object model.

Further optionally, the apparatus 160 may further include a storage module for storing the synthesized data (ie, the image and the label of the image) output by the simulation module.

Further optionally, the apparatus 160 may further include a verification module for verifying the detection accuracy of the trained deep learning model.

Based on the same technical concept, an embodiment of the present application further provides a computing device system, where the computing device system includes at least one computing device 170 as shown in FIG. 17 . The computing device 170 includes a bus 1701 , a processor 1702 and a memory 1704 . Optionally, a communication interface 1703 may also be included. 17 In FIG. 17 , a dashed box indicates that the communication interface 1703 is optional. The processor 1702 in the at least one computing device 170 executes the computer instructions stored in the memory 1704, and can execute the methods provided in the above method embodiments.

The communication between the processor 1702 , the memory 1704 and the communication interface 1703 is through the bus 1701 . When the computing device system includes multiple computing devices 170, the multiple computing devices 170 communicate through a communication path.

The processor 1702 may be a CPU. Memory 1704 may include volatile memory, such as random access memory. Memory 1704 may also include non-volatile memory such as read only memory, flash memory, HDD or SSD. The memory 1704 stores executable code that is executed by the processor 1702 to perform any part or all of the aforementioned methods. The memory may also include other software modules required for running processes such as an operating system. The operating system can be LINUX ^TM , UNIX ^TM , WINDOWS ^TM and so on.

Specifically, the memory 1704 stores any one or any multiple modules of the aforementioned apparatus 180 or the apparatus 170 , and FIG. 17 takes the storage of any one or any multiple modules of the aforementioned apparatus 180 as an example; or any number of modules, and may also include other software modules required for running processes such as the operating system. The operating system can be LINUX ^TM , UNIX ^TM , WINDOWS ^TM and so on.

When the computing device system includes multiple computing devices 170, the multiple computing devices 170 establish communication with each other through a communication network, and each computing device runs any one or any multiple modules of the foregoing apparatus 150 or apparatus 160.

Based on the same technical concept, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the instructions are executed, the methods provided in the above method embodiments can be implemented.

Based on the same technical concept, an embodiment of the present application further provides a chip, which is coupled to a memory, and is used to read and execute program instructions stored in the memory, so as to implement the method provided by the above method embodiments.

Based on the same technical concept, the embodiments of the present application also provide a computer program product containing instructions. The computer program product stores instructions that, when run on a computer, cause the computer to execute the method provided by the above method embodiments.

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the protection scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

A method for training a deep learning model, comprising:

Loading a virtual camera and a target object model in the three-dimensional cockpit model, and the target object model is located within the shooting range of the virtual camera;

Obtain the first image output by the virtual camera, generate a label of the first image from scene information when the virtual camera outputs the first image, and the label of the first image is used to describe the target object model identification information in the first image;

adjusting the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model;

Acquire the second image output by the virtual camera, generate the label of the second image from the scene information when the virtual camera outputs the second image, and the label of the second image is used to describe where the target object model is located. the identification information in the second image;

The deep learning model is trained based on the first image and the label of the first image, the second image and the label of the second image; wherein, the deep learning model after training is used for Identify the identification information of the target object in the newly input cockpit image.
The method of claim 1, wherein the target object model is a human body model or a face model.
The method according to claim 1, wherein before loading the target object model in the three-dimensional cockpit model, the method further comprises: scanning at least one real target object to obtain at least one target object model, The object model is saved to the model library;

Loading the target object model in the three-dimensional cockpit model includes: randomly selecting one or more target object models from the model library, and loading the one or more target object models in the three-dimensional cockpit model.
The method of claim 1, wherein loading a virtual camera in the three-dimensional cockpit model comprises:

The shooting parameters of the real camera in the real cockpit are placed in the virtual camera, and the shooting parameters include at least one of resolution, distortion parameter, focal length, field angle, aperture or exposure time.
The method according to any one of claims 1-4, wherein the environmental parameters of the three-dimensional cockpit model include at least one of a light field, an interior or an external environment of the three-dimensional cockpit model;

Wherein, the light field includes at least one of illumination brightness, the number of light sources, the color of the light sources, or the position of the light sources.
The method according to any one of claims 1-4, wherein the pose parameters of the target object model include position coordinates and/or Euler angles.
The method according to any one of claims 1-6, wherein the deep learning model after training is used to identify the depth information of the target object in the newly input cockpit image;

Generating the label of the first image from the scene information when the virtual camera outputs the first image includes: according to the virtual camera outputting the first image, each feature point on the surface of the target object model and all the The relative position of the virtual camera is determined, and the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image is determined; the first depth information is used as the first depth information of the first image. Label;

Generating the label of the second image from the scene information when the virtual camera outputs the second image includes: according to the virtual camera outputting the second image, each feature point on the surface of the target object model and the The relative position of the virtual camera is determined, and the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image is determined; the second depth information is used as the second depth information of the second image. Label.
The method according to any one of claims 1-6, wherein the deep learning model that has been trained is used to identify the pose information of the target object in the newly input cockpit image;

Generating the label of the first image from the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image; When the virtual camera outputs the first image, the pose parameter of the target object model is used as the label of the first image;

Generating the label of the second image from the scene information when the virtual camera outputs the second image includes: determining the pose parameters of the target object model when the virtual camera outputs the second image; The pose parameter of the target object model when the virtual camera outputs the second image is used as a label of the second image.
A model training device, comprising:

a processing module for: loading a virtual camera and a target object model in the three-dimensional cockpit model, where the target object model is located within the shooting range of the virtual camera; acquiring a first image output by the virtual camera; The scene information when the virtual camera outputs the first image generates a label of the first image, and the label of the first image is used to describe the identification information of the target object model in the first image; adjusting the environmental parameters of the three-dimensional cockpit model and/or pose parameters of the target object model; acquiring the second image output by the virtual camera; generating the second image from the scene information when the virtual camera outputs the second image The label of the image, the label of the second image is used to describe the identification information of the target object model in the second image;

A training module for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, all the training completed The deep learning model is used to identify the identification information of the target object in the newly input cockpit image.
The apparatus of claim 9, wherein the target object model is a human body model or a face model.
The device according to claim 9, wherein the processing module is further configured to: scan at least one real target object to obtain at least one target object model before loading the target object model in the three-dimensional cockpit model, The at least one target object model is stored in a model library;

When loading the target object model in the three-dimensional cockpit model, the processing module is specifically configured to: randomly select one or more target object models from the model library, and load the one or more target object models in the three-dimensional cockpit model. target object model.
The device according to claim 9, wherein when the processing module loads the virtual camera in the three-dimensional cockpit model, it is specifically used for:

The shooting parameters of the real camera in the real cockpit are placed in the virtual camera, and the shooting parameters include at least one of resolution, distortion parameter, focal length, field angle, aperture or exposure time.
The device according to any one of claims 9-12, wherein the environmental parameters of the three-dimensional cockpit model include at least one of a light field, an interior or an external environment of the three-dimensional cockpit model;

Wherein, the light field includes at least one of illumination brightness, the number of light sources, the color of the light sources, or the position of the light sources.
The apparatus according to any one of claims 9-12, wherein the pose parameters of the target object model include position coordinates and/or Euler angles.
The device according to any one of claims 9-14, wherein the deep learning model after training is used to identify the depth information of the target object in the newly input cockpit image;

When generating the label of the first image from the scene information when the virtual camera outputs the first image, the processing module is specifically configured to: according to the target object when the virtual camera outputs the first image The relative positions of each feature point on the model surface and the virtual camera, determine the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image; information as a label for the first image;

When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module is specifically configured to: according to the target object when the virtual camera outputs the second image The relative positions of each feature point on the model surface and the virtual camera, determine the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image; information as a label for the second image.
The device according to any one of claims 9-14, wherein the deep learning model after training is used to identify the pose information of the target object in the newly input cockpit image;

When generating the label of the first image from scene information when the virtual camera outputs the first image, the processing module is specifically configured to: determine the target object when the virtual camera outputs the first image pose parameters of the model, using the pose parameters of the target object model when the virtual camera outputs the first image as the label of the first image;

When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module is specifically configured to: determine the target object when the virtual camera outputs the second image The pose parameter of the model, and the pose parameter of the target object model when the virtual camera outputs the second image is used as the label of the second image.
A computing device system, characterized in that it includes at least one computing device, each computing device includes a memory and a processor, and the memory is used to execute computer instructions stored in the memory, so as to execute the above claims 1 to 8 The method of any one.
A computer-readable storage medium, characterized in that, the computer-readable storage medium stores instructions that, when executed, cause the method according to any one of claims 1 to 8 to be implemented.
A chip, characterized in that, the chip is coupled with a memory, and is used for reading and executing program instructions stored in the memory, so as to implement the method according to any one of claims 1-8.
A computer program product comprising instructions, characterized in that the computer program product has instructions stored in the computer program product, which, when executed on a computer, cause the computer to execute the method according to any one of claims 1-8.