WO2022165809A1 - Method and apparatus for training deep learning model - Google Patents

Method and apparatus for training deep learning model Download PDF

Info

Publication number
WO2022165809A1
WO2022165809A1 PCT/CN2021/075856 CN2021075856W WO2022165809A1 WO 2022165809 A1 WO2022165809 A1 WO 2022165809A1 CN 2021075856 W CN2021075856 W CN 2021075856W WO 2022165809 A1 WO2022165809 A1 WO 2022165809A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
model
target object
virtual camera
label
Prior art date
Application number
PCT/CN2021/075856
Other languages
French (fr)
Chinese (zh)
Inventor
刘杨
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/075856 priority Critical patent/WO2022165809A1/en
Priority to CN202180000179.9A priority patent/CN112639846A/en
Publication of WO2022165809A1 publication Critical patent/WO2022165809A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the embodiment of the present application uses the three-dimensional cockpit model of the cockpit as the background of the target object model in the simulation process.
  • the synthetic data that is, in the third The authenticity of an image and the label of the first image, the label of the second image and the label of the second image
  • the pose of the target object model and the environment of the three-dimensional cockpit model are adjusted, so that a variety of cockpits can be obtained.
  • the unique synthetic data in the field can be used in the whole process without manual participation, which realizes the effect of obtaining diversified annotated training data at low cost and high efficiency for new application scenarios.
  • using such annotated training data to train the model can effectively improve the generalization ability of the deep learning model and optimize the accuracy and effectiveness of the deep learning model.
  • FIG. 5 is a schematic diagram of another deep learning model device provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the relative position of the target object model and the virtual camera
  • Machine vision systems can convert imperfect, blurry, and constantly changing images into semantic representations. As shown in Figure 1, it is a schematic diagram of the visual system identifying abnormal behavior. After inputting the image to be recognized in the visual system, the visual system can output the corresponding semantic information to identify the abnormal behavior in the image, that is, "smoking".
  • Vision systems can be implemented by deep learning methods.
  • ImageNet is the name of a computer vision system recognition project
  • large-scale visual recognition challenge imageNet Large Scale Visual Recognition Challenge, ILSVRC
  • ILSVRC Large Scale Visual Recognition Challenge
  • the Alexnet model of the competition winner Hilton team classified images (1000 classes)
  • the top-5 error rate of the track was reduced to 15.3%, crushing the 26.2% of the second place using the SVM algorithm. Since then, deep learning has embarked on a booming road.
  • the sources of real data generally include: 1), public datasets; 2), data collection and labeling suppliers; 3), self-collection.
  • synthetic data is used to construct a training dataset for the deep learning model.
  • Synthetic data refers to data generated by a simulated environment of a computer system, rather than data measured and collected from a real-world environment. This data is anonymous and has no privacy concerns. Since it is created based on user-specified parameters, it can have the same characteristics as real-world data as much as possible.
  • the existing synthetic data does not consider the cockpit scene, so when the deep learning model trained with these synthetic data recognizes the cockpit scene image, it is difficult for the model to distinguish the foreground (that is, the target object to be detected) and the background (that is, the background), resulting in The accuracy of the model is poor.
  • the embodiments of the present application provide a method and apparatus for training a deep learning model, so as to achieve a low-cost and high-efficiency acquisition of diverse annotated training data for a new application scenario (taking a cockpit as an example), Improve the generalization ability of deep learning models.
  • the model training device may also be a device composed of multiple parts, such as a scene creation module, a simulation module, and a training module, which is not limited in this application.
  • Each component can be deployed in different systems or servers.
  • each part of the apparatus may run in three environments of cloud computing equipment system, edge computing equipment system or terminal computing equipment, or may run in any two of these three environments. middle.
  • the cloud computing device system, the edge computing device system, and the terminal computing device are connected by a communication channel, and can communicate and transmit data with each other.
  • the method for training a deep learning model provided by the embodiment of the present application is performed by the combined parts of the model training apparatus running in the three environments (or any two of the three environments).
  • the source of the 3D cockpit model can be obtained directly from other places.
  • the 3D cockpit model corresponding to the car can be obtained from the manufacturer that generated the car, or It is generated by direct modeling, for example, by using special software such as a three-dimensional modeling tool, but it can also be generated by other methods, which is not limited in this application.
  • the specific modeling method can be Non-Uniform Rational B-Splines (NURBS) or polygon meshes, etc. No restrictions.
  • the internal environment may include the internal structure of the cockpit, such as the material, texture, shape, structure, color, etc. of the interior, such as seats, steering wheels, storage boxes, ornaments, car seat cushions, car floor mats, automobiles, etc. Pendants, internal decorations, daily necessities (such as paper towels, water cups, mobile phones, etc.).
  • the interior environment may also include a light field (or illumination) inside the cabin, such as illumination brightness, number of light sources, color of light sources, location of light sources or orientation of light sources, and the like.
  • the external environment includes the geographic location, shape, texture, color, etc. of the external environment.
  • the three-dimensional cockpit model is located in an urban environment, there may also be urban buildings (such as tall buildings) and lane signs (such as traffic lights) outside the cockpit.
  • urban buildings such as tall buildings
  • lane signs such as traffic lights
  • the three-dimensional cockpit model is located in a field environment, there may also be flowers, plants and trees outside the cockpit.
  • the shooting content of the virtual camera can be made closer to the shooting content of the real camera in practical applications, thereby helping to improve the synthetic data.
  • the synthetic data here refers to the image carrying the label output in step S602, which may also be referred to as a sample image herein).
  • the virtual camera in the embodiment of the present application can describe the position, attitude and other shooting parameters of the camera in the real cockpit. Based on these parameters of the virtual camera and the environmental parameters of the three-dimensional cockpit model, it can output the virtual camera from the perspective of the virtual camera.
  • the two-dimensional image inside the three-dimensional cockpit model obtains the effect of shooting a two-dimensional image of the real cockpit interior scene by a real camera in the real world.
  • FIG. 7 which is an interior view of the car model from a perspective (looking from the inside out)
  • a virtual camera is placed inside the cockpit of the three-dimensional cockpit model, and the virtual camera can photograph the scene in the cockpit (at least the main driving position can be photographed) Image).
  • the Y-axis of the coordinate system of the simulation space be the vertical direction
  • the vertical upward direction is the direction in which the Y-axis coordinate becomes larger
  • the Z-axis and X-axis are in the horizontal direction.
  • the virtual camera is set in the coordinate system of the simulation space.
  • the shooting direction of the virtual camera is the direction in which the Z-axis coordinate becomes larger.
  • the 3D cockpit model comes with a virtual camera, you can use the virtual camera directly.
  • the selected virtual camera can output the image inside the cockpit, that is, the selected virtual camera.
  • the camera is capable of capturing the scene in the cockpit (at least an image of the primary driver's position).
  • Positions 4 and 8 such as the inside of the rear door glass, the person in the rear seat can be photographed
  • Positions 5 and 7 such as the back of the front and rear seats, can be photographed person in the rear seat
  • position 11 e.g. rearview mirror
  • the shooting parameters of the virtual camera also need to be set.
  • the shooting parameters may include one or more of resolution, distortion parameters, focal length, field angle, aperture or exposure time, and the like. If the pose of the virtual camera is adjustable, the shooting parameters may further include the position and pose of the virtual camera, where the pose of the virtual camera may be represented by Euler angles. Of course, the shooting parameters may also include other parameters, as long as they are parameters that affect the shooting effect of the virtual camera, which is not limited in this application.
  • the shooting parameters can be set to default values, which is simple to implement and low in cost.
  • the shooting parameters can be obtained by collecting the shooting parameters of the real camera in the real cockpit, that is, collecting the shooting parameters of the real camera in the real cockpit, and placing the collected shooting parameters in the virtual camera.
  • the internal parameters of the on-board driver monitoring system (Driver Monitoring System, DMS) and cockpit monitoring system (Cockpit Monitoring System, CMS) cameras can be obtained by Zhang's calibration method, and then the internal parameters are placed in the virtual camera. In this way, the shooting effect of the virtual camera is closer to the shooting effect of the real camera in practical applications, thereby helping to improve the realism of the synthetic data.
  • the above two manners of determining the shooting parameters may be implemented independently, or may be implemented in combination with each other.
  • the resolution adopts the default value (such as 256x256), and the parameter value of the real camera is collected for the distortion parameters, focal length, field angle, aperture or exposure time.
  • the type of the target object model needs to be determined according to the target object that the deep learning model needs to recognize when it is applied.
  • the deep learning model needs to recognize the depth information of the human face in the image
  • the target object model can be a human face.
  • the model for example, the deep learning model needs to recognize the posture of the human body in the image
  • the target object model can be a human body model. It should be understood that the above is only an example and not a limitation, and the present application does not limit the specific type of the target object model, for example, it can also be an animal (such as a dog, a cat).
  • loading the target object model into the three-dimensional cockpit model includes: randomly selecting one or more target object models from the model library, and loading the selected target object model into the three-dimensional cockpit model. It should be understood that when there are multiple target object models, the method steps performed for each target object model in this embodiment of the present application are similar. Therefore, the following description will mainly take one target object model as an example.
  • FIG. 9A-9B are schematic diagrams of a high-quality human face model, wherein FIG. 9A is a front view of the high-quality human face model, and FIG. 9B is a side view of the high-quality human face model. Since the target object models in the model library are all generated based on the scanning of real target objects, the target object model can provide more texture details and diversity, which can improve the realism of the target object model, which in turn helps to improve the Diversity and realism of synthetic data.
  • the environmental parameters of the three-dimensional cockpit model may include parameters of the internal environment, such as the brightness of the light inside the cockpit, the number of light sources, the color of the light source, or the position of the light source, etc., such as the type, quantity, position, shape, structure, color, etc.
  • the environment parameters of the three-dimensional cockpit model may also include parameters of the external environment, such as the geographic location, shape, texture, color, etc. of the external environment. In this way, by changing the environmental parameters of the three-dimensional cockpit model, more diverse sample images (ie, images with labels, or synthetic data) can be obtained.
  • the pose parameters of the target object model include the position coordinates of the target object model (for example, the coordinate values (x, y, z) in the coordinate system of the simulation space), and/or the Euler angles of the target object model. (including pitch angle, yaw angle and roll angle). In this way, more diverse sample images can be obtained by changing the pose parameters of the three-dimensional cockpit model.
  • the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model are adjusted, and several training images carrying labels are output, including:
  • Step 1 Set the environment parameter of the three-dimensional cockpit model as the first environment parameter, and set the pose parameter of the target object model as the first pose parameter; output the first image based on the virtual camera in the three-dimensional cockpit model, and output the first image according to the virtual camera.
  • the scene information of an image generates a label of the first image, and the first image carrying the label is obtained.
  • Step 2 adjusting the environment parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model, for example, setting the environment parameters of the three-dimensional cockpit model as the second environment parameters, and setting the pose parameters of the target object model as the second pose parameters;
  • the second image is output based on the virtual camera, and a label of the second image is generated according to the scene information when the virtual camera outputs the second image, so as to obtain the second image carrying the label.
  • the first image output in step 1 is different from the second image output in step 2.
  • the first image and the second image are different.
  • Image carrying tags are also different.
  • the simulation script is used to execute the program code of the simulation, and when the simulation script is run by the computer, the computer can realize the following functions: randomly select from the model library
  • the face model is loaded into the 3D cockpit model; the X and Y coordinates of the face model are controlled to change randomly within the range of plus or minus 100 mm, and the Z coordinate is randomly changed within the range of 200 to 800 mm; the pitch angle (Pitch) of the face model is controlled.
  • the synthetic data is obtained based on the output image of the virtual camera and the output label based on the scene information.
  • the target object model loaded in the three-dimensional cockpit model may also be updated during the simulation. For example, after outputting a preset number of images with labels based on the current target object model, re-select one or more target object models from the model library, and then load the re-selected target object models in the 3D cockpit model.
  • the reselected target object model performs steps 1 and 2 above to output more images with labels. In this way, the diversity of synthetic data can be further improved.
  • the shooting parameters of the virtual camera may also be updated during the simulation. For example, after setting the environment parameter of the three-dimensional cockpit model as the first environment parameter, setting the pose parameter of the target object model as the first pose parameter, and outputting the first image carrying the label, update the shooting parameters of the virtual camera (for example, adjusting the resolution rate or exposure time, etc.), and then based on the same environment parameters and pose parameters (that is, keep the environment parameters of the 3D cockpit model as the first environment parameters, and keep the pose parameters of the target object model as the first pose parameters change), and output a third image different from the first image. In this way, the diversity of synthetic data can be further improved.
  • the target object model when the target object model can be divided into multiple parts and the poses of different parts can be independent of each other, during the simulation process, in addition to adjusting the overall pose parameters of the target object model (that is, maintaining the In addition to the relative pose unchanged), the pose parameters of each part of the target object model can also be adjusted separately.
  • the target object model is a human body model
  • the overall pose parameters of the human body model may be adjusted, or only the pose parameters of the head may be adjusted, or only the pose parameters of the hands may be adjusted, or the pose parameters of the head may be adjusted simultaneously. Pose parameters and hand pose parameters, etc. In this way, the flexibility of simulation can be improved, and the diversity of synthetic data can be further improved.
  • the most important role of the virtual camera is to determine the rendering effect (for example, the position of the image plane, the resolution of the image, the viewing angle of the image) according to the shooting parameters of the virtual camera. field angle, etc.), so that the display effect of the rendered 2D image is the same as the display effect of the image captured by the real camera using the shooting parameters, and the realism of the RGB image is improved.
  • the label of the image is used to describe the identification information of the target object in the cockpit image.
  • the label type of the image needs to be determined according to the output data (namely identification information) that the deep learning model needs to map. For example, if the trained deep learning model is used to determine the depth information of the target object in the image to be recognized, the label includes the depth information of the target object model; for example, the trained deep learning model is used to determine the depth information of the target object model to be recognized.
  • the label includes the attitude information of the target object model; for example, the trained deep deep learning model is used to determine the depth information and attitude information of the target object in the image to be recognized, Then the label includes both the depth information and the pose information of the target object model.
  • the trained deep deep learning model is used to determine the depth information and attitude information of the target object in the image to be recognized, Then the label includes both the depth information and the pose information of the target object model.
  • other types of labels may also exist in practical applications, which are not limited in this application.
  • the embodiments of the present application have different specific methods for generating tags based on scene information.
  • the following takes the tags as depth information and pose information as examples to introduce two possible methods for generating tags based on scene information.
  • Example 1 The label includes depth information.
  • Depth estimation is a fundamental problem in the field of computer vision. There are many devices, such as depth cameras and millimeter-wave radars, that can directly acquire depth. But car-grade high-resolution depth cameras are expensive. It is also possible to use binoculars for depth estimation, but since binocular images need to use stereo matching to perform pixel correspondence and parallax calculation, the computational complexity is also high. Especially for low texture scenes the matching effect is not good.
  • the monocular depth estimation is relatively cheaper and easier to popularize. Then for monocular depth estimation, as the name implies, it is to use an RGB image to estimate the distance of each pixel in the image relative to the shooting source.
  • the synthetic data generated in the embodiment of the present application may be an image carrying a depth information label, and may be used as training data for a supervised monocular depth estimation algorithm based on deep learning.
  • the trained monocular depth estimation algorithm can be used to estimate depth information based on RGB images.
  • the method for generating a depth information label according to scene information includes: according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs a certain image, determining the relative position of the target object model relative to the virtual camera when the virtual camera outputs the image The depth information of the camera, which is used as the label of the image.
  • Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the virtual camera to output the first image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image the first depth information of the target object model relative to the virtual camera; the first depth information is used as the label of the first image.
  • Generating the label of the second image according to the scene information when the virtual camera outputs the second image includes: determining the virtual camera to output the second image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image the second depth information of the target object model relative to the virtual camera; the second depth information is used as the label of the second image.
  • each feature point on the target object model has its own 3D coordinates (x, y, z), and the depth information corresponding to each feature point is not the linear distance from the feature point to the center of the virtual camera aperture, but The vertical distance from this point to the plane of the virtual camera aperture. Therefore, to obtain the depth information of each feature point on the target object model, the essence is to obtain the vertical distance between each feature point and the plane where the virtual camera is located. The vertical distance between each feature point and the plane where the virtual camera is located can be specifically calculated and obtained according to the position coordinates of the virtual camera and the position coordinates of the target object model.
  • the camera aperture center S is located at the origin O of the coordinate system in the simulation space, that is, the coordinates of the camera aperture center S are (0, 0, 0), the Y axis is the vertical direction, and the vertical upward direction is Y.
  • the depth value corresponding to each feature point is equal to the vertical distance between the feature point and the plane where the virtual camera is located, which is equal to the Z coordinate of the feature point. It can be seen from Figure 10 that the positions of point A and point B on the face are different, and the straight-line distance between point B and the center of the virtual camera aperture is greater than the straight-line distance between point A and the center of the virtual camera aperture, but point A is where the virtual camera aperture is located.
  • the vertical distance of the plane is less than the vertical distance from points A and B to the plane where the virtual camera aperture is located, that is, the Z coordinate of points A and B is greater than the Z coordinate of point C (z3 ⁇ z1, z2), that is, the two points of A and B are
  • the depth value is greater than the depth value of the Z point.
  • the depth information corresponding to each RGB image may be expressed in the form of a depth map.
  • the RGB image and/or the depth map can be in JPG, PNG or JPEG format, which is not limited in this application.
  • the depth of each pixel in the depth map can be stored in 16 bits (two bytes) in millimeters.
  • FIGS. 11A and 11B are examples of a set of synthetic data, wherein FIG. 11A is an RGB image, FIG. 11B is a depth map corresponding to the RGB image shown in FIG. 11A , and FIG. 11C is a more intuitive display of FIG. 11B in Matplotlib. schematic diagram.
  • Example 2 The label includes pose information.
  • Attitude estimation problem is to determine the orientation of a three-dimensional target object.
  • Pose estimation has applications in many fields such as robot vision, motion tracking, and single-camera calibration.
  • the synthetic data generated in the embodiment of the present application may be an image carrying a pose information label, and may be used as training data for a supervised pose estimation algorithm based on deep learning.
  • the trained pose estimation algorithm can be used to estimate the pose of the target object based on the RGB image.
  • the method for generating the depth information label according to the scene information includes: according to the pose parameter of the target object model when the virtual camera outputs a certain image, the pose parameter is used as the label of the image.
  • Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image, and determining the pose parameters of the target object model when the virtual camera outputs the first image
  • the parameter is used as the label of the first image
  • the label of the second image is generated according to the scene information when the virtual camera outputs the second image, including: determining the pose parameters of the target object model when the virtual camera outputs the second image, and outputting the second image from the virtual camera.
  • the pose parameters of the target object model in the image are used as the label of the second image.
  • the attitude information may be represented by Euler angles.
  • the Euler angle may specifically be the rotation angle of the head around the three coordinate axes (ie, X, Y, and Z axes) of the simulation space coordinate system.
  • the rotation angle around the X axis is the pitch angle (Pitch)
  • the rotation angle around the Y axis is the yaw angle (Yaw)
  • the rotation angle around the Z axis is the roll angle (Roll).
  • training the model train the deep learning model based on the image with the label, and obtain the deep learning model that has been trained.
  • the RGB image in each group of combined data is used as the input of the deep learning model, and the label corresponding to the RGB image is used as the output of the deep learning model, and the deep learning model is trained to obtain the completed
  • a deep learning model ie, a monocular depth estimation algorithm
  • the deep learning model can output the depth information of the face in the cockpit image, thereby assisting in the realization of gaze tracking, eye positioning and other functions.
  • the deep learning model can output the posture information of the human body in the cockpit image (such as Euler angles of the head, Euler angles of the hands, etc.), thereby assisting the realization of the human body Motion tracking, driving distraction detection, and more.
  • the embodiment of the present application uses the 3D model of the cockpit (that is, the 3D cockpit model) as the background of the target object model (such as a human body model or a face model) during the simulation process, so as to create a realistic background and light and shadow Compared with the existing method of generating training data by simply rendering the human body model, the authenticity of the synthetic data can be improved; in the simulation process, the pose of the target object model and the environment of the three-dimensional cockpit model are randomly set, so that the The unique synthetic data in the diverse cockpit field is obtained, which realizes the effect of obtaining diverse annotated training data at low cost and high efficiency for new application scenarios.
  • the embodiments of the present application can effectively improve the generalization ability of the deep learning model, and optimize the accuracy and effectiveness of the deep learning model.
  • the shooting parameters of the real camera in the real cockpit can be put into the virtual camera, so as to improve the realism of the images captured by the virtual camera, and further optimize the precision and accuracy of the deep learning model. effectiveness.
  • the embodiment of the present application can also generate a target object model based on real human body/face scan data, so that the target object model can provide more texture details and diversity, thereby further improving the generalization ability of the deep learning model and optimizing the depth. Accuracy and effectiveness of the learned model.
  • a depth estimation model ie, a deep learning model for estimating depth information
  • semantic segmentation SegNet
  • Figure 12 it is a schematic diagram of the architecture of the SegNet model.
  • color dithering can be performed on the input image of the model as data enhancement.
  • the loss function of the model can be set as the average error (in millimeters) of the per-pixel depth estimate from the true depth.
  • the model was optimized using the quasi-Newton method (L-BFGS), and it was observed that the model converged significantly on the synthesized training data.
  • the test includes two aspects: qualitative test and quantitative test:
  • FIG. 13A is a synthetic picture generated by using the method of the embodiment of the present application, and the synthetic picture shown in FIG. 13A is input into the depth estimation model to obtain the depth map shown in FIG. 13B . From a qualitative point of view, the effect is ideal.
  • RGB photos are processed to obtain the target area (that is, the area where the target object is located, such as the face area), and other areas are masked, as shown in Figure 14. After obtaining the estimated depth map, only the estimated depth map and the target depth map are used. The target area is compared, which can improve the reliability of the test.
  • an embodiment of the present application further provides a model training apparatus, which includes a module/unit for executing the above method steps.
  • the apparatus 150 includes a processing module 1501 and a training module 1502;
  • the processing module 1501 is used for: loading a virtual camera and a target object model in the three-dimensional cockpit model, and the target object model is located within the shooting range of the virtual camera; acquiring the first image output by the virtual camera; when outputting the first image from the virtual camera generate the label of the first image from the scene information of the The second image output by the camera; the scene information when the virtual camera outputs the second image is used to generate a label of the second image, and the label of the second image is used to describe the identification information of the target object model in the second image;
  • the training module 1502 is used for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, the trained deep learning model is used to identify the newly input cockpit Identification information of the target object in the image.
  • the target object model is a human body model or a face model.
  • the processing module 1501 is further configured to: scan at least one real target object before loading the target object model in the three-dimensional cockpit model, obtain at least one target object model, and save the at least one target object model to the model library;
  • the processing module 1501 is specifically used for: randomly selecting one or more target object models from the model library, and loading one or more target object models into the 3D cockpit model.
  • the processing module 1501 loads the virtual camera in the three-dimensional cockpit model, it is specifically used to: put the shooting parameters of the real camera in the real cockpit into the virtual camera, and the shooting parameters include resolution, distortion parameters, focal length, and field of view. , at least one of aperture, or exposure time.
  • the environmental parameters of the three-dimensional cockpit model include at least one of a light field of the three-dimensional cockpit model, an interior or an external environment; wherein the light field includes any of the brightness of the light, the number of light sources, the color of the light source, or the position of the light source. at least one.
  • the pose parameters of the target object model include position coordinates and/or Euler angles.
  • the trained deep learning model is used to identify the depth information of the target object in the newly input cockpit image
  • the processing module 1501 When the processing module 1501 generates the label of the first image from the scene information when the virtual camera outputs the first image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image, determining the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image; using the first depth information as a label of the first image;
  • the processing module 1501 When the processing module 1501 generates the label of the second image from the scene information when the virtual camera outputs the second image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image, determining the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image; and using the second depth information as a label of the second image.
  • the trained deep learning model is used to identify the pose information of the target object in the newly input cockpit image
  • the processing module 1501 When generating the label of the first image from the scene information when the virtual camera outputs the first image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the first image, and output the first image from the virtual camera. When the pose parameter of the target object model is used as the label of the first image;
  • the processing module 1501 When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the second image, and output the second image from the virtual camera. When the pose parameters of the target object model are used as the label of the second image.
  • the apparatus 160 includes a scene creation module 1601 , a simulation module 1602 , and a training module 1603 .
  • the scene creation module 1601 is used to establish a three-dimensional cockpit model, and a virtual camera is installed in the three-dimensional cockpit model.
  • the type of the cockpit is not limited in the present application, such as the cockpit of a car, the cockpit of a ship, the cockpit of an airplane, and the like.
  • the shooting parameters of the virtual camera may be default values or the shooting parameters of the real camera, which are not limited here.
  • the simulation module 1602 is used to load the target object model in the 3D cockpit model; by continuously adjusting the environment parameters of the 3D cockpit model and/or the pose parameters of the target object model, output several training images with labels.
  • the type of the target object is not limited in this application, such as human body, human face, and the like.
  • the label carried by the training image needs to be determined according to the output data that the deep learning model needs to map. For example, the trained deep learning model is used to determine the depth information of the target object in the input cockpit image to be recognized, then the label is the sample image For example, if the trained deep learning model is used to determine the pose of the target object in the input cockpit image to be recognized, the label is the pose of the target object model in the sample image.
  • the training module 1603 is similar in function to the above-mentioned training module 1702, and is used to train the deep learning model based on several images with labels output by the simulation module 1602 to obtain the trained deep learning model.
  • the image is the input of the deep learning model
  • the label is the output of the deep learning model.
  • the apparatus 160 may further include a data acquisition module, configured to scan a real target object to generate a target object model.
  • the apparatus 160 may further include a storage module for storing the synthesized data (ie, the image and the label of the image) output by the simulation module.
  • a storage module for storing the synthesized data (ie, the image and the label of the image) output by the simulation module.
  • the apparatus 160 may further include a verification module for verifying the detection accuracy of the trained deep learning model.
  • an embodiment of the present application further provides a computing device system, where the computing device system includes at least one computing device 170 as shown in FIG. 17 .
  • the computing device 170 includes a bus 1701 , a processor 1702 and a memory 1704 .
  • a communication interface 1703 may also be included. 17 In FIG. 17 , a dashed box indicates that the communication interface 1703 is optional.
  • the processor 1702 in the at least one computing device 170 executes the computer instructions stored in the memory 1704, and can execute the methods provided in the above method embodiments.
  • the communication between the processor 1702 , the memory 1704 and the communication interface 1703 is through the bus 1701 .
  • the multiple computing devices 170 communicate through a communication path.
  • the processor 1702 may be a CPU.
  • Memory 1704 may include volatile memory, such as random access memory.
  • Memory 1704 may also include non-volatile memory such as read only memory, flash memory, HDD or SSD.
  • the memory 1704 stores executable code that is executed by the processor 1702 to perform any part or all of the aforementioned methods.
  • the memory may also include other software modules required for running processes such as an operating system.
  • the operating system can be LINUX TM , UNIX TM , WINDOWS TM and so on.
  • the memory 1704 stores any one or any multiple modules of the aforementioned apparatus 180 or the apparatus 170
  • FIG. 17 takes the storage of any one or any multiple modules of the aforementioned apparatus 180 as an example; or any number of modules, and may also include other software modules required for running processes such as the operating system.
  • the operating system can be LINUX TM , UNIX TM , WINDOWS TM and so on.
  • the multiple computing devices 170 establish communication with each other through a communication network, and each computing device runs any one or any multiple modules of the foregoing apparatus 150 or apparatus 160.
  • the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the instructions are executed, the methods provided in the above method embodiments can be implemented.
  • an embodiment of the present application further provides a chip, which is coupled to a memory, and is used to read and execute program instructions stored in the memory, so as to implement the method provided by the above method embodiments.
  • the embodiments of the present application also provide a computer program product containing instructions.
  • the computer program product stores instructions that, when run on a computer, cause the computer to execute the method provided by the above method embodiments.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

A method and apparatus for training a deep learning model. The method comprises: firstly loading a virtual camera and a target object model in a three-dimensional cabin model, then obtaining a first image output by the virtual camera, and generating a label of the first image according to scene information when the virtual camera outputs the first image; then adjusting environmental parameters of the three-dimensional cabin model and/or pose parameters of the target object model, after the adjustment, obtaining a second image output by the virtual camera, and generating a label of the second image according to scene information when the virtual camera outputs the second image; and finally, using the first image, the label of the first image, the second image and the label of the second image for deep learning of the model, so that the generalization capability of the deep learning model can be effectively improved, and the precision and effectiveness of the deep learning model are optimized.

Description

一种训练深度学习模型的方法和装置A method and apparatus for training a deep learning model 技术领域technical field
本申请涉及人工智能(Artificial Intelligence,AI)领域,尤其涉及一种训练深度学习模型的方法和装置。The present application relates to the field of artificial intelligence (Artificial Intelligence, AI), and in particular, to a method and apparatus for training a deep learning model.
背景技术Background technique
近年来,深度学习已经渗透到机器视觉的各个领域,并对原有方法带来显著的改进。其中最成功的,当属有监督的深度学习。有监督的深度学习使用带标注的训练数据集训练深度学习模型,得到的深度学习模型可以对测试集和实际场景中的新数据进行检测。但是,有监督的深度学习有明显的局限性:在开始任何视觉项目前,需要准备大量的带标注的训练数据。而机器视觉在真实世界中的应用领域是不断更新的,所以一旦出现新的应用场景,例如座舱场景,就需要重新准备训练数据,训练新的深度学习模型。In recent years, deep learning has penetrated into various fields of machine vision and brought significant improvements to the original methods. The most successful of these is supervised deep learning. Supervised deep learning uses annotated training datasets to train deep learning models, and the resulting deep learning models can perform detection on new data in both test sets and real-world scenarios. However, supervised deep learning has obvious limitations: Before starting any vision project, a large amount of annotated training data needs to be prepared. The application field of machine vision in the real world is constantly updated, so once new application scenarios appear, such as cockpit scenarios, it is necessary to re-prepare training data and train new deep learning models.
目前,针对新的应用场景,一般是采集真实数据,人工完成标注数据,以获得训练数据。这种方法不仅费时费力、价格高昂,而且有些训练数据人工无法标注,导致模型训练效果差。At present, for new application scenarios, real data is generally collected and labeled data is manually completed to obtain training data. This method is not only time-consuming, labor-intensive, and expensive, but also some training data cannot be manually labeled, resulting in poor model training results.
因此,针对新的应用场景,如何低成本地获得多样化的带标注的训练数据,进而获得准确率高的深度学习模型,是本申请所要解决的技术问题。Therefore, for new application scenarios, how to obtain diverse labeled training data at low cost, and then obtain a deep learning model with high accuracy, is a technical problem to be solved by this application.
发明内容SUMMARY OF THE INVENTION
本申请提供一种训练深度学习模型的方法和装置,可以提高深度学习模型泛化能力,优化深度学习模型的精度和有效性。The present application provides a method and apparatus for training a deep learning model, which can improve the generalization ability of the deep learning model and optimize the accuracy and effectiveness of the deep learning model.
第一方面,提供一种训练深度学习模型的方法,包括:在三维座舱模型中加载虚拟相机和目标对象模型,目标对象模型位于虚拟相机的拍摄范围内;获取虚拟相机输出的第一图像;将虚拟相机输出第一图像时的场景信息生成第一图像的标签,第一图像的标签用于描述目标对象模型在第一图像中的识别信息;调整三维座舱模型的环境参数和/或目标对象模型的位姿参数;获取虚拟相机输出的第二图像;将虚拟相机输出第二图像时的场景信息生成第二图像的标签,第二图像的标签用于描述目标对象模型在第二图像中的识别信息;基于第一图像以及第一图像的标签、第二图像以及第二图像的标签,对深度学习模型进行训练;其中,训练完成的深度学习模型用于识别新输入的座舱图像中的目标对象的识别信息。In a first aspect, a method for training a deep learning model is provided, comprising: loading a virtual camera and a target object model in a three-dimensional cockpit model, where the target object model is located within the shooting range of the virtual camera; obtaining a first image output by the virtual camera; The scene information when the virtual camera outputs the first image generates a label of the first image, and the label of the first image is used to describe the identification information of the target object model in the first image; adjust the environmental parameters of the three-dimensional cockpit model and/or the target object model obtain the second image output by the virtual camera; generate the label of the second image from the scene information when the virtual camera outputs the second image, and the label of the second image is used to describe the recognition of the target object model in the second image information; based on the first image and the label of the first image, the second image and the label of the second image, train the deep learning model; wherein, the trained deep learning model is used to identify the target object in the newly input cockpit image identification information.
本申请实施例针对座舱场景,在仿真过程中使用座舱的三维座舱模型作为目标对象模型的背景,相较于现有技术单纯由人体模型渲染生成训练数据的方法,可以提高合成数据(即于第一图像以及第一图像的标签、第二图像以及第二图像的标签)的真实性;且在仿真过程中,调整目标对象模型的位姿和三维座舱模型的环境,从而可以得到多样化的座舱领域独有的合成数据,全程可以无需人工参与,实现了针对新的应用场景低成本且高效率地获得多样化的带标注的训练数据的效果。最后使用这样的带标注的训练数据来训练模型,可以有效提高深度学习模型泛化能力,优化深度学习模型的精度和有效性。For the cockpit scene, the embodiment of the present application uses the three-dimensional cockpit model of the cockpit as the background of the target object model in the simulation process. Compared with the method of generating training data by simply rendering the human body model in the prior art, the synthetic data (that is, in the third The authenticity of an image and the label of the first image, the label of the second image and the label of the second image); and during the simulation process, the pose of the target object model and the environment of the three-dimensional cockpit model are adjusted, so that a variety of cockpits can be obtained. The unique synthetic data in the field can be used in the whole process without manual participation, which realizes the effect of obtaining diversified annotated training data at low cost and high efficiency for new application scenarios. Finally, using such annotated training data to train the model can effectively improve the generalization ability of the deep learning model and optimize the accuracy and effectiveness of the deep learning model.
一种可能的设计中,目标对象模型为人体模型或人脸模型。In one possible design, the target object model is a human body model or a face model.
一种可能的设计中,在三维座舱模型中加载目标对象模型之前,还可以对至少一个真实的目标对象进行扫描,获得至少一个目标对象模型,将至少一个目标对象模型保存到模型库,相应的,在三维座舱模型中加载目标对象模型,则包括:从模型库中随机选取一个或多个目标对象模型,在三维座舱模型中加载一个或多个目标对象模型。In a possible design, before loading the target object model in the 3D cockpit model, at least one real target object can be scanned to obtain at least one target object model, and the at least one target object model can be saved to the model library, and the corresponding , and loading the target object model in the 3D cockpit model includes: randomly selecting one or more target object models from the model library, and loading one or more target object models in the 3D cockpit model.
该设计中,基于真实的人体/人脸扫描数据生成目标对象模型,使得目标对象模型可以提供更多的纹理细节和多样性,进而进一步提高深度学习模型泛化能力,优化深度学习模型的精度和有效性。In this design, the target object model is generated based on the real human/face scan data, so that the target object model can provide more texture details and diversity, thereby further improving the generalization ability of the deep learning model, and optimizing the accuracy of the deep learning model. effectiveness.
一种可能的设计中,在三维座舱模型中加载虚拟相机,可以是将真实座舱中的真实相机的拍摄参数置入虚拟相机,拍摄参数包括分辨率、畸变参数、焦距、视场角、光圈或曝光时间中的至少一项。In a possible design, to load a virtual camera in the 3D cockpit model, the shooting parameters of the real camera in the real cockpit can be placed in the virtual camera, and the shooting parameters include resolution, distortion parameters, focal length, field of view, aperture or At least one of exposure time.
该设计中,将真实座舱中的真实相机的拍摄参数置入虚拟相机,可以使得虚拟相机输出的图像更加逼真,进而进一步提高深度学习模型泛化能力,优化深度学习模型的精度和有效性。In this design, the shooting parameters of the real camera in the real cockpit are placed in the virtual camera, which can make the images output by the virtual camera more realistic, further improve the generalization ability of the deep learning model, and optimize the accuracy and effectiveness of the deep learning model.
一种可能的设计中,三维座舱模型的环境参数包括三维座舱模型的光场、内饰或外部环境中的至少一项;其中,光场包括光照亮度、光源的数量、光源的颜色或光源的位置中的至少一项。In a possible design, the environmental parameters of the 3D cockpit model include at least one of the light field of the 3D cockpit model, the interior or the external environment; wherein the light field includes the brightness of the light, the number of light sources, the color of the light source, or the amount of the light source. at least one of the locations.
该设计中,三维座舱模型的光场、内饰或外部环境可以被调整,进而可以营造丰富多样的背景和光影效果,可以使得虚拟相机输出的图像更加多样化,进而进一步提高深度学习模型泛化能力,优化深度学习模型的精度和有效性。In this design, the light field, interior or external environment of the 3D cockpit model can be adjusted to create a rich variety of backgrounds and light and shadow effects, which can make the images output by the virtual camera more diverse, and further improve the generalization of the deep learning model. ability to optimize the accuracy and effectiveness of deep learning models.
一种可能的设计中,目标对象模型的位姿参数包括位置坐标和/或欧拉角。In a possible design, the pose parameters of the target object model include position coordinates and/or Euler angles.
在本申请实施例中,标签的类型可以有多种,或者说,训练完成的深度学习模型的用途可以有多种。In this embodiment of the present application, there may be multiple types of labels, or in other words, the trained deep learning model may have multiple uses.
第一种可能的设计中,识别信息为深度信息,训练完成的深度学习模型用于识别新输入的座舱图像中的目标对象的深度信息。相应的,将虚拟相机输出第一图像时的场景信息生成第一图像的标签,可以包括:根据虚拟相机输出第一图像时目标对象模型表面上各特征点与虚拟相机的相对位置,确定虚拟相机输出第一图像时目标对象模型相对于虚拟相机的第一深度信息;将第一深度信息作为第一图像的标签。将虚拟相机输出第二图像时的场景信息生成第二图像的标签,可以包括:根据虚拟相机输出第二图像时目标对象模型表面上各特征点与虚拟相机的相对位置,确定虚拟相机输出第二图像时目标对象模型相对于虚拟相机的第二深度信息;将第二深度信息作为第二图像的标签。In the first possible design, the identification information is depth information, and the trained deep learning model is used to identify the depth information of the target object in the newly input cockpit image. Correspondingly, generating the label of the first image from the scene information when the virtual camera outputs the first image may include: determining the virtual camera according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image. When the first image is output, the first depth information of the target object model relative to the virtual camera; the first depth information is used as the label of the first image. Generating the label of the second image from the scene information when the virtual camera outputs the second image may include: determining the virtual camera to output the second image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image. The second depth information of the target object model relative to the virtual camera during the image; the second depth information is used as the label of the second image.
该设计中,可以获得高精度的用于识别图像中目标对象的深度信息的深度学习模型。In this design, a high-precision deep learning model for recognizing the depth information of the target object in the image can be obtained.
第二种可能的设计中,识别信息为姿态信息,训练完成的深度学习模型用于识别新输入的座舱图像中的目标对象的姿态信息;In the second possible design, the identification information is attitude information, and the trained deep learning model is used to identify the attitude information of the target object in the newly input cockpit image;
将虚拟相机输出第一图像时的场景信息生成第一图像的标签,包括:确定虚拟相机输出第一图像时目标对象模型的位姿参数,将虚拟相机输出第一图像时目标对象模型的位姿参数作为第一图像的标签;Generating the label of the first image from the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image, and generating the pose parameters of the target object model when the virtual camera outputs the first image parameter as the label of the first image;
将虚拟相机输出第二图像时的场景信息生成第二图像的标签,包括:确定虚拟相机输出第二图像时目标对象模型的位姿参数,将虚拟相机输出第二图像时目标对象模型的位姿参数作为第二图像的标签。Generating the label of the second image from the scene information when the virtual camera outputs the second image includes: determining the pose parameters of the target object model when the virtual camera outputs the second image, and generating the pose parameters of the target object model when the virtual camera outputs the second image parameter as a label for the second image.
该设计中,可以获得高精度的用于识别图像中目标对象的姿态信息的深度学习模型。In this design, a high-precision deep learning model for recognizing the pose information of the target object in the image can be obtained.
应理解,以上两种标签仅为示例而非限定,实际还可以有其它实现方式。It should be understood that the above two tags are only examples rather than limitations, and there may actually be other implementation manners.
第二方面,提供一种模型训练装置,包括用于执行上述第一方面或第一方面任一种可能的设计中所述的方法步骤的模块/单元。In a second aspect, a model training apparatus is provided, comprising modules/units for performing the method steps described in the first aspect or any possible design of the first aspect.
示例性的,该装置包括:Exemplarily, the device includes:
处理模块,用于:在三维座舱模型中加载虚拟相机和目标对象模型,目标对象模型位于虚拟相机的拍摄范围内;获取虚拟相机输出的第一图像;将虚拟相机输出第一图像时的场景信息生成第一图像的标签,第一图像的标签用于描述目标对象模型在第一图像中的识别信息;调整三维座舱模型的环境参数和/或目标对象模型的位姿参数;获取虚拟相机输出的第二图像;将虚拟相机输出第二图像时的场景信息生成第二图像的标签,第二图像的标签用于描述目标对象模型在第二图像中的识别信息;The processing module is used for: loading the virtual camera and the target object model in the three-dimensional cockpit model, and the target object model is located in the shooting range of the virtual camera; obtaining the first image output by the virtual camera; and outputting the scene information of the virtual camera when the first image is output Generate a label of the first image, and the label of the first image is used to describe the identification information of the target object model in the first image; adjust the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model; obtain the output of the virtual camera. the second image; generating the label of the second image from the scene information when the virtual camera outputs the second image, and the label of the second image is used to describe the identification information of the target object model in the second image;
训练模块,用于:基于第一图像以及第一图像的标签、第二图像以及第二图像的标签,对深度学习模型进行训练;其中,训练完成的深度学习模型用于识别新输入的座舱图像中的目标对象的识别信息。A training module for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, the trained deep learning model is used to identify the newly input cockpit image The identification information of the target object in .
第三方面,提供一种计算设备系统,包括至少一台计算设备,每台计算设备包括存储器和处理器,存储器用于执行存储器存储的计算机指令,以执行上述第一方面或第一方面任一种可能的设计中所述的方法。In a third aspect, a computing device system is provided, including at least one computing device, each computing device including a memory and a processor, and the memory is used for executing computer instructions stored in the memory to execute the first aspect or any one of the first aspects. method described in a possible design.
第四方面,提供一种计算机可读存储介质,该计算机可读存储介质存储有指令,当指令被执行时,使上述第一方面或第一方面任一种可能的设计中所述的方法被实现。A fourth aspect provides a computer-readable storage medium, the computer-readable storage medium stores instructions, when the instructions are executed, the method described in the above-mentioned first aspect or any possible design of the first aspect is executed. accomplish.
第五方面,提供一种芯片,该芯片可以与存储器耦合,用于读取并执行存储器中存储的程序指令,实现上述第一方面或第一方面任一种可能的设计中所述的方法。A fifth aspect provides a chip that can be coupled to a memory for reading and executing program instructions stored in the memory to implement the method described in the first aspect or any possible design of the first aspect.
第六方面,提供一种包含指令的计算机程序产品,该计算机程序产品中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一种可能的设计中所述的方法。A sixth aspect provides a computer program product comprising instructions, the computer program product having instructions stored in the computer program product that, when executed on a computer, cause the computer to execute the above-mentioned first aspect or any possible design of the first aspect. method described.
上述第二方面至第六方面的有益效果参见第一方面对应设计的有益效果,这里不再赘述。For the beneficial effects of the above-mentioned second to sixth aspects, refer to the beneficial effects of the corresponding designs of the first aspect, which will not be repeated here.
附图说明Description of drawings
图1为一种视觉系统识别异常行为的示意图;Figure 1 is a schematic diagram of a visual system identifying abnormal behavior;
图2为一种深度神经网络的特征学习的示意图;2 is a schematic diagram of feature learning of a deep neural network;
图3为模型渲染的手部图像的示意图;3 is a schematic diagram of a hand image rendered by a model;
图4为本申请实施例提供的一种深度学习模型装置的示意图;4 is a schematic diagram of a deep learning model device provided by an embodiment of the present application;
图5为本申请实施例提供的另一种深度学习模型装置的示意图;FIG. 5 is a schematic diagram of another deep learning model device provided by an embodiment of the present application;
图6为本申请实施例提供的一种训练深度学习模型的方法的流程图;6 is a flowchart of a method for training a deep learning model provided by an embodiment of the present application;
图7为一种视角(从内向外看)下的汽车模型的内视图;Figure 7 is an interior view of a car model from one viewing angle (from the inside out);
图8为一种真实汽车的剖视图;Fig. 8 is a sectional view of a real automobile;
图9A~图9B为一个高质量的人脸模型的示意图;9A-9B are schematic diagrams of a high-quality face model;
图10为目标对象模型和虚拟相机的相对位置的示意图;10 is a schematic diagram of the relative position of the target object model and the virtual camera;
图11A~图11C为一组合成数据的示例;11A to 11C are examples of a set of synthetic data;
图12为语义分割(SegNet)模型的架构示意图;12 is a schematic diagram of the architecture of a semantic segmentation (SegNet) model;
图13A为采用本申请实施例方法生成的合成图片的示意图;FIG. 13A is a schematic diagram of a composite picture generated by the method according to an embodiment of the present application;
图13B为将图13A所示的合成图片输入深度估计模型后得到的深度图;FIG. 13B is a depth map obtained by inputting the synthesized picture shown in FIG. 13A into a depth estimation model;
图14为经语义分割算法处理后的RGB图;Figure 14 is an RGB image processed by a semantic segmentation algorithm;
图15为本申请实施例提供的一种模型训练装置150的结构示意图;FIG. 15 is a schematic structural diagram of a model training apparatus 150 provided by an embodiment of the application;
图16为本申请实施例提供的一种模型训练装置160的结构示意图;FIG. 16 is a schematic structural diagram of a model training apparatus 160 provided by an embodiment of the present application;
图17为本申请实施例提供的计算设备170的结构示意图。FIG. 17 is a schematic structural diagram of a computing device 170 provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例进行详细描述。The embodiments of the present application will be described in detail below with reference to the accompanying drawings.
机器视觉系统可以把不完美的、模糊的、且持续变化的图像转化成语义表示。如图1所示,是一种视觉系统识别异常行为的示意图,在视觉系统中输入待识别的图像之后,视觉系统可以输出对应的语义信息,识别出图像中的异常行为,即“抽烟”。Machine vision systems can convert imperfect, blurry, and constantly changing images into semantic representations. As shown in Figure 1, it is a schematic diagram of the visual system identifying abnormal behavior. After inputting the image to be recognized in the visual system, the visual system can output the corresponding semantic information to identify the abnormal behavior in the image, that is, "smoking".
视觉系统可以由深度学习方法实现。在2012年的ImageNet(ImageNet是一个计算机视觉系统识别项目名称)大规模视觉识别挑战赛(ImageNet Large Scale Visual Recognition Challenge,ILSVRC)中,竞赛冠军获得者Hilton团队的Alexnet模型把图像分类(1000类)赛道top-5误差率降低到了15.3%,碾压了第二名使用SVM算法的26.2%。从此之后,深度学习走上了蓬勃发展之路。Vision systems can be implemented by deep learning methods. In the 2012 ImageNet (ImageNet is the name of a computer vision system recognition project) large-scale visual recognition challenge (ImageNet Large Scale Visual Recognition Challenge, ILSVRC), the Alexnet model of the competition winner Hilton team classified images (1000 classes) The top-5 error rate of the track was reduced to 15.3%, crushing the 26.2% of the second place using the SVM algorithm. Since then, deep learning has embarked on a booming road.
近年来,深度学习已经渗透到机器视觉的各个领域,并对原有方法带来显著的改进。其中最成功的,当属有监督的深度学习(已知数据和其一一对应的标签,训练一个智能算法,将输入数据映射到标签的过程)。有监督的深度学习用带标注的训练数据集训练模型,得到深度学习模型(本文中的“深度学习模型”可简称为“模型”)可以对测试集和实际场景中的新数据进行识别和分类。In recent years, deep learning has penetrated into various fields of machine vision and brought significant improvements to the original methods. The most successful of these is supervised deep learning (the process of training an intelligent algorithm to map the input data to the labels, given the data and its one-to-one corresponding labels). Supervised deep learning trains a model with annotated training datasets, resulting in a deep learning model ("deep learning model" in this article may simply be referred to as a "model") that can identify and classify new data in test sets and actual scenarios .
有监督的深度学习近年来在机器视觉领域所取得的成功,主要原因在于:The success of supervised deep learning in machine vision in recent years is mainly due to:
1)、大量数据:比如ImageNet(计算机视觉系统识别项目名称)图像分类赛道,提供了数以百万计的训练数据。深度神经网络(Deep Neural Networks,DNN)鉴于其超大的模型容量,用于小样本数据很容易过拟合。有了大量数据后,DNN过拟合的问题得到了有效缓解。1), a large amount of data: such as the ImageNet (computer vision system recognition project name) image classification track, which provides millions of training data. Due to its large model capacity, Deep Neural Networks (DNN) are prone to overfitting for small sample data. With a large amount of data, the problem of DNN overfitting has been effectively alleviated.
2)、超大算力:处理大量数据需要超大算力。在Alexnet以前,使用普通的中央处理器(Central Processing Unit,CPU)在ImageNet上训练一个深度学习网络需要数年的时间,因此无法实用。Alex Krizhevsky开创性的基于英伟达(Nvidia,是一家人工智能计算公司)图形处理器(Graphics Processing Unit,GPU)的统一计算设备架构(Compute Unified Device Architecture,CUDA)接口,成功训练了一个性能有突破性提升的深度神经网络,从而开启了人工智能的GPU时代。2), super computing power: processing a large amount of data requires super computing power. Before Alexnet, it took years to train a deep learning network on ImageNet using a common Central Processing Unit (CPU), so it was not practical. Alex Krizhevsky's pioneering Compute Unified Device Architecture (CUDA) interface based on Nvidia (Nvidia, an artificial intelligence computing company) Graphics Processing Unit (GPU) successfully trained a performance breakthrough Enhanced deep neural networks, thus ushering in the GPU era of artificial intelligence.
有监督的深度学习需要满足两个假设:1)训练数据和真实数据是独立同分布(Independent Identically Distributed,IID)的,在概率论与统计学中,IID是指一组随机变量中每个变量的概率分布都相同,且这些随机变量互相独立;2)训练数据对真实数据有足够的覆盖。因此,大量高质量的训练数据,对于有监督的深度学习尤其重要。Supervised deep learning needs to satisfy two assumptions: 1) The training data and the real data are Independent Identically Distributed (IID). In probability theory and statistics, IID refers to each variable in a set of random variables The probability distributions are the same, and these random variables are independent of each other; 2) The training data has sufficient coverage of the real data. Therefore, a large amount of high-quality training data is especially important for supervised deep learning.
有监督的深度学习有明显的局限性:在开始任何视觉项目前,需要准备大量的训练数据。训练数据和真实数据必须独立同分布,例如已有普通相机模型,要向鱼眼扩展,就得重新收集数据,再次训练。真实世界中的概率分布可能是不断演化的,这就需要不断的准备训练数据,重新训练,更新深度学习模型。Supervised deep learning has obvious limitations: Before starting any vision project, a large amount of training data needs to be prepared. The training data and the real data must be independent and identically distributed. For example, if there is an existing ordinary camera model, to expand to the fisheye, the data must be collected and trained again. The probability distribution in the real world may be constantly evolving, which requires continuous preparation of training data, retraining, and updating of deep learning models.
所以,当上述两条假设没有被充分满足时,误判和错判难以避免(尽管在人类看来差别微不足道)。当一个能以高准确率被正常识别的物体被放进与该物体出现场景呈负相关的场景训练集时,视觉系统很容易被误导。例如,此类系统可能无法识别站在沙滩上的奶牛,因为奶牛几乎没有可能出现在沙滩场景。Therefore, when the above two assumptions are not fully satisfied, misjudgment and misjudgment are inevitable (although the difference is insignificant in the eyes of humans). When an object that can be recognized normally with high accuracy is put into a training set of scenes that are negatively correlated with the scene in which the object appears, the vision system is easily misled. For example, such a system might not be able to identify a cow standing on a beach, since it is almost impossible for a cow to be present in a beach scene.
示例性的,参见图2,为一种深度神经网络的特征学习的示意图,在人类看来同一种动物(猫咪)被卷积神经网络(Convolutional Neural Network,CNN)映射到嵌入空间中很远的距离,导致识别错误。面对这种问题,只能甩锅“数据”,但又没法保证,到底需要多少数据,才可以完全消除这种恼人的“边角案例(corner cases)”。Illustratively, see Fig. 2, which is a schematic diagram of feature learning of a deep neural network. In the eyes of humans, the same animal (cat) is mapped by a Convolutional Neural Network (CNN) to a distant object in the embedding space. distance, resulting in recognition errors. Faced with this kind of problem, we can only dump the "data", but there is no guarantee of how much data is needed to completely eliminate such annoying "corner cases".
由此可见,有监督的深度学习需要大量的高质量的标注数据来训练深度学习模型。It can be seen that supervised deep learning requires a large amount of high-quality labeled data to train deep learning models.
以下,介绍深度学习模型的训练数据的两种主要来源。Below, two main sources of training data for deep learning models are introduced.
一种可能的实现方式中,采用真实数据构建深度学习模型的训练数据集。In one possible implementation, real data is used to build a training dataset for the deep learning model.
真实数据的来源一般有:1)、公开数据集;2)、数据采集和标注供应商;3)、自采。The sources of real data generally include: 1), public datasets; 2), data collection and labeling suppliers; 3), self-collection.
但是采用真实数据存在以下缺点:However, using real data has the following disadvantages:
1)、价格昂贵。从数据标注提供商购买数据,价格数万、数十万或上百万不等。由于深度学习模型训练需要大量数据,因此总价往往高昂。1), expensive. Buy data from data annotation providers for tens of thousands, hundreds of thousands or millions. Since deep learning model training requires a lot of data, the overall price is often high.
2)、费力费时。需要清晰的写好文档,指定标注格式。往往耗费几个月的时间。2), laborious and time-consuming. The document needs to be clearly written and the markup format specified. Often it takes several months.
3)、有些训练数据无法人工标注。比如人手3D关键点数据、像素级语义分割数据等。3) Some training data cannot be manually labeled. Such as human hand 3D keypoint data, pixel-level semantic segmentation data, etc.
另一种可能的实现方式中,采用合成数据构建深度学习模型的训练数据集。In another possible implementation, synthetic data is used to construct a training dataset for the deep learning model.
合成数据是指由计算机系统的仿真环境生成的数据,而不是从现实世界的环境中测量和收集的数据。这一数据是匿名的,没有隐私问题。由于基于用户指定的参数创建,因此可以尽可能地和现实情境下的数据有相同的特征。Synthetic data refers to data generated by a simulated environment of a computer system, rather than data measured and collected from a real-world environment. This data is anonymous and has no privacy concerns. Since it is created based on user-specified parameters, it can have the same characteristics as real-world data as much as possible.
创建合成数据要比收集真实数据效率更高,成本更低。这使企业能够快速实验新的场景,更好的应对快速变化的视场环境。合成数据也能够与真实数据进行互补。有些“corner case”的识别错误之所以产生,就是因为这种类别的图像在真实数据中没有或者很少。这时可以在合成数据时通过参数控制合成这种类型的数据,使得数据集的分布更加贴近使用场景。Creating synthetic data is more efficient and less expensive than collecting real data. This enables companies to quickly experiment with new scenarios and better cope with rapidly changing field of view environments. Synthetic data can also be complemented with real data. Some "corner case" recognition errors occur because images of this category are absent or few in the real data. At this time, this type of data can be synthesized through parameter control when synthesizing data, so that the distribution of the data set is closer to the usage scenario.
但是,创建高质量数据合成的仿真环境并不容易。如果合成数据与真实数据差异很大,那么基于合成数据训练得到的深度学习模型就不可用。However, creating a simulation environment for high-quality data synthesis is not easy. Deep learning models trained on synthetic data are not usable if the synthetic data is significantly different from the real data.
现有的合成数据未考虑到座舱场景,所以采用这些合成数据训练获得的深度学习模型在识别座舱场景图像时,模型难以区分前景(即需要检测的目标对象)和后景(即背景),导致模型的准确性差。The existing synthetic data does not consider the cockpit scene, so when the deep learning model trained with these synthetic data recognizes the cockpit scene image, it is difficult for the model to distinguish the foreground (that is, the target object to be detected) and the background (that is, the background), resulting in The accuracy of the model is poor.
不仅如此,现有的合成数据完全由3D模型渲染生成数据。例如在手指跟踪视觉系统构建中,基于参数化手部模型生成不同姿态下手部的图像,随机替换背景进行训练。如图3所示,为模型渲染的手部图像的示意图,从图3可以看出,单纯由人体模型渲染生成人体模型颜色、纹理等特征不具有足够的多样性,缺乏真实感,所以在基于合成数据上训练得到的深度学习模型泛化能力极差,放在真实场景中无法达到应有的性能。Not only that, the existing synthetic data is completely rendered data from the 3D model. For example, in the construction of a finger tracking vision system, images of the hands in different poses are generated based on the parametric hand model, and the background is randomly replaced for training. As shown in Figure 3, a schematic diagram of the hand image rendered for the model, it can be seen from Figure 3 that the color, texture and other features of the human body model simply generated by the rendering of the human body model do not have enough diversity and lack of realism. The generalization ability of deep learning models trained on synthetic data is extremely poor, and they cannot achieve due performance in real scenarios.
鉴于此,本申请实施例提供一种训练深度学习模型的方法和装置,用以实现针对新的应用场景(以座舱为例),低成本且高效率地获得多样化的带标注的训练数据,提高深度学习模型泛化能力。In view of this, the embodiments of the present application provide a method and apparatus for training a deep learning model, so as to achieve a low-cost and high-efficiency acquisition of diverse annotated training data for a new application scenario (taking a cockpit as an example), Improve the generalization ability of deep learning models.
在对本申请实施例提供的训练深度学习模型的方法介绍之前,对本申请实施例所适用 的系统架构进行介绍。Before introducing the method for training a deep learning model provided by the embodiment of the present application, the system architecture applicable to the embodiment of the present application is introduced.
本申请实施例提供的训练深度学习模型的方法可以由模型训练装置执行,本申请实施例中并不限定模型训练装置所部署的位置。示例性的,如图4所示,深度学习模型装置可以运行在云计算设备系统(包括至少一个云计算设备,例如:服务器等),也可以运行在边缘计算设备系统(包括至少一个边缘计算设备,例如:服务器、台式电脑等),也可以运行在各种终端计算设备上,例如:笔记本电脑、个人台式电脑等。The method for training a deep learning model provided in the embodiment of the present application may be performed by a model training apparatus, and the location where the model training apparatus is deployed is not limited in the embodiment of the present application. Exemplarily, as shown in FIG. 4 , the deep learning model device can run on a cloud computing device system (including at least one cloud computing device, such as a server, etc.), or can run on an edge computing device system (including at least one edge computing device). , such as: server, desktop computer, etc.), can also run on various terminal computing devices, such as: notebook computer, personal desktop computer, etc.
模型训练装置也可以是由多个部分构成的装置,例如场景创建模块、仿真模块以及训练模块等,本申请不做限制。各个组成部分可以分别部署在不同的系统或服务器中。示例性的,如图5所示,装置的各部分可以分别运行在云计算设备系统、边缘计算设备系统或终端计算设备这三个环境中,也可以运行在这三个环境中的任意两个中。云计算设备系统、边缘计算设备系统和终端计算设备之间由通信通路连接,可以互相进行通信和数据传输。本申请实施例提供的训练深度学习模型的方法由运行在三个环境(或三个环境中的任意两个)中的模型训练装置的各组合部分配合执行。The model training device may also be a device composed of multiple parts, such as a scene creation module, a simulation module, and a training module, which is not limited in this application. Each component can be deployed in different systems or servers. Exemplarily, as shown in FIG. 5 , each part of the apparatus may run in three environments of cloud computing equipment system, edge computing equipment system or terminal computing equipment, or may run in any two of these three environments. middle. The cloud computing device system, the edge computing device system, and the terminal computing device are connected by a communication channel, and can communicate and transmit data with each other. The method for training a deep learning model provided by the embodiment of the present application is performed by the combined parts of the model training apparatus running in the three environments (or any two of the three environments).
以下,结合附图对本申请提供的训练深度学习模型的方法进行详细介绍。Hereinafter, the method for training a deep learning model provided by the present application will be introduced in detail with reference to the accompanying drawings.
参见图6,为本申请实施例提供的一种训练深度学习模型的方法的流程图,包括:Referring to FIG. 6, a flowchart of a method for training a deep learning model provided by an embodiment of the present application includes:
S601、场景创建:在仿真空间中加载三维座舱模型,在三维座舱模型中加载虚拟相机。S601. Scene creation: load a 3D cockpit model in the simulation space, and load a virtual camera in the 3D cockpit model.
仿真空间是仿真软件提供的3D模拟环境。被加载进3D模拟环境的各三维模型能够在仿真空间中呈3D显示效果。基于仿真软件,可以控制各三维模型在仿真空间的显示效果,例如后文介绍的调整三维座舱模型的环境参数、目标对象模型的位姿参数等。仿真软件包括但不限于CARLA、AirSim、PreScan等。Simulation space is a 3D simulation environment provided by simulation software. Each 3D model loaded into the 3D simulation environment can be displayed in 3D in the simulation space. Based on the simulation software, the display effect of each 3D model in the simulation space can be controlled, such as adjusting the environmental parameters of the 3D cockpit model and the pose parameters of the target object model as described later. Simulation software includes but is not limited to CARLA, AirSim, PreScan, etc.
三维座舱模型是座舱的多边形网格(polygon mesh)表示,可以用计算机或者其它视频设备进行显示。显示的座舱可以是现实世界的实体座舱,也可以是虚构的座舱。其中,对于座舱的类型,本申请不予限定。例如为汽车的座舱、轮船的座舱、飞机的座舱等。三维座舱模型内部可以容纳至少一个目标对象。以三维座舱模型是汽车模型为例,且以目标对象为人体为例,汽车的驾驶舱具有主驾驶座和副驾驶座,可容纳两个人。A 3D cockpit model is a polygon mesh representation of the cockpit that can be displayed on a computer or other video device. The cockpit shown can be a real-world physical cockpit or a fictitious cockpit. The application does not limit the type of the cockpit. For example, the cockpit of a car, the cockpit of a ship, the cockpit of an airplane, and the like. The interior of the three-dimensional cockpit model can accommodate at least one target object. Taking the 3D cockpit model as a car model as an example, and taking the target object as a human body as an example, the cockpit of the car has a main driver's seat and a co-pilot seat, which can accommodate two people.
在仿真空间中加载三维座舱模型,是指将三维座舱模型置入仿真空间。而在三维座舱模中加载虚拟相机,意味着虚拟相机也置入仿真空间,且虚拟相机位于三维座舱模型内部。Loading the 3D cockpit model in the simulation space refers to placing the 3D cockpit model in the simulation space. Loading a virtual camera in the 3D cockpit model means that the virtual camera is also placed in the simulation space, and the virtual camera is located inside the 3D cockpit model.
本申请实施例中,三维座舱模型的来源,可以是从其它地方直接获取,例如当三维座舱模型是汽车的座舱时,可以从生成该汽车的厂商处获得该汽车对应的三维座舱模型,也可以是直接建模生成,例如使用三维建模工具这种专门的软件生成,但是也可以用其它方法生成,本申请不做限制。利用三维建模工具通过虚拟三维空间构建出具有三维数据的座舱模型时,具体的建模方法可以是非均匀有理B样条(Non-Uniform Rational B-Splines,NURBS)或多边形网格等,本申请不做限制。In this embodiment of the present application, the source of the 3D cockpit model can be obtained directly from other places. For example, when the 3D cockpit model is the cockpit of a car, the 3D cockpit model corresponding to the car can be obtained from the manufacturer that generated the car, or It is generated by direct modeling, for example, by using special software such as a three-dimensional modeling tool, but it can also be generated by other methods, which is not limited in this application. When using a 3D modeling tool to construct a cockpit model with 3D data through a virtual 3D space, the specific modeling method can be Non-Uniform Rational B-Splines (NURBS) or polygon meshes, etc. No restrictions.
需要强调的是,本申请实施例创建三维座舱模型时,除了要建构三维座舱模型本身的本形状、结构、颜色等之外,还要考虑三维座舱模型的内/外环境的渲染,因为实际应用中真实相机拍摄真实的座舱图像时,除了会拍摄到目标对象的图像之外,还会拍摄到一部分内部环境的图像,还可能拍摄到座舱外的图像(例如透过车窗拍摄到的车外的环境图像)。It should be emphasized that when creating a 3D cockpit model in this embodiment of the present application, in addition to constructing the original shape, structure, color, etc. of the 3D cockpit model itself, the rendering of the inner/outer environment of the 3D cockpit model should also be considered, because the practical application When a real camera takes a real cockpit image, in addition to the image of the target object, it will also capture a part of the internal environment, and may also capture images outside the cockpit (for example, the outside of the car captured through the window environment image).
可选的,内部环境可以包括座舱的内部结构,例如内饰的材质、纹理、形状、结构、颜色等,内饰具体例如座椅、方向盘、收纳箱、摆件、汽车坐垫、汽车脚垫、汽车挂件、内部摆件、日常生活用品(如纸巾、水杯、手机等)。内部环境还可以包括座舱内部的光 场(或者说光照),光场具体例如光照亮度、光源的数量、光源的颜色、光源的位置或光源的朝向等。Optionally, the internal environment may include the internal structure of the cockpit, such as the material, texture, shape, structure, color, etc. of the interior, such as seats, steering wheels, storage boxes, ornaments, car seat cushions, car floor mats, automobiles, etc. Pendants, internal decorations, daily necessities (such as paper towels, water cups, mobile phones, etc.). The interior environment may also include a light field (or illumination) inside the cabin, such as illumination brightness, number of light sources, color of light sources, location of light sources or orientation of light sources, and the like.
可选的,外部环境包括外部环境的地理位置、形状、纹理、颜色等,例如三维座舱模型位于城市环境中,则座舱外还可以有城市建筑物(如高楼),车道标志(如红绿灯),其它车辆等,例如三维座舱模型位于野外环境中,则座舱外还可以有花草树木等。Optionally, the external environment includes the geographic location, shape, texture, color, etc. of the external environment. For example, if the three-dimensional cockpit model is located in an urban environment, there may also be urban buildings (such as tall buildings) and lane signs (such as traffic lights) outside the cockpit. For other vehicles, for example, the three-dimensional cockpit model is located in a field environment, there may also be flowers, plants and trees outside the cockpit.
由于本申请实施例在创建三维座舱模型时,考虑了三维座舱模型的内/外环境特征,所以可以使得虚拟相机的拍摄内容更贴近实际应用中真实相机的拍摄内容,进而有助于提高合成数据的真实感(这里的合成数据是指步骤S602中输出的携带标签的图像,本文还可称之为样本图像)。Since the internal/external environment features of the three-dimensional cockpit model are considered when creating the three-dimensional cockpit model in the embodiments of the present application, the shooting content of the virtual camera can be made closer to the shooting content of the real camera in practical applications, thereby helping to improve the synthetic data. (the synthetic data here refers to the image carrying the label output in step S602, which may also be referred to as a sample image herein).
本申请实施例中的虚拟相机,能够描述真实座舱中相机的位置、姿态以及其它拍摄参数,基于该虚拟相机的该些参数,以及三位座舱模型的环境参数,能够输出该虚拟相机视角下的该三维座舱模型内部的二维图像,得到真实世界中真实相机对真实座舱内部场景拍摄二维图像的效果。The virtual camera in the embodiment of the present application can describe the position, attitude and other shooting parameters of the camera in the real cockpit. Based on these parameters of the virtual camera and the environmental parameters of the three-dimensional cockpit model, it can output the virtual camera from the perspective of the virtual camera. The two-dimensional image inside the three-dimensional cockpit model obtains the effect of shooting a two-dimensional image of the real cockpit interior scene by a real camera in the real world.
本申请实施例中,在获得三维座舱模型之后,如果三维座舱模型中本身不具有虚拟相机,或者所具有虚拟相机不适用于本申请,则需要在三维座舱模型中加载虚拟相机。In the embodiment of the present application, after the 3D cockpit model is obtained, if the 3D cockpit model itself does not have a virtual camera, or the virtual camera is not applicable to the present application, the virtual camera needs to be loaded in the 3D cockpit model.
参见图7,为一种视角(从内向外看)下的汽车模型的内视图,虚拟相机被置于三维座舱模型的座舱内部,虚拟相机可以对座舱内场景进行拍摄(至少能够拍摄主驾驶位置的图像)。设仿真空间的坐标系的Y轴为垂直方向,且垂直向上为Y轴坐标变大的方向,Z轴和X轴位于水平方向上,如图7所示,虚拟相机设置于仿真空间的坐标系的原点处,虚拟相机的拍摄方向为朝向Z轴坐标变大的方向。Referring to FIG. 7 , which is an interior view of the car model from a perspective (looking from the inside out), a virtual camera is placed inside the cockpit of the three-dimensional cockpit model, and the virtual camera can photograph the scene in the cockpit (at least the main driving position can be photographed) Image). Let the Y-axis of the coordinate system of the simulation space be the vertical direction, and the vertical upward direction is the direction in which the Y-axis coordinate becomes larger, and the Z-axis and X-axis are in the horizontal direction. As shown in Figure 7, the virtual camera is set in the coordinate system of the simulation space. At the origin of , the shooting direction of the virtual camera is the direction in which the Z-axis coordinate becomes larger.
如果三维座舱模型自带虚拟相机,则可以直接使用该虚拟相机。另外,如果三维座舱模型中具有多个虚拟相机时,则需要从该多个虚拟相机中确定出本申请所需的虚拟相机,被选择的虚拟相机可以输出座舱内部的图像,即被选择的虚拟相机能够对座舱内场景进行拍摄(至少能够拍摄主驾驶位置的图像)。If the 3D cockpit model comes with a virtual camera, you can use the virtual camera directly. In addition, if there are multiple virtual cameras in the three-dimensional cockpit model, it is necessary to determine the virtual camera required by the application from the multiple virtual cameras, and the selected virtual camera can output the image inside the cockpit, that is, the selected virtual camera. The camera is capable of capturing the scene in the cockpit (at least an image of the primary driver's position).
参见图8,为真实汽车的剖视图,图8中示出了13个可以设置相机的位置,这些位置上的相机所服务的功能分别如下:位置1、驾驶员监控系统(Driver Monitoring System,DMS);位置2、全景式监控影像系统(Around View Monitor,AVM);位置3、B柱车外人脸开门;位置4、后舱监控系统(Backseat Monitoring System,BMS);位置5、后排娱乐;位置6、AVM;位置7、后排娱乐;位置8、BMS;位置9、AVM;位置10、行车记录仪(Digital Video Recorder,DVR);位置11、车内(后视镜)高清拍照;位置12、AVM;位置13、B柱车外人脸开门。其中,位置1、4、5、7、8、11的相机是朝向车内的,位置2、3、6、9、10、12、13的相机是朝向车外的。Referring to Figure 8, which is a cross-sectional view of a real car, Figure 8 shows 13 positions where cameras can be set. The functions served by the cameras at these positions are as follows: Position 1, Driver Monitoring System (DMS) ;Position 2, Panoramic Surveillance Image System (Around View Monitor, AVM); Position 3, B-pillar exterior face door; Position 4, Backseat Monitoring System (BMS); Position 5, Rear entertainment; Position 6. AVM; position 7, rear entertainment; position 8, BMS; position 9, AVM; position 10, driving recorder (Digital Video Recorder, DVR); position 11, in-car (rearview mirror) high-definition photography; position 12 , AVM; position 13, B-pillar outside the car to open the door. Among them, the cameras at positions 1, 4, 5, 7, 8, and 11 are directed toward the inside of the car, and the cameras at positions 2, 3, 6, 9, 10, 12, and 13 are directed toward the outside of the car.
本申请实施例中所使用的虚拟相机在三维座舱模型中的位置,则可以参照图8所示真实汽车中拍摄角度朝向车内的相机的部署位置,例如:位置1(如方向盘前侧玻璃内侧,可以拍摄到主驾驶座上的人);位置4、8(如后座车门玻璃内侧,可以拍摄到后座上的人);位置5、7(如前排后座的背部,可以拍摄到后座上的人);位置11(如后视镜)。For the position of the virtual camera used in the embodiment of the present application in the three-dimensional cockpit model, you can refer to the deployment position of the camera with the shooting angle facing the inside of the car in the real car shown in FIG. , the person in the driver's seat can be photographed); Positions 4 and 8 (such as the inside of the rear door glass, the person in the rear seat can be photographed); Positions 5 and 7 (such as the back of the front and rear seats, can be photographed person in the rear seat); position 11 (e.g. rearview mirror).
在本申请实施例中,在所述三维座舱模型中加载虚拟相机之后,还需要设置虚拟相机的拍摄参数。其中,拍摄参数可以包括分辨率、畸变参数、焦距、视场角、光圈或曝光时间等中的一项或多项。如果虚拟相机的位姿可调,那么拍摄参数还可以包括虚拟相机的位置和姿态,其中虚拟相机的姿态可以用欧拉角来表示。当然,拍摄参数还可以包括其他参 数,只要是影响虚拟相机拍摄效果的参数即可,本申请不做限制。In the embodiment of the present application, after the virtual camera is loaded in the three-dimensional cockpit model, the shooting parameters of the virtual camera also need to be set. The shooting parameters may include one or more of resolution, distortion parameters, focal length, field angle, aperture or exposure time, and the like. If the pose of the virtual camera is adjustable, the shooting parameters may further include the position and pose of the virtual camera, where the pose of the virtual camera may be represented by Euler angles. Of course, the shooting parameters may also include other parameters, as long as they are parameters that affect the shooting effect of the virtual camera, which is not limited in this application.
可选的,拍摄参数可设置为默认值,这种方式实现简单,成本低。Optionally, the shooting parameters can be set to default values, which is simple to implement and low in cost.
可选的,拍摄参数可以通过采集真实座舱中的真实相机的拍摄参数获得,即采集真实座舱中的真实相机的拍摄参数,并将采集到的拍摄参数置入虚拟相机。一个具体的示例中,可以通过张氏标定法获取车载驾驶员监控系统(Driver Monitoring System,DMS)和座舱监控系统(Cockpit Monitoring System,CMS)相机的内参,然后把该内参置入虚拟相机中。这种方式下,虚拟相机的拍摄效果更接近实际应用中真实相机的拍摄效果,进而有助于提高合成数据的真实感。Optionally, the shooting parameters can be obtained by collecting the shooting parameters of the real camera in the real cockpit, that is, collecting the shooting parameters of the real camera in the real cockpit, and placing the collected shooting parameters in the virtual camera. In a specific example, the internal parameters of the on-board driver monitoring system (Driver Monitoring System, DMS) and cockpit monitoring system (Cockpit Monitoring System, CMS) cameras can be obtained by Zhang's calibration method, and then the internal parameters are placed in the virtual camera. In this way, the shooting effect of the virtual camera is closer to the shooting effect of the real camera in practical applications, thereby helping to improve the realism of the synthetic data.
应理解,上述两种确定拍摄参数的方式可以单独实施,也可以相互结合实施。相互结合实施时,例如:分辨率采用默认值(例如256x256),畸变参数、焦距、视场角、光圈或曝光时间等则采集真实相机的参数值。It should be understood that the above two manners of determining the shooting parameters may be implemented independently, or may be implemented in combination with each other. When implemented in combination with each other, for example, the resolution adopts the default value (such as 256x256), and the parameter value of the real camera is collected for the distortion parameters, focal length, field angle, aperture or exposure time.
S602、实时仿真:在三维座舱模型中加载目标对象模型,多次调整三维座舱模型的环境参数和/或目标对象模型的位姿参数,输出若干携带标签的(二维)图像。S602. Real-time simulation: load the target object model in the 3D cockpit model, adjust the environmental parameters of the 3D cockpit model and/or the pose parameters of the target object model for many times, and output several (2D) images carrying tags.
在本申请实施例中,目标对象模型的类型需要根据深度学习模型在应用时所需要识别的目标对象确定,例如深度学习模型需要识别图像中的人脸的深度信息,则目标对象模型可以为人脸模型,例如深度学习模型需要识别图像中的人体的姿态,则目标对象模型可以为人体模型。应理解,以上仅为举例而非限定,本申请对目标对象模型的具体类型不做限制,例如还可以是动物(如狗、猫)。In the embodiment of the present application, the type of the target object model needs to be determined according to the target object that the deep learning model needs to recognize when it is applied. For example, the deep learning model needs to recognize the depth information of the human face in the image, and the target object model can be a human face. The model, for example, the deep learning model needs to recognize the posture of the human body in the image, and the target object model can be a human body model. It should be understood that the above is only an example and not a limitation, and the present application does not limit the specific type of the target object model, for example, it can also be an animal (such as a dog, a cat).
可选的,在三维座舱模型中加载目对象模型,包括:从模型库中随机选取一个或多个目标对象模型,在三维座舱模型中加载选取的目标对象模型。应理解,当目标对象的模型有多个时,本申请实施例中针对每个目标对象模型所执行的方法步骤是类似的,因此在下文中,主要以一个目标对象模型为例进行介绍。Optionally, loading the target object model into the three-dimensional cockpit model includes: randomly selecting one or more target object models from the model library, and loading the selected target object model into the three-dimensional cockpit model. It should be understood that when there are multiple target object models, the method steps performed for each target object model in this embodiment of the present application are similar. Therefore, the following description will mainly take one target object model as an example.
本申请实施例对模型库中目标对象模型的来源不做限定。This embodiment of the present application does not limit the source of the target object model in the model library.
可选的,在三维座舱模型中加载目标对象模型之前,对至少一个真实的目标对象进行扫描,基于扫描到的真实数据生成至少一个目标对象模型,将至少一个目标对象模型保存到模型库。扫描方式例如,采用英特尔实感技术(Intel RealSense)的数据采集设备对真实的人脸进行人脸扫描。图9A~图9B为一个高质量的人脸模型的示意图,其中图9A为高质量的人脸模型的正视图,图9B为高质量的人脸模型的侧视图。由于模型库中的目标对象模型都是基于对真实的目标对象进行扫描生成的,所以目标对象模型能够提供更多的纹理细节和多样性,可以提高目标对象模型的真实感,进而有助于提高合成数据的多样性和真实感。Optionally, before loading the target object model into the three-dimensional cockpit model, scan at least one real target object, generate at least one target object model based on the scanned real data, and save the at least one target object model to the model library. Scanning method, for example, a data acquisition device using Intel RealSense technology scans a real face. 9A-9B are schematic diagrams of a high-quality human face model, wherein FIG. 9A is a front view of the high-quality human face model, and FIG. 9B is a side view of the high-quality human face model. Since the target object models in the model library are all generated based on the scanning of real target objects, the target object model can provide more texture details and diversity, which can improve the realism of the target object model, which in turn helps to improve the Diversity and realism of synthetic data.
可选的,三维座舱模型的环境参数可以包括内部环境的参数,例如座舱内部的光照亮度、光源的数量、光源的颜色或光源的位置等,例如座舱内饰的类别、数量、位置、形状、结构、颜色等。三维座舱模型的环境参数还可以包括外部环境的参数,例如外部环境的地理位置、形状、纹理、颜色等。如此,通过改变三维座舱模型的环境参数可以获得更加多样化的样本图像(即携带标签的图像,或者说合成数据)。Optionally, the environmental parameters of the three-dimensional cockpit model may include parameters of the internal environment, such as the brightness of the light inside the cockpit, the number of light sources, the color of the light source, or the position of the light source, etc., such as the type, quantity, position, shape, structure, color, etc. The environment parameters of the three-dimensional cockpit model may also include parameters of the external environment, such as the geographic location, shape, texture, color, etc. of the external environment. In this way, by changing the environmental parameters of the three-dimensional cockpit model, more diverse sample images (ie, images with labels, or synthetic data) can be obtained.
可选的,目标对象模型的位姿参数包括目标对象模型的位置坐标(例如在仿真空间的坐标系中的坐标值(x,y,z)),和/或,目标对象模型的欧拉角(包括俯仰角、偏航角以及翻滚角)。如此,通过改变三维座舱模型的位姿参数可以获得更加多样化的样本图像。Optionally, the pose parameters of the target object model include the position coordinates of the target object model (for example, the coordinate values (x, y, z) in the coordinate system of the simulation space), and/or the Euler angles of the target object model. (including pitch angle, yaw angle and roll angle). In this way, more diverse sample images can be obtained by changing the pose parameters of the three-dimensional cockpit model.
在本申请实施例中,调整三维座舱模型的环境参数和/或目标对象模型的位姿参数,输 出若干携带标签的训练图像,包括:In the embodiment of the present application, the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model are adjusted, and several training images carrying labels are output, including:
步骤1、设置三维座舱模型的环境参数为第一环境参数,设置目标对象模型的位姿参数为第一位姿参数;基于三维座舱模型中的虚拟相机输出第一图像,并根据虚拟相机输出第一图像时的场景信息生成第一图像的标签,获得携带标签的第一图像。Step 1. Set the environment parameter of the three-dimensional cockpit model as the first environment parameter, and set the pose parameter of the target object model as the first pose parameter; output the first image based on the virtual camera in the three-dimensional cockpit model, and output the first image according to the virtual camera. The scene information of an image generates a label of the first image, and the first image carrying the label is obtained.
步骤2、调整三维座舱模型的环境参数和/或目标对象模型的位姿参数,例如设置三维座舱模型的环境参数为第二环境参数、设置目标对象模型的位姿参数为第二位姿参数;基于虚拟相机输出第二图像,并根据虚拟相机输出第二图像时的场景信息生成第二图像的标签,获得携带标签的第二图像。 Step 2, adjusting the environment parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model, for example, setting the environment parameters of the three-dimensional cockpit model as the second environment parameters, and setting the pose parameters of the target object model as the second pose parameters; The second image is output based on the virtual camera, and a label of the second image is generated according to the scene information when the virtual camera outputs the second image, so as to obtain the second image carrying the label.
需要注意的是,在调整三维座舱模型的环境参数和/或目标对象模型的位姿参数的过程中,需要保证目标对象模型和三维座舱模型的轮廓不占用相同空间(例如人体只能在汽车座舱内部空间中活动,不得与汽车的壳体相嵌),以提高合成数据的真实感。It should be noted that in the process of adjusting the environmental parameters of the 3D cockpit model and/or the pose parameters of the target object model, it is necessary to ensure that the contours of the target object model and the 3D cockpit model do not occupy the same space (for example, the human body can only be placed in the car cockpit). activities in the interior space, must not be embedded with the car shell) to improve the realism of the synthetic data.
由于步骤2和步骤1的三维座舱模型的环境参数和/或目标对象模型的位姿参数不同,所以步骤1输出的第一图像和步骤2中输出的第二图像不同,第一图像和第二图像携带标签也不同。通过多次执行上述步骤1和步骤2(即不断地更改三维座舱模型的环境参数和/或目标对象模型的位姿参数),就可以获得更多样的携带标签的图像。当携带标签的图像的数量达到一定数值后,就可以将其作为训练样本去训练深度学习模型,即执行S603。Since the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model in steps 2 and 1 are different, the first image output in step 1 is different from the second image output in step 2. The first image and the second image are different. Image carrying tags are also different. By repeatedly performing the above steps 1 and 2 (ie, constantly changing the environment parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model), more diverse images with labels can be obtained. When the number of images with labels reaches a certain value, it can be used as a training sample to train the deep learning model, that is, S603 is executed.
为了便于理解本申请实施例中的仿真过程,这里例举一个具体的示例:仿真脚本用于执行仿真的程序代码,当仿真脚本被计算机运行时,计算机可以实现以下功能:从模型库中随机选取人脸模型加载到三维座舱模型内部;控制人脸模型的X、Y坐标在正负100毫米范围内随机变化、Z坐标在200到800毫米范围内随机变化;控制人脸模型的俯仰角(Pitch)在正负20度范围内随机变化、人脸模型的偏航角(Yaw)在正负40度范围内随机变化;控制座舱内部的光照的随机变化范围在1到3之间;在人脸模型的位姿和光照亮度变化的过程中,基于虚拟相机输出图像以及基于场景信息输出标签,得到合成数据。In order to facilitate the understanding of the simulation process in the embodiment of the present application, a specific example is given here: the simulation script is used to execute the program code of the simulation, and when the simulation script is run by the computer, the computer can realize the following functions: randomly select from the model library The face model is loaded into the 3D cockpit model; the X and Y coordinates of the face model are controlled to change randomly within the range of plus or minus 100 mm, and the Z coordinate is randomly changed within the range of 200 to 800 mm; the pitch angle (Pitch) of the face model is controlled. ) varies randomly within the range of plus or minus 20 degrees, and the yaw angle (Yaw) of the face model varies randomly within the range of plus or minus 40 degrees; the random variation range of the illumination inside the control cabin is between 1 and 3; In the process of changing the pose of the model and the brightness of the light, the synthetic data is obtained based on the output image of the virtual camera and the output label based on the scene information.
可选的,在仿真的过程中还可以更新三维座舱模型中加载的目标对象模型。例如,在基于当前的目标对象模型输出预设数量的携带标签的图像之后,从模型库中重新选择一个或多个目标对象模型,然后在三维座舱模型中加载该重新选择的目标对象模型,对该重新选择的目标对象模型执行上述步骤1和步骤2,输出更多携带标签的图像。如此,可以进一步提高合成数据的多样性。Optionally, the target object model loaded in the three-dimensional cockpit model may also be updated during the simulation. For example, after outputting a preset number of images with labels based on the current target object model, re-select one or more target object models from the model library, and then load the re-selected target object models in the 3D cockpit model. The reselected target object model performs steps 1 and 2 above to output more images with labels. In this way, the diversity of synthetic data can be further improved.
可选的,在仿真的过程中还可以更新虚拟相机的拍摄参数。例如,在设置三维座舱模型的环境参数为第一环境参数,设置目标对象模型的位姿参数为第一位姿参数,输出携带标签的第一图像之后,更新虚拟相机的拍摄参数(例如调整分辨率或曝光时间等),然后基于相同的环境参数和位姿参数(即保持三维座舱模型的环境参数为第一环境参数不变,以及保持目标对象模型的位姿参数为第一位姿参数不变),输出不同于第一图像的第三图像。如此,可以进一步提高合成数据的多样性。Optionally, the shooting parameters of the virtual camera may also be updated during the simulation. For example, after setting the environment parameter of the three-dimensional cockpit model as the first environment parameter, setting the pose parameter of the target object model as the first pose parameter, and outputting the first image carrying the label, update the shooting parameters of the virtual camera (for example, adjusting the resolution rate or exposure time, etc.), and then based on the same environment parameters and pose parameters (that is, keep the environment parameters of the 3D cockpit model as the first environment parameters, and keep the pose parameters of the target object model as the first pose parameters change), and output a third image different from the first image. In this way, the diversity of synthetic data can be further improved.
可选的,当目标对象模型可划分为多个局部且不同局部的姿态可以相互独立时,在仿真的过程中,除了可以对目标对象模型整体的位姿参数进行调整(即保持各局部之间的相对位姿不变)之外,还可以对目标对象模型的各个局部的位姿参数分别进行调整。示例性的,当目标对象模型为人体模型时,可以调整人体模型整体的位姿参数,或者仅调整头部的位姿参数,或者仅调整手部的位姿参数,或者同时调整头部的位姿参数和手部的位姿参数,等等。如此,可以提高仿真的灵活性,进一步提高合成数据的多样性。Optionally, when the target object model can be divided into multiple parts and the poses of different parts can be independent of each other, during the simulation process, in addition to adjusting the overall pose parameters of the target object model (that is, maintaining the In addition to the relative pose unchanged), the pose parameters of each part of the target object model can also be adjusted separately. Exemplarily, when the target object model is a human body model, the overall pose parameters of the human body model may be adjusted, or only the pose parameters of the head may be adjusted, or only the pose parameters of the hands may be adjusted, or the pose parameters of the head may be adjusted simultaneously. Pose parameters and hand pose parameters, etc. In this way, the flexibility of simulation can be improved, and the diversity of synthetic data can be further improved.
这里简单介绍下基于虚拟相机输出图像的原理:将目标对象模型上的每个特征点(或者叫顶点)投影到虚拟相机的像平面(image plane)的某个像素上,进而获得彩色(Red Green Blue,RGB)图像,该过程可以采用计算机图形学(Computer Graphics,CG)技术实现。Here is a brief introduction to the principle based on the output image of the virtual camera: project each feature point (or vertex) on the target object model to a pixel on the image plane of the virtual camera, and then obtain the color (Red Green) Blue, RGB) image, this process can be realized by computer graphics (Computer Graphics, CG) technology.
需要强调的是,本申请实施例基于虚拟相机输出RGB图像时,虚拟相机的最重要的作用在于:根据虚拟相机的拍摄参数确定渲染效果(例如像平面的位置、图像的分辨率、图像的视场角等),以使得渲染得到的2D图像的显示效果与真实相机使用该拍摄参数拍摄出的图像的显示效果相同,提高RGB图像的真实感。It should be emphasized that, when the embodiment of the present application outputs an RGB image based on a virtual camera, the most important role of the virtual camera is to determine the rendering effect (for example, the position of the image plane, the resolution of the image, the viewing angle of the image) according to the shooting parameters of the virtual camera. field angle, etc.), so that the display effect of the rendered 2D image is the same as the display effect of the image captured by the real camera using the shooting parameters, and the realism of the RGB image is improved.
在本申请实施例中,图像的标签用于描述座舱图像中的目标对象的识别信息。图像的标签类型需要根据深度学习模型所需要映射的输出数据(即识别信息)确定。例如,训练完成的深深度学习模型是用于确定待识别的图像中的目标对象的深度信息,则该标签包括目标对象模型的深度信息;例如,训练完成的深度学习模型是用于确定待识别的座舱图像中的目标对象的姿态信息,则该标签包括目标对象模型的姿态信息;例如,训练完成的深深度学习模型是用于确定待识别的图像中的目标对象的深度信息和姿态信息,则该标签同时包括目标对象模型的深度信息和姿态信息。当然,实际应用中还可以有其它类型的标签,本申请不做限定。In the embodiment of the present application, the label of the image is used to describe the identification information of the target object in the cockpit image. The label type of the image needs to be determined according to the output data (namely identification information) that the deep learning model needs to map. For example, if the trained deep learning model is used to determine the depth information of the target object in the image to be recognized, the label includes the depth information of the target object model; for example, the trained deep learning model is used to determine the depth information of the target object model to be recognized. The attitude information of the target object in the cockpit image, then the label includes the attitude information of the target object model; for example, the trained deep deep learning model is used to determine the depth information and attitude information of the target object in the image to be recognized, Then the label includes both the depth information and the pose information of the target object model. Of course, other types of labels may also exist in practical applications, which are not limited in this application.
需要说明的是,基于不同的标签类型,本申请实施例根据场景信息生成标签的具体方法不同,以下以标签为深度信息和姿态信息为例,介绍两种可能的根据场景信息生成标签的方法。It should be noted that, based on different tag types, the embodiments of the present application have different specific methods for generating tags based on scene information. The following takes the tags as depth information and pose information as examples to introduce two possible methods for generating tags based on scene information.
示例1、标签包括深度信息。Example 1. The label includes depth information.
深度估计是计算机视觉领域的一个基础性问题。目前有很多设备,如深度相机和毫米波雷达等,可以直接获取深度。但是车规级高分辨率深度相机造价昂贵。也可以利用双目进行深度估计,但是由于双目图像需要利用立体匹配进行像素点对应和视差计算,所以计算复杂度也较高。尤其是对于低纹理场景的匹配效果不好。而单目深度估计则相对成本更低,更容易普及。那么对于单目深度估计,顾名思义,就是利用一张RGB图像,估计图像中每个像素相对拍摄源的距离。对于人眼来说,由于存在大量的先验知识,所以可以从一只眼睛所获取的图像信息中提取出大量深度信息。因此单目深度估计不仅需要从二维图像中学会直接深度信息,而且需要提取一些相机和场景相关的间接信息,以辅助更准确的进行深度估计。而本申请实施例中生成的合成数据可以是携带深度信息标签的图像,则可以作为基于深度学习的、有监督的单目深度估计算法的训练数据。训练完成的单目深度估计算法,可以用于基于RGB图像估计深度信息。Depth estimation is a fundamental problem in the field of computer vision. There are many devices, such as depth cameras and millimeter-wave radars, that can directly acquire depth. But car-grade high-resolution depth cameras are expensive. It is also possible to use binoculars for depth estimation, but since binocular images need to use stereo matching to perform pixel correspondence and parallax calculation, the computational complexity is also high. Especially for low texture scenes the matching effect is not good. The monocular depth estimation is relatively cheaper and easier to popularize. Then for monocular depth estimation, as the name implies, it is to use an RGB image to estimate the distance of each pixel in the image relative to the shooting source. For the human eye, due to the existence of a large amount of prior knowledge, a large amount of depth information can be extracted from the image information obtained by one eye. Therefore, monocular depth estimation not only needs to learn direct depth information from two-dimensional images, but also needs to extract some indirect information related to the camera and the scene to assist in more accurate depth estimation. The synthetic data generated in the embodiment of the present application may be an image carrying a depth information label, and may be used as training data for a supervised monocular depth estimation algorithm based on deep learning. The trained monocular depth estimation algorithm can be used to estimate depth information based on RGB images.
具体的,根据场景信息生成深度信息标签的方法包括:根据虚拟相机输出某个图像时目标对象模型表面上各特征点与虚拟相机的相对位置,确定虚拟相机输出该图像时目标对象模型相对于虚拟相机的深度信息,将该深度信息作为该图像的标签。Specifically, the method for generating a depth information label according to scene information includes: according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs a certain image, determining the relative position of the target object model relative to the virtual camera when the virtual camera outputs the image The depth information of the camera, which is used as the label of the image.
示例性的,以上文中的第一图像和第二图像为例。根据虚拟相机输出第一图像时的场景信息生成第一图像的标签,包括:根据虚拟相机输出第一图像时目标对象模型表面上各特征点与虚拟相机的相对位置,确定虚拟相机输出第一图像时目标对象模型相对于虚拟相机的第一深度信息;将第一深度信息作为第一图像的标签。根据虚拟相机输出第二图像时的场景信息生成第二图像的标签,包括:根据虚拟相机输出第二图像时目标对象模型表面上各特征点与虚拟相机的相对位置,确定虚拟相机输出第二图像时目标对象模型相对于虚拟相机的第二深度信息;将第二深度信息作为第二图像的标签。Illustratively, take the first image and the second image above as an example. Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the virtual camera to output the first image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image the first depth information of the target object model relative to the virtual camera; the first depth information is used as the label of the first image. Generating the label of the second image according to the scene information when the virtual camera outputs the second image includes: determining the virtual camera to output the second image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image the second depth information of the target object model relative to the virtual camera; the second depth information is used as the label of the second image.
需要说明的是,目标对象模型上每个特征点都有自己的3D坐标(x,y,z),每个特征点对应的深度信息不是该特征点到虚拟相机光圈中心的直线距离,而是该点到虚拟相机光圈所在平面的垂直距离。所以获取目标对象模型上每个特征点的深度信息,实质是获取各特征点与虚拟相机所在平面的垂直距离。其中,各特征点与虚拟相机所在平面的垂直距离,具体可以根据虚拟相机的位置坐标,目标对象模型的位置坐标计算获得。It should be noted that each feature point on the target object model has its own 3D coordinates (x, y, z), and the depth information corresponding to each feature point is not the linear distance from the feature point to the center of the virtual camera aperture, but The vertical distance from this point to the plane of the virtual camera aperture. Therefore, to obtain the depth information of each feature point on the target object model, the essence is to obtain the vertical distance between each feature point and the plane where the virtual camera is located. The vertical distance between each feature point and the plane where the virtual camera is located can be specifically calculated and obtained according to the position coordinates of the virtual camera and the position coordinates of the target object model.
示例性的,参见图10,设相机光圈中心S位于仿真空间的坐标系的原点O,即相机光圈中心S的坐标为(0,0,0),Y轴为垂直方向,且垂直向上为Y轴坐标变大的方向,Z轴和X轴位于水平方向上,虚拟相机的拍摄方向为朝向Z轴坐标变大的方向,虚拟相机光圈所在的平面为Y、X、O所成的平面,则每个特征点对应的深度值等于该特征点与虚拟相机所在平面的垂直距离等于该特征点的Z坐标。从图10可以看出,人脸上的A点和B点位置不同,且B点与虚拟相机光圈中心的直线距离大于A点与虚拟相机光圈中心的直线距离,但是A点到虚拟相机光圈所在平面的垂直距离等于B点到虚拟相机光圈所在平面的垂直,即A、B两点的Z坐标相同(z1=z2),所以A、B两点的深度值相等;C点到虚拟相机光圈所在平面的垂直距离小于与A、B点到虚拟相机光圈所在平面的垂直距离,即A、B两点的Z坐标大于C点的Z坐标(z3<z1、z2),即A、B两点的深度值大于Z点的深度值。Exemplarily, referring to FIG. 10 , it is assumed that the camera aperture center S is located at the origin O of the coordinate system in the simulation space, that is, the coordinates of the camera aperture center S are (0, 0, 0), the Y axis is the vertical direction, and the vertical upward direction is Y. The direction in which the axis coordinates increase, the Z axis and the X axis are in the horizontal direction, the shooting direction of the virtual camera is the direction in which the coordinates of the Z axis increase, and the plane where the virtual camera aperture is located is the plane formed by Y, X, and O, then The depth value corresponding to each feature point is equal to the vertical distance between the feature point and the plane where the virtual camera is located, which is equal to the Z coordinate of the feature point. It can be seen from Figure 10 that the positions of point A and point B on the face are different, and the straight-line distance between point B and the center of the virtual camera aperture is greater than the straight-line distance between point A and the center of the virtual camera aperture, but point A is where the virtual camera aperture is located. The vertical distance of the plane is equal to the vertical distance from point B to the plane where the virtual camera aperture is located, that is, the Z coordinates of points A and B are the same (z1=z2), so the depth values of points A and B are equal; point C is where the virtual camera aperture is located. The vertical distance of the plane is less than the vertical distance from points A and B to the plane where the virtual camera aperture is located, that is, the Z coordinate of points A and B is greater than the Z coordinate of point C (z3<z1, z2), that is, the two points of A and B are The depth value is greater than the depth value of the Z point.
在本申请实施例中,每张RGB图像对应的深度信息可以表示为深度图的形式。RGB图像和/或深度图可以采用JPG、PNG或JPEG等格式,本申请不做限制。深度图中每个像素的深度可以用16位(两字节)存储,单位为毫米。示例性的,图11A和图11B是一组合成数据的示例,其中图11A为RGB图像,图11B为图11A所示RGB图像对应的深度图,图11C为在Matplotlib对图11B进行更直观展示的示意图。In this embodiment of the present application, the depth information corresponding to each RGB image may be expressed in the form of a depth map. The RGB image and/or the depth map can be in JPG, PNG or JPEG format, which is not limited in this application. The depth of each pixel in the depth map can be stored in 16 bits (two bytes) in millimeters. Exemplarily, FIGS. 11A and 11B are examples of a set of synthetic data, wherein FIG. 11A is an RGB image, FIG. 11B is a depth map corresponding to the RGB image shown in FIG. 11A , and FIG. 11C is a more intuitive display of FIG. 11B in Matplotlib. schematic diagram.
示例2、标签包括姿态信息。Example 2. The label includes pose information.
姿态估计问题就是确定某一三维目标物体的方位指向问题。姿态估计在机器人视觉、动作跟踪和单照相机定标等很多领域都有应用。本申请实施例中生成的合成数据可以是携带姿态信息标签的图像,可以作为基于深度学习的、有监督的姿态估计算法的训练数据。训练完成的姿态估计算法,可以用于基于RGB图像估计目标对象的姿态。Attitude estimation problem is to determine the orientation of a three-dimensional target object. Pose estimation has applications in many fields such as robot vision, motion tracking, and single-camera calibration. The synthetic data generated in the embodiment of the present application may be an image carrying a pose information label, and may be used as training data for a supervised pose estimation algorithm based on deep learning. The trained pose estimation algorithm can be used to estimate the pose of the target object based on the RGB image.
具体的,根据场景信息生成深度信息标签的方法包括:根据虚拟相机输出某个图像时目标对象模型的位姿参数,将该位姿参数作为该图像的标签。Specifically, the method for generating the depth information label according to the scene information includes: according to the pose parameter of the target object model when the virtual camera outputs a certain image, the pose parameter is used as the label of the image.
示例性的,以上文中的第一图像和第二图像为例。根据虚拟相机输出第一图像时的场景信息生成第一图像的标签,包括:确定虚拟相机输出第一图像时目标对象模型的位姿参数,将虚拟相机输出第一图像时目标对象模型的位姿参数作为第一图像的标签;根据虚拟相机输出第二图像时的场景信息生成第二图像的标签,包括:确定虚拟相机输出第二图像时目标对象模型的位姿参数,将虚拟相机输出第二图像时目标对象模型的位姿参数作为第二图像的标签。Illustratively, take the first image and the second image above as an example. Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image, and determining the pose parameters of the target object model when the virtual camera outputs the first image The parameter is used as the label of the first image; the label of the second image is generated according to the scene information when the virtual camera outputs the second image, including: determining the pose parameters of the target object model when the virtual camera outputs the second image, and outputting the second image from the virtual camera. The pose parameters of the target object model in the image are used as the label of the second image.
在本申请实施例中,姿态信息可以采用欧拉角表示。欧拉角具体可以为头部绕仿真空间坐标系的三个坐标轴(即X、Y、Z轴)的旋转角度。其中绕X轴的旋转角度为俯仰角(Pitch),绕Y轴的旋转角度为偏航角(Yaw)、绕Z轴的旋转角度为翻滚角(Roll)。In this embodiment of the present application, the attitude information may be represented by Euler angles. The Euler angle may specifically be the rotation angle of the head around the three coordinate axes (ie, X, Y, and Z axes) of the simulation space coordinate system. The rotation angle around the X axis is the pitch angle (Pitch), the rotation angle around the Y axis is the yaw angle (Yaw), and the rotation angle around the Z axis is the roll angle (Roll).
S603、训练模型:基于携带标签的图像对深度学习模型进行训练,获得训练完成的深度学习模型。S603 , training the model: train the deep learning model based on the image with the label, and obtain the deep learning model that has been trained.
示例性的,以标签为深度信息为例,将每组合成数据中的RGB图像作为深度学习模 型的输入,RGB图像对应的标签作为深度学习模型的输出,对深度学习模型进行训练,获得完成的深度学习模型(即单目深度估计算法)就可以用于识别RGB图像的深度信息。例如,将待识别的真实的座舱图像输入深度学习模型后,该深度学习模型可输出该座舱图像中人脸的深度信息,进而辅助实现视线追踪、眼球定位等功能。Exemplarily, taking the label as depth information as an example, the RGB image in each group of combined data is used as the input of the deep learning model, and the label corresponding to the RGB image is used as the output of the deep learning model, and the deep learning model is trained to obtain the completed A deep learning model (ie, a monocular depth estimation algorithm) can be used to identify the depth information of RGB images. For example, after inputting the real cockpit image to be recognized into the deep learning model, the deep learning model can output the depth information of the face in the cockpit image, thereby assisting in the realization of gaze tracking, eye positioning and other functions.
示例性的,以标签为姿态信息为例,将每组合成数据中的RGB图像作为深度学习模型的输入,RGB图像对应的标签作为深度学习模型的输出,对深度学习模型进行训练,获得完成的深度学习模型(即姿态估计算法)就可以用于识别RGB图像的深度信息。例如,将待识别的座舱图像输入深度学习模型后,该深度学习模型可输出该座舱图像中人体的姿态信息(例如头部的欧拉角、手部的欧拉角等),进而辅助实现人体动作跟踪、驾驶分心检测等功能。Exemplarily, taking the label as the attitude information as an example, the RGB image in each combination of data is used as the input of the deep learning model, the corresponding label of the RGB image is used as the output of the deep learning model, the deep learning model is trained, and the completed A deep learning model (ie, a pose estimation algorithm) can be used to identify the depth information of RGB images. For example, after inputting the cockpit image to be recognized into the deep learning model, the deep learning model can output the posture information of the human body in the cockpit image (such as Euler angles of the head, Euler angles of the hands, etc.), thereby assisting the realization of the human body Motion tracking, driving distraction detection, and more.
基于以上描述可知,本申请实施例针对座舱场景,在仿真过程中使用座舱的3D模型(即三维座舱模型)作为目标对象模型(如人体模型或人脸模型)的背景,营造逼真的背景和光影效果,相较于现有技术单纯由人体模型渲染生成训练数据的方法,可以提高合成数据的真实性;在仿真过程中,随机设定目标对象模型的位姿和三维座舱模型的环境,从而可以得到多样化的座舱领域独有的合成数据,实现了针对新的应用场景低成本且高效率地获得多样化的带标注的训练数据的效果。本申请实施例可以有效提高深度学习模型泛化能力,优化深度学习模型的精度和有效性。Based on the above description, for the cockpit scene, the embodiment of the present application uses the 3D model of the cockpit (that is, the 3D cockpit model) as the background of the target object model (such as a human body model or a face model) during the simulation process, so as to create a realistic background and light and shadow Compared with the existing method of generating training data by simply rendering the human body model, the authenticity of the synthetic data can be improved; in the simulation process, the pose of the target object model and the environment of the three-dimensional cockpit model are randomly set, so that the The unique synthetic data in the diverse cockpit field is obtained, which realizes the effect of obtaining diverse annotated training data at low cost and high efficiency for new application scenarios. The embodiments of the present application can effectively improve the generalization ability of the deep learning model, and optimize the accuracy and effectiveness of the deep learning model.
不仅如此,本申请实施例在仿真环境中设置虚拟相机时,可以将真实座舱中的真实相机的拍摄参数置入虚拟相机,提高虚拟相机拍摄的图像的真实感,进一步优化深度学习模型的精度和有效性。Not only that, when the virtual camera is set in the simulation environment in the embodiment of the present application, the shooting parameters of the real camera in the real cockpit can be put into the virtual camera, so as to improve the realism of the images captured by the virtual camera, and further optimize the precision and accuracy of the deep learning model. effectiveness.
不仅如此,本申请实施例还可以基于真实的人体/人脸扫描数据生成目标对象模型,使得目标对象模型可以提供更多的纹理细节和多样性,进而进一步提高深度学习模型泛化能力,优化深度学习模型的精度和有效性。Not only that, the embodiment of the present application can also generate a target object model based on real human body/face scan data, so that the target object model can provide more texture details and diversity, thereby further improving the generalization ability of the deep learning model and optimizing the depth. Accuracy and effectiveness of the learned model.
为了测试本申请实施例中生成的合成数据的可用性,可以以语义分割(SegNet)为基础设计一个深度估计模型(即用于估计深度信息的深度学习模型)。参见图12,为SegNet模型的架构示意图。在模型训练的过程中,可以对模型输入图像进行了颜色抖动作为数据增强。模型的损失函数可以设定为每像素深度估计与真实深度的平均误差(单位毫米)。采用拟牛顿法(L-BFGS)对模型进行优化,可观察到模型对合成的训练数据显著收敛。In order to test the availability of the synthetic data generated in the embodiments of the present application, a depth estimation model (ie, a deep learning model for estimating depth information) may be designed based on semantic segmentation (SegNet). Referring to Figure 12, it is a schematic diagram of the architecture of the SegNet model. In the process of model training, color dithering can be performed on the input image of the model as data enhancement. The loss function of the model can be set as the average error (in millimeters) of the per-pixel depth estimate from the true depth. The model was optimized using the quasi-Newton method (L-BFGS), and it was observed that the model converged significantly on the synthesized training data.
测试包括定性测试和定量测试两个方面:The test includes two aspects: qualitative test and quantitative test:
1)采用本申请实施例生成的合成数据对该深度估计模型的检测效果做定性测试。例如,图13A为采用本申请实施例方法生成的合成图片,将图13A所示的合成图片输入深度估计模型,得到图13B所示的深度图。从定性角度看,效果比较理想。1) The detection effect of the depth estimation model is qualitatively tested by using the synthetic data generated in the embodiment of the present application. For example, FIG. 13A is a synthetic picture generated by using the method of the embodiment of the present application, and the synthetic picture shown in FIG. 13A is input into the depth estimation model to obtain the depth map shown in FIG. 13B . From a qualitative point of view, the effect is ideal.
2)采用真实数据对该深度估计模型的检测效果做定量测试:用深度相机对真实的目标对象进行拍摄,得到真实的RGB照片和该RGB照片对应的目标深度图;将真实的RGB照片输入该深度估计模型得到估计深度图;把估计深度图和目标深度图进行比较,即可获得定量测试结果。2) Use real data to quantitatively test the detection effect of the depth estimation model: use a depth camera to shoot a real target object to obtain a real RGB photo and a target depth map corresponding to the RGB photo; input the real RGB photo into the The depth estimation model obtains the estimated depth map; the quantitative test results can be obtained by comparing the estimated depth map with the target depth map.
需要说明的是,在将估计深度图和目标深度图直接进行比较时,可能会掺杂较多的背景干扰,因此在将真实的RGB照片输入该深度估计模型之前,可以先使用语义分割算法对RGB照片进行处理,得到目标区域(即目标对象所在的区域,如人脸区域),其它区域建立遮罩,如图14所示,在得到估计深度图之后,只对估计深度图和目标深度图的目标 区域进行比较,如此可以提高测试的可靠性。It should be noted that when the estimated depth map and the target depth map are directly compared, there may be more background interference, so before inputting the real RGB photo into the depth estimation model, you can use the semantic segmentation algorithm. RGB photos are processed to obtain the target area (that is, the area where the target object is located, such as the face area), and other areas are masked, as shown in Figure 14. After obtaining the estimated depth map, only the estimated depth map and the target depth map are used. The target area is compared, which can improve the reliability of the test.
应理解,上述各实施方式可以相互结合以实现不同的技术效果。It should be understood that the above embodiments can be combined with each other to achieve different technical effects.
基于同一技术构思,本申请实施例还提供的一种模型训练装置,该装置包括用于执行上述方法步骤的模块/单元。Based on the same technical concept, an embodiment of the present application further provides a model training apparatus, which includes a module/unit for executing the above method steps.
示例性的,参见图15,为本申请实施例提供的一种模型训练装置150的结构示意图,装置150包括处理模块1501和训练模块1502;15, which is a schematic structural diagram of a model training apparatus 150 provided in an embodiment of the present application, the apparatus 150 includes a processing module 1501 and a training module 1502;
其中,处理模块1501,用于:在三维座舱模型中加载虚拟相机和目标对象模型,目标对象模型位于虚拟相机的拍摄范围内;获取虚拟相机输出的第一图像;将虚拟相机输出第一图像时的场景信息生成第一图像的标签,第一图像的标签用于描述目标对象模型在第一图像中的识别信息;调整三维座舱模型的环境参数和/或目标对象模型的位姿参数;获取虚拟相机输出的第二图像;将虚拟相机输出第二图像时的场景信息生成第二图像的标签,第二图像的标签用于描述目标对象模型在第二图像中的识别信息;The processing module 1501 is used for: loading a virtual camera and a target object model in the three-dimensional cockpit model, and the target object model is located within the shooting range of the virtual camera; acquiring the first image output by the virtual camera; when outputting the first image from the virtual camera generate the label of the first image from the scene information of the The second image output by the camera; the scene information when the virtual camera outputs the second image is used to generate a label of the second image, and the label of the second image is used to describe the identification information of the target object model in the second image;
训练模块1502,用于:基于第一图像以及第一图像的标签、第二图像以及第二图像的标签,对深度学习模型进行训练;其中,训练完成的深度学习模型用于识别新输入的座舱图像中的目标对象的识别信息。The training module 1502 is used for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, the trained deep learning model is used to identify the newly input cockpit Identification information of the target object in the image.
可选的,目标对象模型为人体模型或人脸模型。Optionally, the target object model is a human body model or a face model.
可选的,处理模块1501还用于:在三维座舱模型中加载目标对象模型之前,对至少一个真实的目标对象进行扫描,获得至少一个目标对象模型,将至少一个目标对象模型保存到模型库;处理模块1501在三维座舱模型中加载目标对象模型时,具体用于:从模型库中随机选取一个或多个目标对象模型,在三维座舱模型中加载一个或多个目标对象模型。Optionally, the processing module 1501 is further configured to: scan at least one real target object before loading the target object model in the three-dimensional cockpit model, obtain at least one target object model, and save the at least one target object model to the model library; When loading the target object model in the 3D cockpit model, the processing module 1501 is specifically used for: randomly selecting one or more target object models from the model library, and loading one or more target object models into the 3D cockpit model.
可选的,处理模块1501在三维座舱模型中加载虚拟相机时,具体用于:将真实座舱中的真实相机的拍摄参数置入虚拟相机,拍摄参数包括分辨率、畸变参数、焦距、视场角、光圈或曝光时间中的至少一项。Optionally, when the processing module 1501 loads the virtual camera in the three-dimensional cockpit model, it is specifically used to: put the shooting parameters of the real camera in the real cockpit into the virtual camera, and the shooting parameters include resolution, distortion parameters, focal length, and field of view. , at least one of aperture, or exposure time.
可选的,三维座舱模型的环境参数包括三维座舱模型的光场、内饰或外部环境中的至少一项;其中,光场包括光照亮度、光源的数量、光源的颜色或光源的位置中的至少一项。Optionally, the environmental parameters of the three-dimensional cockpit model include at least one of a light field of the three-dimensional cockpit model, an interior or an external environment; wherein the light field includes any of the brightness of the light, the number of light sources, the color of the light source, or the position of the light source. at least one.
可选的,目标对象模型的位姿参数包括位置坐标和/或欧拉角。Optionally, the pose parameters of the target object model include position coordinates and/or Euler angles.
可选的,训练完成的深度学习模型用于识别新输入的座舱图像中的目标对象的深度信息;Optionally, the trained deep learning model is used to identify the depth information of the target object in the newly input cockpit image;
处理模块1501在将虚拟相机输出第一图像时的场景信息生成第一图像的标签时,具体用于:根据虚拟相机输出第一图像时目标对象模型表面上各特征点与虚拟相机的相对位置,确定虚拟相机输出第一图像时目标对象模型相对于虚拟相机的第一深度信息;将第一深度信息作为第一图像的标签;When the processing module 1501 generates the label of the first image from the scene information when the virtual camera outputs the first image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image, determining the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image; using the first depth information as a label of the first image;
处理模块1501在将虚拟相机输出第二图像时的场景信息生成第二图像的标签时,具体用于:根据虚拟相机输出第二图像时目标对象模型表面上各特征点与虚拟相机的相对位置,确定虚拟相机输出第二图像时目标对象模型相对于虚拟相机的第二深度信息;将第二深度信息作为第二图像的标签。When the processing module 1501 generates the label of the second image from the scene information when the virtual camera outputs the second image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image, determining the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image; and using the second depth information as a label of the second image.
可选的,训练完成的深度学习模型用于识别新输入的座舱图像中的目标对象的姿态信息;Optionally, the trained deep learning model is used to identify the pose information of the target object in the newly input cockpit image;
处理模块1501在将虚拟相机输出第一图像时的场景信息生成第一图像的标签时,具体用于:确定虚拟相机输出第一图像时目标对象模型的位姿参数,将虚拟相机输出第一图 像时目标对象模型的位姿参数作为第一图像的标签;When generating the label of the first image from the scene information when the virtual camera outputs the first image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the first image, and output the first image from the virtual camera. When the pose parameter of the target object model is used as the label of the first image;
处理模块1501在将虚拟相机输出第二图像时的场景信息生成第二图像的标签时,具体用于:确定虚拟相机输出第二图像时目标对象模型的位姿参数,将虚拟相机输出第二图像时目标对象模型的位姿参数作为第二图像的标签。When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the second image, and output the second image from the virtual camera. When the pose parameters of the target object model are used as the label of the second image.
上述各个模块执行对应功能的具体实现过程可以参考上文方法实施例中对应方法步骤的具体实现过程,此处不再赘述。For the specific implementation process of the corresponding functions performed by the foregoing modules, reference may be made to the specific implementation process of the corresponding method steps in the above method embodiments, which will not be repeated here.
应理解,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,上述处理模块1501还可以被进一步的细分:It should be understood that the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation. For example, the above processing module 1501 can be further subdivided:
参见图16,为本申请实施例提供的一种模型训练装置160的结构示意图,装置160包括场景创建模块1601、仿真模块1602以及训练模块1603。Referring to FIG. 16 , which is a schematic structural diagram of a model training apparatus 160 provided in an embodiment of the present application, the apparatus 160 includes a scene creation module 1601 , a simulation module 1602 , and a training module 1603 .
其中,场景创建模块1601,用于建立三维座舱模型,在三维座舱模型中设备虚拟相机。其中,座舱的类型本申请不予限定,例如汽车的座舱、轮船的座舱、飞机的座舱等。虚拟相机的拍摄参数可以是默认值,也可以是真实相机的拍摄参数,这里不做限定。Among them, the scene creation module 1601 is used to establish a three-dimensional cockpit model, and a virtual camera is installed in the three-dimensional cockpit model. The type of the cockpit is not limited in the present application, such as the cockpit of a car, the cockpit of a ship, the cockpit of an airplane, and the like. The shooting parameters of the virtual camera may be default values or the shooting parameters of the real camera, which are not limited here.
仿真模块1602,用于在三维座舱模型中加载目标对象模型;通过不断地调整三维座舱模型的环境参数和/或目标对象模型的位姿参数,输出若干携带标签的训练图像。其中,目标对象的类型本申请不予限定,例如人体,人脸等。训练图像携带的标签需要根据深度学习模型所需要映射的输出数据确定,例如训练完成的深度学习模型是用于确定输入的待识别的座舱图像中的目标对象的深度信息,则该标签为样本图像中目标对象模型的深度信息,例如训练完成的深度学习模型是用于确定输入的待识别的座舱图像中的目标对象的姿态,则该标签为样本图像中目标对象模型的姿态。The simulation module 1602 is used to load the target object model in the 3D cockpit model; by continuously adjusting the environment parameters of the 3D cockpit model and/or the pose parameters of the target object model, output several training images with labels. The type of the target object is not limited in this application, such as human body, human face, and the like. The label carried by the training image needs to be determined according to the output data that the deep learning model needs to map. For example, the trained deep learning model is used to determine the depth information of the target object in the input cockpit image to be recognized, then the label is the sample image For example, if the trained deep learning model is used to determine the pose of the target object in the input cockpit image to be recognized, the label is the pose of the target object model in the sample image.
训练模块1603与上述训练模块1702的功能类似,用于基于仿真模块1602输出的若干携带标签的图像对深度学习模型进行训练,获得训练完成的深度学习模型。在训练的过程中,图像为深度学习模型的输入,标签为深度学习模型的输出。The training module 1603 is similar in function to the above-mentioned training module 1702, and is used to train the deep learning model based on several images with labels output by the simulation module 1602 to obtain the trained deep learning model. During the training process, the image is the input of the deep learning model, and the label is the output of the deep learning model.
进一步可选的,装置160还可以包括数据采集模块,用于对真实的目标对象进行扫描,生成目标对象模型。Further optionally, the apparatus 160 may further include a data acquisition module, configured to scan a real target object to generate a target object model.
进一步可选的,装置160还可以包括存储模块,用于存储仿真模块输出的合成数据(即图像和图像的标签。Further optionally, the apparatus 160 may further include a storage module for storing the synthesized data (ie, the image and the label of the image) output by the simulation module.
进一步可选的,装置160还可以包括验证模块,用于验证训练完成的深度学习模型的检测精度。Further optionally, the apparatus 160 may further include a verification module for verifying the detection accuracy of the trained deep learning model.
上述各个模块执行对应功能的具体实现过程可以参考上文方法实施例中对应方法步骤的具体实现过程,此处不再赘述。For the specific implementation process of the corresponding functions performed by the foregoing modules, reference may be made to the specific implementation process of the corresponding method steps in the above method embodiments, which will not be repeated here.
基于同一技术构思,本申请实施例还提供一种计算设备系统,所述计算设备系统包括至少一个如图17所示的计算设备170。所述计算设备170包括总线1701、处理器1702和存储器1704。可选的,还可以包括通信接口1703,17图17中用虚线框表示通信接口1703是可选的。该至少一台计算设备170中的处理器1702执行存储器1704存储的计算机指令,可以执行上述方法实施例中提供的方法。Based on the same technical concept, an embodiment of the present application further provides a computing device system, where the computing device system includes at least one computing device 170 as shown in FIG. 17 . The computing device 170 includes a bus 1701 , a processor 1702 and a memory 1704 . Optionally, a communication interface 1703 may also be included. 17 In FIG. 17 , a dashed box indicates that the communication interface 1703 is optional. The processor 1702 in the at least one computing device 170 executes the computer instructions stored in the memory 1704, and can execute the methods provided in the above method embodiments.
其中,处理器1702、存储器1704和通信接口1703之间通过总线1701通信。所述计算设备系统中包括多个计算设备170时,多个计算设备170之间通过通信通路进行通信。The communication between the processor 1702 , the memory 1704 and the communication interface 1703 is through the bus 1701 . When the computing device system includes multiple computing devices 170, the multiple computing devices 170 communicate through a communication path.
处理器1702可以为CPU。存储器1704可以包括易失性存储器,例如随机存取存储器。存储器1704还可以包括非易失性存储器,例如只读存储器,快闪存储器,HDD或SSD。 存储器1704中存储有可执行代码,处理器1702执行该可执行代码以执行前述方法中的任意部分或全部。存储器中还可以包括操作系统等其他运行进程所需的软件模块。操作系统可以为LINUX TM,UNIX TM,WINDOWS TM等。 The processor 1702 may be a CPU. Memory 1704 may include volatile memory, such as random access memory. Memory 1704 may also include non-volatile memory such as read only memory, flash memory, HDD or SSD. The memory 1704 stores executable code that is executed by the processor 1702 to perform any part or all of the aforementioned methods. The memory may also include other software modules required for running processes such as an operating system. The operating system can be LINUX TM , UNIX TM , WINDOWS TM and so on.
具体的,存储器1704中存储有前述装置180或装置170中的任意一个或任意多个模块,图17以存储前述装置180的任意一个或任意多个模块为例;存储器1704中除了存储前述任意一个或任意多个模块,还可以包括操作系统等其他运行进程所需的软件模块。操作系统可以为LINUX TM,UNIX TM,WINDOWS TM等。 Specifically, the memory 1704 stores any one or any multiple modules of the aforementioned apparatus 180 or the apparatus 170 , and FIG. 17 takes the storage of any one or any multiple modules of the aforementioned apparatus 180 as an example; or any number of modules, and may also include other software modules required for running processes such as the operating system. The operating system can be LINUX TM , UNIX TM , WINDOWS TM and so on.
当计算设备系统包括多个计算设备170时,多个计算设备170之间通过通信网络互相建立通信,每个计算设备上运行前述装置150或装置160中的任意一个或任意多个模块。When the computing device system includes multiple computing devices 170, the multiple computing devices 170 establish communication with each other through a communication network, and each computing device runs any one or any multiple modules of the foregoing apparatus 150 or apparatus 160.
基于同一技术构思,本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有指令,当指令被执行时,可以使上述方法实施例中提供的方法被实现。Based on the same technical concept, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the instructions are executed, the methods provided in the above method embodiments can be implemented.
基于同一技术构思,本申请实施例还提供一种芯片,所述芯片与存储器耦合,用于读取并执行所述存储器中存储的程序指令,实现如上述方法实施例所提供的方法。Based on the same technical concept, an embodiment of the present application further provides a chip, which is coupled to a memory, and is used to read and execute program instructions stored in the memory, so as to implement the method provided by the above method embodiments.
基于同一技术构思,本申请实施例还提供一种包含指令的计算机程序产品,所述计算机程序产品中存储有指令,当其在计算机上运行时,使得计算机执行上述方法实施例所提供的方法。Based on the same technical concept, the embodiments of the present application also provide a computer program product containing instructions. The computer program product stores instructions that, when run on a computer, cause the computer to execute the method provided by the above method embodiments.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的保护范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the protection scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims (20)

  1. 一种训练深度学习模型的方法,其特征在于,包括:A method for training a deep learning model, comprising:
    在所述三维座舱模型中加载虚拟相机和目标对象模型,所述目标对象模型位于所述虚拟相机的拍摄范围内;Loading a virtual camera and a target object model in the three-dimensional cockpit model, and the target object model is located within the shooting range of the virtual camera;
    获取所述虚拟相机输出的第一图像,将所述虚拟相机输出所述第一图像时的场景信息生成所述第一图像的标签,所述第一图像的标签用于描述所述目标对象模型在所述第一图像中的识别信息;Obtain the first image output by the virtual camera, generate a label of the first image from scene information when the virtual camera outputs the first image, and the label of the first image is used to describe the target object model identification information in the first image;
    调整所述三维座舱模型的环境参数和/或所述目标对象模型的位姿参数;adjusting the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model;
    获取虚拟相机输出的第二图像,将所述虚拟相机输出所述第二图像时的场景信息生成所述第二图像的标签,所述第二图像的标签用于描述所述目标对象模型在所述第二图像中的识别信息;Acquire the second image output by the virtual camera, generate the label of the second image from the scene information when the virtual camera outputs the second image, and the label of the second image is used to describe where the target object model is located. the identification information in the second image;
    基于所述第一图像以及所述第一图像的标签、所述第二图像以及所述第二图像的标签,对所述深度学习模型进行训练;其中,训练完成的所述深度学习模型用于识别新输入的座舱图像中的目标对象的识别信息。The deep learning model is trained based on the first image and the label of the first image, the second image and the label of the second image; wherein, the deep learning model after training is used for Identify the identification information of the target object in the newly input cockpit image.
  2. 如权利要求1所述的方法,其特征在于,所述目标对象模型为人体模型或人脸模型。The method of claim 1, wherein the target object model is a human body model or a face model.
  3. 如权利要求1所述的方法,其特征在于,在三维座舱模型中加载目标对象模型之前,还包括:对至少一个真实的目标对象进行扫描,获得至少一个目标对象模型,将所述至少一个目标对象模型保存到模型库;The method according to claim 1, wherein before loading the target object model in the three-dimensional cockpit model, the method further comprises: scanning at least one real target object to obtain at least one target object model, The object model is saved to the model library;
    在所述三维座舱模型中加载目标对象模型,包括:从所述模型库中随机选取一个或多个目标对象模型,在所述三维座舱模型中加载所述一个或多个目标对象模型。Loading the target object model in the three-dimensional cockpit model includes: randomly selecting one or more target object models from the model library, and loading the one or more target object models in the three-dimensional cockpit model.
  4. 如权利要求1所述的方法,其特征在于,在所述三维座舱模型中加载虚拟相机,包括:The method of claim 1, wherein loading a virtual camera in the three-dimensional cockpit model comprises:
    将真实座舱中的真实相机的拍摄参数置入所述虚拟相机,所述拍摄参数包括分辨率、畸变参数、焦距、视场角、光圈或曝光时间中的至少一项。The shooting parameters of the real camera in the real cockpit are placed in the virtual camera, and the shooting parameters include at least one of resolution, distortion parameter, focal length, field angle, aperture or exposure time.
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述三维座舱模型的环境参数包括三维座舱模型的光场、内饰或外部环境中的至少一项;The method according to any one of claims 1-4, wherein the environmental parameters of the three-dimensional cockpit model include at least one of a light field, an interior or an external environment of the three-dimensional cockpit model;
    其中,所述光场包括光照亮度、光源的数量、光源的颜色或光源的位置中的至少一项。Wherein, the light field includes at least one of illumination brightness, the number of light sources, the color of the light sources, or the position of the light sources.
  6. 如权利要求1-4任一项所述的方法,其特征在于,所述目标对象模型的位姿参数包括位置坐标和/或欧拉角。The method according to any one of claims 1-4, wherein the pose parameters of the target object model include position coordinates and/or Euler angles.
  7. 如权利要求1-6任一项所述的方法,其特征在于,训练完成的所述深度学习模型用于识别新输入的座舱图像中的目标对象的深度信息;The method according to any one of claims 1-6, wherein the deep learning model after training is used to identify the depth information of the target object in the newly input cockpit image;
    将所述虚拟相机输出所述第一图像时的场景信息生成所述第一图像的标签,包括:根据所述虚拟相机输出所述第一图像时所述目标对象模型表面上各特征点与所述虚拟相机的相对位置,确定所述虚拟相机输出所述第一图像时所述目标对象模型相对于所述虚拟相机的第一深度信息;将所述第一深度信息作为所述第一图像的标签;Generating the label of the first image from the scene information when the virtual camera outputs the first image includes: according to the virtual camera outputting the first image, each feature point on the surface of the target object model and all the The relative position of the virtual camera is determined, and the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image is determined; the first depth information is used as the first depth information of the first image. Label;
    将所述虚拟相机输出所述第二图像时的场景信息生成所述第二图像的标签,包括:根据所述虚拟相机输出所述第二图像时所述目标对象模型表面上各特征点与所述虚拟相机的相对位置,确定所述虚拟相机输出所述第二图像时所述目标对象模型相对于所述虚拟相机的第二深度信息;将所述第二深度信息作为所述第二图像的标签。Generating the label of the second image from the scene information when the virtual camera outputs the second image includes: according to the virtual camera outputting the second image, each feature point on the surface of the target object model and the The relative position of the virtual camera is determined, and the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image is determined; the second depth information is used as the second depth information of the second image. Label.
  8. 如权利要求1-6任一项所述的方法,其特征在于,训练完成的所述深度学习模型用于识别新输入的座舱图像中的目标对象的姿态信息;The method according to any one of claims 1-6, wherein the deep learning model that has been trained is used to identify the pose information of the target object in the newly input cockpit image;
    将所述虚拟相机输出所述第一图像时的场景信息生成所述第一图像的标签,包括:确定所述虚拟相机输出所述第一图像时所述目标对象模型的位姿参数,将所述虚拟相机输出所述第一图像时所述目标对象模型的位姿参数作为所述第一图像的标签;Generating the label of the first image from the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image; When the virtual camera outputs the first image, the pose parameter of the target object model is used as the label of the first image;
    将所述虚拟相机输出所述第二图像时的场景信息生成所述第二图像的标签,包括:确定所述虚拟相机输出所述第二图像时所述目标对象模型的位姿参数,将所述虚拟相机输出所述第二图像时所述目标对象模型的位姿参数作为所述第二图像的标签。Generating the label of the second image from the scene information when the virtual camera outputs the second image includes: determining the pose parameters of the target object model when the virtual camera outputs the second image; The pose parameter of the target object model when the virtual camera outputs the second image is used as a label of the second image.
  9. 一种模型训练装置,其特征在于,包括:A model training device, comprising:
    处理模块,用于:在所述三维座舱模型中加载虚拟相机和目标对象模型,所述目标对象模型位于所述虚拟相机的拍摄范围内;获取所述虚拟相机输出的第一图像;将所述虚拟相机输出所述第一图像时的场景信息生成所述第一图像的标签,所述第一图像的标签用于描述所述目标对象模型在所述第一图像中的识别信息;调整所述三维座舱模型的环境参数和/或所述目标对象模型的位姿参数;获取所述虚拟相机输出的第二图像;将所述虚拟相机输出所述第二图像时的场景信息生成所述第二图像的标签,所述第二图像的标签用于描述所述目标对象模型在所述第二图像中的识别信息;a processing module for: loading a virtual camera and a target object model in the three-dimensional cockpit model, where the target object model is located within the shooting range of the virtual camera; acquiring a first image output by the virtual camera; The scene information when the virtual camera outputs the first image generates a label of the first image, and the label of the first image is used to describe the identification information of the target object model in the first image; adjusting the environmental parameters of the three-dimensional cockpit model and/or pose parameters of the target object model; acquiring the second image output by the virtual camera; generating the second image from the scene information when the virtual camera outputs the second image The label of the image, the label of the second image is used to describe the identification information of the target object model in the second image;
    训练模块,用于:基于所述第一图像以及所述第一图像的标签、所述第二图像以及所述第二图像的标签,对所述深度学习模型进行训练;其中,训练完成的所述深度学习模型用于识别新输入的座舱图像中的目标对象的识别信息。A training module for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, all the training completed The deep learning model is used to identify the identification information of the target object in the newly input cockpit image.
  10. 如权利要求9所述的装置,其特征在于,所述目标对象模型为人体模型或人脸模型。The apparatus of claim 9, wherein the target object model is a human body model or a face model.
  11. 如权利要求9所述的装置,其特征在于,所述处理模块还用于:在三维座舱模型中加载目标对象模型之前,对至少一个真实的目标对象进行扫描,获得至少一个目标对象模型,将所述至少一个目标对象模型保存到模型库;The device according to claim 9, wherein the processing module is further configured to: scan at least one real target object to obtain at least one target object model before loading the target object model in the three-dimensional cockpit model, The at least one target object model is stored in a model library;
    所述处理模块在所述三维座舱模型中加载目标对象模型时,具体用于:从所述模型库中随机选取一个或多个目标对象模型,在所述三维座舱模型中加载所述一个或多个目标对象模型。When loading the target object model in the three-dimensional cockpit model, the processing module is specifically configured to: randomly select one or more target object models from the model library, and load the one or more target object models in the three-dimensional cockpit model. target object model.
  12. 如权利要求9所述的装置,其特征在于,所述处理模块在所述三维座舱模型中加载虚拟相机时,具体用于:The device according to claim 9, wherein when the processing module loads the virtual camera in the three-dimensional cockpit model, it is specifically used for:
    将真实座舱中的真实相机的拍摄参数置入所述虚拟相机,所述拍摄参数包括分辨率、畸变参数、焦距、视场角、光圈或曝光时间中的至少一项。The shooting parameters of the real camera in the real cockpit are placed in the virtual camera, and the shooting parameters include at least one of resolution, distortion parameter, focal length, field angle, aperture or exposure time.
  13. 如权利要求9-12任一项所述的装置,其特征在于,所述三维座舱模型的环境参数包括三维座舱模型的光场、内饰或外部环境中的至少一项;The device according to any one of claims 9-12, wherein the environmental parameters of the three-dimensional cockpit model include at least one of a light field, an interior or an external environment of the three-dimensional cockpit model;
    其中,所述光场包括光照亮度、光源的数量、光源的颜色或光源的位置中的至少一项。Wherein, the light field includes at least one of illumination brightness, the number of light sources, the color of the light sources, or the position of the light sources.
  14. 如权利要求9-12任一项所述的装置,其特征在于,所述目标对象模型的位姿参数包括位置坐标和/或欧拉角。The apparatus according to any one of claims 9-12, wherein the pose parameters of the target object model include position coordinates and/or Euler angles.
  15. 如权利要求9-14任一项所述的装置,其特征在于,训练完成的所述深度学习模型用于识别新输入的座舱图像中的目标对象的深度信息;The device according to any one of claims 9-14, wherein the deep learning model after training is used to identify the depth information of the target object in the newly input cockpit image;
    所述处理模块在将所述虚拟相机输出所述第一图像时的场景信息生成所述第一图像的标签时,具体用于:根据所述虚拟相机输出所述第一图像时所述目标对象模型表面上各特征点与所述虚拟相机的相对位置,确定所述虚拟相机输出所述第一图像时所述目标对象 模型相对于所述虚拟相机的第一深度信息;将所述第一深度信息作为所述第一图像的标签;When generating the label of the first image from the scene information when the virtual camera outputs the first image, the processing module is specifically configured to: according to the target object when the virtual camera outputs the first image The relative positions of each feature point on the model surface and the virtual camera, determine the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image; information as a label for the first image;
    所述处理模块在将所述虚拟相机输出所述第二图像时的场景信息生成所述第二图像的标签时,具体用于:根据所述虚拟相机输出所述第二图像时所述目标对象模型表面上各特征点与所述虚拟相机的相对位置,确定所述虚拟相机输出所述第二图像时所述目标对象模型相对于所述虚拟相机的第二深度信息;将所述第二深度信息作为所述第二图像的标签。When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module is specifically configured to: according to the target object when the virtual camera outputs the second image The relative positions of each feature point on the model surface and the virtual camera, determine the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image; information as a label for the second image.
  16. 如权利要求9-14任一项所述的装置,其特征在于,训练完成的所述深度学习模型用于识别新输入的座舱图像中的目标对象的姿态信息;The device according to any one of claims 9-14, wherein the deep learning model after training is used to identify the pose information of the target object in the newly input cockpit image;
    所述处理模块在将所述虚拟相机输出所述第一图像时的场景信息生成所述第一图像的标签时,具体用于:确定所述虚拟相机输出所述第一图像时所述目标对象模型的位姿参数,将所述虚拟相机输出所述第一图像时所述目标对象模型的位姿参数作为所述第一图像的标签;When generating the label of the first image from scene information when the virtual camera outputs the first image, the processing module is specifically configured to: determine the target object when the virtual camera outputs the first image pose parameters of the model, using the pose parameters of the target object model when the virtual camera outputs the first image as the label of the first image;
    所述处理模块在将所述虚拟相机输出所述第二图像时的场景信息生成所述第二图像的标签时,具体用于:确定所述虚拟相机输出所述第二图像时所述目标对象模型的位姿参数,将所述虚拟相机输出所述第二图像时所述目标对象模型的位姿参数作为所述第二图像的标签。When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module is specifically configured to: determine the target object when the virtual camera outputs the second image The pose parameter of the model, and the pose parameter of the target object model when the virtual camera outputs the second image is used as the label of the second image.
  17. 一种计算设备系统,其特征在于,包括至少一台计算设备,每台计算设备包括存储器和处理器,所述存储器用于执行所述存储器存储的计算机指令,以执行上述权利要求1至8中任一项所述的方法。A computing device system, characterized in that it includes at least one computing device, each computing device includes a memory and a processor, and the memory is used to execute computer instructions stored in the memory, so as to execute the above claims 1 to 8 The method of any one.
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有指令,当所述指令被执行时,使如权利要求1-8中任一项所述的方法被实现。A computer-readable storage medium, characterized in that, the computer-readable storage medium stores instructions that, when executed, cause the method according to any one of claims 1 to 8 to be implemented.
  19. 一种芯片,其特征在于,所述芯片与存储器耦合,用于读取并执行所述存储器中存储的程序指令,实现如权利要求1-8中任一项所述的方法。A chip, characterized in that, the chip is coupled with a memory, and is used for reading and executing program instructions stored in the memory, so as to implement the method according to any one of claims 1-8.
  20. 一种包含指令的计算机程序产品,其特征在于,所述计算机程序产品中存储有指令,当其在计算机上运行时,使得计算机执行如权利要求1-8中任一项所述的方法。A computer program product comprising instructions, characterized in that the computer program product has instructions stored in the computer program product, which, when executed on a computer, cause the computer to execute the method according to any one of claims 1-8.
PCT/CN2021/075856 2021-02-07 2021-02-07 Method and apparatus for training deep learning model WO2022165809A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/075856 WO2022165809A1 (en) 2021-02-07 2021-02-07 Method and apparatus for training deep learning model
CN202180000179.9A CN112639846A (en) 2021-02-07 2021-02-07 Method and device for training deep learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/075856 WO2022165809A1 (en) 2021-02-07 2021-02-07 Method and apparatus for training deep learning model

Publications (1)

Publication Number Publication Date
WO2022165809A1 true WO2022165809A1 (en) 2022-08-11

Family

ID=75297673

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075856 WO2022165809A1 (en) 2021-02-07 2021-02-07 Method and apparatus for training deep learning model

Country Status (2)

Country Link
CN (1) CN112639846A (en)
WO (1) WO2022165809A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273577A (en) * 2022-09-26 2022-11-01 丽水学院 Photography teaching method and system
CN115331309A (en) * 2022-08-19 2022-11-11 北京字跳网络技术有限公司 Method, apparatus, device and medium for recognizing human body action
CN116563246A (en) * 2023-05-10 2023-08-08 之江实验室 Training sample generation method and device for medical image aided diagnosis

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516778A (en) * 2021-04-14 2021-10-19 武汉联影智融医疗科技有限公司 Model training data acquisition method and device, computer equipment and storage medium
CN113240611B (en) * 2021-05-28 2024-05-07 中建材信息技术股份有限公司 Foreign matter detection method based on picture sequence
CN113362388A (en) * 2021-06-03 2021-09-07 安徽芯纪元科技有限公司 Deep learning model for target positioning and attitude estimation
CN114332224A (en) * 2021-12-29 2022-04-12 北京字节跳动网络技术有限公司 Method, device and equipment for generating 3D target detection sample and storage medium
CN115223002B (en) * 2022-05-09 2024-01-09 广州汽车集团股份有限公司 Model training method, door opening motion detection device and computer equipment
CN117441190A (en) * 2022-05-17 2024-01-23 华为技术有限公司 Position positioning method and device
CN115578515B (en) * 2022-09-30 2023-08-11 北京百度网讯科技有限公司 Training method of three-dimensional reconstruction model, three-dimensional scene rendering method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428388A (en) * 2019-07-11 2019-11-08 阿里巴巴集团控股有限公司 A kind of image-data generating method and device
US20200160114A1 (en) * 2017-07-25 2020-05-21 Cloudminds (Shenzhen) Robotics Systems Co., Ltd. Method for generating training data, image semantic segmentation method and electronic device
CN112132213A (en) * 2020-09-23 2020-12-25 创新奇智(南京)科技有限公司 Sample image processing method and device, electronic equipment and storage medium
CN112232293A (en) * 2020-11-09 2021-01-15 腾讯科技(深圳)有限公司 Image processing model training method, image processing method and related equipment
CN112308103A (en) * 2019-08-02 2021-02-02 杭州海康威视数字技术股份有限公司 Method and device for generating training sample

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11257272B2 (en) * 2019-04-25 2022-02-22 Lucid VR, Inc. Generating synthetic image data for machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160114A1 (en) * 2017-07-25 2020-05-21 Cloudminds (Shenzhen) Robotics Systems Co., Ltd. Method for generating training data, image semantic segmentation method and electronic device
CN110428388A (en) * 2019-07-11 2019-11-08 阿里巴巴集团控股有限公司 A kind of image-data generating method and device
CN112308103A (en) * 2019-08-02 2021-02-02 杭州海康威视数字技术股份有限公司 Method and device for generating training sample
CN112132213A (en) * 2020-09-23 2020-12-25 创新奇智(南京)科技有限公司 Sample image processing method and device, electronic equipment and storage medium
CN112232293A (en) * 2020-11-09 2021-01-15 腾讯科技(深圳)有限公司 Image processing model training method, image processing method and related equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331309A (en) * 2022-08-19 2022-11-11 北京字跳网络技术有限公司 Method, apparatus, device and medium for recognizing human body action
CN115273577A (en) * 2022-09-26 2022-11-01 丽水学院 Photography teaching method and system
CN116563246A (en) * 2023-05-10 2023-08-08 之江实验室 Training sample generation method and device for medical image aided diagnosis
CN116563246B (en) * 2023-05-10 2024-01-30 之江实验室 Training sample generation method and device for medical image aided diagnosis

Also Published As

Publication number Publication date
CN112639846A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022165809A1 (en) Method and apparatus for training deep learning model
Sahu et al. Artificial intelligence (AI) in augmented reality (AR)-assisted manufacturing applications: a review
US11373332B2 (en) Point-based object localization from images
Sakaridis et al. Semantic foggy scene understanding with synthetic data
US20180012411A1 (en) Augmented Reality Methods and Devices
US10019652B2 (en) Generating a virtual world to assess real-world video analysis performance
US11288857B2 (en) Neural rerendering from 3D models
DE112019002589T5 (en) DEPTH LEARNING SYSTEM
CN109407547A (en) Multi-cam assemblage on-orbit test method and system towards panoramic vision perception
CN114972617B (en) Scene illumination and reflection modeling method based on conductive rendering
CN101422035A (en) Image high-resolution upgrading device, image high-resolution upgrading method, image high-resolution upgrading program and image high-resolution upgrading system
Bešić et al. Dynamic object removal and spatio-temporal RGB-D inpainting via geometry-aware adversarial learning
CN112365604A (en) AR equipment depth of field information application method based on semantic segmentation and SLAM
CN115226406A (en) Image generation device, image generation method, recording medium generation method, learning model generation device, learning model generation method, learning model, data processing device, data processing method, estimation method, electronic device, generation method, program, and non-transitory computer-readable medium
Du et al. Video fields: fusing multiple surveillance videos into a dynamic virtual environment
WO2022052782A1 (en) Image processing method and related device
WO2022165722A1 (en) Monocular depth estimation method, apparatus and device
Goncalves et al. Deepdive: An end-to-end dehazing method using deep learning
CN112651881A (en) Image synthesis method, apparatus, device, storage medium, and program product
CN112308977A (en) Video processing method, video processing apparatus, and storage medium
US20220343639A1 (en) Object re-identification using pose part based models
CN115008454A (en) Robot online hand-eye calibration method based on multi-frame pseudo label data enhancement
CN114494395A (en) Depth map generation method, device and equipment based on plane prior and storage medium
CN116012805B (en) Target perception method, device, computer equipment and storage medium
Bai et al. Cyber mobility mirror for enabling cooperative driving automation: A co-simulation platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21923820

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21923820

Country of ref document: EP

Kind code of ref document: A1