WO2022165809A1 - Procédé et appareil d'entraînement de modèle d'apprentissage profond - Google Patents

Procédé et appareil d'entraînement de modèle d'apprentissage profond Download PDF

Info

Publication number
WO2022165809A1
WO2022165809A1 PCT/CN2021/075856 CN2021075856W WO2022165809A1 WO 2022165809 A1 WO2022165809 A1 WO 2022165809A1 CN 2021075856 W CN2021075856 W CN 2021075856W WO 2022165809 A1 WO2022165809 A1 WO 2022165809A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
model
target object
virtual camera
label
Prior art date
Application number
PCT/CN2021/075856
Other languages
English (en)
Chinese (zh)
Inventor
刘杨
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180000179.9A priority Critical patent/CN112639846A/zh
Priority to PCT/CN2021/075856 priority patent/WO2022165809A1/fr
Publication of WO2022165809A1 publication Critical patent/WO2022165809A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the embodiment of the present application uses the three-dimensional cockpit model of the cockpit as the background of the target object model in the simulation process.
  • the synthetic data that is, in the third The authenticity of an image and the label of the first image, the label of the second image and the label of the second image
  • the pose of the target object model and the environment of the three-dimensional cockpit model are adjusted, so that a variety of cockpits can be obtained.
  • the unique synthetic data in the field can be used in the whole process without manual participation, which realizes the effect of obtaining diversified annotated training data at low cost and high efficiency for new application scenarios.
  • using such annotated training data to train the model can effectively improve the generalization ability of the deep learning model and optimize the accuracy and effectiveness of the deep learning model.
  • FIG. 5 is a schematic diagram of another deep learning model device provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the relative position of the target object model and the virtual camera
  • Machine vision systems can convert imperfect, blurry, and constantly changing images into semantic representations. As shown in Figure 1, it is a schematic diagram of the visual system identifying abnormal behavior. After inputting the image to be recognized in the visual system, the visual system can output the corresponding semantic information to identify the abnormal behavior in the image, that is, "smoking".
  • Vision systems can be implemented by deep learning methods.
  • ImageNet is the name of a computer vision system recognition project
  • large-scale visual recognition challenge imageNet Large Scale Visual Recognition Challenge, ILSVRC
  • ILSVRC Large Scale Visual Recognition Challenge
  • the Alexnet model of the competition winner Hilton team classified images (1000 classes)
  • the top-5 error rate of the track was reduced to 15.3%, crushing the 26.2% of the second place using the SVM algorithm. Since then, deep learning has embarked on a booming road.
  • the sources of real data generally include: 1), public datasets; 2), data collection and labeling suppliers; 3), self-collection.
  • synthetic data is used to construct a training dataset for the deep learning model.
  • Synthetic data refers to data generated by a simulated environment of a computer system, rather than data measured and collected from a real-world environment. This data is anonymous and has no privacy concerns. Since it is created based on user-specified parameters, it can have the same characteristics as real-world data as much as possible.
  • the existing synthetic data does not consider the cockpit scene, so when the deep learning model trained with these synthetic data recognizes the cockpit scene image, it is difficult for the model to distinguish the foreground (that is, the target object to be detected) and the background (that is, the background), resulting in The accuracy of the model is poor.
  • the embodiments of the present application provide a method and apparatus for training a deep learning model, so as to achieve a low-cost and high-efficiency acquisition of diverse annotated training data for a new application scenario (taking a cockpit as an example), Improve the generalization ability of deep learning models.
  • the model training device may also be a device composed of multiple parts, such as a scene creation module, a simulation module, and a training module, which is not limited in this application.
  • Each component can be deployed in different systems or servers.
  • each part of the apparatus may run in three environments of cloud computing equipment system, edge computing equipment system or terminal computing equipment, or may run in any two of these three environments. middle.
  • the cloud computing device system, the edge computing device system, and the terminal computing device are connected by a communication channel, and can communicate and transmit data with each other.
  • the method for training a deep learning model provided by the embodiment of the present application is performed by the combined parts of the model training apparatus running in the three environments (or any two of the three environments).
  • the source of the 3D cockpit model can be obtained directly from other places.
  • the 3D cockpit model corresponding to the car can be obtained from the manufacturer that generated the car, or It is generated by direct modeling, for example, by using special software such as a three-dimensional modeling tool, but it can also be generated by other methods, which is not limited in this application.
  • the specific modeling method can be Non-Uniform Rational B-Splines (NURBS) or polygon meshes, etc. No restrictions.
  • the internal environment may include the internal structure of the cockpit, such as the material, texture, shape, structure, color, etc. of the interior, such as seats, steering wheels, storage boxes, ornaments, car seat cushions, car floor mats, automobiles, etc. Pendants, internal decorations, daily necessities (such as paper towels, water cups, mobile phones, etc.).
  • the interior environment may also include a light field (or illumination) inside the cabin, such as illumination brightness, number of light sources, color of light sources, location of light sources or orientation of light sources, and the like.
  • the external environment includes the geographic location, shape, texture, color, etc. of the external environment.
  • the three-dimensional cockpit model is located in an urban environment, there may also be urban buildings (such as tall buildings) and lane signs (such as traffic lights) outside the cockpit.
  • urban buildings such as tall buildings
  • lane signs such as traffic lights
  • the three-dimensional cockpit model is located in a field environment, there may also be flowers, plants and trees outside the cockpit.
  • the shooting content of the virtual camera can be made closer to the shooting content of the real camera in practical applications, thereby helping to improve the synthetic data.
  • the synthetic data here refers to the image carrying the label output in step S602, which may also be referred to as a sample image herein).
  • the virtual camera in the embodiment of the present application can describe the position, attitude and other shooting parameters of the camera in the real cockpit. Based on these parameters of the virtual camera and the environmental parameters of the three-dimensional cockpit model, it can output the virtual camera from the perspective of the virtual camera.
  • the two-dimensional image inside the three-dimensional cockpit model obtains the effect of shooting a two-dimensional image of the real cockpit interior scene by a real camera in the real world.
  • FIG. 7 which is an interior view of the car model from a perspective (looking from the inside out)
  • a virtual camera is placed inside the cockpit of the three-dimensional cockpit model, and the virtual camera can photograph the scene in the cockpit (at least the main driving position can be photographed) Image).
  • the Y-axis of the coordinate system of the simulation space be the vertical direction
  • the vertical upward direction is the direction in which the Y-axis coordinate becomes larger
  • the Z-axis and X-axis are in the horizontal direction.
  • the virtual camera is set in the coordinate system of the simulation space.
  • the shooting direction of the virtual camera is the direction in which the Z-axis coordinate becomes larger.
  • the 3D cockpit model comes with a virtual camera, you can use the virtual camera directly.
  • the selected virtual camera can output the image inside the cockpit, that is, the selected virtual camera.
  • the camera is capable of capturing the scene in the cockpit (at least an image of the primary driver's position).
  • Positions 4 and 8 such as the inside of the rear door glass, the person in the rear seat can be photographed
  • Positions 5 and 7 such as the back of the front and rear seats, can be photographed person in the rear seat
  • position 11 e.g. rearview mirror
  • the shooting parameters of the virtual camera also need to be set.
  • the shooting parameters may include one or more of resolution, distortion parameters, focal length, field angle, aperture or exposure time, and the like. If the pose of the virtual camera is adjustable, the shooting parameters may further include the position and pose of the virtual camera, where the pose of the virtual camera may be represented by Euler angles. Of course, the shooting parameters may also include other parameters, as long as they are parameters that affect the shooting effect of the virtual camera, which is not limited in this application.
  • the shooting parameters can be set to default values, which is simple to implement and low in cost.
  • the shooting parameters can be obtained by collecting the shooting parameters of the real camera in the real cockpit, that is, collecting the shooting parameters of the real camera in the real cockpit, and placing the collected shooting parameters in the virtual camera.
  • the internal parameters of the on-board driver monitoring system (Driver Monitoring System, DMS) and cockpit monitoring system (Cockpit Monitoring System, CMS) cameras can be obtained by Zhang's calibration method, and then the internal parameters are placed in the virtual camera. In this way, the shooting effect of the virtual camera is closer to the shooting effect of the real camera in practical applications, thereby helping to improve the realism of the synthetic data.
  • the above two manners of determining the shooting parameters may be implemented independently, or may be implemented in combination with each other.
  • the resolution adopts the default value (such as 256x256), and the parameter value of the real camera is collected for the distortion parameters, focal length, field angle, aperture or exposure time.
  • the type of the target object model needs to be determined according to the target object that the deep learning model needs to recognize when it is applied.
  • the deep learning model needs to recognize the depth information of the human face in the image
  • the target object model can be a human face.
  • the model for example, the deep learning model needs to recognize the posture of the human body in the image
  • the target object model can be a human body model. It should be understood that the above is only an example and not a limitation, and the present application does not limit the specific type of the target object model, for example, it can also be an animal (such as a dog, a cat).
  • loading the target object model into the three-dimensional cockpit model includes: randomly selecting one or more target object models from the model library, and loading the selected target object model into the three-dimensional cockpit model. It should be understood that when there are multiple target object models, the method steps performed for each target object model in this embodiment of the present application are similar. Therefore, the following description will mainly take one target object model as an example.
  • FIG. 9A-9B are schematic diagrams of a high-quality human face model, wherein FIG. 9A is a front view of the high-quality human face model, and FIG. 9B is a side view of the high-quality human face model. Since the target object models in the model library are all generated based on the scanning of real target objects, the target object model can provide more texture details and diversity, which can improve the realism of the target object model, which in turn helps to improve the Diversity and realism of synthetic data.
  • the environmental parameters of the three-dimensional cockpit model may include parameters of the internal environment, such as the brightness of the light inside the cockpit, the number of light sources, the color of the light source, or the position of the light source, etc., such as the type, quantity, position, shape, structure, color, etc.
  • the environment parameters of the three-dimensional cockpit model may also include parameters of the external environment, such as the geographic location, shape, texture, color, etc. of the external environment. In this way, by changing the environmental parameters of the three-dimensional cockpit model, more diverse sample images (ie, images with labels, or synthetic data) can be obtained.
  • the pose parameters of the target object model include the position coordinates of the target object model (for example, the coordinate values (x, y, z) in the coordinate system of the simulation space), and/or the Euler angles of the target object model. (including pitch angle, yaw angle and roll angle). In this way, more diverse sample images can be obtained by changing the pose parameters of the three-dimensional cockpit model.
  • the environmental parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model are adjusted, and several training images carrying labels are output, including:
  • Step 1 Set the environment parameter of the three-dimensional cockpit model as the first environment parameter, and set the pose parameter of the target object model as the first pose parameter; output the first image based on the virtual camera in the three-dimensional cockpit model, and output the first image according to the virtual camera.
  • the scene information of an image generates a label of the first image, and the first image carrying the label is obtained.
  • Step 2 adjusting the environment parameters of the three-dimensional cockpit model and/or the pose parameters of the target object model, for example, setting the environment parameters of the three-dimensional cockpit model as the second environment parameters, and setting the pose parameters of the target object model as the second pose parameters;
  • the second image is output based on the virtual camera, and a label of the second image is generated according to the scene information when the virtual camera outputs the second image, so as to obtain the second image carrying the label.
  • the first image output in step 1 is different from the second image output in step 2.
  • the first image and the second image are different.
  • Image carrying tags are also different.
  • the simulation script is used to execute the program code of the simulation, and when the simulation script is run by the computer, the computer can realize the following functions: randomly select from the model library
  • the face model is loaded into the 3D cockpit model; the X and Y coordinates of the face model are controlled to change randomly within the range of plus or minus 100 mm, and the Z coordinate is randomly changed within the range of 200 to 800 mm; the pitch angle (Pitch) of the face model is controlled.
  • the synthetic data is obtained based on the output image of the virtual camera and the output label based on the scene information.
  • the target object model loaded in the three-dimensional cockpit model may also be updated during the simulation. For example, after outputting a preset number of images with labels based on the current target object model, re-select one or more target object models from the model library, and then load the re-selected target object models in the 3D cockpit model.
  • the reselected target object model performs steps 1 and 2 above to output more images with labels. In this way, the diversity of synthetic data can be further improved.
  • the shooting parameters of the virtual camera may also be updated during the simulation. For example, after setting the environment parameter of the three-dimensional cockpit model as the first environment parameter, setting the pose parameter of the target object model as the first pose parameter, and outputting the first image carrying the label, update the shooting parameters of the virtual camera (for example, adjusting the resolution rate or exposure time, etc.), and then based on the same environment parameters and pose parameters (that is, keep the environment parameters of the 3D cockpit model as the first environment parameters, and keep the pose parameters of the target object model as the first pose parameters change), and output a third image different from the first image. In this way, the diversity of synthetic data can be further improved.
  • the target object model when the target object model can be divided into multiple parts and the poses of different parts can be independent of each other, during the simulation process, in addition to adjusting the overall pose parameters of the target object model (that is, maintaining the In addition to the relative pose unchanged), the pose parameters of each part of the target object model can also be adjusted separately.
  • the target object model is a human body model
  • the overall pose parameters of the human body model may be adjusted, or only the pose parameters of the head may be adjusted, or only the pose parameters of the hands may be adjusted, or the pose parameters of the head may be adjusted simultaneously. Pose parameters and hand pose parameters, etc. In this way, the flexibility of simulation can be improved, and the diversity of synthetic data can be further improved.
  • the most important role of the virtual camera is to determine the rendering effect (for example, the position of the image plane, the resolution of the image, the viewing angle of the image) according to the shooting parameters of the virtual camera. field angle, etc.), so that the display effect of the rendered 2D image is the same as the display effect of the image captured by the real camera using the shooting parameters, and the realism of the RGB image is improved.
  • the label of the image is used to describe the identification information of the target object in the cockpit image.
  • the label type of the image needs to be determined according to the output data (namely identification information) that the deep learning model needs to map. For example, if the trained deep learning model is used to determine the depth information of the target object in the image to be recognized, the label includes the depth information of the target object model; for example, the trained deep learning model is used to determine the depth information of the target object model to be recognized.
  • the label includes the attitude information of the target object model; for example, the trained deep deep learning model is used to determine the depth information and attitude information of the target object in the image to be recognized, Then the label includes both the depth information and the pose information of the target object model.
  • the trained deep deep learning model is used to determine the depth information and attitude information of the target object in the image to be recognized, Then the label includes both the depth information and the pose information of the target object model.
  • other types of labels may also exist in practical applications, which are not limited in this application.
  • the embodiments of the present application have different specific methods for generating tags based on scene information.
  • the following takes the tags as depth information and pose information as examples to introduce two possible methods for generating tags based on scene information.
  • Example 1 The label includes depth information.
  • Depth estimation is a fundamental problem in the field of computer vision. There are many devices, such as depth cameras and millimeter-wave radars, that can directly acquire depth. But car-grade high-resolution depth cameras are expensive. It is also possible to use binoculars for depth estimation, but since binocular images need to use stereo matching to perform pixel correspondence and parallax calculation, the computational complexity is also high. Especially for low texture scenes the matching effect is not good.
  • the monocular depth estimation is relatively cheaper and easier to popularize. Then for monocular depth estimation, as the name implies, it is to use an RGB image to estimate the distance of each pixel in the image relative to the shooting source.
  • the synthetic data generated in the embodiment of the present application may be an image carrying a depth information label, and may be used as training data for a supervised monocular depth estimation algorithm based on deep learning.
  • the trained monocular depth estimation algorithm can be used to estimate depth information based on RGB images.
  • the method for generating a depth information label according to scene information includes: according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs a certain image, determining the relative position of the target object model relative to the virtual camera when the virtual camera outputs the image The depth information of the camera, which is used as the label of the image.
  • Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the virtual camera to output the first image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image the first depth information of the target object model relative to the virtual camera; the first depth information is used as the label of the first image.
  • Generating the label of the second image according to the scene information when the virtual camera outputs the second image includes: determining the virtual camera to output the second image according to the relative positions of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image the second depth information of the target object model relative to the virtual camera; the second depth information is used as the label of the second image.
  • each feature point on the target object model has its own 3D coordinates (x, y, z), and the depth information corresponding to each feature point is not the linear distance from the feature point to the center of the virtual camera aperture, but The vertical distance from this point to the plane of the virtual camera aperture. Therefore, to obtain the depth information of each feature point on the target object model, the essence is to obtain the vertical distance between each feature point and the plane where the virtual camera is located. The vertical distance between each feature point and the plane where the virtual camera is located can be specifically calculated and obtained according to the position coordinates of the virtual camera and the position coordinates of the target object model.
  • the camera aperture center S is located at the origin O of the coordinate system in the simulation space, that is, the coordinates of the camera aperture center S are (0, 0, 0), the Y axis is the vertical direction, and the vertical upward direction is Y.
  • the depth value corresponding to each feature point is equal to the vertical distance between the feature point and the plane where the virtual camera is located, which is equal to the Z coordinate of the feature point. It can be seen from Figure 10 that the positions of point A and point B on the face are different, and the straight-line distance between point B and the center of the virtual camera aperture is greater than the straight-line distance between point A and the center of the virtual camera aperture, but point A is where the virtual camera aperture is located.
  • the vertical distance of the plane is less than the vertical distance from points A and B to the plane where the virtual camera aperture is located, that is, the Z coordinate of points A and B is greater than the Z coordinate of point C (z3 ⁇ z1, z2), that is, the two points of A and B are
  • the depth value is greater than the depth value of the Z point.
  • the depth information corresponding to each RGB image may be expressed in the form of a depth map.
  • the RGB image and/or the depth map can be in JPG, PNG or JPEG format, which is not limited in this application.
  • the depth of each pixel in the depth map can be stored in 16 bits (two bytes) in millimeters.
  • FIGS. 11A and 11B are examples of a set of synthetic data, wherein FIG. 11A is an RGB image, FIG. 11B is a depth map corresponding to the RGB image shown in FIG. 11A , and FIG. 11C is a more intuitive display of FIG. 11B in Matplotlib. schematic diagram.
  • Example 2 The label includes pose information.
  • Attitude estimation problem is to determine the orientation of a three-dimensional target object.
  • Pose estimation has applications in many fields such as robot vision, motion tracking, and single-camera calibration.
  • the synthetic data generated in the embodiment of the present application may be an image carrying a pose information label, and may be used as training data for a supervised pose estimation algorithm based on deep learning.
  • the trained pose estimation algorithm can be used to estimate the pose of the target object based on the RGB image.
  • the method for generating the depth information label according to the scene information includes: according to the pose parameter of the target object model when the virtual camera outputs a certain image, the pose parameter is used as the label of the image.
  • Generating the label of the first image according to the scene information when the virtual camera outputs the first image includes: determining the pose parameters of the target object model when the virtual camera outputs the first image, and determining the pose parameters of the target object model when the virtual camera outputs the first image
  • the parameter is used as the label of the first image
  • the label of the second image is generated according to the scene information when the virtual camera outputs the second image, including: determining the pose parameters of the target object model when the virtual camera outputs the second image, and outputting the second image from the virtual camera.
  • the pose parameters of the target object model in the image are used as the label of the second image.
  • the attitude information may be represented by Euler angles.
  • the Euler angle may specifically be the rotation angle of the head around the three coordinate axes (ie, X, Y, and Z axes) of the simulation space coordinate system.
  • the rotation angle around the X axis is the pitch angle (Pitch)
  • the rotation angle around the Y axis is the yaw angle (Yaw)
  • the rotation angle around the Z axis is the roll angle (Roll).
  • training the model train the deep learning model based on the image with the label, and obtain the deep learning model that has been trained.
  • the RGB image in each group of combined data is used as the input of the deep learning model, and the label corresponding to the RGB image is used as the output of the deep learning model, and the deep learning model is trained to obtain the completed
  • a deep learning model ie, a monocular depth estimation algorithm
  • the deep learning model can output the depth information of the face in the cockpit image, thereby assisting in the realization of gaze tracking, eye positioning and other functions.
  • the deep learning model can output the posture information of the human body in the cockpit image (such as Euler angles of the head, Euler angles of the hands, etc.), thereby assisting the realization of the human body Motion tracking, driving distraction detection, and more.
  • the embodiment of the present application uses the 3D model of the cockpit (that is, the 3D cockpit model) as the background of the target object model (such as a human body model or a face model) during the simulation process, so as to create a realistic background and light and shadow Compared with the existing method of generating training data by simply rendering the human body model, the authenticity of the synthetic data can be improved; in the simulation process, the pose of the target object model and the environment of the three-dimensional cockpit model are randomly set, so that the The unique synthetic data in the diverse cockpit field is obtained, which realizes the effect of obtaining diverse annotated training data at low cost and high efficiency for new application scenarios.
  • the embodiments of the present application can effectively improve the generalization ability of the deep learning model, and optimize the accuracy and effectiveness of the deep learning model.
  • the shooting parameters of the real camera in the real cockpit can be put into the virtual camera, so as to improve the realism of the images captured by the virtual camera, and further optimize the precision and accuracy of the deep learning model. effectiveness.
  • the embodiment of the present application can also generate a target object model based on real human body/face scan data, so that the target object model can provide more texture details and diversity, thereby further improving the generalization ability of the deep learning model and optimizing the depth. Accuracy and effectiveness of the learned model.
  • a depth estimation model ie, a deep learning model for estimating depth information
  • semantic segmentation SegNet
  • Figure 12 it is a schematic diagram of the architecture of the SegNet model.
  • color dithering can be performed on the input image of the model as data enhancement.
  • the loss function of the model can be set as the average error (in millimeters) of the per-pixel depth estimate from the true depth.
  • the model was optimized using the quasi-Newton method (L-BFGS), and it was observed that the model converged significantly on the synthesized training data.
  • the test includes two aspects: qualitative test and quantitative test:
  • FIG. 13A is a synthetic picture generated by using the method of the embodiment of the present application, and the synthetic picture shown in FIG. 13A is input into the depth estimation model to obtain the depth map shown in FIG. 13B . From a qualitative point of view, the effect is ideal.
  • RGB photos are processed to obtain the target area (that is, the area where the target object is located, such as the face area), and other areas are masked, as shown in Figure 14. After obtaining the estimated depth map, only the estimated depth map and the target depth map are used. The target area is compared, which can improve the reliability of the test.
  • an embodiment of the present application further provides a model training apparatus, which includes a module/unit for executing the above method steps.
  • the apparatus 150 includes a processing module 1501 and a training module 1502;
  • the processing module 1501 is used for: loading a virtual camera and a target object model in the three-dimensional cockpit model, and the target object model is located within the shooting range of the virtual camera; acquiring the first image output by the virtual camera; when outputting the first image from the virtual camera generate the label of the first image from the scene information of the The second image output by the camera; the scene information when the virtual camera outputs the second image is used to generate a label of the second image, and the label of the second image is used to describe the identification information of the target object model in the second image;
  • the training module 1502 is used for: training the deep learning model based on the first image and the label of the first image, the second image and the label of the second image; wherein, the trained deep learning model is used to identify the newly input cockpit Identification information of the target object in the image.
  • the target object model is a human body model or a face model.
  • the processing module 1501 is further configured to: scan at least one real target object before loading the target object model in the three-dimensional cockpit model, obtain at least one target object model, and save the at least one target object model to the model library;
  • the processing module 1501 is specifically used for: randomly selecting one or more target object models from the model library, and loading one or more target object models into the 3D cockpit model.
  • the processing module 1501 loads the virtual camera in the three-dimensional cockpit model, it is specifically used to: put the shooting parameters of the real camera in the real cockpit into the virtual camera, and the shooting parameters include resolution, distortion parameters, focal length, and field of view. , at least one of aperture, or exposure time.
  • the environmental parameters of the three-dimensional cockpit model include at least one of a light field of the three-dimensional cockpit model, an interior or an external environment; wherein the light field includes any of the brightness of the light, the number of light sources, the color of the light source, or the position of the light source. at least one.
  • the pose parameters of the target object model include position coordinates and/or Euler angles.
  • the trained deep learning model is used to identify the depth information of the target object in the newly input cockpit image
  • the processing module 1501 When the processing module 1501 generates the label of the first image from the scene information when the virtual camera outputs the first image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the first image, determining the first depth information of the target object model relative to the virtual camera when the virtual camera outputs the first image; using the first depth information as a label of the first image;
  • the processing module 1501 When the processing module 1501 generates the label of the second image from the scene information when the virtual camera outputs the second image, it is specifically used for: according to the relative position of each feature point on the surface of the target object model and the virtual camera when the virtual camera outputs the second image, determining the second depth information of the target object model relative to the virtual camera when the virtual camera outputs the second image; and using the second depth information as a label of the second image.
  • the trained deep learning model is used to identify the pose information of the target object in the newly input cockpit image
  • the processing module 1501 When generating the label of the first image from the scene information when the virtual camera outputs the first image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the first image, and output the first image from the virtual camera. When the pose parameter of the target object model is used as the label of the first image;
  • the processing module 1501 When generating the label of the second image from the scene information when the virtual camera outputs the second image, the processing module 1501 is specifically used to: determine the pose parameters of the target object model when the virtual camera outputs the second image, and output the second image from the virtual camera. When the pose parameters of the target object model are used as the label of the second image.
  • the apparatus 160 includes a scene creation module 1601 , a simulation module 1602 , and a training module 1603 .
  • the scene creation module 1601 is used to establish a three-dimensional cockpit model, and a virtual camera is installed in the three-dimensional cockpit model.
  • the type of the cockpit is not limited in the present application, such as the cockpit of a car, the cockpit of a ship, the cockpit of an airplane, and the like.
  • the shooting parameters of the virtual camera may be default values or the shooting parameters of the real camera, which are not limited here.
  • the simulation module 1602 is used to load the target object model in the 3D cockpit model; by continuously adjusting the environment parameters of the 3D cockpit model and/or the pose parameters of the target object model, output several training images with labels.
  • the type of the target object is not limited in this application, such as human body, human face, and the like.
  • the label carried by the training image needs to be determined according to the output data that the deep learning model needs to map. For example, the trained deep learning model is used to determine the depth information of the target object in the input cockpit image to be recognized, then the label is the sample image For example, if the trained deep learning model is used to determine the pose of the target object in the input cockpit image to be recognized, the label is the pose of the target object model in the sample image.
  • the training module 1603 is similar in function to the above-mentioned training module 1702, and is used to train the deep learning model based on several images with labels output by the simulation module 1602 to obtain the trained deep learning model.
  • the image is the input of the deep learning model
  • the label is the output of the deep learning model.
  • the apparatus 160 may further include a data acquisition module, configured to scan a real target object to generate a target object model.
  • the apparatus 160 may further include a storage module for storing the synthesized data (ie, the image and the label of the image) output by the simulation module.
  • a storage module for storing the synthesized data (ie, the image and the label of the image) output by the simulation module.
  • the apparatus 160 may further include a verification module for verifying the detection accuracy of the trained deep learning model.
  • an embodiment of the present application further provides a computing device system, where the computing device system includes at least one computing device 170 as shown in FIG. 17 .
  • the computing device 170 includes a bus 1701 , a processor 1702 and a memory 1704 .
  • a communication interface 1703 may also be included. 17 In FIG. 17 , a dashed box indicates that the communication interface 1703 is optional.
  • the processor 1702 in the at least one computing device 170 executes the computer instructions stored in the memory 1704, and can execute the methods provided in the above method embodiments.
  • the communication between the processor 1702 , the memory 1704 and the communication interface 1703 is through the bus 1701 .
  • the multiple computing devices 170 communicate through a communication path.
  • the processor 1702 may be a CPU.
  • Memory 1704 may include volatile memory, such as random access memory.
  • Memory 1704 may also include non-volatile memory such as read only memory, flash memory, HDD or SSD.
  • the memory 1704 stores executable code that is executed by the processor 1702 to perform any part or all of the aforementioned methods.
  • the memory may also include other software modules required for running processes such as an operating system.
  • the operating system can be LINUX TM , UNIX TM , WINDOWS TM and so on.
  • the memory 1704 stores any one or any multiple modules of the aforementioned apparatus 180 or the apparatus 170
  • FIG. 17 takes the storage of any one or any multiple modules of the aforementioned apparatus 180 as an example; or any number of modules, and may also include other software modules required for running processes such as the operating system.
  • the operating system can be LINUX TM , UNIX TM , WINDOWS TM and so on.
  • the multiple computing devices 170 establish communication with each other through a communication network, and each computing device runs any one or any multiple modules of the foregoing apparatus 150 or apparatus 160.
  • the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the instructions are executed, the methods provided in the above method embodiments can be implemented.
  • an embodiment of the present application further provides a chip, which is coupled to a memory, and is used to read and execute program instructions stored in the memory, so as to implement the method provided by the above method embodiments.
  • the embodiments of the present application also provide a computer program product containing instructions.
  • the computer program product stores instructions that, when run on a computer, cause the computer to execute the method provided by the above method embodiments.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un procédé et un appareil d'entraînement d'un modèle d'apprentissage profond. Le procédé comprend : tout d'abord le chargement d'une caméra virtuelle et d'un modèle d'objet cible dans un modèle de cabine tridimensionnel, puis l'obtention d'une première sortie d'image par la caméra virtuelle, et la génération d'une étiquette de la première image en fonction d'informations de scène lorsque la caméra virtuelle délivre la première image ; puis le réglage de paramètres environnementaux du modèle de cabine tridimensionnel et/ou de paramètres de pose du modèle d'objet cible, après le réglage, l'obtention d'une seconde image délivrée en sortie par la caméra virtuelle, et la génération d'une étiquette de la seconde image en fonction d'informations de scène lorsque la caméra virtuelle délivre la seconde image ; et enfin, l'utilisation de la première image, de l'étiquette de la première image, de la seconde image et de l'étiquette de la seconde image pour un apprentissage profond du modèle, de sorte que la capacité de généralisation du modèle d'apprentissage profond peut être efficacement améliorée, et la précision et l'efficacité du modèle d'apprentissage profond sont optimisées.
PCT/CN2021/075856 2021-02-07 2021-02-07 Procédé et appareil d'entraînement de modèle d'apprentissage profond WO2022165809A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180000179.9A CN112639846A (zh) 2021-02-07 2021-02-07 一种训练深度学习模型的方法和装置
PCT/CN2021/075856 WO2022165809A1 (fr) 2021-02-07 2021-02-07 Procédé et appareil d'entraînement de modèle d'apprentissage profond

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/075856 WO2022165809A1 (fr) 2021-02-07 2021-02-07 Procédé et appareil d'entraînement de modèle d'apprentissage profond

Publications (1)

Publication Number Publication Date
WO2022165809A1 true WO2022165809A1 (fr) 2022-08-11

Family

ID=75297673

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075856 WO2022165809A1 (fr) 2021-02-07 2021-02-07 Procédé et appareil d'entraînement de modèle d'apprentissage profond

Country Status (2)

Country Link
CN (1) CN112639846A (fr)
WO (1) WO2022165809A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273577A (zh) * 2022-09-26 2022-11-01 丽水学院 一种摄影教学的方法及系统
CN115331309A (zh) * 2022-08-19 2022-11-11 北京字跳网络技术有限公司 用于识别人体动作的方法、装置、设备和介质
CN116563246A (zh) * 2023-05-10 2023-08-08 之江实验室 一种用于医学影像辅助诊断的训练样本生成方法及装置

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516778A (zh) * 2021-04-14 2021-10-19 武汉联影智融医疗科技有限公司 模型训练数据获取方法、装置、计算机设备和存储介质
CN113240611B (zh) * 2021-05-28 2024-05-07 中建材信息技术股份有限公司 一种基于图片序列的异物检测方法
CN113362388A (zh) * 2021-06-03 2021-09-07 安徽芯纪元科技有限公司 一种用于目标定位和姿态估计的深度学习模型
CN114332224A (zh) * 2021-12-29 2022-04-12 北京字节跳动网络技术有限公司 3d目标检测样本的生成方法、装置、设备及存储介质
CN115223002B (zh) * 2022-05-09 2024-01-09 广州汽车集团股份有限公司 模型训练方法、开门动作检测方法、装置以及计算机设备
WO2023220916A1 (fr) * 2022-05-17 2023-11-23 华为技术有限公司 Procédé de positionnement de pièce et appareil
CN115578515B (zh) * 2022-09-30 2023-08-11 北京百度网讯科技有限公司 三维重建模型的训练方法、三维场景渲染方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428388A (zh) * 2019-07-11 2019-11-08 阿里巴巴集团控股有限公司 一种图像数据生成方法及装置
US20200160114A1 (en) * 2017-07-25 2020-05-21 Cloudminds (Shenzhen) Robotics Systems Co., Ltd. Method for generating training data, image semantic segmentation method and electronic device
CN112132213A (zh) * 2020-09-23 2020-12-25 创新奇智(南京)科技有限公司 样本图像的处理方法及装置、电子设备、存储介质
CN112232293A (zh) * 2020-11-09 2021-01-15 腾讯科技(深圳)有限公司 图像处理模型训练、图像处理方法及相关设备
CN112308103A (zh) * 2019-08-02 2021-02-02 杭州海康威视数字技术股份有限公司 生成训练样本的方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11257272B2 (en) * 2019-04-25 2022-02-22 Lucid VR, Inc. Generating synthetic image data for machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160114A1 (en) * 2017-07-25 2020-05-21 Cloudminds (Shenzhen) Robotics Systems Co., Ltd. Method for generating training data, image semantic segmentation method and electronic device
CN110428388A (zh) * 2019-07-11 2019-11-08 阿里巴巴集团控股有限公司 一种图像数据生成方法及装置
CN112308103A (zh) * 2019-08-02 2021-02-02 杭州海康威视数字技术股份有限公司 生成训练样本的方法和装置
CN112132213A (zh) * 2020-09-23 2020-12-25 创新奇智(南京)科技有限公司 样本图像的处理方法及装置、电子设备、存储介质
CN112232293A (zh) * 2020-11-09 2021-01-15 腾讯科技(深圳)有限公司 图像处理模型训练、图像处理方法及相关设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331309A (zh) * 2022-08-19 2022-11-11 北京字跳网络技术有限公司 用于识别人体动作的方法、装置、设备和介质
CN115273577A (zh) * 2022-09-26 2022-11-01 丽水学院 一种摄影教学的方法及系统
CN116563246A (zh) * 2023-05-10 2023-08-08 之江实验室 一种用于医学影像辅助诊断的训练样本生成方法及装置
CN116563246B (zh) * 2023-05-10 2024-01-30 之江实验室 一种用于医学影像辅助诊断的训练样本生成方法及装置

Also Published As

Publication number Publication date
CN112639846A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022165809A1 (fr) Procédé et appareil d'entraînement de modèle d'apprentissage profond
Sahu et al. Artificial intelligence (AI) in augmented reality (AR)-assisted manufacturing applications: a review
US11373332B2 (en) Point-based object localization from images
Sakaridis et al. Semantic foggy scene understanding with synthetic data
US20180012411A1 (en) Augmented Reality Methods and Devices
US10019652B2 (en) Generating a virtual world to assess real-world video analysis performance
US11288857B2 (en) Neural rerendering from 3D models
CN114972617B (zh) 一种基于可导渲染的场景光照与反射建模方法
CN101422035A (zh) 光源推定装置、光源推定系统与光源推定方法以及图像高分辨率化装置与图像高分辨率化方法
Bešić et al. Dynamic object removal and spatio-temporal RGB-D inpainting via geometry-aware adversarial learning
CN112365604A (zh) 基于语义分割和slam的ar设备景深信息应用方法
CN115226406A (zh) 图像生成装置、图像生成方法、记录介质生成方法、学习模型生成装置、学习模型生成方法、学习模型、数据处理装置、数据处理方法、推断方法、电子装置、生成方法、程序和非暂时性计算机可读介质
WO2022100419A1 (fr) Procédé de traitement d'images et dispositif associé
Du et al. Video fields: fusing multiple surveillance videos into a dynamic virtual environment
CN112651881A (zh) 图像合成方法、装置、设备、存储介质以及程序产品
WO2022052782A1 (fr) Procédé de traitement d'image et dispositif associé
WO2022165722A1 (fr) Procédé, appareil et dispositif d'estimation de profondeur monoculaire
Goncalves et al. Deepdive: An end-to-end dehazing method using deep learning
CN112308977A (zh) 视频处理方法、视频处理装置和存储介质
US20220343639A1 (en) Object re-identification using pose part based models
CN115008454A (zh) 一种基于多帧伪标签数据增强的机器人在线手眼标定方法
CN114494395A (zh) 基于平面先验的深度图生成方法、装置、设备及存储介质
WO2021151380A1 (fr) Procédé de rendu d'objet virtuel sur la base d'une estimation d'éclairage, procédé de formation de réseau neuronal et produits associés
CN116012805B (zh) 目标感知方法、装置、计算机设备、存储介质
Bai et al. Cyber mobility mirror for enabling cooperative driving automation: A co-simulation platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21923820

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21923820

Country of ref document: EP

Kind code of ref document: A1