CN107909114B

CN107909114B - Method and apparatus for training supervised machine learning models

Info

Publication number: CN107909114B
Application number: CN201711236484.XA
Authority: CN
Inventors: 颜沁睿
Original assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Current assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2020-07-17
Anticipated expiration: 2037-11-30
Also published as: CN107909114A

Abstract

A method and apparatus for training a supervised machine learning model is disclosed. The method comprises the following steps: generating an artificial image containing the target object; recording annotation data related to the target object in the process of generating the artificial image; performing an operation in the model using the artificial image as input data of the model to obtain derived data relating to the target object; and comparing the derived data and the annotated data to determine whether to adjust parameters of the model. By the method, a large amount of manual labeling required in the training process of the model can be saved.

Description

Method and apparatus for training supervised machine learning models

Technical Field

The present disclosure relates generally to the technical field of supervised machine learning models, and in particular to methods and apparatus for training supervised machine learning models.

Background

Supervised machine learning generally requires training a model using a large number of training samples, and determines whether and how to adjust parameters of the model according to a comparison result between an expected result and a derivation result obtained by the model using the training samples, so that the model can be well adapted to data (for example, actual application data) other than the training samples. Models for supervised machine learning may include, for example, artificial neural networks (e.g., convolutional neural networks), decision trees, and the like.

Many different training sample sets or training sample libraries have been provided. Designers of supervised machine-learned models need to manually label target objects in a large (or even a large) number of samples to provide labeling data (e.g., type, size, location, etc. of the target objects) related to the target objects before using such sample sets or sample libraries to train the supervised machine-learned models. Training is costly, but accuracy and efficiency are low.

Disclosure of Invention

In one aspect, a method for training a supervised machine learning model is provided. The method can comprise the following steps: generating an artificial image containing the target object; recording annotation data related to the target object in the process of generating the artificial image; performing an operation in the model using the artificial image as input data of the model to obtain derived data relating to the target object; and comparing the derived data and the annotated data to determine whether to adjust parameters of the model.

In another aspect, an apparatus for training a supervised machine learning model is also provided. The apparatus may include: a rendering engine configured to generate an artificial image containing a target object and to record annotation data related to the target object in a process of generating the artificial image; an operator configured to perform an operation in the model using the artificial image as input data of the model to obtain derived data relating to the target object; and an adjuster configured to compare the derived data and the annotated data to determine whether to adjust a parameter of the model.

In another aspect, an apparatus for training a supervised machine learning model is also provided. The apparatus may include a processor configured to perform the method described above.

In another aspect, a non-transitory storage medium having stored thereon program instructions that, when executed by a computing device, perform the above-described method is also provided.

According to the method and the device disclosed by the embodiment of the disclosure, manual labeling required in the process of supervised machine learning training can be omitted, so that the cost can be reduced, the labeling accuracy is improved, and the training efficiency is improved.

Drawings

Fig. 1 illustrates a flow diagram of an example method for training a model for supervised machine learning in accordance with an embodiment of the present disclosure.

Fig. 2 illustrates an example of training a supervised machine learning model in accordance with an embodiment of the present disclosure.

Fig. 3 illustrates a block diagram of an example apparatus for training a model for supervised machine learning in accordance with an embodiment of the present disclosure.

Fig. 4 illustrates a block diagram of an example apparatus for training a model for supervised machine learning in accordance with an embodiment of the present disclosure.

Detailed Description

Fig. 1 illustrates a flow diagram of an example method for training a model for supervised machine learning in accordance with an embodiment of the present disclosure. As shown in fig. 1, an example method 100 in accordance with embodiments of the present disclosure may include: step S101, generating an artificial image containing a target object; step S105, recording annotation data related to the target object in the process of generating the artificial image; a step S110 of performing an operation in the model using the artificial image as input data of the model to obtain derived data about the target object; and step S115, comparing the derived data and the labeled data to determine whether to adjust the parameters of the model.

The example method 100 is described in detail below in connection with the example of FIG. 2.

The example method 100 may begin at step S101 to generate an artificial image containing a target object.

In one embodiment, as shown in FIG. 2, a repository 200 may be connected and one or more elements may be retrieved from the repository 200.

The repository 200 may include various elements for generating artificial images. For example, the repository 200 may include images, pictures, or animations representing various shapes of various parts of a "person", such as a head, an arm, a hand, a finger, a trunk, a leg, a foot, an eye, an ear, a nose, a mouth, a hair, a beard, an eyebrow, clothes, gloves, a helmet, a hat, etc., may also include images, pictures, or animations representing various shapes of various tools, such as a sword, a wrench, a stick, etc., and may also include images, pictures, or animations representing various entities, such as an animal, a plant, a vehicle, a building, a natural landscape, a universe object, etc., and various shapes of various parts thereof. Additionally, the image pictures or videos included in the repository 200 may be one-dimensional, two-dimensional, three-dimensional, and/or more dimensions. The repository 200 may also include audio, text, and other elements.

The method according to the embodiment of the present disclosure is not limited to the number, type, organization (or storage) form, etc. of the elements included in the repository 200, nor to the form, connection form, access form, etc. of the repository 200.

Then, in step S101, the acquired one or more elements may be combined together, and an aggregate of the combined elements is rendered (e.g., 2D rendering or 3D rendering), thereby generating an artificial scene 205, such as shown in fig. 2.

Additionally, a sword-holding person in the artificial scene 205 may be taken as one target object 210. In further examples, any one or more entities in the scene may be treated as one or more target objects. For example, a sword in a hand of a sword-holding person may be used as one target object, a cloud, a wall, a sword-holding person and a sword in the background may be used as one target object, or a set of a cloud, a wall, a sword-holding person and a sword in the background may be used as one target object.

In the example of fig. 2, an artificial image 220 may also be generated by fish-eye lens projection of the generated artificial scene 205. In further examples, other types of projection may also be performed on the artificial scene 205, such as wide-angle lens projection, standard lens projection, telephoto lens projection, and so forth, and multiple types of projection may be used. Alternatively, the artificial scene 205 may be used directly as the artificial image 220 without projecting the artificial scene 205.

In the process of generating the artificial scene 205 or the artificial image 220 in step S101, step S105 is simultaneously performed, so that the annotation data 215 relating to the target object 210 is automatically recorded in the process of generating the artificial scene 205 or the artificial image 220. The annotation data defines various attribute values for individual elements or individual element aggregates used to generate the artificial image or individual entities in the artificial image.

For example, in the example of fig. 2, in step S101, the elements selected from the repository 200 for generating the target object 210 may include elements 201 (head of human being facing the righteast), 202 (expanded arms of human being), 203 (legs and feet of human being), and 204 (sword). In the process of combining and rendering these elements and the selected other elements into the artificial scene 205 (step S101), it may be determined that the type of the target object 210 related to the elements 201 to 204 is "person", the orientation is "east", and the item held in the hand is "sword" according to the attributes of the elements 201 to 204 (step S105). In addition, the coordinates of the target object 210 may also be determined according to the pose position of the element 203 in the artificial scene 205 (step S101) (step S105), the elevation angle of the target object 210 may be determined according to the pose angle of the

element

201 or 202 or 204 (step S101) (step S105), and so on. Additionally, the changes of the individual annotation data before and after projection (step S105), such as the changes of size, angle and position, etc., before and after projection, can also be determined as well as in the process of projecting the artificial scene 205 into the artificial image 220 (step S101).

That is, the artificial scene 205 or the artificial image 220 (including the target object 210 in the artificial scene 205 or the artificial image 220) is generated by combining and rendering the selected elements according to the various annotation data of the selected elements. Meanwhile, annotation data associated with each element and/or each collection of elements used to generate the artificial scene 205 or artificial image 220 may be recorded and/or determined during the generation of the artificial scene 205 or artificial image 220.

In the example of FIG. 2, the annotation data 215 can include information of the type (human), location coordinates, orientation, elevation, status, hand-held item, etc., of the target object 210. In further examples, the annotation data 215 may also include information such as the shape, color (e.g., color of clothing of the target object 210 in fig. 2), size (e.g., height of the target object 210 in fig. 2), and the like of the target object. In addition, in step S105, annotation data may also be generated for each element and/or a combination of elements used to generate the artificial scene 205 or the artificial image 220.

By both steps S101 and S105, the artificial scene 205 or artificial image 220 containing the target object 210 and the annotation data 215 related to the target object 210 can be obtained simultaneously without having to perform further artificial annotation for the target object 210 in the generated artificial scene 205 or artificial image 220.

The example method 100 then continues to step S110. As shown in FIG. 2, the generated artificial image 220 may be used as an input to a supervised machine learning model 225 and provided to the model 225. The model 225 performs operations and outputs derived data 230 relating to the target object 210.

In one embodiment, the artificial image 220 may be provided directly to the model 225, or a data set capable of representing the artificial image 220 may be provided to the model 225 (e.g., in the case where the artificial image 220 is a 3D image, a set of 3D points may be provided to the model 225). In further embodiments, other information (e.g., audio, location coordinates, etc.) related to the artificial image 220 may also be provided to the model 225. The present disclosure is not limited to a particular type, implementation, and task (e.g., recognition, prediction, 3D reconstruction) of the model 225, nor to a particular format or form of data received by the model.

The example method may then continue to step S115 to compare the annotation data 215 and the derivation data 230. In one embodiment, the annotation data 215 and the derivative data 230 can be compared to determine if the two data are the same. For example, the "type" in the annotation data 215 may be compared to the "type" in the derivation data 230. In another embodiment, the annotation data 215 and the derivative data 230 can also be compared to determine if the difference between the two data exceeds a threshold. For example, it may be compared whether the difference between "elevation angle" in the comparison annotation data 215 and "elevation angle" in the derivation data 230 exceeds a threshold. The threshold may be specified by a designer of the supervised machine learned model 225 when designing the supervised machine learned model 225.

In the case where it is determined from the comparison that the parameters of the model 225 need to be adjusted, the parameters of the model 225 may be adjusted and steps S110 and S115 may be repeated until the output of the model 225 meets the expected requirements. In one embodiment, different numbers of artificial images may be generated in step S101, and different error comparison methods, parameter adjustment methods, and expectation conditions are employed in steps S110 and S115 depending on the type of model and the expected goal of training. For example, for a neural network, a plurality (e.g., a large number) of artificial images may be generated in step S101, and parameters may be adjusted in steps S110 and S115 using, for example, a back-propagation algorithm such that the gradient of the error function with respect to the partial derivative of the parameters decreases, and ultimately the error function decreases to an acceptable range.

In a training method (e.g., the example method 100) according to an embodiment of the present disclosure, annotation data is generated simultaneously during the generation of an artificial scene or an artificial image, so that additional manual annotation is not necessary, which is beneficial to reducing the training cost and improving the training efficiency.

In addition, the samples in the general training sample set or training sample library are often the result of actual acquisition of typical data in typical applications, such as photos, sounds, characters, etc. acquired by using a device such as a camera or a recorder for a specific group of people, a specific occasion, a specific application, etc. Using such samples, it is possible to limit the model or training of the model to a particular population, a particular situation, a particular application, or to the set or library of training samples used, etc. In addition, the accuracy and reliability of the results of the training will also depend on the labeling results for the samples in the training sample set or training sample library, or on reference data provided by the provider of the training sample set or training sample library. For example, a trained model may perform well for the set of training samples used or samples in the training sample library, but may have large errors for cases other than the samples in the other training sample sets or training sample libraries.

In the training method according to an embodiment of the present disclosure, the generated artificial scene or artificial image is used for training, and the relevant annotation data must be accurate and reliable (since the artificial scene or artificial image is generated based on these annotation data). Therefore, the training method according to the embodiment of the disclosure can avoid the limitation of the training sample set or the samples in the training sample library on the training result, and is beneficial to improving the accuracy and reliability of the training.

Fig. 3 and 4 illustrate block diagrams of example apparatus for training a supervised machine learning model in accordance with embodiments of the present disclosure.

As shown in fig. 3, an example apparatus 300 may include a rendering engine 301, an operator 305, and a sealer 310.

Rendering engine 301 may be configured to generate an artificial scene or artificial image containing the target object and to record annotation data related to the target object in the process of generating the artificial scene or artificial image. In one embodiment, rendering engine 301 may include one or more Graphics Processors (GPUs).

Rendering engine 301 may also be configured to generate an artificial scene including the target object by combining and rendering one or more elements in the repository, and to generate an artificial image by one or more projections of the artificial scene. In one embodiment, the rendering engine 301 may include one or more cameras to capture an artificial scene in one or more projection modes, for example, wide-angle lens projection, standard lens projection, fisheye lens projection, and telephoto lens projection, to generate an artificial image. In further embodiments, the rendering engine 301 may transform the artificial scene directly by hardware or software to transform the artificial scene into an artificial image corresponding to the projected result using one or more projection modes.

Additionally, the rendering engine 301 may include an I/O interface (not shown) and a buffer memory to receive one or more elements for generating artificial scenes from the repository 200 and to buffer the received elements and/or the generated artificial images/artificial scenes and/or intermediate results.

In one embodiment, the renderer 301 may be configured to perform steps S101 and S105 of the example method 100 shown in fig. 1, for example.

The operator 305 may be configured to perform operations in the model using the artificial image as input data to the model to obtain derived data relating to the target object. In one embodiment, the operator 305 may include a general purpose Central Processing Unit (CPU) or a model-specific hardware accelerator (e.g., a multiply accumulator in the case of a convolutional neural network, etc.). In one embodiment, the renderer 301 may be configured to perform step S110 of the example method 100 shown in fig. 1, for example.

The adjuster 310 may be configured to compare the derived data and the annotated data to determine whether to adjust parameters of the model. In one embodiment, the regulator 310 may include a general purpose Central Processing Unit (CPU) and/or a comparator (not shown). In addition, the regulator 310 may also include an I/O interface (not shown) to receive the regulated model parameters. In one embodiment, the adjustor 310 may be configured to perform step S115 of the example method 100 shown in fig. 1, for example.

As shown in fig. 4, an example apparatus 400 may include one or more processors 401, memory 405, and I/O interfaces 410.

The processor 401 may be any form of processing unit having data processing capabilities and/or instruction execution capabilities, such as a general purpose CPU, GPU, dedicated accelerator, or the like. For example, the processor 401 may perform a method according to an embodiment of the present disclosure. In addition, the processor 401 may also control other components in the apparatus 400 to perform desired functions. Processor 401 may be connected to memory 405 and I/O interface 410 by a bus system and/or other form of connection mechanism (not shown).

The memory 405 may include various forms of computer readable and writable storage media, such as volatile memory and/or nonvolatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. The readable and writable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. For example, in the case of use with a neural network dedicated processor, the memory 405 may also be RAM on a chip carrying the dedicated processor. Memory 405 may include program instructions for instructing apparatus 400 to perform methods according to embodiments of the present disclosure.

The I/O interface 410 may be used to provide parameters or data to the processor 401 and to output resulting data for processing by the processor 401. In addition, the I/O interface 410 may also be coupled to the repository 200 to receive one or more elements for generating artificial scenes or artificial images.

It should be understood that the

devices

300 and 400 shown in fig. 3 and 4 are exemplary only, and not limiting. Devices according to embodiments of the present disclosure may have other components and/or structures.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense, that is, in a sense of "including but not limited to". Additionally, the words "herein," "above," "below," and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of this application. Where the context permits, words in the above description using the singular or plural number may also include the plural or singular number respectively. With respect to the word "or" when referring to a list of two or more items, the word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list.

While certain embodiments of the present disclosure have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the present disclosure. Indeed, the methods and systems described herein may be embodied in a variety of other forms. In addition, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the scope of the disclosure.

Claims

1. A method for training a supervised machine learning model, comprising:

generating an artificial image containing the target object;

recording annotation data relating to the target object in generating the artificial image, the artificial image being generated based on the annotation data;

performing an operation in the model using the artificial image as input data to the model to obtain derived data relating to the target object; and

comparing the derived data and the annotated data to determine whether to adjust parameters of the model,

wherein generating the artificial image comprises:

generating an artificial scene including the target object by combining and rendering one or more elements in a resource pool; and

generating the artificial image by performing one or more projections of the artificial scene.

2. The method of claim 1, wherein the one or more projections comprise one or more of wide-angle lens projections, standard lens projections, fisheye lens projections, and telephoto lens projections.

3. The method of claim 1, wherein the annotation data comprises one or more of a type of the target object, a shape of the target object, a color of the target object, a size of the target object, a location of the target object, and an orientation of the target object.

4. The method of claim 1, wherein comparing the derived data and the annotated data comprises:

determining whether the derived data and the annotation data are the same.

5. The method of claim 1, wherein comparing the derived data and the annotated data comprises:

determining whether a difference between the derived data and the annotated data exceeds a threshold.

6. An apparatus for training a supervised machine learning model, comprising:

a rendering engine configured to generate an artificial image containing a target object and to record annotation data relating to the target object in generating the artificial image, the artificial image being generated based on the annotation data;

an operator configured to perform an operation in the model using the artificial image as input data of the model to obtain derived data relating to the target object; and

an adjuster configured to compare the derived data and the annotated data to determine whether to adjust a parameter of the model,

wherein the rendering engine is further configured to generate an artificial scene comprising the target object by combining and rendering one or more elements in a repository, and to generate the artificial image by one or more projections of the artificial scene.

7. The apparatus of claim 6, wherein the one or more projections comprise one or more of a wide-angle lens projection, a standard lens projection, a fisheye lens projection, and a telephoto lens projection.

8. The apparatus of claim 6, wherein the annotation data comprises one or more of a type of the target object, a shape of the target object, a color of the target object, a size of the target object, a location of the target object, and an orientation of the target object.

9. The apparatus of claim 6, wherein the adjuster is configured to determine whether the derived data and the annotation data are the same.

10. The apparatus of claim 6, wherein the adjuster is configured to determine whether a difference between the derived data and the annotated data exceeds a specified threshold.

11. An apparatus for training a supervised machine learning model, comprising:

a processor configured to perform the method of any of claims 1 to 5.

12. A non-transitory storage medium having stored thereon program instructions that, when executed by a computing device, perform the method of any of claims 1-5.