CN111783948A

CN111783948A - Model training method and device, electronic equipment and storage medium

Info

Publication number: CN111783948A
Application number: CN202010587794.1A
Authority: CN
Inventors: 孟庆月; 赵晨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-16

Abstract

The application discloses a model training method and device, electronic equipment and a storage medium, and relates to the technical field of deep learning. The specific implementation scheme is as follows: acquiring a target image and annotation data of key points of the target image; inputting a target image into a first network model to obtain a first output result; inputting the first output result into a second network model to obtain a second output result; determining a target loss value based on the annotation data, the first output result and the second output result; and adjusting parameters of the first network model and the second network model according to the target loss value to obtain a target network model. And obtaining a target loss value according to the labeling information of the key points in the target image, the output result of the first network model and the output result of the second network model, and adjusting the parameters of the first network model and the second network model according to the target loss value, so that the accuracy of predicting the key points by the target network model can be improved.

Description

Model training method and device, electronic equipment and storage medium

Technical Field

The present application relates to deep learning technologies in the field of computer technologies, and in particular, to a model training method, a model training apparatus, and an electronic device.

Background

The deep learning technology of the human body key points refers to a technology of inputting pictures containing human bodies and outputting the human body key points through the inference of a deep learning model. With the development of internet technology and the popularization of human-computer interaction application, the application value of accurate and reasonable human body key point technology is higher and higher, for example, the human body key point technology is applied to the fields of some motion sensing games, human body behavior analysis, virtual image driving and the like, and even the current children education, live broadcast special effects and the like have good application progress. At present, the key points of the human body in the obtained image are obtained by finding all the key points of the human body from the input picture through a deep learning model and then combining all the key points to obtain the key points of each person in the image.

Disclosure of Invention

The disclosure provides a model training method, a model training device, an electronic device and a storage medium.

According to a first aspect of the present disclosure, there is provided a model training method, comprising:

acquiring a target image and annotation data of key points of the target image;

inputting the target image into a first network model to obtain a first output result;

inputting the first output result into a second network model to obtain a second output result;

determining a target loss value based on the annotation data, the first output result, and the second output result;

and adjusting parameters of the first network model and the second network model according to a target loss value to obtain a target network model, wherein the target network model comprises the first network model and the second network model after the parameters are adjusted.

According to a second aspect of the present disclosure, there is provided a model training apparatus comprising:

the first acquisition module is used for acquiring a target image;

the second acquisition module is used for marking data of key points of the target image;

the third acquisition module is used for inputting the target image into the first network model to obtain a first output result;

the fourth obtaining module is used for inputting the first output result into a second network model to obtain a second output result;

a determining module, configured to determine a target loss value based on the annotation data, the first output result, and the second output result;

and a fifth obtaining module, configured to adjust parameters of the first network model and the second network model according to a target loss value, so as to obtain a target network model, where the target network model includes the first network model and the second network model after the parameters are adjusted.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of the first aspects.

According to the method and the device, the problem that the accuracy of the predicted positions of the key points is low in the picture key point obtaining process is solved, the target loss value is obtained through the marking information of the key points in the target image, the output result of the first network model and the output result of the second network model, the parameters of the first network model and the second network model are adjusted according to the target loss value, and the accuracy of the target network model for predicting the key points can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of a model training method provided by an embodiment of the present application;

FIG. 2 is another flow chart of a model training method provided by an embodiment of the present application;

FIG. 3 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 4 is a block diagram of an electronic device for implementing a model training method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Referring to fig. 1, fig. 1 is a flowchart of a model training method provided in an embodiment of the present application, and as shown in fig. 1, the embodiment provides a model training method applied to an electronic device, including the following steps:

step 101, acquiring a target image and annotation data of key points of the target image.

The target image may be a human whole-body image, such as a front whole-body image, a back whole-body image, a side whole-body image, or the like. And marking relevant key points on the target image, wherein the marking can be carried out in a manual marking mode. The target image may also include an image in which a body part is occluded, for example, a whole-body image of a human body is processed, and a part of the body part in the whole-body image of the human body is occluded by an occlusion. The target image may also be referred to as a training sample image.

And 102, inputting the target image into a first network model to obtain a first output result.

The first network model may adopt an existing neural network model, and inputs the target image into the first network model, and the first network model outputs a first output result.

And 103, inputting the first output result into a second network model to obtain a second output result.

The second network model may adopt an existing neural network model, and the first output result is input into the second network model, and the second network model outputs the second output result.

And 104, determining a target loss value based on the labeling data, the first output result and the second output result.

The annotation data can be regarded as a true value, the first output result and the second output result can be regarded as predicted values, and in the step, the true value and the predicted values are compared to determine the target loss value.

And 105, adjusting parameters of the first network model and the second network model according to a target loss value to obtain a target network model, wherein the target network model comprises the first network model and the second network model after the parameters are adjusted.

And adjusting parameters of the first network model and the second network model according to the target loss value to obtain a target network model, wherein the target network model is a trained model. Furthermore, a target network model can be adopted to predict key points of the obtained human body image, so that more accurate key point distribution is obtained.

In the embodiment, a target image and annotation data of key points of the target image are obtained; inputting the target image into a first network model to obtain a first output result; inputting the first output result into a second network model to obtain a second output result; determining a target loss value based on the annotation data, the first output result, and the second output result; and adjusting parameters of the first network model and the second network model according to a target loss value to obtain a target network model, wherein the target network model comprises the first network model and the second network model after the parameters are adjusted. And obtaining a target loss value according to the labeling information of the key points in the target image, the output result of the first network model and the output result of the second network model, and adjusting the parameters of the first network model and the second network model according to the target loss value, so that the accuracy of predicting the key points by the target network model can be improved.

In one embodiment of the present application, the first network model comprises cascaded M levels, M being a positive integer greater than 1;

the inputting the target image into a first network model to obtain a first output result includes:

inputting the target image into the first network model, and acquiring a feature map output by the first M-1 levels and a predictive thermodynamic map output by the Mth level;

and obtaining the first output result according to the feature diagram and the predictive thermodynamic diagram.

As shown in fig. 2, the first network model includes M cascaded levels (i.e., stages), each of which may have multiple branches. And obtaining feature maps through the first M-1 levels of the first network model, outputting a predictive thermodynamic diagram at the Mth level of the first network model, and determining the predictive thermodynamic diagram based on the feature maps obtained by the first M-1 levels. In fig. 2, the feature maps output by the first M-1 levels and the predictive thermodynamic map are connected by channels to obtain a predictive map, for example, the feature map is a map of 3 channels, the predictive thermodynamic map is also a map of 3 channels, and the predictive map obtained by connecting the two is a map of 3 × 2 channels. The graph sizes of the characteristic graph, the predictive thermodynamic diagram and the predictive diagram are the same. The first output result includes a prediction graph. As shown in fig. 2, the first network model may further include a convolution module, where the convolution module may obtain an initial feature map based on the target image, output a first feature map based on the initial feature map at level 1, output a second feature map based on the initial feature map and the first feature map at level 2, and so on, and output an mth feature map, which is the predicted thermodynamic map output by level M, based on the initial feature map and the feature maps output by the levels preceding level M. And connecting the feature maps obtained by the convolution module and the first M-1 levels with the prediction thermodynamic map to obtain a prediction map.

In this embodiment, since the first output result is obtained based on the feature map and the predictive thermodynamic diagram, when parameters of the first network model and the second network model are adjusted based on the first output result, the accuracy of feature map output can be improved, so that the accuracy of the predictive thermodynamic diagram is improved.

In an embodiment of the present application, acquiring annotation data of a key point of the target image includes:

performing key point labeling on the target image to obtain a key point coordinate set, wherein the key point coordinate set comprises N first key point coordinates, and N is a positive integer greater than 1;

and acquiring a thermodynamic diagram based on each first key point coordinate to obtain N true value diagrams, wherein the labeling data comprises N first key point coordinates and N true value diagrams corresponding to the N first key point coordinates.

Specifically, the target image may include one sample image or a plurality of sample images. For each sample image, carrying out key point labeling on the sample image, and specifically adopting manual labeling. After labeling, a sample image may include one or more key points, each key point has a key point coordinate, corresponding serial numbers are provided between the key points, the serial numbers of the key points in the same sample image are different, and further, the serial numbers of the key points in the multiple sample images are different. The key point coordinate set is a set of coordinates of key points included in one sample image. The number of the key points marked in each sample image is the same, the key points are distinguished from key point coordinates obtained through prediction in the following description, and the coordinates of the key points of the target image are called as first key point coordinates.

In this embodiment, if the target image is a sample image, performing key point labeling on the target image to obtain a key point coordinate set of the target image; and if the target image is a plurality of sample images, performing key point labeling on the plurality of sample images to obtain a key point coordinate set of the plurality of sample images.

For each keypoint coordinate in the keypoint coordinate set, a truth map, i.e., a thermodynamic map (i.e., heatmap), also referred to as a gaussian distribution map, of a predetermined size is generated. The labeling data comprises N first key point coordinates and N real value graphs corresponding to the N first key point coordinates.

In this embodiment, a key point labeling is performed on a target image to obtain a key point coordinate set, and a thermodynamic diagram is obtained according to each first key point coordinate in the key point coordinate set to obtain N true value diagrams. When the parameters of the first network model and the second network model are adjusted by comprehensively considering the first key point coordinate and the truth diagram, the accuracy of predicting the key points by the target network model can be improved, and thus, more accurate key point coordinate values can be obtained through the target network model.

In one embodiment of the present application, the first output result includes N prediction graphs, and the second output result includes N second keypoint coordinates;

determining a target loss value based on the annotation data, the first output result, and the second output result, comprising:

obtaining a first loss value according to the N true value graphs and the N prediction graphs;

obtaining a second loss value according to the N first key point coordinates and the N second key point coordinates;

determining the target loss value based on the first loss value and the second loss value.

Specifically, the first network model may obtain a first output result based on the target image, where the first output result includes N prediction maps; the second network model may obtain a second output result based on the first output result, where the second output result includes N second keypoint coordinates, the N second keypoint coordinates have a one-to-one correspondence with the N prediction maps, and the N second keypoint coordinates also have a one-to-one correspondence with the N first keypoint coordinates. Since the number of channels of the true-value graph is not the same as that of the predicted graph, the number of channels of the predicted graph can be adjusted to be the same as that of the true-value graph by convolution operation, and then the first loss value between the predicted graph and the predicted graph is obtained, for example, if the predicted graph is a graph of 3 × 2 channels and the true-value graph is a graph of 3 channels, the predicted graph of 3 × 2 channels can be changed into a graph of the same number of channels as that of the true-value graph by convolution of 1 × 1, and then the first loss value between the predicted graph and the predicted graph is obtained.

The sequence of the N prediction graphs output by the first network model corresponds to the sequence of the N first key point coordinates, the sequence of the N second key point coordinates output by the second network model corresponds to the sequence of the N prediction graphs, and based on the sequence, the one-to-one correspondence among the N first key point coordinates, the N second key point coordinates, the N true value graphs and the N prediction graphs can be determined. For example, if N is 5, for a sample image, 5 first keypoint coordinates are obtained, the order of the 5 first keypoint coordinates may be preset, the sample image is input into the first network model, 5 prediction graphs are obtained, the order between the 5 prediction graphs is the same as the order between the 5 first keypoint coordinates, and thus, the prediction graph corresponding to each first keypoint coordinate can be obtained, and which truth graph each prediction graph in the 5 prediction graphs corresponds to can be obtained. Similarly, 5 prediction graphs are input into the second network model, 5 second keypoint coordinates can be obtained, the sequence between the 5 second keypoint coordinates is the same as that between the 5 prediction graphs, and therefore the first keypoint coordinates corresponding to each second keypoint coordinate can be obtained. And respectively calculating difference values between the corresponding thermodynamic diagrams in the N true value diagrams and the N prediction diagrams to obtain N first difference values. Further, an average value or a weighted average value is obtained for the N first difference values, so as to obtain a first loss value.

Similarly, second difference values (which may be a distance between two coordinate points) between the N first keypoint coordinates and corresponding keypoint coordinates in the N second keypoint coordinates are respectively obtained, so as to obtain N second difference values. Further, an average value or a weighted average value is obtained for the N second difference values, so as to obtain a second loss value.

Determining a target loss value according to the first loss value and the second loss value, and specifically obtaining the target loss value according to the following expression:

Loss＝αLoss_h+βLoss_c

wherein, Loss is a target Loss value, Loss_hIs the first Loss value, Loss_cThe second loss value α and β can be selected according to practical situations, and is not limited herein.

In this embodiment, a first loss value is obtained according to the N true value maps and the N prediction maps; obtaining a second loss value according to the N first key point coordinates and the N second key point coordinates; determining the target loss value based on the first loss value and the second loss value. The loss value of the key point coordinates and the loss value of the thermodynamic diagram are comprehensively considered in the determination of the target loss value, namely the advantages of two learning modes of thermodynamic diagram learning and key point coordinate direct regression are combined, so that the target network model can obtain more accurate key points.

In an embodiment of the present application, the acquiring the target image includes:

acquiring a first type of image of a target object, wherein the first type of image is a whole body image;

and processing the first-class images to obtain second-class images, wherein the second-class images comprise partial images of the first-class images, and the target images comprise the first-class images and the second-class images.

The target object may be a human, an animal, or the like. The first type of image may be a whole-body image of the target subject, e.g., a front whole-body image, a back whole-body image, a side whole-body image, etc., of the target subject. The first type of image may be regarded as a basic image, and further, the first type of image is processed, for example, a part of a body part of the target object in the first type of image is blocked to obtain a second type of image, or a blocking block for blocking the body part of the target object is randomly added according to distribution of key points in the first type of image to obtain the second type of image.

In this embodiment, the target image includes the first type of image and the second type of image, that is, the target image includes both an unobstructed whole-body image of the target object and an image in which a part of a body of the target object is obstructed, so that a target network model obtained by training using the target image can predict a key point coordinate of an invisible limb, where the key point coordinate is estimated according to a priori learned key point of a human body posture and a currently available key point, the obtained most reasonable key point distribution improves accuracy of predicting the key point coordinate of the invisible limb.

The following exemplary description of the model training method provided in the present application refers to fig. 2.

And (4) preparing data. The selected data is image data (visible as basic data) comprising a complete human body so as to better acquire the distribution probability of key points of the human body; the truth labels corresponding to each person image data are two: the first is the key point coordinate of the human body in the figure, which is determined by the form of marking; the second is a gaussian distribution map (i.e. a truth map) generated according to the coordinates of key points of the human body in the map, and the two truth labels participate in the subsequent training process.

After the basic data are selected, the blocking blocks for the limbs of the human body are randomly added according to the distribution of the key points of the human body in each piece of human image data, new human image data are obtained, and the new human image data and the basic data are used as training samples (namely target images in the figure 2) to participate in model training.

And (4) designing a network structure, and learning the heatmap distribution of key points of the human body. A neural network architecture (i.e., a first network model) is selected, with optional scope including, but not limited to, common network architectures such as alphapos, openpos, or network models that are self-designed to output heatmap. As shown in fig. 2, the first network model includes M cascaded levels (i.e., stages), each of which may have multiple branches, e.g., in fig. 2, level 1 includes branch a and branch b, level M-1 includes branch c and branch d,hierarchy M includes branch e and branch f. The first network model may further include a convolution module, the convolution module may obtain an initial feature map based on the target image, the level 1 outputs a first feature map based on the initial feature map, the level 2 outputs a second feature map based on the initial feature map and the first feature map, and so on, and the level M may output an mth feature map, which is the predicted thermodynamic map output by the level M, based on the initial feature map and the feature maps output by the levels preceding the level M. As shown in fig. 2, the feature maps obtained by the convolution module and the first M-1 levels are connected with the prediction thermodynamic map to obtain the prediction map, for example, the feature map is a map of 3 channels, the prediction thermodynamic map is also a map of 3 channels, and the prediction map obtained by connecting the two is a map of 3 × 2 channels. The graph sizes of the characteristic graph, the predictive thermodynamic diagram and the predictive diagram are the same. First network fabric output O_hIs { B × N × H × W }, wherein B represents the number of data samples input each time, N represents the number of key points set in advance, which coincides with the number of key points labeled at the labeling stage, and H and W represent the length and width of the prediction graph, respectively_hAnd truth thermodynamic diagram (i.e. true value diagram) G_hLoss value in between:

Loss_h＝‖O_h-G_h‖₂

other loss or combinations of losses may be used depending on the application.

And learning the coordinate values of the key points of the human body. Inputting the prediction map obtained in the previous step as an input to a subsequent key point obtaining module (which can be regarded as a second network model), and outputting a coordinate value O of the key point through the key point obtaining module by using the prediction map as an input_cThe keypoint acquisition module may employ a combination of convolutional layers, fully-connected layers, or other layers. Calculating a key point coordinate loss value according to the output key point predicted value and the key point true value (namely the key point coordinate marked in the data preparation stage) by the following expression:

Loss_c＝‖O_c-G_c‖₂

the specific implementation process can be as followsThe situation being other loss or a combination of losses, G_cIs the keypoint truth coordinate (i.e., the first keypoint coordinate).

Calculating the Loss value calculated in the two steps according to the following expression to obtain Loss:

Loss＝αLoss_h+βLoss_c

wherein, alpha and beta can be determined according to actual conditions.

And finally, adjusting parameters of the neural network structure and the key point acquisition module according to the Loss value to obtain a trained target network model.

The method combines the advantages of thermodynamic diagram learning and key point coordinate direct regression, can obtain more accurate key point coordinate values and can better predict key points of invisible limbs, and for the key points of the invisible limbs, the key points are not determined by randomly generated numbers any more, but all key point distributions closest to the current posture are obtained according to the existing key points and the prior values.

Referring to fig. 3, fig. 3 is a structural diagram of a model training apparatus according to an embodiment of the present application, and as shown in fig. 3, the present embodiment provides a model training apparatus 200, including:

a first obtaining module 201, configured to obtain a target image;

a second obtaining module 202, configured to obtain annotation data of a key point of the target image;

a third obtaining module 203, configured to input the target image to the first network model, and obtain a first output result;

a fourth obtaining module 204, configured to input the first output result to a second network model, so as to obtain a second output result;

a determining module 205, configured to determine a target loss value based on the annotation data, the first output result, and the second output result;

a fifth obtaining module 206, configured to adjust parameters of the first network model and the second network model according to a target loss value, so as to obtain a target network model, where the target network model includes the first network model and the second network model after the parameter adjustment.

Further, the first network model comprises M cascaded levels, wherein M is a positive integer greater than 1;

the third obtaining module 203 includes:

the first acquisition sub-module is used for inputting the target image into the first network model and acquiring the feature maps output by the first M-1 levels and the predicted thermodynamic map output by the Mth level;

and the second obtaining submodule is used for obtaining the first output result according to the feature diagram and the predictive thermodynamic diagram.

Further, the second obtaining module 202 includes:

the third obtaining submodule is used for carrying out key point labeling on the target image to obtain a key point coordinate set, wherein the key point coordinate set comprises N first key point coordinates, and N is a positive integer greater than 1;

and the fourth obtaining submodule is used for obtaining a thermodynamic diagram based on each first key point coordinate to obtain N true value diagrams, wherein the labeling data comprise N first key point coordinates and N true value diagrams corresponding to the N first key point coordinates.

Further, the first output result comprises N prediction graphs, and the second output result comprises N second keypoint coordinates;

the determining module 205 includes:

a fifth obtaining sub-module, configured to obtain a first loss value according to the N true value maps and the N prediction maps;

a sixth obtaining submodule, configured to obtain a second loss value according to the N first keypoint coordinates and the N second keypoint coordinates;

a determination submodule configured to determine the target loss value based on the first loss value and the second loss value.

Further, the first obtaining module 201 is configured to:

The model training apparatus 200 is capable of implementing each process implemented by the electronic device in the method embodiment shown in fig. 1, and is not described here again to avoid repetition.

The model training device 200 of the embodiment of the application acquires a target image and annotation data of key points of the target image; inputting the target image into a first network model to obtain a first output result; inputting the first output result into a second network model to obtain a second output result; determining a target loss value based on the annotation data, the first output result, and the second output result; and adjusting parameters of the first network model and the second network model according to a target loss value to obtain a target network model, wherein the target network model comprises the first network model and the second network model after the parameters are adjusted. And obtaining a target loss value according to the labeling information of the key points in the target image, the output result of the first network model and the output result of the second network model, and adjusting the parameters of the first network model and the second network model according to the target loss value, so that the accuracy of predicting the key points by the target network model can be improved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 301, memory 302, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 301 is taken as an example.

Memory 302 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of model training provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of model training provided herein.

The memory 302, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of model training in the embodiments of the present application (e.g., the first obtaining module 201, the second obtaining module 202, the third obtaining module 203, the fourth obtaining module 204, the determining module 205, and the fifth obtaining module 206 shown in fig. 2). The processor 301 executes various functional applications of the server and data processing, i.e., a method of implementing model training in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 302.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the model-trained electronic device, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, and these remote memories may be connected to a model training electronic device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of model training may further comprise: an input device 303 and an output device 304. The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example.

The input device 303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the model-trained electronic device, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 304 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

According to the technical scheme of the embodiment of the application, a target image and annotation data of key points of the target image are obtained; inputting the target image into a first network model to obtain a first output result; inputting the first output result into a second network model to obtain a second output result; determining a target loss value based on the annotation data, the first output result, and the second output result; and adjusting parameters of the first network model and the second network model according to a target loss value to obtain a target network model, wherein the target network model comprises the first network model and the second network model after the parameters are adjusted. And obtaining a target loss value according to the labeling information of the key points in the target image, the output result of the first network model and the output result of the second network model, and adjusting the parameters of the first network model and the second network model according to the target loss value, so that the accuracy of predicting the key points by the target network model can be improved.

Because the first output result is obtained based on the feature map and the predictive thermodynamic diagram, the accuracy of feature map output can be improved when the parameters of the first network model and the second network model are adjusted based on the first output result, and the accuracy of the predictive thermodynamic diagram is improved.

And carrying out key point labeling on the target image to obtain a key point coordinate set, and obtaining a thermodynamic diagram according to each first key point coordinate in the key point coordinate set to obtain N true value diagrams. When the parameters of the first network model and the second network model are adjusted by comprehensively considering the first key point coordinate and the truth diagram, the accuracy of predicting the key points by the target network model can be improved, and thus, more accurate key point coordinate values can be obtained through the target network model.

Obtaining a first loss value according to the N true value graphs and the N prediction graphs; obtaining a second loss value according to the N first key point coordinates and the N second key point coordinates; determining the target loss value based on the first loss value and the second loss value. The loss value of the key point coordinates and the loss value of the thermodynamic diagram are comprehensively considered in the determination of the target loss value, namely the advantages of two learning modes of thermodynamic diagram learning and key point coordinate direct regression are combined, so that the target network model can obtain more accurate key points.

The target images comprise the first type of images and the second type of images, namely the target images comprise the unoccluded whole-body images of the target object and the images of partial body parts of the target object, so that a target network model obtained by training the target images can predict the key point coordinates of the invisible limbs, the key point coordinates are presumed according to the key points of the pre-learned prior human posture and the currently acquired key points, the most reasonable key point distribution is obtained, and the accuracy of predicting the key point coordinates of the invisible limbs is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of model training, comprising:

acquiring a target image and annotation data of key points of the target image;

and adjusting parameters of the first network model and the second network model according to a target loss value to obtain a target network model, wherein the target network model comprises the first network model and the second network model after parameter adjustment.

2. The model training method of claim 1, wherein the first network model comprises a cascade of M levels, M being a positive integer greater than 1;

3. The model training method of claim 2, wherein obtaining annotation data of the key points of the target image comprises:

4. The model training method of claim 3, wherein the first output result comprises N prediction maps and the second output result comprises N second keypoint coordinates;

5. The model training method of claim 1, wherein the acquiring a target image comprises:

6. A model training apparatus comprising:

the first acquisition module is used for acquiring a target image;

the second acquisition module is used for acquiring the annotation data of the key points of the target image;

and a fifth obtaining module, configured to adjust parameters of the first network model and the second network model according to a target loss value, so as to obtain a target network model, where the target network model includes the first network model and the second network model after parameter adjustment.

7. The model training apparatus of claim 6, wherein the first network model comprises a cascade of M levels, M being a positive integer greater than 1;

the third obtaining module includes:

8. The model training apparatus of claim 7, wherein the second obtaining module comprises:

9. The model training apparatus of claim 8, wherein the first output result comprises N prediction maps, and the second output result comprises N second keypoint coordinates;

the determining module includes:

10. The model training apparatus of claim 6, wherein the first obtaining module is configured to:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.