CN112861678B

CN112861678B - Image recognition method and device

Info

Publication number: CN112861678B
Application number: CN202110122934.2A
Authority: CN
Inventors: 田晓玮; 王蔚; 聂学成
Original assignee: Shanghai Yitu Technology Co ltd
Current assignee: Shanghai Yitu Technology Co ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2024-04-19
Anticipated expiration: 2041-01-29
Also published as: CN112861678A

Abstract

The application relates to the technical field of computer vision, in particular to an image recognition method and device, which are used for extracting features of an image to be recognized to obtain a feature map of the image to be recognized, wherein the image to be recognized contains a target object; inputting the feature images into a trained key point identification model, and outputting position information of each key point of a target object and visibility information of each key point, wherein the key point identification model is obtained by performing iterative training on at least one feature sample image, each key point thermal sample image obtained by converting the at least one feature sample image, and each visibility thermal sample image obtained by converting the at least one feature sample image; according to the position information of each key point and the visibility information of each key point, the action category of the target object is identified, so that the position information and the visibility information of the key points of the target object can be identified at the same time only through one model, and the resource occupancy rate can be reduced.

Description

Image recognition method and device

Technical Field

The present application relates to the field of computer vision, and in particular, to an image recognition method and apparatus.

Background

Key point detection is the basis for action recognition. In the actual recognition process, due to the influence of environmental shielding, photographing mode, gesture of a target object and the like, part of key points in an image to be recognized are not visible on a visual level, if the positions of the key points are deliberately predicted under the condition that the key points are not visible, the accuracy of recognizing the action category of the target object can be reduced due to the key point positions of the blank prediction. Therefore, how to predict the position information and the visibility information of the key points at the same time becomes a problem to be solved.

In the related art, when predicting the position information and the visibility information of the key point at the same time, two independent neural network models can be designed to predict, wherein one neural network model is used for predicting the position information of the key point, and the other neural network model is used for predicting the visibility information of the key point, but if one neural network model is additionally added to predict, more resources are occupied.

Disclosure of Invention

The embodiment of the application provides an image identification method and device, which are used for reducing the resource occupancy rate when an image is identified.

The specific technical scheme provided by the embodiment of the application is as follows:

an image recognition method, comprising:

Extracting features of an image to be identified to obtain a feature map of the image to be identified, wherein the image to be identified contains a target object;

Inputting the feature images into a trained key point identification model, and outputting position information of each key point of the target object and visibility information of each key point, wherein the key point identification model is obtained by performing iterative training on at least one feature sample image, each key point thermal sample image obtained by converting the at least one feature sample image and each visibility thermal sample image obtained by converting the at least one feature sample image, and the visibility information characterizes that the key point is visible or invisible;

and identifying the action category of the target object according to the position information of each key point and the visibility information of each key point.

Optionally, if the key point recognition model includes at least a location recognition network and a visibility recognition network, the feature map is input into a trained key point recognition model, and location information of each key point of the target object and visibility information of each key point are output, which specifically includes:

Carrying out regression processing on the feature map through a regression layer in the position identification network, determining a key point thermodynamic diagram of each key point, and determining the position information of each key point according to the key point thermodynamic diagram of each key point; and is combined with the other components of the water treatment device,

And carrying out regression processing on the feature map through a regression layer in the visibility identification network, determining the visibility thermodynamic diagram of each key point, and determining the visibility information of each key point according to the visibility thermodynamic diagram of each key point.

Optionally, determining the location information of each key point according to the key point thermodynamic diagram of each key point specifically includes:

Respectively aiming at the key points, acquiring the thermal value of each pixel point contained in the key point thermodynamic diagram corresponding to any key point, and selecting the pixel point corresponding to the thermal value with the largest value from the thermal values of the pixel points as the key point contained in the key point thermodynamic diagram;

And obtaining the position information of each key point in the image to be identified.

Optionally, determining the visibility information of each key point according to the visibility thermodynamic diagram of each key point specifically includes:

And respectively aiming at the key points, acquiring the thermal value of each pixel point contained in the visibility thermodynamic diagram corresponding to any key point, if at least one pixel point which is larger than or equal to a preset thermal value threshold is contained in the visibility thermodynamic diagram, determining the visibility information of the key point contained in the visibility thermodynamic diagram as visible, and if the thermal value of each pixel point is smaller than the thermal value threshold, determining the visibility information of the key point as invisible.

Optionally, if the key point identification model at least includes a location identification network, a supervision network, and a visibility identification network, the training manner of the key point identification model is:

Training an initial position recognition network through the at least one characteristic sample graph, each key point thermal sample graph obtained after the at least one characteristic sample graph is converted and position information of key points contained in each key point thermal sample graph to obtain a trained position recognition network;

Fixing each parameter of the trained position identification network, training an initial supervision network, and adjusting each parameter of the initial supervision network to obtain a trained supervision network;

Fixing parameters of the trained position recognition network and the trained supervision network, training an initial visibility recognition network through the trained supervision network, the at least one characteristic sample graph, each visibility thermodynamic sample graph after the at least one characteristic sample graph is converted and the visibility information of key points contained in each visibility thermodynamic sample graph, and adjusting each parameter of the initial visibility recognition network to obtain the trained visibility recognition network;

And training the key point recognition model until the objective function of the key point recognition model converges, and obtaining the trained key point recognition model.

Optionally, before training the initial location identification network, the method further includes:

Acquiring at least one characteristic sample graph, converting the at least one characteristic sample graph to obtain a thermal sample graph of each key point, and marking the position information of the key point contained in the thermal sample graph of each key point;

Converting the at least one characteristic sample graph to obtain each visibility thermodynamic sample graph, and marking the visibility information of key points contained in each visibility thermodynamic sample graph;

And carrying out sample expansion on each visibility thermal sample graph so that the number of each visibility thermal sample graph is the same as that of each key point thermal sample graph.

Optionally, training an initial position recognition network through the at least one feature sample graph, each key point thermal sample graph obtained after the at least one feature sample graph is converted, and position information of key points included in each key point thermal sample graph, to obtain a trained position recognition network, which specifically includes:

And inputting any one characteristic sample graph into an initial position identification network for identification to obtain the predicted position information of each key point contained in the characteristic sample graph, and adjusting each parameter of the initial position identification network according to the error value between each predicted position information and the position information of the key point contained in the corresponding key point thermal sample graph until the error value is minimized to obtain the trained position identification network.

An image recognition apparatus comprising:

The feature extraction module is used for extracting features of the image to be identified to obtain a feature map of the image to be identified, wherein the image to be identified contains a target object;

The first recognition module is used for inputting the feature images into a trained key point recognition model and outputting position information of each key point of the target object and visibility information of each key point, wherein the key point recognition model is obtained by performing iterative training on at least one feature sample image, each key point thermal sample image obtained by converting the at least one feature sample image and each visibility thermal sample image obtained by converting the at least one feature sample image, and the visibility information characterizes whether the key point is visible or invisible;

And the second identification module is used for identifying the action category of the target object according to the position information of each key point and the visibility information of each key point.

Optionally, if the key point recognition model includes at least a location recognition network and a visibility recognition network, the feature map is input into a trained key point recognition model, and when the location information of each key point of the target object and the visibility information of each key point are output, the first recognition module is specifically configured to:

Optionally, when determining the location information of each key point according to the key point thermodynamic diagram of each key point, the first identification module is specifically configured to:

Optionally, when determining the visibility information of each key point according to the visibility thermodynamic diagram of each key point, the first identification module is specifically configured to:

Optionally, if the key point identification model includes at least a location identification network, a supervision network, and a visibility identification network, the method further includes a training module when training the key point identification model, where the training module is specifically configured to:

The first acquisition module is used for acquiring at least one characteristic sample graph, converting the at least one characteristic sample graph to obtain a thermal sample graph of each key point, and marking the position information of the key point contained in the thermal sample graph of each key point;

the second acquisition module is used for converting the at least one characteristic sample graph to obtain each visible thermal sample graph and marking the visibility information of key points contained in each visible thermal sample graph;

and the sample expansion module is used for carrying out sample expansion on the visibility thermal sample graphs so that the number of the visibility thermal sample graphs is the same as that of the key point thermal sample graphs.

Optionally, training the initial position recognition network through the at least one feature sample graph, each key point thermal sample graph obtained after the at least one feature sample graph is converted, and position information of key points included in each key point thermal sample graph, and when obtaining a trained position recognition network, the training module is specifically configured to:

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the image recognition method described above when the program is executed.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described image recognition method.

In the embodiment of the application, the feature extraction is carried out on the image to be identified, the feature image is obtained, the feature image is input into the trained key point identification model, the position information of each key point of the target object and the visibility information of each key point are output, and the action category of the target object is identified according to the position information of each key point and the visibility information of each key point.

Drawings

FIG. 1 is a flowchart of an image recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing the effect of an image to be identified according to an embodiment of the present application;

FIG. 3 is a schematic diagram showing the effect of a key point thermodynamic diagram in an embodiment of the present application;

FIG. 4 is a schematic diagram of a location identification network according to an embodiment of the present application;

FIG. 5 is a training diagram of a supervisory network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a visibility recognition network in accordance with an embodiment of the present application;

FIG. 7 is a schematic diagram of a key point recognition model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an image recognition device according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The key point detection task refers to one of classical tasks in the field of computer vision. The input of the task is an image of the target object, and the input is position information of a key point of a bone of the target object included in the image. The key point detection is the basis of a plurality of important computer vision tasks such as action classification, behavior recognition and the like, and the application scene is wide.

With the continuous and deep development of deep learning, most of the key point detection algorithms adopt convolutional neural networks as main parts, and most of the existing solutions only pay attention to the spatial positions of key points, but do not consider whether the key points are visible or not. In the actual recognition process, due to the influences of environmental shielding, photographing modes, gestures of target objects and the like, part of key points in an image to be recognized are not visible on a visual level, if the position information of the key points is predicted deliberately under the scenes, a large ambiguity phenomenon can occur, and the position information of the key points estimated by the evidence can cause serious interference to other downstream tasks. Therefore, how to predict the position information and the visibility information of the key points at the same time becomes a problem to be solved.

In the related art, when predicting the position information and the visibility information of the key points at the same time, two independent neural network models can be designed to predict, wherein one neural network model is used for predicting the position information of the key points, and the other neural network model is used for predicting the visibility information of the key points, but if one neural network model is additionally added to predict, the resource occupancy rate can be improved.

In the embodiment of the application, the feature extraction is carried out on the image to be identified, the feature image of the image to be identified is obtained, the feature image is input into the trained key point identification model, the position information of each key point of the target object and the visibility information of each key point are output, and the action category of the target object is identified according to the position information of each key point and the visibility information of each key point, so that the visibility branch is added into the key point identification model, and only one model is needed, the position information and the visibility information of the key points can be predicted at the same time, thereby not only reducing the resource occupancy rate, but also better predicting the action category of the target object.

Based on the above embodiments, referring to fig. 1, a flowchart of an image recognition method in an embodiment of the present application specifically includes:

step 100: and extracting the characteristics of the image to be identified to obtain a characteristic diagram of the image to be identified.

The image to be identified contains a target object.

In the embodiment of the application, when the image to be identified contains the target object, the image to be identified containing the target object is subjected to feature extraction to obtain the feature map of the image to be identified.

The target object may be a human body, for example, and the key points may be key points of the human body, which are respectively a left shoulder, a left elbow, a left hand, a right elbow, a left hip, a left knee, a right knee, and a left foot, which is not limited in the embodiment of the present application.

It should be noted that the image to be identified may be an image including only the target object. If the image to be identified is an image only containing the target object, the feature extraction can be directly performed on the image to be identified containing the target object, so as to obtain a feature map of the image to be identified.

The image to be identified may also be an image containing the target object, as well as other objects. If the image to be identified contains not only the target object but also other objects, the image to be identified needs to be detected, the target object is marked in the image to be identified through the external rectangular frame, and the image only containing the target object is obtained by intercepting, so that the image obtained after intercepting is used as the image to be identified which is finally required to be subjected to feature extraction.

For example, referring to fig. 2, in the embodiment of the present application, it is assumed that the target object is a pedestrian, and when the application scenario is that when the camera shoots that the pedestrian is passing through the intersection with the traffic light, the image to be recognized shot at this time includes not only the pedestrian but also the traffic light, the road, etc., so, in order to improve the accuracy of recognition, it is necessary to detect the pedestrian in the image to be recognized, mark the pedestrian in the image to be recognized through the circumscribed rectangular frame, intercept the marked pedestrian, obtain the image to be recognized including only the pedestrian, and perform the human body key point recognition according to the image to be recognized including only the pedestrian.

The image to be identified may be, for example, a Red Green Blue (RGB) image, which is not limited in the embodiment of the present application.

Step 110: and inputting the feature map into the trained key point identification model, and outputting the position information of each key point of the target object and the visibility information of each key point.

The key point identification model is obtained by iterative training according to at least one characteristic sample graph, each key point thermodynamic sample graph obtained by converting the at least one characteristic sample graph and each visibility thermodynamic sample graph obtained by converting the at least one characteristic sample graph, and the visibility information characterizes the key points as visible or invisible.

In the embodiment of the application, after the feature map of the image to be identified is obtained, the feature map is input into a trained key point identification model, the feature map is converted into the key point thermodynamic diagrams of all the key points through a position identification network in the key point identification model, the position information of all the key points is respectively determined from the key point thermodynamic diagrams, and meanwhile, the feature map is converted into the visibility thermodynamic diagrams of all the key points through a visibility identification network in the key point identification model, and the visibility information of all the key points is respectively determined from the visibility thermodynamic diagrams.

The steps of obtaining the position information of each key point and obtaining the visibility information of each key point in the embodiment of the present application are described in detail below.

Firstly, the steps of obtaining the position information of each key point in the embodiment of the present application are described in detail, which specifically includes:

And carrying out regression processing on the feature map through a regression layer in the position identification network, determining a key point thermodynamic diagram of each key point, and determining the position information of each key point according to the key point thermodynamic diagram of each key point.

In the embodiment of the application, each key point to be identified is preset, and then coordinate regression processing is carried out on the feature map of the image to be identified through a coordinate regression layer in the position identification network, so as to determine the key point thermodynamic diagram corresponding to each key point.

The determining the position information of each key point according to the key point thermodynamic diagram of each key point specifically includes:

s1: and respectively aiming at each key point, acquiring the thermal value of each pixel point contained in the key point thermodynamic diagram corresponding to any key point, and selecting the pixel point corresponding to the thermal value with the largest value from the thermal values of each pixel point as the key point contained in the key point thermodynamic diagram.

In the embodiment of the application, since each of the keypoint thermodynamic diagrams is obtained by mapping each pixel point in the image to be identified into a thermodynamic value, each of the keypoint thermodynamic diagrams comprises each pixel point, and each pixel point corresponds to a thermodynamic value. After obtaining the key point thermodynamic diagrams of the key points through regression, respectively obtaining the thermodynamic value of each pixel point contained in any key point thermodynamic diagram aiming at each key point, then selecting the thermodynamic value with the largest value from the thermodynamic values of each pixel point, and taking the pixel point corresponding to the thermodynamic value with the largest value as the key point in the key point thermodynamic diagram, referring to fig. 3, which is an effect schematic diagram of the key point thermodynamic diagram in the embodiment of the application, wherein the pixel point with the deepest color is the key point.

S3: and obtaining the position information of each key point in the image to be identified.

In the embodiment of the application, since the key point thermodynamic diagram is obtained through the image to be identified, the size of the key point thermodynamic diagram is the same as the size of the image to be identified. After the pixel points representing the key points in the thermodynamic diagram of each key point are obtained, the position information of each pixel point in the image to be identified is obtained, so that the position information of each key point in the image to be identified can be obtained.

The step of obtaining the visibility information of each key point in the embodiment of the present application is described in detail below, and specifically includes:

And respectively aiming at each key point, acquiring the thermal value of each pixel point contained in the visibility thermodynamic diagram corresponding to any key point, if the visibility thermodynamic diagram contains at least one pixel point which is larger than or equal to a preset thermal value threshold, determining that the visibility information of the key point contained in the visibility thermodynamic diagram is visible, and if the thermal value of each pixel point is smaller than the thermal value threshold, determining that the visibility information of the key point is invisible.

In the embodiment of the application, firstly, each key point to be identified is preset, then, regression processing is carried out on the feature map of the image to be identified through a visibility regression layer in a visibility identification network, and the visibility thermodynamic diagram of each key point is determined.

Then, since each visibility thermodynamic diagram is obtained by mapping each pixel point in the image to be identified into a thermal value, each visibility thermodynamic diagram includes a pixel point, and the pixel points correspond to a thermal value. After obtaining the visibility thermodynamic diagrams of the key points through regression, respectively obtaining the thermodynamic value of each pixel point contained in the visibility thermodynamic diagram of any one key point for each key point, and then respectively judging whether the thermodynamic value of each pixel point is smaller than a preset thermodynamic value threshold value, so as to judge whether the thermodynamic value of each pixel point in the visibility thermodynamic diagram is smaller than the preset thermodynamic value threshold value, wherein the following two situations can be specifically distinguished:

first case: the thermodynamic value of each pixel point in the visibility thermodynamic diagram is smaller than a preset thermodynamic value threshold.

In the embodiment of the application, if the thermodynamic value of each pixel point in the visibility thermodynamic diagram is smaller than the preset thermodynamic value threshold, the visibility information of the key point corresponding to the visibility thermodynamic diagram is determined to be invisible.

For example, assuming that the preset thermal value threshold is 90, the thermal value of each pixel point included in the visibility thermodynamic diagram is obtained, and if it is determined that the thermal value of each pixel point is smaller than the preset thermal value threshold 90, it is determined that the visibility information of the key point corresponding to the visibility thermodynamic diagram is invisible.

Second case: the visibility thermodynamic diagram comprises at least one pixel point which is larger than or equal to a preset thermodynamic value threshold.

In the embodiment of the application, whether the thermal value of each pixel point contained in the visibility thermodynamic diagram is larger than or equal to a preset thermal value threshold is judged, and if at least one pixel point larger than or equal to the preset thermal value threshold is determined to be contained in each pixel point, the visibility information of the key point corresponding to the visibility thermodynamic diagram is determined to be visible.

For example, assuming that the preset thermal value threshold is 90, the thermal value of each pixel included in the visibility thermodynamic diagram is obtained, if it is determined that the thermal value of one pixel in each pixel is 96 and greater than the preset thermal value threshold 90, and the thermal value of each other pixel is less than 90, the visibility information of the key point corresponding to the visibility thermodynamic diagram is determined to be visible.

In the embodiment of the present application, as long as the thermal value of one pixel point is greater than or equal to the preset thermal value threshold, the visibility information of the key point corresponding to the visibility thermodynamic diagram is determined to be visible.

Before predicting the position information and the corresponding visibility information of the key points through the key point recognition model, the key point recognition model needs to be trained first, and the following details of the training mode of the key point recognition model in the embodiment of the application are described, which specifically includes:

S1: and training the initial position recognition network through at least one characteristic sample graph, each key point thermal sample graph obtained after the at least one characteristic sample graph is converted and the position information of the key points contained in each key point thermal sample graph to obtain a trained position recognition network.

In the embodiment of the present application, firstly, a training sample image set is obtained, and a detailed description is given below of a manner of obtaining the training sample image set in the embodiment of the present application, which specifically includes:

A1: and obtaining at least one characteristic sample graph, converting the at least one characteristic sample graph to obtain a thermal sample graph of each key point, and marking the position information of the key point contained in the thermal sample graph of each key point.

In the embodiment of the application, first, each sample image to be identified is obtained, and feature extraction is performed on each sample image to be identified, so as to obtain each corresponding feature sample image.

And then, obtaining each key point contained in each sample image to be identified and the position information of each key point in the sample image to be identified through Gaussian blur processing.

Finally, labeling the characteristic sample images of the sample images to be identified according to the position information of each key point of any sample image to be identified, so as to obtain each characteristic sample image and each corresponding position information, and determining the corresponding key point thermal sample image according to the position information of each key point.

A2: and converting at least one characteristic sample graph to obtain each visibility thermal sample graph, and marking the visibility information of key points contained in each visibility thermal sample graph.

And then, obtaining each key point contained in each image sample to be identified through Gaussian blur processing, and marking the visibility information of each key point. And the visibility information of each characteristic sample graph and each corresponding key point is obtained, and the corresponding visibility thermodynamic sample graph is determined according to the visibility information of each key point.

A3: sample expansion is performed on each visibility thermal sample graph so that the number of each visibility thermal sample graph is the same as the number of each key point thermal sample graph.

In the embodiment of the application, the number of the visibility information samples is small, so that the number of the visibility thermal sample images can be expanded to be the same as that of the thermal sample images of all the key points.

Further, when the target object is a human body, the action categories of the target object can be simply divided into the following four types: stand-up without bowing, stand-up with bowing, non-stand-up and shielding. Because the probability and the acquisition difficulty of the sample images to be identified in the four categories are greatly different, the phenomenon that samples are unbalanced in the training sample image set can be caused. Particularly non-upright scenes, with fewer sample images to be identified. However, the imbalance of the training samples can cause the keyword recognition model to perform poorly in a non-upright scene, and influence the accuracy of the keyword recognition in the non-upright scene, so that the generalization of the keyword recognition is poor.

Therefore, in order to solve the problem of improving the accuracy of the key point identification, before each sample is trained, the training sample is re-marked, and the data marked as non-upright is copied into one copy and is supplemented to the training sample image set. Therefore, the number of samples of the action category of each target object of each type of samples is the same, so that the samples of each action category are trimmed off-line, a keyword recognition model can be more perfectly trained, and the accuracy of the keyword recognition is improved.

The training sample image set may be, for example, a common object detection set (Common Objects in Context, MSCOCO) constructed by microsoft, max-planck-Institut for Information (MPII), an artificial intelligence challenge set (ARTIFICIAL INTELLIGENCE CHALLENG, AI CHALLENGE), and the like, which is not limited in the embodiment of the present application.

It should be noted that, the training of the key point recognition model can be perfected under the condition of unbalanced samples by setting different weights on the parameters of the model.

Then, after the training sample image set is obtained, training the initial position recognition network in the key point recognition model through the training sample image set to obtain a trained position recognition network.

The location recognition network may be, for example, a convolutional neural network (Convolutional Neural Networks, CNN), and the CNN may be, for example, a residual network (ResNet) 18, or a visual geometry group network (Visual Geometry Group, VGG), which is not limited in the embodiment of the present application.

Further, after the feature extraction is performed on the image to be identified to obtain the feature map, the feature map may be further amplified by an upsampling layer, where the upsampling layer in the embodiment of the present application is a difference upsampling, so as to be suitable for a chip during identification, and of course, deconvolution may also be selected, so that the effect is better, which is not limited in the embodiment of the present application.

It should be noted that, when training the first stage of the key point recognition model, only training the position recognition network of the key point recognition model, so as to obtain a trained position recognition network, the following details the step of training the position recognition network in the embodiment of the present application specifically include:

And respectively inputting any one characteristic sample graph into an initial position identification network aiming at least one characteristic sample graph, identifying and obtaining the predicted position information of each key point contained in the characteristic sample graph, and respectively adjusting each parameter of the initial position identification network according to the error value between each predicted position information and the position information of the key point contained in the corresponding key point thermal sample graph until the error value is minimized, so as to obtain the trained position identification network.

For example, referring to fig. 4, which is a schematic structural diagram of a location recognition network in an embodiment of the present application, after each feature sample map is obtained, the following operations are performed for each feature sample map:

First, a convolution (conv) operation is performed on the feature sample map to obtain a first keypoint thermal sample map for each keypoint.

And then, combining the characteristic sample graph with each key point thermal sample graph to obtain a combined characteristic graph.

And then performing conv operation on the combined feature images to obtain a second key point thermal sample image.

And finally, performing conv operation on the second key point thermal sample graph to obtain the predicted position information of the finally identified key point.

And then, respectively adjusting each parameter of the initial position recognition network according to the error value between each predicted position information and the position information of the key point contained in the corresponding key point thermal sample graph until the error value is minimized, so as to obtain the trained position recognition network.

S2: fixing each parameter of the position recognition network after the training is completed, training the initial supervision network, and adjusting each parameter of the initial supervision network to obtain the supervision network after the training is completed.

In the embodiment of the application, each parameter of the trained position identification network is fixed, the initial supervision network is trained through each characteristic sample image in the training sample image set and the visibility information of each corresponding key point, and the parameters of the initial supervision network are adjusted to obtain the trained supervision network.

The supervision network is used for supervising the visibility recognition network during training, and better compares the visibility information of each key point with the corresponding predicted visibility information.

For example, referring to fig. 5, a training diagram of a supervisory network according to an embodiment of the present application is shown, in which conv parameters in a location identification network are copied to corresponding vis branches, and corresponding parameters are fixed, and only a full connection operation (fc) network is trained, and only 1 epoch is required to be trained in this stage of training.

S3: fixing parameters of a trained position recognition network and a trained supervision network, training an initial visibility recognition network through the trained supervision network, at least one characteristic sample graph, each visibility thermodynamic sample graph after converting the at least one characteristic sample graph and the visibility information of key points contained in each visibility thermodynamic sample graph, and adjusting each parameter of the initial visibility recognition network to obtain the trained visibility recognition network.

In the embodiment of the application, after the supervision network is trained, the parameters of the trained position identification network are fixed, and the parameters of the trained supervision network are fixed, so that the visibility identification network is trained. And then training the initial visibility recognition network through the trained supervision network, the feature sample graphs in the training sample image set and the visibility information of the key points, adjusting the parameters of the initial visibility recognition network, and finally obtaining the trained visibility recognition network.

In order to ensure accuracy of the visibility recognition network, the visibility information of each feature sample image and each key point in the training sample image set needs to be trained four times when the visibility recognition network is trained.

For example, referring to fig. 6, a schematic diagram of a visibility recognition network in an embodiment of the present application is shown, where a conv layer is used to process a visibility thermal sample graph to obtain each first visibility thermal sample graph (VIS HEATMAP), a conv layer is used to convolve VIS HEATMAP to obtain a second visibility thermal sample graph (heat 2_vis), a conv layer is used to convolve heat2_vis again to obtain VIS HEATMAP, and finally an average pooling layer (avgpool) is used to obtain finally recognized visibility information.

In addition, the recognition of the visibility information of the key points is a simple task, if the position information of the key points is predicted directly and simultaneously, the accuracy of the recognition of the position information of each key point is reduced more, so that the position recognition network and the visibility recognition network are respectively trained in the embodiment of the application.

S4: and training the key point recognition model until the objective function of the key point recognition model converges, and obtaining the trained key point recognition model.

In the embodiment of the application, the key point recognition model is trained through the training sample image set, all parameters of the key point recognition model are trained end to end, the learning rate is properly reduced until the target parameters of the key point recognition model are converged, and the trained key point recognition model is obtained.

Further, after obtaining the trained keypoint identification model, the trained keypoint identification model may be used as a baseline model during the neural network search (Neural Architecture Search, NAS). Because the target speed and the target precision of the trained key point identification model can reach the speed and the precision expected by a user, NAS search is performed based on the baseline model, and the speed and the precision of the key point identification model can be further improved on the basis of the baseline model with higher speed and precision, so that a better network structure is obtained.

Furthermore, it should be noted that, in the embodiment of the present application, when feature extraction is performed on an image to be identified, feature extraction may also be implemented through a key point identification model.

Step 120: and identifying the action category of the target object according to the position information of each key point and the visibility information of each key point.

In the embodiment of the application, after the position information and the corresponding visibility information of each target object are determined, the action category of the target object contained in the image to be identified is identified according to the position information and the corresponding visibility information of each key point.

For example, in the prior art, whether the human hand is on the front side or the back side of the body cannot be judged only by the position information of the key points.

In the embodiment of the application, the characteristic extraction is carried out on the image to be identified, the characteristic diagram of the image to be identified is obtained, the characteristic diagram is input into the trained key point identification model, the position information of each key point of the target object and the visibility information of each key point are output, and the action category of the target object is identified according to the position information of each key point and the visibility information of each key point, so that the resource occupancy rate during the identification of the position information and the visibility information of the key point can be reduced. In addition, the accuracy of the key point recognition model in predicting the visibility information can be improved by training the visibility information through more samples.

In addition, in the related art, when obtaining the position information and the visibility information of the key point, two models are generally built separately, one model is used for predicting the position information of the key point, the other model is used for predicting the visibility information of the key point, in this way in the related art, after the key point is obtained through the first model prediction and the position information of each key point is obtained, the visibility information of the key point identified by the previous model is predicted through the other model, and the two models are used in series, that is, the input of the second model is the output of the first model, so that the efficiency of identifying the key point is reduced.

Based on the above embodiments, referring to fig. 7, a schematic structural diagram of a key point recognition model according to an embodiment of the present application specifically includes:

1. And extracting the characteristics of the image to be identified containing the target object through resnet network to obtain the characteristic diagram of the image to be identified.

2. The feature map is enlarged to a preset size by Deconv layers.

3. And carrying out convolution operation on the amplified characteristic diagram through a conv layer.

4. Feat was obtained.

5. The location identifies the network.

(1) And carrying out convolution operation on feat through a conv layer to obtain each first key point thermodynamic diagram.

(2) And carrying out combination processing on feat and each first key point thermodynamic diagram to obtain a first combined image.

(3) And carrying out convolution operation on the first combined image through the conv layer to obtain feat1.

(4) And carrying out convolution operation on feat by using the conv layer to obtain a key point thermodynamic diagram of each key point.

(5) And determining the position information of each key point of the target object through each key point thermodynamic diagram.

6. Visibility identifies the network.

(1) A convolution operation is performed on feat through the conv layer to obtain each first visibility thermodynamic diagram.

(2) And carrying out combination processing on feat and each first key point thermodynamic diagram to obtain a second combined image.

(3) And carrying out convolution operation on the second combined image through the conv layer to obtain feat < 2 > -vis.

(4) The feat2_vis is convolved by the conv layer to obtain VIS HEATMAP.

(5) The visibility information of each key point is obtained through avgpool.

Based on the same inventive concept, the embodiment of the application provides an image recognition device, which can be a hardware structure, a software module or a hardware structure plus a software module. Based on the above embodiments, referring to fig. 8, a schematic structural diagram of an image recognition device according to an embodiment of the present application specifically includes:

The feature extraction module 800 is configured to perform feature extraction on an image to be identified, so as to obtain a feature map of the image to be identified, where the image to be identified includes a target object;

The first recognition module 810 is configured to input the feature map into a trained keypoint recognition model, and output location information of each keypoint of the target object and visibility information of each keypoint, where the keypoint recognition model is obtained by performing iterative training on at least one feature sample map, each keypoint thermal sample map obtained by converting the at least one feature sample map, and each visibility thermal sample map obtained by converting the at least one feature sample map, and the visibility information characterizes whether a keypoint is visible or invisible;

and a second identifying module 820, configured to identify an action category of the target object according to the location information of each key point and the visibility information of each key point.

Optionally, if the keypoint identification model includes at least a location identification network and a visibility identification network, the feature map is input into the trained keypoint identification model, and when outputting the location information of each keypoint of the target object and the visibility information of each keypoint, the first identification module 810 is specifically configured to:

Optionally, when determining the location information of each key point according to the key point thermodynamic diagram of each key point, the first identifying module 810 is specifically configured to:

Optionally, when determining the visibility information of each key point according to the visibility thermodynamic diagram of each key point, the first identifying module 810 is specifically configured to:

Optionally, if the key point identification model includes at least a location identification network, a supervision network, and a visibility identification network, when training the key point identification model, the training module 830 is specifically configured to:

The first obtaining module 840 is configured to obtain at least one feature sample graph, convert the at least one feature sample graph, obtain a thermal sample graph of each key point, and mark position information of the key point included in the thermal sample graph of each key point;

a second obtaining module 850, configured to convert the at least one feature sample graph to obtain each visible thermal sample graph, and mark the visibility information of the key points included in each visible thermal sample graph;

The sample expansion module 860 is configured to perform sample expansion on each of the visible thermal sample graphs, so that the number of each of the visible thermal sample graphs is the same as the number of each of the key point thermal sample graphs.

Optionally, training the initial position recognition network through the at least one feature sample graph, each key point thermal sample graph obtained after the at least one feature sample graph is converted, and position information of key points included in each key point thermal sample graph, so as to obtain a trained position recognition network, where the training module 830 is specifically configured to:

Based on the above embodiments, referring to fig. 9, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown.

Embodiments of the present application provide an electronic device that may include a processor 910 (Center Processing Unit, a CPU), a memory 920, an input device 930, an output device 940, and the like, where the input device 930 may include a keyboard, a mouse, a touch screen, and the like, and the output device 940 may include a display device, such as a liquid crystal display (liquid CRYSTAL DISPLAY, LCD), a Cathode Ray Tube (CRT), and the like.

Memory 920 may include Read Only Memory (ROM) and Random Access Memory (RAM) and provides processor 910 with program instructions and data stored in memory 920. In the embodiment of the present application, the memory 920 may be used to store a program of any one of the image recognition methods in the embodiment of the present application.

The processor 910 is configured to execute any one of the image recognition methods according to the embodiments of the present application according to the obtained program instructions by calling the program instructions stored in the memory 920.

Based on the above embodiments, in the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image recognition method in any of the above method embodiments.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An image recognition method, comprising:

identifying the action category of the target object according to the position information of each key point and the visibility information of each key point;

If the key point recognition model at least comprises a position recognition network and a visibility recognition network, inputting the feature map into the trained key point recognition model, and outputting the position information of each key point of the target object and the visibility information of each key point, wherein the method specifically comprises the following steps:

Carrying out regression processing on the feature map through a regression layer in the visibility identification network, determining a visibility thermodynamic diagram of each key point, and determining the visibility information of each key point according to the visibility thermodynamic diagram of each key point;

Determining the position information of each key point according to the key point thermodynamic diagram of each key point, wherein the method specifically comprises the following steps:

Acquiring position information of each key point in the image to be identified;

Determining the visibility information of each key point according to the visibility thermodynamic diagram of each key point, wherein the method specifically comprises the following steps:

2. The method of claim 1, wherein if the keypoint identification model includes at least a location identification network, a supervisory network, and a visibility identification network, the keypoint identification model is trained in the following manner:

3. The method of claim 2, further comprising, prior to training the initial location identification network:

4. The method of claim 3, wherein training the initial location recognition network to obtain a trained location recognition network by the at least one feature sample graph, each of the keypoint thermal sample graphs obtained after converting the at least one feature sample graph, and location information of keypoints included in each of the keypoint thermal sample graphs, specifically comprises:

5. An image recognition apparatus, comprising:

The feature extraction module is used for extracting features of the image to be identified to obtain a feature map of the image to be identified, wherein the image to be identified contains a target object; the first recognition module is used for inputting the feature images into a trained key point recognition model and outputting position information of each key point of the target object and visibility information of each key point, wherein the key point recognition model is obtained by performing iterative training on at least one feature sample image, each key point thermal sample image obtained by converting the at least one feature sample image and each visibility thermal sample image obtained by converting the at least one feature sample image, and the visibility information characterizes whether the key point is visible or invisible;

The second identification module is used for identifying the action category of the target object according to the position information of each key point and the visibility information of each key point;

If the key point recognition model at least comprises a position recognition network and a visibility recognition network, inputting the feature map into the trained key point recognition model, and outputting the position information of each key point of the target object and the visibility information of each key point, wherein the first recognition module is specifically configured to:

According to the key point thermodynamic diagrams of the key points, when determining the position information of the key points, the first identification module is specifically configured to:

Acquiring position information of each key point in the image to be identified;

When the visibility information of each key point is determined according to the visibility thermodynamic diagram of each key point, the first identification module is specifically configured to:

6. The apparatus of claim 5, wherein if the keypoint identification model includes at least a location identification network, a supervisory network, and a visibility identification network, the apparatus further comprises a training module, the training module is specifically configured to:

7. The apparatus of claim 6, further comprising, prior to training the initial location identification network:

8. The apparatus of claim 7, wherein the training module is specifically configured to train the initial location recognition network to obtain a trained location recognition network by the at least one feature sample map, each of the keypoint thermal sample maps obtained by converting the at least one feature sample map, and location information of keypoints included in each of the keypoint thermal sample maps:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1-4 when the program is executed.

10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the steps of the method of any of claims 1-4 when executed by a processor.