CN115035367A

CN115035367A - Picture identification method and device and electronic equipment

Info

Publication number: CN115035367A
Application number: CN202210725264.8A
Authority: CN
Inventors: 张燕; 连四通; 李玲
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-09-09

Abstract

The embodiment of the application provides a picture identification method, a picture identification device and electronic equipment, wherein the picture identification method comprises the following steps: acquiring a target picture to be identified, wherein the target picture comprises a target object; inputting the target picture into a pre-trained target object detection model for detection processing, and outputting position information of the target object in the target picture; inputting the target picture into a pre-trained scene recognition model for scene recognition processing, and outputting a scene recognition result of a scene where a target object in the target picture is located; inputting the position information and the target picture into a pre-trained picture recognition model for attribute recognition processing, and outputting an attribute recognition result of a target object in the target picture; and determining the category of the target picture according to the scene recognition result and the attribute recognition result.

Description

Picture identification method and device and electronic equipment

Technical Field

The invention relates to the technical field of computer vision, in particular to a picture identification method and device and electronic equipment.

Background

With the advent of the network information age, massive information data are generated and propagated in a network, and a part of abnormal pictures are filled in the massive information data, so that the abnormal pictures need to be identified and filtered.

In some scenes, a multi-classification picture identification model is adopted to identify abnormal pictures, picture data are divided into abnormal and normal categories according to the multi-classification picture identification model, pictures are directly identified, the boundaries of the abnormal pictures and the normal pictures are more and more fuzzy along with the maturity of the technology, the standards are also easily updated, the multi-classification picture identification model cannot flexibly deal with different image scenes, and the flexibility is poor.

Disclosure of Invention

The embodiment of the application aims to provide a picture identification method, a picture identification device and electronic equipment, which are used for solving the problem of poor flexibility of a picture identification model.

In a first aspect, an embodiment of the present application provides an image identification method, including: acquiring a target picture to be identified, wherein the target picture comprises a target object; inputting the target picture into a pre-trained target object detection model for detection processing, and outputting the position information of the target object in the target picture; inputting the target picture into a pre-trained scene recognition model for scene recognition processing, and outputting a scene recognition result of a scene where a target object in the target picture is located; inputting the position information and the target picture into a pre-trained picture recognition model for attribute recognition processing, and outputting an attribute recognition result of a target object in the target picture; and determining the category of the target picture according to the scene recognition result and the attribute recognition result.

In a second aspect, an embodiment of the present application provides a method for training a target object detection model, including:

acquiring a plurality of first sample pictures; marking positions of target objects in the plurality of first sample pictures in the corresponding first sample pictures to obtain first labels of the target objects in the sample pictures to form a first training sample set, wherein each first training sample in the first training sample set comprises a first sample picture carrying the first label, and the first label is used for representing real position information marked by the target objects in the corresponding first sample pictures; and sequentially inputting the first training samples in the first training sample set into an initial target object detection model for iterative training until a first loss function of the initial target object detection model is converged, so as to obtain the trained target object detection model, wherein the first loss function is used for representing an error between a predicted value of position information of a target object output by the initial target object detection model in a corresponding first sample picture and the real position information.

In a third aspect, an embodiment of the present application provides a training method for an image recognition model, including: acquiring a plurality of second sample pictures and position information of a target object in each second sample picture;

determining second labels of the target objects in the second sample pictures to form a second training sample set, wherein the second labels are used for representing real attribute information of the target objects, and each second training sample in the second training sample set comprises a second sample picture carrying the second labels and position information of the target objects in the second sample picture; and sequentially inputting second training samples in the second training sample set into a picture recognition model to be trained for iterative training, and obtaining the trained picture recognition model under the condition that a second loss function of the picture recognition model is converged, wherein the second loss function is used for expressing an error between a predicted value of an attribute recognition result of a target object in the second sample picture output by the picture recognition model and the real attribute information.

In a fourth aspect, an embodiment of the present application provides a training method for a scene recognition model, where the method includes: obtaining a plurality of third sample pictures; marking environmental information in the third sample pictures to obtain a third training sample set with environmental information labels, wherein each third training sample in the third training sample set comprises a third sample picture carrying the environmental label information; and sequentially inputting third training samples in the third training sample set into a scene recognition model to be trained for iterative training, and obtaining the trained scene recognition model under the condition that a loss function of the scene recognition model is converged, wherein the loss function of the scene recognition model is used for expressing an error between a predicted value of a scene recognition result of the third sample picture and a scene real value of the third sample picture.

In a fifth aspect, an embodiment of the present application provides an image recognition apparatus, including: the device comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring a target picture to be recognized, and the target picture comprises a target object; the processing module is used for inputting the target picture into a pre-trained target object detection model for detection processing and outputting the position information of the target object in the target picture; the processing module is used for inputting the target picture into a pre-trained scene recognition model for scene recognition processing, and outputting a scene recognition result of a scene where a target object in the target picture is located; the processing module is used for inputting the position information and the target picture into a pre-trained picture recognition model for attribute recognition processing, and outputting an attribute recognition result of a target object in the target picture; and the determining module is used for determining the category of the target picture according to the scene recognition result and the attribute recognition result.

In a sixth aspect, an embodiment of the present application provides a training apparatus for a target object detection model, including: the acquisition module is used for acquiring a plurality of first sample pictures; a marking module, configured to mark positions of target objects in the plurality of first sample pictures in corresponding first sample pictures to obtain first labels of the target objects in the sample pictures, and form a first training sample set, where each first training sample in the first training sample set includes a first sample picture carrying the first label, and the first label is used to represent actual position information of the target object marked in the corresponding first sample picture; and the training module is used for sequentially inputting the first training samples in the first training sample set into a target object detection model to be trained for iterative training until a first loss function of the target object detection model is converged, so as to obtain the trained target object detection model, wherein the first loss function is used for representing an error between a predicted value of position information of a target object output by the initial target object detection model in a corresponding first sample picture and the real position information.

In a seventh aspect, an embodiment of the present application provides a training apparatus for a picture recognition model, including: the acquisition module is used for acquiring a plurality of second sample pictures and position information of a target object in each second sample picture; a determining module, configured to determine a second label of the target object in each second sample picture to form a second training sample set, where the second label is used to represent real attribute information of the target object, and each second training sample in the second training sample set includes a second sample picture carrying the second label and position information of the target object in the second sample picture; the training module is used for sequentially inputting second training samples in the second training sample set into a picture recognition model to be trained for iterative training until the picture recognition model is obtained after training under the condition that a second loss function of the picture recognition model is converged; the second loss function is used for representing an error between a predicted value of an attribute identification result of a target object in the second sample picture output by the picture identification model and the real attribute information.

In an eighth aspect, an embodiment of the present application provides a training apparatus for a scene recognition model, including: the acquisition module is used for acquiring a plurality of third sample pictures; the marking module is used for marking the environmental information in the third sample pictures to obtain a third training sample set with environmental information labels, wherein each third training sample in the third training sample set comprises a third sample picture carrying the environmental label information; and the training module is used for sequentially inputting third training samples in the third training sample set into a scene recognition model to be trained for iterative training until the trained scene recognition model is obtained under the condition that a loss function of the scene recognition model is converged, wherein the loss function of the scene recognition model is used for expressing an error between a predicted value of a scene recognition result of the third sample picture and a scene real value of the third sample picture.

In a ninth aspect, an embodiment of the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus; the processor, the communication interface and the memory complete mutual communication through a bus; the memory is used for storing a computer program; the processor is configured to execute the program stored in the memory to implement the method steps as mentioned in the first aspect, the second aspect, the third aspect, or the fourth aspect.

In a tenth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the picture identification method according to the first aspect, the second aspect, the third aspect, or the fourth aspect.

In an eleventh aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the steps of the picture recognition method according to the first aspect, the second aspect, the third aspect, or the fourth aspect.

According to the technical scheme provided by the embodiment of the application, the target picture to be identified is obtained, and the target picture comprises the target object; inputting the target picture into a pre-trained target object detection model for detection processing, and outputting position information of the target object in the target picture; inputting the target picture into a pre-trained scene recognition model for scene recognition processing, and outputting a scene recognition result of a scene where a target object in the target picture is located; inputting the position information and the target picture into a pre-trained picture recognition model for attribute recognition processing, and outputting an attribute recognition result of a target object in the target picture; and determining the category of the target picture according to the scene recognition result and the attribute recognition result. Therefore, the image recognition is carried out from the attribute angle of the target object and the environment angle in the image, different image scenes can be flexibly handled, and the flexibility and generalization capability of the image recognition model for recognizing abnormal images are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a picture identification method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a training method for a target object detection model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a training process of a target detection model according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a training method for a scene recognition model according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a training method for a picture recognition model according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram illustrating functional modules of an image recognition apparatus according to an embodiment of the present disclosure;

FIG. 7 is a functional block diagram of a training apparatus for detecting a model of a target object according to an embodiment of the present disclosure;

FIG. 8 is a functional block diagram of an apparatus for training a picture recognition model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The embodiment of the application aims to provide a picture identification method, a picture identification device and electronic equipment, and flexibility of a picture identification model is improved.

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As mentioned above, the multi-classification picture recognition model is used for recognizing abnormal pictures, and for the multi-classification picture recognition model, picture data are divided into abnormal and normal categories, and pictures are directly recognized.

In addition, a single-task single model mode is adopted in the related technology, and different image recognition tasks need to be trained independently to obtain image recognition models, so that different types of images are recognized. Because each image recognition task adopts a single model mode, the relevance among the tasks is not considered, and the generalization capability of each image recognition model is poor.

In order to solve the above technical problems, embodiments of the present application provide a picture identification method, a picture identification device, and an electronic device, and the following describes the picture identification method, the picture identification device, and the electronic device provided in embodiments of the present application in detail with reference to the accompanying drawings.

As shown in fig. 1, an execution subject of the method may be a server, where the server may be an independent server or a server cluster composed of multiple servers. The picture identification method may specifically include the following steps S101 to S109:

in step S101, a target picture to be recognized is acquired.

Specifically, the target picture includes a target object, the target picture to be recognized may be a picture with the target object to be detected, and the recognition of the target picture specifically is to recognize attribute information in the target object in the picture and an environment where the target object in the target picture is located. For the picture, the target information of the target object in the picture needs to be identified and the environmental information in the picture needs to be identified in the embodiment of the present application.

In step S103, the target picture is input to the target object detection model for detection processing, and position information of the target object in the target picture is output.

Specifically, the target object detection model is to detect position information of a target object in a target picture, and acquire the position information of the target object in the target picture, which is output by the target object detection model, where the position information is a coordinate position of the target object in the target picture, and the target object detection model may be a deep learning network, and the deep learning network may include a plurality of network layers, and each network layer performs a staged processing on the target picture, where the staged processing includes extracting image features of the target picture, transferring the image features, detecting the image features, and the like.

In one possible implementation, the target object detection model includes a backbone network layer, a neck network slave, and a head network layer. In the detection processing, the backbone network layer is used for extracting the features of the target picture to obtain the image features of the target picture; the neck network layer is used for transmitting the picture characteristics to the head network layer; the head network layer is used for detecting the image characteristics to obtain the position information of the target object in the target picture.

Specifically, the target object detection model mainly detects position coordinates of a target object in the picture, wherein the target object may be a human body, a target part of the human body (such as a head, arms, legs and feet, and the like), interference information (such as a contact information, and the like), and the position coordinates of the target object in the picture may be coordinates of an upper left corner and coordinates of a lower right corner, wherein the coordinates of the upper left corner and the coordinates of the lower right corner are respectively expressed as (x1, y1), (x2, y2), and the coordinates of the upper left corner and the coordinates of the lower right corner are used as position information of the detected target object.

Further, a backbone network (backbone) is a convolutional neural network which is constructed by a plurality of convolutions, batch normalization, activation functions, pooling and residual blocks and used for extracting image features of different scales from a sample picture; the neck network (tack) is a hierarchical network structure formed by convolution, batch normalization, activation functions, upsampling, splicing and the like, and transmits image features of different scales to a head network (head); the head network (head) is constructed by a full connection layer, a sigmoid and the like and is used for generating a boundary box and predicting the category of the attribute in the sample picture.

In step S105, the target picture is input into the scene recognition model for performing scene recognition processing, and a scene recognition result of a scene in which the target object is located in the target picture is output.

Specifically, since the backgrounds of some pictures appear in places such as hotels, and meeting places, it is necessary to distinguish the pictures from other common scenes (such as scenic spots, squares, etc.), and the scene recognition model can recognize the environment where the people are located.

The scene recognition model can be a multi-classification model, the multi-classification model can be constructed by a plurality of convolutions, batch normalization, activation functions, pooling, residual blocks, softmax functions and the like, and whether people are in scenes such as hotels, restaurants and clubs can be comprehensively judged through the scene recognition model. Therefore, the scene where the character is located is identified through the model, whether the character is located in a high-risk scene or not can be judged, different image scenes can be flexibly dealt with, flexibility is high, whether the picture belongs to the picture or not is judged by combining the scene identification model, and the identification precision of the picture can be improved.

In step S107, the position information and the target picture are input to the picture recognition model to perform attribute recognition processing, and the result of attribute recognition of the target object in the target picture is output.

Specifically, the position information and the target picture are input into the picture recognition model, the target object is cut out from the target picture according to the position information, and attribute recognition is performed on the cut out target object.

In a possible implementation manner, the image recognition model is configured to crop the target object according to the position information of the target object, and recognize the attribute information of the cropped target object to obtain an attribute recognition result of the target object. The image identification model can be a multi-task learning model which is constructed by a plurality of convolutions, batch normalization, activation functions, pooling, residual blocks, softmax functions and the like.

Specifically, the picture recognition model recognizes attribute information of the target object detected by the target object detection model, for example, a human attribute such as sex, age, clothing, and posture of a person in the picture. And the attribute identification result output by the picture identification model is the type of the attribute information of the human body in the identified picture.

In step S109, the category of the target picture is determined from the scene recognition result and the attribute recognition result.

Specifically, according to the attribute information identified by the image identification model, the identification result is combined with the scene in which the person in the image is identified by the scene identification model, so as to determine whether the target image belongs to an abnormal image, and if the target image belongs to the abnormal image, the abnormal image is filtered.

According to the technical scheme provided by the embodiment of the application, the picture is identified from the attribute angle of the target object and the environment angle in the picture, so that different image scenes can be flexibly handled, and the flexibility and the generalization capability of the picture identification model for identifying the picture are improved.

As shown in fig. 2, an execution subject of the training method may be a server, where the server may be an independent server or a server cluster composed of multiple servers. The training method of the target object detection model may specifically include the following steps S201 to S205:

step S201: a plurality of first sample pictures are acquired.

Specifically, the first sample picture may be at least one picture including a target object, where the target object in the first sample picture has the same or similar implementation manner as the target object mentioned in the foregoing embodiment, and the same points may be referred to each other, and no further description is given in this embodiment. The target object detection model mainly detects position coordinates of a target object in a picture, wherein the target object may be a human body, and attribute information of the human body includes, but is not limited to, target parts (such as a head, arms, legs and feet, and the like) of the human body, interference information (such as a contact way, and the like), and the like.

Step S203: and marking the positions of the target objects in the plurality of first sample pictures in the corresponding first sample pictures to obtain first labels of the target objects in each sample picture, and forming a first training sample set.

Step S205: and sequentially inputting the first training samples in the first training sample set into a target object detection model to be trained for iterative training until the first loss function of the target object detection model is converged, thereby obtaining the trained target object detection model.

And the first loss function is used for representing the error between the predicted value of the position information of the target object in the corresponding first sample picture and the real position information, which are output by the target object detection model. Each first training sample in the first training sample set comprises a first sample picture carrying a first label, and the first label is used for representing real position information marked by the target object in the corresponding first sample picture.

In the following, a detailed process of the target object detection model is described with reference to fig. 3, regarding the first sample picture, a plurality of pictures related to the human body may be collected as a data set, the data set includes a plurality of first sample pictures, and then the target objects in the pictures, such as the human body, the target portion of the human body, interference information, and the like, are marked, and correspondingly, each marked target object existing in each picture Si (i ═ 1, …, N) has a corresponding position coordinate, and the position coordinates of the target object may be coordinates of an upper left corner and a lower right corner of the target object, where the coordinates of the upper left corner and the lower right corner are respectively marked as (x1, y1), (x2, y2), and the target object has a corresponding label after being marked. Wherein, the priority of the target object can be divided, the target object with high priority is marked preferentially, taking the target object as the target part (such as head, arm, leg and foot, etc.) of the human body and the interference information (such as contact way, etc.) as examples, the attribute information of the target object is sex, age, clothes and posture as examples,

the priority of the target object is sequentially target parts (such as a head, arms, legs and feet and the like) of a human body and interference information (such as a contact way and the like) from high to low, wherein labels corresponding to the target parts include but are not limited to non-abnormal, medium and high; the labels of the interference information are non-interference information and interference information.

And respectively recording the coordinates of the upper left corner and the coordinates of the lower right corner obtained by marking as (x1, y1), (x2 and y2) to be converted into the center point and the width height of the corresponding attribute, and respectively recording the coordinates of the upper left corner and the coordinates of the lower right corner as (x1, y1), (x2 and y2) according to the width height of the corresponding picture for normalization to obtain a final first label of the sample picture.

The target object detection model to be trained comprises three parts, namely a backbone network (backbone), a neck network (hack) and a head network (head).

The backbone network (backbone) is a convolutional neural network which is constructed by a plurality of convolutions, batch normalization, activation functions, pooling and residual blocks and is used for extracting image characteristics of different scales from a sample picture; the neck network (neck) is a hierarchical network structure formed by convolution, batch normalization, activation functions, upsampling, splicing and the like, and transmits image features of different scales to a head network (head); the head network (head) is constructed by a full connection layer, a sigmoid and the like and is used for generating a boundary box and predicting the category of the attribute in the sample picture. During training, sample pictures with first labels in a first training sample set are sequentially input into a backbone network (backbone), a neck network (tack) and a head network (head) for forward propagation, loss values are calculated according to loss functions, a back propagation algorithm is executed, and network parameters of a target object detection model to be trained are updated until optimal model parameter values are obtained. The target object detection model can predict whether a target object exists in the picture and position information (position coordinates) of the target attribute information, and as shown in fig. 3, the detection result is a human body, a target part of the human body and interference information. Thus, the target objects of various characters can be detected through the model, the relevance of information among various characters is considered, and the generalization capability of the model is good.

As shown in fig. 4, an execution subject of the training method may be a server, where the server may be an independent server or a server cluster composed of multiple servers. The training method of the scene recognition model may specifically include the following steps S401 to S405:

step S401: a plurality of third sample pictures is obtained.

Specifically, the third sample picture is a picture including a target object, and the third sample picture has the same or similar implementation manner as the target object in the foregoing embodiments, and may refer to each other, and the description of the embodiment of the present application is omitted here. Since the backgrounds of some pictures appear in places such as hotels, and meeting places, it is necessary to distinguish the pictures from other common scenes (such as scenic spots, squares, etc.), and the scene recognition model can recognize the environment where the characters are located.

For the third sample pictures, pictures of a human body in scenes such as hotels, and clubs can be collected and marked to form sample pictures, and each third sample picture forms a third training sample set.

Step S403: and marking the environmental information in the third sample pictures to obtain a third training sample set.

Each third training sample in the third training sample set comprises a third sample picture carrying an environmental information label, and the environmental information label is used for representing real scene information where a target object is located in the third sample picture.

Step S405: and sequentially inputting the third training samples in the third training sample set into the scene recognition model to be trained for iterative training until the trained scene recognition model is obtained under the condition that the loss function of the scene recognition model is converged.

And the loss function of the scene recognition model is used for representing the error between the predicted value of the scene recognition result of the target object in the third sample picture output by the scene recognition model and the real scene information.

Specifically, the scene recognition model may be a multi-classification model, the multi-classification model may be constructed by a plurality of convolutions, batch normalization, activation functions, pooling, residual blocks, softmax functions, and the like, training samples in the third training sample set are sequentially input into the multi-classification model to be trained, model parameters of the multi-classification model are updated until optimal model parameters are obtained, the multi-classification model is trained to obtain a trained scene recognition model, and whether a character is in a sensitive scene such as a hotel, a convention and the like can be comprehensively judged through the scene recognition model. Therefore, the scene where the character is located is identified through the model, whether the character is located in a high-risk scene or not can be judged, different image scenes can be flexibly dealt with, flexibility is high, whether the picture belongs to an abnormal picture or not is judged by combining the scene identification model, and identification accuracy of the abnormal picture can be improved.

As shown in fig. 5, an execution subject of the training method may be a server, where the server may be an independent server or a server cluster composed of multiple servers. The training method of the image recognition model specifically includes the following steps S501 to S505:

step S501: and acquiring a plurality of second sample pictures and position information of the target object in each second sample picture.

Specifically, the picture recognition model recognizes attribute information of the target object detected by the target object detection model, for example, a human attribute such as sex, age, clothing, and posture of a human body in a picture.

Step S503: and determining a second label of the target object in the second sample picture to form a second training sample set.

The second labels are used for representing real attribute information of the target object, and each second training sample in the second training sample set comprises a second sample picture carrying the second label and position information of the target object in the second sample picture.

Step S505: and sequentially inputting second training samples in the second training sample set into the picture recognition model to be trained for iterative training until a second loss function of the picture recognition model is converged, so as to obtain the trained picture recognition model.

The specific steps of each iterative training of the image recognition model comprise: and cutting the target object in the second sample picture according to the position information of the target object in the second sample picture, identifying the attribute information of the cut target object to obtain a predicted value of the attribute identification result of the target object, and adjusting the model parameters of the picture identification model according to the predicted value of the attribute identification result of the target object, the real attribute information and the second loss function, wherein the second loss function is used for representing the error between the predicted value of the attribute identification result of the target object in the second sample picture output by the picture identification model and the real attribute information.

Specifically, by the description in the above embodiment, the target object may be a human body, and the attribute information may be human body attribute information, wherein the attribute information may be prioritized, and attribute information with high priority is preferentially marked, for example, the priority of the attribute information is gender, age, clothing, and posture in order from high to low; the sex labels are female and male; the age label may be for minor, young and middle aged; the labels of the garment can be normal and abnormal; the labels of the gestures may be abnormal and normal.

The technical solution of the embodiment of the present application is described below by taking a target object as a human body as an example, and after collecting a picture including human body attribute information and marking the human body attribute information, inputting the picture into a target object detection model to perform detection processing, so as to obtain position information of each target object, and cutting the target object in the picture according to the position information, so as to obtain a data set including the target object. For target objects in the data set, due to the fact that target objects which are not marked exist in the labeling cost, the obtained data set only contains labels of part of the target objects, and not all the labels of the target objects.

In one possible implementation manner, determining the second label of the target object in each second sample picture includes: dividing the target object in the second sample picture into marked real attribute information and unmarked real attribute information, wherein the target object marked with the attribute information corresponds to a second label; according to the type of the attribute information, inputting each second sample picture marked with real attribute information into a label prediction model to be trained corresponding to the type of the attribute information of the target object marked with the real attribute information respectively for iterative training, and obtaining each trained label prediction model under the condition that a third loss function corresponding to each label prediction model is converged, wherein the third loss function is used for expressing an error between a predicted value of an attribute recognition result output by the label prediction model and the real attribute information; and sequentially inputting the second sample pictures of the unmarked real attribute information into each label prediction model for label prediction processing to obtain a second label corresponding to the target object of the unmarked real attribute information.

Specifically, the data in the data set is divided into marked data (marked real attribute information) and unmarked data (unmarked real attribute information) according to different target objects in the data set, the data set is marked as S, the attribute information of the target object is marked as Q (Q is 1, …, Q), and the data set S can be divided into a marked data set Sq1 and an unmarked data set Sq2 according to the marking condition of the data, wherein the data set Sq1 and the data set Sq2 satisfy the same distribution of the data. For different attribute information, a single-task learning model (label prediction model) of each attribute information is trained by the labeled data set Sq1, and the training of the label prediction model is completed until the loss function of the label prediction model converges. After a single task model (label prediction model) corresponding to each attribute information is obtained through training, label prediction is carried out on the unlabeled attribute information in the unlabeled data set Sq2 through the trained label prediction model, and a prediction label (second label) of the unlabeled attribute information in the unlabeled data set Sq2 is obtained, and data in the labeled data set Sq1 and the unlabeled data set Sq2 are distributed identically, so that the prediction label of the unlabeled attribute information in the unlabeled data set Sq2 has reliability. In this way, all the attribute information of each picture in the data set has the second label corresponding to it. In this way, the trained picture recognition model has learning capacity for the unlabeled attribute information, and in the recognition scene of applying the picture recognition model to the attribute information of the picture, a label can be allocated to the attribute information of the unlabeled picture, so that the recognition accuracy and the recognition reliability of the picture are improved.

In one possible implementation, the second loss function of the picture identification model is determined by a third loss function of each label prediction model.

Further, after a second label of each attribute information in the dataset is obtained, the constructed sample picture, the attribute information of the sample picture and the second label of the attribute information are input into a multi-task learning model (a picture recognition model to be trained) for joint training, and all model parameters of a feature sharing module of the multi-task learning model and each label prediction model are updated by combining a third loss function of label prediction models of various attribute information obtained by training in the embodiment, so that a pre-training model for multi-task learning is obtained. After a pre-training model of the multi-task learning is obtained, a feature sharing module in the multi-task learning model is frozen, a label prediction model corresponding to each attribute information is trained on the real label data of Sq1 of each attribute information independently, network parameters of each label prediction model are updated according to a third loss function of the label prediction model corresponding to each attribute information, parameters of the feature sharing module are not updated, and an optimal multi-task learning model (a trained picture recognition model) is obtained through multiple iterations until the loss function of the multi-task learning model converges.

Based on the same technical concept, the embodiment of the present application further provides a picture recognition apparatus, and fig. 6 is a schematic diagram of module composition of the picture recognition apparatus provided in the embodiment of the present application, where the picture recognition apparatus is configured to execute the picture recognition method described in the embodiment, and as shown in fig. 6, the picture recognition apparatus 600 includes: an obtaining module 601, configured to obtain a target picture to be identified, where the target picture includes a target object; the processing module 602 is configured to input the target picture into a pre-trained target object detection model for detection processing, and output position information of the target object in the target picture; the processing module 602 is configured to input the target picture into a pre-trained scene recognition model for performing scene recognition processing, and output a scene recognition result of a scene in which the target object in the target picture is located; the processing module 602 is configured to input the position information and the target picture into a pre-trained picture recognition model for attribute recognition processing, and output an attribute recognition result of a target object in the target picture; the determining module 603 is configured to determine a category of the target picture according to the scene recognition result and the attribute recognition result.

According to the technical scheme disclosed by the embodiment of the application, the image recognition is carried out from the attribute angle of the target object and the environment angle in the image, so that different image scenes can be flexibly dealt with, and the flexibility and generalization capability of the image recognition model for recognizing abnormal images are improved.

In one possible implementation, the target object detection model includes a backbone network layer, a neck network layer, and a head network layer; in the detection processing, the backbone network layer is used for extracting the features of the target picture to obtain the image features of the target picture; the neck network layer is used for transmitting the picture characteristics to the head network layer; the head network layer is used for detecting the image characteristics to obtain the position information of the target object in the target picture.

In a possible implementation manner, in the attribute identification process, the picture identification model is configured to crop a target object in the target picture according to the position information, and identify attribute information of the cropped target object to obtain an attribute identification result of the target object.

The image recognition device provided in the embodiment of the present application can implement each process in the embodiment corresponding to the image recognition method, and is not repeated here to avoid repetition.

Corresponding to the training method of the target object detection model provided in the foregoing embodiment, based on the same technical concept, an embodiment of the present application further provides a training apparatus of the target object detection model, fig. 7 is a schematic diagram of module composition of the training apparatus of the target object detection model provided in the embodiment of the present application, the training apparatus of the target object detection model is used to execute the training method of the target object detection model described in the foregoing embodiment, as shown in fig. 7, the training apparatus 700 of the target object detection model includes: an obtaining module 701, configured to obtain a plurality of first sample pictures; a labeling module 702, configured to label positions of a target object in corresponding first sample pictures in the multiple first sample pictures to obtain first labels of the target object in each sample picture, so as to form a first training sample set, where each first training sample in the first training sample set includes a first sample picture carrying a first label, and the first label is used to indicate real position information of the target object labeled in the corresponding first sample picture; the training module 703 is configured to sequentially input the first training samples in the first training sample set into the target object detection model to be trained for iterative training, and obtain the trained target object detection model until a first loss function of the target object detection model converges, where the first loss function is used to indicate an error between a predicted value of position information of a target object in a corresponding first sample picture output by the target object detection model and real position information.

The training device for the target object detection model provided in the embodiment of the present application can implement each process in the embodiment corresponding to the training method for the target object detection model, and is not described here again to avoid repetition.

Based on the same technical concept, the embodiment of the present application further provides a training device for a picture recognition model, fig. 8 is a schematic diagram of modules of the training device for a picture recognition model provided in the embodiment of the present application, the training device for a picture recognition model is used to execute the training method for a picture recognition model described in the embodiment, and as shown in fig. 8, the training device for a picture recognition model includes: an obtaining module 801, configured to obtain a plurality of second sample pictures and position information of a target object in each of the second sample pictures; a determining module 802, configured to determine a second label of the target object in each second sample picture, to form a second training sample set, where the second label is used to represent real attribute information of the target object, and each second training sample in the second training sample set includes a second sample picture carrying the second label and position information of the target object in the second sample picture; the training module 803 is configured to sequentially input the second training samples in the second training sample set into a picture recognition model to be trained for iterative training, until a second loss function of the picture recognition model converges, to obtain the trained picture recognition model, where the second loss function is used to represent an error between a predicted value of an attribute recognition result of a target object in the second sample picture output by the picture recognition model and the real attribute information.

In a possible implementation manner, the specific steps of each iterative training of the picture recognition model include: and cutting the target object in the second sample picture according to the position information of the target object in the second sample picture, identifying the attribute information of the cut target object to obtain a predicted value of an attribute identification result of the target object, and adjusting the model parameter of the picture identification model according to the predicted value of the attribute identification result of the target object, the real attribute information and the second loss function.

In a possible implementation manner, the determining module 803 is further configured to divide the target object in the second sample picture into marked real attribute information and unmarked real attribute information, where a target object of the marked real attribute information corresponds to a second tag; according to the type of attribute information, inputting each second sample picture marked with real attribute information into a label prediction model to be trained corresponding to the type of attribute information of a target object marked with real attribute information respectively for iterative training, and obtaining each trained label prediction model under the condition that a third loss function corresponding to each label prediction model is converged, wherein the third loss function is used for expressing an error between a predicted value of an attribute recognition result output by the label prediction model and the real attribute information; and sequentially inputting the second sample pictures of the unmarked real attribute information into each label prediction model to perform label prediction processing, so as to obtain a second label corresponding to the target object of the unmarked real attribute information.

In a possible implementation manner, the obtaining module 801 is further configured to obtain position information of a target object in each second sample picture, which is output after a pre-trained target object detection model performs detection processing on a plurality of second sample pictures.

The training device for the picture recognition model provided by the embodiment of the application can realize each process in the embodiment corresponding to the training method for the picture recognition model, and is not repeated here for avoiding repetition.

Corresponding to the training method of the scene recognition model provided in the foregoing embodiment, based on the same technical concept, the embodiment of the present application further provides a training apparatus of the scene recognition model, fig. 8 is a schematic diagram of module composition of the training apparatus of the scene recognition model provided in the embodiment of the present application, the training apparatus of the scene recognition model is configured to execute the training method of the scene recognition model described in the foregoing embodiment, and as shown in fig. 8, the training apparatus 800 of the scene recognition model includes: an obtaining module 801, configured to obtain a plurality of third sample pictures; a labeling module 802, configured to label environmental information in multiple third sample pictures to obtain a third training sample set, where each third training sample in the third training sample set includes a third sample picture carrying an environmental information tag, and the environmental information tag is used to represent real scene information where a target object in the third sample picture is located; the training module 803 is configured to sequentially input third training samples in a third training sample set to a scene recognition model to be trained for iterative training, and obtain the trained scene recognition model until a loss function of the scene recognition model is converged, where the loss function of the scene recognition model is used to represent an error between a predicted value of a scene recognition result where a target object in the third sample picture output by the scene recognition model is located and the real scene information.

It should be noted that the training device for the scene recognition model provided in the embodiment of the present application and the training method for the scene recognition model provided in the embodiment of the present application are based on the same inventive concept and have the same technical effect, so that specific implementation of the embodiment may refer to implementation of the training method for the scene recognition model, and repeated details are not repeated.

Based on the same technical concept, the embodiment of the present application further provides an electronic device, which is configured to execute the method mentioned in the above embodiment, and fig. 9 is a schematic structural diagram of an electronic device implementing each embodiment of the present invention, as shown in fig. 9. Electronic devices may vary widely in configuration or performance and may include one or more processors 901 and memory 902, where the memory 902 may store one or more stored applications or data. Memory 902 may be, among other things, transient storage or persistent storage. The application program stored in memory 902 may include one or more modules (not shown), each of which may include a series of computer-executable instructions for the electronic device. Still further, the processor 901 may be configured to communicate with the memory 902 to execute a series of computer-executable instructions in the memory 902 on the electronic device. The electronic device may also include one or more power supplies 903, one or more wired or wireless network interfaces 904, one or more input-output interfaces 905, one or more keyboards 906.

In this embodiment, the electronic device includes a processor, a communication interface, a memory, and a communication bus; the processor, the communication interface and the memory complete mutual communication through a bus; a memory for storing a computer program; and the processor is used for executing the program stored in the memory and realizing the steps described in the embodiment of the method.

It should be noted that the electronic device provided in the embodiment of the present application and the method embodiment provided in the embodiment of the present application are based on the same inventive concept and have the same technical effect, so that for specific implementation of the embodiment, reference may be made to implementation of the foregoing method embodiment, and repeated details are not described again.

The embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps described in the foregoing method embodiments are implemented.

It should be noted that the computer-readable storage medium provided in the embodiment of the present application and the method embodiment provided in the embodiment of the present application are based on the same inventive concept and have the same technical effect, so that reference may be made to the implementation of the foregoing method embodiment for specific implementation of the embodiment, and repeated details are not repeated.

In a specific embodiment, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the steps described in the foregoing method embodiments.

It should be noted that the chip provided in the embodiment of the present application and the method embodiment provided in the embodiment of the present application are based on the same inventive concept and have the same technical effect, so that the specific implementation of the embodiment may refer to the implementation of the foregoing method embodiment, and repeated details are not repeated.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, an electronic device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include transitory computer readable media (transmyedia) such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A picture identification method is characterized by comprising the following steps:

acquiring a target picture to be identified, wherein the target picture comprises a target object;

inputting the target picture into a pre-trained target object detection model for detection processing, and outputting the position information of the target object in the target picture;

inputting the target picture into a pre-trained scene recognition model for scene recognition processing, and outputting a scene recognition result of a scene where a target object in the target picture is located;

inputting the position information and the target picture into a pre-trained picture recognition model for attribute recognition processing, and outputting an attribute recognition result of a target object in the target picture;

and determining the category of the target picture according to the scene recognition result and the attribute recognition result.

2. The picture recognition method of claim 1, wherein the target object detection model comprises a backbone network layer, a neck network layer, and a head network layer;

in the detection processing, the backbone network layer is used for extracting the features of the target picture to obtain the image features of the target picture;

the neck network layer is used for transferring the picture features to the head network layer;

the head network layer is used for detecting the image characteristics to obtain the position information of the target object in the target picture.

3. The picture recognition method according to claim 1, wherein in the attribute recognition processing, the picture recognition model is configured to crop a target object in the target picture according to the position information, and recognize attribute information of the cropped target object, so as to obtain an attribute recognition result of the target object.

4. A training method of a target object detection model is characterized by comprising the following steps:

acquiring a plurality of first sample pictures;

marking positions of target objects in the plurality of first sample pictures in the corresponding first sample pictures to obtain first labels of the target objects in the sample pictures to form a first training sample set, wherein each first training sample in the first training sample set comprises a first sample picture carrying the first label, and the first label is used for representing real position information marked by the target objects in the corresponding first sample pictures;

and sequentially inputting the first training samples in the first training sample set into a target object detection model to be trained for iterative training, and obtaining the trained target object detection model under the condition that a first loss function of the target object detection model is converged, wherein the first loss function is used for representing an error between a predicted value of position information of a target object in a corresponding first sample picture output by the target object detection model and the real position information.

5. A training method of a picture recognition model is characterized by comprising the following steps:

acquiring a plurality of second sample pictures and position information of a target object in each second sample picture;

determining second labels of the target objects in the second sample pictures to form a second training sample set, wherein the second labels are used for representing real attribute information of the target objects, and each second training sample in the second training sample set comprises the second sample picture carrying the second labels and position information of the target objects in the second sample picture;

and sequentially inputting second training samples in the second training sample set into a picture recognition model to be trained for iterative training, and obtaining the trained picture recognition model under the condition that a second loss function of the picture recognition model is converged, wherein the second loss function is used for expressing an error between a predicted value of an attribute recognition result of a target object in the second sample picture output by the picture recognition model and the real attribute information.

6. The method for training the picture recognition model according to claim 5, wherein the specific steps of training the picture recognition model each time includes:

and cutting the target object in the second sample picture according to the position information of the target object in the second sample picture, identifying the attribute information of the cut target object to obtain a predicted value of an attribute identification result of the target object, and adjusting the model parameter of the picture identification model according to the predicted value of the attribute identification result of the target object, the real attribute information and the second loss function.

7. The method for training the image recognition model according to claim 5 or 6, wherein the determining the second label of the target object in each of the second sample images comprises:

dividing the target object in the second sample picture into marked real attribute information and unmarked real attribute information, wherein the target object marked with the real attribute information corresponds to a second label;

according to the type of attribute information, inputting each second sample picture marked with real attribute information into a label prediction model to be trained corresponding to the type of attribute information of a target object marked with real attribute information respectively for iterative training, and obtaining each trained label prediction model under the condition that a third loss function corresponding to each label prediction model is converged, wherein the third loss function is used for expressing an error between a predicted value of an attribute recognition result output by the label prediction model and the real attribute information;

sequentially inputting the second sample pictures of each unmarked real attribute information into each label prediction model to perform label prediction processing, so as to obtain a second label corresponding to the target object of each unmarked real attribute information;

the second loss function is determined by a third loss function of each of the label prediction models.

8. A method for training a scene recognition model, the method comprising:

obtaining a plurality of third sample pictures;

marking environmental information in the third sample pictures to obtain a third training sample set, wherein each third training sample in the third training sample set comprises a third sample picture carrying an environmental information label, and the environmental information label is used for representing real scene information of a target object in the third sample picture;

and sequentially inputting third training samples in the third training sample set into a scene recognition model to be trained for iterative training, and obtaining the trained scene recognition model under the condition that a loss function of the scene recognition model is converged, wherein the loss function of the scene recognition model is used for expressing an error between a predicted value of a scene recognition result of a target object in the third sample picture output by the scene recognition model and the real scene information.

9. A picture recognition apparatus, comprising:

the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a target picture to be recognized, and the target picture comprises a target object;

the processing module is used for inputting the target picture into a pre-trained target object detection model for detection processing and outputting the position information of the target object in the target picture;

the processing module is used for inputting the target picture into a pre-trained scene recognition model for scene recognition processing, and outputting a scene recognition result of a scene where a target object in the target picture is located;

the processing module is used for inputting the position information and the target picture into a pre-trained picture recognition model for attribute recognition processing, and outputting an attribute recognition result of a target object in the target picture;

and the determining module is used for determining the category of the target picture according to the scene recognition result and the attribute recognition result.

10. An electronic device comprising a processor, a communication interface, a memory, and a communication bus; the processor, the communication interface and the memory are communicated with each other through a bus; the memory is used for storing a computer program; the processor is configured to execute the program stored in the memory to implement the picture recognition method according to any one of claims 1 to 3, the training method for the target object detection model according to claim 4, the training method for the picture recognition model according to any one of claims 5 to 7, or the training method for the scene recognition model according to claim 8.