CN116958730A

CN116958730A - Training method and device of image recognition model, storage medium and electronic equipment

Info

Publication number: CN116958730A
Application number: CN202310361289.9A
Authority: CN
Inventors: 朱城
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-10-27

Abstract

The disclosure provides a training method and device of an image recognition model, a storage medium and electronic equipment. Wherein the method comprises the following steps: acquiring a first recognition result obtained after a first sample image is recognized by a first image recognition model; determining a target segmentation image corresponding to the first recognition result from the segmentation image set; identifying a second sample image obtained by synthesizing the target segmentation image and the first sample image through a second image identification model to obtain a second identification result, wherein the second identification result indicates the image type of the first sample image and the image type of the target segmentation image; performing joint training on the first image recognition model and the second image recognition model according to the first training loss and the second training loss; and determining the trained second image recognition model as a target image recognition model. The method and the device solve the technical problem that the recognition accuracy of the image recognition model obtained by the related training method is low.

Description

Training method and device of image recognition model, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computers, and in particular, to a training method and apparatus for an image recognition model, a storage medium, and an electronic device.

Background

Item identification refers to distinguishing pictures from specific categories such as tables, chairs, and the like. The existing object identification model mainly adopts a deep learning classification method and assists in secondary audit by manpower. For the Internet environment of practical application, the model is expected to recall the picture of the object to be subjected to intervention as much as possible, and reduce misjudgment of the normal picture. In actual business, the label cannot be marked for the same picture due to the difference of time and understanding of labels by labeling personnel. In order to speed up the labeling efficiency, only one label is often used directly. Under the setting, the recognition capability of the model on the single-picture single-label is often too strong, and pictures which are not of the same category are recalled, so that misjudgment on the single-picture single-label is caused. That is, the training method of the image recognition model through the single-label picture can cause the technical problem of inaccurate image recognition result.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a training method and device of an image recognition model, a storage medium and electronic equipment, and aims to at least solve the technical problem that the recognition accuracy of the image recognition model obtained by training by the training method is low.

According to an aspect of the embodiment of the present invention, there is provided a training method of an image recognition model, including: acquiring a first recognition result obtained after a first sample image is recognized by a first image recognition model, wherein the first recognition result is used for indicating the image category of the first sample image; determining a target segmentation image corresponding to the first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by carrying out image segmentation on a plurality of sample images, each sample image at least comprises an image object, and each segmentation image corresponds to one image object; in a second image recognition model, recognizing a second sample image synthesized based on the target segmentation image and the first sample image to obtain a second recognition result, wherein the second image recognition model is further trained based on the first image recognition model to obtain an image recognition model, and the second recognition result indicates the image type of the first sample image and the image type of the target segmentation image; and performing joint training on the first image recognition model and the second image recognition model according to a first training loss corresponding to the first recognition result and a second training loss corresponding to the second recognition result, and determining the trained second image recognition model as a target image recognition model when the first training loss and the second training loss meet convergence conditions.

According to another aspect of the embodiment of the present invention, there is also provided a training apparatus for an image recognition model, including:

the first recognition unit is used for obtaining a first recognition result obtained after a first image recognition model recognizes a first sample image in a sample image set, wherein the sample image set comprises a plurality of sample images, each sample image comprises at least one image object, and the first recognition result is used for indicating the image category of the first sample image; a determining unit, configured to determine a target segmented image corresponding to the first recognition result from a segmented image set, where the segmented image set includes a plurality of segmented images obtained by image segmentation of a plurality of sample images in the sample image set, each of the sample images includes at least one image object, and each of the segmented images corresponds to one of the image objects in the sample image; a second recognition unit, configured to recognize, in a second image recognition model, a second sample image synthesized based on the target divided image and the first sample image, and obtain a second recognition result, where the second image recognition model is further trained based on the first image recognition model, and the second recognition result indicates an image type of the first sample image and an image type of the target divided image; and a training unit configured to perform joint training on the first image recognition model and the second image recognition model based on a first training loss corresponding to the first recognition result and a second training loss corresponding to the second recognition result, and determine the second image recognition model after training as a target image recognition model when the first training loss and the second training loss satisfy a convergence condition.

According to a further aspect of embodiments of the present application, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described training method of the image recognition model when run.

According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the training method of the image recognition model as above.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory, in which a computer program is stored, and a processor configured to execute the training method of the image recognition model described above by the computer program.

In the embodiment of the application, a first recognition result obtained after a first image recognition model is obtained for recognizing a first sample image is adopted; determining a target segmentation image corresponding to the first recognition result from the segmentation image set; identifying a second sample image obtained by synthesizing the target segmentation image and the first sample image through a second image identification model to obtain a second identification result, wherein the second identification result indicates the image type of the first sample image and the image type of the target segmentation image; performing joint training on the first image recognition model and the second image recognition model according to the first training loss and the second training loss; and determining the trained second image recognition model as a target image recognition model, thereby completing training of the image recognition model.

In the training method, the image type of the sample is firstly identified through the first image identification model, then the segmented image matched with the image type of the sample image is found from the segmented image library according to the identified type, then the new sample image obtained by combining the segmented image and the sample image is identified through the second image identification model, and the first image identification model and the second image identification model are jointly trained according to the training loss determined by the identification result, so that the identification capability of the first image identification model to the whole type of the image is improved, the output capability of the second image identification model to the multi-label identification result of the image is synchronously improved, the training of the image identification model through the single-label image is further realized, the training method enables the image identification model obtained through training to have stronger multi-label output capability, the technical problem that the image identification result is inaccurate through the conventional training method of the single-label image identification model is solved, and the technical effect of improving the accuracy of the identification result of the image identification model is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment of an alternative image recognition model training method according to an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative training method for an image recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative image recognition model training method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another alternative image recognition model training method in accordance with an embodiment of the present invention;

FIG. 5 is a schematic illustration of a training method of yet another alternative image recognition model in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of a training method of yet another alternative image recognition model according to an embodiment of the present invention;

FIG. 7 is a schematic illustration of a training method of yet another alternative image recognition model in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of a training method of yet another alternative image recognition model, according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a training method of yet another alternative image recognition model, according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of an alternative electronic device according to an embodiment of the invention;

FIG. 11 is a schematic diagram of an alternative training apparatus for image recognition models according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiment of the present invention, there is provided a training method for an image recognition model, which may be applied, but not limited to, a training system for an image recognition model, which is shown in fig. 1 and is composed of a terminal device 102, a server 104 and a network 110. As shown in fig. 1, terminal device 102 communicates with a connection to server 104 via a network 110, which may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communications. The terminal device may include, but is not limited to, at least one of: a mobile phone (e.g., an Android mobile phone, iOS mobile phone, etc.), a notebook computer, a tablet computer, a palm computer, a MID (Mobile Internet Devices, mobile internet device), a PAD, a desktop computer, a smart television, a vehicle-mounted device, etc. The terminal device may be provided with a client, where the terminal device 104 may be provided with a client for identifying an image category, for example, a graph searching client, an instant messaging client (which may provide a graph searching and identifying function), and so on.

The terminal device 102 is further provided with a display, a processor and a memory, wherein the display can be used for displaying a program interface of the image recognition program, and the processor can be used for recognizing the acquired picture to be recognized through a target image recognition model; the memory is used to store the target image recognition model trained by the server 104. It may be appreciated that, in the case where the server 104 receives the image recognition model sent by the server 104 through the network 110 in the terminal device 102 in the case where the server 104 has completed training the image recognition model, and in the case where the terminal device 102 receives the image recognition request sent by the user, the category of the image included in the recognition request is recognized through the received image recognition model.

The server 104 may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The server includes a database and a processing engine. The processing engine is used for processing the model training process; the database can be used for storing a sample gallery for training an image recognition model.

According to an aspect of the embodiment of the present invention, the training system for an image recognition model may further perform the following steps: firstly, the server 104 executes steps S102 to S110 to obtain a sample image set, where the sample image set includes a first sample image; acquiring a first recognition result obtained after a first sample image is recognized by a first image recognition model, wherein the first recognition result is used for indicating the image category of the first sample image; determining a target segmentation image corresponding to the first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by carrying out image segmentation on a plurality of sample images, each sample image at least comprises an image object, and each segmentation image corresponds to an image object; in a second image recognition model, recognizing a second sample image synthesized based on the target segmentation image and the first sample image to obtain a second recognition result, wherein the second image recognition model is further trained based on the first image recognition model to obtain an image recognition model, and the second recognition result indicates the image category of the first sample image and the image category of the target segmentation image; and performing joint training on the first image recognition model and the second image recognition model according to the first training loss corresponding to the first recognition result and the second training loss corresponding to the second recognition result, and determining the trained second image recognition model as a target image recognition model under the condition that the first training loss and the second training loss meet convergence conditions. Next, the server 104 executes step S112 to send the target image recognition model to the terminal device 102 via the network 110; finally, the terminal device 102 performs step S114 to perform image recognition according to the target image recognition model.

In the embodiment of the invention, a first recognition result obtained after a first image recognition model is obtained for recognizing a first sample image is adopted, wherein the first recognition result is used for indicating the image category of the first sample image; determining a target segmentation image corresponding to the first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by carrying out image segmentation on a plurality of sample images, each sample image at least comprises an image object, and each segmentation image corresponds to an image object; in a second image recognition model, recognizing a second sample image synthesized based on the target segmentation image and the first sample image to obtain a second recognition result, wherein the second image recognition model is further trained based on the first image recognition model to obtain an image recognition model, and the second recognition result indicates the image category of the first sample image and the image category of the target segmentation image; and performing joint training on the first image recognition model and the second image recognition model according to the first training loss corresponding to the first recognition result and the second training loss corresponding to the second recognition result, and determining the second image recognition model after training as a target image recognition model under the condition that the first training loss and the second training loss meet convergence conditions, thereby completing training on the image recognition model.

The above is merely an example, and is not limited in any way in the present embodiment.

As an alternative embodiment, as shown in fig. 2, the training method of the image recognition model may include the following steps:

s202, acquiring a first recognition result obtained after a first sample image is recognized by a first image recognition model, wherein the first recognition result is used for indicating the image category of the first sample image;

s204, determining a target segmentation image corresponding to the first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by carrying out image segmentation on a plurality of sample images, each sample image at least comprises an image object, and each segmentation image corresponds to one image object;

s206, in a second image recognition model, recognizing a second sample image synthesized based on the target segmentation image and the first sample image to obtain a second recognition result, wherein the second image recognition model is further trained based on the first image recognition model to obtain an image recognition model, and the second recognition result indicates the image type of the first sample image and the image type of the target segmentation image;

and S208, performing joint training on the first image recognition model and the second image recognition model according to the first training loss corresponding to the first recognition result and the second training loss corresponding to the second recognition result, and determining the trained second image recognition model as a target image recognition model when the first training loss and the second training loss meet convergence conditions.

The first image recognition model in the above step S202 may include, but is not limited to, a neural network model for recognizing an image. Specifically, the first image recognition model includes, but is not limited to, one of a LeNet network, an AlexNet network, an ImageNet network, a VGG network, a googleanenet network, or a ResNet. The first recognition result may include, but is not limited to, an image category of the first sample image. Further, the sample images included in the sample set may be respectively matched with an image tag obtained by manual labeling, so as to train the first image recognition model.

Further, in the step S204, the divided image may be an image included in a divided image set determined from the sample image set. More specifically, the sample image may include at least one image object, and for each image object in the sample image, a corresponding segmented image may be determined according to an image area occupied by the image object, and the segmented image set may be determined according to the sample image set.

Alternatively, the manner of determining the target segmented image corresponding to the first recognition result from the segmented image set may be determining, from the segmented image set, a segmented image matching the sample image category according to the sample image category of the sample image indicated by the first recognition result. The matching method may be that according to a sample image type of the sample image, a split image identical to the sample image set type is determined from the split image set; in another matching method, the same segmented image as the reference image class may be determined from the segmented image set according to a reference image class associated with the sample image class of the sample image, where the association method may be that there is a context relationship between the sample image class and the reference image class, or that there is a similarity relationship between the sample image class and the reference image class.

For example, in the matching method, the same divided image as the sample image set type may be determined from the divided image set based on the sample image type of the sample image, and when the sample image type indicates that the sample image is of the "motion type", the divided image which is also of the "motion type" may be determined from the divided image set;

as another example, in the formula of determining the same segmented image as the reference image class from the segmented image set according to the reference image class associated with the sample image class of the sample image, if the sample image class indicates that the sample image is a "motion class", the segmented image with the class "ball class" may be determined from the segmented image set, that is, the sample image class is an upper class of the segmented image class; it is also possible to determine a segmented image with a class of "football" from the set of segmented images, in case the sample image class indicates that the sample image is "basketball", i.e. the sample image class is a similar class to the segmented image class.

The above manner of determining the target segmented image corresponding to the first recognition result from the segmented image set is merely an exemplary illustration, and the manner of determining the relevant segmented image in the specific embodiment is not limited.

It should be noted that the second image recognition model in the step S206 may include, but is not limited to, a neural network model for recognizing an image. Specifically, the first image recognition model includes, but is not limited to, one of a LeNet network, an AlexNet network, an ImageNet network, a VGG network, a googleanenet network, or a ResNet. The first recognition result may include, but is not limited to, an image category of the first sample image. In an alternative manner, the first image recognition model and the second image recognition model may be image recognition models of the same type, or may be recognition models of different types. It will be appreciated that in the case where the first image recognition model and the second image recognition model are image recognition models of the same type, the first image recognition model and the second image recognition model may also be specific to similar image recognition models in different training phases in the training process. In the present embodiment, the categories of the first image recognition model and the second image recognition model are not limited.

It may be understood that the sample image used for training the second image recognition model in the step S206 is a second sample image obtained by synthesizing the target segmentation image with the first sample image, and the synthesizing operation may specifically be adding the target segmentation image to the first sample image, so as to obtain the second sample image. It can be appreciated that the second sample image obtained by the above-mentioned synthesizing operation may simultaneously match the image sample of the original first sample image and the image sample corresponding to the target segmentation object, and further may include at least two image labels for obtaining the image recognition result loss.

Finally, in the step S208, the first image recognition model and the second image recognition model may be trained by a first training loss corresponding to the first recognition result and a second training loss corresponding to the second recognition result. The first training loss and the second training loss may include, but are not limited to, probability losses that are probability values indicative of a particular image class.

For example, assume that the image real class (image class label) of the first sample image is "motion class", and the image real class of the target divided image is "football class". Further, the image tag corresponding to the second sample image may be "sports class, football class". The image recognition result of the first sample image by the first image recognition model is as follows: when the probability of the first sample image being "sports type" is 80%, it may be correspondingly determined that the first training loss is 1-80% = 20%, that is, 0.2; the image recognition result of the second sample image by the second image recognition model is as follows: the probability that the second sample image is of the "sports class" is 80%, and the second training loss of the second image recognition model can be determined by the loss 0.2 corresponding to the "sports class" and the loss 0.4 corresponding to the "football class" under the condition that the probability that the second sample image is of the "football class" is 60%. Further, when the first training loss and the second training loss are obtained and the training loss does not satisfy the convergence condition, the first image recognition model and the second image recognition model are subjected to joint training (for example, when the convergence condition is not satisfied, model parameters of the relevant modules in the model are adjusted at the same time).

It can be appreciated that, since the second image recognition model can be used to recognize the multi-label image, the accuracy of the second image recognition model for outputting the multi-label image can be improved through the training process. In the combined training process, the overall recognition capability of the first image recognition model to the model is improved, and then more relevant target segmentation images can be determined and used for generating more suitable second sample images to train the second image recognition model. Thus, the trained second image recognition model is finally determined as the target image recognition model for outputting recognition results of the plurality of category labels indicating the image.

Optionally, the method for obtaining the segmentation gallery in step S204 may include the following steps:

s1, acquiring first sample image features of a first sample image extracted according to a first image recognition model;

s2, compressing the image features of the first sample through an activation function to obtain a feature thermal matrix, wherein the feature thermal matrix is used for indicating the image position of an image object in the first sample image;

s3, acquiring a reference segmentation image from the first sample image according to the characteristic thermal matrix;

And S4, adding the reference segmented image to the segmented image set.

It can be appreciated that in this embodiment, after the first sample image is identified in the above manner, a segmented image corresponding to the image object in the first sample image is obtained, and the segmented image is added to the segmented image set, so that the segmented image set is continuously expanded in the process of identifying the sample image with a single label and training the image identification model. In addition, after the reference segmentation image is acquired in the mode, the image recognition model is trained by using the reference segmentation image, so that the recognition capability of the image recognition model on the local image is improved.

In the above embodiment of the present application, a first recognition result obtained after a first image recognition model is obtained for recognizing a first sample image is adopted, a sample image set includes a plurality of sample images, each sample image includes at least one image object, and the first recognition result is used for indicating an image category of the first sample image; determining a target segmentation image corresponding to the first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by carrying out image segmentation on a sample image, and each segmentation image corresponds to one image object in the sample image; in the second image recognition model, recognizing a second sample image synthesized based on the target segmentation image and the first sample image to obtain a second recognition result, wherein the second recognition result indicates the image type of the first sample image and the image type of the target segmentation image; and performing joint training on the first image recognition model and the second image recognition model according to the first training loss corresponding to the first recognition result and the second training loss corresponding to the second recognition result, and determining the second image recognition model after training as a target image recognition model under the condition that the first training loss and the second training loss meet convergence conditions, thereby completing training on the image recognition model.

In the training method, the image type of the sample is firstly identified through the first image identification model, then the segmented image matched with the image type of the sample image is found from the segmented image library according to the identified type, then the new sample image obtained by combining the segmented image and the sample image is identified through the second image identification model, the first image identification model and the second image identification model are jointly trained according to the identification result, the identification capability of the first image identification model to the whole type of the image is improved, meanwhile, the output capability of the second image identification model to the multi-label identification result of the image is synchronously improved, further, the training of the image identification model through the single-label image is realized, the image identification model obtained through the training method has stronger multi-label output capability, the technical problem that the image identification result is inaccurate due to the existing training method of the image identification model through the single-label image is solved, and the technical effect of improving the accuracy of the identification result of the image identification model is realized. The above is merely an example, and is not limited in any way in the present embodiment.

As an optional implementation manner, the determining the target segmented image corresponding to the first recognition result from the segmented image set includes:

S1, acquiring an image relation matrix included in a first recognition result, wherein the image relation matrix is used for indicating the correlation degree between an image category corresponding to a first image object included in a first sample image and a plurality of divided image categories, and the divided image categories are image categories corresponding to a plurality of divided images included in a divided image set;

s2, determining a target image category with highest image category correlation degree corresponding to the first image object according to the image relation matrix;

s3, determining the segmented image belonging to the target image category from the segmented image set as a target segmented image.

In the above embodiment, the first recognition result may include, in addition to the image type of the first sample image, a relationship matrix for indicating a correlation between the image type of each image object in the first sample image and each divided image type.

The manner in which the target divided image is determined is described below with reference to fig. 3. As shown in fig. 3, an alternative pattern of image relationship matrix diagrams is shown. The image relationship matrix depicted in fig. 3 shows that the image objects included in the first sample image are "knife", "tomato" and "lemon"; at the same time, the classification of the segmented image included in the segmented gallery is shown as "fork", "dining table", "football", "fruit", "basketball" and the like. From the relationship matrix in fig. 3, it can be determined that the correlation degree between the image object "knife" in the first sample image and the "fork" included in the divided image class is 90%; the correlation degree between the image object knife and the dining table included in the divided image category is 80%; the correlation degree between the image object "knife" and the "football" included in the divided image category is 2%; the correlation degree of the image object "knife" and the "fruit" included in the divided image category is 50%; the correlation degree between the image object "knife" and the "basketball" included in the divided image category is 2%; and then the split image category with the highest correlation degree with the image object knife can be determined to be a fork;

Similarly, the category with the highest degree of correlation with the image object tomato can be determined as fruit; the category with the highest relevance to "lemon" is "fruit".

In an optional manner, based on the image relation matrix, the image objects "knife", "tomato" and "lemon" may be determined as the first image object at the same time, so as to determine that the target image category with the highest correlation degree with the image object in the first sample image is "fork", "fruit";

in yet another alternative, based on the above image matrix, the image object closest to the image category of the first sample image is determined to be the first image object from the image objects "knife", "tomato" and "lemon". Assuming that the image object closest to the image class of the first sample image is the "knife", the target image class having the highest degree of correlation with the "knife" may be determined as the "fork";

in still another alternative, when the image object recognition result in the first sample image is unique, for example, only "knife" is recognized from the first sample image, and then the image type of the divided image having the highest correlation with the unique image object may be directly determined as the target image type.

In the case of determining the target image category in one of the above ways, the target segmented image may be further determined from the segmented image set according to the target image category.

According to the embodiment of the application, the image relation matrix included in the first identification result is obtained; determining a target image category with highest image category correlation degree corresponding to the first image object according to the image relation matrix; determining a segmented image belonging to a target image category from a segmented image set as a target segmented image, determining a sample image similar to the image object category included in a first sample image from the segmented image based on an image relation matrix included in a first recognition result obtained by recognizing the first sample image as a target segmented image, adding the target segmented image into the first sample image, and adding related recognizable objects and object labels to the first sample image to obtain a second sample image, so that an image recognition model is trained through the second sample image, and the recognition capability of the image recognition model on multiple labels of the image is improved.

As an optional implementation manner, after determining, from the set of segmented images, that the segmented image belonging to the target image class is the target segmented image, the method further includes:

S1, adding a target segmentation image to a target position in a first sample image to obtain a second sample image;

s2, identifying a second sample image by using a second image identification model to obtain a second identification result;

s3, determining the first image label of the first sample image and the segmentation image label of the target segmentation image as a second image label of the second sample image, and determining a second training loss corresponding to the second recognition result according to the second recognition result and the second image label.

In the above embodiment of the present application, the selecting manner of the target position may be determining, according to the object position of the image object included in the first sample image, that a blank area excluding the image object is the target area; in another alternative, in the case where the first image object included in the first sample image is determined, the image position excluding the first image object may be determined as the target position.

It will be appreciated that after determining the target position, the target split image described above may be added to the target position, and the first sample image after adding the split image may be determined as the second sample image. It will be appreciated that, since the target segmented image is an image obtained from the segmented image set, a corresponding image tag may be matched, and the image tag of the second sample image may be updated to the image tag of the first sample image and the image tag of the segmented image. For example, in the case where the image object "knife" is included in the first sample image and the image label of the first sample image is "knife", adding the split image "fork" to the image label of the second sample image obtained as described above for the first sample image may be determined as: "knife", "fork". Further, in the case of obtaining the image tag of the second sample image in the above manner, the second training loss may be determined according to the image recognition result of the second sample image, and in particular, the second training loss may be a probability that the second sample image is recognized by acquiring the second image recognition model to determine that the tag includes a "knife" and a probability that the tag includes a "fork", and the corresponding second training loss is determined according to the two probability values.

With the above embodiment of the present application, the second sample image is obtained by adding the target segmentation image to the target position in the first sample image; identifying the second sample image by using a second image identification model to obtain a second identification result; determining a first image label of the first sample image and a segmentation image label of the target segmentation image as a second image label of a second sample image, determining a second training loss corresponding to the second identification result according to the second identification result and the second image label, and adding related identifiable objects and object labels to the first sample image to obtain the second sample image, so that a second image identification model is trained through the second sample image, and the identification capability of the image identification model on multiple labels of the image is improved.

As an optional implementation manner, the obtaining the first recognition result obtained after the first image recognition model recognizes the first sample image includes:

s1, acquiring first sample image features of a first sample image through a feature extraction network of a first image recognition model;

s2, carrying out feature compression on the image features of the first sample according to the attention matrix to obtain an image relation matrix;

S3, performing tensor compression processing on the image relation matrix, inputting the image relation matrix into the full-connection layer, and obtaining a category identification result of the first sample image, wherein the category identification result is used for indicating the image category of the first sample image, and the first identification result comprises the image relation matrix and the category identification result.

As an optional implementation manner, after the tensor compression processing is performed on the image relation matrix and then the image relation matrix is input into the full-connection layer to obtain the category identification result of the first sample image, the method further includes:

s1, acquiring a first image tag corresponding to a first sample image;

s2, determining a first training loss according to the first image tag and the category recognition result.

In this embodiment, the above-mentioned first image recognition model may employ the resnet50 as the feature extraction network, where the model structure of the resnet50 may be as shown in fig. 4. The above manner is specifically described below in connection with the reset 50.

In this embodiment, a single-label dataset may be used as a backbone network to train a model with a strong overall image recognition, and then the model needs to take out the partial matting of the category and the related relation matrix (i.e., the image relation matrix).

The relationship matrix diagram is obtained as follows: after the last block of the resnet50 shown in fig. 4, a relation matrix module is added, as shown in fig. 5, to perform feature compression processing on the image features, that is, compressing the tensor of w/16×h/16×1024 to w/16×h/16×c (where C is the number of categories), and then adding attention to channels and spaces simultaneously to obtain a final tensor c×c. The above-mentioned ways of simultaneously adding attention in the channel and space may be: as shown in fig. 5, the obtained features w/16×h/16×c are input to a channel attention module first, so as to add attention weight parameters to the signal channel on the basis of w/16×h/16×c, so as to obtain a first reference tensor compression result of tensor compression of the channel; and then, inputting the first compression result into a space attention module to add attention weight parameters to the signal space on the basis of the first reference tensor compression result so as to obtain a second reference tensor compression result for tensor compression of the space.

By the above embodiment, tensors of the image relationship matrix c×c obtained after feature compression can be determined. After the tensor of the acquired c×c is acquired, performing tensor flattening operation (flat), accessing a full-connection layer (fc layer) to obtain a class identification result of the whole graph, and obtaining the first training loss through sigmoid loss according to the identification result of the whole graph and an image label.

It will be appreciated that after the first training loss described above is obtained, the effect may be different at different training stages. For example, in the initial training stage, the first image recognition model may be trained to obtain an image recognition model with a strong recognition capability for the whole image class under the condition that the first training loss is obtained in the above manner;

on the basis of obtaining a first image recognition model with strong overall image category recognition capability, the first image recognition model can be further obtained according to the image sample in the mode, so that the first image recognition model and the second image recognition model are jointly trained, and the overall image recognition capability of the first image recognition model is further improved, and meanwhile, the recognition capability of the second image recognition model on the multi-label image is synchronously improved.

With the above embodiment of the present application, the first sample image features of the first sample image are acquired through the feature extraction network of the first image recognition model; performing feature compression on the first sample image features according to the attention matrix to obtain an image relation matrix; the image relation matrix is input into the full-connection layer after tensor compression processing, so that a class identification result of a first sample image is obtained, and therefore after the integral image characteristics of the image are obtained through the first image identification model, the relation matrix indicating the association of the image object and other segmented image classes is obtained first, then the class identification result is obtained through tensor processing and the full-connection layer, and further the class identification capability of the first image identification model is trained, and meanwhile the capability of the first image identification model for outputting the image relation matrix is trained, so that the training efficiency of model training is improved.

As an optional embodiment, the training the first image recognition model and the second image recognition model according to the first training loss corresponding to the first recognition result and the second training loss corresponding to the second recognition result includes:

s1, obtaining a third recognition result obtained by recognizing a reference divided image by a third image recognition model, wherein the reference divided image is a divided image corresponding to a first image object included in a first sample image, and the third recognition result is used for indicating the image type of the reference divided image;

s2, determining a third training loss according to a third recognition result and an image label corresponding to the reference segmentation image;

s3, acquiring target training loss according to weighted summation results of the first training loss, the second training loss and the third training loss;

s3-1, respectively adjusting model parameters of the first image recognition model, the second image recognition model and the third image recognition model under the condition that the target training loss does not meet the convergence condition;

and S3-2, determining the trained second image recognition model as a target image recognition model under the condition that the target training loss meets the convergence condition.

It may be appreciated that in this embodiment, a third image recognition model for recognizing the reference segmented image obtained by segmentation in the first sample image is further introduced, and a third training loss is determined according to the recognition result of the third recognition model, so that the first image recognition model, the second image recognition model and the third image recognition model are jointly trained through the first training loss, the second training loss and the third training loss.

In the above embodiment, after the segmented image corresponding to the local image object is obtained based on the first sample image, the third image recognition model is input, and the third image recognition model is trained according to the image label of the segmented image, so that the recognition capability of the model on each local category can be obtained. Because the obtained category information is more accurate, the interference of irrelevant information can be effectively avoided, and the misjudgment on irrelevant categories is further reduced.

In the present embodiment, the third image recognition model may be an image recognition model trained based on the first model, or may be a new image recognition model having the same structure as the first image recognition model.

Alternatively, in the case where the third image recognition model is an image recognition model having the same structure as the first image recognition model, the image size of the divided image input into the third image recognition model may be an image having the same original size as that of the divided image.

For example, when the size of the original image is 500px×500px, and the size of the segmentation map including the image object obtained from the original image is 50px×50px, the 50px may be expanded into a reference segmentation map having a size of 500px×500px by means of image upsampling, and then the reference segmentation map is input into the third image recognition model to perform segmentation map recognition. It can be understood that, since the third image recognition model is the image recognition model with the same structure as the first image recognition model, the processing manner of the third image recognition model on the pixel matrix of the segmentation map is the same as the processing manner of the first image recognition model on the pixel matrix of the original image, the segmentation map can be processed into the same image size as the original image by the image preprocessing manner, so as to ensure the accuracy of the image feature extraction and the recognition result.

The above embodiment will be described below with reference to fig. 6. As shown in fig. 6, for a first sample image graph a (corresponding to an image tag) for an image recognition model, the graph a may be input into the first model to obtain a feature a, a relationship matrix is obtained through tensor operation, a first recognition result of the graph a is obtained through further processing operation on the relationship matrix, and a first training loss is further determined according to the tag of the graph a;

Then determining a target segmentation image, namely a segmentation image C, from the segmentation image set according to the relation matrix in the step, synthesizing the images A and C to obtain a second sample image B, training a second model by using the image B, and correspondingly determining a second training loss according to a second recognition result and labels of the images A and C;

the third model may then be trained from the segmentation map a (corresponding to the image objects in map a) segmented from map a. And determining a corresponding third training loss according to a third recognition result output by the third model. And finally, carrying out joint training on the three models according to the first training loss, the second training loss and the third training loss.

In an alternative embodiment, the first image recognition model may be trained first to obtain an image recognition model with better overall image category recognition capability;

then, dividing the sample image according to the result of the first image recognition model recognition to obtain a divided image, and training a third image recognition model according to the divided image to obtain an image recognition model with better divided image recognition capability, wherein the third image recognition model can be an image recognition model obtained by training based on the first model or a new image recognition model with the same structure as the first image recognition model;

Then, training the second image recognition model according to the result of the first image recognition model and the third image recognition model on the sample image recognition, and further training to obtain an image recognition model with strong multi-label recognition capability;

and finally, carrying out joint training on the first image recognition model, the second image recognition model and the third image recognition model according to the method, and synchronously improving the whole image recognition capability of the first image recognition model, the recognition capability of the segmented image of the third image recognition model and the multi-label output capability of the second image recognition model.

In the above embodiment of the present application, the third recognition result of the reference divided image by the third image recognition model is obtained; determining a third training loss according to the third recognition result and the image label corresponding to the reference segmentation image; obtaining target training loss according to the weighted summation result of the first training loss, the second training loss and the third training loss; under the condition that the target training loss does not meet the convergence condition, respectively adjusting model parameters of the first image recognition model, the second image recognition model and the third image recognition model; and determining the trained second image recognition model as a target image recognition model under the condition that the target training loss meets the convergence condition, so that the accuracy of the recognition result output by the image recognition model is improved by synchronously improving the recognition capability of the first image recognition model for recognizing the whole image category, the recognition capability of the third image recognition model for recognizing the local image category and the recognition capability of the second image recognition model for recognizing the plurality of image categories comprising the plurality of image objects.

As an optional embodiment, the obtaining the target training loss according to the weighted sum result of the first training loss, the second training loss and the third training loss includes:

s1, acquiring a first reference loss and a second reference loss, wherein the first reference loss is used for indicating the similarity of an image relation matrix and a labeling relation matrix, the labeling relation matrix is a relation matrix which is output through a semantic relation model and corresponds to image categories corresponding to a plurality of segmented images respectively, the second reference loss is used for indicating the similarity of first sample image features and combined image features output by a second image recognition model, the combined image features are second sample image features output by the second image recognition model according to the second sample image, and the third image recognition model is used for weighting results of the segmented image features output by the reference segmented images;

s2, acquiring a first weight matched with the first reference loss and a second weight matched with the second reference loss;

and S3, determining the sum of the first training loss, the second training loss, the third training loss, the product of the first weight and the first reference loss and the product of the second weight and the second reference loss as a target training loss.

It will be appreciated that in this embodiment, a first reference loss and a second reference loss are further introduced, and the three image recognition models are jointly trained by combining the first training loss, the second training loss, and the third training loss.

The first reference loss is used for indicating the accuracy of the image relation matrix included in the recognition result output by the first image recognition model according to the first sample image, so that the feature extraction module and the tensor processing module of the first image recognition model are trained through the first reference loss, the whole image recognition capability of the first image recognition model is improved, and meanwhile the accuracy of the image relation matrix output by the first image recognition model is improved. And further, the sample quality of the synthesized second sample image is improved through the accurate image relation matrix.

In addition, the second reference loss is used for indicating the similarity between the first sample image feature obtained by the second image recognition model for recognizing the first sample image and the combined image feature, wherein the combined image feature is a segmented image feature obtained by the third image recognition model for recognizing the segmented image, and the weighting result of the second sample image feature obtained by the second image recognition model for recognizing the synthesized second sample image.

As an alternative embodiment, the acquiring the first reference loss includes:

s1, acquiring a first character feature corresponding to a first image tag and reference character features respectively corresponding to image categories corresponding to a plurality of divided images;

s2, inputting the first character features and the reference character features into a semantic relation model after splicing to obtain an annotation relation matrix;

s3, determining a first reference loss according to the annotation relation matrix and the image relation matrix.

In this embodiment, the labeling relation matrix may be determined by the relation features of the classes output by the BERT network as shown in fig. 7, and the image relation matrix result may be lost by the labeling result output by the BERT network, so as to obtain the first reference loss. Specifically, the BERT may be introduced in the training process to constrain the relationship matrix diagram output by the image recognition model, specifically, the BERT may be used to neutralize the difference between each category relationship matrix diagram used in the training and the obtained relationship matrix diagram, and optionally, mse loss may be used.

As an alternative embodiment, obtaining the second reference loss comprises:

s1, obtaining second sample image characteristics output by a second image recognition model according to a second sample image and segmented image characteristics output by a third image recognition model according to a reference segmented image;

S2, determining a joint image feature according to a weighted summation result of the second sample image feature and the segmentation image feature;

s3, acquiring first sample image characteristics output by the second image recognition model according to the first sample image;

s4, determining a second reference loss according to the characteristic difference between the first sample image characteristic and the joint image characteristic.

In this embodiment, since the original image has only a single label, after the joint training of the local matting and the image generated by using the category relation, the obtained features are better than the image features output by the original backbone network, so in this embodiment, the image features of the local matting (i.e. the reference segmented image) and the features obtained by generating the image (i.e. the second sample image) are summed according to weights to obtain new features to guide the backbone network (i.e. the second image recognition model) to obtain new features, where the image features may be used to indicate that the image object included in the image is the value of the C-th category in the image category in the segmented sample gallery.

Specifically, the above-described joint image feature may be acquired by:

where x denotes a picture, y is a label value, k is the number of submodels, here equal to 2, i.e. the third image recognition model for recognizing the reference segmented image in the above-described process, and the second image recognition model for recognizing the second sample image. C is the number of categories in the pictures in the segmented image set. Pi _k For each sub-model weight, the feature weight corresponding to the local model (i.e. the third image recognition model for recognizing the reference segmented image) is taken to be 0.2, and the feature weight corresponding to the generated model (i.e. the second image recognition model for recognizing the second sample image) is taken to be 0.8.The predicted value of a certain class C corresponding to each corresponding sub-model is a single value.

Is a weighted class C predictor. All C-dimensional feature vectors h are obtained in the same way.

The guiding loss between the final build features is as follows:

wherein h is ₁ Is a synthesized feature vector, h ₂ And identifying the obtained feature vector of the first sample image for the second image identification model.

Through the above-described embodiments of the present application, a first reference loss and a second reference loss are obtained; acquiring a first weight matched with the first reference loss and a second weight matched with the second reference loss; and determining the first training loss, the second training loss, the third training loss, the product of the first weight and the first reference loss and the sum of the product of the second weight and the second reference loss as a target training loss mode, and carrying out joint training on the three image recognition models by combining a plurality of training losses, so that the feature extraction capacity and the multi-label recognition capacity of the second image recognition model are improved in an overall manner.

As an optional implementation manner, after the acquiring the first recognition result obtained after the image recognition model recognizes the first sample image, the method further includes:

and S4, adding the reference segmented image to the segmented image set.

Specifically, the last block output result of the image recognition model resnet50, that is, tensor of w/16×h/16×1024, is subjected to sigmoid compression in dimensions to obtain a thermodynamic diagram of w/16×h/16×1, and the first 100 points with a higher threshold are taken to obtain an external matrix diagram. When the probability value for each pixel in the image area corresponding to the thermodynamic diagram is obtained (the probability value is used for indicating the probability that the pixel is the target pixel in the image object), 100 pixels with the highest probability value on the image edge of the thermodynamic diagram are selected, the minimum circumscribed image rectangle of the thermodynamic diagram corresponding to the 100 pixels is obtained, and the pixel matrix corresponding to the minimum circumscribed image rectangle is determined as the circumscribed matrix diagram.

And then returning to obtain the corresponding position on the original image according to bilinear interpolation to obtain the corresponding matting. If a plurality of external matrix pictures are provided, respectively digging the external matrix pictures on the original pictures; and opening up a segmentation gallery for each category separately, and respectively reserving the segmentation charts of the original high-resolution 100 points on the original chart for later obtaining the picture of the pixel level on the original chart for the operation of the third step, wherein the segmentation charts need to be updated in real time along with each training period in training.

With the above-described embodiment of the present application, the first sample image features of the first sample image extracted according to the first image recognition model are acquired; compressing the first sample image features through an activation function to obtain a feature thermodynamic matrix; acquiring a reference segmentation image from the first sample image according to the characteristic thermal matrix; the reference segmented image is added to the segmented image set, thereby achieving segmentation of the reference segmented image and expansion of the segmented image set.

As an optional implementation manner, the acquiring the reference segmented image from the first sample image according to the feature thermal matrix includes:

in the case where a plurality of candidate segmented images are determined from the feature thermal matrix, the following operations are repeatedly performed until a reference segmented image is determined from the plurality of candidate segmented images:

acquiring a candidate segmented image from a plurality of candidate segmented images as a current candidate segmented image; obtaining a plurality of segmented image categories corresponding to the segmented images included in the segmented image set; inputting the image characteristics of the current candidate segmented image and the text characteristics of the segmented image categories into an image comparison model to obtain a plurality of candidate image categories matched with the current candidate segmented image; determining that the current candidate segmented image is a reference segmented image in the case that the plurality of candidate image categories include an image category corresponding to the first sample image; in the case where the image class corresponding to the first sample image is not included in the plurality of candidate image classes, the next candidate segmented image is acquired.

Specifically, in the process of acquiring the partial matting in the above-mentioned process, considering that there may be a plurality of external matrix graphs in the thermodynamic diagram acquired by the same graph, different types of labels need to be given after the thermodynamic diagrams are respectively buckled on the original graph. At this time, the existing model CLIP with large-scale text-picture can be used, as shown in fig. 8, different external matrix pictures are respectively input into the CLIP, and the output text label of top-5 is taken to be matched with the original label of the picture. If there is a corresponding label, then the corresponding box is fetched, otherwise the matting is discarded.

A complete training process of the present application is described below in conjunction with fig. 9.

In this embodiment, the first image recognition model may be trained first to obtain an image recognition model with better overall image category recognition capability, specifically, a resnet50 network may be used as a backbone network, and a single tag dataset may be trained first alone to be used as the backbone network to train out a model with stronger overall image recognition capability;

The above combined training process can be shown in fig. 9, in which the first sample image, i.e., graph a (corresponding to oneImage tag) is input into a first recognition model to obtain an A characteristic, a relationship matrix is obtained through tensor operation related to a relationship matrix module, then a first recognition result of the graph A is obtained through further processing operation of the relationship matrix, and further a first training loss L is determined according to the tag of the graph A _ocls ；

Specifically, as shown in fig. 5, after the last block of the resnet50, a relational matrix module is added, tensors of w/16×h/16×1024 are compressed into w/16×h/16×c (where C is the number of categories), and then attention is added to channels and spaces simultaneously to obtain a final tensor c×c; after obtaining the tensor of C, performing tensor flattening operation (tensor flat), accessing the full connection layer (fc layer), and determining the first training loss L by using the activation function (sigmoid) loss _ocls 。

Simultaneously, a first reference loss L is determined according to the relation matrix _rcls . Specifically, considering that the relation matrix diagram of the existing picture is not necessarily rich enough, the relation matrix diagram is restrained by introducing bert during training, and the specific loss is L _rcls The corresponding differences between the relation matrix diagrams of each category used in the bert and training and the obtained relation matrix diagrams can be lost by mse.

Then, the segmentation map a may be further segmented according to the extracted feature a, and the image recognition model may be trained by using the segmentation map a, where the model for local feature recognition is named as a third recognition model in the current training process. And then correspondingly obtain a third training loss L _lcls 。

Specifically, the method of obtaining the partition map a from the map a may be that the last block of the resnet50, that is, the tensor of the picture w/16×h/16×1024, performing sigmoid compression in the dimension, obtaining the thermodynamic diagram of w/16×h/16×1, taking the first 100 points with a higher threshold, and taking the outer matrix map. And then returning to obtain the corresponding position on the original image according to bilinear interpolation to obtain the corresponding matting. If a plurality of external matrix pictures are provided, respectively digging the external matrix pictures on the original pictures; and opening up a segmentation gallery for each category separately, and respectively reserving the segmentation charts of the original high-resolution 100 points on the original chart for later obtaining the picture of the pixel level on the original chart for the operation of the third step, wherein the segmentation charts need to be updated in real time along with each training period (epoch) in training.

After the local matting is obtained in the first step, the local matting is input to an existing resnet50 model at the same time, and the recognition capability of the model on each local category is obtained. Because the obtained category information is more accurate, the interference of irrelevant information can be effectively avoided, and the misjudgment on irrelevant categories is further reduced. The loss function used is identical to the sigmoid loss identified by the whole graph. At the moment, the whole graph and each local class need to be trained simultaneously, the loss is the difference between the class randomly put into each matt and the predicted value, and the loss is the cross entropy loss L _lcls 。

In the present embodiment, when the third image recognition model is the same image recognition model as the first image recognition model, the image size of the divided image input into the third image recognition model may be the same image as the original size corresponding to the divided image.

For example, in fig. 9, assuming that the size of the original image a is 500px×500px, and the size of the segmentation map a including the image object (e.g., the image object included in the segmentation map a in fig. 9 is "fork") obtained from the original image a is 50px×50px, the segmentation map a of 50px×50px may be expanded into a reference segmentation map having a size of 500px×500px by means of image upsampling, and then the pixel matrix corresponding to the reference segmentation map may be input into the third image recognition model for segmentation map recognition.

Further, under the condition that the relation matrix C is obtained and the partial matting position and the corresponding single category exist, the multi-label output capability of the image recognition model can be trained. Determining a target segmentation image, namely a segmentation image C, from the segmentation image set according to the relation matrix in the step, synthesizing the image A and the image C to obtain a second sample image B, further inputting the image B into a recognition model (the image recognition model at the stage is named as a second recognition model), and correspondingly, determining a second training loss according to a second recognition result and labels of the image A and the image C;

for example, in the case of acquiring the label of the "fork" and the thermodynamic diagram information corresponding to the label in the sample image, at this time, since the highest threshold corresponding to the row where the "fork" is located is "knife" in c×c obtained in the relation matrix, the segmentation gallery acquired by the partial matting in the above step may be taken out, where the segmentation map of the category is found randomly and attached to the picture, and the label of the picture is corrected to be knife and fork. The penalty for training at this time is the multi-tag sigmoid penalty, i.e., the second training penalty L _gcls 。

In the present embodiment, when the second image recognition model is the same image recognition model as the first image recognition model, the image size of the composite image input into the second image recognition model may be the same image as the original size corresponding to the divided image.

For example, in fig. 9, when the original image has a size of 500px and the divided image C is obtained from the divided image library, the divided image C may be appropriately up-sampled or down-sampled to be adjusted to an appropriate image size, the divided image C may be added to the image a by selecting an appropriate position in the image a, and then an image B having the same size as the original image a may be obtained, and then the pixel matrix corresponding to the image B may be input to the second image recognition model.

Meanwhile, the model can be trained through the second reference loss, the fact that the original image has only a single label is considered, after the image is trained through the local matting and the image generated by using the category relation, the obtained characteristics are better than those of the original backbone network, and then the characteristics obtained through the local matting and the image generation are summarized according to the weight to obtain new characteristics so as to guide the backbone network to obtain the new characteristics.

Specifically, the joint image feature may be acquired first:

wherein x denotes the picture, y is the label value, k is the number of submodels, here equal to 2, i.e. used for identifying the parameters in the above-described processAnd (3) taking an image recognition model of the segmented image into account and an image recognition model for recognizing the second sample image. C is the number of categories in the pictures in the segmented image set. Pi _k For each sub-model weight, the feature weight corresponding to the local model (i.e., the image recognition model for recognizing the reference segmented image) is taken to be 0.2, and the feature weight corresponding to the generated model (i.e., the image recognition model for recognizing the second sample image) is taken to be 0.8.The predicted value of a certain class C corresponding to each corresponding sub-model is a single value.

The guiding loss between the final build features is as follows:

wherein h is ₁ And h ₂ And respectively identifying the characteristic vector obtained by the first sample image by the synthesized characteristic vector and the image identification model.

In the case where each training loss is obtained in the above manner, the final training loss is as follows:

L _total ＝L _lcls +L _ocls +L _gcls +αL _rcls +βL _tcls

wherein the weight value α=0.2 and β=0.5 can be chosen.

According to the method for identifying the articles from the whole state to the local state, the data set information of the single labels is utilized to enable the models to learn the overall identification capacity, then on the basis of the existing models, thermodynamic diagram information of each type is extracted in a targeted mode on the local information to conduct single label classification, meanwhile, the relation among the labels is modeled through the overall information, on the basis of guaranteeing the overall identification capacity, the models are guaranteed to be on the local information, the identification capacity of the models to each type and the dependency relation among the types are fused, and therefore the overall identification capacity of the models is improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

According to another aspect of the embodiment of the present invention, there is also provided a training device for an image recognition model for implementing the training method for an image recognition model. As shown in fig. 10, the apparatus includes:

a first recognition unit 1002, configured to obtain a first recognition result obtained after the first image recognition model recognizes the first sample image, where the first recognition result is used to indicate an image category of the first sample image;

a determining unit 1004, configured to determine a target segmented image corresponding to the first recognition result from a segmented image set, where the segmented image set includes a plurality of segmented images obtained by image segmentation of a plurality of sample images, each sample image includes at least one image object, and each segmented image corresponds to one image object;

A second recognition unit 1006, configured to recognize, in a second image recognition model, a second sample image synthesized based on the target segmented image and the first sample image, to obtain a second recognition result, where the second image recognition model is further trained based on the first image recognition model, and the second recognition result indicates an image class of the first sample image and an image class of the target segmented image;

and a training unit 1008, configured to perform joint training on the first image recognition model and the second image recognition model according to a first training loss corresponding to the first recognition result and a second training loss corresponding to the second recognition result, and determine the trained second image recognition model as the target image recognition model when the first training loss and the second training loss satisfy the convergence condition.

Alternatively, in this embodiment, the embodiments to be implemented by each unit module may refer to the embodiments of each method described above, which are not described herein again.

According to still another aspect of the embodiment of the present invention, there is further provided an electronic device for implementing the training method of the image recognition model, where the electronic device may be a terminal device or a server as shown in fig. 11. The present embodiment is described taking the electronic device as a terminal device as an example. As shown in fig. 11, the electronic device comprises a memory 1102 and a processor 1104, the memory 1102 having stored therein a computer program, the processor 1104 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, acquiring a first recognition result obtained after a first sample image is recognized by a first image recognition model, wherein the first recognition result is used for indicating the image category of the first sample image;

s2, determining a target segmentation image corresponding to a first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by carrying out image segmentation on a plurality of sample images, each sample image at least comprises an image object, and each segmentation image corresponds to one image object;

s3, in a second image recognition model, recognizing a second sample image synthesized based on the target segmentation image and the first sample image to obtain a second recognition result, wherein the second image recognition model is further trained based on the first image recognition model to obtain an image recognition model, and the second recognition result indicates the image type of the first sample image and the image type of the target segmentation image;

And S4, performing joint training on the first image recognition model and the second image recognition model according to the first training loss corresponding to the first recognition result and the second training loss corresponding to the second recognition result, and determining the trained second image recognition model as a target image recognition model under the condition that the first training loss and the second training loss meet convergence conditions.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 11 is only schematic, and the electronic device may also be a vehicle-mounted terminal, a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 11 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

The memory 1102 may be used to store software programs and modules, such as program instructions/modules corresponding to the training method and apparatus of the image recognition model in the embodiment of the present invention, and the processor 1104 executes the software programs and modules stored in the memory 1102 to perform various functional applications and data processing, that is, implement the training method of the image recognition model. Memory 1102 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1102 may further include memory located remotely from processor 1104, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1102 may be, but is not limited to, storing file information such as a target logical file. As an example, as shown in fig. 11, the memory 1102 may include, but is not limited to, a first recognition unit 1002, a determination unit 1004, a second recognition unit 1006, and a training unit 1008 in a training apparatus including the image recognition model. In addition, other module units in the training device of the image recognition model may be further included, but are not limited to, and are not described in detail in this example.

Optionally, the transmission device 1106 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 1106 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1106 is a Radio Frequency (RF) module for communicating wirelessly with the internet.

In addition, the electronic device further includes: a display 1108, and a connection bus 1110 for connecting the various modular components of the electronic device described above.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. Among them, the nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the Peer-To-Peer network.

According to one aspect of the present application, there is provided a computer program product comprising a computer program/instruction containing program code for executing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via a communication portion, and/or installed from a removable medium. When executed by a central processing unit, performs various functions provided by embodiments of the present application.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

According to an aspect of the present application, there is provided a computer-readable storage medium, from which a processor of a computer device reads the computer instructions, the processor executing the computer instructions, causing the computer device to perform the training method of the image recognition model described above.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for performing the steps of:

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the above-described method of the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the above, is merely a logical function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method for training an image recognition model, comprising:

acquiring a first recognition result obtained after a first sample image is recognized by a first image recognition model, wherein the first recognition result is used for indicating the image category of the first sample image;

Determining a target segmentation image corresponding to the first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by carrying out image segmentation on a plurality of sample images, each sample image at least comprises an image object, and each segmentation image corresponds to one image object;

in a second image recognition model, recognizing a second sample image synthesized based on the target segmentation image and the first sample image to obtain a second recognition result, wherein the second image recognition model is further trained based on the first image recognition model to obtain an image recognition model, and the second recognition result indicates the image type of the first sample image and the image type of the target segmentation image;

and performing joint training on the first image recognition model and the second image recognition model according to a first training loss corresponding to the first recognition result and a second training loss corresponding to the second recognition result, and determining the trained second image recognition model as a target image recognition model under the condition that the first training loss and the second training loss meet convergence conditions.

2. The method of claim 1, wherein determining a target segmented image from the set of segmented images that corresponds to the first recognition result comprises:

acquiring an image relation matrix included in the first recognition result, wherein the image relation matrix is used for indicating the correlation degree between an image category corresponding to a first image object included in the first sample image and a plurality of divided image categories, and the divided image categories are the image categories corresponding to the plurality of divided images included in the divided image set;

determining a target image category with highest image category correlation degree corresponding to the first image object according to the image relation matrix;

determining a segmented image belonging to the target image class from the segmented image set as the target segmented image.

3. The method of claim 2, wherein after determining from the set of segmented images that a segmented image belonging to the target image class is the target segmented image, further comprising:

adding the target segmentation image to a target position in the first sample image to obtain the second sample image;

Identifying the second sample image by using the second image identification model to obtain a second identification result;

and determining a first image label of the first sample image and a segmentation image label of the target segmentation image as a second image label of the second sample image, and determining the second training loss corresponding to the second recognition result according to the second recognition result and the second image label.

4. The method of claim 2, wherein the obtaining a first recognition result obtained by the first image recognition model after recognizing the first sample image includes:

acquiring first sample image features of the first sample image through a feature extraction network of the first image recognition model;

performing feature compression on the first sample image features according to the attention matrix to obtain the image relation matrix;

and inputting the image relation matrix into a full-connection layer after tensor compression processing to obtain a category identification result of the first sample image, wherein the category identification result is used for indicating the image category of the first sample image, and the first identification result comprises the image relation matrix and the category identification result.

5. The method of claim 4, wherein the inputting the image relation matrix into the full-connection layer after tensor compression processing, after obtaining the category identification result of the first sample image, further comprises:

acquiring a first image tag corresponding to the first sample image;

and determining the first training loss according to the first image tag and the category identification result.

6. The method of claim 2, wherein the jointly training the first image recognition model and the second image recognition model based on a first training loss corresponding to the first recognition result and a second training loss corresponding to the second recognition result comprises:

obtaining a third recognition result obtained by recognizing a reference segmented image by a third image recognition model, wherein the reference segmented image is a segmented image corresponding to the first image object included in the first sample image, and the third recognition result is used for indicating the image type of the reference segmented image;

determining a third training loss according to the third recognition result and the image label corresponding to the reference segmentation image;

Obtaining a target training loss according to the weighted summation result of the first training loss, the second training loss and the third training loss;

respectively adjusting model parameters of the first image recognition model, the second image recognition model and the third image recognition model under the condition that the target training loss does not meet a convergence condition;

and determining the second image recognition model after training as the target image recognition model under the condition that the target training loss meets a convergence condition.

7. The method of claim 6, wherein the obtaining a target training loss from a weighted sum of the first training loss, the second training loss, and the third training loss comprises:

acquiring a first reference loss and a second reference loss, wherein the first reference loss is used for indicating the similarity of the image relation matrix and a labeling relation matrix, the labeling relation matrix is a relation matrix which is output through a semantic relation model and corresponds to image categories corresponding to the plurality of segmented images respectively, the second reference loss is used for indicating the similarity of first sample image features output by the second image recognition model and joint image features, the joint image features are second sample image features output by the second image recognition model according to the second sample image, and the third image recognition model is used for weighting results of the segmented image features output by the third image recognition model according to the reference segmented images;

Acquiring a first weight matched with the first reference loss and a second weight matched with the second reference loss;

determining a sum of the first training loss, the second training loss, the third training loss, a product of the first weight and the first reference loss, and a product of the second weight and the second reference loss as the target training loss.

8. The method of claim 7, wherein the obtaining a first reference loss comprises:

acquiring a first character feature corresponding to the first image tag and reference character features respectively corresponding to image categories corresponding to the plurality of divided images;

inputting the first character features and the reference character features into a semantic relation model after splicing to obtain the annotation relation matrix;

and determining the first reference loss according to the annotation relation matrix and the image relation matrix.

9. The method of claim 7, wherein obtaining the second reference loss comprises:

acquiring the second sample image characteristics output by the second image recognition model according to the second sample image, and the segmentation image characteristics output by the third image recognition model according to the reference segmentation image;

Determining the joint image feature according to the weighted summation result of the second sample image feature and the segmentation image feature;

acquiring the first sample image characteristics output by the second image recognition model according to the first sample image;

the second reference loss is determined from a feature difference between the first sample image feature and the joint image feature.

10. The method according to any one of claims 1 to 9, further comprising, after obtaining a first recognition result obtained by the first image recognition model after recognizing the first sample image:

acquiring first sample image features of the first sample image extracted according to the first image recognition model;

compressing the first sample image features through an activation function to obtain a feature thermal matrix, wherein the feature thermal matrix is used for indicating the image position of an image object in the first sample image;

acquiring a reference segmentation image from the first sample image according to the characteristic thermal matrix;

the reference segmented image is added to the set of segmented images.

11. The method of claim 10, wherein the acquiring a reference segmented image from the first sample image based on the feature thermal matrix comprises:

In the case of determining a plurality of candidate segmented images from the feature thermal matrix, repeating the following operations until the reference segmented image is determined from the plurality of candidate segmented images:

acquiring a candidate segmented image from the plurality of candidate segmented images as a current candidate segmented image;

obtaining a plurality of segmented image categories corresponding to the segmented images included in the segmented image set;

inputting the image characteristics of the current candidate segmented image and the text characteristics of the segmented image categories into an image comparison model to obtain a plurality of candidate image categories matched with the current candidate segmented image;

determining that the current candidate segmented image is the reference segmented image if the plurality of candidate image categories include an image category corresponding to the first sample image;

and acquiring a next candidate segmented image under the condition that the image category corresponding to the first sample image is not included in the plurality of candidate image categories.

12. A training device for a picture recognition model, comprising:

the first recognition unit is used for obtaining a first recognition result obtained after a first sample image is recognized by a first image recognition model, wherein the first recognition result is used for indicating the image category of the first sample image;

The determining unit is used for determining a target segmentation image corresponding to the first recognition result from a segmentation image set, wherein the segmentation image set comprises a plurality of segmentation images obtained by image segmentation of a plurality of sample images, each sample image at least comprises an image object, and each segmentation image corresponds to one image object;

the second recognition unit is used for recognizing a second sample image synthesized by the target segmentation image and the first sample image in a second image recognition model to obtain a second recognition result, wherein the second image recognition model is further trained by the second image recognition model based on the first image recognition model, and the second recognition result indicates the image category of the first sample image and the image category of the target segmentation image;

and the training unit is used for carrying out joint training on the first image recognition model and the second image recognition model according to a first training loss corresponding to the first recognition result and a second training loss corresponding to the second recognition result, and determining the second image recognition model after training as a target image recognition model under the condition that the first training loss and the second training loss meet convergence conditions.

13. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 11.

14. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 11.

15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 11 by means of the computer program.