CN117011631A

CN117011631A - Training method, device, equipment, medium and product of image recognition model

Info

Publication number: CN117011631A
Application number: CN202211387010.6A
Authority: CN
Inventors: 朱城; 鄢科
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-11-07

Abstract

The application discloses a training method, device, equipment, medium and product of an image recognition model, and relates to the field of machine learning. The method comprises the following steps: acquiring a sample image pair, comprising a positive sample image with specified type content and marked with a specified type label and a negative sample image without the specified type content; labeling a negative sample image with a region label; obtaining a first candidate region and a second candidate region through an image recognition model; acquiring a first loss value based on the region tag and the first candidate region; and acquiring a second loss value based on the appointed type label and the second candidate region, and training the image recognition model to obtain a target image recognition model. Through the method, the distinguishing capability of the target image recognition model on the negative sample image and the positive sample image can be improved by means of the labeling mode of the region labels, and the appointed type content in the image can be recognized more accurately. The method can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic and the like.

Description

Training method, device, equipment, medium and product of image recognition model

Technical Field

The embodiment of the application relates to the field of machine learning, in particular to a training method, device, equipment, medium and product of an image recognition model.

Background

Along with the continuous development of internet technology, social platforms play an increasingly important role in daily life of people, and massive images, videos, live broadcasting and other data existing in the social platforms become focuses of attention of using objects, so that in order to further build a safe, clear and orderly network environment, specific images, videos and the like containing specified contents need to be screened in time.

In the related art, an image classification method is generally adopted, a large-scale data set is adopted for training to obtain an image classification model, and a specific image and a non-specific image are distinguished by the image classification model, so that the specific image is conveniently screened from a plurality of images in a social platform.

However, some images only have specified contents in local areas, and when the whole image is identified, the local areas corresponding to the specified contents cannot be well positioned, so that the images are wrongly classified, and the specific images cannot be correctly and effectively screened.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, training equipment, training media and training products of an image recognition model, which can improve the distinguishing capability of the model on a negative sample image and a positive sample image by means of a labeling mode of a region label, and can more accurately recognize appointed type contents in the image by utilizing a target image recognition model. The technical scheme is as follows.

In one aspect, a training method of an image recognition model is provided, the method comprising:

acquiring a sample image pair, wherein the sample image pair comprises a positive sample image and a negative sample image, the positive sample image is an image with specified type content, and a region corresponding to the specified type content is pre-marked with a specified type label; the negative sample image is an image which does not contain the specified type of content, and the positive sample image and the negative sample image have similar association relations;

labeling a plurality of image areas in the negative sample image with area labels, wherein the area labels are used for indicating that the plurality of image areas do not contain specified types of content;

performing region prediction of appointed type contents on the negative sample image and the positive sample image through an image recognition model to be trained to obtain a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image;

Acquiring a first loss value corresponding to the negative sample image based on the difference between the region label and the first candidate region; acquiring a second loss value corresponding to the positive sample image based on the difference between the specified type tag and the second candidate region;

training the image recognition model based on the first loss value and the second loss value to obtain a target image recognition model, wherein the target image recognition model is used for recognizing specified type contents in an image.

In another aspect, there is provided a training apparatus for an image recognition model, the apparatus comprising:

the first acquisition module is used for acquiring a sample image pair, wherein the sample image pair comprises a positive sample image and a negative sample image, the positive sample image is an image with specified type content, and a region corresponding to the specified type content is pre-marked with a specified type label; the negative sample image is an image which does not contain the specified type of content, and the positive sample image and the negative sample image have similar association relations;

the label labeling module is used for labeling area labels for a plurality of image areas in the negative sample image, and the area labels are used for indicating that the plurality of image areas do not contain specified types of content;

The region prediction module is used for carrying out region prediction of appointed type contents on the negative sample image and the positive sample image through an image recognition model to be trained to obtain a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image;

the second acquisition module is used for acquiring a first loss value corresponding to the negative sample image based on the difference between the region label and the first candidate region; acquiring a second loss value corresponding to the positive sample image based on the difference between the specified type tag and the second candidate region;

the model training module is used for training the image recognition model based on the first loss value and the second loss value to obtain a target image recognition model, and the target image recognition model is used for recognizing specified types of content in the image.

In another aspect, a computer device is provided, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement a training method for an image recognition model according to any one of the embodiments of the present application.

In another aspect, a computer readable storage medium is provided, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by a processor to implement a training method for an image recognition model according to any one of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the training method of the image recognition model according to any one of the above embodiments.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

labeling a plurality of image areas of the negative sample image in the sample image pair with area labels to indicate that the plurality of image areas do not contain specified types of content; carrying out region prediction of appointed type contents on the negative sample image and the positive sample image through an image recognition model to be trained to obtain a first candidate region and a second candidate region; acquiring a first loss value based on a difference between the region tag and the first candidate region; acquiring a second loss value based on a difference between the specified type tag and the second candidate region; and training the image recognition model by means of the loss value to obtain a target image recognition model. The labeling mode of the area labels is adopted, the distinguishing capability of the model on the negative sample image and the positive sample image is improved, and the image recognition model is trained while a large number of negative sample image interference models are improved, so that the recognition capability of the target image recognition model on the image to be recognized is improved, the appointed type content in the image is recognized more accurately, and the accuracy of image filtering is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a training process for an image recognition model provided in an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a training method for an image recognition model provided by an exemplary embodiment of the present application;

FIG. 4 is a flowchart of a training method for an image recognition model provided in another exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a Fast RCNN provided in an illustrative embodiment of the application;

FIG. 6 is a flowchart of a training method for an image recognition model provided in accordance with yet another exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of an image recognition model provided by an exemplary embodiment of the present application;

FIG. 8 is a schematic illustration of a target image output by a target image recognition model provided by an exemplary embodiment of the present application;

FIG. 9 is a schematic illustration of a target image output by a target image recognition model provided by another exemplary embodiment of the present application;

FIG. 10 is a block diagram of a training apparatus for image recognition models provided in an exemplary embodiment of the present application;

FIG. 11 is a block diagram of a training apparatus for image recognition models provided in another exemplary embodiment of the present application;

fig. 12 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

First, a brief description will be given of terms involved in the embodiments of the present application.

Artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In the related art, an image classification method is generally adopted, a large-scale data set is adopted for training to obtain an image classification model, and a specific image containing specified contents is distinguished from a non-specific image by means of the image classification model, so that the specific image is conveniently screened from a plurality of images in a social platform. However, some images only have specified contents in local areas, and when the whole image is identified, the local areas corresponding to the specified contents cannot be well positioned, so that the images are wrongly classified, and the specific images cannot be correctly and effectively screened.

According to the training method of the image recognition model, the distinguishing capability of the model on the negative sample image and the positive sample image can be improved by means of the labeling mode of the region labels, and the appointed type content in the image can be recognized more accurately by means of the target image recognition model. Illustratively, the specific image containing the specified content is realized as an image containing the sensitive information, and the image containing the sensitive information can be effectively screened by means of a target image identification model obtained through training.

Referring to fig. 1, an implementation environment according to an embodiment of the present application will be described, where a terminal 110 and a server 120 are involved, and the terminal 110 and the server 120 are connected through a communication network 130.

The training method of the image recognition model provided by the embodiment of the application can be implemented by the terminal 110 alone, or can be implemented by the server 120 alone, or can be implemented by the terminal 110 and the server 120 through data interaction, which is not limited in the embodiment of the application.

In some embodiments, terminal 110 is configured to send a sample image or sample image pair to server 120. Illustratively, an application program having an image acquisition function is installed in the terminal 110 to acquire images including a plurality of samples.

Optionally, the terminal 110 sends the sample images to the server 120, which composes the sample images into a sample image pair; alternatively, the terminal 110 composes the sample images into a sample image pair, and then transmits the sample image pair to the server 120 or the like. After the sample image pair is formed, the sample image pair comprises a positive sample image and a negative sample image, wherein the positive sample image is an image with specified type content, the negative sample image is an image without specified type content, and the positive sample image and the negative sample image have similar association relation. In addition, the region corresponding to the specified type content in the positive sample image is previously labeled with the specified type label. Illustratively, a positive sample image with a specified type of content and a negative sample image that does not contain the specified type of content are combined into a sample image pair.

Alternatively, the sample image pair is analyzed by the server 120 as an example. The server 120 labels the region labels on a plurality of image regions of the negative sample image in the sample image pair, performs region detection on the positive sample image and the negative sample image, and obtains a target image recognition model by means of a first loss value between a first candidate region corresponding to the detected negative sample image and the region label and a second loss value between a second candidate region corresponding to the positive sample image and the specified type label, so that the image to be subjected to specified type content recognition is recognized through the target image recognition model.

Illustratively, when the terminal 110 needs to identify the target image to be identified by the specified type of content, the target image is sent to the server 120, the server 120 identifies the target image through the target image identification model, and the image identification result is fed back to the terminal 110. Optionally, the terminal 110 displays the image recognition result. For example: displaying the result of whether the image to be identified has the specified type of content or not; or when the specified type content exists in the image to be identified, labeling and displaying the region corresponding to the specified type content in the image to be identified.

It should be noted that the above-mentioned terminals include, but are not limited to, mobile terminals such as mobile phones, tablet computers, portable laptop computers, intelligent voice interaction devices, intelligent home appliances, vehicle-mounted terminals, and the like, and may also be implemented as desktop computers and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms.

Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, application programs, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required.

In some embodiments, the servers described above may also be implemented as nodes in a blockchain system.

FIG. 2 is a schematic diagram of a training process for an image recognition model according to an exemplary embodiment of the present application.

Optionally, after obtaining the plurality of sample images, the sample images are combined into a sample image pair, thereby obtaining a plurality of sample image pairs such as sample image pair 1, sample image pair 2, and the like. Wherein the sample image pair 1 is a sample image pair composed of a positive sample image (sample image 1) having a specified type of content and a negative sample image (sample image 2) not having a specified type of content; sample image pair 2 is a sample image pair composed of a positive sample image (sample image 3) having a specified type of content and a negative sample image (sample image 4) not having a specified type of content. Further, the negative sample image and the positive sample image have a similar association relationship therebetween.

For each positive sample image with specified type content, the region corresponding to the specified type content is marked with a specified type label. The method comprises the steps of labeling a plurality of image areas in a negative sample image without specified type content with area labels, wherein the area labels are used for indicating that the plurality of image areas do not contain the specified type content.

Optionally, taking analysis of the sample image pair 1 as an example, region prediction of the negative sample image and the positive sample image in the sample image pair 1 with specified types of content is performed by using an image recognition model to be trained, so as to obtain a first candidate region 210 corresponding to the negative sample image and a second candidate region 220 corresponding to the positive sample image.

Acquiring a first loss value corresponding to the negative sample image based on the difference between the region label marked for the negative sample image and the first candidate region 210; and acquiring a second loss value corresponding to the positive sample image based on the difference between the specified type label corresponding to the region of the specified type content in the positive sample image and the second candidate region 220. The image recognition model 230 is trained based on the first loss value and the second loss value to obtain a target image recognition model 240.

The target image recognition model 240 is used to recognize a specified type of content in an image. For example: and obtaining an image recognition result in a target image recognition model obtained by training the image to be recognized through the server 120, wherein the image recognition result is used for indicating whether the image to be recognized has specified type content or not. Alternatively, when the specified type of content exists in the image to be identified, the image identification result may also be used for indicating information such as an area corresponding to the specified type of content.

The training method for the image recognition model obtained through training comprises at least one of a plurality of scenes such as an image screening scene, an extraction scene of appointed type content, a content filtering scene and the like when the training method is applied. It should be noted that the above application scenario is merely an illustrative example, and the training method of the image recognition model provided in the present embodiment may also be applied to other scenarios, which is not limited in the embodiment of the present application.

It should be noted that, before and during the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window or output voice prompt information, where the prompt interface, popup window or voice prompt information is used to prompt the user to collect the relevant data currently, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation of the user to the prompt interface or popup window, otherwise (i.e. when the confirmation operation of the user to the prompt interface or popup window is not obtained), the relevant step of obtaining the relevant data of the user is finished, i.e. the relevant data of the user is not obtained. In other words, all user data collected by the present application is collected with the consent and authorization of the user, and the collection, use and processing of relevant user data requires compliance with relevant laws and regulations and standards of the relevant country and region.

The training method of the image recognition model provided by the application is described with reference to the noun introduction and the application scenario, and is applied to a server, for example, as shown in fig. 3, and the method includes the following steps 310 to 350.

Step 310, a sample image pair is acquired.

Illustratively, the sample image pair is an image pair composed of pre-collected sample images, wherein the sample image pair includes a positive sample image and a negative sample image.

A plurality of sample images are collected in advance, for example: a plurality of sample images are acquired from a pre-stored image dataset. Optionally, a positive sample image and a negative sample image of the plurality of sample images are determined.

Wherein the positive sample image is an image having a specified type of content, and the negative sample image is an image not containing the specified type of content. In addition, the positive sample image and the negative sample image have similar association relations.

The similarity association relationship is used for indicating that the image content in the negative sample image has a similarity relationship with the specified type of content in the positive sample image. Illustratively, the collected negative sample image does not include the specified type of content, but includes the image content having a high suspected relation with the specified type of content, and the high suspected relation easily makes the image recognition model misjudge the image without the specified type of content as the image with the specified type of content.

For example: the specific type of content included in the positive sample image is implemented as an exposed ankle region, and the negative sample image having a similar association relationship with the positive sample image includes an ankle region, but the ankle region is an ankle region covered by clothing such as socks, trousers, or skirts, and the covered ankle region (image content) has a highly suspected relationship with the exposed ankle region (specific type of content).

Optionally, the specified type of content is predetermined content. Illustratively, a specific location area is taken as a specified type of content, such as: the location-specific area is implemented as a toilet area; alternatively, the sensitive human body area is taken as a specified type of content, for example: the sensitive body area is realized as an armpit area, an ankle area, etc.; alternatively, some traffic signs in the traffic field, images of the vehicle type, and the like are taken as specified type contents, such as: the plurality of traffic specification marks are used as the specified type content, the certain type of vehicles are used as the specified type content, and the like.

Illustratively, the specific type of content is implemented as a sensitive human body region. The collecting a plurality of sample images comprises the following steps: there is a sample image of the specified type of content and there is no sample image of the specified type of content.

For example: a sensitive human body region (e.g., an exposed armpit region) exists in a sample image 1 of the plurality of sample images, a sensitive human body region (e.g., an exposed ankle region) exists in a sample image 2 of the plurality of sample images, a sensitive human body region (e.g., an exposed armpit region and an exposed ankle region) does not exist in a sample image 3 of the plurality of sample images, but an image content highly similar to the sensitive human body region (e.g., an unexposed ankle region) exists, a sensitive human body region (e.g., an exposed armpit region) exists in a sample image 4 of the plurality of sample images, a sensitive human body region does not exist in a sample image 5 of the plurality of sample images, an image content highly similar to the sensitive human body region (e.g., an unexposed armpit region) exists, and so on.

Alternatively, a sample image pair is formed by taking a sample head image with the specified type of content as a positive sample image and a sample head image without the specified type of content as a negative sample image. Wherein the negative sample image and the positive sample image have similar association relation (for example, the image content in the negative sample image is highly suspected to be the appointed type content in the positive sample image). For example: taking the sample image 1 as a positive sample image and the sample image 3 as a negative sample image, so that the sample image 1 and the sample image 3 form a sample image pair; alternatively, the sample image 2 is taken as a positive sample image, and the sample image 3 is taken as a negative sample image, so that the sample image 2 and the sample image 3 constitute another sample image pair.

It should be noted that the above is only an illustrative example, and the embodiments of the present application are not limited thereto.

The region corresponding to the specified type content is pre-marked with a specified type label.

Illustratively, the positive sample image includes a specified type content, a region corresponding to the specified type content in the positive sample image is pre-labeled with a specified type label, and the specified type label is used for indicating that the specified type content exists in the positive sample image, and the position of the region of the specified type content in the sample image.

Schematically, the region corresponding to the specified type content is marked in the form of a marking frame, and the marking frame is used as a specified type label; or, the region corresponding to the specified type content is marked by adopting a marking layer form, and the marking layer is used as a specified type label and the like.

Step 320, labeling a plurality of image areas in the negative sample image with area labels.

Illustratively, after a sample image is formed into a sample head pair, a labeling process of region labels is performed on a negative sample image which does not contain specified types of content. Wherein the region tag is used to indicate that the plurality of image regions do not contain the specified type of content. Namely: by the region label, when the negative sample image and the positive sample image have similar association relation, the negative sample image can be more clearly indicated not to have the specified type of content. Namely: even when the image content in the negative sample image is highly suspected of the specified type content in the positive sample image, it is possible to more clearly indicate that the negative sample image does not have the specified type content.

In some embodiments, the negative-sample image is segmented into a plurality of image regions in the negative-sample image.

Illustratively, after determining the negative image in the sample image pair, the negative image is image segmented to obtain a plurality of image regions in the negative image. For example: carrying out random image segmentation on the negative sample image to obtain a plurality of image areas in the negative sample image, wherein the areas of the plurality of image areas possibly differ; alternatively, the negative-sample image is subjected to image segmentation of equal area to obtain a plurality of image areas of equal area in the negative-sample image, and the like.

Optionally, after obtaining the plurality of image areas, labeling the plurality of image areas with an area label with a value of zero.

Illustratively, after obtaining a plurality of image areas in the negative sample image, labeling each image area so that the plurality of image areas have corresponding area labels, where the area labels are used to indicate that the image area does not have specified types of content.

Illustratively, a plurality of image areas are marked with area labels with values of zero. For example: in the process of model training based on the sample image pairs, a plurality of image areas corresponding to the negative sample images in the sample images are set to be in a labeling form with a true value of 0, and the true value of 0 is used for indicating that the negative sample images do not contain specified types of content.

Wherein the negative-sample-based image is an image collected in advance and having no specified type of content, and thus a plurality of image areas obtained after the region division of the negative-sample image, wherein the specified type of content is also not present.

And 330, carrying out region prediction of the appointed type content on the negative sample image and the positive sample image through the image recognition model to be trained to obtain a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image.

Illustratively, after a sample image pair including a positive sample image and a negative sample image is obtained, an image recognition model to be trained is adopted to respectively predict areas of specified types of contents for the negative sample image and the positive sample image.

The region prediction of the specified type of content is used for analyzing whether the specified type of content exists in the sample image when the region prediction is performed on the sample image.

Alternatively, although the negative image is an image that does not include the specified type of content, when the negative image is subjected to region prediction by using the image recognition model to be trained, it may be determined that some image regions in the negative image have the specified type of content. For example: there is an image area in the negative sample image that is highly suspected of specifying type content but does not actually have the specified type content.

In some embodiments, the region prediction of the specified type of content is performed on the negative sample image through the image recognition model to be trained, so as to obtain a first candidate region corresponding to the negative sample image.

Illustratively, the region prediction of the specified type content is performed on the negative sample image by means of the image recognition model, whether the specified type content exists in the negative sample image or not and the image region corresponding to the specified type content are analyzed, so that the region prediction process is realized, and the result of the region prediction is output, so that the first candidate region corresponding to the negative sample image is obtained.

Optionally, performing region prediction of the specified type content on the positive sample image through an image recognition model to be trained to obtain a second candidate region corresponding to the positive sample image.

Illustratively, the image recognition model is used for carrying out region prediction of the specified type content on the positive sample image, analyzing whether the specified type content exists in the positive sample image and an image region corresponding to the specified type content, so that the region prediction process is realized, and outputting a result of the region prediction, so as to obtain a second candidate region corresponding to the positive sample image.

Step 340, obtaining a first loss value corresponding to the negative sample image based on the difference between the region label and the first candidate region; and acquiring a second loss value corresponding to the positive sample image based on the difference between the specified type label and the second candidate region.

Illustratively, after obtaining a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image, training the image recognition model by means of the first candidate region and the second candidate region.

Optionally, based on the step 320, it is determined that a plurality of image areas in the negative-sample image are labeled with area labels, after a first candidate area after the image recognition model performs area prediction on the negative-sample image is obtained, the first candidate area (prediction result) is compared with the area labels (truth result), and a difference between the area labels and the first candidate area is determined as a first loss value corresponding to the negative-sample image.

Optionally, based on a specified type tag that is pre-labeled in a region corresponding to the specified type content in the positive sample image, after a second candidate region after the image recognition model performs region prediction on the positive sample image is obtained, the second candidate region (prediction result) is compared with the specified type tag (real result of the positive sample image), and a difference between the specified type tag and the second candidate region is determined as a second loss value corresponding to the positive sample image.

And step 350, training the image recognition model based on the first loss value and the second loss value to obtain a target image recognition model.

Illustratively, after the first loss value corresponding to the negative sample image and the second loss value corresponding to the positive sample image are obtained, the image recognition model to be trained is trained through the first loss value and the second loss value, and then the target image recognition model is obtained.

The target image recognition model is used for recognizing specified types of content in the image.

In an alternative embodiment, the image to be identified is represented as a target image, and the identification of the target image to be identified by the target image identification model is described as an example. Illustratively, after identifying the specified type of content in the target image, the presence of the specified type of content in the target image may be determined, for example: whether the specified type of content exists in the target image or not; or/and, the existence position of the specified type content in the target image.

Optionally, after the specified type of content in the target image is identified by the target image identification model, at least one of the following identification results is obtained.

1. Absence of specified type of content in target image

Illustratively, the target image is input into the target image recognition model, and after the target image is recognized by the target image recognition model, the specified type of content is not recognized from the target image. Optionally, the target image recognition model outputs a recognition result. For example: outputting 'no specified type content exists in the target image' by the target image identification model; alternatively, the target image recognition model outputs a "target image" or the like.

2. The presence of specified types of content in a target image

Illustratively, the target image is input into a target image recognition model, and after the target image is recognized by the target image recognition model, the specified type of content is recognized from the target image. Optionally, the target image recognition model outputs a recognition result.

For example: outputting a 'specified type content exists in a target image' by the target image identification model;

or the target image recognition model outputs a target image marked with a mosaic (for example, the image area corresponding to the specified type of content is subjected to blurring processing through the mosaic, or the target image is subjected to blurring processing through the mosaic, and the like);

Or the target image recognition model outputs a 'target image' marked with an instruction frame (for example, an image area corresponding to the specified type of content is surrounded by the instruction frame), and the like.

In summary, the region labels are marked for the plurality of image regions of the negative sample image in the sample image pair to indicate that the plurality of image regions do not contain the specified type of content; carrying out region prediction of appointed type contents on the negative sample image and the positive sample image through an image recognition model to be trained to obtain a first candidate region and a second candidate region; acquiring a first loss value based on a difference between the region tag and the first candidate region; acquiring a second loss value based on a difference between the specified type tag and the second candidate region; and training the image recognition model by means of the loss value to obtain a target image recognition model. The labeling mode of the area labels is adopted, the distinguishing capability of the model on the negative sample image and the positive sample image is improved, and the image recognition model is trained while a large number of negative sample image interference models are improved, so that the recognition capability of the target image recognition model on the image to be recognized is improved, the appointed type content in the image is recognized more accurately, and the accuracy of image filtering is improved.

In an alternative embodiment, the target image recognition model is obtained by using the training method of the image recognition model through a pre-constructed image recognition model to be trained. Illustratively, as shown in fig. 4, step 350 in the embodiment shown in fig. 3 described above may also be implemented as steps 410 through 440 as follows.

And 410, extracting the features of the first candidate region to obtain a first feature representation corresponding to the first candidate region.

Illustratively, after the region prediction of the specified type content is performed on the negative sample image through the image recognition model, a first candidate region corresponding to the negative sample image is obtained, and feature extraction is performed on the first candidate region, so that a first feature representation corresponding to the first candidate region is obtained.

For example: and projecting the first candidate region into a vector space, thereby realizing the process of extracting the characteristics of the first candidate region and obtaining a first characteristic representation.

Optionally, a predetermined feature extraction network is adopted to perform feature extraction on the first candidate region, so as to obtain a first feature representation corresponding to the first candidate region, and the like.

And step 420, extracting features of the second candidate region to obtain a second feature representation corresponding to the second candidate region.

Illustratively, after the region prediction of the specified type content is performed on the positive sample image through the image recognition model, a second candidate region corresponding to the positive sample image is obtained, and feature extraction is performed on the second candidate region, so that a second feature representation corresponding to the second candidate region is obtained.

For example: and projecting the second candidate region into a vector space, thereby realizing the process of extracting the characteristics of the second candidate region and obtaining a second characteristic representation.

Optionally, a predetermined feature extraction network is adopted to perform feature extraction on the second candidate region, so as to obtain a second feature representation corresponding to the second candidate region, and the like.

Step 430, determining a similarity loss value between the first feature representation and the second feature representation based on a preset similarity loss function.

Illustratively, after the first feature representation and the second feature representation are obtained, a similarity loss function is used to determine a similarity loss value between the first feature representation and the second feature representation.

Optionally, the preset similarity loss function is implemented as the following formula one.

Equation one:

wherein L is _simloss (X ₁ ,X ₂ ) A similarity loss value for indicating a similarity between the first feature representation and the second feature representation; x is X ₁ A first feature representation for indicating a first candidate region correspondence; x is X ₂ A second feature representation for indicating a second candidate region correspondence; k is used for indicating the number of the first feature representations and the second feature representations, and K first feature representations and K second feature representations are selected;a first feature representation for indicating an ith first candidate region;And the second characteristic representation is used for indicating the corresponding second candidate region of the ith. Above->The determination is made using equation two as follows.

Formula II:

wherein,for indicating->Is a norm result of (a);For indicating->Is a norm result of (a).

Illustratively, the similarity loss value between the first feature representation and the second feature representation is determined by using the similarity loss function represented by the first and second formulas, namely: determining L _simloss (X ₁ ,X ₂ )。

Step 440, training the image recognition model based on the first loss value, the second loss value and the similarity loss value to obtain the target image recognition model.

Illustratively, after obtaining a first loss value corresponding to the negative sample image, a second loss value corresponding to the positive sample image, and a similarity loss value between the positive sample image and the negative sample image, training the image recognition model through the loss values.

In an alternative embodiment, the image recognition model includes a region prediction network and a region detection network.

(one) regional prediction network

Optionally, the region prediction network is configured to instruct prediction of a region in the image containing the specified type of content and to determine the candidate region.

For example: and (3) carrying out region prediction of the specified type of content on the negative sample image and the positive sample image in the sample image pair by adopting a region prediction network in the image recognition model, so as to predict the position condition of the specified type of content in the sample image (such as determining the position of a candidate region).

In some embodiments, the negative sample image and the positive sample image are subjected to region prediction of specified type content through an image recognition model to be trained, so that a plurality of first prediction regions corresponding to the negative sample image and a plurality of second prediction regions corresponding to the positive sample image are obtained.

Optionally, the negative sample image and the positive sample image are subjected to region extraction through an image recognition model to be trained, so that a plurality of first prediction regions corresponding to the negative sample image and a plurality of second prediction regions corresponding to the positive sample image are obtained.

Optionally, using a region candidate network (Region Proposal Network, RPN) as a region prediction network in the image recognition model, when analyzing a sample image using a generic RPN, multiple prediction regions are typically generated for one sample image, for example: 256 prediction regions are generated for one sample image.

Illustratively, after the sample image pair is input into the image recognition model, the positive sample image and the negative sample image in the sample image pair are respectively subjected to region prediction based on a region prediction network in the image recognition model, so as to obtain a plurality of first prediction regions corresponding to the negative sample image, a plurality of second prediction regions corresponding to the positive sample image, and the like.

Alternatively, the negative sample image corresponds to 256 first prediction regions, and the positive sample image corresponds to 256 second prediction regions. When one sample image pair is analyzed, 512 prediction regions (including 256 second prediction regions corresponding to positive sample images and 256 first prediction regions corresponding to negative sample images) are obtained in total.

Of these, 512 prediction regions include a large number of image regions excluding the specified type of content, that is, many meaningless prediction regions. Particularly, for the negative sample image, many first prediction areas suspected of having the specified type of content are generated, and there is little opportunity for the first prediction areas suspected of having the specified type of content to be subjected to feedback calculation.

In an alternative embodiment, a scoring condition for a plurality of predicted regions containing a specified type of content is determined, and a candidate region is selected from the plurality of predicted regions.

Optionally, a first scoring condition including the specified type of content in the plurality of first prediction regions and a second scoring condition including the specified type of content in the plurality of second prediction regions are determined.

Illustratively, after the first prediction region corresponding to the negative sample image and the second prediction region corresponding to the positive sample image are obtained, the score results (the first score results and the second score results) including the specified type of content in the plurality of prediction regions (the first prediction regions and the second prediction regions) are respectively determined. For example: the region prediction network also determines the probability that the plurality of prediction regions respectively contain the specified type of content when calculating the prediction regions corresponding to the negative sample image and the positive sample image in the sample image pair.

In some embodiments, the first scoring condition and the second scoring condition are combined, a first candidate region corresponding to the negative sample image is selected from a plurality of first pre-determined regions, and a second candidate region corresponding to the positive sample image is selected from a plurality of second pre-determined regions.

Alternatively, the plurality of score results are ranked based on score results corresponding to the plurality of prediction regions, respectively, and the candidate region is determined from the plurality of prediction regions based on the ranking.

Illustratively, the plurality of prediction regions includes a first prediction region corresponding to the negative sample image and a second prediction region corresponding to the positive sample image. The above 512 prediction regions are exemplified as the implementation of a plurality of prediction regions. Based on the score results corresponding to the 512 prediction regions, sorting the 512 probability results, and selecting the prediction region with higher score result as the candidate region.

For example: the first 256 probability results are selected from 512 probability results, a prediction region corresponding to the first 256 probability results is determined, and a loss value determination process is performed using the prediction region as a candidate region.

Optionally, the probability result corresponding to the second prediction region is mostly higher than the probability result corresponding to the first prediction region based on that the positive sample image contains the specified type of content. For example: the prediction areas corresponding to the first 256 probability results comprise first prediction areas corresponding to 56 negative sample images and second prediction areas corresponding to 200 positive sample images; the first prediction regions corresponding to 56 negative sample images are used as first candidate regions corresponding to the negative sample images, and the second prediction regions corresponding to 200 positive sample images are used as second candidate regions corresponding to the positive sample images.

In some embodiments, a maximum likelihood loss function (Top likelihood Loss) is designed to calculate a first loss value corresponding to the negative sample image. Illustratively, the maximum likelihood loss function is implemented as shown in equation three below.

And (3) a formula III:

wherein L is _tlloss The maximum likelihood loss value is used for indicating the corresponding negative sample image; n (N) _cls For indicating the number of candidate regions in the negative sample image; p is p _i A prediction probability (scoring result) for indicating that the specified type of content exists in the i-th first candidate region in the negative sample image;for indicating the true value corresponding to the ith first candidate region in the negative sample image, anNamely: region labels marked for image regions in the negative sample image, representing that the image regions do not contain specified types of content), and the true value for any one of the N candidate regions is 0, namely: no content of the specified type is contained in any of the first candidate areas.

In addition, L _cls For indicating p _i And (3) withThe logarithmic loss between them is realized in the form shown in the following equation four.

Equation four:

optionally, the first loss value corresponding to the negative sample image is determined using the above formula three and formula four.

Further, a second loss value corresponding to the positive sample image is determined by means of the second candidate region corresponding to the positive sample image and the specified type tag.

Optionally, determining a region classification loss value corresponding to the positive sample image based on the second candidate region corresponding to the positive sample image and the designated type label, and taking the region classification loss value as a second loss value corresponding to the positive sample image; or determining a regional regression loss value corresponding to the positive sample image based on the second candidate region corresponding to the positive sample image and the appointed type label, and taking the regional regression loss value as a second loss value corresponding to the positive sample image; alternatively, the region classification loss value and the region regression loss value corresponding to the positive sample image are determined based on the second candidate region and the designated type label corresponding to the positive sample image, and the sum of the region classification loss value and the region regression loss value is used as the second loss value corresponding to the positive sample image.

Illustratively, the following formula five is employed to determine the region classification loss value corresponding to the positive sample image.

Formula five:

wherein L is _pclsloss The method comprises the steps of indicating a region classification loss value corresponding to a positive sample image; n (N) _pcls For indicating a number of second candidate regions in the positive sample image; i is used for indicating an ith second candidate area corresponding to the positive sample image; p is p _pi The probability that the ith second candidate area corresponding to the positive sample image contains the specified type of content is indicated;the ith second candidate area corresponding to the positive sample image is used for indicating that the ith second candidate area is an area containing specified type content, namely: the area is pre-labeled with a label of a specified type. />

Wherein L is _pcls For indicating p _pi And (3) withThe logarithmic loss between them is realized as the following equation six.

Formula six:

illustratively, the regional regression loss value corresponding to the positive sample image is determined using the following equation seven.

Formula seven:

wherein L is _pregloss The regional regression loss value is used for indicating the regional regression loss value corresponding to the positive sample image; λ is used to indicate a preset parameter; n (N) _reg Indicating the number of second candidate areas in the positive sample image participating in the area regression loss value calculation; t is t _i A prediction offset for indicating a characteristic representation of the ith second candidate region;an actual offset for indicating a characteristic representation of the ith second candidate region; l (L) _reg Implemented as equation eight below.

Formula eight:

wherein R is used to indicate a smoothL1 function.

In an alternative embodiment, the sum of the regional classification loss value corresponding to the positive sample image and the regional regression loss value corresponding to the positive sample image is used as the second loss value corresponding to the positive sample image.

Optionally, training the region prediction network in the image recognition model with the first loss value and the second loss value to obtain a trained region prediction network.

Illustratively, after obtaining the first loss value and the second loss value, training the region prediction network in the image recognition model with the first loss value and the second loss value.

For example: and adjusting the maximum likelihood loss value corresponding to the negative sample image, the regional classification loss value corresponding to the positive sample image and the regional regression loss value by means of preset super parameters so as to obtain a first loss value corresponding to the negative sample image and a second loss value corresponding to the positive sample image, and training a regional prediction network in the image recognition model based on the sum of the first loss value and the second loss value, wherein the formula nine is shown below.

Formula nine:

L _rpnloss ＝L _ploss +L _nloss ＝L _pclloss +λ ₁ L _pregloss +λ ₂ L _tlloss

wherein L is _rpnloss Total loss values for indicating training of the region prediction network in the image recognition model, such as: the regional prediction network is implemented as RPN, and then the regional prediction network is the RPN loss value; l (L) _ploss A second loss value for indicating correspondence of the positive sample image; l (L) _nloss A first loss value for indicating correspondence of the negative sample image; l (L) _pclloss The region classification loss value is used for indicating the positive sample image corresponding to the positive sample image as shown in the formula five; l (L) _pregloss The regional regression loss value is used for indicating the positive sample image corresponding to the formula seven; lambda (lambda) ₁ A hyper-parameter for indicating an adjusted regional regression loss value; l (L) _tlloss For indicating a maximum likelihood loss value corresponding to the negative sample image as shown in the above formula three; lambda (lambda) ₂ Super-parameter for indicating adjustment of maximum likelihood loss value。

Optionally, the total loss value L corresponding to the regional prediction network is adopted _rpnloss And training the regional prediction network to obtain the trained regional prediction network.

(II) area detection network

Illustratively, the region detection network is configured to instruct classification prediction of the specified type of content in the candidate region in the image.

In some embodiments, the presence of the specified type of content in the candidate region is classified and predicted by the region detection network to obtain a classification prediction result. For example: the presence or absence of the specified type of content in the candidate area is predicted by the area detection network, and the predicted presence (presence or absence of the specified type of content) is used as a classification prediction result.

Optionally, a Fast convolution network (Fast Region-based Convolutional Network, fast RCNN) is used as a Region detection network in the image recognition model, and when a general Fast RCNN is used to analyze a candidate Region corresponding to a sample image, the value (classification result) of each candidate Region is usually output through a full connection layer in the Fast RCNN, for example: the candidate area a contains a specified type of content.

Illustratively, fast RCNN is described. As shown in fig. 5, fast RCNN includes the following four components: a convolution layer 510 (Convolutional Neural Network, conv layers); (II) RPN520; (III) a region of interest Pooling layer 530 (Region of interest Pooling, roi Pooling); (IV) Classification layer 540 (Classification).

One convolution layer 510

Illustratively, as a CNN network object detection method, fast RCNN first extracts feature maps (feature maps) of sample images using a set of basic vector convolution operations (conv), relu activation functions, and pooling layers (pooling), which are shared for the subsequent RPN layers and full-connection layers.

(II) RPN520

Illustratively, the RPN network is used to generate candidate regions (region pro-sams), and the layer determines, by activating the function softmax, whether the sample image (anchors) belongs to a positive sample image (positive) or a negative sample image (negative), i.e.: it is determined whether the specified type of content is included in the sample image.

Optionally, the RPN re-uses frame regression (bounding box regression) to correct anchors to obtain precise regions (proposals).

(III) region of interest pooling layer 530

Illustratively, the region of interest pooling layer collects the input feature maps and proposals, extracts proposal feature maps after integrating the above information, and sends the information to the subsequent full-connection layer to determine the image type of the sample image, namely: the sample image belongs to either a positive sample image or a negative sample image.

(IV) Classification layer 540

Illustratively, the class of proposal is calculated using proposal feature maps, while the final precise location of the detection frame is obtained again using frame regression (bounding box regression).

In an alternative embodiment, the area detection network in the image recognition model is trained with a similarity loss value, and a trained area detection network is obtained.

In an alternative embodiment, the specified type of content in the first candidate region is subjected to classified prediction through the region detection network to obtain a first classified prediction result; and carrying out classified prediction on the appointed type content in the second candidate region through the region detection network to obtain a second classified prediction result.

Illustratively, after the sample image pair is input into the image recognition model, based on a region detection network in the image recognition model, classification prediction is performed on image categories corresponding to the positive sample image and the negative sample image in the sample image pair, so that classification prediction results corresponding to the negative sample image and the positive sample image are determined. For example: the classification prediction results include a first classification prediction result corresponding to the negative sample image and a second classification prediction result corresponding to the positive sample image.

The classification prediction result is used for indicating whether the sample image comprises specified type content or not; alternatively, the classification prediction result is used to indicate the region range of the specified type of content predicted in the sample image. For example: taking the classification prediction result as a classification result as an example, the first classification prediction result corresponding to the negative sample image is 0, and indicating that the negative sample image does not contain the specified type content; alternatively, the second classification prediction result corresponding to the positive sample image is 1, indicating that the positive sample image includes the specified type content or the like.

Optionally, based on the first classification prediction result and the region label corresponding to the negative sample image, a first classification loss value corresponding to the negative sample image is obtained.

Illustratively, the region label corresponding to the negative sample image is used to indicate that none of the plurality of image regions in the negative sample image includes the specified type of content.

Illustratively, the fifth formula is used to indicate the region classification loss value corresponding to the positive sample image, that is: a classification loss value determined based on the specified type of content in the second candidate region of the positive sample image. The above-described formula five is modified, and the classification loss value corresponding to the negative-sample image is determined using the following formula ten as a reference to the negative-sample image as a whole.

Formula ten:

wherein L is _nclsloss A classification loss value for indicating correspondence of the negative sample image; p is p _n A probability for indicating that the negative sample image contains the specified type of content;for indicating that the negative sample image does not contain the specified type of content, namely: the image areas in the negative sample image are pre-labeled with area labels.

Optionally, based on the second classification prediction result and the specified type label corresponding to the positive sample image, a second classification loss value and a regression loss value corresponding to the positive sample image are obtained.

Illustratively, when the classification prediction result is used to determine the classification loss value and the regression loss value corresponding to the positive sample image, the second classification prediction result corresponding to the positive sample image is compared with the specified type label corresponding to the positive sample image, so as to determine the classification loss value and the regression loss value corresponding to the positive sample image.

Optionally, determining a classification loss value and a regression loss value corresponding to the positive sample image according to the classification loss function, wherein the specified type label corresponding to the positive sample image is used for indicating that the specified type content is included in the positive sample image based on the specified type label corresponding to the positive sample image, and when the specified type label corresponding to the positive sample image based on the second type prediction result is included, the specified type label corresponding to the positive sample image can take a value of 1, namely, the specified type content is included in the positive sample image.

Illustratively, the fifth formula is used to indicate the region classification loss value corresponding to the positive sample image, that is: a classification loss value determined based on the specified type of content in the second candidate region of the positive sample image. The above-described formula five is modified to determine a classification loss value corresponding to the positive sample image with reference to the whole positive sample image by using the following formula eleven.

Formula eleven:

wherein L is _pclsloss A classification loss value for indicating the correspondence of the positive sample image; p is p _p A probability for indicating that the positive sample image contains the specified type of content;for indicating that the positive sample image contains the specified type of content, namely: the regions in the positive sample image are pre-labeled with a specified type of label.

In some embodiments, the region classification loss value corresponding to the positive sample image is determined, and the region classification loss value is directly used as the classification loss value corresponding to the positive sample image.

Illustratively, the above formula seven is used to indicate the regional regression loss value corresponding to the positive sample image, namely: regression loss values determined based on the specified type of content in the second candidate region of the positive sample image. The above formula seven is modified, and a regression loss value corresponding to the positive sample image is determined using the following formula twelve as a reference.

Formula twelve:

wherein L is _pregloss The regression loss value is used for indicating the regression loss value corresponding to the positive sample image;for indicating that the positive sample image contains the specified type of content, namely: the region in the positive sample image is pre-marked with a label of a specified type; t is used for indicating the prediction offset of the second characteristic representation corresponding to the positive sample image; t is used to indicate the actual offset of the second feature representation for the positive sample image.

In some embodiments, the regional regression loss value corresponding to the positive sample image is determined, and the regional classification loss value is directly used as the regression loss value corresponding to the positive sample image.

In an alternative embodiment, the area detection network in the image recognition model is trained with a similarity loss value, a first classification loss value, a second classification loss value, and a regression loss value, to obtain a trained area detection network.

Illustratively, after obtaining a similarity loss value between the negative sample image and the positive sample image, a first classification loss value corresponding to the negative sample, a second loss value corresponding to the positive sample, and a regression loss value corresponding to the positive sample, the area detection network is trained with the similarity loss value, the first classification loss value, the second classification loss value, and the regression loss value.

L _fastloss ＝L _clsloss +λ ₃ L _regloss +λ ₄ L _simloss

Wherein L is _fastloss Total loss values for indicating training of the area detection network in the image recognition model, such as: the regional detection network is realized as Fast RCNN, and then the regional detection network is a Fast RCNN loss value; a second loss value for indicating correspondence of the positive sample image; l (L) _clsloss For indicating a classification loss value, namely: a first classification loss value corresponding to a negative sample image and a second classification loss value corresponding to a positive sample image, such as: l (L) _clsloss Indicating a sum of the first classification loss value and the second classification loss value; lambda (lambda) ₃ A hyper-parameter for indicating an adjustment of the classification loss value; l (L) _simloss For indicating a similarity loss value as shown in formula one above; lambda (lambda) ₄ And the super parameter is used for indicating and adjusting the similarity loss value.

Optionally, the total loss value L corresponding to the area detection network is adopted _fastloss And training the area detection network to obtain the trained area detection network.

In an alternative embodiment, the trained region prediction network and the trained region detection network are combined into a target image recognition model.

Optionally, a simultaneous training mode is adopted to predict the total loss value L corresponding to the network through the region _rpnloss Training a regional prediction network and obtaining a total loss value L corresponding to the regional prediction network _fastloss The area detection network is trained.

Alternatively, a separate training mode can be adopted, wherein the total loss value L corresponding to the network is predicted through the region _rpnloss After training the regional prediction network to obtain a trained regional prediction network, determining a total loss value L corresponding to the regional detection network through a candidate region output by the regional prediction network _fastloss And the total loss value L corresponding to the network is detected through the area _fastloss And training the area detection network to obtain a trained area detection network and the like.

Optionally, the trained regional prediction network and the trained regional detection network form a target image recognition model, so that a target image to be recognized is recognized by means of the target image recognition model, and whether the target image has specified type content or not is judged; or, the region position of the specified type of content existing in the target image is judged.

In summary, the labeling mode of the area label is adopted, the distinguishing capability of the model to the negative sample image and the positive sample image is improved, and the image recognition model is trained while a large number of negative sample image interference models are improved, so that the recognition capability of the target image recognition model to the image to be recognized is improved, the appointed type content in the image is recognized more accurately, and the accuracy of image filtering is improved.

In an optional embodiment, the target image recognition model obtained by training is used for performing image recognition on the target image to be recognized, so as to know the existence condition of the specified type of content in the target image, for example: whether the specified type of content exists in the target image or not; or, the presence position of the type content is specified in the target image, and the like. Illustratively, as shown in fig. 6, the embodiment shown in fig. 2 described above may also be implemented as steps 610 through 682 below.

At step 610, a sample image pair is acquired.

Wherein the sample image pair comprises a positive sample image and a negative sample image. The positive sample image is an image having a specified type of content; the negative sample image is an image that does not contain a specified type of content.

Wherein the positive sample image and the negative sample image have similar association relation. The similarity association relationship is used for indicating that the image content in the negative sample image has a content similarity relationship with the specified type of content in the positive sample image. For example: the image content in the negative sample image is highly suspected of being of the specified type of content in the positive sample image. The region corresponding to the appointed type content in the positive sample image is pre-marked with the appointed type label. Alternatively, the following description is made taking, as an example, the implementation of the specified type of content as a sensitive human body region.

Optionally, the training method of the image recognition model is a training method implemented based on a contrast mechanism. After obtaining the sample image with the classification information (namely, determining that the sample image contains the specified type content or does not contain the specified type content), marking the region corresponding to the specified type content in the sample image containing the specified type content in a manual marking mode on the basis of the classification information of the existing sample image, for example, the method comprises the following steps: and taking the marked specified type frame as a specified type label.

Illustratively, two pairs of knots are used to share a feature network with the picture information (i.e., the sample image pair). For example: and acquiring sample images containing the specified type of content from the plurality of sample images as positive sample images in the sample image pair, and acquiring sample images which do not contain the specified type of content from the plurality of sample images as negative sample images in the sample image pair, so that the sample image pair is obtained.

Step 620, labeling a plurality of image areas in the negative sample image with area labels.

Wherein the region tag is used to indicate that the plurality of image regions do not contain the specified type of content.

In some embodiments, region segmentation is performed on the negative sample image to obtain a plurality of image regions in the negative sample image; after a plurality of image areas are obtained, the plurality of image areas are marked with area labels with values of zero.

And 630, carrying out region prediction of the appointed type content on the negative sample image and the positive sample image through the image recognition model to be trained to obtain a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image.

Illustratively, the negative sample image and the positive sample image are subjected to regional prediction of specified types of content by adopting an image recognition model to be trained.

The image recognition model to be trained is a recognition model with a certain region prediction function, and whether specified type content exists in the image can be roughly determined.

Alternatively, the framework of the image recognition model is derived based on Faster R-CNN. The Fast R-CNN comprises RPN and Fast R-CNN, the classifying loss of the RPN network is optimized at the loss function end, and the contrast loss is designed at the Fast RCNN part, so that the difference between the positive sample image and the negative sample image can be effectively distinguished while the recognition capability of the image recognition model on the sample image is ensured.

Illustratively, the prediction of the region is based on: the sample image pair is input into the fast R-CNN, and a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image are determined by means of the RPN.

In an alternative embodiment, after the first candidate region corresponding to the negative sample image is obtained, the above region tag is labeled for the first candidate region corresponding to the negative sample image to indicate that the first candidate region does not have the specified type of content, so that the first candidate region is highlighted to not include the specified type of content, so as to determine the first loss value corresponding to the negative sample image later.

Step 640, obtaining a first loss value corresponding to the negative sample image based on the difference between the region label and the first candidate region; and acquiring a second loss value corresponding to the positive sample image based on the difference between the specified type label and the second candidate region.

In some embodiments, as shown in fig. 7, a sample image pair consisting of a positive sample image 710 and a negative sample image 720 is input into an image recognition model fast R-CNN, and an RPN730 in the fast R-CNN is used to determine a region classification loss value 711 corresponding to the positive sample image 710 and a region regression loss value 712 corresponding to the positive sample image 710; the maximum likelihood loss value 721 corresponding to the negative sample image 720 is determined by means of the RPN730 in the fast R-CNN.

Optionally, a first candidate region corresponding to the negative sample image 720 is determined by the RPN730 in the Faster R-CNN, and a second candidate region corresponding to the positive sample image 710 is determined by the RPN730 in the Faster R-CNN; and extracting features of the first candidate region and the second candidate region by Fast R-CNN740 in the Fast R-CNN, thereby determining a first feature representation corresponding to the first candidate region and a second feature representation corresponding to the second candidate region.

Selecting first K second feature representations from a plurality of second feature representations corresponding to the positive sample image; selecting first K first feature representations from a plurality of first feature representations corresponding to the negative sample image, and determining a similarity loss value 750 between the positive sample image and the negative sample image; in addition, based on a second candidate region of the positive sample image and a pre-labeled designated type label, a second classification loss value and a regression loss value corresponding to the positive sample image are obtained; and acquiring a first classification loss value corresponding to the negative sample image based on the first candidate region of the negative sample image and the set region label.

Step 650, training the image recognition model based on the first loss value and the second loss value to obtain the target image recognition model.

Illustratively, the loss value corresponding to the negative sample image is used as a first loss value, and the loss value corresponding to the positive sample image is used as a second loss value, so that the image recognition model is trained by means of the first loss value and the second loss value, and a target image recognition model is obtained.

The target image recognition model is used for recognizing the appointed type content in the target image to be recognized.

In step 660, the target image is acquired.

The target image is an image to be subjected to region detection of the specified type of content.

Illustratively, the target image to be identified is acquired by adopting a random acquisition mode, and when the target image is identified, namely whether the specified type of content exists in the target image or not is determined.

For example: the method comprises the steps of determining specified type content as a human body sensitive area in advance, and classifying a target image with the human body sensitive area into a specific image containing the specified content; the target image in which the human body sensitive area does not exist is classified as a nonspecific image or the like.

Step 670, the target image is input into the target image recognition model.

Illustratively, after obtaining the target image, inputting the target image into a trained target image recognition model, thereby determining whether specified type content exists in the target image by means of the target image recognition model; or determining the existence position of the specified type of content in the target image, etc.

In response to detecting the specified type of content from the target image, the target image labeled with the hint identification is output 681.

In some embodiments, when the target image recognition model detects that the specified type of content exists in the target image, the target image marked with prompt information is output, and the prompt identifier is used for indicating an image area corresponding to the specified type of content.

Optionally, in response to detecting the specified type of content from the target image, outputting the target image marked with the prompt box. The prompt box is used for surrounding the specified type of content.

Alternatively, in response to detecting the specified type of content from the target image, the target image with the blurred identification is output. Wherein the obfuscation flag is used to override the specified type of content.

Illustratively, as shown in FIG. 8, for a target image 810, when the target image recognition model detects that a specified type of content (e.g., ankle region) exists in the target image 810, a prompt identifier is marked at the specified type of content in the target image 810 when the target image 810 is output. Wherein, the prompt identifier can comprise one or a plurality of prompt identifiers.

For example: the prompt message is implemented as a prompt box 820 as shown in fig. 8, where the prompt box 820 is used to enclose a specified type of content to indicate the specified type of content; alternatively, the hint information may be implemented as a blur area that is used to blur the specified type of content (e.g., in a mosaic form) to blur the specified type of content, etc.

In response to not detecting the specified type of content from the target image, a target image is output 682.

Alternatively, when the target image recognition model does not detect that the specified type of content exists in the target image, the target image is output. Schematically, as shown in fig. 9, a target image 910 is shown, and when the target image recognition model does not detect that the specified type of content exists in the target image 910, the target image 910 is output, etc.

FIG. 10 is a block diagram of a training apparatus for image recognition models according to an exemplary embodiment of the present application, as shown in FIG. 10, the apparatus includes the following parts:

a first obtaining module 1010, configured to obtain a sample image pair, where the sample image pair includes a positive sample image and a negative sample image, the positive sample image is an image with a specified type of content, and a region corresponding to the specified type of content is pre-labeled with a specified type of tag; the negative sample image is an image which does not contain the specified type of content, and the positive sample image and the negative sample image have similar association relations;

A label labeling module 1020 configured to label a plurality of image areas in the negative sample image with area labels, where the area labels are used to indicate that the plurality of image areas do not contain specified types of content;

the region prediction module 1030 is configured to perform region prediction of specified types of content on the negative sample image and the positive sample image through an image recognition model to be trained, so as to obtain a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image;

a second obtaining module 1040, configured to obtain a first loss value corresponding to the negative sample image based on a difference between the region label and the first candidate region; acquiring a second loss value corresponding to the positive sample image based on the difference between the specified type tag and the second candidate region;

the model training module 1050 is configured to train the image recognition model based on the first loss value and the second loss value, so as to obtain a target image recognition model, where the target image recognition model is used to recognize the content of the specified type in the image.

In an optional embodiment, the label labeling module 1020 is further configured to perform region segmentation on the negative sample image to obtain a plurality of image regions in the negative sample image; and labeling the plurality of image areas with the area labels with the values of zero.

In an optional embodiment, the model training module 1050 is further configured to perform feature extraction on the first candidate region to obtain a first feature representation corresponding to the first candidate region; extracting features of the second candidate region to obtain a second feature representation corresponding to the second candidate region; determining a similarity loss value between the first feature representation and the second feature representation based on a preset similarity loss function; and training the image recognition model based on the first loss value, the second loss value and the similarity loss value to obtain the target image recognition model.

In an alternative embodiment, the image recognition model includes a region prediction network and a region detection network;

the model training module 1050 is further configured to train the region prediction network in the image recognition model with the first loss value and the second loss value to obtain a trained region prediction network; training the area detection network in the image recognition model by using the similarity loss value to obtain a trained area detection network; and forming the target image recognition model by the trained regional prediction network and the trained regional detection network.

In an alternative embodiment, the model training module 1050 is further configured to perform classification prediction on the specified type of content in the first candidate region through the region detection network, to obtain a first classification prediction result; carrying out classification prediction on the appointed type content in the second candidate region through the region detection network to obtain a second classification prediction result; acquiring a first classification loss value corresponding to the negative sample image based on the first classification prediction result and the region label corresponding to the negative sample image; acquiring a second classification loss value and a regression loss value corresponding to the positive sample image based on the second classification prediction result and a specified type label corresponding to the positive sample image; and training the area detection network in the image recognition model according to the similarity loss value, the first classification loss value, the second classification loss value and the regression loss value to obtain a trained area detection network.

In an optional embodiment, the region prediction module 1030 is further configured to perform region prediction of the specified type of content on the negative sample image and the positive sample image through an image recognition model to be trained, so as to obtain a plurality of first prediction regions corresponding to the negative sample image and a plurality of second prediction regions corresponding to the positive sample image; determining a first scoring condition including the specified type of content in the plurality of first prediction regions and a second scoring condition including the specified type of content in the plurality of second prediction regions; and combining the first scoring condition and the second scoring condition, selecting a first candidate region corresponding to the negative sample image from the plurality of first pre-determined regions, and selecting a second candidate region corresponding to the positive sample image from the plurality of second pre-determined regions.

In an optional embodiment, the second obtaining module 1040 is further configured to obtain, based on a difference between the specified type tag and the second candidate region, a region classification loss value and a region regression loss value between the specified type tag and the second candidate region, and use the region classification loss value and the region regression loss value as the second loss value corresponding to the positive sample image.

In an alternative embodiment, as shown in fig. 11, the apparatus further comprises:

the model output module 1060 is configured to obtain a target image, where the target image is an image to be subjected to region detection of specified type content; inputting the target image into the target image recognition model; and in response to detection of the specified type of content from the target image, outputting the target image marked with a prompt identifier, wherein the prompt identifier is used for indicating an image area corresponding to the specified type of content.

In an alternative embodiment, the model output module 1060 is further configured to output the target image in response to the absence of detection of the specified type of content from the target image.

In an alternative embodiment, the model output module 1060 is further configured to output, in response to detecting the specified type of content from the target image, a target image labeled with a prompt box, where the prompt box is configured to enclose the specified type of content; alternatively, in response to detecting the specified type of content from the target image, a target image with a blur identification for overlaying the specified type of content is output.

It should be noted that: the training device for an image recognition model provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the training device of the image recognition model provided in the above embodiment and the training method embodiment of the image recognition model belong to the same concept, and detailed implementation processes of the training device and the training method embodiment of the image recognition model are detailed in the method embodiment, and are not repeated here.

Fig. 12 is a schematic diagram showing a structure of a server according to an exemplary embodiment of the present application. The server 1200 includes a central processing unit (Central Processing Unit, CPU) 1201, a system Memory 1204 including a random access Memory (Random Access Memory, RAM) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the central processing unit 1201. The server 1200 also includes a mass storage device 1206 for storing an operating system 1213, application programs 1214, and other program modules 1215.

The mass storage device 1206 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1206 and its associated computer-readable media provide non-volatile storage for the server 1200. That is, the mass storage device 1206 may include a computer readable medium (not shown) such as a hard disk or compact disk read only memory (Compact Disc Read Only Memory, CD-ROM) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The system memory 1204 and mass storage device 1206 described above may be collectively referred to as memory.

According to various embodiments of the application, the server 1200 may also operate by being connected to a remote computer on a network, such as the Internet. That is, the server 1200 may be connected to the network 1212 through a network interface unit 1211 coupled to the system bus 1205, or alternatively, the network interface unit 1211 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.

The embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or instruction set is loaded and executed by the processor to realize the training method of the image recognition model provided by each method embodiment.

Embodiments of the present application further provide a computer readable storage medium having at least one instruction, at least one program, a code set, or an instruction set stored thereon, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the training method of the image recognition model provided by the above method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the training method of the image recognition model according to any one of the above embodiments.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of training an image recognition model, the method comprising:

training the image recognition model based on the first loss value and the second loss value to obtain a target image recognition model, wherein the target image recognition model is used for recognizing specified types of content in an image.

2. The method of claim 1, wherein labeling the plurality of image regions in the negative sample image with region labels comprises:

performing region segmentation on the negative sample image to obtain a plurality of image regions in the negative sample image;

and labeling the plurality of image areas with the area labels with the values of zero.

3. The method according to claim 1 or 2, wherein the training the image recognition model based on the first loss value and the second loss value to obtain a target image recognition model comprises:

extracting features of the first candidate region to obtain a first feature representation corresponding to the first candidate region;

Extracting features of the second candidate region to obtain a second feature representation corresponding to the second candidate region;

determining a similarity loss value between the first feature representation and the second feature representation based on a preset similarity loss function;

and training the image recognition model based on the first loss value, the second loss value and the similarity loss value to obtain the target image recognition model.

4. A method according to claim 3, wherein the image recognition model comprises a region prediction network and a region detection network;

the training the image recognition model based on the first loss value, the second loss value and the similarity loss value to obtain the target image recognition model includes:

training the regional prediction network in the image recognition model according to the first loss value and the second loss value to obtain a trained regional prediction network;

training the area detection network in the image recognition model by using the similarity loss value to obtain a trained area detection network;

and forming the target image recognition model by the trained regional prediction network and the trained regional detection network.

5. The method of claim 4, wherein training the region detection network in the image recognition model with the similarity loss value results in a trained region detection network, comprising:

carrying out classification prediction on the specified type content in the first candidate region through the region detection network to obtain a first classification prediction result; carrying out classification prediction on the appointed type content in the second candidate region through the region detection network to obtain a second classification prediction result;

acquiring a first classification loss value corresponding to the negative sample image based on the first classification prediction result and the region label corresponding to the negative sample image;

acquiring a second classification loss value and a regression loss value corresponding to the positive sample image based on the second classification prediction result and a specified type label corresponding to the positive sample image;

and training the area detection network in the image recognition model according to the similarity loss value, the first classification loss value, the second classification loss value and the regression loss value to obtain a trained area detection network.

6. The method according to claim 1 or 2, wherein the performing, by the image recognition model to be trained, the region prediction of the negative sample image and the positive sample image with the specified type of content, to obtain a first candidate region corresponding to the negative sample image and a second candidate region corresponding to the positive sample image, includes:

Performing region prediction of appointed type contents on the negative sample image and the positive sample image through an image recognition model to be trained to obtain a plurality of first prediction regions corresponding to the negative sample image and a plurality of second prediction regions corresponding to the positive sample image;

determining a first scoring condition including the specified type of content in the plurality of first prediction regions and a second scoring condition including the specified type of content in the plurality of second prediction regions;

and combining the first scoring condition and the second scoring condition, selecting a first candidate region corresponding to the negative sample image from the plurality of first pre-determined regions, and selecting a second candidate region corresponding to the positive sample image from the plurality of second pre-determined regions.

7. The method according to claim 1 or 2, wherein the obtaining a second loss value corresponding to the positive sample image based on a difference between the specified type of tag and the second candidate region includes:

and based on the difference between the specified type tag and the second candidate region, acquiring a region classification loss value and a region regression loss value between the specified type tag and the second candidate region, and taking the region classification loss value and the region regression loss value as a second loss value corresponding to the positive sample image.

8. The method according to claim 1 or 2, wherein training the image recognition model based on the first loss value and the second loss value, after obtaining a target image recognition model, further comprises:

acquiring a target image, wherein the target image is an image to be subjected to region detection of specified type content;

inputting the target image into the target image recognition model;

and in response to detection of the specified type of content from the target image, outputting the target image marked with a prompt identifier, wherein the prompt identifier is used for indicating an image area corresponding to the specified type of content.

9. The method of claim 8, wherein the method further comprises:

the target image is output in response to the specified type of content not being detected from the target image.

10. The method of claim 8, wherein outputting the target image tagged with the hint identifier in response to detecting the specified type of content from the target image comprises:

in response to detecting the specified type of content from the target image, outputting a target image marked with a prompt box, wherein the prompt box is used for enclosing the specified type of content;

Or,

in response to detecting the specified type of content from the target image, a target image is output having a blur identification for overlaying the specified type of content.

11. A training device for an image recognition model, the device comprising:

12. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement a training method for an image recognition model according to any one of claims 1 to 10.

13. A computer readable storage medium, wherein at least one program is stored in the storage medium, and the at least one program is loaded and executed by a processor to implement the training method of the image recognition model according to any one of claims 1 to 10.

14. A computer program product comprising a computer program or instructions which, when executed by a processor, implements a method of training an image recognition model as claimed in any one of claims 1 to 10.