CN117173501A

CN117173501A - Training method of image detection model, image detection method and related device

Info

Publication number: CN117173501A
Application number: CN202310974673.6A
Authority: CN
Inventors: 茅心悦
Original assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Current assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-12-05

Abstract

The application provides a training method of an image detection model, an image detection method and a related device, which are used for improving the detection precision of the image detection model and reducing the labor cost. The method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises a plurality of types of sample images; generating a plurality of texts according to a plurality of categories, and extracting text characteristics of each text; for each image in the training sample set, determining a loss parameter for each image: inputting each image into a network model to be trained, and extracting first image features of an area of each image containing a target object; determining a similarity of the first image feature to text features of each of the plurality of texts; determining a loss parameter of each image based on the determined multiple similarities corresponding to the first image features; and adjusting parameters of the network model to be trained based on the loss parameters of all images in the training sample set and a preset loss function.

Description

Training method of image detection model, image detection method and related device

Technical Field

The application relates to the field of target detection, and discloses a training method of an image detection model, an image detection method and a related device.

Background

In the related art, a pre-training network model is obtained by training a network by using a pre-training sample. And inputting the real training sample into a pre-training network model for fine-tuning training to obtain a final model. The number of pre-training samples is generally large and the image features are rich, such as color, shape, texture, etc. While the actual training samples are typically small in number and have poor image characteristics. The effect of fine tuning the training network model by using the real training sample pair is poor, so that the detection precision is limited.

Disclosure of Invention

The application provides a training method of an image detection model, an image detection method and a related device, which are used for improving the detection precision of the image detection model and reducing the labor cost.

In a first aspect, an embodiment of the present application provides a training method for an image detection model, including:

acquiring a training sample set, wherein the training sample set comprises sample images of a plurality of categories, one sample image corresponds to one category, and the label of the sample image is the category corresponding to the sample image;

Generating a plurality of texts according to the plurality of categories, and extracting text characteristics of each text, wherein the plurality of texts are in one-to-one correspondence with the plurality of categories, and each text comprises a category corresponding to each text;

for each image in the training sample set, performing the following operation to determine a loss parameter for the each image: inputting each image into a network model to be trained, and extracting first image features of a region of each image containing a target object; determining a similarity of the first image feature to a text feature of each of the plurality of texts; determining a loss parameter of each image based on the determined multiple similarities corresponding to the first image features;

and adjusting parameters of the network model to be trained based on the loss parameters of all images in the training sample set and a preset loss function.

In a possible implementation manner, in the training method of the image detection model provided by the embodiment of the present application, the generating a plurality of texts according to the plurality of categories includes:

generating each text according to at least one expansion field and the category corresponding to the text, wherein the expansion field comprises one or more of the following:

The acquisition mode of the images in the training sample set and the scene to which the images in the training sample set belong.

In a possible implementation manner, in the training method of the image detection model provided by the embodiment of the present application, the training sample set includes: a plurality of sample subsets; the plurality of sample subsets includes a first sample subset and at least one second sample subset;

the sample images in the first sample subset are acquired in a different manner from the sample images in the second sample subset;

and the total number of sample images in the first sample subset is greater than the total number of sample images in each of the at least one second sample subset.

In a possible implementation manner, in the training method of an image detection model provided by the embodiment of the present application, in the loss function, a weight of a loss sub-function corresponding to the first sample subset is smaller than a weight of a loss sub-function corresponding to the at least one second sample subset.

For example, the loss function may be l=α _M *Lpre+α _Q * Lreal, wherein the Lpre characterizes a loss sub-function corresponding to the first subset of samples, the alpha _M Representing the weight of a loss sub-function corresponding to the first sample subset; the Lreal characterizes a loss sub-function corresponding to the at least one second sample subset, the alpha _Q And characterizing the weight of the loss sub-function corresponding to the at least one second sample subset.

In some examples, theSaid-> Wherein said->Characterizing image features of an a-th sample image in said first sample subset pre; the real_N characterizes an Nth second subset of samples of the at least one second subset of samples, the NT characterizes a total number of the at least one second subset of samples, the>And characterizing the image characteristics of the b-th sample image in the Nth second sample subset.

The saidCharacterizing target text features corresponding to image features of an a-th sample image in the first sample subset pre, wherein in the similarity between the image features of the a-th sample image in the first sample subset pre and the text features of the texts, the similarity between the target text features corresponding to the image features of the a-th sample image in the first sample subset pre and the image features of the a-th sample image in the first sample subset pre is the largest;

the saidCharacterizing target text features corresponding to image features of a b-th sample image of the nth second subset of samples, wherein in similarity of image features of a b-th sample image of the nth second subset of samples to text features of the plurality of texts, target text features corresponding to a b-th sample image of the nth second subset of samples are compared with the nth second sample The similarity of the image features of the b-th sample image in the subset is greatest.

Characterization of the->Loss parameter of->Characterization of the->A loss parameter of (2); wherein (1)> The F is _{te_k} Characterizing a kth text feature of the text features of the plurality of texts; />Characterization of the->Is in contact with the->Similarity of->Characterization of the->Is in contact with the->Is constant.

In a possible implementation manner, in the training method of an image detection model provided by the embodiment of the present application, the plurality of sample subsets includes a plurality of second sample subsets, where a scene to which an image in any one of the second sample subsets belongs is different from a scene to which an image in any one of the plurality of second sample subsets except the plurality of second sample subsets belongs.

In a possible implementation manner, in the training method of the image detection model provided by the embodiment of the present application, the loss function includes a loss sub-function corresponding to the first sample subset, and a loss sub-function corresponding to each second sample subset in the second sample subsets;

wherein the weight of the loss sub-function corresponding to the first sample subset is less than the minimum of the weights of the loss sub-functions corresponding to the plurality of second sample subsets.

For example, in the training method of the image detection model provided by the embodiment of the present application, the loss function is:

wherein the Lpre characterizes a loss sub-function corresponding to the first subset of samples, the alpha _M Representing the weight of a loss sub-function corresponding to the first sample subset; the NT characterizes a total number of the plurality of second sample subsets, theCharacterizing a loss sub-function corresponding to an nth second subset of samples of the plurality of second subsets of samples, the α _{real_N} And characterizing a loss subfunction corresponding to the Nth second sample subset.

In some examples, theSu SouhuWherein said->Characterizing image features of an a-th sample image in said first sample subset pre; said->Characterizing image features of a b-th sample image in the nth second subset of samples;

The saidCharacterizing target text features corresponding to image features of a b-th sample image in the nth second sample subset, wherein in similarity between the image features of the b-th sample image in the nth second sample subset and the text features of the plurality of texts, the similarity between the target text features corresponding to the b-th sample image in the nth second sample subset and the image features of the b-th sample image in the nth second sample subset is the largest;

In a second aspect, an embodiment of the present application provides an image detection method, which may include:

acquiring an image to be detected;

if the image to be detected comprises a target object, extracting image features of a region comprising the target object;

determining the similarity of the image features and text features of each text in a plurality of texts, wherein the texts are pre-generated and correspond to a plurality of object categories one by one, and each text comprises a corresponding object category;

And determining the object category included in the target text as the object category of the target object, wherein the similarity between the image feature and the text feature of the target text is the largest of the determined multiple similarities corresponding to the image feature.

In a possible implementation manner, an embodiment of the present application provides an image detection method, where the number of characters of each text is greater than a preset threshold number of characters.

In a possible implementation manner, an embodiment of the present application provides an image detection method, where the image to be detected is acquired by an image acquisition device in a kitchen scene;

the plurality of object categories includes:

chef hat and mask, chef hat and mask.

In a third aspect, an embodiment of the present application provides an electronic device that may include a memory and a processor;

the memory is used for storing program instructions;

the processor is configured to execute the program instructions to implement the method according to the first aspect and any possible implementation manner thereof, or to execute the method according to the second aspect and any possible implementation manner thereof.

In a fourth aspect, embodiments of the present application also provide a computer readable storage medium, which may comprise computer program instructions which, when executed by a computer, perform a method as described in the first aspect and any of the possible embodiments thereof, or perform a method as described in the second aspect and any of the possible embodiments thereof.

In a fifth aspect, an embodiment of the present application further provides a training apparatus, including:

the sample collection acquisition module is used for acquiring a training sample collection, wherein the training sample collection comprises sample images of a plurality of categories, one sample image corresponds to one category, and the label of the sample image is the category corresponding to the sample image;

the text feature generation module is used for generating a plurality of texts according to the plurality of categories and extracting text features of each text, wherein the plurality of texts are in one-to-one correspondence with the plurality of categories, and each text comprises the category corresponding to each text;

the model training module is used for executing the following operation on each image in the training sample set, and determining the loss parameter of each image: inputting each image into a network model to be trained, and extracting first image features of a region of each image containing a target object; determining a similarity of the first image feature to a text feature of each of the plurality of texts; determining a loss parameter of each image based on the determined multiple similarities corresponding to the first image features; and adjusting parameters of the network model to be trained based on the loss parameters of all images in the training sample set and a preset loss function.

In a sixth aspect, an embodiment of the present application further provides an image detection apparatus, including:

the image acquisition module is used for acquiring an image to be detected;

the image detection module is used for extracting image features of a region comprising the target object if the image to be detected comprises the target object; determining the similarity of the image features and text features of each text in a plurality of texts, wherein the texts are pre-generated and correspond to a plurality of object categories one by one, and each text comprises a corresponding object category; and determining the object category included in the target text as the object category of the target object, wherein the similarity between the image feature and the text feature of the target text is the largest of the determined multiple similarities corresponding to the image feature.

The embodiment of the application has the following beneficial effects:

the application provides a training method of an image detection model, an image detection method and a related device, which can generate an expanded text of each category by utilizing the category of a sample image in a training sample set. The network model is trained by using the text features of the expanded text and the image features of the sample image, so that the data volume in the training process can be increased, the robustness of the model is improved, the features of the sample image are enriched, the detection precision is good, a large number of real training samples are not needed, and the manual labeling cost can be reduced. The situation that the effect of training the network model is poor is caused because the feature distribution of the pre-training sample is different from that of the real training sample does not exist.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a training process of the related art;

FIG. 2 is a schematic flow chart of a training method of an image detection model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a training process according to an embodiment of the present application;

FIG. 4 is a schematic illustration of a pre-training sample image;

FIG. 5 is a schematic illustration of a sample image of a real scene;

FIG. 6 is a schematic diagram of a training process provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart of an image detection method according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a training device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a detection device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order of illustration is depicted in the flowchart, in some cases the steps shown or described may be performed in a different order than presented.

The terms first and second in the description and claims of the application and in the above-mentioned figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, two, three or more, and embodiments of the present application are not limited.

In the technical scheme of the application, the acquisition, the transmission, the use and the like of the image data all meet the requirements of national relevant laws and regulations.

Before describing a training method of an image detection model provided by the embodiment of the present application, for convenience of understanding, the following detailed description is first provided for the technical background of the embodiment of the present application.

In the related art, as shown in fig. 1, a network model is trained by using a pre-training sample set, so as to obtain a pre-training network model. Inputting each pre-training sample into a network, and performing network training by taking the label of the output sample as a target to obtain a pre-training network model. And performing fine tuning training on the pre-training network model by using the real training samples, inputting each real training sample into the pre-training network model, and adjusting the pre-training network model by taking the labels of the output samples as targets, thereby obtaining a final detection model.

The pre-training sample set is usually a pre-acquired marked image, and the image features in the pre-training sample set are rich. The real training sample set is an image in an actual model application scene, and the labor cost required by image annotation is high, so that the sample number in the real training sample set is small. And the image features in the real training sample set are poor. The pre-training network model has an unsatisfactory effect when performing fine-tuning training on a real sample image, and the detection precision is limited.

In view of this, the embodiment of the application provides a training method of an image detection model, an image detection method and a related device. And expanding the label of the sample image into text, and extracting text features. And the training of the network model is carried out by combining the text features and the image features, so that the training of the image detection model combines semantic information and image information, and the detection accuracy can be improved.

FIG. 2 illustrates a training method for an image detection model, according to an exemplary embodiment. The method may be performed by an electronic device, the method may comprise the steps of:

s201, acquiring a training sample set, wherein the training sample set comprises sample images of a plurality of categories, one sample image corresponds to one category, and the label of the sample image is the category corresponding to the sample image.

In specific implementation, as shown in fig. 3, when the electronic device performs image detection model training, a training sample set may be input into the network model, and a result is output, and parameters of the network model are adjusted according to the output result, so as to implement a process of training the network model. In the training method of the image detection model provided by the embodiment of the application, a pre-training network model is not generated.

The training sample set acquired by the electronic device may include sample images of multiple sources, and may include the pre-training sample image and the real scene sample image. The method provided by the application is convenient to introduce, the application scene is taken as a kitchen, and the detection purpose is to detect whether a chef in the kitchen wears chef caps and masks or not as an example. It should be noted that the training method provided by the present application may also be applied to other scenes and other detection purposes.

As shown in fig. 4, the pre-training sample image may be an image processed image, and has rich image features. The image of the real scene sample is generally collected in the application scene, as shown in fig. 5, the image features are poor.

The training sample set may include a plurality of sample images of a plurality of categories, wherein one sample image corresponds to one category, and the category corresponding to the sample image is used as a label of the sample image. For example, the multiple categories may be respectively noted as: chef hat and mask, chef hat and mask. For another example, the plurality of categories may be respectively noted as: chef's hat+mask, chef's hat+nomask, no chef's hat+mask, no chef's hat+nomask. The embodiment of the application does not limit the language and the form of the image tag, but can respectively represent the meanings corresponding to each class of Chinese language records.

S202, generating a plurality of texts according to the plurality of categories, and extracting text characteristics of each text, wherein the plurality of texts are in one-to-one correspondence with the plurality of categories, and each text comprises the category corresponding to each text.

In specific implementation, the electronic device may expand each category to generate a corresponding text. Optionally, the electronic device may generate each text according to at least one extended field and a category corresponding to each text, where the extended field includes one or more of the following: the acquisition mode of the images in the training sample set and the scene to which the images in the training sample set belong. The manner in which the image is acquired may characterize the source of the image. The scene to which the image belongs can be a characterization network model application scene.

In some examples, the text generated by the electronic device expanding each category may include a manner of acquisition of the sample image. For example, each category is expanded, and the generated text is respectively:

text 1: an online image of a person wearing a chef's hat and wearing a mask;

text 2: an online image of a person not wearing a chef's hat and wearing a mask;

Text 3: a real photo of a person wearing a chef's hat and not wearing a mask;

text 4: a real photo of a person not wearing a chef's hat and wearing a mask.

Wherein, "online" characterizes one way of acquiring an image, and "real" characterizes one way of acquiring an image.

In other examples, the text generated by the electronic device expanding each category may include a sample image acquisition mode and a scene to which the image belongs. For example, each category is expanded, and the generated text is respectively:

text 1: an online image of a person wearing a chef's hat and wearing a mask in sceneX;

text 2: an online image of a person not wearing a chef's hat and wearing a mask in sceneX;

text 3: a real photo of a person wearing a chef's hat and not wearing a mask in sceneX;

text 4: a real photo of a person not wearing a chef's hat and wearing a mask in sceneX.

Wherein, "online" characterizes one way of acquiring an image, and "real" characterizes one way of acquiring an image. "sceneX" characterizes the scene X to which the image belongs, i.e., the image is acquired (or captured) in scene X. For example, "scene1" may characterize scene1 to which the image belongs, and "scene2" may characterize scene2 to which the image belongs. The image acquisition modes in the application can include but are not limited to: and obtaining, generating, collecting and the like on the network. In the foregoing example, "online" may characterize acquisition on the web. "real" may characterize the acquisition.

The electronic device may extract text features of the generated text, may obtain text feature 1 of text 1, text feature 2 of text 2, text feature 3 of text 3, and text feature 4 of text 4.

Optionally, the number of characters of the text generated by the electronic device is greater than or equal to a preset character threshold. For example, the preset character threshold may be 16 characters. Alternatively, the number of characters of the text generated by the electronic device may be greater than or equal to 128 characters.

S203, for each image in the training sample set, executing the following operation, and determining a loss parameter of each image: inputting each image into a network model to be trained, and extracting first image features of a region of each image containing a target object; determining a similarity of the first image feature to a text feature of each of the plurality of texts; and determining a loss parameter of each image based on the determined multiple similarities corresponding to the first image feature.

S204, adjusting parameters of the network model to be trained based on loss parameters of all images in the training sample set and a preset loss function.

In a possible implementation manner, the electronic device may not distinguish the acquisition modes of the sample images during the steps 203 and 204. The electronic device may be configured with a first loss function: l= Σ _{i epsilon training sample set} -logp(F _{im_i} ,F _{tem_i} )；

Wherein the F is _{im_i} Characterizing image features of an ith sample image in the training sample set, the F _{te_k} Characterizing the plurality ofThe kth text feature in the text features of the individual texts; the F is _{tem_i} Characterizing the target text feature corresponding to the ith sample image, wherein the similarity between the image feature of the ith sample image and the text features of the texts is the largest;

said p (F) _{im_i} ,F _{tem_i} ) Characterizing said F _{im_i} Is used for the loss parameters of the (c) in the (c), S(F _{im_i} ,F _{tem_i} ) Characterizing said F _{im_i} With said F _{tem_i} Similarity of S (F) _{im_i} ,F _{te_k} ) Characterizing said F _{im_i} With said F _{te_k} Is constant.

For each sample image, the electronic device may determine the similarity of the sample image to each text feature, e.g.,wherein F is _im Characterizing any one of the image features, F _te Characterizing any one text feature.

For each sample image, the electronic device may determine a text feature corresponding to a maximum value in similarity of an image feature of the sample image to each text feature as a target text feature corresponding to the image feature of the sample image. And determining a loss parameter of the sample image based on the similarity of the image feature of the sample image to each text feature and the target text feature corresponding to the image feature of the sample image. For example, said F _{im_i} The loss parameters of (2) are:

the electronic device may determine a loss value of the current training based on the first loss function and the loss parameter of each sample image, and perform the foregoing training process again after adjusting the model based on the loss value. Optionally, the electronic device may end training after determining that the loss value is smaller than the preset loss value threshold, and use the trained model as the trained image detection model.

In another possible implementation manner, the electronic device may distinguish the capturing manner of each sample image and the scene to which the image belongs by using the loss function configured in the electronic device in the processes of step 203 and step 204. For example, the training sample set may include: a plurality of sample subsets. The plurality of sample subsets may include a first sample subset and at least one second sample subset. The sample images in the first sample subset are acquired in a different manner than the sample images in the second sample subset. As an example, the sample images in the first sample subset may be acquired or generated on the internet, and the sample images in the second sample subset may be acquired. The total number of sample images in the first sample subset is greater than the total number of sample images in each of the at least one second sample subset.

In some examples, the plurality of sample subsets includes at least one second sample subset, wherein in a case where the at least one second sample subset is the plurality of second sample subsets, a scene to which an image in any one of the second sample subsets belongs is different from a scene to which an image in the at least one second sample subset except the at least one second sample subset belongs. For example, the second sample subset may include a second sample subset 1 and a second sample subset 2. The scene to which the image in the second sample subset 1 belongs may be scene 1. The scene to which the image in the second sample subset 2 belongs may be scene 2.

The electronic device may be configured with a second loss function:

L＝α _M *Lpre+α _Q *Lreal

wherein the Lpre characterizes a loss sub-function corresponding to the first subset of samples, the alpha _M Representing the weight of a loss sub-function corresponding to the first sample subset; the Lreal characterizes a loss sub-function corresponding to the at least one second sample subset, the alpha _Q And characterizing the weight of the loss sub-function corresponding to the at least one second sample subset.

Optionally, theSaid-> Wherein said->Characterizing image features of an a-th sample image in said first sample subset pre; the real_N characterizes an Nth second subset of samples of the at least one second subset of samples, the NT characterizes a total number of the at least one second subset of samples, the >And characterizing the image characteristics of the b-th sample image in the Nth second sample subset.

the saidAnd characterizing target text features corresponding to image features of a b-th sample image in the Nth second sample subset, wherein in the similarity between the image features of the b-th sample image in the Nth second sample subset and the text features of the texts, the similarity between the target text features corresponding to the b-th sample image in the Nth second sample subset and the image features of the b-th sample image in the Nth second sample subset is the largest.

Characterization of the->Loss parameter of->Characterization of the->A loss parameter of (2); wherein (1)> The F is _{te_k} Characterizing a kth text feature of the text features of the plurality of texts; / >Characterization of the->Is in contact with the->Similarity of->Characterization of the->Is in contact with the->Is constant.

For each sample image in each sample subset, the electronic device may determine a text feature corresponding to a maximum of the similarity of the image feature of the sample image to each text feature as a target text feature corresponding to the image feature of the sample image. And determining a loss parameter of the sample image based on the similarity of the image feature of the sample image to each text feature and the target text feature corresponding to the image feature of the sample image.

For exampleLoss parameter of->For example->Loss parameter->

The electronic device may determine a loss value of the current training based on the second loss function and the loss parameter of each sample image in each sample subset, and perform the foregoing training process again after adjusting the model based on the loss value. Optionally, the electronic device may end training after determining that the loss value is smaller than the preset loss value threshold, and use the trained model as the trained image detection model.

In one possible design, in the second loss function, a weight α corresponding to a loss sub-function Lpre corresponding to the first sample subset _M Less than the corresponding lossy sub-of the at least one second subset of samplesWeight alpha corresponding to function Lreal _Q . In such a design, the detection effect of the image detection model for detecting the image obtained by the acquisition mode corresponding to the second sample subset can be improved.

In other examples, the plurality of sample subsets includes a plurality of second sample subsets, wherein a scene to which an image in any one of the second sample subsets belongs is different from a scene to which an image in any one of the plurality of second sample subsets except the plurality of second sample subsets belongs. For example, the second sample subset may include a second sample subset 1 and a second sample subset 2. The scene to which the image in the second sample subset 1 belongs may be scene 1. The scene to which the image in the second sample subset 2 belongs may be scene 2.

In the electronic device, a third loss function may be configured:

Optionally, theThe saidWherein said->Image features characterizing the a-th sample image in said first sample subset preSign of the disease; said->Characterizing image features of a b-th sample image in the nth second subset of samples;

For exampleLoss parameter of->For example->Loss parameter->/>

The electronic device may determine a loss value of the current training based on the third loss function and the loss parameter of each sample image in each sample subset, and perform the foregoing training process again after adjusting the model based on the loss value. Optionally, the electronic device may end training after determining that the loss value is smaller than the preset loss value threshold, and use the trained model as the trained image detection model.

In a possible design, in the third loss function, the weight α of the loss sub-function Lpre corresponding to the first subset of samples _M Less than a corresponding loss sub-function of each of the plurality of second sample subsetsWeight alpha of (2) _{real_N} (where N takes the minimum value of 1-NT). For example, the plurality of second sample subsets may include second sample subset 1 and second sample subset 2. Loss subfunction corresponding to the second sample subset 1>The weight of (a) is denoted as alpha _{real_1} Loss subfunction corresponding to the second sample subset 2>The weight of (a) is denoted as alpha _{real_2} 。α _{real_1} And alpha _{real_2} Is a weight alpha greater than the corresponding loss sub-function Lpre of the first sample subset _M A kind of electronic device.

Optionally, in the third loss function, the weight of the loss sub-function corresponding to each second sample subset may be configured in combination with an actual application scenario. In some examples, the weight of the loss sub-function corresponding to each second subset of samples may be determined based on the number of samples in each second subset of samples. If the weight of the loss sub-function corresponding to the second sample subset with a large sample number is smaller than the weight of the loss sub-function corresponding to the second sample subset with a small sample number.

Fig. 6 illustrates an exemplary training process for an image detection model. Wherein the electronic device may obtain a training sample set, wherein the training sample set may include a first sample subset and a second sample subset. Alternatively, the first subset of samples may be a set of pre-training samples I in the related art ^pre The second sample subset may be the real training set I in the related art ^real 。

The electronic device may find the target object using a regional generation network (Region Proposal Network, RPN). For example, in the aforementioned kitchen scenario, the target object may be a person. The region of interest found by the RPN is a region containing the target object. The electronic device may generate image features of the region of interest for each sample image using the image encoder.

The electronic device can generate expanded text for each tag based on the tags of the samples in the training sample set. Optionally, the electronic device may further generate the expanded text of each tag by combining the acquisition mode of the samples in the first sample subset and the acquisition mode of the samples in the second sample subset, where the samples in the second sample subset belong to a scene. The electronic device may generate text features for each text using a text encoder.

For the image features of each sample image, the electronic device may calculate the similarity of each image feature to each text feature, respectively, and determine a loss value based on a preset loss function, so as to adjust the network parameters of the training model.

In some examples, the training sample set may include a first sample subset pre and a second sample subset real_1. The electronic device may be configured with a fourth loss function: l=0.3×lpre+0.7×lreal.

In the fourth loss function, theThe saidThe weight of the loss sub-function corresponding to the first sample subset is 0.3, and the weight of the loss sub-function corresponding to the second sample subset real_1 is 0.7.

Wherein the saidCharacterizing image features of an a-th sample image in said first sample subset pre, said real_N characterizing an N-th second sample subset of said plurality of second sample subsets, said ≡>Characterizing the image characteristics of the b-th sample image in the second sample subset real_1; said->Characterizing target text features corresponding to image features of an a-th sample image in the first sample subset pre, wherein in the similarity between the image features of the a-th sample image in the first sample subset pre and the text features of the texts, the similarity between the target text features corresponding to the image features of the a-th sample image in the first sample subset pre and the image features of the a-th sample image in the first sample subset pre is the largest;

the saidRepresenting target text features corresponding to the image features of the b-th sample image in the second sample subset real_1, wherein the similarity between the image features of the b-th sample image in the second sample subset real_1 and the text features of the plurality of texts is the largest;

In some examples, the training sample set may include a first sample subset pre, a second sample subset real_1, and a second sample subset real_2. The electronic device may be configured with a fifth loss function:

L＝0.2*Lpre+0.4*Lreal

in a fifth loss function, theThe said Nt=2. Wherein the weight of the loss sub-function corresponding to the first sample subset is 0.2, and the weight of the loss sub-function corresponding to the second sample subset is 0.4.

The saidCharacterizing the b-th sample graph in the second sample subset real_1The similarity between the image feature of the b-th sample image in the second sample subset real_1 and the image feature of the b-th sample image in the second sample subset real_1 is the largest, among the similarities between the image feature of the b-th sample image in the second sample subset real_1 and the text feature of the plurality of texts, of the target text feature corresponding to the image feature of the image;

characterization of the->Loss parameter of->Characterization of the->A loss parameter of (2); wherein (1)>/>The F is _{te_k} Characterizing a kth text feature of the text features of the plurality of texts; />Characterization of the->Is in contact with the->Similarity of->Characterization of the->Is in contact with the->Is constant.

The saidCharacterizing image features of a c-th sample image in the nth second subset of samples; said->Characterizing a target text feature corresponding to the image feature of the c-th sample image in the second sample subset real_2, wherein the similarity between the image feature of the c-th sample image in the second sample subset real_2 and the text features of the texts is the largest, and the similarity between the target text feature corresponding to the c-th sample image in the second sample subset real_2 and the image feature of the c-th sample image in the second sample subset real_2 is the largest;

Characterization of the->Loss parameter of-> Characterization of the->Is in contact with the->Is a similarity of (3).

Fig. 7 illustrates an exemplary image detection method that may be performed by a processor or an electronic device. The method may include:

s701, acquiring an image to be detected.

The image detection method provided by the application is taken as an example for being executed by a processor. In some possible cases, a pre-trained image detection model is configured in the processor. Alternatively, the training process of the image detection model may be performed in other electronic devices, and the processor stores the trained image detection model. Or the training process of the image detection model is performed in a processor, and the processor can perform detection tasks by using the image detection model after training the image detection model.

S702, if the image to be detected comprises a target object, extracting image features of a region comprising the target object.

In practice, the processor may acquire the region of interest, i.e. the region comprising the target object, using RPN techniques. And taking the characteristics of the region as the image characteristics of the image to be detected.

S703, determining the similarity of the image feature and the text feature of each text in a plurality of texts, wherein the texts are generated in advance and correspond to the object categories one by one, and each text comprises the corresponding object category.

In the implementation, the plurality of texts pre-generated by the electronic device are respectively expanded texts of a plurality of object categories. Each text may include a corresponding object category. The electronic device may extract text features for each text.

And S704, determining the object category included in the target text as the object category of the target object, wherein the similarity between the image feature and the text feature of the target text is the largest of the determined multiple similarities corresponding to the image feature.

In implementation, the electronic device may determine, as the target text, a text corresponding to a text feature having a maximum similarity to an image feature of the image to be detected. And the electronic equipment determines the object category contained in the target text as the object category of the image to be detected, so as to realize detection of the object category corresponding to the image to be detected.

In some examples, the number of characters of the text generated by the electronic device is greater than or equal to a preset character threshold. For example, the preset character threshold may be 16 characters. Alternatively, the number of characters of the text generated by the electronic device may be greater than or equal to 128 characters.

In some examples, the image to be measured is acquired by an image acquisition device in a kitchen scene; the target object may be a person. The plurality of object categories includes: chef hat and mask, chef hat and mask.

Based on the same technical conception, the embodiment of the application provides a training device, which can realize the same technical effects as the training method, and is not described herein. Referring to fig. 8, the apparatus includes an acquisition sample set module, a text feature generation module, and a model training module. Wherein:

In a possible implementation manner, the text feature generating module is specifically configured to generate each text according to at least one extended field and a category corresponding to each text, where the extended field includes one or more of the following:

In a possible embodiment, the training sample set includes: a plurality of sample subsets; the plurality of sample subsets includes a first sample subset and at least one second sample subset;

Based on the same technical conception, the embodiment of the application provides a training device, which can realize the same technical effects as the detection method, and is not described herein. Referring to fig. 9, the apparatus includes an image acquisition module, an image detection module. Wherein:

the image acquisition module is used for acquiring an image to be detected;

In a possible implementation manner, the number of characters of each text is greater than a preset threshold number of characters.

In a possible implementation manner, the image to be detected is acquired by an image acquisition device in a kitchen scene;

the plurality of object categories includes:

chef hat and mask, chef hat and mask.

Based on the same technical concept, the embodiment of the present application provides a first electronic device, which can execute the training method provided in the above embodiment and can achieve the same technical effects, and is not described herein again.

Referring to fig. 10, the first electronic device comprises a processor 1001, a memory 1002 and a communication interface 1003, wherein the processor 1001, the memory 1002 and the communication interface are connected through a bus 1004, the communication interface 1003 is used for communicating with other electronic devices, including but not limited to interactive sample images, the memory 1002 stores a computer program, and the processor 1001 executes the steps in the training method in the above embodiment according to the computer program.

The processor referred to in fig. 10 of the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), a general purpose processor, a graphics processor (Graphics Processing Unit, GPU) digital signal processor (Digital Signal Processor, DSP), an Application-specific integrated circuit (Application-specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, transistor logic device, hardware component, or any combination thereof.

Based on the same technical concept, the embodiment of the present application provides a second electronic device, which can execute the training method provided in the above embodiment and can achieve the same technical effects, and will not be described herein again.

Referring to fig. 11, the second electronic device includes a processor 1101, a memory 1102, and a communication interface 1103, where the processor 1101, the memory 1102, and the communication interface are connected through a bus 1104, the communication interface 1103 is used for communicating with other electronic devices, including but not limited to interaction images to be detected, the memory 1102 stores a computer program, and the processor 1101 executes steps in the detection method in the above embodiment according to the computer program.

Optionally, the second electronic device and the first electronic device are the same device. Alternatively, the second electronic device may not be the same device as the first electronic device, where the second electronic device may store the image detection model trained by the first electronic device. The second electronic device may interact with the first electronic device with relevant parameters of the image detection model.

The processor referred to in fig. 11 of the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), a general purpose processor, a graphics processor (Graphics Processing Unit, GPU) digital signal processor (Digital Signal Processor, DSP), an Application-specific integrated circuit (Application-specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, transistor logic device, hardware component, or any combination thereof.

Furthermore, the present application provides a computer-readable storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the training method or the image detection method provided by any one of the embodiments of the present application.

An embodiment of the present application provides a computer program product, which includes a computer program, where the computer program when executed by a computer implements the steps of the training method or the image detection method provided in any one of the foregoing embodiments.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of training an image detection model, the method comprising:

2. The method of claim 1, wherein the generating a plurality of text from the plurality of categories comprises:

3. The method of claim 1 or 2, wherein the training sample set comprises: a plurality of sample subsets; the plurality of sample subsets includes a first sample subset and at least one second sample subset;

4. A method according to claim 3, wherein the weight of the loss sub-function corresponding to the first subset of samples is smaller than the weight of the loss sub-function corresponding to the at least one second subset of samples in the loss function.

5. The method of claim 3, wherein the plurality of sample subsets comprises a plurality of second sample subsets, wherein the scene to which the image in any one of the second sample subsets belongs is different from the scene to which the image in any one of the plurality of second sample subsets except the plurality of second sample subsets belongs.

6. The method of claim 5, wherein the loss function comprises a loss sub-function corresponding to the first subset of samples and a loss sub-function corresponding to each of the second subset of samples;

7. An image detection method, the method comprising:

acquiring an image to be detected;

8. The method of claim 7, wherein the number of characters per text is greater than a preset number of characters threshold.

9. The method according to claim 7 or 8, wherein the image to be measured is acquired by an image acquisition device in a kitchen scene;

the plurality of object categories includes:

chef hat and mask, chef hat and mask.

10. An electronic device comprising a memory and a processor;

the memory is used for storing program instructions;

the processor being configured to execute the program instructions to implement the method of any one of claims 1-9.

11. A computer readable storage medium comprising computer program instructions which, when executed by a computer, perform the method of any of claims 1-9.

12. A training device, comprising:

13. An image detection apparatus, comprising:

the image acquisition module is used for acquiring an image to be detected;