WO2023273570A1

WO2023273570A1 - Target detection model training method and target detection method, and related device therefor

Info

Publication number: WO2023273570A1
Application number: PCT/CN2022/089194
Authority: WO
Inventors: 江毅; 杨朔; 孙培泽; 袁泽寰; 王长虎
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-06-28
Filing date: 2022-04-26
Publication date: 2023-01-05
Also published as: CN113469176A; CN113469176B

Abstract

Disclosed in the present application are a target detection model training method and a target detection method, and a related device therefor. Firstly, text feature extraction is performed on an actual target text identifier of a sample image, so as to obtain a target text feature of the sample image; and then, by using the sample image, the target text feature of the sample image and an actual target position of the sample image, a target detection model is trained, so as to enable the target detection model to perform target detection learning under the constraints of the target text feature of the sample image and the actual target position of the sample image, such that the trained target detection model has a better target detection performance, more accurate target detection can be subsequently performed on an image under test by using the trained target detection model so as to obtain and output a target detection result of the image under test, and the target detection result of the image under test is more accurate, thereby facilitating an improvement in the target detection accuracy.

Description

A target detection model training method, target detection method and related equipment

This application claims the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on June 28, 2021, with the application number 202110723057.4, and the title of the application is "a target detection model training method, target detection method and related equipment", The entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the technical field of image processing, and in particular to a target detection model training method, a target detection method and related equipment.

Background technique

Target detection (also known as target extraction) is an image segmentation technology based on target geometric statistics and features; and target detection has a wide range of applications (for example, target detection can be applied to robotics or automatic driving and other fields).

However, because the existing target detection technology still has some defects, how to improve the accuracy of target detection is still a technical problem to be solved urgently.

Contents of the invention

In order to solve the above technical problems in the prior art, the present application provides a target detection model training method, a target detection method and related equipment, which can effectively improve the accuracy of target detection.

In order to achieve the above objectives, the technical solutions provided in the embodiments of the present application are as follows:

An embodiment of the present application provides a method for training a target detection model, the method comprising:

acquiring a sample image, an actual target text identifier of the sample image, and an actual target position of the sample image;

Carry out text feature extraction to the actual target text mark of described sample image, obtain the target text feature of described sample image;

Inputting the sample image into a target detection model to obtain the image features of the sample image output by the target detection model and the predicted target position of the sample image;

updating the target detection model according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image, and Continue to execute the step of inputting the sample image into the object detection model until the first stop condition is reached.

In a possible implementation manner, the performing text feature extraction on the actual target text identifier of the sample image to obtain the target text feature of the sample image includes:

Inputting the actual target text identifier of the sample image into a pre-trained language model to obtain the target text features of the sample image output by the language model; wherein, the language model is based on the actual text of the sample text and the sample text Text features are trained.

In a possible implementation manner, after the first stop condition is reached, the method further includes:

After acquiring the added image, the actual target text identifier of the added image, and the actual target position of the added image, perform text feature extraction on the actual target text identifier of the added image to obtain the added The target text feature of the image; the actual target text identifier of the added image is different from the actual target text identifier of the sample image;

Inputting the historical sample image and the newly added image into the target detection model, and obtaining the image features of the historical sample image output by the target detection model, the predicted target position of the historical sample image, and the newly added image image features and the predicted target position of the added image; wherein, the historical sample image is determined according to the sample image;

According to the predicted target position of the historical sample image, the actual target position of the historical sample image, the similarity between the image feature of the historical sample image and the target text feature of the historical sample image, the The predicted target position of the added image, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image, and update the target detection model , and continue to execute the step of inputting the historical sample image and the newly added image into the target detection model until a second stop condition is reached.

In a possible implementation manner, the process of determining the historical sample image includes:

According to the sample image, determine the training used image corresponding to the target detection model;

determining at least one historical object category based on actual object text identifications of said training used images;

According to the actual target text identification of the training used image, determine the training used image belonging to each historical target category from the training used image corresponding to the target detection model;

The historical sample images corresponding to the respective historical object categories are respectively extracted from the training used images belonging to the various historical object categories.

In a possible implementation manner, the predicted target position of the historical sample image, the actual target position of the historical sample image, the image features of the historical sample image and the historical sample image The similarity between the target text features of the added image, the predicted target position of the added image, the actual target position of the added image, and the relationship between the image feature of the added image and the target text feature of the added image The similarity between updates the target detection model, including:

According to the predicted target position of the historical sample image, the actual target position of the historical sample image, and the similarity between the image feature of the historical sample image and the target text feature of the historical sample image, Determine historical image loss values;

Determine the added image according to the predicted target position of the added image, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image loss value;

performing weighted summation of the historical image loss value and the newly added image loss value to obtain the detection loss value of the target detection model; wherein, the weighted weight corresponding to the historical image loss value is higher than that of the newly added image The weighted weight corresponding to the loss value;

The target detection model is updated according to the detection loss value of the target detection model.

In a possible implementation manner, the inputting the sample image into a target detection model, and obtaining the image features of the sample image output by the target detection model and the predicted target position of the sample image include:

Inputting the sample image into the target detection model to obtain the image features of the sample image output by the target detection model, the predicted target text identifier of the sample image, and the predicted target position of the sample image;

updating the target detection model according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image ,include:

According to the predicted target text identifier of the sample image, the actual target text identifier of the sample image, the predicted target position of the sample image, the actual target position of the sample image, and the image features of the sample image and the The similarity between the target text features of the sample images is used to update the target detection model.

The embodiment of the present application also provides a target detection method, the method comprising:

Obtain the image to be detected;

Inputting the image to be detected into a pre-trained target detection model to obtain a target detection result of the image to be detected output by the target detection model; wherein, the target detection model is a target detection model provided by an embodiment of the present application Any implementation of the training method for training.

The embodiment of the present application also provides a target detection model training device, the device comprising:

A first acquiring unit, configured to acquire a sample image, an actual target text identifier of the sample image, and an actual target position of the sample image;

The first extraction unit is used to extract the text features of the actual target text identifier of the sample image to obtain the target text features of the sample image;

a first prediction unit, configured to input the sample image into a target detection model, and obtain the image features of the sample image output by the target detection model and the predicted target position of the sample image;

A first update unit, configured to update the target position according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image. the target detection model, and return to the first prediction unit to execute the inputting the sample image into the target detection model until a first stop condition is reached.

The embodiment of the present application also provides a target detection device, the device comprising:

a second acquiring unit, configured to acquire an image to be detected;

A target detection unit, configured to input the image to be detected into a pre-trained target detection model, and obtain the target detection result of the image to be detected output by the target detection model; wherein, the target detection model is implemented using the present application Any implementation of the target detection model training method provided by the example is used for training.

The embodiment of the present application also provides a device, the device includes a processor and a memory:

The memory is used to store computer programs;

The processor is configured to execute any implementation of the target detection model training method provided in the embodiments of the present application according to the computer program, or execute any implementation of the target detection method provided in the embodiments of the present application.

The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any implementation of the target detection model training method provided in the embodiment of the present application way, or execute any implementation of the target detection method provided in the embodiment of the present application.

The embodiment of the present application also provides a computer program product. When the computer program product runs on the terminal device, the terminal device executes any implementation manner of the target detection model training method provided in the embodiment of the present application, or executes Any implementation of the target detection method provided in the embodiments of this application.

Compared with the prior art, the embodiment of the present application has at least the following advantages:

In the technical solution provided by the embodiment of the present application, the text feature extraction is first performed on the actual target text identifier of the sample image to obtain the target text feature of the sample image; then the sample image, the target text feature of the sample image and the sample image are used The actual target position of the target detection model is trained so that the target detection model can perform target detection learning under the constraints of the target text features of the sample image and the actual target position of the sample image, so that the trained target detection model It has better target detection performance, so that the trained target detection model can be used to perform more accurate target detection on the image to be detected, and the target detection result of the image to be detected is obtained and output, so that the target of the image to be detected The detection result is more accurate, which is conducive to improving the accuracy of target detection.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in this application. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

FIG. 1 is a flow chart of a method for training a target detection model provided in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a target detection model provided by an embodiment of the present application;

FIG. 3 is a flowchart of a target detection method provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a target detection model training device provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a target detection device provided by an embodiment of the present application.

detailed description

In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

In order to facilitate the understanding of the technical solution of the present application, the following first introduces the training process of the target detection model (that is, the target detection model training method), and then introduces the application process of the target detection model (that is, the target detection method).

Method embodiment one

Referring to FIG. 1 , this figure is a flow chart of a method for training a target detection model provided by an embodiment of the present application.

The target detection model training method provided in the embodiment of the present application includes S101-S105:

S101: Acquire a sample image, an actual target text identifier of the sample image, and an actual target position of the sample image.

Wherein, the sample image refers to the image used for training the target detection model. In addition, the embodiment of the present application does not limit the number of sample images, for example, the number of sample images may be N (that is, use N sample images to train the target detection model).

The actual target text identifier of the sample image is used to uniquely represent the target object in the sample image. In addition, this embodiment of the present application does not limit the actual target text identifier of the sample image, for example, the actual target text identifier of the sample image may be an object category (or object name, etc.). For example, if the sample image includes a cat, the actual target text identifier of the sample image may be a cat.

The actual target position of the sample image is used to represent the area actually occupied by the target object in the sample image in the sample image. In addition, the present application does not limit the representation of the actual target position of the sample image, and any existing or future representation that can represent the area occupied by an object in the image can be used for implementation.

S102: Perform text feature extraction on the actual target text identifier of the sample image to obtain the target text feature of the sample image.

Among them, the target text feature of the sample image is used to describe the text information (such as semantic information, etc.) carried by the actual target text mark of the sample image, so that the target text feature of the sample image can represent the target object in the sample image The features actually present in this sample image.

In addition, the embodiment of the present application does not limit the method of extracting the target text features of the sample image (that is, the implementation of S102), and any existing or future method that can perform feature extraction for a text can be used for implementation. . For ease of understanding, the following description will be given in combination with examples.

As an example, S102 may specifically include: inputting the actual target text identifier of the sample image into a pre-trained language model, and obtaining the target text feature of the sample image output by the language model.

Wherein, the language model is used for text feature extraction; and the embodiment of the present application does not limit the language model, and any existing or future language model can be used for implementation.

In addition, the language model can be trained in advance according to the sample text and the actual text features of the sample text. Wherein, the sample text refers to the text required for training the language model; and the actual text features of the sample text are used to describe the text information actually carried by the sample text (such as semantic information, etc.).

In addition, the embodiment of the present application does not limit the training process of the language model, and any existing or future method that can train the language model according to the sample text and the actual text features of the sample text can be used for implementation.

Based on the relevant content of S102 above, if the number of sample images is N, after the actual target text identifier of the i-th sample image is obtained, the pre-trained language model can be used to target the actual target text of the i-th sample image The text mark is used for text feature extraction, and the target text feature of the i-th sample image is obtained and output, so that the target text feature of the i-th sample image can accurately represent the actual target text mark of the i-th sample image The text information carried by , so that the target text features of the i-th sample image can be used to constrain the training update process of the target detection model. Wherein, i is a positive integer, i≤N, and N is a positive integer.

It can be seen that because the pre-trained language model can accurately extract the text information (especially semantic information) carried by a text, the number of texts that can be described by the language model is unlimited, so that the language model can be used for different texts Any two of the output text features of these different texts are highly separable, so that it can effectively ensure that the text features of any two texts (for example, any two of the target text features of N sample images ) There is no overlap, which can effectively improve the detection accuracy of the target detection model. In addition, because the language model can learn the semantic correlation between different texts during the training process (for example, the semantic correlation between "cat" and "tiger" is higher than that between "cat" and "car") ), so that the trained language model can better extract text features, which can effectively improve the detection accuracy of the target detection model.

S103: Input the sample image into the target detection model, and obtain the image features of the sample image and the predicted target position of the sample image output by the target detection model.

Wherein, the image feature of the sample image is used to represent the feature that the target object in the sample image is predicted to appear in the sample image.

The predicted target position of the sample image is used to represent the predicted area occupied by the target object in the sample image in the sample image.

The target detection model is used for target detection (for example, to detect the category of the target object and the image position of the target object). In addition, the embodiment of the present application does not limit the target detection model. For example, as shown in FIG. Wherein, the input data of the target category prediction layer 202 includes the output data of the image feature extraction layer 201 , and the input data of the target position prediction layer 203 includes the output data of the image feature extraction layer 201 .

In order to facilitate the understanding of the working principle of the target detection model 200, the following description will be made in conjunction with sample images.

As an example, after the sample image is input into the target detection model 200, the working process of the target detection model 200 may include step 11-step 13:

Step 11: Input the sample image into the image feature extraction layer 201, and obtain the image features of the sample image output by the image feature extraction layer 201.

Wherein, the image feature extraction layer 201 is used for performing image feature extraction on the input data of the image feature extraction layer 201 . In addition, the embodiment of the present application does not limit the implementation manner of the image feature extraction layer 201, and any existing or future solution capable of image feature extraction can be used for implementation.

Step 12: Input the image features of the sample image into the target category prediction layer 202 to obtain the predicted target text identifier of the sample image output by the target category prediction layer 202 .

Wherein, the object type prediction layer 202 is used for performing object type prediction on the input data of the object type prediction layer 202 . In addition, the embodiment of the present application does not limit the implementation manner of the object category prediction layer 202, and any existing or future solution capable of performing object category prediction can be used for implementation.

The predicted target text identifier of the sample image is used to represent the predicted identifier (eg, predicted category) of the target object in the sample image.

Step 13: Input the image features of the sample image into the target position prediction layer 203 to obtain the predicted target position of the sample image output by the target position prediction layer 203 .

Wherein, the target position prediction layer 203 is used for performing object position prediction on the input data of the target position prediction layer 203 . In addition, the embodiment of the present application does not limit the implementation of the object position prediction layer 203, and any existing or future solution capable of predicting object positions can be used for implementation.

Based on the relevant content of the above step 11 to step 13, it can be seen that for the target detection model 200 shown in FIG. 202 and the target position prediction layer 203 respectively generate and output the image features of the sample image, the predicted target text identifier of the sample image, and the predicted target position of the sample image, so that the subsequent target detection model 200 can be determined based on these prediction information. Object detection performance.

It should be noted that, for the target detection model 200 shown in FIG. 2 , in some cases, the data dimension of the image feature of the sample image output by the image feature extraction layer 201 may be different from the data dimension of the target text feature of the sample image. Inconsistent, so in order to ensure that the similarity between the image features of the sample image and the target text features of the sample image can be successfully calculated in the future, a data dimension transformation layer can be added in the target detection model 200 shown in FIG. 2 , and The input data of the data dimension transformation layer includes the output data of the image feature extraction layer 201, so that the data dimension transformation layer can perform data dimension transformation for the output data of the image feature extraction layer 201 (such as the image features of the sample image), Therefore, the output data of the data dimension transformation layer can be consistent with the data dimension of the target text feature of the above sample image, which is beneficial to improve the calculation of the similarity between the image feature of the sample image and the target text feature of the sample image accuracy.

Based on the relevant content of S103 above, if the number of sample images is N, after the i-th sample image is obtained (or, an update is completed for the target detection model), the i-th sample image can be input into the target detection model , so that the target detection model performs target detection processing on the i-th sample image, obtains and outputs the image features of the i-th sample image and the predicted target position of the i-th sample image, so that the subsequent can be based on the i-th sample image The image features of each sample image and its predicted target position are used to determine the target detection performance of the target detection model. Wherein, i is a positive integer, i≤N, and N is a positive integer.

S104: Determine whether the first stop condition is met, if yes, perform a preset action; if not, perform S105.

Wherein, the first stop condition may be preset, and the embodiment of the present application does not limit the first stop condition, for example, the first stop condition may be that the predicted loss value of the target detection model is lower than the first preset loss threshold, or The rate of change of the predicted loss value of the target detection model is lower than the first rate of change threshold, or the number of updates of the target detection model reaches the first threshold.

It should be noted that the predicted loss value of the target detection model is used to represent the target detection performance of the target detection model for the above N sample images; and the embodiment of the present application does not limit the calculation method of the predicted loss value of the target detection model, which can be Use any existing or future model prediction loss value calculation method for implementation.

Preset actions can be preset. For example, the preset action may be to end the training process of the target detection model (that is, to end the target detection learning process of the target detection model for N sample images). As another example, when it is necessary to add a new object detection function to the trained object detection model (that is, to perform incremental learning on the object detection model), the preset actions may include the following S106-S109.

Based on the relevant content of S104 above, it can be known that for the target detection model of the current round, it can be judged whether the target detection model of the current round meets the first stop condition; The N sample images have better target detection performance, which means that the target detection performance of the current round of target detection model is better, so the target detection model of the current round can be saved, so that the subsequent work can be performed using the saved target detection model ( For example, to perform target detection work or to add a new object detection function to the target detection model); if the first stop condition is not met, it means that the target detection performance of the current round of target detection model for the above N sample images is still relatively poor , so the target detection model can be updated according to the label information corresponding to the N sample images and the prediction information output by the target detection model of the current round for the N sample images.

S105: Update the target detection model according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image, and return to execute S103.

Wherein, the similarity between the image feature of the sample image and the target text feature of the sample image is used to represent the similarity between the image feature of the sample image and the target text feature of the sample image. In addition, the embodiment of the present application does not limit the calculation method of the similarity between the image feature of the sample image and the target text feature of the sample image, for example, the Euclidean distance may be used for calculation.

In addition, the training objectives of the target detection model may include that the predicted target position of the sample image is as close as possible to the actual target position of the sample image, and the image features of the sample image are as close as possible to the target text features of the sample image (also That is, the similarity between the image feature of the sample image and the target text feature of the sample image is as large as possible).

Based on the relevant content of S105 above, if the number of sample images is N, after determining that the target detection model of the current round does not meet the first stop condition, it can first be based on the predicted target position of the i-th sample image and the sample The gap between the actual target positions of the images, and the similarity between the image features of the i-th sample image and the target text features of the i-th sample image, update the target detection model, so that the updated target detection model It has better target detection performance, so that the above S103 and its subsequent steps can be continued to be performed subsequently. Wherein, i is a positive integer, i≤N, and N is a positive integer.

Based on the relevant content of the above S101 to S105, in the target detection model training method provided in the embodiment of the present application, the text feature extraction can be performed on the actual target text identifier of the sample image to obtain the target text feature of the sample image; and then use The sample image, the target text feature of the sample image and the actual target position of the sample image are used to train the target detection model to obtain a trained target detection model. Among them, because the target text feature of the sample image can more accurately represent the actual target text mark of the sample image, the target detection model trained under the constraints of the target text feature of the sample image has a better target detection function, This is beneficial to improve the target detection performance.

Method embodiment two

In fact, the trained target detection model has better target detection performance for the target objects it has learned, so in order to further improve the prediction performance of the target detection model, the trained target detection model can be further learned Learned target objects (i.e., category incremental learning can be performed for target detection models). Based on this, the embodiment of the present application also provides a possible implementation of the target detection model training method. In this implementation, the target detection model training method includes S106-S109 in addition to the above S101-S105:

S106: After acquiring the added image, the actual target text identifier of the added image, and the actual target position of the added image, perform text feature extraction on the actual target text identifier of the added image to obtain the added image target text features.

Wherein, the newly-added image refers to the image required for category incremental learning for the trained target detection model.

In addition, the embodiment of the present application does not limit the number of added images, for example, the number of added images is M; wherein, M is a positive integer. At this time, S106-S109 can be used to realize that the target detection model further learns how to perform target detection on the M new images under the premise of keeping the learned target objects.

In addition, for the actual target text identifier of the added image, the actual target position of the added image, and the target text feature of the added image, please refer to the actual target text identifier of the sample image and the actual target of the sample image in S101 above. location, and the relevant content of the target text feature of the sample image in S102 above, only need to identify the actual target text of the sample image in S101 above, the actual target position of the sample image, and the target text feature of the sample image in S102 above Just replace "sample image" with "new image" in the relevant content of .

Based on the relevant content of S106 above, it can be seen that for the trained target detection model (for example, it can be a target detection model trained by using the training process shown in S101-S105 above, or it can be a target detection model trained by using the training process shown in S101-S105 above. In the case of the target detection model obtained by using the training process shown in S106-S109 to carry out category incremental learning at least once after the training process shown in the training process is completed), after the newly added image and the actual target text identification of the newly added image are acquired and the actual target position of the newly added image, it can be determined that a class incremental learning is needed for the trained target detection model, so text feature extraction can be performed on the actual target text identifier of the newly added image to obtain the Add the target text features of the image, so that the target text features of the new image can be used to constrain the incremental learning process of the target detection model, so that the retrained target detection model can maintain the learned target Further learning how to perform object detection on these additional images based on the premise of objects.

S107: Input the historical sample image and the newly added image into the target detection model, and obtain the image features of the historical sample image output by the target detection model, the predicted target position of the historical sample image, the image features of the newly added image and The predicted target location for this added image.

Wherein, the historical sample images may include all or part of the images used in the historical training process of the target detection model.

The historical training process of the target detection model refers to the category learning process that the target detection model has experienced before the current sub-category incremental learning process for the target detection model. For example, if the trained target detection model has only experienced the category learning process shown in S101-S105 above, the historical training process of the target detection model refers to the training process shown in S101-S105 above. As another example, if the trained target detection model has gone through the category learning process shown in S101-S105 above and Q times the category incremental learning process shown in S106-S109, then the historical training process of the target detection model It may include the training process shown in S101-S105 above, the training process shown in the first time S106-S109 to the training process shown in the Qth time S106-S109.

In addition, this embodiment of the present application does not limit the determination process of the historical sample image. For example, in a possible implementation manner, the determination process of the historical sample image may include Step 21-Step 24:

Step 21: According to the sample image, determine the image used for training corresponding to the target detection model.

Wherein, the training used images corresponding to the target detection model refer to images that have been used in the historical training process of the target detection model. For ease of understanding, two examples are used for description below.

Example 1, if the historical training process of the target detection model includes the training process shown in S101-S105 above, the training used images corresponding to the target detection model may include the above N sample images.

Example 2, if the historical training process of the target detection model can include the training process shown in S101-S105 above, the training process shown in the first time S106-S109 to the training process shown in the Qth time S106-S109, the qth In the training process shown in S106-S109, G _q newly added images are used for category incremental learning, and q is a positive integer, q≤Q, then the training used images corresponding to the target detection model can include the above N sample images, G ₁ new images, G ₂ new images, ..., G _Q new images.

Based on the relevant content of the above step 21, after it is determined that the trained target detection model needs to be incrementally learned, it can first be determined based on the images involved in the historical training process of the target detection model. The training used image of is used so that the training used image can accurately represent the image that has been used in the historical learning process of the object detection model.

Step 22: Determine at least one historical target category according to the actual target text identifiers of the images used for training.

Wherein, the historical target category refers to the object category that the target detection model has learned during the historical training process of the target detection model. For ease of understanding, two examples are used for description below.

Example 1, if the historical training process of the target detection model includes the training process shown in S101-S105 above, and the N sample images in the training process shown in S101-S105 above correspond to R ₀ object categories, then the The R ₀ object categories are all determined as historical object categories.

Example 2, if the historical training process of the target detection model can include the training process shown in S101-S105 above, and the training process shown in the first time S106-S109 to the training process shown in the Qth time S106-S109, the In the training process shown in S101-S105 above, the N sample images correspond to R ₀ object categories, and in the qth training process shown in S106-S109, G _q newly added images correspond to R _q object categories, and q is a positive integer, and q≤Q, then R ₀ object categories, R ₁ object categories, R ₂ object categories, ..., R _Q object categories can all be determined as historical object categories.

It should be noted that there are no repeated object categories among R ₀ object categories, R ₁ object categories, R ₂ object categories, . . . , R _Q object categories. That is, any two object categories among R ₀ object categories, R ₁ object categories, R ₂ object categories, . . . , R _q-1 object categories are different.

Based on the relevant content of the above step 22, it can be known that after obtaining the training used images corresponding to the target detection model, the actual target text identifiers of each training used images can be used to determine the historical object category corresponding to the target detection model, so that the The historical object categories can accurately represent the object categories that have been learned during the history learning process of this object detection model.

Step 23: According to the actual target text identification of the training used images, determine the training used images belonging to each historical target category from the training used images corresponding to the target detection model.

As an example, if the number of historical target categories is M, and there are Y ₁ images belonging to the first historical target category, Y ₂ images belonging to the second historical target category, ... ..., and YM images belong to the _Mth historical target category, then step 23 may specifically include: determining _Y1 images belonging to the first historical target category in the training images corresponding to the target detection model to belong to the first The training images of 1 historical target category are used, and the Y ₂ images belonging to the second historical target category in the training used images corresponding to the target detection model are determined as the training used images belonging to the second historical target category, ... (by analogy), all Y _M images belonging to the Mth historical target category in the training used images corresponding to the target detection model are determined as the training used images belonging to the Mth historical target category.

Step 24: Extract historical sample images corresponding to each historical object category from training images that belong to each historical object category.

It should be noted that this embodiment of the present application does not limit the implementation of "extraction" in step 24. For example, the extraction may be performed with reference to a preset extraction ratio (or number of extractions, etc.).

For example, if the extraction ratio is 10%, and the number of historical object categories is M, then step 24 may specifically include: performing random extraction according to an extraction ratio of 10% from the training images that belong to the first historical object category, Obtain each historical sample image corresponding to the first historical target category, so that the actual target text identification of each historical sample image corresponding to the first historical target category is the first historical target category; subordinate to the second The training of the first historical target category has been randomly selected according to the sampling ratio of 10% in the used image, and each historical sample image corresponding to the second historical target category is obtained, so that each historical sample image corresponding to the second historical target category The actual target text identification of the image is the second historical target category; ... (and so on); the training images belonging to the Mth historical target category are randomly selected according to the sampling ratio of 10%, and the first Each historical sample image corresponding to the M historical target category, so that the actual target text identifiers of each historical sample image corresponding to the M th historical target category are all the M th historical target category.

Based on the relevant content of the above steps 21 to 24, after it is determined that the trained target detection model needs to be incrementally learned, some historical samples can be extracted from the images involved in the historical training process of the target detection model images, so that these historical sample images can represent the object categories that have been learned during the historical learning process of the object detection model.

In addition, for the image features of the historical sample images and the predicted target positions of the historical sample images, please refer to the related content of "Image Features of Sample Images" and "Predicted Target Positions of Sample Images" in S103 above. Just replace “sample image” with “historical sample image” in the related content of “image feature of sample image” and “predicted target position of sample image” in S103 above.

In addition, for the image features of the newly added image and the predicted target position of the newly added image, please refer to the relevant content of the "image feature of the sample image" and "predicted target position of the sample image" in S103 above. In document S103, in the related content of "image features of sample image" and "predicted target position of sample image", "sample image" can be replaced with "new image".

Based on the relevant content of S103 above, after obtaining the historical sample image and the newly added image, the historical sample image and the newly added image can be respectively input into the target detection model, so that the target detection model can target the historical sample image and the newly added image for target detection, obtain and output the image features of the historical sample image and the predicted target position, the image features of the newly added image and the predicted target position, so that the target detection model can be determined based on these predicted information. Object detection performance.

S108: Judging whether the second stop condition is met, if yes, perform preset steps; if not, perform S109.

Wherein, the second stop condition may be preset, and the embodiment of the present application does not limit the second stop condition, for example, the second stop condition may be that the detection loss value of the target detection model is lower than the second preset loss threshold, or The rate of change of the detection loss value of the target detection model is lower than the second rate of change threshold, or the number of updates of the target detection model reaches the second threshold.

It should be noted that the detection loss value of the target detection model is used to represent the target detection performance of the target detection model for historical sample images and newly added images; and the embodiment of the present application does not limit the calculation method of the detection loss value of the target detection model , which can be implemented by using any existing or future model detection loss value calculation method.

In fact, since the number of historical sample images corresponding to each historical target category is usually relatively small, in order to improve the influence of these historical sample images on the target detection model, the embodiment of the present application also provides a detection method of the target detection model. The calculation method of the loss value may specifically include step 31-step 33:

Step 31: Determine the historical Image loss value.

Wherein, the historical image loss value refers to the loss value generated when the target detection model performs target detection on the historical sample images, so that the historical image loss value is used to represent the target detection performance of the target detection model on the historical sample images.

In addition, the embodiment of the present application does not limit the calculation method of the historical image loss value, and any existing or future prediction loss value calculation method may be used for implementation.

Step 32: According to the predicted target position of the added image, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image, determine the loss value of the added image .

Wherein, the newly added image loss value refers to the loss value generated when the target detection model performs target detection for the newly added image, so that the newly added image loss value is used to represent the target detection performance of the target detection model for the newly added image.

In addition, the embodiment of the present application does not limit the calculation method of the newly added image loss value, and any existing or future prediction loss value calculation method may be used for implementation.

Step 33: Perform weighted summation of the historical image loss value and the newly added image loss value to obtain the detection loss value of the target detection model. Wherein, the weighting weight corresponding to the historical image loss value is higher than the weighting weight corresponding to the newly added image loss value.

Wherein, the weighted weight corresponding to the historical image loss value refers to the weight value to be multiplied by the historical image loss value in the "weighted summation" in step 33 . In addition, the weighting weights corresponding to the historical image loss values may be preset.

The weighted weight corresponding to the newly added image loss value refers to the weight value to be multiplied by the newly added image loss value in the "weighted sum" in step 33 . In addition, the weighting weights corresponding to the newly added image loss values may be preset.

Based on the relevant content of the above steps 31 to 33, in order to improve the binding force of a small number of historical sample images and their label information on the training update process of the target detection model, it can be used in the process of calculating the detection loss value of the target detection model Increase the weighted weight corresponding to the historical image loss value, so that the target detection model trained based on the weighted weight corresponding to the historical image loss value can not only realize accurate target detection for the newly added image corresponding to the target detection model, but also realize Still for the training corresponding to the target detection model, images have been used for accurate target detection, which is conducive to improving the accuracy of category incremental learning.

The preset steps can be preset. For example, the preset step may be to end the current category incremental learning process of the target detection model. As another example, when it is necessary to add a new object detection function to the trained target detection model again (that is, to perform the next category incremental learning for the target detection model), the preset steps may include the above S106-S109 .

Based on the relevant content of S108 above, it can be known that for the target detection model of the current round, it can be judged whether the target detection model of the current round meets the second stop condition; Both historical sample images and newly added images have better target detection performance, which means that the target detection performance of the current round of target detection model is better, so the current round of target detection model can be saved so that the saved target detection can be used later The model performs follow-up work (such as performing target detection or adding new object detection functions to the target detection model); if the second stop condition is not reached, it means that the target detection model of the current round is aimed at the above-mentioned historical sample images and The target detection performance of the newly added image is still relatively poor, so it can be based on the label information corresponding to the historical sample image, the label information corresponding to the newly added image, and the target detection model of the current round for the historical sample image and the newly added image. The output prediction information updates the target detection model.

S109: According to the predicted target position of the historical sample image, the actual target position of the historical sample image, the similarity between the image feature of the historical sample image and the target text feature of the historical sample image, and the Predict the target position, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image, update the target detection model, and return to execute S107.

Among them, the training target of the target detection model may include that the predicted target position of the historical sample image is as close as possible to the actual target position of the historical sample image, and the image features of the historical sample image are as close as possible to the historical sample image. The target text feature of the image (that is, the similarity between the image feature of the historical sample image and the target text feature of the historical sample image is as large as possible), and the predicted target position of the new image is as close as possible to the The actual target position of the added image, and the image feature of the added image is as close as possible to the target text feature of the added image (that is, the distance between the image feature of the added image and the target text feature of the added image similarity as large as possible).

In addition, the embodiment of this application does not limit the implementation of S109. For example, S109 may specifically include S1091-S1094:

S1091: Determine the historical image according to the predicted target position of the historical sample image, the actual target position of the historical sample image, and the similarity between the image feature of the historical sample image and the target text feature of the historical sample image loss value.

S1092: Determine the added image loss value according to the predicted target position of the added image, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image.

S1093: Perform weighted summation of the historical image loss value and the newly added image loss value to obtain a detection loss value of the target detection model. Wherein, the weighting weight corresponding to the historical image loss value is higher than the weighting weight corresponding to the newly added image loss value.

It should be noted that, for the relevant content of S1091-S1093, please refer to the relevant content of step 31-step 33 above.

S1094: Update the target detection model according to the detection loss value of the target detection model.

It should be noted that the embodiment of the present application does not limit the implementation of S1094, and any existing method for updating the model based on the loss value may be used for implementation.

Based on the relevant content of the above S106 to S109, in the target detection model training method provided in the embodiment of the present application, for the trained target detection model, if it is necessary to add a new object detection function to the target detection model, then The new image and its label information can be used to carry out category incremental learning for the target detection model, so that the learned target detection model can add the target detection function for the new image while maintaining the original target detection function. , which is conducive to continuously improving the target detection performance of the target detection model.

Method embodiment three

In order to further improve the target detection performance of the target detection model, the embodiment of the present application also provides a possible implementation of the target detection model training method, which specifically includes steps 41-45:

Step 41: Obtain a sample image, an actual target text identifier of the sample image, and an actual target position of the sample image.

Step 42: Perform text feature extraction on the actual target text identifier of the sample image to obtain the target text feature of the sample image.

It should be noted that, for the relevant content of step 41-step 42, refer to the above S101-S102 respectively.

Step 43: Input the sample image into the target detection model, and obtain the image features of the sample image output by the target detection model, the predicted target text mark of the sample image, and the predicted target position of the sample image.

Wherein, the predicted target text identifier of the sample image is used to represent the predicted identifier (eg, predicted category) of the target object in the sample image.

It should be noted that step 43 can be implemented in any of the above S103 implementations, only the output data of the target detection model in S103 above is replaced by "the image features of the sample image and the predicted target position of the sample image" It only needs to be “the image feature of the sample image, the predicted target text identifier of the sample image, and the predicted target position of the sample image”.

Step 44: Judging whether the first stop condition is met, if yes, execute a preset action; if not, execute step 45.

It should be noted that, for the relevant content of step 44, please refer to the relevant content of S104 above. In addition, the "predicted loss value of the target detection model" in step 44 is based on the predicted target text identifier of the sample image, the actual target text identifier of the sample image, the predicted target position of the sample image, the actual target position of the sample image, and The similarity between the image feature of the sample image and the target text feature of the sample image is calculated.

Step 45: According to the predicted target text mark of the sample image, the actual target text mark of the sample image, the predicted target position of the sample image, the actual target position of the sample image, and the image features of the sample image and the target of the sample image The similarity between text features is used to update the target detection model, and return to step 43.

It should be noted that step 45 can be implemented using any of the implementations of S105 above, and it is only necessary to combine the "predicted target position of the sample image, the actual target position of the sample image, and the The similarity between the image features of the sample image and the target text features of the sample image" is replaced by "the predicted target text identifier of the sample image, the actual target text identifier of the sample image, the predicted target position of the sample image, the sample image and the similarity between the image features of the sample image and the target text features of the sample image".

That is, the update process of the target detection model in step 45 is based on the predicted target text mark of the sample image, the actual target text mark of the sample image, the predicted target position of the sample image, the actual target position of the sample image, and the sample The similarity between the image features of the image and the target text features of the sample image is implemented.

Based on the relevant content of the above step 41 to step 45, in the target detection model training method provided in the embodiment of the present application, the text feature extraction can be performed on the actual target text identifier of the sample image to obtain the target text feature of the sample image; Then use the sample image, the target text feature of the sample image, the actual target text identifier of the sample image, and the actual target position of the sample image to train the target detection model to obtain a trained target detection model. Among them, because the target detection model is trained under the constraints of the target text features of the sample image, the actual target text identifier, and the actual target position, the trained target detection model has a better target detection function. This is beneficial to improve the target detection performance.

Method Embodiment Four

In order to further improve the prediction performance of the target detection model, the embodiment of the present application also provides a possible implementation of the target detection model training method. In this embodiment, the target detection model training method includes the above steps 41-45 In addition, step 46-step 49 is also included:

Step 46: After acquiring the added image, the actual target text identifier of the added image, and the actual target position of the added image, perform text feature extraction on the actual target text identifier of the added image to obtain the added image target text features.

It should be noted that, for the relevant content of step 46, refer to the relevant content of S106 above.

Step 47: Input the historical sample image and the newly added image into the target detection model, and obtain the image features of the historical sample image output by the target detection model, the predicted target text identifier of the historical sample image, and the The predicted target position, the image feature of the added image, the predicted target text identifier of the added image, and the predicted target position of the added image.

Wherein, the predicted target text identifier of the historical sample image is used to represent the predicted identifier (eg, predicted category) of the target object in the historical sample image.

The predicted target text identifier of the added image is used to represent the predicted identifier (eg, predicted category) of the target object in the added image.

It should be noted that the relevant content of step 47 can be implemented by using any of the implementation methods of S107 above. It is only necessary to convert the output data of the target detection model in S107 above from "the image characteristics of the historical sample image, the historical sample image The predicted target position of the image, the image feature of the newly added image, and the predicted target position of the newly added image" are replaced with "the image feature of the historical sample image, the predicted target text identifier of the historical sample image, the The predicted target position, the image feature of the added image, the predicted target text identifier of the added image, and the predicted target position of the added image” are enough.

Step 48: Judging whether the second stop condition is met, if yes, execute the preset step; if not, execute step 49.

It should be noted that, for the relevant content of step 48, reference may be made to the relevant content of S108 above. In addition, the "detection loss value of the target detection model" in step 48 is based on the predicted target text identifier of the historical sample image, the actual target text identifier of the historical sample image, the predicted target position of the historical sample image, the historical sample image The actual target position of the example image, the predicted target text mark of the newly added image, the actual target text mark of the added image, the predicted target position of the added image, the actual target position of the added image, the historical sample image The similarity between the image feature and the target text feature of the historical sample image, and the similarity between the image feature of the added image and the target text feature of the added image are calculated.

Step 49: According to the predicted target text mark of the historical sample image, the actual target text mark of the historical sample image, the predicted target position of the historical sample image, the actual target position of the historical sample image, and the prediction of the newly added image The target text identifier, the actual target text identifier of the added image, the predicted target position of the added image, the actual target position of the added image, the image features of the historical sample image and the target text features of the historical sample image and the similarity between the image features of the added image and the target text features of the added image, update the target detection model, and return to step 47.

It should be noted that step 49 can be implemented by using any of the implementation methods of S109 above, and only need to set the "predicted target position of the historical sample image, the actual target of the historical sample image" in any implementation of S109 above position, the predicted target position of the newly added image, the actual target position of the newly added image, the similarity between the image features of the historical sample image and the target text features of the historical sample image, and the image The similarity between the feature and the target text feature of the newly added image" is replaced by "the predicted target text identifier of the historical sample image, the actual target text identifier of the historical sample image, the predicted target position of the historical sample image, The actual target position of the historical sample image, the predicted target text mark of the added image, the actual target text mark of the added image, the predicted target position of the added image, the actual target position of the added image, the historical sample The similarity between the image features of the example image and the target text features of the historical example image, and the similarity between the image features of the added image and the target text features of the added image” can be used.

Based on the relevant content of the above step 46 to step 49, in the target detection model training method provided in the embodiment of the present application, for the trained target detection model, if it is necessary to add a new object detection function to the target detection model , then the target detection model can be incrementally learned by using the newly added image and its three label information (that is, target text features, actual target text identification, and actual target position), so that the learned target detection model can On the premise of maintaining the original target detection function, the target detection function for new images is added, which is conducive to continuously improving the target detection performance of the target detection model.

After the target detection model is trained, the target detection model can be used for target detection. Based on this, an embodiment of the present application further provides a target detection method, which will be described below with reference to the accompanying drawings.

Method Embodiment Five

Referring to FIG. 3 , this figure is a flow chart of a target detection method provided by an embodiment of the present application.

The target detection method provided in the embodiment of this application includes S301-S302:

S301: Acquire an image to be detected.

Wherein, the image to be detected refers to an image that needs to be subjected to target detection processing.

S302: Input the image to be detected into a pre-trained target detection model, and obtain a target detection result of the image to be detected output by the target detection model.

Wherein, the target detection model is trained by using any implementation of the target detection model training method provided in the embodiment of the present application.

The object detection result of the image to be detected is obtained by the object detection model performing object detection on the image to be detected. In addition, this embodiment of the present application does not limit the target detection result of the image to be detected. For example, the target detection result of the image to be detected may include the predicted target text identifier (for example, the predicted target category) of the target object in the image to be detected and/or the The area occupied by the target object in the image to be detected in the image to be detected.

Based on the relevant content of the above S301 to S302, it can be known that after the image to be detected is obtained, the target detection model that has been trained can be used to perform target detection on the image to be detected, and the target detection result of the image to be detected can be obtained and output, so that The target detection result of the image to be detected can accurately represent the relevant information of the target object in the image to be detected (eg, target category information and target position information, etc.). Among them, since the trained target detection model has better target detection performance, the target detection result of the image to be detected determined by using the target detection model is more accurate, which is beneficial to improve the accuracy of target detection.

Based on the target detection model training method provided by the above method embodiment, the embodiment of the present application also provides a target detection model training device, which will be explained and described below with reference to the accompanying drawings.

Device embodiment one

For the technical details of the target detection model training device provided in the first device embodiment, please refer to the above method embodiment.

Referring to FIG. 4 , this figure is a schematic structural diagram of a target detection model training device provided by an embodiment of the present application.

The target detection model training device 400 provided in the embodiment of the present application includes:

A first acquiring unit 401, configured to acquire a sample image, an actual target text identifier of the sample image, and an actual target position of the sample image;

The first extraction unit 402 is configured to perform text feature extraction on the actual target text identifier of the sample image to obtain the target text feature of the sample image;

The first prediction unit 403 is configured to input the sample image into the target detection model, and obtain the image features of the sample image output by the target detection model and the predicted target position of the sample image;

The first updating unit 404 is configured to, according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image, Update the target detection model, and return to the first prediction unit 403 to execute the input of the sample image into the target detection model until the first stop condition is reached.

In a possible implementation manner, the first extraction unit 402 is specifically configured to:

In a possible implementation manner, the target detection model training device 400 further includes:

The second extraction unit is configured to, after the first stop condition is reached and the added image, the actual target text identifier of the added image, and the actual target position of the added image are acquired, the actual target position of the added image The target text mark carries out text feature extraction, obtains the target text feature of described newly added image;

The second prediction unit is configured to input the historical sample image and the newly added image into the target detection model, and obtain the image features of the historical sample image and the predicted target of the historical sample image output by the target detection model position, the image feature of the added image, and the predicted target position of the added image; wherein, the historical sample image is determined according to the sample image;

The second updating unit is configured to: according to the predicted target position of the historical example image, the actual target position of the historical example image, the image feature of the historical example image and the target text feature of the historical example image The similarity between, the predicted target position of the added image, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image , updating the target detection model, and returning to the second prediction unit to execute the inputting the historical sample image and the newly added image into the target detection model until a second stop condition is reached.

In a possible implementation manner, the second updating unit includes:

The first determination subunit is configured to: according to the predicted target position of the historical sample image, the actual target position of the historical sample image, and the image features of the historical sample image and the target of the historical sample image The similarity between text features determines the historical image loss value;

The second determining subunit is configured to use the predicted target position of the added image, the actual target position of the added image, and the relationship between the image feature of the added image and the target text feature of the added image The similarity to determine the new image loss value;

The third determining subunit is configured to perform weighted summation of the historical image loss value and the newly added image loss value to obtain the detection loss value of the target detection model; wherein, the weight corresponding to the historical image loss value The weight is higher than the weighted weight corresponding to the added image loss value;

The model update subunit is configured to update the target detection model according to the detection loss value of the target detection model.

In a possible implementation manner, the first prediction unit 403 is specifically configured to:

The first updating unit 404 is specifically used for:

According to the predicted target text identifier of the sample image, the actual target text identifier of the sample image, the predicted target position of the sample image, the actual target position of the sample image, and the image features of the sample image and the The similarity between the target text features of the sample images is used to update the target detection model, and return to the first prediction unit 403 to execute the input of the sample images into the target detection model until the first stop condition is reached.

Based on the relevant content of the above-mentioned target detection model training device 400, it can be seen that for the target detection model training device 400, the text feature extraction is first performed on the actual target text identifier of the sample image to obtain the target text feature of the sample image; image, the target text features of the sample image and the actual target position of the sample image to train the target detection model to obtain a trained target detection model. Among them, because the target text feature of the sample image can more accurately represent the actual target text mark of the sample image, the target detection model trained based on the target text feature of the sample image has a better target detection function, which is beneficial to Improve object detection performance.

Based on the target detection method provided by the above method embodiment, the embodiment of the present application also provides a target detection device, which will be explained and described below with reference to the accompanying drawings.

Device embodiment two

For the technical details of the target detection device provided in the second embodiment of the device, please refer to the above method embodiment.

Referring to FIG. 5 , this figure is a schematic structural diagram of a target detection device provided by an embodiment of the present application.

The target detection device 500 provided in the embodiment of the present application includes:

A second acquiring unit 501, configured to acquire an image to be detected;

The target detection unit 502 is configured to input the image to be detected into a pre-trained target detection model, and obtain the target detection result of the image to be detected output by the target detection model; wherein, the target detection model uses the Any implementation of the method for training the target detection model provided in the examples is used for training.

Based on the relevant content of the target detection device 500 above, it can be seen that for the target detection device 500, after acquiring the image to be detected, it can use the trained target detection model to perform target detection on the image to be detected, and obtain and output the target detection model. Detect the target detection result of the image, so that the target detection result of the image to be detected can accurately represent the relevant information of the target object in the image to be detected (eg, target category information and target position information, etc.). Among them, since the trained target detection model has better target detection performance, the target detection result of the image to be detected determined by using the target detection model is more accurate, which is beneficial to improve the accuracy of target detection.

Further, the embodiment of the present application also provides a device, the device includes a processor and a memory:

The memory is used to store computer programs;

Further, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the target detection model training method provided in the embodiment of the present application. Any implementation manner, or execute any implementation manner of the target detection method provided in the embodiment of the present application.

Furthermore, the embodiment of the present application also provides a computer program product, which, when running on the terminal device, enables the terminal device to execute any implementation manner of the target detection model training method provided in the embodiment of the present application , or execute any implementation of the target detection method provided in the embodiment of the present application.

It should be understood that in this application, "at least one (item)" means one or more, and "multiple" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c ", where a, b, c can be single or multiple.

The above descriptions are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it to be equivalent to equivalent changes Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims

A method for training a target detection model, characterized in that the method comprises:

acquiring a sample image, an actual target text identifier of the sample image, and an actual target position of the sample image;

Carry out text feature extraction to the actual target text mark of described sample image, obtain the target text feature of described sample image;

Inputting the sample image into a target detection model to obtain the image features of the sample image output by the target detection model and the predicted target position of the sample image;

updating the target detection model according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image, and Continue to execute the step of inputting the sample image into the object detection model until the first stop condition is reached.
The method according to claim 1, wherein said extracting the text features of the actual target text identifier of the sample image to obtain the target text features of the sample image comprises:

Inputting the actual target text identifier of the sample image into the pre-trained language model to obtain the target text features of the sample image output by the language model.
The method according to claim 1, wherein after reaching the first stop condition, the method further comprises:

After acquiring the added image, the actual target text identifier of the added image, and the actual target position of the added image, perform text feature extraction on the actual target text identifier of the added image to obtain the added The target text feature of the image; the actual target text identifier of the added image is different from the actual target text identifier of the sample image;

Inputting the historical sample image and the newly added image into the target detection model, and obtaining the image features of the historical sample image output by the target detection model, the predicted target position of the historical sample image, and the newly added image image features and the predicted target position of the added image; wherein, the historical sample image is determined according to the sample image;

According to the predicted target position of the historical sample image, the actual target position of the historical sample image, the similarity between the image feature of the historical sample image and the target text feature of the historical sample image, the The predicted target position of the added image, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image, and update the target detection model , and continue to execute the step of inputting the historical sample image and the newly added image into the target detection model until a second stop condition is reached.
The method according to claim 3, wherein the determination process of the historical sample image comprises:

According to the sample image, determine the training used image corresponding to the target detection model;

determining at least one historical object category based on actual object text identifications of said training used images;

According to the actual target text identification of the training used image, determine the training used image belonging to each historical target category from the training used image corresponding to the target detection model;

The historical sample images corresponding to the respective historical object categories are respectively extracted from the training used images belonging to the various historical object categories.
The method according to claim 3, wherein the predicted target position based on the historical sample image, the actual target position of the historical sample image, the image features of the historical sample image and the The similarity between the target text features of the historical sample images, the predicted target position of the added image, the actual target position of the added image, and the relationship between the image features of the added image and the The similarity between target text features, update the target detection model, including:

According to the predicted target position of the historical sample image, the actual target position of the historical sample image, and the similarity between the image feature of the historical sample image and the target text feature of the historical sample image, Determine historical image loss values;

Determine the added image according to the predicted target position of the added image, the actual target position of the added image, and the similarity between the image feature of the added image and the target text feature of the added image loss value;

performing weighted summation of the historical image loss value and the newly added image loss value to obtain the detection loss value of the target detection model; wherein, the weighted weight corresponding to the historical image loss value is higher than that of the newly added image The weighted weight corresponding to the loss value;

The target detection model is updated according to the detection loss value of the target detection model.
The method according to claim 1, wherein the sample image is input into a target detection model to obtain the image features of the sample image output by the target detection model and the predicted target position of the sample image, include:

The sample image is input into the target detection model to obtain the image features of the sample image output by the target detection model, the predicted target text mark of the sample image and the predicted target position of the sample image;

updating the target detection model according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image ,include:

According to the predicted target text identifier of the sample image, the actual target text identifier of the sample image, the predicted target position of the sample image, the actual target position of the sample image, and the image features of the sample image and the The similarity between the target text features of the sample images is used to update the target detection model.
A target detection method, characterized in that the method comprises:

Obtain the image to be detected;

Inputting the image to be detected into a pre-trained target detection model to obtain the target detection result of the image to be detected output by the target detection model; wherein, the target detection model is obtained by using any one of claims 1-6 The target detection model training method described above is used for training.
A target detection model training device, characterized in that the device comprises:

A first acquiring unit, configured to acquire a sample image, an actual target text identifier of the sample image, and an actual target position of the sample image;

The first extraction unit is used to extract the text features of the actual target text identifier of the sample image to obtain the target text features of the sample image;

a first prediction unit, configured to input the sample image into a target detection model, and obtain the image features of the sample image output by the target detection model and the predicted target position of the sample image;

A first update unit, configured to update the target position according to the predicted target position of the sample image, the actual target position of the sample image, and the similarity between the image feature of the sample image and the target text feature of the sample image. the target detection model, and return to the first prediction unit to execute the inputting the sample image into the target detection model until a first stop condition is reached.
A target detection device, characterized in that the device comprises:

a second acquiring unit, configured to acquire an image to be detected;

a target detection unit, configured to input the image to be detected into a pre-trained target detection model, and obtain a target detection result of the image to be detected output by the target detection model; wherein, the target detection model utilizes claim 1 The method for training the target detection model described in any one of -6 is trained.
A device, characterized in that the device includes a processor and a memory:

The memory is used to store computer programs;

The processor is configured to execute the target detection model training method according to any one of claims 1-6, or execute the target detection method according to claim 7 according to the computer program.
A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the target detection model training method according to any one of claims 1-6, Or execute the target detection method described in claim 7.
A computer program product, characterized in that, when the computer program product runs on a terminal device, the terminal device executes the target detection model training method described in any one of claims 1-6, or executes the The target detection method described in 7.