WO2023273572A1

WO2023273572A1 - Feature extraction model construction method and target detection method, and device therefor

Info

Publication number: WO2023273572A1
Application number: PCT/CN2022/089230
Authority: WO
Inventors: 江毅; 孙培泽; 杨朔; 袁泽寰; 王长虎
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-06-28
Filing date: 2022-04-26
Publication date: 2023-01-05
Also published as: CN113591839A; CN113591839B

Abstract

Disclosed in the present application are a feature extraction model construction method and a target detection method, and a device therefor. Firstly, a feature extraction model is constructed by using a sample binary group and an actual information similarity of the sample binary group, such that the constructed feature extraction model has a better feature extraction performance; then, by using the constructed feature extraction model, feature extraction is performed on an image under test and an object text identifier under test, so as to obtain and output an extracted feature of the image under test and an extracted feature of the object text identifier under test; and finally, according to a similarity between the extracted feature of the image under test and the extracted feature of the object text identifier under test, a target detection result corresponding to the image under test is determined, such that the target detection result can accurately represent an association relationship between the image under test and the object text identifier under test, thereby facilitating an improvement in the target detection accuracy.

Description

A feature extraction model construction method, target detection method and equipment thereof

This application claims the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on June 28, 2021, with the application number 202110723063.X, and the application title "a method for building a feature extraction model, a method for object detection and its equipment" , the entire contents of which are incorporated in this application by reference.

technical field

The present application relates to the technical field of image processing, and in particular to a feature extraction model building method, a target detection method and equipment thereof.

Background technique

Target detection (also known as target extraction) is an image segmentation technology based on target geometric statistics and features; and target detection has a wide range of applications (for example, target detection can be applied to robotics or automatic driving and other fields).

However, because the existing target detection technology still has some defects, how to improve the accuracy of target detection is still a technical problem to be solved urgently.

Contents of the invention

In order to solve the above technical problems in the prior art, the present application provides a feature extraction model construction method, a target detection method and equipment thereof, which can improve the accuracy of target detection.

In order to achieve the above objectives, the technical solutions provided in the embodiments of the present application are as follows:

The embodiment of the present application provides a method for constructing a feature extraction model, the method comprising:

Obtaining the similarity of actual information between the sample pair and the sample pair; wherein, the sample pair includes a sample image and a sample object text identifier; the actual information similarity of the sample pair is used to describe the The degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;

The sample pair is input into the model to be trained, and the extracted feature of the sample pair output by the model to be trained is obtained; wherein, the extracted feature of the sample pair includes the extracted feature of the sample image and Extracting features of the text identifier of the sample object;

determining the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier as the predicted information similarity of the sample binary group;

updating the model to be trained according to the actual information similarity of the sample pair and the predicted information similarity of the sample pair, and continuing to execute the step of inputting the sample pair into the model to be trained , until the preset stop condition is reached, the feature extraction model is determined according to the model to be trained.

In a possible implementation manner, the model to be trained includes a text feature extraction sub-model and an image feature extraction sub-model;

The process of determining the feature extraction of the sample binary group includes:

Inputting the sample image into the image feature extraction sub-model to obtain the extracted features of the sample image output by the image feature extraction sub-model;

Inputting the sample object text identifier into the text feature extraction sub-model to obtain the extracted features of the sample object text identifier output by the text feature extraction sub-model.

In a possible implementation manner, before inputting the sample pair into the model to be trained, the method further includes:

Using preset prior knowledge, the text feature extraction sub-model is initialized, so that the similarity between the text features output by the initialized text feature extraction sub-model for any two objects is the same as that of the two objects The degree of correlation between them is positively correlated; wherein, the preset prior knowledge is used to describe the degree of correlation between different objects.

In a possible implementation manner, if the extracted feature of the sample image includes a feature map of the sample image, the similarity between the extracted feature of the sample image and the extracted feature of the sample object text identifier Determine the process, including:

Respectively determining the similarity between each pixel-level extracted feature in the feature map of the sample image and the extracted feature of the sample object text identifier;

According to the similarity between each pixel-level extracted feature in the feature map of the sample image and the extracted feature of the sample object text identifier, determine the relationship between the extracted feature of the sample image and the extracted feature of the sample object text identifier similarity.

In a possible implementation manner, the process of determining the actual information similarity of the sample pair includes:

If the sample object text identifier is used to uniquely identify the sample object, and the sample image includes the sample object, then according to the actual position of the sample object in the sample image, determine the actual information similarity.

The embodiment of the present application also provides a target detection method, the method comprising:

Obtain the image to be detected and the text identification of the object to be detected;

Inputting the image to be detected and the text identification of the object to be detected into a pre-built feature extraction model to obtain the extraction features of the image to be detected output by the feature extraction model and the extraction features of the text identification of the object to be detected; wherein, The feature extraction model is constructed using any implementation of the feature extraction model construction method provided in the embodiments of the present application;

A target detection result corresponding to the image to be detected is determined according to the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.

The embodiment of the present application also provides a feature extraction model construction device, including:

A sample acquisition unit, configured to acquire a similarity between a sample pair and the actual information of the sample pair; wherein, the sample pair includes a sample image and a sample object text identifier; the actual information of the sample pair The similarity is used to describe the degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;

A feature prediction unit, configured to input the sample pair into the model to be trained, and obtain the extracted features of the sample pair output by the model to be trained; wherein, the extracted features of the sample pair include the Extracting features of the sample image and extracting features of the text identifier of the sample object;

a model updating unit, configured to update the model to be trained according to the actual information similarity of the sample pair and the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier, And continue to execute the step of inputting the sample pair into the model to be trained until the preset stop condition is reached, and a feature extraction model is determined according to the model to be trained.

The embodiment of the present application also provides a target detection device, including:

An information acquisition unit, configured to acquire the image to be detected and the text identification of the object to be detected;

A feature extraction unit, configured to input the image to be detected and the text identifier of the object to be detected into a pre-built feature extraction model, and obtain the extracted features of the image to be detected and the text identifier of the object to be detected output by the feature extraction model feature extraction; wherein, the feature extraction model is constructed using any implementation of the method for constructing a feature extraction model provided in the embodiments of the present application;

The result determination unit is configured to determine the target detection result corresponding to the image to be detected according to the degree of similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.

The embodiment of the present application also provides a device, which is characterized in that the device includes a processor and a memory:

The memory is used to store computer programs;

The processor is configured to execute any implementation of the feature extraction model construction method provided in the embodiment of the present application according to the computer program, or execute any implementation of the target detection method provided in the embodiment of the application.

The embodiment of the present application also provides a computer-readable storage medium, which is characterized in that the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the feature extraction model construction method provided in the embodiment of the present application Any implementation of the method, or execute any implementation of the target detection method provided by the embodiment of the present application.

The embodiment of the present application also provides a computer program product, which is characterized in that, when the computer program product runs on the terminal device, the terminal device executes any implementation method of the feature extraction model construction method provided in the embodiment of the present application way, or execute any implementation of the target detection method provided in the embodiment of the present application.

Compared with the prior art, the embodiment of the present application has at least the following advantages:

In the technical solution provided by the embodiment of the present application, the feature extraction model is first constructed by using the similarity between the sample pair and the actual information of the sample pair, so that the constructed feature extraction model has better feature extraction performance; The constructed feature extraction model performs feature extraction for the image to be detected and the text identifier of the object to be detected, and obtains and outputs the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected; finally, according to the extraction of the image to be detected The similarity between the feature and the extracted feature of the text mark of the object to be detected determines the target detection result corresponding to the image to be detected.

Among them, because the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected can accurately represent the similarity between the information carried by the image to be detected and the information carried by the text identifier of the object to be detected degree, so that the target detection result corresponding to the image to be detected based on the similarity can accurately represent the association between the image to be detected and the text mark of the object to be detected (for example, whether there is an object in the image to be detected by The text of the object to be detected identifies the uniquely identified target object, and the position of the target object in the image to be detected, etc.), which is beneficial to improve the accuracy of target detection.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in this application. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

Fig. 1 is a flow chart of a method for constructing a feature extraction model provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of the nth sample binary group provided by the embodiment of the present application;

FIG. 3 is a schematic diagram of a sample image including multiple objects provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a model to be trained provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of the relationship between different objects provided by the embodiment of the present application;

FIG. 6 is a flow chart of a target detection method provided in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a feature extraction model construction device provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an object detection device provided by an embodiment of the present application.

detailed description

The inventor found in the research on target detection that if there is a target object (such as a cat) in an image, the information carried by the image should be similar to the information carried by the object text identifier of the target object (for example, the target object is in The information carried by each pixel in the region of the image should be the same as the information carried by the object text identifier of the target object).

Based on the above findings, the embodiment of the present application provides a method for constructing a feature extraction model, the method includes: obtaining the similarity between the sample doublet and the actual information of the sample doublet, so that the sample doublet includes the sample image and The sample object text identifier and the actual information similarity of the sample binary group are used to describe the similarity between the information actually carried by the sample image and the information actually carried by the sample object text identifier; the sample binary group is input into the waiting Training the model to obtain the extraction feature of the sample binary group output by the model to be trained; wherein, the extraction feature of the sample binary group includes the extraction feature of the sample image and the extraction feature of the sample object text identifier; according to the sample binary The actual information similarity of the tuple, and the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier, update the model to be trained, and continue to execute the process of inputting the sample binary group into the model to be trained Steps until the preset stop condition is reached, according to the model to be trained, the feature extraction model is determined.

It can be seen that the extracted features of the sample image and the extracted features of the sample object text identifier output by the trained model for the sample binary group can accurately represent the information carried by the sample image and the information carried by the sample object text identifier. Information, so that the similarity between the extracted features of the sample image and the extracted features of the sample object text is almost close to the actual information similarity of the sample binary group, so that the trained model to be trained has better features Extraction performance, so that the feature extraction model built based on the trained model to be trained also has better feature extraction performance, so that the subsequent target detection process can be performed more accurately based on the built feature extraction model, which is conducive to improving Object detection accuracy.

In addition, the embodiment of the present application does not limit the execution subject of the feature extraction model construction method. For example, the feature extraction model construction method provided in the embodiment of the present application can be applied to data processing devices such as terminal devices or servers. Wherein, the terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), or a tablet computer. The server can be an independent server, a cluster server or a cloud server.

In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

In order to facilitate the understanding of the technical solution of this application, the following first introduces the relevant content of the feature extraction model construction method (that is, the construction process of the feature extraction model), and then introduces the relevant content of the target detection method (that is, the application process of the feature extraction model) content.

Method embodiment one

Referring to FIG. 1 , this figure is a flow chart of a method for constructing a feature extraction model provided by an embodiment of the present application.

The feature extraction model construction method provided in the embodiment of the present application includes S101-S106:

S101: Obtain a similarity between a sample pair and the actual information of the sample pair.

The sample pair refers to the model input data that needs to be input to the model to be trained during the training process of the model to be trained; and the sample pair includes a sample image and a text identifier of a sample object. Wherein, the sample image refers to an image that needs to be subjected to target detection processing. The sample object text identifier is used to uniquely identify the sample object.

It should be noted that the embodiment of the present application does not limit the sample object text identifier, for example, the sample object text identifier may be an object category name (for example, a cat).

In addition, the embodiment of the present application does not limit the number of sample pairs, for example, the number of sample pairs may be N. Wherein, N is a positive integer. That is, the model to be trained can be trained using N sample pairs.

In addition, the embodiment of the present application does not limit the sample type of the sample pair, for example, when the nth sample pair includes the nth sample image and the nth sample object text identifier, and the nth sample object text When the identification is used to uniquely identify the nth sample object, if the nth sample object exists in the nth sample image, it can be determined that the nth sample binary group belongs to a positive sample; if the nth sample image If the nth sample object does not exist in , it can be determined that the nth sample pair belongs to a negative sample.

The actual information similarity of the sample binary group is used to describe the similarity between the information actually carried by the sample image and the information actually carried by the sample object text mark, so that the actual information similarity of the sample binary group can accurately represent The relationship between the sample image and the sample object text identifier; specifically, it may include: when the sample object text identifier is used to uniquely identify the sample object, if the actual information similarity of the sample binary group is greater, it means The greater the possibility that the sample object exists in the sample image; the smaller the actual information similarity of the sample binary group is, the less likely it is that the sample object exists in the sample image.

Theoretically, for the nth sample pair (as shown in Figure 2), if there is an nth sample object in the nth sample image, the information actually carried by the nth sample image should be as much as possible Close to the information actually carried by the text identifier of the nth sample object (for example, the information actually carried by each pixel in the area where the nth sample object is located in the nth sample image should be the same as the information carried by the nth sample object text The information actually carried by the identifier remains the same).

Based on the above theory, the embodiment of the present application provides a process of obtaining the actual information similarity of the sample pair, which may specifically include: if the sample object text identifier is used to uniquely identify the sample object, and the sample image includes the sample object, Then, according to the actual position of the sample object in the sample image, the actual information similarity of the sample binary group is determined.

In addition, the embodiment of the present application does not limit the determination process of the actual information similarity of the sample pair. For example, in a possible implementation manner, it may specifically include: first, according to the actual position of the sample object in the sample image, determine The image area of the sample object, so that the image area of the sample object can represent the area occupied by the sample object in the sample image; then the actual information similarity corresponding to each pixel in the image area of the sample object is determined as The first preset similarity value (for example, 1), and the actual information similarity corresponding to each pixel point in the sample image except the image area of the sample object is determined as the second preset similarity value (for example, 0).

For ease of understanding, the following description will be given in combination with examples.

As an example, if the nth sample pair includes the nth sample image and the nth sample object text identifier, and the nth sample image is an h×w×3-dimensional image, then the nth sample two The actual information similarity of tuples can be a h×w-dimensional matrix determined according to formulas (1)-(2)

In the formula,

Indicates the actual information similarity of the nth sample binary group;

Indicates the position of the pixel point in the i-th row and j-column of the n-th sample image in the n-th sample image, i is a positive integer, i≤h, h is a positive integer, j is a positive integer, j≤w, w is a positive integer; Z ⁿ represents the area where the nth sample object is located in the nth sample image;

Indicates the similarity between the information actually carried by the pixel point in the i-th row and j-th column in the n-th sample image and the information actually carried by the n-th sample object text logo, and if

It means that the area where the nth sample object is located in the nth sample image includes the pixels in the ith row and jth column in the nth sample image, so it can be determined that the ith row and jth column in the nth sample image The information actually carried by the pixel is the same as the information actually carried by the nth sample object text identifier, then the a _ij =1; if

It means that the area where the nth sample object is located in the nth sample image does not include the pixel points in the ith row and jth column in the nth sample image, so it can be determined that the ith row and jth in the nth sample image The information actually carried by the pixel points of the column is different from the information actually carried by the text identifier of the nth sample object, then a _ij =0.

Based on the relevant content of the above formulas (1) and (2), it can be seen that for the nth sample binary group shown in Figure 2, the position of the nth sample object in the nth sample image (that is, , the position of the cat) to determine the actual information similarity of the nth sample pair; moreover, the determination process may specifically include: when the actual information similarity of the nth sample pair includes the nth sample image When the actual information similarity corresponding to each pixel in the nth sample image is located in the area where the nth sample object is located in the nth sample image (such as Within the object bounding box shown in Figure 2), it can be determined that the actual information similarity corresponding to the i-th row and j-th column pixel in the n-th sample image is 1; if the i-th row and j-th pixel in the n-th sample image Column pixel points are located outside the range of the nth sample object in the nth sample image (outside the object bounding box as shown in Figure 2), then it can be determined that the ith row jth in the nth sample image The actual information similarity corresponding to the column pixels is 0.

In addition, when the nth sample binary group includes the nth sample image and the nth sample object text identification, there are Q (eg, 3) objects in the nth sample image (such as the image shown in Figure 3), and When the nth sample object text identifier is used to uniquely identify the qth object in the nth sample image (such as a dog, person or horse in Figure 3), the actual information similarity of the nth sample binary group can be Determine according to the area occupied by the qth object in the nth sample image, specifically: the actual information similarity corresponding to each pixel in the area occupied by the qth object in the nth sample image is averaged It is determined as the first preset similarity value (for example, 1), and the actual information similarity corresponding to each pixel point outside the area occupied by the qth object in the nth sample image is determined as the second preset Set a similarity value (for example, 0). Wherein, q is a positive integer, and q≤Q.

That is, if you want to use the nth sample image and the qth object in the nth sample image to train the "model to be trained" below, you need to use the nth sample image and the nth sample image The object text identifier of the qth object constructs a sample pair, and the area occupied by the qth object in the nth sample image is used to determine the actual information similarity of the sample pair.

It should be noted that in Figure 3, "dog" refers to the object text identifier of a dog; "person" refers to the object text identifier of a person; and "horse" refers to the object text identifier of a horse.

Based on the relevant content of S101 above, after the sample image and the sample object text identifier are acquired, according to the association relationship between the sample image and the sample object text identifier (for example, whether there is a sample object text in the sample image identify the uniquely identified sample object, and the location of the sample object in the sample image), determine the similarity between the information actually carried by the sample image and the information actually carried by the text identifier of the sample object, so that the subsequent During the training process of the training model, the similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object is taken as the learning goal.

S102: Input the sample pair into the model to be trained, and obtain the extracted features of the sample pair output by the model to be trained.

Wherein, the extracted features of the sample pair are used to represent the information carried by the sample pair; and the extracted features of the sample pair include the extracted features of the sample image and the extracted features of the text identifier of the sample object.

The extracted features of the sample image are used to represent the information carried by the sample image prediction. In addition, the embodiment of the present application does not limit the representation of the feature extraction of the sample image. For example, if a sample image is h×w×3 dimensional, the feature extraction of the sample image can be performed using a h×w×c dimensional feature map. express.

The extracted features of the sample object text identifier are used to represent the information carried by the sample object text identifier prediction. In addition, the embodiment of the present application does not limit the representation manner of the extracted features of the sample object text identifier. For example, the extracted features of a sample object text identifier may be represented by a 1×c-dimensional feature vector.

The model to be trained is used to perform feature extraction on input data of the model to be trained (for example, perform text feature extraction on text data, and/or perform image feature extraction on image data). In addition, the embodiment of the present application does not limit the structure of the model to be trained. For example, in a possible implementation, as shown in FIG. 4 , the model 400 to be trained may include a text feature extraction sub-model 401 and an image feature extraction sub-model 402.

In order to facilitate the understanding of the working principle of the model to be trained 400 , the process of determining the extracted features of the sample binary group is taken as an example to describe below.

As an example, the process of using the model to be trained 400 to determine the feature extraction of the sample pair may specifically include steps 11-12:

Step 11: Input the sample image into the image feature extraction sub-model 402 to obtain the extracted features of the sample image output by the image feature extraction sub-model 402 .

Wherein, the image feature extraction sub-model 402 is used for image feature extraction; moreover, the embodiment of the present application does not limit the implementation of the image feature extraction sub-model 402, and any existing or future image feature extraction function can be used The model structure is implemented.

Step 12: Input the text mark of the sample object into the text feature extraction sub-model 401, and obtain the extracted features of the text mark of the sample object output by the text feature extraction sub-model 401.

Wherein, the text feature extraction sub-model 401 is used for text feature extraction; moreover, the embodiment of the present application does not limit the implementation of the text feature extraction sub-model 401, and any existing or future text feature extraction function can be used The model structure (such as Bert, GPT-3 and other language models) is implemented.

Based on the relevant content of the above step 11 to step 12, it can be known that for the model to be trained 400, after inputting the sample binary group into the model to be trained 400, the image feature extraction sub-model 402 in the model to be trained 400 can be used for the model to be trained. Perform image feature extraction on the sample image in the sample binary group, obtain and output the extracted feature of the sample image, so that the extracted feature of the sample image can represent the information carried by the sample image prediction; and, the model to be trained 400 Chinese This feature extraction sub-model 401 performs text feature extraction on the sample object text identifier in the sample binary group, obtains and outputs the extraction features of the sample object text identifier, so that the sample object text identifier extraction features can represent the sample object The text identifies the information carried by the prediction.

In addition, in order to further improve the feature extraction performance of the model to be trained 400, before training the model 400 to be trained, some prior knowledge can be used to initialize the text feature extraction sub-model 401 of the model to be trained 400, so that the text features The extraction sub-model 401 can subsequently perform text feature extraction based on these prior knowledge. Based on this, the embodiment of the present application also provides a possible implementation of the feature extraction model construction method. In this embodiment, the feature extraction model construction method includes S107 in addition to S101-S106:

S107: Initialize the text feature extraction sub-model 401 by using preset prior knowledge.

Among them, the preset prior knowledge is used to describe the degree of association between different objects (for example, as shown in Figure 5, cats and tigers belong to the cat family, so that the degree of association between cats and tigers is high; are both lions and lionesses, making the association between lions and lionesses even higher).

It should be noted that if the correlation degree between two objects is 1, it means that the two objects belong to the same type of object; if the correlation degree between the two objects is 0, it means that the two objects are completely different There are similarities (that is, there is no association relationship between the two objects).

In addition, the embodiment of the present application does not limit the preset prior knowledge, for example, the preset prior knowledge may include a pre-built object knowledge graph. Among them, the object knowledge map can be used to describe the degree of correlation between different objects; and the object knowledge map can be constructed in advance based on a large amount of knowledge information related to objects.

In addition, this embodiment of the present application does not limit the implementation manner of "initialization processing" in S107. For example, the "initialization processing" in S107 may refer to pre-training. That is, the text feature extraction sub-model 401 is pre-trained using the preset prior knowledge, so that the trained text feature extraction sub-model 401 can perform feature extraction according to the preset prior knowledge, so that the initialized text The similarity between the text features output by the feature extraction sub-model 401 for any two objects (especially for the object identification of the two objects) is positively correlated with the degree of association between the two objects.

That is, for the "initialized text feature extraction sub-model 401", if the preset prior knowledge indicates that the correlation between the first object and the second object is higher, then the "initialized text feature The higher the similarity between the text features (such as “v ₅ ” and “v ₃ ” in FIG. 5 ) respectively output by the extraction sub-model 401” for the first object (such as a cat) and the second object (such as a lion); if In the preset prior knowledge, the lower the degree of correlation between the first object and the second object is, the text features output by the "initialized text feature extraction sub-model 401" for the first object and the second object respectively The lower the similarity between.

It should be noted that “v ₁ ” in Fig. 5 indicates the text features output by the “initialized text feature extraction sub-model 401” for tigers; “v ₂ ” indicates that the “initialized text feature extraction sub-model 401” targets The text features output by Leopard; ... (and so on); "v ₆ " indicates the text features output by the "initialized text feature extraction sub-model 401" for lynx.

It should also be noted that the embodiment of the present application does not limit the execution time of S107, and it only needs to be completed before S102 is executed (that is, S107 only needs to be completed before training the model to be trained).

Based on the relevant content of S107 above, it can be seen that before using the sample pair and the actual information similarity of the sample pair to train the model 400 to be trained, the text The feature extraction sub-model 401 is pre-trained, so that the text feature extraction sub-model 401 in the model to be trained 400 can learn to perform feature extraction according to preset prior knowledge, so that the training process of the model to be trained 400 continues to optimize the The text feature extraction performance of the text feature extraction sub-model 401, so that the text feature extraction sub-model 401 in the trained model 400 to be trained can better perform feature extraction based on preset prior knowledge, which is conducive to improving the model 400 to be trained feature extraction performance, which is beneficial to improve the feature extraction performance of the feature extraction model constructed based on the model to be trained 400, and further helps to improve the target detection performance when using the feature extraction model for target detection.

Based on the relevant content of S102 above, after the nth sample pair is acquired, the nth sample pair can be input into the model to be trained, so that the model to be trained can target the nth sample pair in the nth sample pair. The nth sample image and the nth sample object text mark carry out feature extraction respectively, obtain and output the extraction feature of the nth sample image and the extraction feature of the nth sample object text mark (that is, the nth sample The extraction feature of the binary group), so that the extraction feature of the sample image and the extraction feature of the nth sample object text mark can respectively represent the information carried by the sample image prediction and the information carried by the sample object text mark prediction, In order to subsequently determine the feature extraction performance of the model to be trained based on the extracted features of the nth sample image and the extracted features of the nth sample object text identifier. Wherein, n is a positive integer, n≤N, N is a positive integer, and N represents the number of sample pairs.

S103: Calculate the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier, as the predicted information similarity of the sample binary group.

Among them, the similarity of the prediction information of the sample binary group refers to the similarity between the extracted features of the sample image and the extracted features of the sample object text identification, so that the similarity of the predicted information of the sample binary group can be used to describe the sample image prediction The degree of similarity between the information carried and the information carried by the sample object text identifier prediction.

In addition, the embodiment of the present application does not limit the method of determining the similarity of the prediction information of the sample pair (that is, the implementation of S103). For example, in a possible implementation, if the extracted features of the sample image include the sample image feature map, S103 may specifically include S1031-S1032:

S1031: Determine the similarity between each pixel-level extracted feature in the feature map of the sample image and the extracted feature of the sample object text identifier.

Among them, the feature map of the sample image is used to represent the information carried by the sample image; moreover, the embodiment of the present application does not limit the feature map of the sample image, for example, if a sample image is h×w×3 dimensional, and the sample object text identifier The extracted feature of is 1×c dimension, then the feature map of the sample image can be h×w×c dimension. Wherein, h is a positive integer, w is a positive integer, and c is a positive integer.

In addition, the embodiment of the present application does not limit the representation of the feature map of the sample image. For example, if the feature map of the sample image is h×w×c-dimensional, the feature map of the sample image can use h×w pixel-level feature extraction To represent, and each pixel-level extracted feature is 1×c dimension. Among them, the pixel-level extracted feature located in the i-th row and j-column in the feature map of the sample image is used to represent the information carried by the pixel point prediction in the i-th row and j-column in the sample image. Wherein, i is a positive integer, i≤h; j is a positive integer, j≤w.

In addition, the embodiment of the present application does not limit the implementation manner of S1031, for example, S1031 may be implemented using formula (3).

In the formula, b _ij represents the similarity between the pixel-level extracted features located in the i-th row and j-th column in the feature map of the sample image and the extracted features of the sample object text mark, so that b _ij can be used to describe the first The degree of similarity between the predicted information carried by the pixel point in row i, column j, and the predicted information carried by the text mark of the sample object;

Represents the pixel-level extraction features located in the i-th row and j-th column in the feature map of the sample image, so that

It is used to describe the information carried by the pixel point prediction in the i-th row and j-th column in the sample image, and

is a 1×c-dimensional feature vector; H _n represents the extracted feature of the sample object text mark, so that this H _n is used to describe the information carried by the sample object text mark prediction, and H _n is a 1×c-dimensional feature vector; S(·) means to perform similarity calculation; i is a positive integer, i≤h, h is a positive integer, j is a positive integer, j≤w, w is a positive integer.

It should be noted that the embodiment of the present application does not limit the implementation of S(·), and any existing similarity calculation method (eg, Euclidean distance, cosine distance, etc.) can be used for implementation.

S1032: According to the similarity between each pixel-level extracted feature in the feature map of the sample image and the extracted feature of the sample object text identifier, determine the similarity of the predicted information of the sample binary group.

The embodiment of the present application does not limit S1032, for example, S1032 may be calculated by using formula (4).

In the formula,

Represents the similarity of the predicted information of the sample binary group (that is, the similarity between the _extracted features of the sample image and the extracted features of the sample object text identifier); The similarity between the pixel-level extracted features of the sample object and the extracted features of the sample object text mark; i is a positive integer, i≤h, h is a positive integer, j is a positive integer, j≤w, w is a positive integer.

Based on the relevant content of S103, it can be known that for the nth sample binary group including the nth sample image and the nth sample object text identifier, after obtaining the extracted features of the nth sample image and the nth sample object After extracting the features of the text mark, the predictive information similarity of the nth sample binary group can be calculated according to the extracting feature of the nth sample image and the extracting feature of the nth sample object text mark, so that the nth sample The predicted information similarity of the two sample pairs can accurately describe the similarity between the predicted information carried by the nth sample image and the predicted information carried by the nth sample object text mark, so that the subsequent information can be based on the The prediction information similarity of n sample pairs determines the feature extraction performance of the model to be trained. Wherein, n is a positive integer, n≤N, N is a positive integer, and N represents the number of sample pairs.

S104: Determine whether a preset stop condition is met, if yes, execute S106; if not, execute S105.

Wherein, the preset stop condition can be set in advance. For example, the preset stop condition can be that the loss value of the model to be trained is lower than the preset loss threshold, or that the rate of change of the loss value of the model to be trained is lower than the preset rate of change threshold (that is, the model to be trained reaches convergence), It is also possible for the number of updates of the model to be trained to reach a preset number threshold.

It should be noted that the loss value of the model to be trained is used to describe the feature extraction performance of the model to be trained; and the embodiment of the present application does not limit the calculation method of the loss value of the model to be trained, existing or future ones can be used Any method capable of calculating the loss value of the model to be trained according to the predicted information similarity of the sample pair and the actual information similarity of the sample pair is implemented.

S105: Update the model to be trained according to the predicted information similarity of the sample pair and the actual information similarity of the sample pair, and return to S102.

In the embodiment of the present application, after it is determined that the current round of the model to be trained does not meet the preset stop condition, it can be determined that the feature extraction performance of the current round of the model to be trained is still relatively poor, so it can be based on the similarity of the prediction information of the sample pair Degree and the difference between the actual information similarity of the sample pair, update the model to be trained, so that the updated model to be trained has better feature extraction performance, and use the updated model to be trained Continue to execute S102 and its subsequent steps.

S106: Determine a feature extraction model according to the model to be trained.

In the embodiment of the present application, after it is determined that the model to be trained in the current round reaches the preset stop condition, it can be determined that the model to be trained in the current round has better feature extraction performance (in particular, it can ensure that the feature extraction of the sample image including the sample object As close as possible to the extraction feature of the sample object text identifier used to uniquely identify the sample object), so the feature extraction model can be constructed according to the current round of the model to be trained (for example, directly determine the current round of the model to be trained as feature extraction or, according to the model structure and model parameters of the model to be trained in the current round, determine the model structure and model parameters of the feature extraction model, so that the model structure and model parameters of the feature extraction model are respectively the same as those of the model to be trained in the current round The model structure and model parameters remain the same), so that the feature extraction performance of the constructed feature extraction model is consistent with the feature extraction performance of the current round of the model to be trained, so that the constructed feature extraction model also has better features Extract performance.

Based on the relevant content of the above S101 to S106, it can be seen that for the feature extraction model construction method, after obtaining the actual information similarity between the sample doublet and the sample doublet, first use the sample doublet and the sample doublet The actual information similarity of the group trains the model to be trained, so that the similarity between the extracted features of the sample image output by the trained model for the sample pair and the extracted features of the sample object text identifier is almost close to the The similarity of the actual information of the sample binary groups, so that the trained model to be trained has better feature extraction performance, and then the feature extraction model constructed based on the trained model to be trained also has better feature extraction performance, In this way, the subsequent target detection process can be performed more accurately based on the constructed feature extraction model, which is conducive to improving the accuracy of target detection.

After the feature extraction model is constructed, the feature extraction model can be used for target detection. Based on this, an embodiment of the present application further provides a target detection method, which will be described below with reference to the accompanying drawings.

Method embodiment two

Referring to FIG. 6 , this figure is a flow chart of a target detection method provided by an embodiment of the present application.

The target detection method provided in the embodiment of this application includes S601-S603:

S601: Obtain an image to be detected and a text identification of an object to be detected.

Wherein, the image to be detected refers to an image that needs to be subjected to target detection processing.

The text identifier of the object to be detected is used to uniquely identify the object to be detected. That is, S601-S603 may be used to determine whether there is an object to be detected that is uniquely identified by a text identifier of the object to be detected in the image to be detected.

It should be noted that the embodiment of the present application does not limit the text identification of the object to be detected. For example, the text identification of the object to be detected can be any sample object text identification used in the process of building the feature extraction model, or it can be any text identification except in Any object text identification other than the sample object text identification used in the construction of the feature extraction model. For example, if the object text identifier "tiger" has not been used in the process of building the feature extraction model, the object text identifier to be detected may be a tiger. It can be seen that the object detection method provided in the embodiment of the present application is an open world-oriented object detection method.

S602: Input the image to be detected and the text identifier of the object to be detected into a pre-built feature extraction model, and obtain the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected output by the feature extraction model.

Wherein, the feature extraction model is used to perform feature extraction on the input data of the feature extraction model; and the feature extraction model is constructed using any implementation of the feature extraction model construction method provided in the embodiment of the present application. For details, please refer to the above Method embodiment one .

The extracted features of the image to be detected are used to represent the information carried by the image to be detected.

The extracted features of the text identifier of the object to be detected are used to represent the information carried by the text identifier of the object to be detected.

Based on the relevant content of S602, after the image to be detected and the text identifier of the object to be detected are obtained, the image to be detected and the text identifier of the object to be detected can be input into a pre-built feature extraction model, so that the feature extraction model is specific to the Feature extraction is performed on the image to be detected and the text mark of the object to be detected, and the extracted features of the image to be detected and the text mark of the object to be detected are obtained and output, so that the extracted features of the image to be detected can represent the The information carried by the detection image and the extracted features of the text mark of the object to be detected can represent the information carried by the text mark of the object to be detected.

S603: Determine a target detection result corresponding to the image to be detected according to the degree of similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.

Wherein, the target detection result corresponding to the image to be detected is used to describe the relationship between the image to be detected and the text identifier of the object to be detected. In addition, this embodiment of the present application does not limit the representation of the target detection result corresponding to the image to be detected. For example, if the text identifier of the object to be detected is used to uniquely identify the object to be detected, the target detection result corresponding to the image to be detected may include the The possibility that the object to be detected exists in the detection image (such as the possibility that each pixel in the image to be detected is located in the area where the object to be detected is located in the image to be detected), and/or the object to be detected is located in the area to be detected Detect locations in an image.

In addition, the embodiment of the present application does not limit the method of determining the target detection result corresponding to the image to be detected. For example, the process of determining the target detection result corresponding to the image to be detected may include steps 21-22:

Step 21: Calculate the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.

Wherein, the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected is used to describe the degree of similarity between the information carried by the image to be detected and the information carried by the text identifier of the object to be detected.

In addition, this embodiment of the present application does not limit the representation of the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected. For example, it can be represented by an h×w-dimensional similarity matrix. , the similarity value in the i-th row and j-column in the h×w-dimensional similarity matrix can describe the information carried by the pixel in the i-th row and j-column in the image to be detected and the information carried by the text mark of the object to be detected The degree of similarity between the information can be used to indicate the possibility that the pixel point in row i and column j in the image to be detected is located in the area where the object to be detected is located in the image to be detected.

It should be noted that for the relevant content of step 21, please refer to the relevant content of S103 above, just replace "sample image" in S103 above with "image to be detected", and replace "sample object text identifier" with "to be detected Detect object text mark" is enough.

Step 22: Determine the target detection result corresponding to the image to be detected according to the preset similarity condition and the similarity between the extracted features of the image to be detected and the extracted features of the text mark of the object to be detected.

Wherein, the preset similarity condition can be set in advance, for example, if the similarity between the extracted feature of the image to be detected and the extracted feature of the text mark of the object to be detected is represented by an h×w-dimensional similarity matrix, Then the preset similarity condition may be greater than a preset similarity threshold (eg, 0.5).

It can be seen that when the degree of similarity between the extracted features of the image to be detected and the extracted features of the text mark of the object to be detected is represented by an h×w-dimensional similarity matrix, and the preset similarity condition is greater than the preset similarity degree threshold, step 22 may specifically include: judging whether the similarity value in the i-th row and j-column in the above-mentioned h×w-dimensional similarity matrix is greater than the preset similarity threshold, if greater than the preset similarity threshold, then determine The information carried by the pixel point in the i-th row and j-column in the image to be detected is similar to the information carried by the text mark of the object to be detected, so it can be determined that the pixel point in the i-th row and j-column in the image to be detected is located in the object to be detected In the area where the image to be detected is located; if it is not greater than the preset similarity threshold, it can be determined that the information carried by the pixel point in the i-th row and j-th column in the image to be detected is not very similar to the information carried by the text identifier of the object to be detected , so it can be determined that the pixel point in row i and column j in the image to be detected is not located in the area where the object to be detected is located in the image to be detected.

Based on the relevant content of the above S601 to S603, after obtaining the image to be detected and the text identifier of the object to be detected, the constructed feature extraction model can be used to perform feature extraction on the image to be detected and the text identifier of the object to be detected, and obtain And output the extraction features of the image to be detected and the extraction features of the text identification of the object to be detected; then determine the image to be detected according to the similarity between the extraction features of the image to be detected and the extraction features of the text identification of the object to be detected Corresponding target detection results.

It can be seen that, because the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected can accurately represent the similarity between the information carried by the image to be detected and the information carried by the text identifier of the object to be detected degree, so that the target detection result corresponding to the image to be detected based on the similarity can accurately represent the association between the image to be detected and the text mark of the object to be detected (for example, whether there is an object in the image to be detected by The text of the object to be detected identifies the uniquely identified target object, and the position of the target object in the image to be detected, etc.), which is beneficial to improve the accuracy of target detection.

Also, because the constructed feature extraction model can extract text features for any object text mark according to the association relationship between different objects, the target detection method provided in the embodiment of the present application can not only be based on the The used sample object text identification for target detection can also be used for target detection based on any object text identification other than the sample object text identification used in the construction process of the feature extraction model, which is conducive to improving the feature extraction The model is aimed at the target detection performance of the non-sample object, thereby helping to improve the target detection performance of the target detection method provided in the embodiment of the present application.

In addition, the embodiment of the present application does not limit the execution subject of the object detection method. For example, the object detection method provided in the embodiment of the present application can be applied to data processing devices such as terminal devices or servers. Wherein, the terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), or a tablet computer. The server can be an independent server, a cluster server or a cloud server.

Based on the method for constructing a feature extraction model provided by the foregoing method embodiments, an embodiment of the present application further provides a device for constructing a feature extraction model, which will be explained and described below with reference to the accompanying drawings.

Device embodiment one

For the technical details of the feature extraction model construction device provided in the first device embodiment, please refer to the above method embodiment.

Refer to FIG. 7 , which is a schematic structural diagram of a feature extraction model construction device provided in an embodiment of the present application.

The feature extraction model construction device 700 provided in the embodiment of the present application includes:

The sample obtaining unit 701 is configured to obtain a sample double group and the actual information similarity of the sample double group; wherein, the sample double group includes a sample image and a sample object text identifier; the actual information of the sample double group The information similarity is used to describe the degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;

A feature prediction unit 702, configured to input the sample pair into the model to be trained, and obtain the extracted features of the sample pair output by the model to be trained; wherein, the extracted features of the sample pair include the Extracting features of the sample image and extracting features of the sample object text identifier;

A model updating unit 703, configured to update the model to be trained according to the actual information similarity of the sample pair and the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier , and continue to execute the step of inputting the sample pair into the model to be trained until a preset stop condition is reached, and a feature extraction model is determined according to the model to be trained.

In a possible implementation manner, the feature extraction model building device 700 also includes:

The initialization unit is configured to use preset prior knowledge to initialize the text feature extraction sub-model; wherein the preset prior knowledge is used to describe the relationship between different objects.

In a possible implementation manner, the process of determining the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier includes:

Respectively determine the similarity between each pixel-level extraction feature in the feature map of the sample image and the extraction feature of the sample object text identifier; The similarity between the extracted features of the text identification determines the similarity between the extracted features of the sample image and the extracted features of the sample object text identification.

Based on the relevant content of the above-mentioned feature extraction model construction device 700, it can be seen that after obtaining the actual information similarity between the sample doublet and the sample doublet, first use the sample doublet and the actual information similarity of the sample doublet Train the model to be trained so that the similarity between the extracted features of the sample image output by the trained model for the sample pair and the extracted features of the sample object text identifier is almost close to the actual value of the sample pair. Information similarity, so that the trained model to be trained has better feature extraction performance, and then the feature extraction model constructed based on the trained model to be trained also has better feature extraction performance, so that the follow-up can be based on this The constructed feature extraction model can perform the target detection process more accurately, which is conducive to improving the accuracy of target detection.

Based on the target detection method provided by the above method embodiment, the embodiment of the present application also provides a target detection device, which will be explained and described below with reference to the accompanying drawings.

Device embodiment two

For the technical details of the target detection device provided in the second embodiment of the device, please refer to the above method embodiment.

Referring to FIG. 8 , this figure is a schematic structural diagram of a target detection device provided by an embodiment of the present application.

The target detection device 800 provided in the embodiment of the present application includes:

An information acquisition unit 801, configured to acquire an image to be detected and a text identification of an object to be detected;

A feature extraction unit 802, configured to input the image to be detected and the text identifier of the object to be detected into a pre-built feature extraction model, and obtain the extracted features of the image to be detected and the text of the object to be detected output by the feature extraction model The extracted features of the identification; wherein, the feature extraction model is constructed using any implementation of the feature extraction model construction method provided in the embodiment of the present application;

The result determination unit 803 is configured to determine a target detection result corresponding to the image to be detected according to the degree of similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.

Based on the relevant content of the above target detection device 800, it can be known that after the image to be detected and the text identifier of the object to be detected are obtained, the constructed feature extraction model can be used to perform feature extraction on the image to be detected and the text identifier of the object to be detected, Obtain and output the extraction features of the image to be detected and the extraction features of the text identification of the object to be detected; then determine the detection The object detection result corresponding to the image.

Also, because the constructed feature extraction model can extract text features for any object text mark according to the association relationship between different objects, the target detection method provided in the embodiment of the present application can not only be based on the The used sample object text identification for target detection can also be used for target detection based on any object text identification other than the sample object text identification used in the construction process of the feature extraction model, which is conducive to improving the feature extraction The model is aimed at the target detection performance of non-sample objects, so as to help improve the target detection performance of the target detection device 800 provided in the embodiment of the present application.

Further, the embodiment of the present application also provides a device, the device includes a processor and a memory:

The memory is used to store computer programs;

Further, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the feature extraction model construction method provided in the embodiment of the present application. Any implementation manner, or execute any implementation manner of the target detection method provided in the embodiment of the present application.

Furthermore, the embodiment of the present application also provides a computer program product, which, when running on the terminal device, enables the terminal device to execute any implementation manner of the feature extraction model construction method provided in the embodiment of the present application , or execute any implementation of the target detection method provided in the embodiment of the present application.

It should be understood that in this application, "at least one (item)" means one or more, and "multiple" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c ", where a, b, c can be single or multiple.

The above descriptions are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it to be equivalent to equivalent changes Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims

A method for building a feature extraction model, characterized in that the method comprises:

Obtaining the similarity of actual information between the sample pair and the sample pair; wherein, the sample pair includes a sample image and a sample object text identifier; the actual information similarity of the sample pair is used to describe the The degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;

The sample pair is input into the model to be trained, and the extracted feature of the sample pair output by the model to be trained is obtained; wherein, the extracted feature of the sample pair includes the extracted feature of the sample image and Extracting features of the text identifier of the sample object;

determining the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier as the predicted information similarity of the sample binary group;

updating the model to be trained according to the actual information similarity of the sample pair and the predicted information similarity of the sample pair, and continuing to execute the step of inputting the sample pair into the model to be trained , until the preset stop condition is reached, the feature extraction model is determined according to the model to be trained.
The method according to claim 1, wherein the model to be trained includes a text feature extraction sub-model and an image feature extraction sub-model;

The process of determining the feature extraction of the sample binary group includes:

Inputting the sample image into the image feature extraction sub-model to obtain the extracted features of the sample image output by the image feature extraction sub-model;

Inputting the sample object text identifier into the text feature extraction sub-model to obtain the extracted features of the sample object text identifier output by the text feature extraction sub-model.
The method according to claim 2, wherein, before the input of the sample pair to the model to be trained, the method further comprises:

Using preset prior knowledge, the text feature extraction sub-model is initialized, so that the similarity between the text features output by the initialized text feature extraction sub-model for any two objects is the same as that of the two objects The degree of correlation between them is positively correlated; wherein, the preset prior knowledge is used to describe the degree of correlation between different objects.
The method according to claim 1, wherein if the extracted feature of the sample image includes a feature map of the sample image, the relationship between the extracted feature of the sample image and the extracted feature of the sample object text mark The process of determining the similarity includes:

Respectively determining the similarity between each pixel-level extracted feature in the feature map of the sample image and the extracted feature of the sample object text identifier;

According to the similarity between each pixel-level extracted feature in the feature map of the sample image and the extracted feature of the sample object text identifier, determine the relationship between the extracted feature of the sample image and the extracted feature of the sample object text identifier similarity.
The method according to claim 1, wherein the process of determining the actual information similarity of the sample pair includes:

If the sample object text identifier is used to uniquely identify the sample object, and the sample image includes the sample object, then according to the actual position of the sample object in the sample image, determine the actual information similarity.
The method according to claim 5, wherein if the actual information similarity of the sample binary group includes the actual information similarity corresponding to each pixel in the sample image, then the The actual position in the sample image determines the actual information similarity of the sample binary group, including:

determining the image area of the sample object according to the actual position of the sample object in the sample image;

Determining the actual information similarity corresponding to each pixel in the image area of the sample object as a first preset similarity value;

The actual information similarity corresponding to each pixel in the sample image except the image area of the sample object is determined as a second preset similarity value.
A target detection method, characterized in that the method comprises:

Obtain the image to be detected and the text identification of the object to be detected;

Inputting the image to be detected and the text identification of the object to be detected into a pre-built feature extraction model to obtain the extraction features of the image to be detected output by the feature extraction model and the extraction features of the text identification of the object to be detected; wherein, The feature extraction model is constructed using the feature extraction model construction method described in any one of claims 1-6;

A target detection result corresponding to the image to be detected is determined according to the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
A feature extraction model construction device is characterized in that it comprises:

A sample acquisition unit, configured to acquire a similarity between a sample pair and the actual information of the sample pair; wherein, the sample pair includes a sample image and a sample object text identifier; the actual information of the sample pair The similarity is used to describe the degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;

A feature prediction unit, configured to input the sample pair into the model to be trained, and obtain the extracted features of the sample pair output by the model to be trained; wherein, the extracted features of the sample pair include the Extracting features of the sample image and extracting features of the text identifier of the sample object;

a model updating unit, configured to update the model to be trained according to the actual information similarity of the sample pair and the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier, And continue to execute the step of inputting the sample pair into the model to be trained until the preset stop condition is reached, and a feature extraction model is determined according to the model to be trained.
A target detection device, characterized in that it comprises:

An information acquisition unit, configured to acquire the image to be detected and the text identification of the object to be detected;

A feature extraction unit, configured to input the image to be detected and the text identifier of the object to be detected into a pre-built feature extraction model, and obtain the extracted features of the image to be detected and the text identifier of the object to be detected output by the feature extraction model feature extraction; wherein, the feature extraction model is constructed using the feature extraction model construction method described in any one of claims 1-5;

The result determination unit is configured to determine the target detection result corresponding to the image to be detected according to the degree of similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
A device, characterized in that the device includes a processor and a memory:

The memory is used to store computer programs;

The processor is configured to execute the feature extraction model building method according to any one of claims 1-6 according to the computer program, or execute the target detection method according to claim 7.
A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the method for constructing a feature extraction model according to any one of claims 1-6, Or execute the target detection method described in claim 7.
A computer program product, characterized in that, when the computer program product runs on a terminal device, the terminal device executes the method for constructing a feature extraction model according to any one of claims 1-6, or executes the method of claim 1 The target detection method described in 7.