WO2023273572A1 - Procédé de construction de modèle d'extraction de caractéristiques et procédé de détection de cible, et dispositif associé - Google Patents

Procédé de construction de modèle d'extraction de caractéristiques et procédé de détection de cible, et dispositif associé Download PDF

Info

Publication number
WO2023273572A1
WO2023273572A1 PCT/CN2022/089230 CN2022089230W WO2023273572A1 WO 2023273572 A1 WO2023273572 A1 WO 2023273572A1 CN 2022089230 W CN2022089230 W CN 2022089230W WO 2023273572 A1 WO2023273572 A1 WO 2023273572A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
image
model
feature extraction
similarity
Prior art date
Application number
PCT/CN2022/089230
Other languages
English (en)
Chinese (zh)
Inventor
江毅
孙培泽
杨朔
袁泽寰
王长虎
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023273572A1 publication Critical patent/WO2023273572A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • the present application relates to the technical field of image processing, and in particular to a feature extraction model building method, a target detection method and equipment thereof.
  • Target detection also known as target extraction
  • target detection is an image segmentation technology based on target geometric statistics and features; and target detection has a wide range of applications (for example, target detection can be applied to robotics or automatic driving and other fields).
  • the present application provides a feature extraction model construction method, a target detection method and equipment thereof, which can improve the accuracy of target detection.
  • the embodiment of the present application provides a method for constructing a feature extraction model, the method comprising:
  • the sample pair includes a sample image and a sample object text identifier; the actual information similarity of the sample pair is used to describe the The degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;
  • the sample pair is input into the model to be trained, and the extracted feature of the sample pair output by the model to be trained is obtained; wherein, the extracted feature of the sample pair includes the extracted feature of the sample image and Extracting features of the text identifier of the sample object;
  • the model is updated according to the actual information similarity of the sample pair and the predicted information similarity of the sample pair, and continuing to execute the step of inputting the sample pair into the model to be trained , until the preset stop condition is reached, the feature extraction model is determined according to the model to be trained.
  • the model to be trained includes a text feature extraction sub-model and an image feature extraction sub-model;
  • the process of determining the feature extraction of the sample binary group includes:
  • the method before inputting the sample pair into the model to be trained, the method further includes:
  • the text feature extraction sub-model is initialized, so that the similarity between the text features output by the initialized text feature extraction sub-model for any two objects is the same as that of the two objects The degree of correlation between them is positively correlated; wherein, the preset prior knowledge is used to describe the degree of correlation between different objects.
  • the similarity between the extracted feature of the sample image and the extracted feature of the sample object text identifier Determine the process, including:
  • the process of determining the actual information similarity of the sample pair includes:
  • sample object text identifier is used to uniquely identify the sample object, and the sample image includes the sample object, then according to the actual position of the sample object in the sample image, determine the actual information similarity.
  • the embodiment of the present application also provides a target detection method, the method comprising:
  • the feature extraction model is constructed using any implementation of the feature extraction model construction method provided in the embodiments of the present application;
  • a target detection result corresponding to the image to be detected is determined according to the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
  • the embodiment of the present application also provides a feature extraction model construction device, including:
  • a sample acquisition unit configured to acquire a similarity between a sample pair and the actual information of the sample pair; wherein, the sample pair includes a sample image and a sample object text identifier; the actual information of the sample pair
  • the similarity is used to describe the degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;
  • a feature prediction unit configured to input the sample pair into the model to be trained, and obtain the extracted features of the sample pair output by the model to be trained; wherein, the extracted features of the sample pair include the Extracting features of the sample image and extracting features of the text identifier of the sample object;
  • a model updating unit configured to update the model to be trained according to the actual information similarity of the sample pair and the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier, And continue to execute the step of inputting the sample pair into the model to be trained until the preset stop condition is reached, and a feature extraction model is determined according to the model to be trained.
  • the embodiment of the present application also provides a target detection device, including:
  • An information acquisition unit configured to acquire the image to be detected and the text identification of the object to be detected
  • a feature extraction unit configured to input the image to be detected and the text identifier of the object to be detected into a pre-built feature extraction model, and obtain the extracted features of the image to be detected and the text identifier of the object to be detected output by the feature extraction model feature extraction; wherein, the feature extraction model is constructed using any implementation of the method for constructing a feature extraction model provided in the embodiments of the present application;
  • the result determination unit is configured to determine the target detection result corresponding to the image to be detected according to the degree of similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
  • the embodiment of the present application also provides a device, which is characterized in that the device includes a processor and a memory:
  • the memory is used to store computer programs
  • the processor is configured to execute any implementation of the feature extraction model construction method provided in the embodiment of the present application according to the computer program, or execute any implementation of the target detection method provided in the embodiment of the application.
  • the embodiment of the present application also provides a computer-readable storage medium, which is characterized in that the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the feature extraction model construction method provided in the embodiment of the present application Any implementation of the method, or execute any implementation of the target detection method provided by the embodiment of the present application.
  • the embodiment of the present application also provides a computer program product, which is characterized in that, when the computer program product runs on the terminal device, the terminal device executes any implementation method of the feature extraction model construction method provided in the embodiment of the present application way, or execute any implementation of the target detection method provided in the embodiment of the present application.
  • the embodiment of the present application has at least the following advantages:
  • the feature extraction model is first constructed by using the similarity between the sample pair and the actual information of the sample pair, so that the constructed feature extraction model has better feature extraction performance;
  • the constructed feature extraction model performs feature extraction for the image to be detected and the text identifier of the object to be detected, and obtains and outputs the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected; finally, according to the extraction of the image to be detected
  • the similarity between the feature and the extracted feature of the text mark of the object to be detected determines the target detection result corresponding to the image to be detected.
  • the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected can accurately represent the similarity between the information carried by the image to be detected and the information carried by the text identifier of the object to be detected degree, so that the target detection result corresponding to the image to be detected based on the similarity can accurately represent the association between the image to be detected and the text mark of the object to be detected (for example, whether there is an object in the image to be detected by
  • the text of the object to be detected identifies the uniquely identified target object, and the position of the target object in the image to be detected, etc.), which is beneficial to improve the accuracy of target detection.
  • Fig. 1 is a flow chart of a method for constructing a feature extraction model provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the nth sample binary group provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of a sample image including multiple objects provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a model to be trained provided in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the relationship between different objects provided by the embodiment of the present application.
  • FIG. 6 is a flow chart of a target detection method provided in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a feature extraction model construction device provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an object detection device provided by an embodiment of the present application.
  • a target object such as a cat
  • the information carried by the image should be similar to the information carried by the object text identifier of the target object (for example, the target object is in
  • the information carried by each pixel in the region of the image should be the same as the information carried by the object text identifier of the target object).
  • the embodiment of the present application provides a method for constructing a feature extraction model, the method includes: obtaining the similarity between the sample doublet and the actual information of the sample doublet, so that the sample doublet includes the sample image and The sample object text identifier and the actual information similarity of the sample binary group are used to describe the similarity between the information actually carried by the sample image and the information actually carried by the sample object text identifier; the sample binary group is input into the waiting Training the model to obtain the extraction feature of the sample binary group output by the model to be trained; wherein, the extraction feature of the sample binary group includes the extraction feature of the sample image and the extraction feature of the sample object text identifier; according to the sample binary The actual information similarity of the tuple, and the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier, update the model to be trained, and continue to execute the process of inputting the sample binary group into the model to be trained Steps until the preset stop condition is reached, according to the model to be
  • the extracted features of the sample image and the extracted features of the sample object text identifier output by the trained model for the sample binary group can accurately represent the information carried by the sample image and the information carried by the sample object text identifier.
  • Information so that the similarity between the extracted features of the sample image and the extracted features of the sample object text is almost close to the actual information similarity of the sample binary group, so that the trained model to be trained has better features Extraction performance, so that the feature extraction model built based on the trained model to be trained also has better feature extraction performance, so that the subsequent target detection process can be performed more accurately based on the built feature extraction model, which is conducive to improving Object detection accuracy.
  • the embodiment of the present application does not limit the execution subject of the feature extraction model construction method.
  • the feature extraction model construction method provided in the embodiment of the present application can be applied to data processing devices such as terminal devices or servers.
  • the terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), or a tablet computer.
  • the server can be an independent server, a cluster server or a cloud server.
  • the following first introduces the relevant content of the feature extraction model construction method (that is, the construction process of the feature extraction model), and then introduces the relevant content of the target detection method (that is, the application process of the feature extraction model) content.
  • this figure is a flow chart of a method for constructing a feature extraction model provided by an embodiment of the present application.
  • the feature extraction model construction method provided in the embodiment of the present application includes S101-S106:
  • S101 Obtain a similarity between a sample pair and the actual information of the sample pair.
  • the sample pair refers to the model input data that needs to be input to the model to be trained during the training process of the model to be trained; and the sample pair includes a sample image and a text identifier of a sample object.
  • the sample image refers to an image that needs to be subjected to target detection processing.
  • the sample object text identifier is used to uniquely identify the sample object.
  • sample object text identifier may be an object category name (for example, a cat).
  • the embodiment of the present application does not limit the number of sample pairs, for example, the number of sample pairs may be N.
  • N is a positive integer. That is, the model to be trained can be trained using N sample pairs.
  • the embodiment of the present application does not limit the sample type of the sample pair, for example, when the nth sample pair includes the nth sample image and the nth sample object text identifier, and the nth sample object text
  • the identification is used to uniquely identify the nth sample object, if the nth sample object exists in the nth sample image, it can be determined that the nth sample binary group belongs to a positive sample; if the nth sample image If the nth sample object does not exist in , it can be determined that the nth sample pair belongs to a negative sample.
  • the actual information similarity of the sample binary group is used to describe the similarity between the information actually carried by the sample image and the information actually carried by the sample object text mark, so that the actual information similarity of the sample binary group can accurately represent The relationship between the sample image and the sample object text identifier; specifically, it may include: when the sample object text identifier is used to uniquely identify the sample object, if the actual information similarity of the sample binary group is greater, it means The greater the possibility that the sample object exists in the sample image; the smaller the actual information similarity of the sample binary group is, the less likely it is that the sample object exists in the sample image.
  • the information actually carried by the nth sample image should be as much as possible Close to the information actually carried by the text identifier of the nth sample object (for example, the information actually carried by each pixel in the area where the nth sample object is located in the nth sample image should be the same as the information carried by the nth sample object text The information actually carried by the identifier remains the same).
  • the embodiment of the present application provides a process of obtaining the actual information similarity of the sample pair, which may specifically include: if the sample object text identifier is used to uniquely identify the sample object, and the sample image includes the sample object, Then, according to the actual position of the sample object in the sample image, the actual information similarity of the sample binary group is determined.
  • the embodiment of the present application does not limit the determination process of the actual information similarity of the sample pair.
  • it may specifically include: first, according to the actual position of the sample object in the sample image, determine The image area of the sample object, so that the image area of the sample object can represent the area occupied by the sample object in the sample image; then the actual information similarity corresponding to each pixel in the image area of the sample object is determined as The first preset similarity value (for example, 1), and the actual information similarity corresponding to each pixel point in the sample image except the image area of the sample object is determined as the second preset similarity value (for example, 0).
  • the first preset similarity value for example, 1
  • the second preset similarity value for example, 0
  • the nth sample pair includes the nth sample image and the nth sample object text identifier, and the nth sample image is an h ⁇ w ⁇ 3-dimensional image, then the nth sample two
  • the actual information similarity of tuples can be a h ⁇ w-dimensional matrix determined according to formulas (1)-(2)
  • n Indicates the actual information similarity of the nth sample binary group; Indicates the position of the pixel point in the i-th row and j-column of the n-th sample image in the n-th sample image, i is a positive integer, i ⁇ h, h is a positive integer, j is a positive integer, j ⁇ w, w is a positive integer; Z n represents the area where the nth sample object is located in the nth sample image; Indicates the similarity between the information actually carried by the pixel point in the i-th row and j-th column in the n-th sample image and the information actually carried by the n-th sample object text logo, and if It means that the area where the nth sample object is located in the nth sample image includes the pixels in the ith row and jth column in the nth sample image, so it can be determined that the ith row and jth column in the nth sample image The information actually carried by the pixel
  • the determination process may specifically include: when the actual information similarity of the nth sample pair includes the nth sample image
  • the actual information similarity corresponding to each pixel in the nth sample image is located in the area where the nth sample object is located in the nth sample image (such as Within the object bounding box shown in Figure 2)
  • the actual information similarity corresponding to the i-th row and j-th column pixel in the n-th sample image is 1; if the i-th row and j-th pixel in the n-th sample image Column pixel points are located outside the range of the nth sample object in the nth sample image (outside the object bounding box as shown in Figure 2)
  • the nth sample binary group includes the nth sample image and the nth sample object text identification
  • Q eg, 3 objects in the nth sample image
  • the nth sample object text identifier is used to uniquely identify the qth object in the nth sample image (such as a dog, person or horse in Figure 3)
  • the actual information similarity of the nth sample binary group can be Determine according to the area occupied by the qth object in the nth sample image, specifically: the actual information similarity corresponding to each pixel in the area occupied by the qth object in the nth sample image is averaged It is determined as the first preset similarity value (for example, 1), and the actual information similarity corresponding to each pixel point outside the area occupied by the qth object in the nth sample image is determined as the second preset Set a similarity value (for example, 0).
  • q is a positive integer
  • q ⁇ Q is a positive integer, and q ⁇ Q.
  • the object text identifier of the qth object constructs a sample pair, and the area occupied by the qth object in the nth sample image is used to determine the actual information similarity of the sample pair.
  • dog refers to the object text identifier of a dog
  • person refers to the object text identifier of a person
  • hoorse refers to the object text identifier of a horse.
  • the sample image and the sample object text identifier are acquired, according to the association relationship between the sample image and the sample object text identifier (for example, whether there is a sample object text in the sample image identify the uniquely identified sample object, and the location of the sample object in the sample image), determine the similarity between the information actually carried by the sample image and the information actually carried by the text identifier of the sample object, so that the subsequent During the training process of the training model, the similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object is taken as the learning goal.
  • S102 Input the sample pair into the model to be trained, and obtain the extracted features of the sample pair output by the model to be trained.
  • the extracted features of the sample pair are used to represent the information carried by the sample pair; and the extracted features of the sample pair include the extracted features of the sample image and the extracted features of the text identifier of the sample object.
  • the extracted features of the sample image are used to represent the information carried by the sample image prediction.
  • the embodiment of the present application does not limit the representation of the feature extraction of the sample image. For example, if a sample image is h ⁇ w ⁇ 3 dimensional, the feature extraction of the sample image can be performed using a h ⁇ w ⁇ c dimensional feature map. express.
  • the extracted features of the sample object text identifier are used to represent the information carried by the sample object text identifier prediction.
  • the embodiment of the present application does not limit the representation manner of the extracted features of the sample object text identifier.
  • the extracted features of a sample object text identifier may be represented by a 1 ⁇ c-dimensional feature vector.
  • the model to be trained is used to perform feature extraction on input data of the model to be trained (for example, perform text feature extraction on text data, and/or perform image feature extraction on image data).
  • the embodiment of the present application does not limit the structure of the model to be trained.
  • the model 400 to be trained may include a text feature extraction sub-model 401 and an image feature extraction sub-model 402.
  • the process of using the model to be trained 400 to determine the feature extraction of the sample pair may specifically include steps 11-12:
  • Step 11 Input the sample image into the image feature extraction sub-model 402 to obtain the extracted features of the sample image output by the image feature extraction sub-model 402 .
  • the image feature extraction sub-model 402 is used for image feature extraction; moreover, the embodiment of the present application does not limit the implementation of the image feature extraction sub-model 402, and any existing or future image feature extraction function can be used
  • the model structure is implemented.
  • Step 12 Input the text mark of the sample object into the text feature extraction sub-model 401, and obtain the extracted features of the text mark of the sample object output by the text feature extraction sub-model 401.
  • the text feature extraction sub-model 401 is used for text feature extraction; moreover, the embodiment of the present application does not limit the implementation of the text feature extraction sub-model 401, and any existing or future text feature extraction function can be used
  • the model structure (such as Bert, GPT-3 and other language models) is implemented.
  • the image feature extraction sub-model 402 in the model to be trained 400 can be used for the model to be trained.
  • This feature extraction sub-model 401 performs text feature extraction on the sample object text identifier in the sample binary group, obtains and outputs the extraction features of the sample object text identifier, so that the sample object text identifier extraction features can represent the sample object The text identifies the information carried by the prediction.
  • the embodiment of the present application also provides a possible implementation of the feature extraction model construction method.
  • the feature extraction model construction method includes S107 in addition to S101-S106:
  • the preset prior knowledge is used to describe the degree of association between different objects (for example, as shown in Figure 5, cats and tigers belong to the cat family, so that the degree of association between cats and tigers is high; are both lions and lionesses, making the association between lions and lionesses even higher).
  • the correlation degree between two objects is 1, it means that the two objects belong to the same type of object; if the correlation degree between the two objects is 0, it means that the two objects are completely different There are similarities (that is, there is no association relationship between the two objects).
  • the embodiment of the present application does not limit the preset prior knowledge, for example, the preset prior knowledge may include a pre-built object knowledge graph.
  • the object knowledge map can be used to describe the degree of correlation between different objects; and the object knowledge map can be constructed in advance based on a large amount of knowledge information related to objects.
  • this embodiment of the present application does not limit the implementation manner of "initialization processing" in S107.
  • the "initialization processing" in S107 may refer to pre-training. That is, the text feature extraction sub-model 401 is pre-trained using the preset prior knowledge, so that the trained text feature extraction sub-model 401 can perform feature extraction according to the preset prior knowledge, so that the initialized text
  • the similarity between the text features output by the feature extraction sub-model 401 for any two objects is positively correlated with the degree of association between the two objects.
  • the "initialized text feature extraction sub-model 401” if the preset prior knowledge indicates that the correlation between the first object and the second object is higher, then the "initialized text feature The higher the similarity between the text features (such as “v 5 ” and “v 3 ” in FIG. 5 ) respectively output by the extraction sub-model 401” for the first object (such as a cat) and the second object (such as a lion); if In the preset prior knowledge, the lower the degree of correlation between the first object and the second object is, the text features output by the "initialized text feature extraction sub-model 401" for the first object and the second object respectively The lower the similarity between.
  • v 1 in Fig. 5 indicates the text features output by the “initialized text feature extraction sub-model 401” for tigers; “v 2 ” indicates that the “initialized text feature extraction sub-model 401” targets The text features output by Leopard; ... (and so on); “v 6 " indicates the text features output by the "initialized text feature extraction sub-model 401" for lynx.
  • the embodiment of the present application does not limit the execution time of S107, and it only needs to be completed before S102 is executed (that is, S107 only needs to be completed before training the model to be trained).
  • the text The feature extraction sub-model 401 is pre-trained, so that the text feature extraction sub-model 401 in the model to be trained 400 can learn to perform feature extraction according to preset prior knowledge, so that the training process of the model to be trained 400 continues to optimize the The text feature extraction performance of the text feature extraction sub-model 401, so that the text feature extraction sub-model 401 in the trained model 400 to be trained can better perform feature extraction based on preset prior knowledge, which is conducive to improving the model 400 to be trained feature extraction performance, which is beneficial to improve the feature extraction performance of the feature extraction model constructed based on the model to be trained 400, and further helps to improve the target detection performance when using the feature extraction model for target detection.
  • the nth sample pair can be input into the model to be trained, so that the model to be trained can target the nth sample pair in the nth sample pair.
  • the nth sample image and the nth sample object text mark carry out feature extraction respectively, obtain and output the extraction feature of the nth sample image and the extraction feature of the nth sample object text mark (that is, the nth sample The extraction feature of the binary group), so that the extraction feature of the sample image and the extraction feature of the nth sample object text mark can respectively represent the information carried by the sample image prediction and the information carried by the sample object text mark prediction,
  • n is a positive integer
  • n ⁇ N N is a positive integer
  • N represents the number of sample pairs.
  • the similarity of the prediction information of the sample binary group refers to the similarity between the extracted features of the sample image and the extracted features of the sample object text identification, so that the similarity of the predicted information of the sample binary group can be used to describe the sample image prediction The degree of similarity between the information carried and the information carried by the sample object text identifier prediction.
  • the embodiment of the present application does not limit the method of determining the similarity of the prediction information of the sample pair (that is, the implementation of S103).
  • S103 may specifically include S1031-S1032:
  • S1031 Determine the similarity between each pixel-level extracted feature in the feature map of the sample image and the extracted feature of the sample object text identifier.
  • the feature map of the sample image is used to represent the information carried by the sample image; moreover, the embodiment of the present application does not limit the feature map of the sample image, for example, if a sample image is h ⁇ w ⁇ 3 dimensional, and the sample object text identifier
  • the extracted feature of is 1 ⁇ c dimension
  • the feature map of the sample image can be h ⁇ w ⁇ c dimension.
  • h is a positive integer
  • w is a positive integer
  • c is a positive integer.
  • the embodiment of the present application does not limit the representation of the feature map of the sample image.
  • the feature map of the sample image can use h ⁇ w pixel-level feature extraction To represent, and each pixel-level extracted feature is 1 ⁇ c dimension.
  • the pixel-level extracted feature located in the i-th row and j-column in the feature map of the sample image is used to represent the information carried by the pixel point prediction in the i-th row and j-column in the sample image.
  • S1031 may be implemented using formula (3).
  • b ij represents the similarity between the pixel-level extracted features located in the i-th row and j-th column in the feature map of the sample image and the extracted features of the sample object text mark, so that b ij can be used to describe the first
  • the degree of similarity between the predicted information carried by the pixel point in row i, column j, and the predicted information carried by the text mark of the sample object Represents the pixel-level extraction features located in the i-th row and j-th column in the feature map of the sample image, so that It is used to describe the information carried by the pixel point prediction in the i-th row and j-th column in the sample image, and is a 1 ⁇ c-dimensional feature vector
  • H n represents the extracted feature of the sample object text mark, so that this H n is used to describe the information carried by the sample object text mark prediction, and H n is a 1 ⁇ c-dimensional feature vector
  • S( ⁇ ) means to perform similarity calculation
  • i is a positive integer
  • S1032 may be calculated by using formula (4).
  • the predictive information similarity of the nth sample binary group can be calculated according to the extracting feature of the nth sample image and the extracting feature of the nth sample object text mark, so that the nth sample
  • the predicted information similarity of the two sample pairs can accurately describe the similarity between the predicted information carried by the nth sample image and the predicted information carried by the nth sample object text mark, so that the subsequent information can be based on the
  • the prediction information similarity of n sample pairs determines the feature extraction performance of the model to be trained.
  • n is a positive integer
  • n ⁇ N N is a positive integer
  • N represents the number of sample pairs.
  • S104 Determine whether a preset stop condition is met, if yes, execute S106; if not, execute S105.
  • the preset stop condition can be set in advance.
  • the preset stop condition can be that the loss value of the model to be trained is lower than the preset loss threshold, or that the rate of change of the loss value of the model to be trained is lower than the preset rate of change threshold (that is, the model to be trained reaches convergence), It is also possible for the number of updates of the model to be trained to reach a preset number threshold.
  • the loss value of the model to be trained is used to describe the feature extraction performance of the model to be trained; and the embodiment of the present application does not limit the calculation method of the loss value of the model to be trained, existing or future ones can be used Any method capable of calculating the loss value of the model to be trained according to the predicted information similarity of the sample pair and the actual information similarity of the sample pair is implemented.
  • S105 Update the model to be trained according to the predicted information similarity of the sample pair and the actual information similarity of the sample pair, and return to S102.
  • the feature extraction performance of the current round of the model to be trained is still relatively poor, so it can be based on the similarity of the prediction information of the sample pair Degree and the difference between the actual information similarity of the sample pair, update the model to be trained, so that the updated model to be trained has better feature extraction performance, and use the updated model to be trained Continue to execute S102 and its subsequent steps.
  • the feature extraction model can be constructed according to the current round of the model to be trained (for example, directly determine the current round of the model to be trained as feature extraction or, according to the model structure and model parameters of the model to be trained in the current round, determine the model structure and model parameters of the feature extraction model, so that the model structure and model parameters of the feature extraction model are respectively the same as those of the model to be trained in the current round The model structure and model parameters remain the same), so that the feature extraction performance of the constructed feature extraction model is consistent with the feature extraction performance of the current round of the model to be trained, so that the constructed feature extraction model also has better features Extract performance.
  • the feature extraction model construction method after obtaining the actual information similarity between the sample doublet and the sample doublet, first use the sample doublet and the sample doublet
  • the actual information similarity of the group trains the model to be trained, so that the similarity between the extracted features of the sample image output by the trained model for the sample pair and the extracted features of the sample object text identifier is almost close to the
  • the similarity of the actual information of the sample binary groups so that the trained model to be trained has better feature extraction performance, and then the feature extraction model constructed based on the trained model to be trained also has better feature extraction performance, In this way, the subsequent target detection process can be performed more accurately based on the constructed feature extraction model, which is conducive to improving the accuracy of target detection.
  • the feature extraction model After the feature extraction model is constructed, the feature extraction model can be used for target detection. Based on this, an embodiment of the present application further provides a target detection method, which will be described below with reference to the accompanying drawings.
  • this figure is a flow chart of a target detection method provided by an embodiment of the present application.
  • the target detection method provided in the embodiment of this application includes S601-S603:
  • S601 Obtain an image to be detected and a text identification of an object to be detected.
  • the image to be detected refers to an image that needs to be subjected to target detection processing.
  • the text identifier of the object to be detected is used to uniquely identify the object to be detected. That is, S601-S603 may be used to determine whether there is an object to be detected that is uniquely identified by a text identifier of the object to be detected in the image to be detected.
  • the embodiment of the present application does not limit the text identification of the object to be detected.
  • the text identification of the object to be detected can be any sample object text identification used in the process of building the feature extraction model, or it can be any text identification except in Any object text identification other than the sample object text identification used in the construction of the feature extraction model.
  • the object text identifier to be detected may be a tiger.
  • the object detection method provided in the embodiment of the present application is an open world-oriented object detection method.
  • S602 Input the image to be detected and the text identifier of the object to be detected into a pre-built feature extraction model, and obtain the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected output by the feature extraction model.
  • the feature extraction model is used to perform feature extraction on the input data of the feature extraction model; and the feature extraction model is constructed using any implementation of the feature extraction model construction method provided in the embodiment of the present application. For details, please refer to the above Method embodiment one .
  • the extracted features of the image to be detected are used to represent the information carried by the image to be detected.
  • the extracted features of the text identifier of the object to be detected are used to represent the information carried by the text identifier of the object to be detected.
  • the image to be detected and the text identifier of the object to be detected can be input into a pre-built feature extraction model, so that the feature extraction model is specific to the Feature extraction is performed on the image to be detected and the text mark of the object to be detected, and the extracted features of the image to be detected and the text mark of the object to be detected are obtained and output, so that the extracted features of the image to be detected can represent the
  • the information carried by the detection image and the extracted features of the text mark of the object to be detected can represent the information carried by the text mark of the object to be detected.
  • S603 Determine a target detection result corresponding to the image to be detected according to the degree of similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
  • the target detection result corresponding to the image to be detected is used to describe the relationship between the image to be detected and the text identifier of the object to be detected.
  • this embodiment of the present application does not limit the representation of the target detection result corresponding to the image to be detected.
  • the target detection result corresponding to the image to be detected may include the The possibility that the object to be detected exists in the detection image (such as the possibility that each pixel in the image to be detected is located in the area where the object to be detected is located in the image to be detected), and/or the object to be detected is located in the area to be detected Detect locations in an image.
  • the embodiment of the present application does not limit the method of determining the target detection result corresponding to the image to be detected.
  • the process of determining the target detection result corresponding to the image to be detected may include steps 21-22:
  • Step 21 Calculate the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
  • the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected is used to describe the degree of similarity between the information carried by the image to be detected and the information carried by the text identifier of the object to be detected.
  • this embodiment of the present application does not limit the representation of the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
  • it can be represented by an h ⁇ w-dimensional similarity matrix.
  • the similarity value in the i-th row and j-column in the h ⁇ w-dimensional similarity matrix can describe the information carried by the pixel in the i-th row and j-column in the image to be detected and the information carried by the text mark of the object to be detected.
  • the degree of similarity between the information can be used to indicate the possibility that the pixel point in row i and column j in the image to be detected is located in the area where the object to be detected is located in the image to be detected.
  • step 21 For the relevant content of step 21, please refer to the relevant content of S103 above, just replace “sample image” in S103 above with “image to be detected”, and replace “sample object text identifier” with “to be detected Detect object text mark” is enough.
  • Step 22 Determine the target detection result corresponding to the image to be detected according to the preset similarity condition and the similarity between the extracted features of the image to be detected and the extracted features of the text mark of the object to be detected.
  • the preset similarity condition can be set in advance, for example, if the similarity between the extracted feature of the image to be detected and the extracted feature of the text mark of the object to be detected is represented by an h ⁇ w-dimensional similarity matrix, Then the preset similarity condition may be greater than a preset similarity threshold (eg, 0.5).
  • a preset similarity threshold eg, 0.5
  • step 22 may specifically include: judging whether the similarity value in the i-th row and j-column in the above-mentioned h ⁇ w-dimensional similarity matrix is greater than the preset similarity threshold, if greater than the preset similarity threshold, then determine The information carried by the pixel point in the i-th row and j-column in the image to be detected is similar to the information carried by the text mark of the object to be detected, so it can be determined that the pixel point in the i-th row and j-column in the image to be detected is located in the object to be detected In the area where the image to be detected is located; if it is not greater than the preset similarity threshold, it can be determined that the information carried by the pixel
  • the constructed feature extraction model can be used to perform feature extraction on the image to be detected and the text identifier of the object to be detected, and obtain And output the extraction features of the image to be detected and the extraction features of the text identification of the object to be detected; then determine the image to be detected according to the similarity between the extraction features of the image to be detected and the extraction features of the text identification of the object to be detected Corresponding target detection results.
  • the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected can accurately represent the similarity between the information carried by the image to be detected and the information carried by the text identifier of the object to be detected degree, so that the target detection result corresponding to the image to be detected based on the similarity can accurately represent the association between the image to be detected and the text mark of the object to be detected (for example, whether there is an object in the image to be detected by
  • the text of the object to be detected identifies the uniquely identified target object, and the position of the target object in the image to be detected, etc.), which is beneficial to improve the accuracy of target detection.
  • the target detection method provided in the embodiment of the present application can not only be based on the
  • the used sample object text identification for target detection can also be used for target detection based on any object text identification other than the sample object text identification used in the construction process of the feature extraction model, which is conducive to improving the feature extraction
  • the model is aimed at the target detection performance of the non-sample object, thereby helping to improve the target detection performance of the target detection method provided in the embodiment of the present application.
  • the embodiment of the present application does not limit the execution subject of the object detection method.
  • the object detection method provided in the embodiment of the present application can be applied to data processing devices such as terminal devices or servers.
  • the terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), or a tablet computer.
  • the server can be an independent server, a cluster server or a cloud server.
  • an embodiment of the present application further provides a device for constructing a feature extraction model, which will be explained and described below with reference to the accompanying drawings.
  • FIG. 7 is a schematic structural diagram of a feature extraction model construction device provided in an embodiment of the present application.
  • the feature extraction model construction device 700 provided in the embodiment of the present application includes:
  • the sample obtaining unit 701 is configured to obtain a sample double group and the actual information similarity of the sample double group; wherein, the sample double group includes a sample image and a sample object text identifier; the actual information of the sample double group
  • the information similarity is used to describe the degree of similarity between the information actually carried by the sample image and the information actually carried by the text mark of the sample object;
  • a feature prediction unit 702 configured to input the sample pair into the model to be trained, and obtain the extracted features of the sample pair output by the model to be trained; wherein, the extracted features of the sample pair include the Extracting features of the sample image and extracting features of the sample object text identifier;
  • a model updating unit 703 configured to update the model to be trained according to the actual information similarity of the sample pair and the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier , and continue to execute the step of inputting the sample pair into the model to be trained until a preset stop condition is reached, and a feature extraction model is determined according to the model to be trained.
  • the model to be trained includes a text feature extraction sub-model and an image feature extraction sub-model;
  • the process of determining the feature extraction of the sample binary group includes:
  • the feature extraction model building device 700 also includes:
  • the initialization unit is configured to use preset prior knowledge to initialize the text feature extraction sub-model; wherein the preset prior knowledge is used to describe the relationship between different objects.
  • the process of determining the similarity between the extracted features of the sample image and the extracted features of the sample object text identifier includes:
  • the similarity between the extracted features of the text identification determines the similarity between the extracted features of the sample image and the extracted features of the sample object text identification.
  • the process of determining the actual information similarity of the sample pair includes:
  • sample object text identifier is used to uniquely identify the sample object, and the sample image includes the sample object, then according to the actual position of the sample object in the sample image, determine the actual information similarity.
  • the model to be trained so that the similarity between the extracted features of the sample image output by the trained model for the sample pair and the extracted features of the sample object text identifier is almost close to the actual value of the sample pair.
  • Information similarity so that the trained model to be trained has better feature extraction performance, and then the feature extraction model constructed based on the trained model to be trained also has better feature extraction performance, so that the follow-up can be based on this
  • the constructed feature extraction model can perform the target detection process more accurately, which is conducive to improving the accuracy of target detection.
  • the embodiment of the present application also provides a target detection device, which will be explained and described below with reference to the accompanying drawings.
  • FIG. 8 this figure is a schematic structural diagram of a target detection device provided by an embodiment of the present application.
  • the target detection device 800 provided in the embodiment of the present application includes:
  • An information acquisition unit 801 configured to acquire an image to be detected and a text identification of an object to be detected
  • a feature extraction unit 802 configured to input the image to be detected and the text identifier of the object to be detected into a pre-built feature extraction model, and obtain the extracted features of the image to be detected and the text of the object to be detected output by the feature extraction model The extracted features of the identification; wherein, the feature extraction model is constructed using any implementation of the feature extraction model construction method provided in the embodiment of the present application;
  • the result determination unit 803 is configured to determine a target detection result corresponding to the image to be detected according to the degree of similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected.
  • the constructed feature extraction model can be used to perform feature extraction on the image to be detected and the text identifier of the object to be detected, Obtain and output the extraction features of the image to be detected and the extraction features of the text identification of the object to be detected; then determine the detection The object detection result corresponding to the image.
  • the similarity between the extracted features of the image to be detected and the extracted features of the text identifier of the object to be detected can accurately represent the similarity between the information carried by the image to be detected and the information carried by the text identifier of the object to be detected degree, so that the target detection result corresponding to the image to be detected based on the similarity can accurately represent the association between the image to be detected and the text mark of the object to be detected (for example, whether there is an object in the image to be detected by
  • the text of the object to be detected identifies the uniquely identified target object, and the position of the target object in the image to be detected, etc.), which is beneficial to improve the accuracy of target detection.
  • the target detection method provided in the embodiment of the present application can not only be based on the
  • the used sample object text identification for target detection can also be used for target detection based on any object text identification other than the sample object text identification used in the construction process of the feature extraction model, which is conducive to improving the feature extraction
  • the model is aimed at the target detection performance of non-sample objects, so as to help improve the target detection performance of the target detection device 800 provided in the embodiment of the present application.
  • the embodiment of the present application also provides a device, the device includes a processor and a memory:
  • the memory is used to store computer programs
  • the processor is configured to execute any implementation of the feature extraction model construction method provided in the embodiment of the present application according to the computer program, or execute any implementation of the target detection method provided in the embodiment of the application.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the feature extraction model construction method provided in the embodiment of the present application. Any implementation manner, or execute any implementation manner of the target detection method provided in the embodiment of the present application.
  • the embodiment of the present application also provides a computer program product, which, when running on the terminal device, enables the terminal device to execute any implementation manner of the feature extraction model construction method provided in the embodiment of the present application , or execute any implementation of the target detection method provided in the embodiment of the present application.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.

Abstract

Un procédé de construction de modèle d'extraction de caractéristiques et un procédé de détection de cible, et un dispositif associé sont divulgués dans la présente demande. D'abord, un modèle d'extraction de caractéristiques est construit à l'aide d'un groupe binaire d'échantillons et d'une similarité d'informations réelle du groupe binaire d'échantillons, de telle sorte que le modèle d'extraction de caractéristiques construit présente de meilleures performances d'extraction de caractéristiques ; ensuite, à l'aide du modèle d'extraction de caractéristiques construit, une extraction de caractéristiques est exécutée sur une image à l'essai et un identifiant de texte d'objet à l'essai, de façon à obtenir et à émettre une caractéristique extraite de l'image à l'essai et une caractéristique extraite de l'identifiant de texte d'objet à l'essai ; et enfin, selon une similarité entre la caractéristique extraite de l'image à l'essai et la caractéristique extraite de l'identifiant de texte d'objet à l'essai, un résultat de détection cible correspondant à l'image à l'essai est déterminé, de telle sorte que le résultat de détection cible peut représenter avec précision une relation d'association entre l'image à l'essai et l'identifiant de texte d'objet à l'essai, ce qui facilite une amélioration de la précision de détection de cible.
PCT/CN2022/089230 2021-06-28 2022-04-26 Procédé de construction de modèle d'extraction de caractéristiques et procédé de détection de cible, et dispositif associé WO2023273572A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110723063.X 2021-06-28
CN202110723063.XA CN113591839B (zh) 2021-06-28 2021-06-28 一种特征提取模型构建方法、目标检测方法及其设备

Publications (1)

Publication Number Publication Date
WO2023273572A1 true WO2023273572A1 (fr) 2023-01-05

Family

ID=78245050

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089230 WO2023273572A1 (fr) 2021-06-28 2022-04-26 Procédé de construction de modèle d'extraction de caractéristiques et procédé de détection de cible, et dispositif associé

Country Status (2)

Country Link
CN (1) CN113591839B (fr)
WO (1) WO2023273572A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591839B (zh) * 2021-06-28 2023-05-09 北京有竹居网络技术有限公司 一种特征提取模型构建方法、目标检测方法及其设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647350A (zh) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 一种基于双通道网络的图文关联检索方法
CN110019889A (zh) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 训练特征提取模型及计算图片与查询词相关性系数的方法和相关装置
CN111091597A (zh) * 2019-11-18 2020-05-01 贝壳技术有限公司 确定图像位姿变换的方法、装置及存储介质
US20200242197A1 (en) * 2019-01-30 2020-07-30 Adobe Inc. Generating summary content tuned to a target characteristic using a word generation model
CN111897950A (zh) * 2020-07-29 2020-11-06 北京字节跳动网络技术有限公司 用于生成信息的方法和装置
CN111985616A (zh) * 2020-08-13 2020-11-24 沈阳东软智能医疗科技研究院有限公司 一种图像特征提取方法、图像检索方法、装置及设备
CN113591839A (zh) * 2021-06-28 2021-11-02 北京有竹居网络技术有限公司 一种特征提取模型构建方法、目标检测方法及其设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020592B (zh) * 2019-02-03 2024-04-09 平安科技(深圳)有限公司 物体检测模型训练方法、装置、计算机设备及存储介质
CN111782921A (zh) * 2020-03-25 2020-10-16 北京沃东天骏信息技术有限公司 检索目标的方法和装置
CN112990297B (zh) * 2021-03-10 2024-02-02 北京智源人工智能研究院 多模态预训练模型的训练方法、应用方法及装置
CN112990204B (zh) * 2021-05-11 2021-08-24 北京世纪好未来教育科技有限公司 目标检测方法、装置、电子设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019889A (zh) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 训练特征提取模型及计算图片与查询词相关性系数的方法和相关装置
CN108647350A (zh) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 一种基于双通道网络的图文关联检索方法
US20200242197A1 (en) * 2019-01-30 2020-07-30 Adobe Inc. Generating summary content tuned to a target characteristic using a word generation model
CN111091597A (zh) * 2019-11-18 2020-05-01 贝壳技术有限公司 确定图像位姿变换的方法、装置及存储介质
CN111897950A (zh) * 2020-07-29 2020-11-06 北京字节跳动网络技术有限公司 用于生成信息的方法和装置
CN111985616A (zh) * 2020-08-13 2020-11-24 沈阳东软智能医疗科技研究院有限公司 一种图像特征提取方法、图像检索方法、装置及设备
CN113591839A (zh) * 2021-06-28 2021-11-02 北京有竹居网络技术有限公司 一种特征提取模型构建方法、目标检测方法及其设备

Also Published As

Publication number Publication date
CN113591839B (zh) 2023-05-09
CN113591839A (zh) 2021-11-02

Similar Documents

Publication Publication Date Title
CN110162593B (zh) 一种搜索结果处理、相似度模型训练方法及装置
CN107239731B (zh) 一种基于Faster R-CNN的手势检测和识别方法
Zhang et al. Real-time sow behavior detection based on deep learning
WO2023087558A1 (fr) Procédé de classification de scènes d'images de télédétection à petit échantillon basé sur un réseau neuronal de graphe de lissage d'incorporation
CN109993102B (zh) 相似人脸检索方法、装置及存储介质
WO2020155518A1 (fr) Procédé et dispositif de détection d'objet, dispositif informatique et support d'informations
CN109063719B (zh) 一种联合结构相似性和类信息的图像分类方法
CN109165309B (zh) 负例训练样本采集方法、装置及模型训练方法、装置
CN111931859B (zh) 一种多标签图像识别方法和装置
CN110751027B (zh) 一种基于深度多示例学习的行人重识别方法
CN109977253B (zh) 一种基于语义和内容的快速图像检索方法及装置
WO2023134402A1 (fr) Procédé de reconnaissance de caractère de calligraphie basé sur un réseau neuronal à convolution siamois
CN112528022A (zh) 主题类别对应的特征词提取和文本主题类别识别方法
CN111523586B (zh) 一种基于噪声可知的全网络监督目标检测方法
WO2023273572A1 (fr) Procédé de construction de modèle d'extraction de caractéristiques et procédé de détection de cible, et dispositif associé
CN114187595A (zh) 基于视觉特征和语义特征融合的文档布局识别方法及系统
CN111444816A (zh) 一种基于Faster RCNN的多尺度密集行人检测方法
CN110083724A (zh) 一种相似图像检索方法、装置及系统
CN108428234B (zh) 基于图像分割结果评价的交互式分割性能优化方法
CN113723558A (zh) 基于注意力机制的遥感图像小样本舰船检测方法
CN105844299B (zh) 一种基于词袋模型的图像分类方法
WO2024021321A1 (fr) Procédé et appareil de génération de modèle, dispositif électronique et support de stockage
WO2023273570A1 (fr) Procédé d'apprentissage de modèle de détection de cible et procédé de détection de cible, et dispositif associé
CN113780284B (zh) 一种基于目标检测和度量学习的logo检测方法
CN115909403A (zh) 基于深度学习的低成本高精度猪脸识别方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE