WO2023173547A1

WO2023173547A1 - Text image matching method and apparatus, device, and storage medium

Info

Publication number: WO2023173547A1
Application number: PCT/CN2022/090161
Authority: WO
Inventors: 郑喜民; 翟尤; 周成昊; 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-03-16
Filing date: 2022-04-29
Publication date: 2023-09-21
Also published as: CN114723986A

Abstract

The present disclosure relates to the technical field of artificial intelligence. Disclosed are a text image matching method and apparatus, a device, and a storage medium. The method comprises: identifying the type of an object to be matched to obtain a type identification result; determining a candidate object set from a preset candidate object library according to the type identification result; extracting a fusion feature according to the object to be matched and each candidate object in the candidate object set; performing feature extraction on each candidate object in the candidate object set to obtain candidate object features; performing similarity calculation on the fusion feature and the candidate object feature corresponding to the same candidate object to obtain a single-object similarity; and determining, according to each single-object similarity and the candidate object set, a target matching result corresponding to the object to be matched. A direct matching operation between image features and text features is avoided, and the matching precision can be improved by adopting fusion features for text image matching.

Description

Text image matching method, device, equipment and storage medium

This application claims priority to the Chinese patent application filed with the China Patent Office on March 16, 2022, with application number 202210256789.1 and the invention title "Text Image Matching Method, Device, Equipment and Storage Medium", the entire content of which is incorporated by reference. in this application.

Technical field

The present application relates to the field of artificial intelligence technology, and in particular to a text image matching method, device, equipment and storage medium.

Background technique

Text-image matching refers to a cross-modal matching search method. By giving a piece of natural language text, retrieve images that match the description of the text; or given an image, retrieve text that matches the content of the image.

As a cross-modal matching search method, the system needs to process images and natural language text information separately, and then perform matching based on the processing results. There are already some data sets and algorithms in this area, but the inventor found that in these algorithms, the image and the natural language text are first extracted through the feature extraction network respectively, and then the two extracted features are matched. . Because the difference between images and text is huge, the features between the two modalities are often difficult to match, resulting in low matching accuracy.

technical problem

The main purpose of this application is to provide a text image matching method, device, equipment and storage medium, aiming to solve the current problem of text image matching. First, the image and natural language text are extracted through the feature extraction network respectively, and then the features are extracted. When the two extracted features are matched, there is a technical problem of low matching accuracy.

Technical solutions

In order to achieve the above-mentioned object of the invention, this application proposes a text image matching method, which method includes:

Get the object to be matched;

Perform type identification on the object to be matched to obtain a type identification result;

According to the type recognition result, determine a candidate object set from a preset candidate object library;

Perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

Perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

Perform similarity calculation on the fusion features corresponding to the same candidate object and the candidate object features to obtain single object similarity;

According to each of the single object similarities and the candidate object set, a target matching result corresponding to the object to be matched is determined.

This application also proposes a text image matching device, which includes:

Data acquisition module, used to obtain objects to be matched;

A type recognition result determination module is used to perform type recognition on the object to be matched and obtain a type recognition result;

A candidate object set determination module, configured to determine a candidate object set from a preset candidate object library according to the type recognition result;

A fusion feature extraction module, configured to perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

A candidate object feature determination module, configured to perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

A single object similarity determination module, used to calculate the similarity between the fusion features corresponding to the same candidate object and the candidate object features to obtain the single object similarity;

A target matching result determination module, configured to determine a target matching result corresponding to the object to be matched based on each of the single object similarities and the candidate object set.

This application also proposes a computer device, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the above text image matching method. The method includes:

Get the object to be matched;

This application also proposes a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the above-mentioned text-image matching method is implemented. The method includes:

Get the object to be matched;

beneficial effects

The text image matching method, device, equipment and storage medium of the present application, wherein the method obtains a type recognition result by performing type recognition on the object to be matched; and determines candidates from a preset candidate object library according to the type recognition result. Object set; perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set; perform feature extraction on each candidate object in the candidate object set to obtain candidate object features; perform feature extraction on the same object set The fusion feature corresponding to the candidate object and the candidate object feature are subjected to similarity calculation to obtain a single object similarity; according to each of the single object similarity and the candidate object set, the similarity corresponding to the object to be matched is determined. Goal matching results. By first extracting fusion features of the object to be matched and the candidate object, and then performing a matching operation on the fused features and candidate object features, the direct matching operation of image features and text features is avoided, and the use of fusion features for text-image matching can increase the matching accuracy. Improved text-image matching accuracy.

Description of the drawings

Figure 1 is a schematic flow chart of a text image matching method according to an embodiment of the present application;

Figure 2 is a schematic structural block diagram of a text image matching device according to an embodiment of the present application;

Figure 3 is a schematic structural block diagram of a computer device according to an embodiment of the present application.

The realization of the purpose, functional features and advantages of the present application will be further described with reference to the embodiments and the accompanying drawings.

Best Mode of Carrying Out the Invention

In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

Referring to Figure 1, an embodiment of the present application provides a text image matching method, which method includes:

S1: Get the object to be matched;

S2: Perform type identification on the object to be matched and obtain the type identification result;

S3: Determine the candidate object set from the preset candidate object library according to the type recognition result;

S4: Perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

S5: Perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

S6: Calculate the similarity between the fusion feature and the candidate object feature corresponding to the same candidate object, and obtain the single object similarity;

S7: Determine the target matching result corresponding to the object to be matched according to the similarity of each single object and the candidate object set.

In this embodiment, a type identification result is obtained by performing type identification on the object to be matched; a candidate object set is determined from a preset candidate object library according to the type identification result; and a candidate object set is determined according to the object to be matched and the candidate object. Perform fusion feature extraction on each candidate object in the set; perform feature extraction on each candidate object in the candidate object set to obtain candidate object features; perform fusion feature and candidate object corresponding to the same candidate object Similarity calculation is performed on the features to obtain a single object similarity; based on each single object similarity and the candidate object set, a target matching result corresponding to the object to be matched is determined. By first extracting fusion features of the object to be matched and the candidate object, and then performing a matching operation on the fused features and candidate object features, the direct matching operation of image features and text features is avoided, and the use of fusion features for text-image matching can increase the matching accuracy. Improved text-image matching accuracy.

For S1, you can obtain the object to be matched input by the user, the object to be matched from the database, or the object to be matched from a third-party application.

The object to be matched is the object that needs to be matched with text and images.

The object to be matched is a piece of text or an image.

For S2, perform type identification on the object to be matched to determine whether the object to be matched is text or an image.

There is only one value for the type identification result. The value range of type identification results includes: text type and image type.

For S3, the type recognition result is matched with type identifiers in the candidate object library, and the font library corresponding to the sub-library identifier corresponding to the matched type identifier in the candidate object library is used as a candidate object set.

The candidate object library includes: type identifier and sub-library identifier.

For S4, intermediate features between the text and the image are extracted according to the object to be matched and each candidate object in the candidate object set, and the extracted intermediate features are used as fusion features.

Wherein, feature extraction is performed based on the coding of the object to be matched and the coding of each candidate object in the candidate object set, and the extracted features are used as fusion features.

The number of fused features is the same as the number of candidate objects in the candidate object set.

For S5, feature extraction is performed on each candidate object in the candidate object set, and the extracted features are used as candidate object features. It can be understood that the candidate object features correspond to the candidate objects one-to-one.

For S6, perform cosine similarity or Euclidean distance calculation on the fusion feature and the candidate object feature corresponding to the same candidate object, and use the calculated data as a single object similarity.

That is, the number of single-object similarities is the same as the number of candidate objects in the candidate object set.

For S7, when the single-object similarity is cosine similarity, find the single-object similarity with the largest value from each of the single-object similarities, and place the found single-object similarity at that location. The candidate object corresponding to the candidate object set is used as the hit object of the target matching result corresponding to the object to be matched; when the single object similarity is the Euclidean distance, the value found from each of the single object similarities is The candidate object corresponding to the single object similarity found in the candidate object set is the smallest single object similarity as the hit object of the target matching result corresponding to the object to be matched.

In one embodiment, the above-mentioned step of performing type identification on the object to be matched and obtaining the type identification result includes:

S21: Input the object to be matched into the preset text image classification model for classification prediction, and obtain the classification prediction result;

S22: When the vector element corresponding to the text label in the classification prediction result is greater than the vector element corresponding to the image label in the classification prediction result, determine that the type identification result is a text type;

S23: When the vector element corresponding to the text label in the classification prediction result is smaller than the vector element corresponding to the image label in the classification prediction result, determine that the type identification result is an image type.

This embodiment uses a text image classification model to perform classification prediction, thereby improving the results of classification prediction, thereby improving the accuracy of text image matching.

For S21, the object to be matched is input into a preset text image classification model for classification prediction, and the data obtained by classification prediction is used as the classification prediction result.

The text image classification model can use a binary classifier.

The classification prediction result is a vector. There are two vector elements in this vector. The two vector elements correspond to text labels and image labels respectively. The vector elements in this vector are probability values.

For S22, when the vector element corresponding to the text label in the classification prediction result is greater than the vector element corresponding to the image label in the classification prediction result, it means that the vector element corresponding to the text label is the largest, and at this time, the waiting The matching object is a piece of text, so the type recognition result is determined to be a text type.

For S23, when the vector element corresponding to the text label in the classification prediction result is smaller than the vector element corresponding to the image label in the classification prediction result, it means that the vector element corresponding to the image label is the largest, so When the object to be matched is an image, it is determined that the type recognition result is an image type.

In one embodiment, the above-mentioned step of determining a candidate object set from a preset candidate object library based on the type recognition result includes:

S31: When the type recognition result is a text type, use the image sub-library in the candidate object library as the candidate object set;

S32: When the type recognition result is an image type, use the text sub-base in the candidate object library as the candidate object set.

In this embodiment, when the type recognition result is a text type, the image sub-library is used as the candidate object set, and when the type recognition result is an image type, the text sub-library is used as the candidate object set, thereby providing a fusion feature. Provides the basis for generation and text-image matching.

For S31, when the type recognition result is a text type, it means that the object to be matched is a piece of text, so the image sub-library corresponding to the sub-library identifier corresponding to the text type in the candidate object library is used as the candidate object. set, at this time the candidate object in the candidate object set is an image.

For S32, when the type recognition result is an image type, it means that the object to be matched is an image, so the image sub-library corresponding to the sub-library identifier corresponding to the image type in the candidate object library is used as the candidate Object set, in which case the candidate object in the candidate object set is text.

In one embodiment, the above step of performing fusion feature extraction based on the object to be matched and each candidate object in the candidate object set includes:

S41: Use any candidate object in the candidate object set as a target object;

S42: Enter the target object into a coding model corresponding to the type of the candidate object set for coding to obtain the first coding;

S43: Enter the object to be matched into the encoding model corresponding to the type recognition result for encoding to obtain a second encoding;

S44: Splice the first code and the second code in dimensions to obtain a fusion code;

S45: Input the fusion code into a preset fusion feature extraction model for feature extraction to obtain the fusion feature corresponding to the target object.

In this embodiment, the object to be matched and the candidate object are first encoded and dimensionally spliced respectively, and then the result of dimension splicing is input into the fusion feature extraction model for feature extraction, thereby extracting the intermediate features between the image and the text, as It provides a basis for matching fusion features and candidate object features.

For S42, when the type of the candidate object set is a text type, the target object is input into the coding model corresponding to the text type for encoding, and the coded data is used as the first encoding; when the candidate object set is When the type is an image type, the target object is input into the encoding model corresponding to the image type for encoding, and the encoded data is used as the first encoding.

Optionally, the coding model adopts a fully connected layer. Because the encoding model is a shallower information encoding, a large amount of original information in the target object will be retained.

It can be understood that the encoding model can also adopt other encoding models, which are not limited here.

For S43, when the type recognition result is a text type, the object to be matched is input into the coding model corresponding to the text type for coding, and the coded data is used as the second code; when the type recognition result is an image type, the object to be matched is input into a coding model corresponding to the image type for coding, and the coded data is used as the second coding.

For S44, optionally, the first code and the second code are spliced in the dimension in the order of text first and then image, and the spliced data is used as the fusion code. At this time, the fusion code is in the dimension The top order is text encoding and image encoding.

Optionally, the first code and the second code are spliced in the dimension in the order of image first and then text, and the data obtained by splicing is used as the fusion code. At this time, the fusion code in the dimension is sequentially: Image encoding, text encoding.

For S45, the fusion code is input into a preset fusion feature extraction model for feature extraction, and the extracted features are used as the fusion features corresponding to the target object.

The fusion feature extraction model is a model trained based on the Rresnet50 network or Unet network. Rresnet50 network is a deep residual network. Unet network is a semantic segmentation network.

It can be understood that by repeating steps S41 to S45, the fusion feature corresponding to each candidate object in the candidate object set can be determined.

In one embodiment, the above-mentioned step of performing feature extraction on each candidate object in the candidate object set to obtain candidate object features includes:

S51: Enter each candidate object in the candidate object set into a single object feature extraction model corresponding to the type of the candidate object set for feature extraction, and obtain that each candidate object corresponds to the candidate object. feature.

This embodiment uses a single object feature extraction model corresponding to the type of the candidate object set for feature extraction, thereby improving the accuracy of the extracted features and improving the accuracy of text image matching.

For S51, each candidate object in the candidate object set is input into a single object feature extraction model corresponding to the type of the candidate object set for feature extraction, and the extracted features are used as one candidate object feature. .

When the type of the candidate object set is a text type, the single object feature extraction model corresponding to the type of the candidate object set is a model obtained by training the LSTM network using multiple text training samples; when the type of the candidate object set is a text type, When the type of the candidate object set is an image type, the single object feature extraction model corresponding to the type of the candidate object set is a model obtained by training the Rresnet50 network or the Unet network using multiple image training samples.

LSTM network refers to long short-term memory artificial neural network.

Text training samples include: text samples and text feature calibration data.

Image training samples include: image samples and image feature calibration data.

In one embodiment, the above-mentioned step of calculating the similarity between the fusion features and the candidate object features corresponding to the same candidate object to obtain the similarity of a single object includes:

S61: Use any candidate object in the candidate object set as an object to be calculated;

S62: Use the fusion feature corresponding to the object to be calculated as the first feature;

S63: Use the candidate object feature corresponding to the object to be calculated as the second feature;

S64: Perform cosine similarity calculation on the first feature and the second feature to obtain the single object similarity corresponding to the object to be calculated.

This embodiment uses cosine similarity to perform similarity calculation. Since cosine similarity tends to provide a better solution, the accuracy of text-image matching is further improved.

For S64, the first feature and the second feature are features corresponding to the same candidate object. Therefore, cosine similarity calculation is performed on the first feature and the second feature, and the calculated The cosine similarity is used as the single object similarity corresponding to the object to be calculated.

By repeating steps S61 to S64, the single object similarity corresponding to each candidate object in the candidate object set can be determined.

In one embodiment, the above-mentioned single object similarity is cosine similarity, and the step of determining the target matching result corresponding to the object to be matched according to each of the single object similarities and the candidate object set includes:

S71: Find the single-object similarity with the largest value from each of the single-object similarities, and use it as the target similarity;

S72: Determine whether the target similarity is greater than the preset similarity threshold;

S73: If yes, determine that the target matching result is successful, and use the candidate object corresponding to the target similarity in the candidate object set as the hit object of the target matching result;

S74: If not, determine that the target matching result is a failure.

This embodiment further improves the accuracy of the single object similarity by taking the value greater than the preset similarity threshold as the maximum single object similarity and the corresponding candidate object in the candidate object set as the hit object of the target matching result. Determine the accuracy of target matching results.

For S71, the single object similarity with the largest value is found from each of the single object similarities, and the found single object similarity is used as the target similarity.

For S73, if yes, that is, the target similarity is greater than the preset similarity threshold, which means that there is a single object similarity that meets the requirements, then it is determined that the result of the target matching result is successful, and the target is The candidate object corresponding to the similarity in the candidate object set is used as the hit object of the target matching result.

For S74, if not, that is, the target similarity is less than or equal to the preset similarity threshold, which means that there is no single object similarity that meets the requirements, then it is determined that the target matching result is a failure.

Referring to Figure 2, this application also proposes a text image matching device, which includes:

The data acquisition module 100 is used to acquire objects to be matched;

The type recognition result determination module 200 is used to perform type recognition on the object to be matched and obtain the type recognition result;

The candidate object set determination module 300 is used to determine the candidate object set from the preset candidate object library according to the type recognition result;

The fusion feature extraction module 400 is configured to perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

The candidate object feature determination module 500 is used to perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

The single object similarity determination module 600 is used to calculate the similarity between the fusion features corresponding to the same candidate object and the candidate object features to obtain the single object similarity;

The target matching result determination module 700 is configured to determine the target matching result corresponding to the object to be matched according to each of the single object similarities and the candidate object set.

In one embodiment, the above-mentioned type recognition result determination module 200 includes: a classification prediction result determination sub-module, a first result determination sub-module and a second result determination sub-module;

The classification prediction result determination sub-module is used to input the object to be matched into a preset text image classification model for classification prediction to obtain a classification prediction result;

The first result determination submodule is configured to determine that the type identification result is when the vector element corresponding to the text label in the classification prediction result is greater than the vector element corresponding to the image label in the classification prediction result. text type;

The second result determination submodule is configured to determine the vector element corresponding to the text label in the classification prediction result when the vector element corresponding to the image label in the classification prediction result is smaller than the vector element corresponding to the image label in the classification prediction result. The type recognition result is the image type.

In one embodiment, the above-mentioned candidate object set determination module 300 includes: a first candidate object set determination sub-module and a second candidate object set determination sub-module;

The first candidate object set determination sub-module is configured to use the image sub-library in the candidate object library as the candidate object set when the type recognition result is a text type;

The second candidate object set determination sub-module is configured to use the text sub-library in the candidate object library as the candidate object set when the type recognition result is an image type.

In one embodiment, the above-mentioned fusion feature extraction module 400 includes: a fusion feature extraction sub-module;

The fusion feature extraction sub-module is used to use any of the candidate objects in the candidate object set as a target object, and input the target object into a coding model corresponding to the type of the candidate object set for encoding, to obtain In the first encoding, the object to be matched is input into the encoding model corresponding to the type recognition result and encoded to obtain the second encoding. The first encoding and the second encoding are spliced in dimensions. , obtain the fusion code, input the fusion code into a preset fusion feature extraction model for feature extraction, and obtain the fusion feature corresponding to the target object.

In one embodiment, the above-mentioned candidate object feature determination module 500 includes: a candidate object feature determination sub-module;

The candidate object feature determination submodule is used to input each candidate object in the candidate object set into a single object feature extraction model corresponding to the type of the candidate object set for feature extraction, and obtain each candidate object. The candidate object corresponds to the candidate object feature.

In one embodiment, the above-mentioned single object similarity determination module 600 includes: cosine similarity calculation sub-module;

The cosine similarity calculation sub-module is used to use any of the candidate objects in the candidate object set as an object to be calculated, the fusion feature corresponding to the object to be calculated as the first feature, and the The candidate object feature corresponding to the object to be calculated is used as the second feature, and cosine similarity calculation is performed on the first feature and the second feature to obtain the single object similarity corresponding to the object to be calculated.

In one embodiment, the above-mentioned target matching result determination module 700 includes: a similarity screening sub-module and a target matching result determination sub-module;

The similarity screening sub-module is used to find the single object similarity with the largest value from each of the single object similarities as the target similarity;

The target matching result determination sub-module is used to determine whether the target similarity is greater than a preset similarity threshold, and the first matching result determination sub-module is used to determine whether the target matching result is If successful, the candidate object corresponding to the target similarity in the candidate object set is used as the hit object of the target matching result. If not, the result of the target matching result is determined to be failure.

Referring to FIG. 3 , an embodiment of the present application also provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, network interface, and database connected through a system bus. Among them, the processor designed by the computer is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems, computer programs and databases. This memory provides an environment for the operation of operating systems and computer programs in non-volatile storage media. The database of the computer device is used to store data such as text image matching methods. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer program, when executed by the processor, implements the text image matching method in any of the above embodiments. The text image matching method includes: obtaining an object to be matched; performing type recognition on the object to be matched to obtain a type recognition result; determining a candidate object set from a preset candidate object library according to the type recognition result; Performing fusion feature extraction on the object to be matched and each candidate object in the candidate object set; performing feature extraction on each candidate object in the candidate object set to obtain candidate object features; corresponding to the same candidate object Perform similarity calculations on the fused features and the candidate object features to obtain a single object similarity; and determine a target matching result corresponding to the object to be matched based on each of the single object similarities and the candidate object set.

In one embodiment, the above step of performing type identification on the object to be matched and obtaining the type identification result includes: inputting the object to be matched into a preset text image classification model for classification prediction and obtaining the classification prediction result; when When the vector element corresponding to the text label in the classification prediction result is greater than the vector element corresponding to the image label in the classification prediction result, it is determined that the type recognition result is a text type; when the AND in the classification prediction result When the vector element corresponding to the text label is smaller than the vector element corresponding to the image label in the classification prediction result, it is determined that the type identification result is an image type.

In one embodiment, the above step of determining a candidate object set from a preset candidate object library based on the type identification result includes: when the type identification result is a text type, converting the candidate object set in the candidate object library The image sub-library is used as the candidate object set; when the type recognition result is an image type, the text sub-library in the candidate object library is used as the candidate object set.

In one embodiment, the above step of performing fusion feature extraction based on the object to be matched and each candidate object in the candidate object set includes: using any one of the candidate objects in the candidate object set as a target object; The target object is input into the encoding model corresponding to the type of the candidate object set for encoding to obtain the first encoding; the object to be matched is input into the encoding model corresponding to the type recognition result for encoding, Obtain the second code; splice the first code and the second code in dimensions to obtain a fusion code; input the fusion code into a preset fusion feature extraction model for feature extraction to obtain the target The fused features corresponding to the object.

In one embodiment, the above step of performing feature extraction on each candidate object in the candidate object set to obtain candidate object features includes: inputting each candidate object in the candidate object set with the corresponding Feature extraction is performed in a single object feature extraction model corresponding to the type of the candidate object set, and each candidate object corresponding to the candidate object feature is obtained.

In one embodiment, the above-mentioned step of calculating the similarity of the fusion features and the candidate object features corresponding to the same candidate object to obtain the similarity of a single object includes: calculating any one of the candidate objects in the set The candidate object is used as the object to be calculated; the fusion feature corresponding to the object to be calculated is used as the first feature; the candidate object feature corresponding to the object to be calculated is used as the second feature; the first feature and The second feature performs cosine similarity calculation to obtain the single object similarity corresponding to the object to be calculated.

In one embodiment, the above-mentioned single object similarity is cosine similarity, and the step of determining the target matching result corresponding to the object to be matched according to each of the single object similarities and the candidate object set includes: Find the single object similarity with the largest value from each of the single object similarities as the target similarity; determine whether the target similarity is greater than the preset similarity threshold; if so, determine that the target matches The result of the result is success, and the candidate object corresponding to the target similarity in the candidate object set is used as the hit object of the target matching result; if not, the result of the target matching result is determined to be failure.

An embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. A computer program is stored thereon. When the computer program is executed by a processor, the above-mentioned steps are implemented. The text image matching method in any embodiment includes the steps of: obtaining an object to be matched; performing type recognition on the object to be matched to obtain a type recognition result; and determining from a preset candidate object library according to the type recognition result. Candidate object set; perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set; perform feature extraction on each candidate object in the candidate object set to obtain candidate object features; extract the same The fusion feature corresponding to the candidate object and the candidate object feature are subjected to similarity calculation to obtain a single object similarity; according to the similarity of each single object and the candidate object set, the corresponding object to be matched is determined. target matching results.

The text image matching method executed above obtains a type recognition result by performing type recognition on the object to be matched; according to the type recognition result, a candidate object set is determined from a preset candidate object library; according to the object to be matched Perform fusion feature extraction with each candidate object in the candidate object set; perform feature extraction on each candidate object in the candidate object set to obtain candidate object features; perform fusion features corresponding to the same candidate object Similarity calculation is performed with the candidate object characteristics to obtain a single object similarity; and a target matching result corresponding to the object to be matched is determined based on each single object similarity and the candidate object set. By first extracting fusion features of the object to be matched and the candidate object, and then performing a matching operation on the fused features and candidate object features, the direct matching operation of image features and text features is avoided, and the use of fusion features for text-image matching can increase the matching accuracy. Improved text-image matching accuracy.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the media, when executed, the computer program may include the processes of the above method embodiments. Any reference to memory, storage, database or other media provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

It should be noted that, in this document, the terms "comprising", "comprising" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, device, article or method that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, apparatus, article or method. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, apparatus, article or method that includes that element.

The above are only preferred embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of the present application, or directly or indirectly used in other related The technical fields are all equally included in the scope of patent protection of this application.

Claims

A text image matching method, wherein the method includes:

Get the object to be matched;

Perform type identification on the object to be matched to obtain a type identification result;

According to the type recognition result, determine a candidate object set from a preset candidate object library;

Perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

Perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

Perform similarity calculation on the fusion features corresponding to the same candidate object and the candidate object features to obtain single object similarity;

According to each of the single object similarities and the candidate object set, a target matching result corresponding to the object to be matched is determined.
The text image matching method according to claim 1, wherein the step of performing type recognition on the object to be matched and obtaining the type recognition result includes:

Input the object to be matched into a preset text image classification model for classification prediction, and obtain a classification prediction result;

When the vector element corresponding to the text label in the classification prediction result is greater than the vector element corresponding to the image label in the classification prediction result, it is determined that the type identification result is a text type;

When the vector element corresponding to the text label in the classification prediction result is smaller than the vector element corresponding to the image label in the classification prediction result, it is determined that the type identification result is an image type.
The text image matching method according to claim 1, wherein the step of determining a candidate object set from a preset candidate object library according to the type recognition result includes:

When the type recognition result is a text type, use the image sub-library in the candidate object library as the candidate object set;

When the type recognition result is an image type, the text sub-base in the candidate object library is used as the candidate object set.
The text image matching method according to claim 1, wherein the step of extracting fusion features based on the object to be matched and each candidate object in the candidate object set includes:

Use any candidate object in the candidate object set as a target object;

Enter the target object into a coding model corresponding to the type of the candidate object set for coding to obtain a first coding;

Enter the object to be matched into the encoding model corresponding to the type recognition result for encoding to obtain a second encoding;

Splice the first code and the second code in dimensions to obtain a fusion code;

The fusion code is input into a preset fusion feature extraction model for feature extraction to obtain the fusion feature corresponding to the target object.
The text image matching method according to claim 1, wherein the step of performing feature extraction on each candidate object in the candidate object set to obtain candidate object features includes:

Each candidate object in the candidate object set is input into a single object feature extraction model corresponding to the type of the candidate object set for feature extraction, and each candidate object is obtained to correspond to the candidate object feature.
The text image matching method according to claim 1, wherein the step of calculating the similarity of the fusion features corresponding to the same candidate object and the candidate object features to obtain the single object similarity includes:

Use any one of the candidate objects in the candidate object set as an object to be calculated;

Use the fusion feature corresponding to the object to be calculated as the first feature;

Use the candidate object feature corresponding to the object to be calculated as the second feature;

Cosine similarity calculation is performed on the first feature and the second feature to obtain the single object similarity corresponding to the object to be calculated.
The text image matching method according to claim 1, wherein the single object similarity is a cosine similarity, and the method corresponding to the object to be matched is determined based on each of the single object similarity and the candidate object set. The steps for target matching results include:

Find the single object similarity with the largest value from each of the single object similarities as the target similarity;

Determine whether the target similarity is greater than a preset similarity threshold;

If so, determine that the result of the target matching result is successful, and use the candidate object corresponding to the target similarity in the candidate object set as the hit object of the target matching result;

If not, the result of the target matching result is determined to be failure.
A text image matching device, wherein the device includes:

Data acquisition module, used to obtain objects to be matched;

A type recognition result determination module is used to perform type recognition on the object to be matched and obtain a type recognition result;

A candidate object set determination module, configured to determine a candidate object set from a preset candidate object library according to the type recognition result;

A fusion feature extraction module, configured to perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

A candidate object feature determination module, configured to perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

A single object similarity determination module, used to calculate the similarity between the fusion features corresponding to the same candidate object and the candidate object features to obtain the single object similarity;

A target matching result determination module, configured to determine a target matching result corresponding to the object to be matched based on each of the single object similarities and the candidate object set.
A computer device includes a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements a text image matching method. The method includes:

Get the object to be matched;

Perform type identification on the object to be matched to obtain a type identification result;

According to the type recognition result, determine a candidate object set from a preset candidate object library;

Perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

Perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

Perform similarity calculation on the fusion features corresponding to the same candidate object and the candidate object features to obtain single object similarity;

According to each of the single object similarities and the candidate object set, a target matching result corresponding to the object to be matched is determined.
The computer device according to claim 9, wherein the step of performing type identification on the object to be matched and obtaining a type identification result includes:

Input the object to be matched into a preset text image classification model for classification prediction, and obtain a classification prediction result;

When the vector element corresponding to the text label in the classification prediction result is greater than the vector element corresponding to the image label in the classification prediction result, it is determined that the type identification result is a text type;

When the vector element corresponding to the text label in the classification prediction result is smaller than the vector element corresponding to the image label in the classification prediction result, it is determined that the type identification result is an image type.
The computer device according to claim 9, wherein the step of determining a candidate object set from a preset candidate object library according to the type recognition result includes:

When the type recognition result is a text type, use the image sub-library in the candidate object library as the candidate object set;

When the type recognition result is an image type, the text sub-base in the candidate object library is used as the candidate object set.
The computer device according to claim 9, wherein the step of extracting fusion features based on the object to be matched and each candidate object in the candidate object set includes:

Use any candidate object in the candidate object set as a target object;

Enter the target object into a coding model corresponding to the type of the candidate object set for coding to obtain a first coding;

Enter the object to be matched into the encoding model corresponding to the type recognition result for encoding to obtain a second encoding;

Splice the first code and the second code in dimensions to obtain a fusion code;

The fusion code is input into a preset fusion feature extraction model for feature extraction to obtain the fusion feature corresponding to the target object.
The computer device according to claim 9, wherein the step of performing feature extraction on each candidate object in the candidate object set to obtain candidate object features includes:

Each candidate object in the candidate object set is input into a single object feature extraction model corresponding to the type of the candidate object set for feature extraction, and each candidate object is obtained to correspond to the candidate object feature.
The computer device according to claim 9, wherein the step of performing similarity calculation on the fusion feature corresponding to the same candidate object and the candidate object feature to obtain a single object similarity includes:

Use any one of the candidate objects in the candidate object set as an object to be calculated;

Use the fusion feature corresponding to the object to be calculated as the first feature;

Use the candidate object feature corresponding to the object to be calculated as the second feature;

Cosine similarity calculation is performed on the first feature and the second feature to obtain the single object similarity corresponding to the object to be calculated.
A computer-readable storage medium with a computer program stored thereon, wherein when the computer program is executed by a processor, a text-image matching method is implemented, and the method includes:

Get the object to be matched;

Perform type identification on the object to be matched to obtain a type identification result;

According to the type recognition result, determine a candidate object set from a preset candidate object library;

Perform fusion feature extraction based on the object to be matched and each candidate object in the candidate object set;

Perform feature extraction on each candidate object in the candidate object set to obtain candidate object features;

Perform similarity calculation on the fusion features corresponding to the same candidate object and the candidate object features to obtain single object similarity;

According to each of the single object similarities and the candidate object set, a target matching result corresponding to the object to be matched is determined.
The computer-readable storage medium according to claim 15, wherein the step of performing type identification on the object to be matched and obtaining the type identification result includes:

Input the object to be matched into a preset text image classification model for classification prediction, and obtain a classification prediction result;

When the vector element corresponding to the text label in the classification prediction result is greater than the vector element corresponding to the image label in the classification prediction result, it is determined that the type identification result is a text type;

When the vector element corresponding to the text label in the classification prediction result is smaller than the vector element corresponding to the image label in the classification prediction result, it is determined that the type identification result is an image type.
The computer-readable storage medium according to claim 15, wherein the step of determining a candidate object set from a preset candidate object library according to the type recognition result includes:

When the type recognition result is a text type, use the image sub-library in the candidate object library as the candidate object set;

When the type recognition result is an image type, the text sub-base in the candidate object library is used as the candidate object set.
The computer-readable storage medium according to claim 15, wherein the step of performing fusion feature extraction based on the object to be matched and each candidate object in the candidate object set includes:

Use any candidate object in the candidate object set as a target object;

Enter the target object into a coding model corresponding to the type of the candidate object set for coding to obtain a first coding;

Enter the object to be matched into the encoding model corresponding to the type recognition result for encoding to obtain a second encoding;

Splice the first code and the second code in dimensions to obtain a fusion code;

The fusion code is input into a preset fusion feature extraction model for feature extraction to obtain the fusion feature corresponding to the target object.
The computer-readable storage medium according to claim 15, wherein the step of performing feature extraction on each candidate object in the candidate object set to obtain candidate object features includes:

Each candidate object in the candidate object set is input into a single object feature extraction model corresponding to the type of the candidate object set for feature extraction, and each candidate object is obtained to correspond to the candidate object feature.
The computer-readable storage medium according to claim 15, wherein the step of performing similarity calculation on the fusion feature and the candidate object feature corresponding to the same candidate object to obtain a single object similarity includes:

Use any one of the candidate objects in the candidate object set as an object to be calculated;

Use the fusion feature corresponding to the object to be calculated as the first feature;

Use the candidate object feature corresponding to the object to be calculated as the second feature;

Cosine similarity calculation is performed on the first feature and the second feature to obtain the single object similarity corresponding to the object to be calculated.