CN111666437A

CN111666437A - Image-text retrieval method and device based on local matching

Info

Publication number: CN111666437A
Application number: CN201910173421.7A
Authority: CN
Inventors: 卢禹锟; 田伟伟; 董健; 颜水成
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2020-09-15

Abstract

The invention provides a local matching-based image-text retrieval method and a local matching-based image-text retrieval device, wherein the method comprises the following steps: acquiring text retrieval information input by a user; calling a pre-constructed image-text matching model, inputting the text retrieval information into the image-text matching model, obtaining at least one retrieval keyword of the text retrieval information by the image-text matching model, and retrieving at least one frame of image matched with the text retrieval information based on the retrieval keyword; and acquiring at least one frame of image output after the image-text matching model is retrieved. According to the scheme provided by the invention, the retrieval keywords in the text retrieval information are extracted, and then the images matched with the keywords are retrieved, so that the images meeting the requirements of the user can be efficiently and accurately acquired.

Description

Image-text retrieval method and device based on local matching

Technical Field

The invention relates to the technical field of retrieval, in particular to a local matching-based image-text retrieval method and device.

Background

The image-text similarity has a great demand in advertisement and recommendation, and in practical application, because image information is complicated and numerous, a pair of images often comprises a plurality of main bodies, only a few images have category information, a large number of images cannot be classified, and salient region labeling is not possible; moreover, the image description text is usually analyzed by means of HTML, and a large number of irrelevant words and phrases exist in the image description text, so that the phenomena that the semantics of the picture and the characters are not matched, or the character description of the picture is weak and the like often occur, and when similar pictures are searched based on the existing text, the pictures cannot be accurately obtained.

Disclosure of Invention

The invention provides a local matching-based image-text retrieval method and device to overcome the problems or at least partially solve the problems.

According to one aspect of the invention, a local matching-based image-text retrieval method is provided, which comprises the following steps:

acquiring text retrieval information input by a user;

calling a pre-constructed image-text matching model, inputting the text retrieval information into the image-text matching model, obtaining at least one retrieval keyword of the text retrieval information by the image-text matching model, and retrieving at least one frame of image matched with the text retrieval information based on the retrieval keyword;

and acquiring at least one frame of image output after the image-text matching model is retrieved.

Optionally, the invoking a pre-constructed image-text matching model, and before inputting the text retrieval information into the image-text matching model, further includes:

constructing the image-text matching model;

collecting various types of text information and corresponding image information, and establishing a graphic and text database comprising text-image data pairs;

training the graph-text matching model based on the text-image data pairs in the graph-text database.

Optionally, the training the graph-text matching model based on the text-image data pairs in the graph-text database includes:

acquiring any one or more groups of text-image data pairs in the image-text database;

outputting text representation data of the text information and image representation data of the image information in the text-image data pair by using an attention mechanism;

learning the correlation between the text information and the image information according to the text representation data and the image representation data, and obtaining the correlation of the text-image data pair; taking the text-image data pairs with the correlation degrees larger than a first preset threshold value as positive sample data, and taking the text-image data pairs with the correlation degrees smaller than a second preset threshold value as negative sample data;

and training the image-text matching model based on the positive sample data and the negative sample data.

Optionally, after acquiring any one or more sets of text-image data pairs in the teletext database, the method further includes:

and acquiring a salient region of the image information in the text-image data pair through a detector, and cleaning the text information in the text-image data pair.

Optionally, the outputting, using an attention mechanism, text representation data of text information and image representation data of image information in the text-image data pair includes:

text representation data of the cleaned text information and image representation data of salient regions of the image information are output using an attention mechanism.

Optionally, the learning of the correlation between the text information and the image information based on the text representation data and the image representation data and the deriving of the correlation of the text-image data pair include:

and interactively aligning the image representation data and the text representation data, learning the correlation between the text representation data and the image representation data, and obtaining the correlation of the text-image data pair.

Optionally, after learning the correlation between the text information and the image information based on the text representation data and the image representation data and obtaining the correlation of the text-image data pair, the method further includes:

comparing the relevance of the text-image data pair with a third preset threshold;

and if the correlation degree of the text-image data pair is greater than the third preset threshold value, generating a label of the image information in the text-image data pair by using the text information in the text-image data pair.

According to another aspect of the present invention, there is also provided a local matching-based teletext retrieval apparatus, including:

the information acquisition module is configured to acquire text retrieval information input by a user;

the retrieval module is configured to call a pre-constructed image-text matching model, input the text retrieval information into the image-text matching model, obtain at least one retrieval keyword of the text retrieval information through the image-text matching model, and retrieve at least one frame of image matched with the text retrieval information based on the retrieval keyword;

and the image output module is configured to acquire at least one frame of image output after the image-text matching model is retrieved.

Optionally, the apparatus further comprises:

the model construction module is configured to construct the image-text matching model;

the data collection module is configured to collect various types of text information and corresponding image information and establish an image-text database comprising text-image data pairs;

a model training module configured to train the graph-text matching model based on text-image data pairs in the graph-text database.

Optionally, the model training module is further configured to:

after any one or more groups of text-image data pairs in the image-text database are obtained, a detector is used for obtaining the salient region of the image information in the text-image data pairs, and the text information in the text-image data pairs is cleaned.

Optionally, the model training module is further configured to:

Optionally, the apparatus further comprises:

a marking module configured to compare the relevance of the text-image data pair with a third preset threshold;

According to another aspect of the present invention, there is also provided a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to execute any one of the above-mentioned local matching-based teletext retrieval methods.

According to another aspect of the present invention, there is also provided a computing device comprising:

a processor;

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform any of the local matching based teletext retrieval methods described above.

The invention provides a more efficient image-text retrieval method and a more efficient image-text retrieval device, based on text retrieval information input by a user, at least one frame of image matched with the text retrieval information can be input through a pre-constructed image-text matching model. In the scheme provided by the embodiment of the invention, after the text retrieval information from the user is input into the image-text matching model, the retrieval keywords in the text retrieval information are extracted, and then the images matched with the keywords are retrieved, so that the images meeting the requirements of the user can be efficiently and accurately obtained.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart of a local matching-based image-text retrieval method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for training a graph-text matching model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a local matching-based teletext retrieval arrangement according to an embodiment of the invention;

fig. 4 is a schematic structural diagram of a local matching-based teletext retrieval arrangement according to a preferred embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

An embodiment of the present invention provides an image-text retrieval method based on local matching, and as shown in fig. 1, the image-text retrieval method based on local matching provided by the embodiment of the present invention may include:

step S102, acquiring text retrieval information input by a user;

step S104, calling a pre-constructed image-text matching model, inputting the text retrieval information into the image-text matching model, obtaining at least one retrieval keyword of the text retrieval information by the image-text matching model, and retrieving at least one frame of image matched with the text retrieval information based on the retrieval keyword;

and step S106, acquiring at least one frame of image output after the image-text matching model is retrieved.

The embodiment of the invention provides a more efficient image-text retrieval method, and based on text retrieval information input by a user, at least one frame of image matched with the text retrieval information can be input through a pre-constructed image-text matching model. In the method provided by the embodiment of the invention, after the text retrieval information from the user is input into the image-text matching model, the retrieval keywords in the text retrieval information are extracted, and then the images matched with the keywords are retrieved, so that the images meeting the requirements of the user can be efficiently and accurately obtained.

In step S104, the retrieval of the image matched with the text retrieval information may be implemented by a pre-constructed image-text matching model. Therefore, before the step S104, the model may be constructed and trained, specifically including: constructing a graph-text matching model; collecting various types of text information and corresponding image information, and establishing a graphic and text database of the text-image data pairs; the model is matched based on the text-image data pairs in the teletext database.

The image-text matching model in the embodiment of the invention is a network model, and when the image-text matching model is constructed, the construction of the image-text matching model can be realized through Tensorflow codes or other neural network models. And for a network model, sample data needs to be collected in advance to train the model, so that a large amount of text information and corresponding image information can be collected, each text information and corresponding image information form a text-image data pair, and an image-text database is established based on the large amount of text-image data pairs. When the text-image data pair is collected, the text-image data pair can be acquired based on network mass information (such as a web crawler), or acquired based on an existing database, or manually input based on an existing picture, and the like. Further, after the text image information is collected to construct the image-text database, the image-text matching model can be trained based on the data in the image-text database.

For example, when data collection is performed, a query may be input through a search engine, and a text-image data pair may be established after a plurality of images retrieved based on the query are acquired.

Another embodiment of the present invention further provides a method for training a graph-text matching model, as can be seen from fig. 2, the method for training a graph-text matching model according to the embodiment of the present invention may include:

step S202, acquiring any one or more groups of text-image data pairs in a graphic database; when a network model is trained, a training set and a verification set are constructed on the basis of existing data to train the network model; as mentioned above, the teletext database has been constructed based on the collected text information and image information, and thus, when training the teletext matching model, the text-image data pairs in the teletext database may be obtained first as initial training data.

In the above embodiment, the data in the graphic-text database may be data collected based on network data, the image information is complicated, and often one image includes a plurality of subjects, and the text information may also include some data unrelated to text meaning, so that after the text-image data pair is obtained based on the graphic-text database, the salient region of the image information in the text-image data pair may be obtained by a detector, and the text information in the text-image data pair is cleaned, and each text-image data pair may be processed in the above manner. When the text information is cleaned, the method mainly screens irrelevant texts, namely semantically inconsistent texts, cleans the text information to obtain credible data, and then performs model training, so that the efficiency of model training can be improved. For example, a landscape image is obtained from the network, and the corresponding text information is "hainan island seven day trip" and the like, which is described very weakly on the image, at this time, the main body "hainan island" of "hainan island seven day trip" needs to be obtained, and other information does not need to be cleaned.

In addition, by marking the salient region of the image information, the meaning of the image information itself actually expressed can be determined. When the salient region of the image information is marked, the main body can be obtained in advance by a detector, and the salient mask is given by a traditional salient algorithm. When a salient region of any image information is acquired, the image can be cut into blocks, and people, scenery and the like in the image can be detected through special detectors for local matching. The salient region of an image may comprise one or more regions, which may be analyzed based on different image content.

Step S204, outputting text representation data of the text information and image representation data of the image information in the text-image data pair by using an attention mechanism; as described above, after the text-image data pairs are acquired in the image-text database, the image information and the text information of each group are processed accordingly, so that when the text information and the image information in each group of text-image data pairs are represented, the text representation data of the cleaned text information and the image representation data of the salient region of the image information can be output by using an attention mechanism. The text information can be modeled by using a pre-trained language model, meanwhile, the sentence representation is output by using an attention mechanism, and particularly, the generation of text representation data can be realized by adopting an encoding-decoding mode.

For the attention characterization of the saliency region of the image information, after the saliency region is input, a description corresponding to the saliency region can be output. The image representation data of the image information may be generated by "encoding-decoding". The encoder is a convolution network, extracts the high-level characteristics of the image and expresses the high-level characteristics as a coding vector; the decoder is a recurrent neural network language model, the initial input is a coding vector, and the description text of the image is generated. In the task of image description generation, two problems of encoding capacity bottleneck and long distance dependence exist, so that the information can be effectively selected by using an attention mechanism. In generating each word of the description, the input to the recurrent neural network uses a mechanism of attention to select some relevant information from the image in addition to the information of the previous word.

Step S206, learning the correlation between the text information and the image information according to the text representation data and the image representation data, and obtaining the correlation of the text-image data pair. In step S204, the image representation data and the text representation data in any group of text-image data pairs are obtained, at this time, learning may be performed based on the image representation data and the text representation data after being characterized by the attention mechanism, and the correlation degree thereof is determined, where the data after being characterized by the attention mechanism may be a group of vectors, and optionally, when specifically performing the correlation degree learning, the image representation data and the text representation data may be interactively aligned, and the correlation between the text representation data and the image representation data is learned, and the correlation degree of the text-image data pairs is obtained.

Further, the text-image data pairs with the correlation degree larger than a first preset threshold value are used as positive sample data, and the text-image data pairs with the correlation degree smaller than a second preset threshold value are used as negative sample data; the first preset threshold and the second preset threshold may be set based on an application scenario of the picture matching model and a user requirement, which is not limited in the present invention.

And S208, training an image-text matching model based on the positive sample data and the negative sample data.

Optionally, in the process of training the image-text matching model, rank-loss can be modified to support a larger batch of random negative sample sorting for sorting of non-explicit negative samples. The loss function is a function that maps the value of a random event or its associated random variable to a non-negative real number to represent the "risk" or "loss" of the random event. In machine learning, the loss function is used for estimating the degree of inconsistency between the predicted value and the true value of the model, and is a non-negative true value function, and the smaller the loss function is, the better the robustness of the model is. By continuously training the image-text matching model and optimizing the training parameters, the efficiency and the quality of the image-text matching model for image-text matching can be improved.

According to the training method for the image matching model, provided by the embodiment of the invention, the image information in any text-image data pair in the image-text database is subjected to the significance region marking and the text information cleaning, so that the quality of the training data of the image-text matching model can be effectively improved, and the image characterization data and the text characterization data obtained through the attention characterization are subjected to the correlation learning to obtain positive and negative sample data, so that the image-text correlation filtering can be improved, and the image-text recall quality can be improved.

In an optional embodiment of the present invention, the correlation of the text-image data pair may be compared with a third preset threshold; and if the correlation degree of the text-image data pair is greater than a third preset threshold value, generating a label of the image information in the text-image data pair by using the text information in the text-image data pair. By marking the image information, when text retrieval information is input based on the image-text matching model, a label with the same or high similarity degree as the text retrieval information can be preferentially acquired, and then the corresponding image information is acquired. Optionally, before the image information is marked with the text information, at least one piece of keyword information of the text can be extracted from the text information, and then the extracted keyword is used as a label of the corresponding image information. The image information may have one or more labels, which may be set according to the local main body included in the image information, and the present invention is not limited thereto.

Taking the matching of the text information of the advertiser as an example, the image-text retrieval method based on the local matching provided by the embodiment of the invention can comprise the following steps:

1. receiving the text information ' Skyline haijiao landscape in Hainan ' input by a user ';

2. calling a pre-constructed image-text matching model, inputting text retrieval information 'Skyline haijiao scenery about Hainan' into the image-text matching model, and acquiring retrieval keywords 'Hainan' and 'Skyline haijiao' of the text retrieval information by the image-text matching model;

3. the image-text matching model is searched based on keywords 'Hainan' and 'Skyline sea horn', and at least one frame of landscape image which is related to the same label or similar semanteme with the keyword is obtained and output;

4. and obtaining an image output by the image-text matching model, and screening out an image as a matching picture of character information 'Skyline haijiao landscape in Hainan'.

In the embodiment of the invention, when the user request is based, the scenic map of the Skyline haijiao in Hainan can be searched out, which meets the requirements of the user. If a picture containing a portrait may be acquired based on a traditional mode, the scheme provided by the embodiment of the invention can meet the requirement of a user to output a high-quality image related to a text when the image is output based on a text matching model because the main body of the picture is detected in advance and the image is labeled with high quality.

Based on the same inventive concept, an embodiment of the present invention further provides a device for retrieving an image and text based on local matching, as shown in fig. 3, the device for retrieving an image and text based on local matching according to the embodiment of the present invention may include:

an information obtaining module 310 configured to obtain text retrieval information input by a user;

the retrieval module 320 is configured to call a pre-constructed image-text matching model, input the text retrieval information into the image-text matching model, obtain at least one retrieval keyword of the text retrieval information from the image-text matching model, and retrieve at least one frame of image matched with the text retrieval information based on the retrieval keyword;

and the image output module 330 is configured to obtain at least one frame of image output after retrieval of the image-text matching model.

In an alternative embodiment of the present invention, as shown in fig. 4, the apparatus may further include:

a model construction module 340 configured to construct a graph-text matching model;

a data collection module 350 configured to collect various types of text information and corresponding image information, and establish a graphic-text database including text-image data pairs;

a model training module 360 configured to train a graph-text matching model based on the text-image data pairs in the graph-text database.

In an optional embodiment of the present invention, the model training module 360 may be further configured to:

acquiring any one or more groups of text-image data pairs in a graphic database;

and training an image-text matching model based on the positive sample data and the negative sample data.

a labeling module 370 configured to compare the relevance of the text-image data pairs with a third preset threshold;

and if the correlation degree of the text-image data pair is greater than a third preset threshold value, generating a label of the image information in the text-image data pair by using the text information in the text-image data pair.

Based on the same inventive concept, an embodiment of the present invention further provides a computer storage medium, where computer program codes are stored, and when the computer program codes are run on a computing device, the computing device is caused to execute the image-text retrieval method based on local matching according to any of the above embodiments.

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including:

a processor;

a memory storing computer program code;

the computer program code, when executed by a processor, causes a computing device to perform a method for local matching based teletext retrieval according to any one of the embodiments described above.

The embodiment of the invention provides a more efficient image-text retrieval method, and based on text retrieval information input by a user, at least one frame of image matched with the text retrieval information can be input through a pre-constructed image-text matching model. In the method provided by the embodiment of the invention, after the text retrieval information from the user is input into the image-text matching model, the retrieval keywords in the text retrieval information are extracted, and then the images matched with the keywords are retrieved, so that the images meeting the requirements of the user can be efficiently and accurately obtained. In addition, when the image-text matching model is trained, the image information in any text-image data pair in the image-text database is subjected to significance region marking and the text information is cleaned to further obtain training data, so that the quality of the training data of the image-text matching model can be effectively improved, and the efficiency and the quality of the image output by the image matching model are further improved.

It is clear to those skilled in the art that the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.

In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.

Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

According to an aspect of the embodiments of the present invention, a method for retrieving an image-text based on local matching is provided, including:

acquiring text retrieval information input by a user;

A2. The method according to a1, wherein the invoking a pre-constructed teletext matching model, before entering the text retrieval information into the teletext matching model, further comprises:

constructing the image-text matching model;

A3. The method of a2, wherein the training the teletext matching model based on the text-image data pairs in the teletext database comprises:

A4. The method according to a3, wherein, after acquiring any one or more sets of text-image data pairs in the teletext database, the method further comprises:

A5. The method of a4, wherein the outputting text representation data of text information and image representation data of image information in the text-image data pair using an attention mechanism comprises:

A6. The method of any of A3-a5, wherein the learning of the relevance of the text information and image information based on the text characterization data and image characterization data and deriving the relevance of the text-image data pairs comprises:

A7. The method according to a6, wherein after learning the relevance of the text information and the image information based on the text representation data and the image representation data and deriving the relevance of the text-image data pair, the method further comprises:

According to another aspect of the embodiment of the present invention, there is also provided B8. a local matching-based teletext retrieval apparatus, including:

B9. The apparatus of B8, further comprising:

B10. The apparatus of B9, wherein the model training module is further configured to:

B11. The apparatus of B10, wherein the model training module is further configured to:

B12. The apparatus of B11, wherein the model training module is further configured to:

B13. The apparatus of any one of B10-B12, wherein the model training module is further configured to:

B14. The apparatus of B13, further comprising:

There is also provided, in accordance with another aspect of an embodiment of the present invention, a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to perform a local matching based teletext retrieval method according to any one of a1-a 7.

There is also provided, in accordance with another aspect of an embodiment of the present invention, apparatus for computing, including:

a processor;

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform the method for textual retrieval based on local matching of any of a1-a 7.

Claims

1. A local matching-based image-text retrieval method comprises the following steps:

acquiring text retrieval information input by a user;

2. The method of claim 1, wherein said invoking a pre-constructed teletext matching model, prior to entering the text retrieval information into the teletext matching model, further comprises:

constructing the image-text matching model;

3. The method of claim 2, wherein the training the graph-text matching model based on text-image data pairs in the graph-text database comprises:

4. The method of claim 3, wherein said obtaining any one or more sets of text-image data pairs in said teletext database further comprises:

5. The method of claim 4, wherein said outputting text characterization data for text information and image characterization data for image information in the text-image data pair using an attention mechanism comprises:

6. The method of any of claims 3-5, wherein learning the relevance of the text information and image information based on the text characterization data and image characterization data, and deriving the relevance of the text-image data pairs, comprises:

7. The method of claim 6, wherein after learning the relevance of the text information and the image information based on the text characterization data and the image characterization data and deriving the relevance of the text-image data pair, further comprising:

8. A local matching-based image-text retrieval device comprises:

9. A computer storage medium having computer program code stored thereon which, when run on a computing device, causes the computing device to perform a local matching based teletext retrieval method according to any one of claims 1-7.

10. A computing device, comprising:

a processor;

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform a method for teletext retrieval based on local matching according to any one of claims 1-7.