CN112148831B

CN112148831B - Image-text mixed retrieval method and device, storage medium and computer equipment

Info

Publication number: CN112148831B
Application number: CN202011343386.8A
Authority: CN
Inventors: 熊智; 江琳; 陈科第; 张铨; 王耿鑫; 庞磊; 郭志伟
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Huaduo Network Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-19
Anticipated expiration: 2040-11-26
Also published as: CN112148831A

Abstract

The invention relates to a method and a device for image-text mixed retrieval, a storage medium and computer equipment, wherein the image-text mixed retrieval method comprises the following steps: acquiring information to be retrieved input by a user, and if the information to be retrieved contains text data and image data at the same time, performing reverse index query operation after word segmentation on the text data to obtain a first candidate set; inputting the image data into a pre-trained neural network model to obtain a feature vector corresponding to the image data, and performing reverse index query operation after the feature vector is coded to obtain a second candidate set; merging the documents in the first candidate set and the second candidate set, counting the hit times of the merged documents in the first candidate set and/or the second candidate set, sorting the merged documents according to the counting result, and returning the sorting result to a display page; the retrieval accuracy is improved, and meanwhile, the maintenance cost of the system is effectively reduced.

Description

Image-text mixed retrieval method and device, storage medium and computer equipment

Technical Field

The invention relates to the technical field of information retrieval, in particular to a method and a device for retrieving images and texts in a mixed mode, a storage medium and computer equipment.

Background

The conventional retrieval system mainly comprises two modes of single-mode retrieval and cross-mode retrieval, and query data of the retrieval system only comprises data of one mode regardless of the single-mode retrieval or the cross-mode retrieval. However, in recent years, with the development of mobile internet, the query demand of people has changed, and new search demands have gradually arisen, and mixed-mode search is one of them. Mixed-modality retrieval refers to retrieval of data involving multiple modalities when query data is input to a retrieval system, for example, in the field of education, query data is input, which generally includes a text description of a topic and an associated graphic image, and in this case, the retrieval system needs to return a topic similar to both the text description and the graphic image.

At this stage, there is relatively little research and technology associated with mixed modality retrieval. Generally, it is most intuitive to build two sets of search models for text and image, for a given text-image query pair, query the text and image in their single-mode search systems, and finally take the intersection of the two and return the final result. However, this approach has two problems: firstly, one set of index system needs to be maintained for searching texts and images respectively, and when the data volume is large, the storage cost and the calculation cost for maintaining the two sets of index systems are very high; secondly, after candidate image text pairs are obtained through text index and image index retrieval, a sorting strategy is difficult to find, two sets of retrieval results can be combined for sorting, and a final query result is obtained.

Disclosure of Invention

The invention aims to solve at least one of the technical defects, in particular to the technical defect that in the prior art, when mixed modal retrieval is carried out, a plurality of sets of retrieval models are used for carrying out retrieval respectively, so that the maintenance cost is improved, and the retrieval accuracy is reduced.

The invention provides a graph-text mixed retrieval method, which comprises the following steps:

acquiring information to be retrieved input by a user, and if the information to be retrieved contains text data and image data at the same time, performing reverse index query operation after word segmentation on the text data to obtain a first candidate set;

inputting the image data into a pre-trained neural network model to obtain a feature vector corresponding to the image data, and performing reverse index query operation after the feature vector is coded to obtain a second candidate set;

merging the documents in the first candidate set and the second candidate set, counting the hit times of the merged documents in the first candidate set and/or the second candidate set, sorting the merged documents according to the counting result, and returning the sorting result to a display page.

Optionally, the step of performing an inverted index query operation after segmenting the text data to obtain a first candidate set includes:

segmenting the text data to obtain a plurality of terms, constructing an inverted index according to the terms, and performing query operation in the inverted index to obtain a first candidate set corresponding to the text data;

the first candidate set comprises a plurality of documents obtained by querying the terms in the inverted index.

Optionally, the step of performing an inverted index query operation on the encoded feature vector to obtain a second candidate set includes:

carrying out rounding operation on the feature vectors to obtain integer feature vectors, mapping the integer feature vectors into text vectors, and then carrying out block coding to obtain a plurality of feature codes corresponding to the image data;

constructing an inverted index according to the feature codes, and performing query operation in the inverted index to obtain a second candidate set corresponding to the image data;

and the second candidate set comprises a plurality of documents obtained by querying the feature codes in the inverted index.

Optionally, the step of counting the number of hits of the merged document in the first candidate set and/or the second candidate set includes:

if the merged document is the document in the first candidate set, counting the hit times of the document in the first candidate set;

if the merged document is the document in the second candidate set, counting the hit times of the document in the second candidate set;

and if the combined document is the document in the first candidate set and the second candidate set, counting the weighted hit times of the document in the first candidate set and the second candidate set.

Optionally, if the merged document is a document in the first candidate set and the second candidate set, the step of counting the number of weighted hits of the document in the first candidate set and the second candidate set includes:

if the merged document is a document in the first candidate set and the second candidate set, counting the first hit times of the document in the first candidate set and the second hit times of the document in the second candidate set;

and acquiring a weighting factor, and performing weighted fusion on the first hit times and the second hit times by using the weighting factor to obtain weighted hit times.

Optionally, the training method of the neural network model includes:

acquiring sample data in a database, and dividing the sample data into a training set and a test set, wherein the sample data comprises image data and corresponding label types;

training a neural network model by using the image data in the training set to obtain a feature vector corresponding to the image data, and adjusting parameters of the neural network model according to the feature vector and the label category until the adjusted neural network model converges;

and verifying the classification effect of the converged neural network model by using the sample data in the test set, and adjusting the hyper-parameters of the neural network model until the neural network model with the optimal classification effect is obtained.

Optionally, the neural network model comprises a convolutional layer, a pooling layer, a fully-connected layer, an integer feature layer, and a loss function;

training a neural network model by using the image data in the training set to obtain a feature vector corresponding to the image data, wherein the step comprises the following steps:

and sequentially inputting the image data in the training set into the neural network model, and performing feature extraction on the image data through the convolutional layer, the pooling layer, the full-link layer and the integer feature layer to obtain a feature vector corresponding to the image data.

Optionally, the loss function comprises a classification loss function and a rounding loss function;

adjusting parameters of the neural network model according to the feature vectors and the label categories until the adjusted neural network model converges, comprising:

inputting the feature vector corresponding to the image data and the label category into the classification loss function, calculating a difference value between the feature vector and the label category by using the classification loss function, and adjusting the parameters of the neural network model according to the difference value until the feature vector corresponding to the label category is obtained by training the neural network model;

inputting the feature vector corresponding to the image data into the rounding function, calculating a difference value between the feature vector and the rounded feature vector by using the rounding function, and adjusting parameters of the neural network model according to the difference value until the feature vector in a preset difference value interval is obtained by training the neural network model.

Optionally, the hyper-parameter of the neural network model is a corresponding weighting factor when the classification loss function and the rounding loss function are weighted and fused.

The invention also provides a device for retrieving mixed pictures and texts, which comprises:

the first index module is used for acquiring information to be retrieved input by a user, and if the information to be retrieved simultaneously contains text data and image data, performing inverted index query operation after word segmentation on the text data to obtain a first candidate set;

the second indexing module is used for inputting the image data into a pre-trained neural network model to obtain a feature vector corresponding to the image data, and performing reverse index query operation after encoding processing on the feature vector to obtain a second candidate set;

and the comprehensive ranking module is used for merging the documents in the first candidate set and the second candidate set, counting the hit times of the merged documents in the first candidate set and/or the second candidate set, ranking the merged documents according to the statistical result, and returning the ranking result to the display page.

The invention also provides a storage medium having stored thereon computer readable instructions which, when executed by one or more processors, cause the one or more processors to carry out the steps of the teletext hybrid retrieval method according to any one of the preceding embodiments.

The invention also provides a computer device having computer readable instructions stored thereon which, when executed by one or more processors, cause the one or more processors to perform the steps of the hybrid teletext retrieval method according to any one of the preceding embodiments.

According to the technical scheme, the embodiment of the invention has the following advantages:

the invention relates to a method and a device for image-text mixed retrieval, a storage medium and computer equipment, wherein the image-text mixed retrieval method comprises the following steps: acquiring information to be retrieved input by a user, and if the information to be retrieved contains text data and image data at the same time, performing reverse index query operation after word segmentation on the text data to obtain a first candidate set; inputting the image data into a pre-trained neural network model to obtain a feature vector corresponding to the image data, and performing reverse index query operation after the feature vector is coded to obtain a second candidate set; merging the documents in the first candidate set and the second candidate set, counting the hit times of the merged documents in the first candidate set and/or the second candidate set, sorting the merged documents according to the counting result, and returning the sorting result to a display page; according to the image-text mixed retrieval method based on deep learning, text data and image data are considered in the retrieval process, so that the query accuracy including the image data and the text data is improved; moreover, the invention is compatible with text-to-text query and image-to-image query, so that the whole image-text mixed retrieval system can realize mixed retrieval of the images and the texts only by maintaining one set of inverted index results, thereby effectively reducing the maintenance cost of the system.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic diagram of a first application environment according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a mixed retrieval method for images and texts according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an inverted index structure according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a second application environment according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a model training method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a mixed image-text retrieval device according to an embodiment of the present invention;

fig. 8 is a schematic internal structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Before further detailed description of the embodiments of the present invention, terms and terminology used in the embodiments of the present invention are described, and the related technical background of the present invention is briefly described as follows:

the traditional retrieval system mainly comprises two modes of single-mode retrieval and cross-mode retrieval, wherein the single-mode retrieval refers to retrieval in a mode; for example, for a text modality, a query word is input, and the retrieval system returns a webpage containing query text; for the picture modality, given a query image, the retrieval system returns an image that is similar to the query image. Cross-modality retrieval refers to retrieval between modalities; for example, given a piece of query text, the retrieval system returns the image described by the piece of text. Whether single modality or cross-modality retrieval, the data in its query contains data of only one modality.

However, in recent years, with the development of mobile internet, the query demand of people has changed, and new search demands have gradually arisen, and mixed-mode search is one of them. Mixed-modality retrieval refers to retrieval of data involving multiple modalities when query data is input to a retrieval system, for example, in the field of education, query data is input, which generally includes a text description of a topic and an associated graphic image, and in this case, the retrieval system needs to return a topic similar to both the text description and the graphic image.

Therefore, the invention provides the following solutions to solve the technical defects that in the prior art, when mixed modal retrieval is performed, a plurality of sets of retrieval models are used for performing retrieval respectively, so that not only is the maintenance cost increased, but also the retrieval accuracy is reduced.

Referring to fig. 1, fig. 1 is a schematic diagram of a first application environment provided by an embodiment of the present invention; the technical solution of the present application can be implemented on the basis of the server 110, as shown in fig. 1, the server 110 responds to a query request in the client 120 and obtains information to be retrieved, which is input by a user, to implement a relevant function; in the embodiment of the application, a pre-trained neural network model is deployed in the server 110, and after the server 110 obtains information to be retrieved input by a user in the client 120, the text data in the information to be retrieved needs to be subjected to inverted index query operation after being participled to obtain a first candidate set; then inputting image data in the information to be retrieved into a neural network model to obtain a feature vector corresponding to the image data, and performing reverse index query operation after encoding processing on the feature vector to obtain a second candidate set; and finally, combining and sorting the documents in the first candidate set and the second candidate set, and returning a sorting result to a display page in the client 120, thereby realizing the function of image-text mixed retrieval.

It should be noted that, the client 120 includes a smart phone, a tablet computer, an e-reader, a desktop computer or a notebook computer, etc., which is not limited herein; the client 120 in this embodiment is a product with a search function, and the product may be an educational direction matrix product for searching topic information, such as Quesgo.

The text-text mixed retrieval method in the present application will be described in detail by using a plurality of embodiments, which are specifically described below.

In an embodiment, as shown in fig. 2, fig. 2 is a schematic flow chart of a method for retrieving a mixed image and text according to an embodiment of the present invention; the invention also provides a graph-text mixed retrieval method, which specifically comprises the following steps:

s110: the method comprises the steps of obtaining information to be retrieved input by a user, and if the information to be retrieved contains text data and image data at the same time, performing inverted index query operation after word segmentation on the text data to obtain a first candidate set.

In this step, before the client 120 requests the query service from the server 110 according to the information to be retrieved input by the user, the query service needs to be constructed in the back-end service of the client 120, and after the query service is constructed, the query service can receive the query request of the user and send the query request to the server 110, so that the server 110 can obtain the information to be retrieved input by the user according to the query request.

It can be understood that the information to be retrieved includes, but is not limited to, a text pair, an image pair, and a text image pair, and therefore, after the server 110 acquires the information to be retrieved corresponding to the query request, it is necessary to determine which information is included in the currently received information to be retrieved, and to adopt different processing manners for different information.

For example, when the received information to be retrieved contains both text data and image data, i.e., a text image pair, the text image pair can be retrieved using a set of retrieval models in the current hybrid retrieval system.

Specifically, when the search model searches text data, the text data may be segmented, then query operation is performed on the segmented text data in an inverted index manner, a plurality of documents corresponding to the segmented text data are obtained, and the plurality of documents are collected to obtain a first candidate set.

Before searching text data and/or image data, the hybrid search system needs to construct a set of image-text hybrid inverted index structure in advance, where the inverted index structure includes two parts, namely, a term and an inverted record table, where the term part can be obtained from data in a web library collected in advance, and the inverted record table records a plurality of documents corresponding to each term obtained by querying each term in the database through the inverted index structure.

In addition, the terms and the inverted record table in the inverted index structure can be updated in real time according to the collected data in the webpage library so as to improve the retrieval capability of the webpage library.

The following description is given by taking a set of examples, schematically, as shown in fig. 3, fig. 3 is a schematic diagram of an inverted index structure provided by an embodiment of the present invention, in fig. 3, terms after word segmentation is performed on text data include apple tree, blossom, present, upper, here, and pear tree, and after an inverted index query operation is performed by using a constructed inverted index structure, an obtained inverted record table includes a document 1, a document 2, and a document 3 corresponding to each term, where the document 1 indicates that the apple tree is bloomed, the document 2 indicates that the apple tree has an apple, the document 3 indicates that the apple tree and the pear tree exist, and a first candidate set can be obtained by combining the document 1, the document 2, and the document 3.

S120: inputting the image data into a neural network model to obtain a feature vector corresponding to the image data, and performing reverse index query operation after the feature vector is coded to obtain a second candidate set; the neural network model is obtained by training based on the model training method in any one of the above embodiments.

In this step, when it is determined that the information to be retrieved includes text data and image data at the same time in step S110, the text data and the image data may be respectively processed correspondingly. For example, when the text data is segmented and the first candidate set is obtained through the reverse index query operation, the image data can be input into the neural network model, so that the feature vector in the image data is extracted through the neural network model, and the reverse index query operation is realized after the feature vector is correspondingly processed, so as to obtain the second candidate set.

Specifically, since the indexing method used in the present application is an inverted index, before performing the inverted index, the retrieval information needs to be divided into a plurality of keywords, where after each keyword is associated with a term in an inverted index structure constructed in advance, the inverted index structure is used to search for a document related to the keyword in the database. In addition, because the image data and the text data are different in type, the image data cannot be directly subjected to word segmentation and then subjected to the inverted index query operation, so that the feature vector of the image data can be extracted through the neural network model, and the rounding operation, the encoding processing and the like are performed on the feature vector, so that the word segmentation operation as common as the text data is realized.

After the image data is input into the neural network model, the characteristic information of the image data can be extracted through the multilayer network of the neural network model, and the corresponding characteristic vector is obtained. Particularly, the last layer of the neural network model is output as an integer feature layer, and parameters of the neural network model are adjusted by the classification loss function and the rounding loss function in the early training process of the neural network model, so that the integer feature vector of the image category information can be reserved after the neural network model is trained.

However, in the present application, even though the feature vector of the image data obtained by the neural network model approaches to the integer feature vector after the model training, in order to facilitate the query operation of the reverse index on the image data, the rounding operation may be continued on the feature vector output by the neural network model before the reverse index, so that the feature vector becomes the integer feature vector in the true sense, and the integer feature vector can retain the semantic information thereof to a certain extent.

Subsequently, the resulting integer feature vector is character-encoded. For example, the vector obtained after the rounding operation on the image data is [ -1, 3, 3, 4], and due to the setting of the loss function, each dimension value of the feature vector ranges from- α to α (1 < = α < = 17), and when encoding is performed, the values are sequentially converted into letters corresponding to the values, so that the character encoding operation is realized.

Taking alpha value 17 as an example, character "a" corresponds to a numerical value "-17", character "b" corresponds to a numerical value "-16", and so on, mapping the integer feature vector to a text vector, performing block coding according to step length k, adding a mark "#" before each block coding, and then constructing a reverse index corresponding to the image data by using the existing reverse index structure.

It should be noted that α is a hyper-parameter, and a specific value can be set according to specific situations; and, the maximum value of α here is 17, because when α is equal to 17, the value of each dimension of the feature vector ranges from-17 to 17, there are 25 possible values, and exactly 26 english letters can be used for one-to-one correspondence.

The step length k is selected manually, and any length can be set; the "#" tag is used to distinguish that the term is a feature encoding from an image, rather than a term from a text modality, and the tag may be replaced with another form of symbol.

And then, carrying out reverse index query operation on the multiple terms corresponding to the image data obtained after the encoding processing according to the constructed reverse index to obtain multiple documents corresponding to the image data after the feature encoding, and collecting the multiple documents to obtain a second candidate set.

S130: merging the documents in the first candidate set and the second candidate set, counting the hit times of the merged documents in the first candidate set and/or the second candidate set, sorting the merged documents according to the counting result, and returning the sorting result to a display page.

In this step, after the first candidate set and the second candidate set are obtained through step S110 and step S120, the documents in the two candidate sets may be merged to be sorted uniformly.

Specifically, before the ranking, the number of hits of the merged document in the first candidate set and/or the second candidate set needs to be counted. For example, if the merged document is a document in the first candidate set, directly counting the hit times of the document in the first candidate set; if the merged document is the document in the second candidate set, counting the hit times of the document in the second candidate set; and if the combined document is the document in the first candidate set and the second candidate set, counting the weighted hit times of the document in the first candidate set and the second candidate set.

It is understood that the number of hits herein refers to the number of times the same document appears in the first candidate set or the second candidate set; the weighted hit number here refers to a weighted sum of the number of occurrences of the same document in the first candidate set and the second candidate set, respectively.

It should be noted that the weighting factor in the weighting hit frequency is a hyper-parameter, and is a manually set parameter, and can be manually adjusted according to the quality of the retrieval effect, and the larger the numerical value of the weighting factor is, the more important the system is to match the image, and the smaller the numerical value of the weighting factor is, the more important the system is to match the text; moreover, the setting of the weighting factor can be manually adjusted according to the specific application scene.

As shown in fig. 2, after counting the hit times of the merged document in the first candidate set and/or the second candidate set, the documents may be sorted according to the statistical result, and the sorted result is returned to the presentation page.

It can be understood that after the user inputs the information to be retrieved, in order to obtain the document information matched with the information to be retrieved, the documents in the first candidate set and the second candidate set need to be merged and then sorted, so that a more effective retrieval result is provided for the user on the display page, and the user experience is improved.

In the image-text mixed retrieval method based on deep learning provided by the embodiment, text data and image data are considered in the retrieval process, so that the query accuracy including both the image data and the text data is improved; moreover, the invention is compatible with text-to-text query and image-to-image query, so that the whole image-text mixed retrieval system can realize mixed retrieval of the images and the texts only by maintaining one set of inverted index results, thereby effectively reducing the maintenance cost of the system.

In an embodiment, the step of performing an inverted index query operation after segmenting the text data in step S110 to obtain a first candidate set may include:

s111: segmenting the text data to obtain a plurality of terms, constructing an inverted index according to the terms, and performing query operation in the inverted index to obtain a first candidate set corresponding to the text data;

s112: the first candidate set comprises a plurality of documents obtained by querying the terms in the inverted index.

In this embodiment, as shown in fig. 3, the inverted index structure may be divided into two parts, namely a term and an inverted record table, where the term of the text data may be obtained by segmenting a text. For example, for the text "three corners of an equilateral triangle are equal as shown," after word segmentation, a term list can be obtained: the method comprises the steps of constructing an inverted index through a vocabulary, and carrying out query operation in a database through the constructed inverted index to obtain a corresponding first inverted record list.

The first reverse record table records a plurality of documents obtained after reverse index query of each term in the database, and a first candidate set of the text data can be obtained after the plurality of documents are collected.

In an embodiment, the step of performing an inverted index query operation on the feature vector after the encoding processing in step S120 to obtain the second candidate set may include:

s121: carrying out rounding operation on the feature vectors to obtain integer feature vectors, mapping the integer feature vectors into text vectors, and then carrying out block coding to obtain a plurality of feature codes corresponding to the image data;

s122: constructing an inverted index according to the feature codes, and performing query operation in the inverted index to obtain a second candidate set corresponding to the image data;

s123: and the second candidate set comprises a plurality of documents obtained by querying the feature codes in the inverted index.

In this embodiment, after the image data is input to the neural network model, the feature information of the image data may be extracted through the multilayer network of the neural network model, and the corresponding feature vector may be obtained. Particularly, the last layer of the neural network model is output as an integer feature layer, and parameters of the neural network model are adjusted by the classification loss function and the rounding loss function in the early training process of the neural network model, so that the integer feature vector of the image category information can be reserved after the neural network model is trained.

However, in the present application, even though the feature vector of the image data obtained by the neural network model approaches to the integer feature vector after model training, in order to facilitate the inverted index query operation on the image data, before the inverted index corresponding to the image data is constructed by using the pre-constructed inverted index structure, the feature vector output by the neural network model may be further rounded to make the feature vector become the integer feature vector in the true sense, and the integer feature vector may retain its semantic information to a certain extent.

In particular, the rounding operation may be performed by

And (4) performing an integer taking function. For example, the feature vector obtained by extracting the features of the image data x is [1.10, -2.01, 3.07, 4.09, 2.09, 9.07 ]]To proceed with

After the operation, let the hyper-parameter α =10, the resulting integer feature vector is [1, -2, 3, 4, 2, 9%]。

After the integer feature vector is obtained, the integer feature vector can be converted into a text vector. Continuing with the above example explanation, since the hyper-parameter α =10, the mapping relationship from numbers to characters is: -10 for the character "a", -9 for the character "b", -8 for the character "c", and so on, and the resulting character vector is [ l, i, n, o, m, t ].

After the character vector is obtained, it may be block coded, for example, block operation with step size of 2 is performed on it, and the obtained block is: [ "li", "no", "mt" ], after a marker "#" is added, the obtained block codes are as follows: "# li", "# no", and "# mt".

And then, constructing an inverted index through the block codes, performing query operation in the database through the constructed inverted index, finding a second inverted record table corresponding to each block code, and generating a second candidate set according to a plurality of documents in the second inverted record table.

In one embodiment, the step of counting the number of hits of the merged document in the first candidate set and/or the second candidate set in step S130 may include:

s131: if the merged document is the document in the first candidate set, counting the hit times of the document in the first candidate set;

s132: if the merged document is the document in the second candidate set, counting the hit times of the document in the second candidate set;

s133: and if the combined document is the document in the first candidate set and the second candidate set, counting the weighted hit times of the document in the first candidate set and the second candidate set.

In this embodiment, before the ranking, the number of hits of the merged document in the first candidate set and/or the second candidate set needs to be counted.

For example, if the merged document is a document in the first candidate set, directly counting the hit times of the document in the first candidate set; if the merged document is the document in the second candidate set, counting the hit times of the document in the second candidate set; and if the combined document is the document in the first candidate set and the second candidate set, counting the weighted hit times of the document in the first candidate set and the second candidate set.

Therefore, the documents can be ranked according to the hit times of all the documents, the documents with high hit times are ranked at the front, and the documents with low hit times are ranked at the back, so that a user can visually see the retrieval result with high relevance.

In one embodiment, if the merged document is a document in the first candidate set and the second candidate set in step S133, the step of counting the number of weighted hits of the document in the first candidate set and the second candidate set includes:

s331: if the merged document is a document in the first candidate set and the second candidate set, counting the first hit times of the document in the first candidate set and the second hit times of the document in the second candidate set;

s332: and acquiring a weighting factor, and performing weighted fusion on the first hit times and the second hit times by using the weighting factor to obtain weighted hit times.

In this embodiment, if the merged document is a document in the first candidate set and the second candidate set, the first hit frequency of the document in the first candidate set and the second hit frequency in the second candidate set are respectively counted, and then the first hit frequency and the second hit frequency are weighted and fused by using the weighting factor to obtain the weighted hit frequency.

Specifically, if the merged document is a document in the first candidate set and the second candidate set, the first candidate set is counted

The first hit frequency corresponding to the document

And a second candidate set

The second hit number corresponding to the document

Then obtaining the current manually set weighting factor

Calculating a first candidate set

And a second candidate set

Weighted hit times corresponding to documents with middle intersections

And obtaining the final weighted hit times corresponding to the document.

It will be appreciated that here the weighting factors in the number of weighted hits

The super-parameter is a parameter set manually and can be adjusted manually according to the quality of the retrieval effect,

the larger the value of (A) is, the more important the system is in image matching,

the smaller the text matching, the more important the system is in the matching of the text; moreover, the setting of the weighting factor can be manually adjusted according to the specific application scene.

A specific training process of the neural network model will be described with reference to a specific embodiment, before this, schematically referring to fig. 4, fig. 4 is a schematic diagram of a second application environment provided by an embodiment of the present invention; the technical scheme of the application can be implemented on the basis of the server 110, as shown in fig. 1, the server 110 acquires sample data in the database 111, and divides the sample data into a training set and a test set, so that the sample data in the training set and the test set are respectively input into the model training unit 112 to implement related functions; in this embodiment of the application, a neural network model to be trained is deployed in the server 110 through the model training unit 112, and when the neural network model needs to be trained, the server 110 obtains sample data in the database 111, divides the sample data into a training set and a test set, trains the neural network model in the model training unit 112 by using image data in the training set, and adjusts parameters of the neural network model until the adjusted neural network model converges; and then, carrying out classification effect verification on the converged neural network model by using sample data in the test set, and adjusting the hyper-parameters of the neural network model until the neural network model with the optimal classification effect is obtained, thereby realizing the functions of feature extraction and classification prediction of image data according to the trained neural network model.

The server 110 refers to a server capable of performing data processing and model training, and may be implemented by a stand-alone server or a server cluster composed of a plurality of servers; the database 111 herein refers to software capable of storing mass data information, which may be topic information in a topic library or other forms of storage information.

In an embodiment, as shown in fig. 5, fig. 5 is a schematic flow chart of a model training method provided in an embodiment of the present invention, and an embodiment of the present invention provides a training method of a neural network model, which specifically includes the following steps:

s210: the method comprises the steps of obtaining sample data in a database, and dividing the sample data into a training set and a testing set, wherein the sample data comprises image data and corresponding label types.

The mixed mode retrieval in the invention mainly depends on a single retrieval model, and the retrieval model can simultaneously retrieve text image pairs or individually retrieve text image pairs or image pairs, so that related models used in the retrieval model need to be trained in advance before retrieval so as to be used for retrieval.

In this step, when training the model, the training may be performed based on the server 110, as shown in fig. 4, the server 110 may obtain sample data stored in the database 111, and train the relevant model deployed in the model training unit 112 by using the sample data; database 111 may be deployed in server 110, so that server 110 collects and processes data.

Specifically, before the correlation model is trained by using the sample data, for example, before the convolutional neural network is trained, the sample data obtained from the database 111 may be preprocessed in advance and then input into the convolutional neural network for training.

The preprocessing process may include dividing the sample data into a training set and a test set according to a certain ratio, or dividing the sample data into the training set, the test set and the test set according to a certain ratio; wherein the training set is used for data samples of model fitting; the verification set is a sample set reserved in the model training process and can be used for adjusting the hyper-parameters of the model and performing preliminary evaluation on the capability of the model; the test set is used to evaluate the generalization ability of the final model. The proportion division of the training set, the verification set and the test set can be set manually, but the samples in the training set are required to be enough, and characteristic data with good effect can be obtained through model training and learning.

For the application, the sample data acquired in the database 111 may be divided into a training set and a test set, and the function of the test set includes the capability of adjusting the hyper-parameters of the model in the verification set, and also has the generalization capability of evaluating the final model.

In addition, the sample data includes image data and a corresponding label type, so that when model training is performed using the sample data, the training model can learn and store feature information corresponding to the label type of the image data.

S220: and training a neural network model by using the image data in the training set to obtain a feature vector corresponding to the image data, and adjusting parameters of the neural network model according to the feature vector and the label category until the adjusted neural network model converges.

In this step, after the sample data obtained in the database 111 is divided into the training set and the test set in step S210, the sample data in the training set may be input into the corresponding model for training, and if the trained model is a convolutional neural network, the sample data may be input into the convolutional neural network, so as to train each layer of network in the convolutional neural network through the sample data.

For example, Convolutional Neural Networks (CNN) are a class of feed-forward neural networks that include convolutional computations and have deep structures, and are one of the algorithms that represent deep learning. A typical convolutional neural network consists of 3 parts: a convolution layer, a pooling layer, and a full-link layer. The convolutional layer is responsible for extracting local features in the image; the pooling layer is used for greatly reducing the parameter magnitude (dimensionality reduction); the fully-connected layer resembles part of a conventional neural network and is used to output the desired results.

However, the typical CNN is not only the above-mentioned 3-layer structure, but also a multi-layer structure, as shown in fig. 6, fig. 6 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present invention; the multi-layer structure in the convolutional neural network in fig. 6 is: convolutional layer-pooling layer-convolutional layer-pooling layer … … fully-connected layer-integer feature layer, where the integer feature layer is the last fully-connected layer and is also an active layer, and its activation function is as follows:

wherein the content of the first and second substances,

is the output of the full link layer,

in the form of a function of the hyperbolic tangent,

is a hyperparameter greater than zero, which is used to control the range of output values, and, because of

The data value range of the function output is

Meta, hyper-parameter

The value range of the output becomes

。

After the image data is subjected to feature extraction by utilizing each layer of network of the convolutional neural network to obtain the corresponding feature vector, the parameters of the neural network model can be adjusted according to the feature vector of the image data and the corresponding label category until the adjusted neural network model converges.

For example, when the output of the training sample is calculated using a convolutional neural network, a loss function may be used to measure the loss between the calculated output of the training sample and the label class corresponding to the real training sample.

In particular, parameters in the convolutional neural network may be updated by a back-propagation algorithm until the model converges. The back propagation algorithm can perform iterative optimization on the loss function by using a gradient descent method to obtain a minimum value, find a suitable linear coefficient matrix and bias vector corresponding to the hidden layer and the full-connection layer, and enable the output calculated by all training sample inputs to be equal to or close to the label category of the sample data as much as possible.

In addition, the convolution neural network performs feature recognition on the image data and outputs a real-valued feature vector, but in the present application, it is necessary to use an integer feature vector in order to perform a better hybrid search. Therefore, when the loss function is used to measure the loss between the output calculated by the training sample and the label class corresponding to the real training sample, two sets of loss functions can be used, for example, one set is the classification loss function, and the other set is the rounding loss function, so that after the model is trained, the image features can be extracted for the image and mapped into the integer feature vector capable of retaining the image class information.

For example, in the training process, image data is input into the convolutional neural network, the training mode of the whole network is to perform classification tasks on images, and in the calculation of a loss function, rounding loss and classification loss are calculated at the same time, so that in the model training process, parameters of a convolutional neural network model can be adjusted through a back propagation algorithm until the adjusted convolutional neural network can retain integer feature vectors of image category information.

Of course, the integer feature vector obtained here is only a feature vector that approaches to a discrete value infinitely, and is not a complete integer feature vector; in addition, the structure of the neural network model is not limited to the network structure composed of the convolutional layer, the pooling layer and the full-link layer, and any neural network model capable of generating the feature vector for the image can be used in the invention to generate the feature vector for the image.

S230: and verifying the classification effect of the converged neural network model by using the sample data in the test set, and adjusting the hyper-parameters of the neural network model until the neural network model with the optimal classification effect is obtained.

In this step, when the neural network model is trained by using the image data in the training set in step S220 to obtain the feature vector corresponding to the image data, then the parameters of the neural network model are adjusted according to the feature vector and the label category of the image data until the adjusted neural network model converges, the classification effect of the converged neural network model can be verified by using the sample data in the test set, and the hyper-parameter of the neural network model is adjusted until the neural network model with the optimal classification effect is obtained.

For example, when the loss function is used to measure the loss between the output calculated by the training sample and the label class corresponding to the real training sample, two sets of loss functions may be used, such as one set being a classification loss function and the other set being a rounding loss function, so that after the model is trained, the image features may be extracted for the image and mapped to the integer feature vectors capable of retaining the image class information.

However, because some hyper-parameters exist in the loss function, the hyper-parameters are set artificially, and the values of the hyper-parameters influence the effect of network training. Therefore, in order to optimize the classification effect of the neural network model, when a group of hyper-parameters is manually selected, the hyper-parameters can be used for training the neural network model, after the network iteration is finished, the quality of the hyper-parameters manually selected at this time is evaluated by sample data in a test set, then the hyper-parameters manually selected are continuously adjusted, the network is continuously trained, until after a certain training is finished, the effect on the test set is the best, and the hyper-parameters and the parameters in the network at this time are the hyper-parameters and the parameters in the neural network model which are finally needed.

In the above embodiment, the neural network model is trained using the existing image data in the database, and before training, the image data is divided into a training set and a test set, so that the neural network model learns the feature vectors capable of storing semantic information thereof through the data in the training set, then class prediction is performed in the neural network model using the sample data in the test set, the accuracy of network classification at this time is determined, the hyper-parameters in the neural network model are continuously adjusted, and the optimal neural network model and the relevant hyper-parameters are stored for subsequent use.

In one embodiment, the neural network model may include a convolutional layer, a pooling layer, a fully-connected layer, an integer feature layer, and a loss function; in step S220, the step of training a neural network model by using the image data in the training set to obtain a feature vector corresponding to the image data includes:

In this embodiment, as shown in fig. 6, the multilayer structure in the convolutional neural network in fig. 6 is: convolutional layer-pooling layer-convolutional layer-pooling layer … … fully-connected layer-integer feature layer, where integer feature layer is the last fully-connected layer, which is also the active layer.

After the image data in the training set is input into the convolutional neural network, the convolutional layer extracts local features in the image data through the filtering of a convolutional kernel, and compared with the convolutional layer, the pooling layer can effectively reduce the data dimension, so that the operation amount can be greatly reduced, overfitting can be effectively avoided, and the data processed by the convolutional layer and the pooling layer is input into the full-connection layer to obtain the final feature vector.

In one embodiment, the loss function may include a classification loss function and a rounding loss function;

in step S220, the step of adjusting parameters of the neural network model according to the feature vector and the label category until the adjusted neural network model converges includes:

s221: inputting the feature vector corresponding to the image data and the label category into the classification loss function, calculating a difference value between the feature vector and the label category by using the classification loss function, and adjusting the parameters of the neural network model according to the difference value until the feature vector corresponding to the label category is obtained by training the neural network model;

s222: inputting the feature vector corresponding to the image data into the rounding function, calculating a difference value between the feature vector and the rounded feature vector by using the rounding function, and adjusting parameters of the neural network model according to the difference value until the feature vector in a preset difference value interval is obtained by training the neural network model.

In this embodiment, when the model is trained, the image data is input into the convolutional neural network, the training mode of the whole network is to perform a classification task on the image, and when the loss function is calculated, the loss function includes a classification loss function and a rounding loss function, so that the rounding loss and the classification loss are calculated at the same time. Therefore, in the model training process, the parameters of the convolutional neural network model can be adjusted through a back propagation algorithm until the adjusted convolutional neural network can keep the integer feature vector of the image category information.

Specifically, after the classification loss function is calculated, the network can generate a feature vector capable of storing the label category of the image for the image in the training process, and an expression of the classification loss function can be represented as follows:

wherein the content of the first and second substances,

in order to classify the function of the loss,

as input to neural network model

A feature vector generated from the image data,

is as follows

The prediction matrices for the individual classes are,

is as follows

The labels of the individual categories are,

is the total amount of the image data,

is the total amount of the label categories,

to supervise the vector, if it is

The image data belongs to

Class, then

Otherwise

。

Then, after calculating the rounding loss function, the value of the feature vector generated by the network for the image data may be made to approach to an integer value, so as to construct a subsequent inverted index, and the expression of the rounding loss function may be represented as follows:

wherein the content of the first and second substances,

in order to get the function of the loss rounded,

as input to neural network model

A feature vector generated from the image data,

in order to be a function of the rounding,

is composed of

The norm of (2) is trained by taking an integer loss function, and the network learns the feature vector of an approximate integer value for the image.

In addition, after the classification loss and the rounding loss are calculated, the classification loss and the rounding loss need to be weighted, and a total loss function formula after weighting is as follows:

wherein the content of the first and second substances,

for the hyper-parameter, the quality of the hyper-parameter manually selected at this time needs to be evaluated through sample data in a test set, then the hyper-parameter manually selected is continuously adjusted, the network continues to be trained until the effect on the test set is the best after a certain training is finished, and the hyper-parameter at this time is the hyper-parameter in the neural network model which is finally needed by people.

In one embodiment, the hyper-parameter of the neural network model is a corresponding weighting factor when the classification loss function and the rounding loss function are weighted and fused.

In an embodiment, as shown in fig. 7, fig. 7 is a schematic structural diagram of an image-text hybrid retrieval apparatus according to an embodiment of the present invention; in fig. 7, the present invention further provides an image-text mixed retrieval apparatus, which includes a first indexing module 210, a second indexing module 220, and a comprehensive sorting module 230, and the specific steps are as follows:

the first indexing module 210 is configured to obtain information to be retrieved, which is input by a user, and if the information to be retrieved includes text data and image data at the same time, perform an inverted index query operation after segmenting words of the text data to obtain a first candidate set;

the second indexing module 220 is configured to input the image data into a pre-trained neural network model to obtain a feature vector corresponding to the image data, perform an inverted index query operation after encoding the feature vector, and obtain a second candidate set;

and the comprehensive ranking module 230 is configured to merge the documents in the first candidate set and the second candidate set, count the hit times of the merged documents in the first candidate set and/or the second candidate set, rank the merged documents according to the statistical result, and return the ranking result to the display page.

The image-text mixed retrieval device based on deep learning provided in the embodiment considers the text data and the image data in the retrieval process, so that the query accuracy including the image data and the text data is improved; moreover, the invention is compatible with text-to-text query and image-to-image query, so that the whole image-text mixed retrieval system can realize mixed retrieval of the images and the texts only by maintaining one set of inverted index results, thereby effectively reducing the maintenance cost of the system.

For the specific limitations of the teletext hybrid retrieval arrangement, reference may be made to the above limitations of the teletext hybrid retrieval method, which are not described in detail herein. All or part of the modules in the image-text mixed retrieval device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the terminal device, and can also be stored in a memory in the terminal device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, the present invention also provides a storage medium having computer-readable instructions stored therein, which when executed by one or more processors, cause the one or more processors to perform the steps of the teletext hybrid retrieval method according to any one of the preceding embodiments.

In one embodiment, the present invention further provides a computer device having computer readable instructions stored therein, which when executed by one or more processors, cause the one or more processors to perform the steps of the teletext hybrid retrieval method according to any one of the preceding embodiments.

Fig. 8 is a schematic diagram illustrating an internal structure of a computer device according to an embodiment of the present invention, and fig. 8 is a schematic diagram, where the computer device 300 may be provided as a server. Referring to fig. 8, a computer device 300 includes a processing component 302 that further includes one or more processors and memory resources, represented by memory 301, for storing instructions, such as application programs, that are executable by the processing component 302. The application programs stored in memory 301 may include one or more modules that each correspond to a set of instructions. Further, the processing component 302 is configured to execute instructions to perform the teletext hybrid retrieval method of any of the embodiments described above.

The computer device 300 may also include a power component 303 configured to perform power management of the computer device 300, a wired or wireless network interface 304 configured to connect the computer device 300 to a network, and an input output (I/O) interface 305. The computer device 300 may operate based on an operating system stored in memory 301, such as Windows Server, Mac OS XTM, Unix, Linux, Free BSDTM, or the like.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for mixed retrieval of images and texts is characterized by comprising the following steps:

acquiring information to be retrieved input by a user, and if the information to be retrieved contains text data and image data at the same time, performing reverse index query operation on the text data through a pre-constructed reverse index structure after word segmentation to obtain a first candidate set;

the inverted index structure comprises terms and an inverted record table, wherein the terms are obtained through data in a webpage library collected in advance, and a plurality of documents corresponding to the terms are recorded in the inverted record table and are obtained after the terms are inquired in the database through the inverted index structure;

inputting the image data into a pre-trained neural network model to obtain a feature vector corresponding to the image data, coding the feature vector to obtain a corresponding term, and performing reverse index query operation on the term through the reverse index structure to obtain a second candidate set;

merging the documents in the first candidate set and the second candidate set, counting the hit times of the merged documents in the first candidate set or the second candidate set and the weighted hit times in the first candidate set and the second candidate set, sorting the merged documents according to the counting result, and returning the sorting result to a display page;

and the setting of the weighting factor in the weighting hit times is manually adjusted according to a specific application scene.

2. The teletext hybrid retrieval method according to claim 1, wherein the step of performing inverted index query operation on the text data after word segmentation through a pre-constructed inverted index structure to obtain a first candidate set comprises:

3. The image-text mixed retrieval method of claim 1, wherein the step of obtaining a second candidate set by performing an inverted index query operation on the terms through the inverted index structure after the feature vector is encoded to obtain corresponding terms comprises:

4. The teletext hybrid retrieval method according to claim 1, wherein the step of counting the number of hits of the merged document in the first candidate set or the second candidate set and the weighted number of hits in the first candidate set and the second candidate set comprises:

5. The teletext hybrid retrieval method according to claim 4, wherein if the merged document is a document in the first candidate set and the second candidate set, the step of counting the weighted hit times of the document in the first candidate set and the second candidate set comprises:

6. The teletext hybrid retrieval method according to claim 1, wherein the training method of the neural network model comprises:

7. The teletext hybrid retrieval method of claim 6, wherein the neural network model comprises a convolutional layer, a pooling layer, a fully-connected layer, an integer feature layer, and a loss function;

8. The teletext hybrid retrieval method of claim 7, wherein the loss function comprises a classification loss function and a rounding loss function;

inputting the feature vector corresponding to the image data into the rounding loss function, calculating a difference value between the feature vector and the rounded feature vector by using the rounding loss function, and adjusting parameters of the neural network model according to the difference value until the feature vector in a preset difference value interval is obtained by training the neural network model.

9. The teletext hybrid retrieval method according to claim 8, wherein the hyper-parameter of the neural network model is a corresponding weighting factor when the classification loss function and the rounding loss function are weighted and fused.

10. A mixed retrieval apparatus for images and texts, comprising:

the system comprises a first indexing module, a second indexing module and a third indexing module, wherein the first indexing module is used for acquiring information to be retrieved input by a user, and if the information to be retrieved contains text data and image data at the same time, the text data is subjected to inverted index query operation through a pre-constructed inverted index structure after being segmented to obtain a first candidate set;

the second indexing module is used for inputting the image data into a pre-trained neural network model to obtain a feature vector corresponding to the image data, coding the feature vector to obtain a corresponding term, and performing reverse index query operation on the term through the reverse index structure to obtain a second candidate set;

the comprehensive ranking module is used for merging the documents in the first candidate set and the second candidate set, counting the hit times of the merged documents in the first candidate set or the second candidate set and the weighted hit times in the first candidate set and the second candidate set, ranking the merged documents according to the counting result, and returning the ranking result to a display page;

11. A storage medium, characterized by: the storage medium has stored therein computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the teletext hybrid retrieval method according to any one of claims 1-9.

12. A computer device, characterized by: stored in the computer device are computer-readable instructions which, when executed by one or more processors, cause the one or more processors to carry out the steps of the teletext hybrid retrieval method according to any one of claims 1-9.