CN110765286A

CN110765286A - Cross-media retrieval method and device, computer equipment and storage medium

Info

Publication number: CN110765286A
Application number: CN201910848286.1A
Authority: CN
Inventors: 李涵
Original assignee: Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Current assignee: Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2020-02-07

Abstract

The application relates to a cross-media retrieval method, a cross-media retrieval device, a computer device and a storage medium. The method is used for obtaining the retrieval target from the cross-media information database, not only considering the global similarity between the media information to be retrieved and the cross-media information, but also increasing the weight of the preset category words through the global feature of the text media information, strengthening the semantic information of the text media information in the preset category words, taking the semantic information as the local feature, and obtaining the retrieval target according to the weighting calculation result of the global similarity and the local similarity. The method integrates the global characteristics and the similarity of the media information to be retrieved and the cross-media information on the local part, especially the similarity of the preset category words on the one-dimensional semanteme, solves the problem of the semantic dissimilarity of the text media information and other media information, and improves the retrieval effect.

Description

Cross-media retrieval method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of information retrieval technologies, and in particular, to a cross-media retrieval method and apparatus, a computer device, and a storage medium.

Background

With the development of internet technology, the media forms for recording information have diversified. For example, text, images, video, and audio may each convey information. The same information content may be recorded in different media forms. This creates a need for cross-media retrieval by the user, for example, the user enters the text "cat-mouse," wishing to retrieve a picture or video of cat-mouse.

According to the traditional cross-media retrieval method, image features are extracted through a convolutional neural network, text features are extracted through Word2vec, and association between the two heterogeneous features of the image and the text is found through a semantic matching method of logistic regression, so that cross-media retrieval is achieved.

In practical applications, the semantic information of the information in different media formats is different, that is, the information in different media formats has a semantic alignment problem. If the semantic information in the picture is much richer than the text, the media characteristics extracted only by the convolutional neural network and Word2vec mode cannot solve the semantic alignment problem among different media data, and further the retrieval effect is influenced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a cross-media search method, device, computer device, and storage medium with high performance.

A cross-media retrieval method for obtaining a retrieval target from matching cross-media information in a cross-media information database, the method comprising:

receiving media information to be retrieved; the cross-media information is information which belongs to different media types from the media information to be retrieved; the media type of one of the media information to be retrieved and the cross-media information is a character;

acquiring global features of the text media information and global features of the non-text media information through the trained neural network; the text media information is the media information to be retrieved and the information with the media type being text in the cross-media information; the non-character media information is the media information to be retrieved and the information with non-character media type in the cross-media information;

acquiring an entity of the non-text media information, acquiring local characteristics of the non-text media information according to the entity, and adding weight to preset category words according to global characteristics of the text media information to acquire local characteristics of the text media information;

calculating the global similarity of the text media information and the non-text media information according to the global characteristics of the text media information and the global characteristics of the non-text media information;

calculating the local similarity of the text media information and the non-text media information according to the local characteristics of the text media information and the local characteristics of the non-text media information;

and weighting the global similarity and the local similarity, and obtaining a retrieval target of the media information to be retrieved according to a similarity weighting calculation result.

A cross-media retrieval device for obtaining retrieval targets from matching cross-media information in a cross-media information database, the device comprising:

the retrieval information acquisition module is used for receiving the media information to be retrieved; the media type of one of the media information to be retrieved and the cross-media information is a character; the cross-media information is information which belongs to different media types from the media information to be retrieved;

the global feature acquisition module is used for acquiring global features of the text media information and global features of the non-text media information through the trained neural network; the text media information is the media information to be retrieved and the information with the media type being text in the cross-media information; the non-character media information is the media information to be retrieved and the information with non-character media type in the cross-media information;

the local characteristic acquisition module is used for acquiring the entity of the non-text media information, acquiring the local characteristic of the non-text media according to the entity, and adding weight to the preset category words according to the global characteristic of the text media information to acquire the local characteristic of the text media information;

the global similarity analysis module is used for calculating the global similarity of the text media information and the non-text media information according to the global characteristics of the text media information and the global characteristics of the non-text media information;

the local similarity analysis module is used for calculating the local similarity of the text media information and the non-text media information according to the local characteristics of the text media information and the local characteristics of the non-text media information;

and the retrieval module is used for weighting the global similarity and the local similarity and obtaining a retrieval target of the media information to be retrieved according to a similarity weighting calculation result.

A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above.

According to the cross-media retrieval method, the cross-media retrieval device, the computer equipment and the storage medium, not only is the global similarity between the media information to be retrieved and the cross-media information considered, but also the global feature of the text media information is used for weighting the preset category words, the semantic information of the text media information in the preset category words is strengthened and is used as the local feature, and the retrieval target is obtained according to the weighting calculation results of the global similarity and the local similarity. The method integrates the global characteristics and the similarity of the media information to be retrieved and the cross-media information on the local part, especially the similarity of the preset category words on the one-dimensional semanteme, solves the problem of the semantic dissimilarity of the text media information and other media information, and improves the retrieval effect.

Drawings

FIG. 1 is a diagram of an application environment of a cross-media retrieval method in one embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a cross-media retrieval method in one embodiment;

FIG. 3 is a diagram illustrating an entity similarity analysis process when local features include entity features, according to an embodiment;

FIG. 4 is a diagram illustrating a relationship similarity analysis process when local features include relationship features, according to an embodiment;

FIG. 5 is a diagram illustrating a process for local similarity analysis when local features include physical features and relational features, according to an embodiment;

FIG. 6 is a diagram illustrating a global similarity analysis process in one embodiment;

FIG. 7 is a diagram of a cross-media retrieval analysis process in one embodiment;

FIG. 8 is a block diagram of the structure of a cross media retrieval device in one embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The cross-media retrieval method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The user inputs media information to be retrieved, such as text, pictures, etc., through the terminal 102. The terminal 102 sends the media information to be retrieved to the server 104. The server 10 implements a cross-media retrieval method for obtaining retrieval targets from a cross-media information database. The server receives the media information to be retrieved; the media type of one of the media information to be retrieved and the cross-media information is a character; the cross-media information is information which belongs to different media types from the media information to be retrieved, global characteristics of the text media information and global characteristics of the non-text media information are obtained through a trained neural network, entities of the non-text media information are obtained, local characteristics of the non-text media information are obtained according to the entities, weights are added to preset category words according to the global characteristics of the text media information to obtain the local characteristics of the text media information, the global similarity of the text media information and the non-text media information is calculated according to the global characteristics of the text media information and the global characteristics of the non-text media information, the local similarity of the text media information and the non-text media information is calculated according to the local characteristics of the text media information and the local characteristics of the non-text media information, the global similarity and the local similarity are weighted, and a calculation result is weighted according to the similarity, and obtaining a retrieval target of the media information to be retrieved.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in FIG. 2, a cross-media retrieval method is provided for obtaining a retrieval target from matching cross-media information in a cross-media information database. Taking the application of the method to the server in fig. 1 as an example, the method includes the following steps:

step 202, receiving media information to be retrieved.

The media information to be retrieved is a retrieval object input by a user through a terminal, and the media form of the media information can be pictures, videos, characters or audios. For example, a user inputs a picture, a video, a text, or audio at a client of a search engine through a terminal, where the input picture, video, text, or audio is a search object and is media information to be searched. The user inputs the media information to be retrieved through the terminal, and the terminal sends the media information to be retrieved to the server through the network.

The cross-media information refers to information which belongs to different media categories with the media information to be retrieved. For example, if the media category of the media information to be retrieved is text, the media category of the cross-media information is image information, video information or audio information. For another example, if the media type of the media information to be retrieved is image information, the media type of the cross-media information is text information, video information or audio information.

In this embodiment, the media information to be retrieved or the cross-media information belongs to different media types, and one of the media types is a text, that is, the media type of the media information to be retrieved is a text, or one of the media types of the cross-media information is a text. The cross-media information database refers to a database for storing media information which is different from the media information to be retrieved. And if the type of the media information to be retrieved is characters, the cross-media information database is an image database or a video database. And if the type of the media information to be retrieved is video, the cross-media information database is an image database or a character database. By adopting the cross-media retrieval method, the retrieval target is obtained from the cross-media information database.

S204, acquiring the global features of the text media information and the global features of the non-text media information through the trained neural network.

The character media information is the media information to be retrieved and the information with the media type being characters in the cross-media information; the non-character media information is the media information to be retrieved and the information with non-character media type in the cross-media information. As noted above, the media type of one of the media information to be retrieved and the cross-media information is text. Therefore, in this embodiment, the text media information is one of the media information to be retrieved and the cross-media information. That is, if the media type of the media information to be retrieved is text, the text media information refers to the media information to be retrieved, and the non-text media information is cross-media information. If the media type of the media information to be retrieved is non-text, the text media information refers to cross-media information, and the non-text media information is the media information to be retrieved.

Artificial Neural Networks (ans), also referred to as Neural Networks (NNs) or Connection models (Connection models), are algorithmic mathematical models that Model animal Neural network behavior characteristics and perform distributed parallel information processing. The network achieves the aim of processing information by adjusting the mutual connection relationship among a large number of nodes in the network depending on the complexity of the system.

In particular, different neural networks are employed to extract global features for different media types of information. Taking the text media information as an example, the Char-CNN can be used to extract text features in combination with the LSTM, so as to obtain global features of the text. Taking the image media information as an example, a convolutional neural network (e.g., VGG19) can be used to extract image features, so as to obtain global features of the image.

S206, acquiring the entity of the non-character media information, acquiring the local characteristics of the non-character media information according to the entity, and adding weight to the preset category words according to the global characteristics of the character media information to acquire the local characteristics of the character media information.

Local features refer to features of a certain dimension that are considered with emphasis relative to global features. In particular, the local features of the non-textual media information are obtained by an entity of the non-textual media information, and thus, for non-textual media information, the local features are related to the entity. Wherein an entity refers to an object or substance present in the information. For example, a person or object present in the image in an entity in the image. For non-text media information, the manner of acquiring entities in different media information is different. For example, the manner of acquiring the entity of the image media information may be to adopt a trained convolutional neural network to perform target detection on the image, and extract the entity in the image, such as "person" in the image and "cat" in the image.

For text media, the local features are related to the emphasis of the preset category words. If the preset category word is an entity word, the local feature is related to the entity. If the preset category word is a relation word, the local features are related to the relation between the entities. In this embodiment, the global feature of the text media information is used to add weight to the preset category words, so as to enhance the semantic information feature of the text media information in the preset category words.

And S208, calculating the global similarity of the text media information and the non-text media information according to the global characteristics of the text media information and the global characteristics of the non-text media information.

The global similarity refers to the similarity of the global features of the text media information and the non-text media information. Taking the media type of the media information to be retrieved as characters and the media type of the cross-media information as an image as an example, extracting image features through a convolutional neural network, extracting text features by using Word2vec, and calculating the global similarity of the image features and the text features according to the image features and the text features. The global similarity can be calculated by a similarity calculation method, such as a K-nearest neighbor algorithm.

S210, calculating the local similarity of the text media information and the non-text media information according to the local characteristics of the text media information and the local characteristics of the non-text media information.

In particular, local features refer to features of a certain dimension that are considered with emphasis relative to global features. And the global characteristics of the character media information are used for adding weight to the preset category words, so that the semantic information of the character media information in the preset category words is strengthened. Therefore, the local similarity calculation can represent the similar situation of the text media information and the non-text media information in the local part, especially the semantic meaning of the preset category word dimension.

S212, weighting the global similarity and the local similarity, and obtaining a retrieval target of the media information to be retrieved according to the similarity weighting calculation result.

Specifically, different weights are set for the global similarity and the local similarity, wherein the weight of the global similarity and the weight of the local similarity can be obtained through model training in advance. And according to the similarity weighting calculation result, performing similarity sorting on the cross-media information in the cross-media information database, and outputting a retrieval target according to the sorting result. For example, if the media information to be retrieved is text media information, the ranking result is calculated according to the similarity weighting of the text media information and the cross-media information, and the retrieval target of the media information to be retrieved is obtained by outputting N pieces of cross-media information with the highest similarity, such as images, in the cross-media information.

The cross-media retrieval method not only considers the global similarity between the media information to be retrieved and the cross-media information, but also increases the weight of the preset category words through the global characteristics of the text media information, strengthens the semantic information of the text media information in the preset category words, takes the semantic information as the local characteristics, and obtains the retrieval target according to the weighting calculation results of the global similarity and the local similarity. The method integrates the global characteristics and the similarity of the media information to be retrieved and the cross-media information on the local part, especially the similarity of the preset category words on the one-dimensional semanteme, solves the problem of the semantic dissimilarity of the text media information and other media information, and improves the retrieval effect.

In another embodiment, the obtaining the local features of the non-text media information includes obtaining an entity of the non-text media information, obtaining the local features of the non-text media information according to the entity, and adding weights to the preset category words according to the global features of the text media information to obtain the local features of the text media information, and includes: carrying out target detection on the non-character media information by using the trained entity extraction model, and extracting the entity in the non-character media to obtain the entity characteristic of the non-character media information; and according to the global characteristics of the character media information, adding weights to the word vectors of the entity words in the character media information to obtain the entity characteristics of the character media information.

Specifically, for non-text media information, an entity extraction model is trained in advance to perform entity extraction, wherein the entity extraction model adopts a neural network structure. Taking the media information to be retrieved as image information and the cross-media information as text information as an example, a local similarity calculation model of one embodiment is shown in fig. 3. And inputting the image into a trained entity extraction model for target detection, and extracting the entity in the image. In this embodiment, the entity extraction model adopts a convolutional neural network structure, such as a fast RCNN structure, to extract entities in the image, such as "people" in the image and "cats" in the image. The entity extracted by the entity extraction model is the entity characteristic of the non-character media information.

And for the text media information, adding weight to the word vector of the entity word in the text media according to the global characteristic of the text media information to obtain the entity characteristic of the text media. Specifically, in this embodiment, the Char-CNN is used to extract text features in combination with the LSTM, so as to obtain an LSTM sequence of the text media information. And after the LSTM sequence, adding weights to word vectors of entity words in the text media information through an attention module by using an attention mechanism to obtain entity characteristics of the text media information. The entity words in the text media information refer to nouns and pronouns in the text media information. Such as the noun "cat" and character name in a paragraph of text.

And for the entity features of the image and the entity features of the text, the entity level similarity of the image and the text is obtained by adopting a K-nearest neighbor algorithm. In this embodiment, the local features are entity features, and by extracting entities from the non-text media information and adding weights to word vectors of entity words in the text media information, the word vectors are respectively taken as consideration objects of local similarity of the non-text media information and the text media information, so that the similarity of the non-text media information and the text media information at an entity level can be obtained. Namely, the similarity of the media information to be retrieved and the cross-media information in the dimension of the entity, namely the entity similarity, solves the problem that the semantic meaning of the text media information is not uniform with other media information, and improves the retrieval effect.

In another embodiment, the obtaining the local feature of the non-text media information includes obtaining a relationship feature, obtaining an entity of the non-text media information according to the entity, and adding a weight to the preset category word according to the global feature of the text media information to obtain the local feature of the text media information, including: carrying out target detection on the non-character media information by using a trained entity extraction model, extracting entities in the non-character media, and combining the extracted entities to obtain the relation characteristics of the non-character media; and according to the global characteristics of the character media information, adding weights to the word vectors of the relation words in the character media information to obtain the relation characteristics of the character media.

Specifically, for non-text media information, an entity extraction model is trained in advance to perform entity extraction, wherein the entity extraction model adopts a neural network structure. Taking the media information to be retrieved as image information and the cross-media information as text information as an example, a local similarity calculation model of one embodiment is shown in fig. 4. And inputting the image into a trained entity extraction model for target detection, and extracting the entity in the image. In this embodiment, the entity extraction model adopts a convolutional neural network structure, such as a fast RCNN structure, to extract entities in the image, such as "people" in the image and "cats" in the image. And combining the entities extracted by the entity extraction model, wherein the combination mode can be that each entity is connected to obtain the relation characteristic of the non-character media information. I.e., the relational characteristics of non-textual media comprise a plurality of entities.

And for the character media information, adding weight to the word vector of the relation word in the character media according to the global characteristic of the character media information to obtain the relation characteristic of the character media. Specifically, in this embodiment, the Char-CNN is used to extract text features in combination with the LSTM, so as to obtain an LSTM sequence of the text media information. And after the LSTM sequence, adding weights to word vectors of relation words in the text media information through an attention module by using an attention mechanism to obtain the relation characteristics of the text media information. The relation words in the text media information refer to verbs in the text media information.

And for the relation features of the image and the relation features of the text, obtaining the relation level similarity of the image and the text by adopting a K-nearest neighbor algorithm. In this embodiment, the local features are relational features, and by extracting entities from the non-text media information and combining the entities, adding weights to word vectors of relational words in the text media information, and using the weights as consideration objects of local similarity parts of the non-text media information and the text media information, the similarity of the non-text media information and the text media information at a relational level can be obtained. Namely, the similarity of the media information to be retrieved and the cross-media information in the dimension of the relationship, namely the relationship similarity, solves the problem that the semantic meaning of the text media information is not uniform with other media information, and improves the retrieval effect.

In another embodiment, the local features include physical features and relational features; calculating the local similarity of the literal media information and the non-literal media information according to the local characteristics of the literal media information and the local characteristics of the non-literal media information, comprising the following steps: carrying out target detection on the non-character media information by using the trained entity extraction model, and extracting the entity in the non-character media to obtain the entity characteristic of the non-character media; combining the extracted entities to obtain the relation characteristics of the non-character media; according to the global characteristics of the character media information, adding weights to word vectors of entity words in the character media information to obtain entity characteristics of the character media information; and according to the global characteristics of the character media information, adding weights to the word vectors of the relation words in the character media to obtain the relation characteristics of the character media.

Specifically, for non-text media information, an entity extraction model is trained in advance to perform entity extraction, wherein the entity extraction model adopts a neural network structure. Taking the media information to be retrieved as image information and the cross-media information as text information as an example, a local similarity calculation model of one embodiment is shown in fig. 5. And inputting the image into a trained entity extraction model for target detection, and extracting the entity in the image. In this embodiment, the entity extraction model adopts a convolutional neural network structure, such as a fast RCNN structure, to extract entities in the image, such as "people" in the image and "cats" in the image. The entity extracted by the entity extraction model is the entity characteristic of the non-character media information.

And combining the entities extracted by the entity extraction model, wherein the combination mode can be that each entity is connected to obtain the relation characteristic of the non-character media information. I.e., the relational characteristics of non-textual media comprise a plurality of entities.

And for the text media information, according to the global characteristics of the text media information, adding weights to word vectors of entity words in the text media through the entity attention module to obtain the entity characteristics of the text media. Specifically, in this embodiment, the Char-CNN is used to extract text features in combination with the LSTM, so as to obtain an LSTM sequence of the text media information. And after the LSTM sequence, adding weights to word vectors of entity words in the text media information through an entity attention module by utilizing an attention mechanism to obtain entity characteristics of the text media information. The entity words in the text media information refer to nouns and pronouns in the text media information. Such as the noun "cat" and character name in a paragraph of text.

And for the character media information, adding weight to the word vector of the relation word in the character media according to the global characteristic of the character media information to obtain the relation characteristic of the character media. Specifically, in this embodiment, the Char-CNN is used to extract text features in combination with the LSTM, so as to obtain an LSTM sequence of the text media information. And after the LSTM sequence, adding weight to the word vector of the relation word in the text media information through a relation attention module by using an attention mechanism to obtain the relation characteristic of the text media information. The relation words in the text media information refer to verbs in the text media information.

Calculating the local similarity of the literal media information and the non-literal media information according to the local characteristics of the literal media information and the local characteristics of the non-literal media information, comprising the following steps: calculating the entity similarity of the literal media information and the non-literal media information according to the entity characteristics of the literal media information and the entity characteristics of the non-literal media information; and calculating the relation similarity of the text media information and the non-text media information according to the entity characteristics of the text media information and the relation characteristics of the non-text media information.

Specifically, for the entity features of the image and the entity features of the text, the entity level similarity of the image and the text is obtained by adopting a K-nearest neighbor algorithm. And for the relation features of the image and the relation features of the text, obtaining the relation level similarity of the image and the text by adopting a K-nearest neighbor algorithm. In this embodiment, by extracting entities from the non-text media information, and adding weights to word vectors of entity words in the text media information, which are taken as consideration objects of local similarity parts of the two, the similarity of the non-text media information and the text media information at the entity level can be obtained. By extracting the entities in the non-character media information and combining the entities, adding weights to the word vectors of the relation words in the character media information to be respectively taken as the consideration objects of the local similarity parts of the non-character media information and the character media information, the similarity of the non-character media information and the character media information in the relation hierarchy can be obtained.

Compared with other cross-media retrieval methods, the cross-media retrieval method carries out similarity measurement among media types on three semantic levels of global, entity and relation, and improves retrieval effect. For the use case with the non-aligned semanteme, a better retrieval effect can be obtained. For example, in a picture associated with the text "two children play on the lawn", a large area of blue sky clouds or houses may appear, and these areas unrelated to the semantic meaning of the text may interfere with the similarity detection of the monolingual hierarchical retrieval system. The entity similarity module can detect objects such as children and lawns, the relationship similarity module can extract playing characteristics, interference of irrelevant semantemes is reduced to a great extent, and a good retrieval result is obtained.

In another embodiment, a neural network for global feature extraction of textual media information shares a fully connected layer with a neural network for global feature extraction of non-textual media information; calculating the global similarity of the text media information and the non-text media information according to the global characteristics of the text media information and the global characteristics of the non-text media information, wherein the global similarity comprises the following steps: realizing the feature fusion of the global features of the character media information and the global features of the non-character media information through the full connection layer to generate a subspace; and obtaining the global similarity of the text media information and the non-text media information according to the distance between the text media information and the non-text media information in the subspace.

Specifically, as shown in fig. 6, a convolutional neural network (e.g., VGG19) is used to extract image features, Char-CNN is used to extract text features in combination with LSTM, feature fusion is realized after sharing a full connection layer, and a subspace is generated. And obtaining the global similarity of the image and the text according to the distance between the samples in the subspace.

In the traditional cross-media retrieval, image features are extracted through a convolutional neural network, Word2vec extracts text features, and then association between the two heterogeneous features of the image and the text is found by utilizing a semantic matching method of logistic regression, so that cross-media retrieval is realized. However, the feature fusion realized by the logistic regression method cannot map different media samples with complex semantics into the shared subspace well, and the retrieval performance is affected. According to the method and the device, the feature fusion is realized after the shared full-connection layer, the subspace is generated, the global similarity of the image and the text is obtained according to the distance between the samples in the subspace, and therefore different media samples with complex semantemes can be well mapped into the shared subspace, and the retrieval performance is improved.

In another embodiment, the cross-media information retrieval method further comprises: and acquiring the number N of retrieval targets. And the retrieval target number N is the number of the expected instances similar to the number to be retrieved. The media information to be retrieved can be input by a user when the terminal inputs the media information to be retrieved, and can also be default setting of a retrieval engine.

Specifically, the user inputs media information to be retrieved, such as images, and similar cross-media information desired, such as the number N of instances of text. By adopting the cross-media retrieval method, N media data which are most relevant to the user input in the database are output through similarity detection.

In another embodiment, weighting the global similarity and the local similarity, and obtaining a retrieval target of the media information to be retrieved according to a similarity weighting calculation result includes: weighting the global similarity, the entity similarity and the relationship similarity to obtain the overall similarity; and (4) sorting according to the similarity, and outputting N pieces of cross-media information with the highest overall similarity as retrieval targets. Namely, the overall similarity is the similarity measurement among the media types on three semantic levels of the whole situation, the entity and the relation, and further the retrieval effect can be improved.

The following describes the cross-media retrieval method in detail, taking the media information to be retrieved as an image and the cross-media information as a character. As shown in fig. 7, for an input image, global features of the image are extracted. Specifically, a convolutional neural network (e.g., VGG19) can be used to extract image features, resulting in global features of the image. For each text in the text database, extracting text features by using Char-CNN in combination with LSTM to obtain global features of the text. And (3) realizing feature fusion by using image features extracted by a convolutional neural network and text features extracted by using Char-CNN combined with LSTM after sharing a full connection layer, and generating a subspace. And obtaining the global similarity of the image and the text according to the distance between the samples in the subspace.

For the input image, the entity extraction model adopts a convolution neural network structure, such as a fast RCNN structure, to extract the entity in the image, so as to obtain the entity characteristics of the image. And extracting text features by using Char-CNN in combination with the LSTM to obtain an LSTM sequence of the text media information. And after the LSTM sequence, adding weights to word vectors of entity words in the text media information through an entity attention module by utilizing an attention mechanism to obtain entity characteristics of the text media information. And for the entity features of the image and the entity features of the text, the entity level similarity of the image and the text is obtained by adopting a K-nearest neighbor algorithm.

And combining the entities extracted by the entity extraction model, wherein the combination mode can be that each entity is connected to obtain the relation characteristic of the non-character media information. For the character media information, text features are extracted by using Char-CNN in combination with LSTM, and an LSTM sequence of the character media information is obtained. And after the LSTM sequence, adding weight to the word vector of the relation word in the text media information through a relation attention module by using an attention mechanism to obtain the relation body characteristic of the text media information. And for the relation features of the image and the relation features of the text, obtaining the relation level similarity of the image and the text by adopting a K-nearest neighbor algorithm.

Weighting the global similarity, the entity similarity and the relationship similarity to obtain the overall similarity; and (4) sorting according to the similarity, and outputting N character information with the highest overall similarity with the image as a retrieval target. Namely, the overall similarity is the similarity measurement among the media types on three semantic levels of the whole situation, the entity and the relation, and further the retrieval effect can be improved.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, a cross-media retrieval apparatus for obtaining a retrieval target from matching cross-media information in a cross-media information database is provided, including: retrieval information acquisition module, global characteristics acquisition module, local characteristics acquisition module, global similarity analysis module, local similarity analysis module and retrieval module, wherein:

a retrieval information obtaining module 801, configured to receive media information to be retrieved; any media type in the media information to be retrieved or the cross-media information is a character; the cross-media information is information that is of a different media type than the media information to be retrieved.

And a global feature obtaining module 802, configured to obtain global features of the text media information and global features of the non-text media information through the trained neural network.

The local feature obtaining module 803 is configured to obtain an entity of the non-text media information, obtain a local feature of the non-text media according to the entity, and add a weight to the preset category word according to the global feature of the text media information to obtain the local feature of the text media information.

And the global similarity analysis module 804 is configured to calculate global similarities of the textual media information and the non-textual media information according to the global features of the textual media information and the global features of the non-textual media information.

The local similarity analysis module 805 is configured to calculate a local similarity between the text media information and the non-text media information according to the local feature of the text media information and the local feature of the non-text media information.

And the retrieval module 806 is configured to weight the global similarity and the local similarity, and obtain a retrieval target of the media information to be retrieved according to a similarity weighted calculation result.

The cross-media retrieval device not only considers the global similarity between the media information to be retrieved and the cross-media information, but also adds weight to the preset category words through the global features of the text media information, strengthens the semantic information of the text media information in the preset category words, takes the semantic information as local features, and obtains the retrieval target according to the weighting calculation results of the global similarity and the local similarity. The method integrates the global characteristics and the similarity of the media information to be retrieved and the cross-media information on the local part, especially the similarity of the preset category words on the one-dimensional semanteme, solves the problem of the semantic dissimilarity of the text media information and other media information, and improves the retrieval effect.

In another embodiment, the local features include physical features. A local feature acquisition module comprising:

the first local acquisition module is used for carrying out target detection on the non-character media information by utilizing the trained entity extraction model, extracting the entity in the non-character media information and obtaining the entity characteristic of the non-character media information.

And the second local acquisition module is used for adding weight to the word vector of the entity word in the character media information according to the global characteristic of the character media information to obtain the entity characteristic of the character media information.

In another embodiment, the local features include relational features. A local feature acquisition module comprising:

the first local acquisition module is used for carrying out target detection on the non-character media information by utilizing a trained entity extraction model, extracting entities in the non-character media information and combining the extracted entities to obtain the relation characteristics of the non-character media information;

and the second local acquisition module is used for adding weight to the word vector of the relation word in the character media information according to the global characteristic of the character media information to obtain the relation characteristic of the character media.

In another embodiment, the local features include entity features and relationship features. A local feature acquisition module comprising:

and the first entity characteristic acquisition module is used for carrying out target detection on the non-character media information by utilizing the trained entity extraction model, extracting the entity in the non-character media information and obtaining the entity characteristic of the non-character media information.

And the first relation characteristic acquisition module is used for combining the extracted entities to obtain the relation characteristics of the non-text media information.

And the second entity characteristic acquisition module is used for increasing the weight of the word vector of the entity word in the character media information according to the global characteristic of the character media information to obtain the entity characteristic of the character media information.

And the second relational feature acquisition module is used for adding weight to the word vector of the relational word in the character media according to the global feature of the character media information to obtain the relational feature of the character media.

A local similarity analysis module comprising:

and the entity similarity analysis module is used for calculating the entity similarity of the text media information and the non-text media information according to the entity characteristics of the text media information and the entity characteristics of the non-text media information.

And the relation similarity analysis module is used for calculating the relation similarity of the text media information and the non-text media information according to the entity characteristics of the text media information and the relation characteristics of the non-text media information.

In another embodiment, the global similarity analysis module is configured to implement feature fusion of global features of the textual media information and global features of the non-textual media information through the full connection layer, generate a subspace, and obtain global similarity of the textual media information and the non-textual media information according to a distance between the textual media information and the non-textual media information in the subspace.

In another embodiment, the cross-media retrieval device further comprises a target data acquisition module for acquiring the retrieval target number N.

The retrieval module comprises:

and the similarity weighting module is used for weighting the global similarity and the local similarity to obtain the overall similarity.

And the output module is used for sorting according to the similarity and outputting N pieces of cross-media information with the highest overall similarity as retrieval targets.

In another embodiment, the retrieval module includes:

and the similarity weighting module is used for weighting the global similarity, the entity similarity and the relationship similarity to obtain the overall similarity.

For specific limitations of the cross-media retrieval device, reference may be made to the above limitations of the cross-media retrieval method, which is not described herein again. The modules in the cross-media retrieval device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store cross-media information. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a cross-media retrieval method.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program that when executed by the processor performs the following steps to obtain a retrieval target from a cross-media information database:

receiving media information to be retrieved; the cross-media information is information which belongs to different media types from the media information to be retrieved; any media type in the media information to be retrieved or the cross-media information is a character;

acquiring global features of the text media information and global features of the non-text media information through the trained neural network;

acquiring an entity of the non-character media information, acquiring local characteristics of the non-character media information according to the entity, and adding weight to preset category words according to global characteristics of the character media information to acquire local characteristics of the character media information;

and weighting the global similarity and the local similarity, and obtaining a retrieval target of the media information to be retrieved according to the similarity weighting calculation result.

In one embodiment, the obtaining of the local feature of the non-text media information includes obtaining an entity of the non-text media information, obtaining the local feature of the non-text media information according to the entity, and adding a weight to the preset category word according to the global feature of the text media information to obtain the local feature of the text media information, and includes:

carrying out target detection on the non-character media information by using the trained entity extraction model, and extracting entities in the non-character media information to obtain entity characteristics of the non-character media information;

and according to the global characteristics of the character media information, adding weights to the word vectors of the entity words in the character media information to obtain the entity characteristics of the character media information.

In another embodiment, the obtaining the local feature of the non-text media information according to the entity, and adding weight to the preset category word according to the global feature of the text media information to obtain the local feature of the text media information includes:

carrying out target detection on the non-character media information by using a trained entity extraction model, extracting entities in the non-character media information, and combining the extracted entities to obtain the relation characteristics of the non-character media information;

and according to the global characteristics of the character media information, adding weights to the word vectors of the relation words in the character media information to obtain the relation characteristics of the character media.

In another embodiment, the local features include physical features and relational features; the method comprises the following steps of obtaining an entity of non-character media information, obtaining local characteristics of the non-character media information according to the entity, adding weights to preset category words according to global characteristics of the character media information, and obtaining the local characteristics of the character media information, wherein the local characteristics comprise:

combining the extracted entities to obtain the relation characteristics of the non-character media information;

according to the global characteristics of the character media information, adding weights to word vectors of entity words in the character media information to obtain entity characteristics of the character media information;

according to the global characteristics of the character media information, adding weights to the word vectors of the relation words in the character media to obtain the relation characteristics of the character media;

calculating the local similarity of the literal media information and the non-literal media information according to the local characteristics of the literal media information and the local characteristics of the non-literal media information, comprising the following steps:

calculating the entity similarity of the literal media information and the non-literal media information according to the entity characteristics of the literal media information and the entity characteristics of the non-literal media information;

calculating the relation similarity of the text media information and the non-text media information according to the entity characteristics of the text media information and the relation characteristics of the non-text media information; the local similarity includes entity similarity and relationship similarity.

In another embodiment, a neural network for global feature extraction of textual media information shares a fully connected layer with a neural network for global feature extraction of non-textual media information;

calculating the global similarity of the text media information and the non-text media information according to the global characteristics of the text media information and the global characteristics of the non-text media information, wherein the global similarity comprises the following steps:

realizing the feature fusion of the global features of the character media information and the global features of the non-character media information through the full connection layer to generate a subspace;

and obtaining the global similarity of the text media information and the non-text media information according to the distance between the text media information and the non-text media information in the subspace.

In another embodiment, the processor, when executing the computer program, further performs the steps of:

acquiring the number N of retrieval targets;

weighting the global similarity and the local similarity, and obtaining a retrieval target of the media information to be retrieved according to the similarity weighting calculation result, wherein the retrieval target comprises the following steps:

weighting the global similarity and the local similarity to obtain the overall similarity;

and (4) sorting according to the similarity, and outputting N pieces of cross-media information with the highest overall similarity as retrieval targets.

In another embodiment, weighting the global similarity and the local similarity, and obtaining a retrieval target of the media information to be retrieved according to a similarity weighting calculation result includes:

weighting the global similarity, the entity similarity and the relationship similarity to obtain the overall similarity;

In one embodiment, a computer readable storage medium is provided, having stored thereon a computer program that, when executed by a processor, performs the following steps to obtain a retrieval target from a cross-media information database:

acquiring the number N of retrieval targets;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A cross-media retrieval method for obtaining a retrieval target from matching cross-media information in a cross-media information database, the method comprising:

2. The method of claim 1, wherein the local features comprise entity features, the obtaining of the entity of the non-textual media information, the obtaining of the local features of the non-textual media information according to the entity, and the obtaining of the local features of the textual media information by weighting words of a preset category according to global features of the textual media information comprises:

carrying out target detection on the non-character media information by using a trained entity extraction model, and extracting entities in the non-character media information to obtain entity characteristics of the non-character media information;

and according to the global characteristics of the text media information, adding weights to word vectors of entity words in the text media information to obtain the entity characteristics of the text media information.

3. The method of claim 1, wherein the local features include relational features, acquiring the entity of the non-textual media information, acquiring the local features of the non-textual media information according to the entity, and weighting the pre-set category words according to the global features of the textual media information to obtain the local features of the textual media information, comprising:

4. The method of claim 1, wherein the local features comprise entity features and relationship features; the obtaining of the entity of the non-text media information, the obtaining of the local characteristics of the non-text media information according to the entity, and the adding of the weight to the preset category words according to the global characteristics of the text media information, the obtaining of the local characteristics of the text media information, includes:

carrying out target detection on the non-character media information by using a trained entity extraction model, and extracting an entity in the non-character media information to obtain an entity characteristic of the non-character media information;

according to the global characteristics of the text media information, adding weights to word vectors of entity words in the text media information to obtain entity characteristics of the text media information;

according to the global features of the character media information, adding weights to word vectors of relation words in the character media to obtain the relation features of the character media;

the calculating the local similarity of the text media information and the non-text media information according to the local characteristics of the text media information and the local characteristics of the non-text media information comprises the following steps:

calculating the relation similarity of the text media information and the non-text media information according to the entity characteristics of the text media information and the relation characteristics of the non-text media information; the local similarity comprises entity similarity and relationship similarity.

5. The method of claim 1, wherein the neural network for global feature extraction of textual media information shares a fully-connected layer with the neural network for global feature extraction of non-textual media information;

calculating the global similarity of the text media information and the non-text media information according to the global characteristics of the text media information and the global characteristics of the non-text media information, wherein the calculation comprises the following steps:

realizing the feature fusion of the global features of the text media information and the global features of the non-text media information through the full connection layer to generate a subspace;

6. The method of claim 1, further comprising:

acquiring the number N of retrieval targets;

weighting the global similarity and the local similarity, and obtaining a retrieval target of the media information to be retrieved according to a similarity weighting calculation result, wherein the retrieval target comprises:

weighting the global similarity and the local similarity to obtain overall similarity;

7. The method according to claim 4, wherein weighting the global similarity and the local similarity, and obtaining the retrieval target of the media information to be retrieved according to the result of the similarity weighting calculation comprises:

weighting the global similarity, the entity similarity and the relationship similarity to obtain overall similarity;

8. A cross-media retrieval apparatus for obtaining a retrieval target from matching cross-media information in a cross-media information database, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.