CN113987119A

CN113987119A - Data retrieval method, cross-modal data matching model processing method and device

Info

Publication number: CN113987119A
Application number: CN202111166923.0A
Authority: CN
Inventors: 方晟; 刘梦怡; 王树徽; 卓君宝; 黄庆明; 何源; 薛晖
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-28

Abstract

The embodiment of the application discloses a data retrieval method, a cross-modal data matching model processing method and a cross-modal data matching model processing device. According to the embodiment of the application, in the cross-modal retrieval, two parts of feature data are constructed for video data, one part is the image content feature of a video frame, and the other part is the image semantic feature of a plurality of classification labels corresponding to classification prediction, so that the video feature of the video data has both the content feature representing global information and the semantic feature of fine granularity, and the video data can be represented more accurately. In order to more accurately represent the video by adopting semantic features with more dimensions, the method can also expand based on the existing initial image semantic features and collect the expanded image semantic features which are semantically related to the initial image semantic features.

Description

Data retrieval method, cross-modal data matching model processing method and device

Technical Field

The application relates to the technical field of data processing, in particular to a data retrieval method and device, a processing method and device of a cross-modal data matching model, computer equipment and a computer readable storage medium.

Background

With the rapid development of social media, short videos gradually become the mainstream browsing information of the public, and how to effectively establish bidirectional retrieval between videos and texts becomes an important research field.

Most of recent mainstream video text retrieval is based on a hidden space scheme, that is, the video and the text are respectively mapped into a shared public space, so that semantic feature alignment is realized, and feature similarity between the video and the text is calculated for matching retrieval.

The video is a very rich modality, how to extract the information represented by the video is a key for matching retrieval, and the accuracy of the current retrieval result still needs to be improved.

Disclosure of Invention

In view of the above, the present application is made to provide a data retrieval method, a cross-modal data matching model processing method, and a computer device, computer-readable storage medium that overcome or at least partially solve the above problems.

According to an aspect of the present application, there is provided a data retrieval method including:

receiving a retrieval request based on video data;

extracting video features of the video data; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

retrieving text data having text features matching video features of the video data;

providing the text data as a retrieval result.

According to another aspect of the present application, there is provided a data retrieval method including:

receiving a retrieval request based on first modality data;

searching second modality data with data characteristics matched with the data characteristics of the first modality data;

providing the second modality data as a retrieval result;

the first modality data or the second modality data comprise video data, the video features of the video data comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels of classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and the response values corresponding to the extended image semantic features are determined on the basis of the response values of at least one initial image semantic feature having relevance.

receiving a retrieval request based on text data;

retrieving video data with video features matched with text features of the text data; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

providing the video data as a retrieval result.

According to another aspect of the present application, there is provided a processing method for matching a model across modal data, including:

collecting a plurality of sample pairs, the sample pairs comprising video data samples and corresponding matched text data samples;

extracting video characteristics of the video data samples and text characteristics of the text data samples; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

and training a cross-mode data matching model for searching matched text data based on the video data or searching matched video data based on the text data according to the video characteristics of the video data sample and the text characteristics of the text data sample.

According to another aspect of the present application, there is provided a video-based data processing method, comprising:

acquiring video data to be processed;

and executing a data processing flow based on the video characteristics of the video data.

In accordance with another aspect of the present application, there is provided an electronic device, comprising: a processor; and

a memory having executable code stored thereon, which when executed, causes the processor to perform a method as in any one of the above.

According to another aspect of the application, there is provided one or more machine-readable media having stored thereon executable code that, when executed, causes a processor to perform a method as any one of the above.

According to the embodiment of the application, in the cross-modal retrieval, two parts of feature data are constructed for video data, one part is the image content feature of a video frame, and the other part is the image semantic feature of a plurality of classification labels corresponding to classification prediction, so that the video feature of the video data has both the content feature representing global information and the semantic feature of fine granularity, and the video data can be represented more accurately. In order to more accurately represent the video by adopting semantic features with more dimensions, the method can also expand based on the existing initial image semantic features and collect the expanded image semantic features which are semantically related to the initial image semantic features. Video representation is carried out by extracting comprehensive and abundant characteristics from video data, semantic gaps among different modalities are effectively bridged, the performance of a cross-modality video retrieval system is remarkably improved, the accuracy of a video text matching result can be improved, a more accurate retrieval result is facilitated to be obtained, and the effect is also verified experimentally.

In addition, the video data is framed with image content features and image semantic features, the text semantic features are constructed for the text data, and then the video data and the text data are mapped into a content public space and a semantic public space, namely different multi-dimensional feature expressions are adopted, the feature coding processes of two modes are decoupled, the processing efficiency is not influenced, the problems of information loss and inaccurate feature representation caused by feature alignment in a hidden space scheme can be solved, and the accuracy of video text matching is further improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a specific example of a data retrieval method of the present application;

FIG. 2 is a flow chart of a data retrieval method according to a first embodiment of the present application;

FIG. 3 is a flow chart of a data retrieval method according to the second embodiment of the present application;

FIG. 4 is a flow chart of a data retrieval method according to the third embodiment of the present application;

FIG. 5 is a flow chart of a processing method for cross-modal data matching model according to the fourth embodiment of the present application;

fig. 6 is a block diagram showing a data retrieval apparatus according to a fifth embodiment of the present application;

fig. 7 is a block diagram of a data retrieval apparatus according to a sixth embodiment of the present application;

fig. 8 is a block diagram showing a structure of a data retrieval apparatus according to a seventh embodiment of the present application;

fig. 9 is a block diagram illustrating a processing apparatus for matching a model across modal data according to an eighth embodiment of the present application;

fig. 10 illustrates an exemplary system that can be used to implement various embodiments described in this disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The scheme of the embodiment of the application is provided for more accurately performing characteristic characterization on videos and further improving the accuracy of retrieval results in the existing video text matching application.

The characteristic is to extract characteristic information from the data or the related content of the data, and the characteristic information is used for characterizing the data. In the embodiment of the application, two parts of feature data are constructed on video data, one part of feature data is image content features of video frames, and the image content features are directly extracted from the video frame images included in the video and used for representing the global information of the images. For example, SIFT (Scale-invariant feature transform) extracts local features of an image by detecting key points in the image; histogram of Oriented Gradient (HOG) features are features constructed by calculating and counting Histogram of Gradient directions of local regions of an image; the Local Binary Pattern (LBP) feature is a feature for describing Local texture of an image, and has significant advantages of rotation invariance, gray scale invariance and the like. In practical application, one or more image feature extraction algorithms can be selected as required to extract multi-dimensional image content features. And an applicable depth learning model related to image analysis can be selected, and the image feature data input before processing according to the image features can be extracted as the image content features. For example, for an image classification model, feature data input before a classification layer of the image classification model (e.g., an FC classification layer of a ResNet depth residual network) may be extracted.

Another part of the feature data of the video data is image semantic features, and the image semantic features comprise initial image semantic features and extended image semantic features.

The initial image semantic features correspond to a plurality of classification labels of classification prediction so as to extract a plurality of classification labels from a classification prediction model, a single dimension of the initial image semantic features corresponds to one classification label of the classification prediction, and a response value of the single dimension of the initial image semantic features, namely a response value under the classification label, can be regarded as a classification probability which is not subjected to normalization processing. The expanded image classification features are semantic features which are obtained through semantic expansion and are related to the initial image classification features, and words related to the initial image classification features can be extracted from a semantic network to obtain the expanded image classification features.

Compared with the image content characteristics, the image semantic characteristics can perform finer-grained characteristic expression on the image. In a specific implementation, the semantic features of the image can be extracted from the classification result data output by the classification layer of the image classification model by extracting the feature data input before the classification layer of the image classification model as the image content features.

The image classification model may be any suitable model including an image classification link, and may be, for example, a neural network model that performs various learning tasks based on image analysis, such as a ResNet deep residual network and a ResNeXt model. And image content characteristics or image semantic characteristics obtained by different networks can be spliced respectively to obtain video characteristics represented by more dimensions.

The representation information of the video frames in the video is formed by the image content features and the image semantic features together, so that the video features of the video data not only comprise representation global content information, but also have fine-grained semantic information, the video data can be represented more accurately, and a better matching result can be obtained when the video data is used for matching video texts.

In order to more accurately represent the video by adopting semantic features with more dimensions, the image semantic features extracted from the classification prediction model can be used as initial image semantic features, expansion is carried out based on the existing initial image semantic features, other features which are semantically related to the image semantic features are further searched as expanded image semantic features, and the feature dimensions of the video are enriched. By comprehensively and abundantly extracting the characteristics of the video data and performing video representation, semantic gaps among different modes are effectively bridged, the performance of a cross-mode video retrieval system is obviously improved, the accuracy of a video text matching result can be improved, and a more accurate retrieval result is obtained.

The extended image semantic features associated with the initial image semantic features can be extracted from a semantic network (ConceptNet) or a semantic database, the semantic network is a directed graph expressing knowledge by using entities and their semantic relations, wherein nodes represent various entities such as things, concepts, conditions, attributes, states, events, actions, etc., and connecting lines between nodes represent semantic relations between words.

And further, a response value needs to be given to the semantic features of the extended image, and the embodiment of the application innovatively provides that the response value of the semantic features of the extended image is derived based on the response value of at least one associated initial image semantic feature, so that the response value of the semantic features of the extended image has reliable theoretical support.

When the response value of the semantic features of the expanded image is determined based on the response value of the video data corresponding to the semantic features of the initial image, a knowledge graph can be constructed based on the semantic features of the initial image and the semantic features of the expanded image, the response value of the semantic features of the initial image is expanded on the knowledge graph by means of attention-seeking reasoning, the semantic features of the image are used as nodes in the knowledge graph, semantic relation data among the semantic features of the image are used as connecting edges among the nodes, and the semantic features of the initial image, the semantic features of the expanded image and the semantic relation data among the semantic features are obtained from a semantic network.

Further, aiming at each image semantic feature in the knowledge graph, updating a response value of the image semantic feature according to a response value of at least one associated other image semantic feature, namely, aiming at the initial image semantic feature or the expanded image semantic feature, searching other image semantic features related to the image semantic feature, and further updating the response value of the image semantic feature according to the response values of the other image semantic features.

Since the initial image semantic features originally have corresponding response values, the initial values of the extended image semantic features are 0 in the first calculation, so that the extension is insufficient, and the response values of the image semantic features in the knowledge graph can be updated iteratively. Considering that multiple calculations may introduce excessive noise, the number of iterations, e.g., the number of iterations, may be set according to actual requirements. When the response value of the initial image semantic feature is updated according to the response values of other image semantic features associated with the initial image semantic feature, the calculation result is different from the original corresponding response value, so that in order to maintain the original response degree and control the response value through a gating mechanism, an update coefficient is specifically set for the response value, so that the initial image semantic feature of the video data sample maintains part of the original response value, and the update degree is controlled.

When the control of the response value is performed by a gating mechanism, beta_iIs recorded as the update coefficient, beta, of the node i in the gate control_i＝sigmoid(b^T[W_ee_i||W_ff])，

b and W_eIs a parameter for attention mechanism, f is a feature of a video layer, h'_iI.e. the updated response value, h, of node i_iThat is the response value before the update of node i.

When the response value of the current image semantic feature is updated according to the response values of the other associated image semantic features, the response values of the other associated image semantic features may be weighted-averaged, may be directly added, or may be directly averaged. Wherein some weak supervision may be provided in order to improve the accuracy of the results.

In particular, semantic features (denoted as S) of images appearing in textual description information associated with a video data sample can be found₀) Will S₀Other nodes than S₁,S₂…, respectively containing and S₀And configuring BCE Loss (binary cross entropy Loss) as a node with one hop distance and a node with two hops distance in the iterative updating of the graph

Denoted as propagated BCE Loss, where y is the image semantic feature i in the corresponding text description_i1, otherwise y_i＝0

Representing the response of node i after sigmoid, γ is the attenuation coefficient.

When predicting the response value of the extended image semantic feature according to the response value of the initial image semantic feature, a GAT model (Graph Attention Network) may be used to implement the transition matrix of the probability value, where the GAT is implemented by a Graph Convolution Network (GCN), and may also be used to perform propagation and aggregation of information in a Graph based on other GCNs, such as Graph sage (Network structure that performs convolution operation on a Graph), that is, a node may aggregate information of neighboring nodes by using the Graph convolution Network.

An example of predicting a response value of an extended image semantic feature according to a response value of an initial image semantic feature is given as follows, taking an image as an example, the image corresponds to 2048-dimensional image content features and 1000-dimensional initial image semantic features, after 2000 extended image semantic features associated with the initial image semantic features are found, after repeated features are removed, 2400-dimensional image semantic features are constructed, and the response value of the extended image semantic features is expressed as r { h ═ h₁,h₂，…，h_N}，h_i∈R。h_iRepresenting the response value (response value) of the ith node in the knowledge graph, wherein N represents the total amount of the initial image semantic features and the extended image semantic features, the extended image semantic features take the response value as 0 as an initial value, and the image semantic features are expressed as { e } by adopting feature vectors₁，e₂，…，e_N}。

The response value of the node is obtained according to the transfer coefficient of the response value and the response value of the associated initial image semantic feature, and the transfer coefficients of the nodes i and j

Where a and W are both learnable parameters in the attention mechanism, | | | represents a stitching operation, N_iIs a set of neighbor nodes to node i.

After the image features of the video frames in the video data are extracted, the image features of a plurality of video frames need to be aggregated, and the aggregation result is used as the video features of the video data. The aggregation mode may be a mode of directly summing the features of the same dimension in the image features of a plurality of video frames, or a mode of performing weighted summation, or a mode of averaging, and the like, which is not limited in the present application. In an alternative embodiment, an attention mechanism can be adopted to assign different overlay weights to different video frames, so that more useful information is focused on and other information is ignored, and video frames related to text description information are assigned higher weights, so that feature representation is more accurate.

And further, text data with text characteristics matched with the video characteristics of the video data can be retrieved, the text data is provided as a retrieval result, and the retrieval result is fed back to a retrieval page or is used as a basis for next processing.

The text features of the text data include text semantic features, that is, features extracted based on the content of the text data, and represent global information of the text data. In addition, in the embodiment of the application, the image content features and the image semantic features are constructed for the video data, the text semantic features are constructed for the text data, and then the video data and the text data are mapped into the content public space and the semantic public space (wherein the image content features and the image semantic features of the video data are respectively mapped into the public content space and the public semantic space), that is, different multi-dimensional feature expressions are adopted, so that the problems of information loss and inaccurate feature representation caused by feature alignment in a hidden space scheme can be solved, and the accuracy of video text matching is further improved.

The extraction of text semantic features can be realized based on a BiGRU (bidirectional gated cycle unit), and a BiGRU-Attention model is formed after an Attention mechanism is combined, and the extraction can be divided into three parts: a text vectorization input layer, a hidden layer, and an output layer. The GRU is a very effective variant of the LSTM network (Long Short-Term Memory network), and the LSTM and CRU both retain important features through various gate functions, thus ensuring that they are not lost during Long-Term propagation of Long-Term. In addition, the GRU has one less gate function compared with the LSTM, so the number of parameters is less than that of the LSTM, and the training speed of the GRU is faster than that of the LSTM as a whole.

The above process of aggregating the image features of a plurality of video frames in the video data and aggregating the text semantic features of words in the text data can be realized by adopting an attention mechanism for assigning different weights, and can also utilize a Transformer structure adopting the attention mechanism. For the extraction of the content features, a network such as a double-current expansion convolution network I3D can be adopted to directly extract the video content features without operating on video frames, and the I3D network expands convolution and pooling kernel in a very deep image classification network from 2D to 3D to learn the spatiotemporal features seamlessly.

Further, when text data with text characteristics matched with the video characteristics of the video data is retrieved, text semantic characteristics of a plurality of text data can be extracted; and determining the text data of which the text features are matched with the video features of the video data according to the similarity between the image content features of the video data and the image semantic features with the response values and the text semantic features of the text data.

The image features can be correspondingly expressed as vector data, and the vector similarity is used as the similarity between the text data and the video data. The text feature extraction of the text data can be divided into two branches, the similarity is calculated respectively, and then the similarity is added to obtain the total similarity.

In an optional embodiment, when extracting text semantic features of a plurality of text data, semantic features of words in the text data can be extracted; and aggregating the semantic features of the words to obtain the text features corresponding to the text data, wherein the words related to the video data can be configured with higher aggregation weight by executing an attention mechanism, so that the association of the text and the video data is better expressed.

In an optional embodiment, retrieving text data with text features matched with video features of video data is realized based on a cross-modal data matching model, and a plurality of sample pairs can be collected in advance, wherein the sample pairs comprise video data samples and corresponding matched text data samples, further extracting the video features of the video data samples and the text features of the text data samples, and then training the cross-modal data matching model for searching matched text data based on the video data or searching matched video data based on the text data according to the video features of the video data samples and the text features of the text data samples.

The iterative training of the cross-modal data matching model can use triple Loss to train, and specifically can be based on Loss functions corresponding to the following three groups of data: the method comprises the steps of sampling data and a prediction result of using the image content characteristics alone, the sampling data and a prediction result of using the image semantic characteristics alone, and the sampling data and a prediction result of using the image classification characteristics and the image semantic characteristics. In the specific implementation, the calculation of the Loss function can be realized by adopting Angular Loss or N-pair Loss.

Similarly, the data retrieval concept using the video retrieval text can also be applied to a cross-modal retrieval scene between any two modal data, where one of the two modal data includes video data, and the remaining one may be any suitable data modality such as text data, image data, and audio data.

Accordingly, such a cross-modal data retrieval scheme may be provided to receive a retrieval request based on first-modality data, further search for second-modality data having data characteristics matching those of the first-modality data, and provide the second-modality data as a retrieval result. Wherein the first modality data or the second modality data includes video data, and data characteristics of the video data are obtained according to the similar ideas. Specifically, the video features of the video data include image content features and image semantic features of video frames, the image semantic features include initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels of classification prediction, response values of the initial image semantic features are obtained after the image content features are subjected to classification prediction, and the response values corresponding to the extended image semantic features are determined based on response values of at least one initial image semantic feature having an association.

Similarly, the data retrieval concept using the video retrieval text may also be applied to a scene of retrieving video data based on text data, and the correspondingly provided data retrieval method may include: receiving a retrieval request based on text data; retrieving video data with video features matched with text features of the text data, wherein the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined based on response values of at least one initial image semantic feature with association. The video data may finally be provided as a result of the retrieval.

The embodiment of the application can also correspondingly provide the above training scheme of the cross-modal data matching model, and the training scheme further extracts the video features of the video data samples and the text features of the text data samples by collecting a plurality of sample pairs, wherein the sample pairs comprise the video data samples and the corresponding matched text data samples, the video features comprise the image content features and the image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, the response values of the initial image semantic features are obtained by performing classification prediction on the image content features, and the response values corresponding to the extended image semantic features are determined based on the response values of at least one associated initial image semantic feature. And finally, training a cross-modal data matching model for searching matched text data based on the video data or searching matched video data based on the text data according to the video characteristics of the video data sample and the text characteristics of the text data sample.

The embodiment of the application can also correspondingly provide a data processing method based on the video, and firstly, the video data to be processed is obtained; extracting video features of the video data; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and the response values corresponding to the extended image semantic features are determined on the basis of the response values of at least one associated initial image semantic feature.

It should be noted that the scheme of the application can be applied not only to video retrieval scenes (e.g., media information video search, risk video investigation), but also to other processing requirements based on matching results of video texts, and also to other video application scenes. When the method is applied to risk video investigation, risk keywords can be added to the extended image semantic features, and the weights of the risk keywords can be further improved so as to improve the accuracy of video investigation.

The correspondence may be implemented as a functional module in the form of an application, a service, an instance, or software, a Virtual Machine (VM) or a container, or may also be implemented as a hardware device (such as a server or a terminal device) or a hardware chip (such as a CPU, a GPU, or an FPGA) having an image processing function, or the like. May be implemented by a software party or a platform party providing computing or storage resources. Taking Saas Software-as-a-Service (Software-as-a-Service) provided by a platform as an example, the platform can provide functions of training a classification prediction model and a cross-modal data matching model, constructing and storing a knowledge graph and even a semantic network and the like by utilizing self computing resources, and a specific application architecture can be built according to Service requirements. For example, the platform may provide a building service based on the model, the network, and the graph to a software party or an individual using the platform resource, and further call the model, the network, and the graph to implement corresponding functions based on a search request submitted by a device such as a client or a server related to the search.

An example of a retrieval method of the present application is given with reference to fig. 1, taking video data as an example for retrieving text data. The method comprises the steps of taking video data as a retrieval basis, respectively extracting image features aiming at a plurality of video frames included in the video data, wherein the image features comprise image content features and image semantic features, the image semantic features comprise initial image semantic features and extended image semantic features, and carrying out attention concept propagation (namely carrying out response value propagation of the semantic features based on an attention mechanism) according to response values of the initial image semantic features so as to obtain response values of the extended image semantic features.

The semantic features of the expanded images can be collected in advance based on a semantic network, a knowledge graph formed by the initial image semantic features and the semantic features of the expanded images is further constructed, and response values are spread based on the knowledge graph.

And further performing aggregation processing respectively based on the image content characteristics and the image semantic characteristics of each image frame, wherein the video characteristics obtained by aggregation respectively correspond to a public semantic space and a public content space based on an attention mechanism during aggregation.

Matching the video characteristics of the video data with the text characteristics of the text data (taking A social player goal as an example) in a text database, wherein a BiGRU network is adopted to extract the text characteristics of the text data, the text semantic characteristics corresponding to each word are aggregated by adopting a self-attention mechanism, as can be seen in FIG. 1, double branches are adopted to extract the text semantic characteristics, and feature similarity calculation is respectively carried out on a common semantic space and a common content space corresponding to the video data, but actually decoupling the feature coding processes of two modes.

And finally, determining the feature similarity as the similarity of the video data and the text data, and determining target text data matched with the video data according to the similarity as a retrieval result.

The propagation of the response value is shown in the lower half of fig. 1, where blue is an initial node, red is an extended node associated with the initial node, and node expansion may be performed multiple times, for example, there are three red extended nodes directly associated with the upper blue initial node, the red extended node on the left further performs secondary expansion, two red extended nodes are added, the blue node on the lower side performs primary expansion to obtain two red extended nodes, and the secondary expansion is continued to add one red extended node. The left side is a schematic diagram of response value propagation of a part of nodes intercepted by a dotted line, and a specific propagation process is as described above.

Referring to fig. 2, a flowchart of a data retrieval method according to a first embodiment of the present application is shown, where the method may specifically include the following steps:

step 101, receiving a retrieval request based on video data;

step 102, extracting video characteristics of the video data; the video features comprise image content features of video frames and image semantic features with response values obtained after classification prediction is carried out according to the image content features, the image semantic features comprise initial image semantic features and extended image semantic features, and the response values corresponding to the extended image semantic features are determined on the basis of the response values of at least one initial image semantic feature with association;

103, retrieving text data with text characteristics matched with the video characteristics of the video data;

and step 104, providing the text data as a retrieval result.

In an optional embodiment of the present application, the extracting the video feature of the video data includes:

extracting image content characteristics of video frames in the video data;

inputting the image content characteristics into a classification prediction model to obtain a response value corresponding to the initial image semantic characteristics;

and determining a response value corresponding to the semantic features of the expanded image according to the response value corresponding to the semantic features of the initial image.

In an optional embodiment of the present application, before the extracting the video feature of the video data, the method further comprises:

extracting the semantic features of the initial image formed by the classification labels included in the initial classification prediction model;

and acquiring the extended image features which are semantically related to the semantic features of the initial image.

In an optional embodiment of the present application, the obtaining extended image features having semantic association with the initial image semantic features includes:

and extracting the extended image semantic features which are associated with the initial image semantic features from the semantic network.

In an optional embodiment of the present application, the method further comprises:

constructing a knowledge graph based on the initial image semantic features and the expanded image semantic features, wherein the knowledge graph takes the image semantic features as nodes and takes semantic relation data among the image semantic features as connecting edges among the nodes;

the determining the response value corresponding to the extended image semantic feature according to the response value corresponding to the initial image semantic feature includes:

and updating the response value of the image semantic feature according to the response value of at least one other associated image semantic feature aiming at each image semantic feature in the knowledge graph, wherein the image semantic features which appear in the text description information associated with the video data sample have higher influence weight by executing a self-attention mechanism.

In an optional embodiment of the present application, response values of semantic features of each image in the knowledge-base are iteratively updated, and an update coefficient is set for the response values, so that the initial semantic features of the video data sample retain a part of original response values.

In an optional embodiment of the present application, the extracting video features of the video data further includes:

and aggregating image content characteristics and image semantic characteristics of a plurality of video frames, wherein the video frames related to the associated text description information have higher aggregation weight by executing a self-attention mechanism.

In an optional embodiment of the present application, the text feature of the text data includes a text semantic feature, and retrieving the text data whose text feature matches with the video feature of the video data includes:

extracting text semantic features of a plurality of text data;

and determining text data with text characteristics matched with the video characteristics of the video data according to the image content characteristics and the image semantic characteristics of the video data and the similarity of the text semantic characteristics of the text data and the image semantic characteristics.

In an optional embodiment of the present application, the extracting text semantic features of the plurality of text data includes:

extracting semantic features of words in text data;

and aggregating the semantic features of the plurality of words to obtain the text features corresponding to the text data, wherein the words related to the video data have higher aggregation weight by executing an attention mechanism.

In an optional embodiment of the present application, the retrieving text data whose text features match the video features of the video data is implemented based on a cross-modal data matching model, and the method further includes:

extracting video characteristics of the video data samples and text characteristics of the text data samples;

In an optional embodiment of the present application, the iterative training of the cross-modal data matching model is based on loss functions corresponding to the following three sets of data: the method comprises the steps of sampling data and a prediction result of using the image content characteristics alone, the sampling data and a prediction result of using the image semantic characteristics alone, and the sampling data and a prediction result of using the image classification characteristics and the image semantic characteristics.

In addition, according to the embodiment of the application, the image content characteristics and the image semantic characteristics are constructed for the video data, the text semantic characteristics are constructed for the text data, and then the video data and the text data are mapped into the content public space and the semantic public space, namely different multi-dimensional characteristic expressions are adopted, so that the problems of information loss and inaccurate characteristic representation caused by characteristic alignment in a hidden space scheme can be solved, and the accuracy of video text matching is further improved.

Referring to fig. 3, a flowchart of a data retrieval method according to the second embodiment of the present application is shown, where the method specifically includes the following steps:

step 201, receiving a retrieval request based on first modality data;

step 202, searching second modality data with data characteristics matched with the data characteristics of the first modality data;

step 203, providing the second modality data as a retrieval result;

Referring to fig. 4, a flowchart of a data retrieval method according to a third embodiment of the present application is shown, where the method specifically includes the following steps:

step 301, receiving a retrieval request based on text data;

step 302, retrieving video data with video characteristics matched with text characteristics of the text data; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

step 303, providing the video data as a retrieval result.

Referring to fig. 5, a flowchart of a processing method for matching a model across modal data according to a fourth embodiment of the present application is shown, where the method specifically includes the following steps:

step 401, collecting a plurality of sample pairs, wherein the sample pairs comprise video data samples and corresponding matched text data samples;

step 402, extracting video characteristics of a video data sample and text characteristics of a text data sample; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

step 403, according to the video features of the video data samples and the text features of the text data samples, training a cross-mode data matching model for searching matched text data based on the video data or searching matched video data based on the text data.

Referring to fig. 6, a block diagram of a data retrieval apparatus according to a fifth embodiment of the present application is shown, where the apparatus may specifically include:

a retrieval request module 501, configured to receive a retrieval request based on video data;

a video feature extraction module 502, configured to extract video features of the video data; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

a text retrieval module 503, configured to retrieve text data with text features matching video features of the video data;

a result providing module 504, configured to provide the text data as a search result.

In an optional embodiment of the present application, the video feature extraction module includes:

the content characteristic extraction submodule is used for extracting the image content characteristics of the video frames in the video data;

the model prediction submodule is used for inputting the image content characteristics into a classification prediction model to obtain a response value corresponding to the initial image semantic characteristics;

and the response value determining module is used for determining a response value corresponding to the semantic feature of the expanded image according to the response value corresponding to the semantic feature of the initial image.

In an optional embodiment of the present application, the apparatus further comprises:

the characteristic extraction module is used for extracting the initial image semantic characteristics of the image formed by the classification labels in the initial classification prediction model before the video characteristics of the video data are extracted;

and the extended acquisition module is used for acquiring extended image features which are semantically related to the initial image semantic features.

In an optional embodiment of the present application, the extended acquiring module is specifically configured to extract extended image semantic features having a relationship with the initial image semantic features from a semantic network.

the knowledge graph is constructed by taking the image semantic features as nodes and taking semantic relation data among the image semantic features as connecting edges among the nodes;

the response value determination module includes:

and the probability updating submodule is used for updating the response value of the image semantic feature according to the response value of at least one other associated image semantic feature aiming at each image semantic feature in the knowledge graph, wherein the image semantic feature which appears in the text description information associated with the video data sample has higher influence weight by executing a self-attention mechanism.

In an optional embodiment of the present application, the response value of each image semantic feature in the knowledge-graph is iteratively updated, and an update coefficient is set for the response value, so that the initial image semantic feature of the video data sample retains a part of the original response value.

In an optional embodiment of the present application, the feature extraction module is further configured to aggregate image content features and image semantic features of a plurality of video frames, where video frames related to associated text description information have a higher aggregation weight by performing a self-attention mechanism.

In an optional embodiment of the present application, the text feature of the text data includes a text semantic feature, and the text retrieval module includes:

the text feature extraction submodule is used for extracting text semantic features of a plurality of text data;

and the feature similarity calculation operator module is used for determining the text data of which the text features are matched with the video features of the video data according to the similarity between the image content features and the image semantic features of the video data and the text semantic features of the text data.

In an optional embodiment of the present application, the text feature extraction sub-module is specifically configured to extract semantic features of words in the text data; and aggregating the semantic features of the plurality of words to obtain the text features corresponding to the text data, wherein the words related to the video data have higher aggregation weight by executing an attention mechanism.

In an optional embodiment of the present application, the retrieving text data whose text features match the video features of the video data is implemented based on a cross-modal data matching model, and the apparatus further includes:

a sample pair collection module for collecting a plurality of sample pairs, the sample pairs comprising video data samples and corresponding matched text data samples;

the sample characteristic extraction module is used for extracting video characteristics of the video data samples and text characteristics of the text data samples;

and the cross-modal data matching model training module is used for training a cross-modal data matching model for searching matched text data based on video data or searching matched video data based on text data according to the video characteristics of the video data sample and the text characteristics of the text data sample.

Referring to fig. 7, a block diagram of a data retrieval apparatus according to a sixth embodiment of the present application is shown, where the apparatus may specifically include:

a retrieval request module 601, which receives a retrieval request based on the first modality data;

a data searching module 602, configured to search for second modality data whose data characteristics match those of the first modality data;

a result providing module 603, configured to provide the second modality data as a retrieval result;

Referring to fig. 8, a block diagram of a data retrieval apparatus according to a seventh embodiment of the present application is shown, where the apparatus may specifically include:

a retrieval request receiving module 701, configured to receive a retrieval request based on text data;

a video data retrieving module 702, configured to retrieve video data with video features matching text features of the text data; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

a retrieval result providing module 703, configured to provide the video data as a retrieval result.

Referring to fig. 9, a block diagram of a processing apparatus for matching a model across modal data according to an eighth embodiment of the present application is shown, where the apparatus may specifically include:

a sample collection module 801 for collecting a plurality of sample pairs, the sample pairs including video data samples and corresponding matched text data samples;

a sample feature extraction module 802, configured to extract video features of the video data sample and text features of the text data sample; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

and the model training module 803 is configured to train a cross-modal data matching model for searching for matched text data based on video data or searching for matched video data based on text data according to the video features of the video data samples and the text features of the text data samples.

An embodiment of the present application further provides a data processing apparatus based on video, which may specifically include:

the video data acquisition module is used for acquiring video data to be processed;

the video characteristic extraction module is used for extracting video characteristics of the video data; the video features comprise image content features and image semantic features of video frames, the image semantic features comprise initial image semantic features and extended image semantic features, the initial image semantic features correspond to a plurality of classification labels for classification prediction, response values of the initial image semantic features are obtained after classification prediction is carried out on the image content features, and response values corresponding to the extended image semantic features are determined on the basis of response values with at least one associated initial image semantic feature;

and the processing flow executing module is used for executing the data processing flow based on the video characteristics of the video data.

An embodiment of the present application further provides an electronic device, including: a processor; and

a memory having executable code stored thereon, which when executed, causes the processor to perform a method as in any one of the above embodiments.

Embodiments of the application also provide one or more machine-readable media having executable code stored thereon that, when executed, cause a processor to perform a method as described in any of the above embodiments.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Embodiments of the disclosure may be implemented as a system using any suitable hardware, firmware, software, or any combination thereof, in a desired configuration. Fig. 10 schematically illustrates an exemplary system (or apparatus) 900 that can be used to implement various embodiments described in this disclosure.

For one embodiment, fig. 10 illustrates an exemplary system 900 having one or more processors 902, a system control module (chipset) 904 coupled to at least one of the processor(s) 902, a system memory 906 coupled to the system control module 904, a non-volatile memory (NVM)/storage 908 coupled to the system control module 904, one or more input/output devices 910 coupled to the system control module 904, and a network interface 912 coupled to the system control module 906.

The processor 902 may include one or more single-core or multi-core processors, and the processor 902 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the system 900 can function as a browser as described in embodiments herein.

In some embodiments, system 900 may include one or more computer-readable media (e.g., system memory 906 or NVM/storage 908) having instructions and one or more processors 902 in combination with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described in this disclosure.

For one embodiment, the system control module 904 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 902 and/or any suitable device or component in communication with the system control module 904.

The system control module 904 may include a memory controller module to provide an interface to the system memory 906. The memory controller module may be a hardware module, a software module, and/or a firmware module.

System memory 906 may be used, for example, to load and store data and/or instructions for system 900. For one embodiment, the system memory 906 may comprise any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 906 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, the system control module 904 may include one or more input/output controllers to provide an interface to the NVM/storage 908 and input/output device(s) 910.

For example, NVM/storage 908 may be used to store data and/or instructions. NVM/storage 908 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 908 may include storage resources that are physically part of the device on which system 900 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 908 may be accessible over a network via input/output device(s) 910.

Input/output device(s) 910 may provide an interface for system 900 to communicate with any other suitable device, and input/output device(s) 910 may include communication components, audio components, sensor components, and so forth. Network interface 912 may provide an interface for system 900 to communicate over one or more networks, and system 900 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, such as WiFi, 2G, 3G, 4G, or 5G, or a combination thereof.

For one embodiment, at least one of the processor(s) 902 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of the system control module 904. For one embodiment, at least one of the processor(s) 902 may be packaged together with logic for one or more controller(s) of the system control module 904 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 902 may be integrated on the same die with logic for one or more controller(s) of the system control module 904. For one embodiment, at least one of the processor(s) 902 may be integrated on the same die with logic of one or more controllers of the system control module 904 to form a system on a chip (SoC).

In various embodiments, system 900 may be, but is not limited to being: a browser, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 900 may have more or fewer components and/or different architectures. For example, in some embodiments, system 900 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

Wherein, if the display includes a touch panel, the display screen may be implemented as a touch screen display to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also identify the duration and pressure associated with the touch or slide operation.

The present application further provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a terminal device, the one or more modules may cause the terminal device to execute instructions (instructions) of method steps in the present application.

In one example, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method according to the embodiments of the present application when executing the computer program.

There is also provided in one example a computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as one or more of the embodiments of the application.

Although certain examples have been illustrated and described for purposes of description, a wide variety of alternate and/or equivalent implementations, or calculations, may be made to achieve the same objectives without departing from the scope of practice of the present application. This application is intended to cover any adaptations or variations of the embodiments discussed herein. Therefore, it is manifestly intended that the embodiments described herein be limited only by the claims and the equivalents thereof.

Claims

1. A method of data retrieval, comprising:

receiving a retrieval request based on video data;

providing the text data as a retrieval result.

2. The method of claim 1, wherein the extracting video features of the video data comprises:

extracting image content characteristics of video frames in the video data;

3. The method of claim 2, wherein prior to said extracting video features of said video data, said method further comprises:

4. The method of claim 3, wherein obtaining extended image features having semantic associations with initial image semantic features comprises:

5. The method of claim 2, further comprising:

6. The method according to claim 5, wherein the response value of each image semantic feature in the knowledge-graph is updated iteratively, and an update coefficient is set for the response value, so that the initial image semantic feature of the video data sample retains a part of the original response value.

7. The method of claim 1, wherein the extracting video features of the video data further comprises:

8. The method of claim 1, wherein the text features of the text data comprise text semantic features, and wherein retrieving text data whose text features match video features of the video data comprises:

extracting text semantic features of a plurality of text data;

9. The method of claim 8, wherein extracting text semantic features of the plurality of text data comprises:

extracting semantic features of words in text data;

10. A method of data retrieval, comprising:

receiving a retrieval request based on first modality data;

providing the second modality data as a retrieval result;

11. A method of data retrieval, comprising:

receiving a retrieval request based on text data;

providing the video data as a retrieval result.

12. A method for video-based data processing, comprising:

acquiring video data to be processed;

13. An electronic device, comprising: a processor; and

a memory having executable code stored thereon that, when executed, causes the processor to perform the method of any of claims 1-12.

14. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform the method of any of claims 1-12.