CN111767726B

CN111767726B - Data processing method and device

Info

Publication number: CN111767726B
Application number: CN202010588592.9A
Authority: CN
Inventors: 张轩玮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2024-02-06
Anticipated expiration: 2040-06-24
Also published as: CN111767726A

Abstract

The embodiment of the invention provides a data processing method and device, wherein the method comprises the following steps: acquiring video information for generating content tags and text information for describing the video information; determining a word vector of each word segmentation in the text information; extracting features of the video information to obtain video feature vectors corresponding to the video information; cross-fusing the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors; and obtaining the corresponding content tag according to the fused word vector and the fused video feature vector. By the method in the embodiment, the characteristics of the text information and the video information can be cross-fused, and the cross information of the text information and the video information is extracted, so that the content label result is more accurate.

Description

Data processing method and device

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a data processing method and apparatus.

Background

The current method for marking data is mainly based on text, but because of the many characteristics that video content can represent, it is difficult to completely characterize all information in video by text alone. When the text content only comprises a plurality of phrases, the phrases can provide limited information, and if the text content is not combined with specific video content, the text content is likely to be unable to characterize main information, and even useful information is difficult to analyze from the text content.

In view of the above problems, the prior art also provides a related solution, but most of the methods of image-text fusion provided in the prior art are based on simply splicing the features of the two at the input end, so that the method is only used in an encoder, more features are obtained, the features between the text and the video are still independent, the effect is limited, and the video information cannot be fully utilized in a decoder.

Aiming at a plurality of technical problems existing in the related art, no effective solution is provided at present.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a data processing method and apparatus, so as to solve at least one technical problem in the related art. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a data processing method, including:

acquiring video information for generating content tags and text information for describing the video information;

determining a word vector of each word segmentation in the text information;

extracting features of the video information to obtain video feature vectors corresponding to the video information;

cross-fusing the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;

And obtaining the corresponding content tag according to the fused word vector and the fused video feature vector.

Optionally, in the foregoing method, the determining a word vector of each word segment in the text information includes:

performing word segmentation processing on the text information to obtain the word segmentation forming the text information;

obtaining a corresponding word list according to the word segmentation and the preset tag word;

and determining the word vector of each word according to a word vector model obtained through pre-training and the word list.

Optionally, in the foregoing method, the feature extracting the video information to obtain a video feature vector corresponding to the video information includes:

extracting the video information according to frames to obtain at least two video frame images;

respectively inputting the video frame images into a deep neural network;

and acquiring the video feature vector obtained after the feature extraction layer in the deep neural network performs feature extraction on the video frame image.

Optionally, in the foregoing method, the cross-fusing, by a mutual attention mechanism, the features of the word vector and the features of the video feature vector to obtain a fused word vector and a fused video feature vector, including:

Performing vector dimension adjustment on the word vector to obtain a dimension-adjusting word vector, and performing vector dimension adjustment on the video feature vector to obtain the dimension-adjusting video feature vector;

after the dimension-adjusting word vectors are spliced with the dimension-adjusting video feature vectors, initial word vector information of the dimension-adjusting word vectors and initial video vector information of the dimension-adjusting video feature vectors are obtained;

determining a hierarchical relationship of the mutual attention layers, the hierarchical relationship characterizing a connection relationship between different mutual attention layers;

inputting the word vector information and the video vector information into the mutual attention layer arranged on a first layer for feature cross fusion, obtaining each first video vector information fused with all the initial word vector information according to the first word vector influence weight of each initial word vector information on each initial video vector information, and obtaining each first word vector information fused with all the initial video vector information according to the first video vector influence weight of each initial video vector information on each initial word vector information;

the first word vector information and the first video vector information are input into the mutual attention layer of the next layer according to the hierarchical relation to perform feature cross fusion again, and second word vector information and second video vector influence weights are obtained respectively; according to the circulation, outputting the fused word vector information and the fused video feature vector information through the mutual attention layer of the last layer;

Decoding the fused word vector information to obtain the fused word vector;

and decoding the fused video feature vector information to obtain the fused video feature vector.

Optionally, in the foregoing method, the obtaining the corresponding content tag according to the fused word vector and the fused video feature vector includes:

determining candidate word vectors of all words in the word list;

determining the candidate word vectors closest to the first distance of each output vector, respectively, wherein the output vector comprises: the fused word vector and the fused video feature vector;

and taking the words corresponding to the candidate word vectors with the first nearest distance as the content labels corresponding to the output vectors.

Optionally, in the foregoing method, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector, the method further includes:

acquiring the total number of the content tags;

when the total number of the content tags is larger than a preset upper threshold value, acquiring a second distance between the candidate word vector and the output vector, which correspond to the same content tag;

determining a correspondence between the content tag and a second distance;

Arranging the content tags from small to large according to the second distance;

and deleting the content labels with the arrangement order larger than the upper threshold according to the corresponding relation.

Optionally, in the foregoing method, image extraction is performed on the video information according to frames to obtain at least two video frame images, including:

acquiring the total frame number of the image included in the video information;

determining a preset upper limit threshold of the number of images;

determining a preset extraction strategy for extracting the images of the video messages according to the numerical relation between the total frame number and the upper limit threshold of the number of the images;

and carrying out image extraction on the video information according to frames according to the extraction strategy to obtain video frame images with the number smaller than or equal to the number corresponding to the upper limit threshold of the number of the images.

In a second aspect of the present invention, there is also provided a data processing apparatus, comprising:

the acquisition module is used for acquiring video information for generating content tags and text information for describing the video information;

the determining module is used for determining word vectors of each word segmentation in the text information;

the vector acquisition module is used for extracting the characteristics of the video information to obtain a video characteristic vector corresponding to the video information;

The feature fusion module is used for carrying out cross fusion on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;

and the label determining module is used for obtaining the corresponding content label according to the fused word vector and the fused video feature vector.

Optionally, in the foregoing apparatus, the determining module includes:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain the word segmentation forming the text information;

the vocabulary unit is used for obtaining a corresponding vocabulary according to the word segmentation and the preset tag word;

and the word vector unit is used for determining the word vector of each word segmentation according to a word vector model obtained through pre-training and the word list.

Optionally, in the foregoing apparatus, the vector acquisition module includes:

the extraction unit is used for extracting the video information according to frames to obtain at least two video frame images;

the input unit is used for respectively inputting the video frame images into the deep neural network;

the acquisition unit is used for acquiring the video feature vector obtained after the feature extraction layer in the deep neural network performs feature extraction on the video frame image.

Optionally, in the foregoing apparatus, the feature fusion module includes:

the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusting word vectors, and carrying out vector dimension adjustment on the video feature vectors to obtain dimension-adjusting video feature vectors;

the splicing unit is used for splicing each dimension-adjusting word vector with the dimension-adjusting video feature vector to obtain initial word vector information of the dimension-adjusting word vector and initial video vector information of the dimension-adjusting video feature vector;

a relationship determination unit configured to determine a hierarchical relationship of the mutual attention layers, the hierarchical relationship characterizing a connection relationship between different mutual attention layers;

the fusion unit is used for inputting the word vector information and the video vector information into the mutual attention layer arranged on the first layer to perform cross fusion of features, obtaining first video vector information fused with all the initial word vector information according to first word vector influence weights of the initial word vector information on the initial video vector information, and obtaining first word vector information fused with all the initial video vector information according to first video vector influence weights of the initial video vector information on the initial word vector information;

The output unit is used for inputting the first word vector information and the first video vector information into the mutual attention layer of the next layer according to the hierarchical relation, carrying out feature cross fusion again, and respectively obtaining second word vector information and second video vector influence weight; according to the circulation, outputting the fused word vector information and the fused video feature vector information through the mutual attention layer of the last layer;

the first decoding unit is used for decoding the fused word vector information to obtain the fused word vector;

and the second decoding unit is used for decoding the fused video feature vector information to obtain the fused video feature vector.

Optionally, in the foregoing apparatus, the tag determining module includes:

a first determining unit, configured to determine candidate word vectors of each word in the vocabulary;

a second determining unit, configured to determine the candidate word vectors closest to a first distance of each output vector, where the output vector includes: the fused word vector and the fused video feature vector;

and the label determining unit is used for taking the word corresponding to the candidate word vector with the first nearest distance as the content label corresponding to the output vector.

Optionally, the apparatus as described above further comprises: a tag screening module; the tag screening module comprises:

a total number unit for acquiring the total number of the content tags;

a second distance unit, configured to obtain a second distance between the candidate word vector and the output vector corresponding to the same content tag when the total number of the content tags is greater than a preset upper threshold;

a third determining unit configured to determine a correspondence between the content tag and a second distance;

an arrangement unit configured to arrange the content tags from small to large according to the second distance;

and the screening unit is used for deleting the content labels with the arrangement order larger than the upper limit threshold according to the corresponding relation.

Optionally, as in the previous device, the extraction unit comprises:

a total frame number subunit, configured to obtain a total frame number of an image included in the video information;

a threshold subunit, configured to determine a preset upper limit threshold of the number of images;

a strategy subunit, configured to determine a preset extraction strategy for extracting an image from the video signal according to a numerical relationship between the total frame number and an upper limit threshold of the image number;

And the image determining subunit is used for extracting the images of the video information according to the extraction strategy and obtaining video frame images with the number smaller than or equal to the number corresponding to the upper limit threshold of the number of the images.

In yet another aspect of the present invention, there is also provided a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the invention there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the invention;

FIG. 2 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 3 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

As shown in fig. 1, a method for processing data in an embodiment of the present application includes steps S1 to S5 as follows:

s1, acquiring video information for generating content labels and text information for describing the video information.

Specifically, the video information is a continuous picture capable of exhibiting a smooth continuous visual effect; in particular, video information is typically greater than 24 frames per second; the text information may be: one or more keywords, long sentences or articles, etc.; the text information and the video information belong to the same data, and the text information and the video information are the same data; for example: when the video information is a piece of video, then the text information may be text content for content summarizing the piece of video.

S2, determining word vectors of each word segmentation in the text information.

In particular, whether machine learning or deep learning is digital in nature, word vectors do things by mapping words into a vector space and representing the words as vectors. Conceptually, it involves mathematical embedding from a space of one dimension per word to a continuous vector space with lower dimensions. Methods of generating such mappings include neural networks, dimension reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context of word occurrences.

Determining the word vector for each word segment in the text information may be accomplished by a language model method such as word2vec, glove, ELMo, BERT.

And S3, extracting features of the video information to obtain video feature vectors corresponding to the video information.

Specifically, the feature extraction of the video information generally needs to be performed on the video, that is, each frame of image is extracted and then identified to obtain key information in the video information, and in some alternative technical schemes, the extracted image information can be subjected to feature extraction through a neural network model such as CNN, so as to obtain corresponding video feature information.

S4, cross fusion is carried out on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism, and the fused word vectors and the fused video feature vectors are obtained respectively.

Specifically, a mutual attention mechanism is adopted to find out the hidden dependency relationship between the word vector and the video feature vector, so that more supplementary information is provided for the word vector through the video feature vector, more supplementary information is provided for the video feature vector through the word vector, and the information fusion degree of the word vector and the video feature vector is higher.

And S5, obtaining a corresponding content tag according to the fused word vector and the fused video feature vector.

Specifically, the obtaining the corresponding content tag according to the fused word vector and the fused video feature vector may be:

1) Obtaining corresponding labels according to the fused word vectors and the fused video feature vectors respectively, and obtaining content labels according to the labels of the two;

2) And fusing the fused word vector and the fused video feature vector, so that the fused video feature vector and the fused word vector are mutually influenced and fused again, and further, the content label is obtained according to the influenced fused word vector and the fused video feature vector.

By adopting the method in the embodiment, the video characteristics can be used under the condition that the text information lacks comprehensive or key information, so that the information contained in the video information can be combined, and the advantages of improving the recall and the accuracy of the tag are finally achieved.

In some embodiments, the method as described above, determining the word vector of each word segment in the text information includes the following steps A1 to A3:

step A1, performing word segmentation processing on the text information to obtain word segmentation forming the text information;

a2, obtaining a corresponding word list according to the word segmentation and the preset tag word;

and step A3, determining the word vector of each word according to the word vector model and the word list which are obtained through training in advance.

Specifically, word segmentation processing is used for splitting a text into a plurality of words, for example: when the text information is "the beautiful flower sound effect is that the tragic actors are matched with the sound effect, the playing is easy, and after the word segmentation processing is carried out on the text information, the obtained word segmentation comprises the following steps: "beautiful," sound effect, "" artist, "" tragic, "" actor, "" match, "" sound effect, "" also, "" yes, "" spell, "" show, "" individual, "" play, "" easy, "" mock.

The preset tag word can be a word group which is obtained through pre-selection, and the words in the word list comprise the tag word and the word segmentation obtained through word segmentation processing according to the text information.

The word vector model obtained through pre-training can be a word2vec model (a tool for calculating word vectors); the word vector for each word segment can be determined by training the word2vec model.

Specifically, after determining the vocabulary and the model, a word vector for each word segment in the vocabulary may be determined. Further, the words in the vocabulary may be randomly initialized to 512-dimensional vectors as a word vector and a tag vector (a word vector of a tag word) for each word segment, respectively.

Through the method in the embodiment, the relation among the segmented words in the text information can be obtained through the word vector, the semantics of each segmented word in the text information can be effectively obtained, and the accuracy of the label result can be effectively improved.

As shown in fig. 2, in some embodiments, as in the foregoing method, the step S3 of feature extracting the video information to obtain video feature vectors corresponding to the video information includes the following steps S31 and S33:

s31, extracting images of the video information according to frames to obtain at least two video frame images;

S32, inputting the video image into a deep neural network;

s33, obtaining a video feature vector obtained after feature extraction of the video information by a feature extraction layer in the deep neural network.

Specifically, image extraction is performed on video information according to frames, which may be performed on a frame-by-frame basis, and in addition, since the image information of adjacent frames may be close, image extraction may also be performed at fixed intervals, and a specific image extraction strategy may be selected according to actual situations; the video frame image is the image information obtained by image extraction of the video information.

Because the deep neural network has the capability of extracting the characteristics of the image information, the video frame image is input into the deep neural network to obtain the corresponding video characteristic vector.

One of the alternative implementation methods is as follows: video information is input into an xception (depth separable convolution) model, and since the extracted video features of the penultimate layer of the xception model are most abundant, 2048-dimensional vectors of the penultimate layer of the model are extracted as video features.

By adopting the method in the embodiment, the images with high repeatability can be filtered by extracting the images of the video information according to frames, the repeated extraction of the same video feature vector is avoided, and the feature extraction efficiency can be effectively improved; the feature extraction layer in the deep neural network is used for extracting the features of the video information to obtain rich video feature vectors so as to obtain more information provided by the video.

As shown in fig. 3, in some embodiments, the step S4 obtains the corresponding content tag according to the word vector and the video feature vector, as in the foregoing method, including the following steps S41 to S47:

and S41, carrying out vector dimension adjustment on the word vector to obtain a dimension-adjusting word vector, and carrying out vector dimension adjustment on the video feature vector to obtain a dimension-adjusting video feature vector.

Specifically, based on the foregoing embodiment, the word vector of the word segmentation may be 512 dimensions, and the video feature vector may be 2048 dimensions; therefore, the dimensions of the word vector and the video feature vector are different, and if the word vector and the video feature vector are spliced and fused, the dimensions of the word vector and the video feature vector are required to be unified; optionally, since the dimension of the video feature vector is higher, only dimension reduction processing can be performed on the video feature vector, and generally, dimension reduction of the 2048-dimension video feature vector can be performed by a dimension reduction method of a fully-connected network to obtain a 512-dimension adjustment video feature vector.

S42, splicing each dimension-adjusting word vector with the dimension-adjusting video feature vector to obtain initial word vector information of the dimension-adjusting word vector and initial video vector information of the dimension-adjusting video feature vector;

specifically, the feature vector of the dimension-adjusting word and the feature vector of the dimension-adjusting video are spliced and fused so as to form a context relation, so that the global relation between each feature vector of the dimension-adjusting word and the feature vector of the dimension-adjusting video can be found conveniently. In addition, an encoder may be used to encode the input data to obtain corresponding information, and in general, the encoder is a recurrent neural network, which is a network model for converting sequence modeling into time sequence modeling, and generally, the recurrent neural network uses sequence data as input, performs recursion in the evolution direction of the sequence, and all nodes (circulation units) are connected in a chain manner. The initial word vector information and the initial video vector information are information obtained by encoding the dimension-adjusting word vector and the dimension-adjusting video feature vector input encoder.

The splicing and fusion method can be as follows: and setting each dimension-adjusting video feature vector as a word vector and the dimension of each dimension-adjusting word vector as the same level, and then connecting.

And S43, determining the hierarchical relationship of the mutual attention layers, wherein the hierarchical relationship characterizes the connection relationship among different mutual attention layers.

Specifically, each mutual attention layer of each layer can execute a one-time mutual attention mechanism to extract and cross-fuse respective deep features of word vector information and video feature vector information.

The hierarchical relationship is used for representing the front-back connection relationship between different mutual attention layers, and the output of the mutual attention layer of the previous layer enters the mutual attention layer of the next layer to extract and cross-fuse deep features again.

S44, inputting word vector information and video vector information into a mutual attention layer arranged on a first layer to perform feature cross fusion, obtaining first video vector information fused with all initial word vector information according to first word vector influence weights of all initial word vector information on all initial video vector information, and obtaining first word vector information fused with all initial video vector information according to first video vector influence weights of all initial video vector information on all initial word vector information.

Specifically, taking cross fusion of features of word vector information and video vector information input into a mutual attention layer arranged on a first layer as an example, wherein the first word vector influence weight of each initial video vector information according to each initial word vector information can be that when feature fusion is performed on one initial video vector information by adopting a mutual attention mechanism, the influence weight of each initial word vector on the initial video vector information is determined, for example, when the initial video vector information comprises a1, b1 and c1, and initial word vectors a2, b2 and c2 exist; when the influence weights of a2 to a1, b2 to a1, and c2 to a1 are n1, m1, and t1, respectively, then the first video vector information corresponding to a1 is: a1+a2×n1+b2×m1+c2×t1. When the impact weights of a1 to a2, b1 to a2, and c1 to a2 are n2, m2, and t2, respectively, then the first word vector information corresponding to a2 is: a2+a1×n2+b1×m2+c1×t2; similarly, global vector information for vectors b1, c1 and b2, c2 can be obtained in the same way.

S45, inputting the first word vector information and the first video vector information into a mutual attention layer of a next layer according to a hierarchical relationship, performing feature cross fusion again, and respectively obtaining second word vector information and second video vector influence weights; and circulating until the fused word vector information and the fused video feature vector information are obtained through the mutual attention layer output of the last layer.

Specifically, after the first word vector information and the first video vector information are obtained, each first video vector information and the first word vector information are input into the mutual attention layer of the next layer to perform feature cross fusion again, where the method of cross fusion may be performed with reference to the example in step S44, and the cycle is repeated until the last layer of mutual attention layer outputs the fused word vector information and the fused video feature vector information after feature cross fusion through all the mutual attention layers.

S46, decoding the fused word vector information to obtain a fused word vector;

and S47, decoding the fused video feature vector information to obtain the fused video feature vector.

Specifically, the decoder may be used to decode the vector information to obtain a corresponding vector, and output the vector. Typically, the decoder is also a recurrent neural network.

By the method in the embodiment, the word vector can be influenced by the video feature vector, and the video feature vector can be influenced by the word vector; the finally obtained fused word vector can be fused to obtain the characteristics of the video characteristic vector, and the fused video characteristic vector can be fused to obtain the characteristics of the word vector, so that the finally obtained fused word vector and the fused video characteristic vector can more accurately represent the characteristics of text information and video information.

In some embodiments, as in the foregoing method, step S5 obtains a corresponding content tag according to the fused word vector and the fused video feature vector, and includes steps S51 to S53 as follows:

step S461, determining candidate word vectors of all words in the word list;

step s462, determining candidate word vectors closest to the first distance of each output vector, where the output vector includes: the fused word vector and the fused video feature vector;

and S463, taking the word corresponding to the candidate word vector closest to the first distance as a content label corresponding to the output vector.

Specifically, candidate word vectors corresponding to each word in the word list are determined; then determining a first distance (typically, the first distance may be a cosine distance) between each output vector and each candidate word vector in the vocabulary, and determining therefrom the candidate word vector closest to the first distance of each output vector; finally, the word corresponding to the candidate word vector with the first distance nearest to each output vector is used as the content label corresponding to the output vector.

By the method in the embodiment, each feature can be quickly matched to obtain the nearest content label so as to obtain the label capable of accurately representing the video information.

In some embodiments, as in the foregoing method, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector, the method further includes the following steps B1 to B5:

step b1. The total number of content tags is obtained.

Specifically, this step is used to determine the total number of all content tags obtained in step S4.

And B2, when the total number of the content labels is larger than a preset upper limit threshold value, acquiring a second distance between the candidate word vector and the output vector corresponding to the same content label.

Specifically, the upper limit threshold can be set according to actual conditions, and when the total number of content tags is greater than the online threshold, the content tags need to be discarded to prevent the content tags from being too much and affecting the conciseness; according to the steps in the foregoing embodiments, it is known that the word corresponding to the candidate word vector closest to the first distance of each output vector is used as the content tag corresponding to the output vector, so that there is a unique correspondence between the content tag, the candidate word vector and the output vector;

b3, determining the corresponding relation between the content tag and the second distance;

specifically, after determining that the content tag, the candidate word vector and the output vector have a unique correspondence in step B3, since the candidate word vector and the output vector are both determined, the second distance between the candidate word vector and the output vector is also determined, so that the correspondence between the content tag and the second distance can be obtained.

Step B4., arranging the content tags from small to large according to the second distance;

step B5. deletes content tags whose ranking order is greater than the upper threshold according to the correspondence.

Specifically, after the content tags are arranged according to the second distance from small to large, the arrangement order of the second distance can be determined; the further the distance is, the lower the relativity between the two words is, so that only the content labels arranged in the upper limit threshold value are reserved, and the accuracy of semantic expression of the content labels can be ensured.

In some embodiments, as in the foregoing method, step S31 performs image extraction on the video information in frames to obtain at least two video frame images, including steps S311 to S314 as follows:

and S311, acquiring the total frame number of the image included in the video information.

Specifically, the number of frames per second of each video information is generally fixed, and thus, the total number of frames of the image included in the video information can be determined as long as the duration of the video information and the number of frames per second are acquired.

Step S312, determining a preset upper limit threshold of the number of images.

Specifically, a large amount of computing resources are required to be consumed in extracting the features of the images, and meanwhile, the images of adjacent frames, particularly some transition frames, generally do not suddenly show new features on the displayed information, so that the feature extraction can cause the waste of computing resources, and the computing amount of the feature extraction can be effectively controlled by predetermining the upper limit threshold of the number of the images.

S313, determining a preset extraction strategy for extracting the images of the video signal according to the numerical relation between the total frame number and the upper limit threshold of the number of the images.

Specifically, the numerical relation between the total frame number and the upper threshold of the image number can be a ratio relation or a difference relation; different numerical relationships may correspond to different extraction strategies, for example: when the numerical relation is a ratio relation and the total frame number is 10 times of the upper limit threshold of the image number, the extraction strategy may be: one image is extracted every 10 frames.

And S314, carrying out image extraction on the video information according to the extraction strategy and frames to obtain video frame images with the number smaller than or equal to the number corresponding to the upper limit threshold of the number of the images.

Specifically, the extraction strategy obtained by matching can meet that the number of finally obtained video frame images is smaller than or equal to the upper limit threshold value of the number of images when the video information is subjected to image extraction; after the extraction strategy is obtained, the video information can be subjected to image extraction according to the corresponding rule, so that the number of the final video frame images meets the requirement, the situations of overlong processing time consumption, waste of computing resources and the like caused by excessive number of the video frame images are avoided, and the processing efficiency is effectively improved.

Application example:

the method in any embodiment is adopted for testing and is compared with the existing model in the related technology, the test set is that 6000 data marked by two persons are shown in the table 1, wherein the data marked by two persons are: for the same piece of data, the tag includes the results of two person labels:

TABLE 1

Therefore, the model obtained by the method of the embodiment has obvious progress in three judgment standards of recall rate, accuracy and F value.

As shown in fig. 4, according to an embodiment of another aspect of the present invention, there is also provided a data processing apparatus including:

an acquisition module 1 for acquiring text information and video information for generating content tags;

a determining module 2, configured to determine a word vector of each word in the text information;

the vector acquisition module 3 is used for extracting the characteristics of the video information to obtain video characteristic vectors corresponding to the video information;

the feature fusion module 4 is used for carrying out cross fusion on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;

and the tag determining module 5 is used for obtaining a corresponding content tag according to the fused word vector and the fused video feature vector.

In some embodiments, as in the previous apparatus, the determining module 2 comprises:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain word segmentation forming the text information;

and the word vector unit is used for determining the word vector of each word according to the word vector model and the word list which are obtained through training in advance.

In some embodiments, as in the previous apparatus, the vector acquisition module 3 comprises:

the extraction unit is used for extracting the video information according to the frames to obtain at least two video frame images;

the input unit is used for inputting the video frame images into the deep neural network respectively;

the acquisition unit is used for acquiring video feature vectors obtained after feature extraction of the video frame images by the feature extraction layer in the deep neural network.

In some embodiments, as in the previous apparatus, the feature fusion module 4 comprises:

the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension adjusting word vectors, and carrying out vector dimension adjustment on the video feature vectors to obtain dimension adjusting video feature vectors;

The relationship determination unit is used for determining the hierarchical relationship of the mutual attention layers, wherein the hierarchical relationship characterizes the connection relationship among different mutual attention layers;

the fusion unit is used for inputting word vector information and video vector information into a mutual attention layer arranged on a first layer to perform cross fusion of features, obtaining first video vector information fused with all initial word vector information according to first word vector influence weights of the initial word vector information on the initial video vector information, and obtaining first word vector information fused with all initial video vector information according to first video vector influence weights of the initial video vector information on the initial word vector information;

the output unit is used for inputting the first word vector information and the first video vector information into the mutual attention layer of the next layer according to the hierarchical relationship, carrying out feature cross fusion again, and respectively obtaining second word vector information and second video vector influence weight; according to the circulation, outputting the fused word vector information and the fused video feature vector information through the mutual attention layer of the last layer;

the first decoding unit is used for decoding the fused word vector information to obtain a fused word vector;

In some embodiments, as in the previous apparatus, the tag determination module 5 comprises:

a second determining unit, configured to determine candidate word vectors closest to the first distance of each output vector, where the output vector includes: the fused word vector and the fused video feature vector;

and the label determining unit is used for taking the word corresponding to the candidate word vector with the first distance as the content label corresponding to the output vector.

In some embodiments, an apparatus as previously described, further comprising: a tag screening module; the tag screening module comprises:

a total number unit for acquiring the total number of the content tags;

the second distance unit is used for acquiring a second distance between the candidate word vector and the output vector corresponding to the same content label when the total number of the content labels is larger than a preset upper limit threshold value;

a third determining unit configured to determine a correspondence between the content tag and the second distance;

an arrangement unit for arranging the content tags from small to large according to a second distance;

In some embodiments, such as the foregoing apparatus, the extraction unit comprises:

the strategy subunit is used for determining a preset extraction strategy for extracting the images of the video signal according to the numerical relation between the total frame number and the upper limit threshold value of the number of the images;

and the image determining subunit is used for extracting the images of the video information according to the extraction strategy and obtaining video frame images with the number smaller than or equal to the number corresponding to the upper limit threshold value of the number of the images.

The embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, where the processor 1501, the communication interface 1502 and the memory 1503 complete communication with each other through the communication bus 1504,

a memory 1503 for storing a computer program;

the processor 1501, when executing the program stored in the memory 1503, performs the following steps:

the communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which instructions are stored, which when executed on a computer, cause the computer to perform the data processing method for generating content tags according to any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data processing method of generating content tags as described in any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of data processing, comprising:

determining a word vector of each word segment in the text information, including: performing word segmentation processing on the text information to obtain the word segmentation forming the text information; obtaining a corresponding word list according to the word segmentation and the preset tag word; determining the word vector of each word according to a word vector model obtained through pre-training and the word list;

cross-fusing the features of the word vector and the features of the video feature vector through a mutual attention mechanism to respectively obtain a fused word vector and a fused video feature vector, wherein the cross-fusing comprises the following steps: performing vector dimension adjustment on the word vector to obtain a dimension-adjusting word vector, and performing vector dimension adjustment on the video feature vector to obtain a dimension-adjusting video feature vector; after the dimension-adjusting word vectors are spliced with the dimension-adjusting video feature vectors, initial word vector information of the dimension-adjusting word vectors and initial video vector information of the dimension-adjusting video feature vectors are obtained; determining a hierarchical relationship of the mutual attention layers, the hierarchical relationship characterizing a connection relationship between different mutual attention layers; inputting the word vector information and the video vector information into the mutual attention layer arranged on a first layer for feature cross fusion, obtaining each first video vector information fused with all the initial word vector information according to the first word vector influence weight of each initial word vector information on each initial video vector information, and obtaining each first word vector information fused with all the initial video vector information according to the first video vector influence weight of each initial video vector information on each initial word vector information; the first word vector information and the first video vector information are input into the mutual attention layer of the next layer according to the hierarchical relation to perform feature cross fusion again, and second word vector information and second video vector influence weights are obtained respectively; according to the circulation, outputting the fused word vector information and the fused video feature vector information through the mutual attention layer of the last layer; decoding the fused word vector information to obtain the fused word vector; decoding the fused video feature vector information to obtain the fused video feature vector;

2. The method according to claim 1, wherein the feature extracting the video information to obtain a video feature vector corresponding to the video information includes:

respectively inputting the video frame images into a deep neural network;

3. The method of claim 1, wherein the obtaining the corresponding content tag from the fused word vector and the fused video feature vector comprises:

determining candidate word vectors of all words in the word list;

4. The method of claim 3, further comprising, after deriving the corresponding content tag from the fused word vector and fused video feature vector:

acquiring the total number of the content tags;

determining a correspondence between the content tag and a second distance;

5. A method according to claim 3, wherein image extraction of the video information on a frame-by-frame basis results in at least two video frame images, comprising:

determining a preset upper limit threshold of the number of images;

6. A data processing apparatus, comprising:

the determining module comprises;

the word vector unit is used for determining the word vector of each word according to a word vector model and a word list which are obtained through training in advance;

the feature fusion module comprises:

the second decoding unit is used for decoding the fused video feature vector information to obtain a fused video feature vector;

7. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.