CN111767726A

CN111767726A - Data processing method and device

Info

Publication number: CN111767726A
Application number: CN202010588592.9A
Authority: CN
Inventors: 张轩玮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-13
Anticipated expiration: 2040-06-24
Also published as: CN111767726B

Abstract

The embodiment of the invention provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring video information for generating a content label and text information for describing the video information; determining a word vector of each word segmentation in the text information; extracting the features of the video information to obtain a video feature vector corresponding to the video information; performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors; and obtaining the corresponding content label according to the fused word vector and the fused video feature vector. By the method in the embodiment, the characteristics of the text information and the video information can be subjected to cross fusion, and the cross information of the text information and the video information is extracted, so that the content tag result is more accurate.

Description

Data processing method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data processing method and device.

Background

At present, a method for labeling data is mainly based on texts, but because video content can have many expressed characteristics, it is difficult to completely represent all information in a video through texts. When the text content only comprises a plurality of word groups, the information provided by the word groups is limited, and if the text content is not combined with specific video content, the text content is likely to be the main information which cannot be represented, and even useful information is difficult to be obtained by analyzing.

In view of the above problems, the prior art also provides a related solution, but most of the methods for image-text fusion provided in the prior art are based on simply splicing the features of the two at the input end, so that the methods are only used in the encoder, but more features are obtained, the features of the text and the video are still independent from each other, the effect is limited, and the video information cannot be fully utilized in the decoder.

In view of the technical problems in the related art, no effective solution is provided at present.

Disclosure of Invention

An embodiment of the present invention provides a data processing method and apparatus, so as to solve at least one technical problem in the related art. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a data processing method, including:

acquiring video information for generating a content label and text information for describing the video information;

determining a word vector of each word segmentation in the text information;

extracting the features of the video information to obtain a video feature vector corresponding to the video information;

performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors;

and obtaining the corresponding content label according to the fused word vector and the fused video feature vector.

Optionally, as in the foregoing method, the determining a word vector of each word segmentation in the text information includes:

performing word segmentation processing on the text information to obtain the word segments forming the text information;

obtaining a corresponding word list according to the word segmentation and a preset label word;

and determining the word vector of each word segmentation according to a word vector model obtained by pre-training and the word list.

Optionally, as in the foregoing method, the performing feature extraction on the video information to obtain a video feature vector corresponding to the video information includes:

performing image extraction on the video information according to frames to obtain at least two video frame images;

respectively inputting the video frame images into a deep neural network;

and obtaining the video feature vector obtained by performing feature extraction on the video frame image by a feature extraction layer in the deep neural network.

Optionally, as in the foregoing method, the cross-fusing the features of the word vector and the features of the video feature vector through a mutual attention mechanism to obtain a fused word vector and a fused video feature vector, includes:

carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusting word vectors, and carrying out vector dimension adjustment on the video feature vectors to obtain dimension-adjusting video feature vectors;

after splicing each dimensionality-adjusting word vector and the dimensionality-adjusting video feature vector, obtaining initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;

determining a hierarchical relationship of the mutual attention layers, wherein the hierarchical relationship represents a connection relationship between different mutual attention layers;

inputting the word vector information and the video vector information into the mutual attention layer arranged on a first layer for cross fusion of features, obtaining each piece of first video vector information fused with all pieces of initial word vector information according to the first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, and obtaining each piece of first word vector information fused with all pieces of initial video vector information according to the first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information;

inputting the first word vector information and the first video vector information into the mutual attention layer of the next layer according to the hierarchical relationship to perform cross fusion of features again, and respectively obtaining second word vector information and second video vector influence weight; circulating according to the above steps until fused word vector information and fused video feature vector information are obtained through the output of the last layer of the mutual attention layer;

decoding the fused word vector information to obtain the fused word vector;

and decoding the fused video feature vector information to obtain the fused video feature vector.

Optionally, as in the foregoing method, the obtaining the corresponding content tag according to the fused word vector and the fused video feature vector includes:

determining candidate word vectors of all words in the word list;

determining the candidate word vector closest to a first distance of each output vector, respectively, the output vectors comprising: the fused word vector and the fused video feature vector;

and taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.

Optionally, as in the foregoing method, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector, the method further includes:

acquiring the total number of the content tags;

when the total number of the content labels is larger than a preset upper limit threshold value, acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector;

determining a correspondence between the content tag and a second distance;

arranging the content tags according to the second distance from small to large;

and deleting the content tags with the arrangement order larger than the upper threshold value according to the corresponding relation.

Optionally, as in the foregoing method, performing image extraction on the video information by frame to obtain at least two video frame images, includes:

acquiring the total frame number of images included in the video information;

determining a preset upper limit threshold of the number of images;

determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper threshold of the image quantity;

and carrying out image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.

In a second aspect of the present invention, there is also provided a data processing apparatus comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring video information used for generating a content label and text information used for describing the video information;

the determining module is used for determining a word vector of each word segmentation in the text information;

the vector acquisition module is used for extracting the characteristics of the video information to obtain a video characteristic vector corresponding to the video information;

the feature fusion module is used for performing cross fusion on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;

and the label determining module is used for obtaining the corresponding content label according to the fused word vector and the fused video feature vector.

Optionally, as in the foregoing apparatus, the determining module includes:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain the word segmentation forming the text information;

the word list unit is used for obtaining a corresponding word list according to the participles and preset label words;

and the word vector unit is used for determining the word vector of each participle according to a word vector model obtained by pre-training and the word list.

Optionally, as in the foregoing apparatus, the vector obtaining module includes:

the extraction unit is used for carrying out image extraction on the video information according to frames to obtain at least two video frame images;

the input unit is used for respectively inputting the video frame images into the deep neural network;

and the obtaining unit is used for obtaining the video feature vector obtained after the feature extraction layer in the deep neural network extracts the features of the video frame image.

Optionally, in the foregoing apparatus, the feature fusion module includes:

the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension adjusting word vectors and carrying out vector dimension adjustment on the video feature vectors to obtain the dimension adjusting video feature vectors;

the splicing unit is used for splicing each dimensionality-adjusting word vector with the dimensionality-adjusting video feature vector to obtain initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;

the relationship determination unit is used for determining the hierarchical relationship of the mutual attention layers, and the hierarchical relationship represents the connection relationship between different mutual attention layers;

a fusion unit, configured to input the word vector information and the video vector information into the mutual attention layer disposed on a first layer to perform cross fusion of features, obtain, according to a first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, each piece of first video vector information in which all pieces of initial word vector information are fused, and obtain, according to a first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information, each piece of first word vector information in which all pieces of initial video vector information are fused;

the output unit is used for inputting the first word vector information and the first video vector information into the attention layer of the next layer according to the hierarchical relationship to perform cross fusion of features again, and obtaining second word vector information and second video vector influence weight respectively; circulating according to the above steps until fused word vector information and fused video feature vector information are obtained through the output of the last layer of the mutual attention layer;

the first decoding unit is used for decoding the fused word vector information to obtain the fused word vector;

and the second decoding unit is used for decoding the fused video feature vector information to obtain the fused video feature vector.

Optionally, in the foregoing apparatus, the tag determining module includes:

the first determining unit is used for determining candidate word vectors of all words in the word list;

a second determining unit, configured to determine the candidate word vector closest to the first distance of each output vector, respectively, where the output vectors include: the fused word vector and the fused video feature vector;

and the label determining unit is used for taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.

Optionally, the apparatus as described above, further comprising: a tag screening module; the tag screening module comprises:

a total number unit for acquiring a total number of the content tags;

a second distance unit, configured to obtain a second distance between the candidate word vector and the output vector corresponding to the same content tag when the total number of the content tags is greater than a preset upper threshold;

a third determining unit configured to determine a correspondence between the content tag and the second distance;

the arrangement unit is used for arranging the content labels according to the second distance from small to large;

and the screening unit is used for deleting the content tags of which the arrangement order is greater than the upper limit threshold value according to the corresponding relation.

Optionally, as in the foregoing apparatus, the extracting unit includes:

a total frame number subunit, configured to obtain a total frame number of an image included in the video information;

the threshold subunit is used for determining a preset upper limit threshold of the number of images;

the strategy subunit is used for determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper threshold of the image number;

and the image determining subunit is used for performing image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 3 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

Fig. 1 shows a data processing method according to an embodiment of the present application, which includes the following steps S1 to S5:

s1, video information used for generating a content label and text information used for describing the video information are obtained.

Specifically, the video information is a continuous picture capable of exhibiting a smooth continuous visual effect; specifically, video information is typically greater than 24 frames per second; the text information may be: one or more keywords, long sentences or articles, etc.; the method is used for extracting the keywords of the data containing the video information and the text information and labeling, so that the text information and the video information belong to the same data; for example: when the video information is a piece of video, the text information may be text content for summarizing the piece of video.

And S2, determining a word vector of each word in the text information.

Specifically, whether machine learning or deep learning is essentially a numerical number, the word vector does the task of mapping words into a vector space and representing them by vectors. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions. Methods of generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the context of explicit representation of terms for word occurrences.

Determining a word vector for each participle in the text information may be implemented by language model methods such as word2vec, glove, ELMo, BERT, and the like.

And S3, extracting the characteristics of the video information to obtain a video characteristic vector corresponding to the video information.

Specifically, the feature extraction is performed on the video information, generally, image extraction is performed on the video, that is, each frame of image is extracted and then identified to obtain key information in the video information.

And S4, performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors.

Specifically, a mutual attention mechanism can be adopted to find the hidden dependency relationship between the word vector and the video feature vector, so that more supplementary information can be provided for the word vector through the video feature vector, more supplementary information can be provided for the video feature vector through the word vector, and the information fusion degree of the word vector and the video feature vector is higher.

And S5, obtaining a corresponding content label according to the fused word vector and the fused video feature vector.

Specifically, the content tag obtained according to the fused word vector and the fused video feature vector may be:

1) obtaining corresponding labels according to the fused word vector and the fused video feature vector respectively, and then obtaining content labels according to the labels of the word vector and the fused video feature vector;

2) and fusing the fused word vector and the fused video feature vector, so that the fused video feature vector and the fused word vector mutually affect and fuse again, and further obtaining a content label according to the affected fused word vector and the fused video feature vector.

By adopting the method in the embodiment, under the condition that the text information lacks comprehensive or key information, the video characteristics can be used, and further the information contained in the video information can be combined, so that the advantages of improving the recall and accuracy of the label are finally achieved.

In some embodiments, as in the foregoing method, determining a word vector for each participle in the text message includes steps a 1-A3 as follows:

a1, performing word segmentation processing on the text information to obtain word segments forming the text information;

a2, obtaining a corresponding word list according to the word segmentation and the preset label words;

and A3, determining the word vector of each participle according to the word vector model obtained by pre-training and the word list.

Specifically, the word segmentation processing on the text information is used for splitting a text into a plurality of words, for example: when the text information is that the tradition actors in the sound effect of the miracle are also pieced together with the performance for the sound effect, the obtained participles comprise: "wonderful flower", "sound effect", "teacher", "tragic", "actor", "do", "fit", "sound effect", "also", "is", "spell", "played", "performance", "play", "easy" and "do".

The preset label words can be word groups obtained by pre-selection, and the words in the word list comprise the label words and word segmentation obtained by word segmentation according to text information.

The word vector model obtained by pre-training can be a word2vec model (a tool for calculating word vectors); the word vector resulting for each participle can thus be determined by training the resulting word2vec model.

Specifically, after determining the vocabulary and the model, the word vector of each participle in the vocabulary can be determined. Further, words in the word list may be randomly initialized into 512-dimensional vectors as a word vector and a tag vector (word vector of tag words) for each participle, respectively.

By the method in the embodiment, the relation among the participles in the text information can be obtained through the word vector, the semantics of each participle in the text information can be effectively obtained, and the accuracy of the labeling result can be effectively improved.

As shown in fig. 2, in some embodiments, as the foregoing method, the step S3 performs feature extraction on the video information to obtain a video feature vector corresponding to the video information, including the following steps S31 and S33:

s31, extracting the video information according to frames to obtain at least two video frame images;

s32, inputting the video image into a deep neural network;

and S33, obtaining a video feature vector obtained after a feature extraction layer in the deep neural network performs feature extraction on video information.

Specifically, the image extraction is performed on the video information frame by frame, or on a frame-by-frame basis, and in addition, since the image information of adjacent frames may be close to each other, the image extraction may also be performed at fixed intervals, and a specific image extraction strategy may be selected according to actual conditions; the video frame image is image information obtained by performing image extraction on the video information.

Because the deep neural network has the capability of extracting the features of the image information, the corresponding video feature vector can be obtained by inputting the video frame image into the deep neural network.

One of the optional implementation methods is as follows: when video information is input into an xception (depth separable convolution) model, 2048-dimensional vectors of the penultimate layer of the xception model are extracted as video features because the extracted video features of the penultimate layer are most abundant.

By adopting the method in the embodiment, the images with high repeatability can be filtered by extracting the images of the video information according to frames, the same video characteristic vector can be prevented from being extracted repeatedly, and the efficiency of characteristic extraction can be effectively improved; the characteristic extraction of the video information is carried out through a characteristic extraction layer in the deep neural network, so that abundant video characteristic vectors can be extracted, and more information provided by the video can be obtained.

As shown in fig. 3, in some embodiments, the step S4 of obtaining the corresponding content tag according to the word vector and the video feature vector, as the aforementioned method, includes the following steps S41 to S47:

and S41, carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusted word vectors, and carrying out vector dimension adjustment on the video feature vectors to obtain dimension-adjusted video feature vectors.

Specifically, on the basis of the foregoing embodiment, the word vector of the word segmentation may be 512 dimensions, while the video feature vector may be 2048 dimensions; therefore, the dimensionality of the word vector and the dimensionality of the video feature vector are different from each other, and if the word vector and the video feature vector are spliced and fused, the dimensionality of the word vector and the dimensionality of the video feature vector need to be unified; optionally, because the dimensionality of the video feature vector is higher, only the dimensionality reduction processing can be performed on the video feature vector, and the dimensionality reduction of the 2048-dimensional video feature vector can be performed to obtain a 512-dimensional dimensionality-adjusted video feature vector generally by a method of dimensionality reduction of a fully-connected network.

S42, splicing each dimensionality-adjusting word vector and the dimensionality-adjusting video feature vector to obtain initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;

specifically, the reason for splicing and fusing the dimensionality adjusting word vectors and the dimensionality adjusting video feature vectors is to form a context relationship so as to find out the global relation between each dimensionality adjusting word vector and each dimensionality adjusting video feature vector. In addition, an encoder may be used to encode input data to obtain corresponding information, and generally, the encoder is a recurrent neural network, which is a network model that converts sequence modeling into time sequence modeling, and generally takes sequence data as input, performs recursion in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain manner. And the initial word vector information and the initial video vector information are information obtained by inputting the dimension-adjusting word vector and the dimension-adjusting video feature vector into an encoder for encoding.

The splicing fusion method can be as follows: and (4) taking each dimension-adjusting video feature vector as a word vector, setting the dimension of each dimension-adjusting word vector as the same grade, and then connecting.

And S43, determining the hierarchical relationship of the mutual attention layers, wherein the hierarchical relationship represents the connection relationship between different mutual attention layers.

Specifically, each of the inter-attention layers may perform a one-pass inter-attention mechanism, and extract and cross-fuse respective deep features of the word vector information and the video feature vector information.

The hierarchical relationship is used for representing the front-back connection relationship between different mutual attention layers, and the output of the mutual attention layer of the previous layer enters the next mutual attention layer to be subjected to deep feature extraction and cross fusion again.

And S44, inputting the word vector information and the video vector information into a mutual attention layer arranged on a first layer for cross fusion of features, obtaining each piece of first video vector information fused with all pieces of initial word vector information according to the first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, and obtaining each piece of first word vector information fused with all pieces of initial video vector information according to the first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information.

Specifically, for example, when feature fusion is performed on one initial video vector information by using a mutual attention mechanism, the influence weight of each initial word vector on the initial video vector information is determined, for example, when the initial video vector information includes a1, b1 and c1, initial word vectors a2, b2 and c2 exist; when the influence weights of a2 on a1, b2 on a1, and c2 on a1 are n1, m1, and t1, respectively, the first video vector information corresponding to a1 is: a1+ a2 Xn 1+ b2 Xm 1+ c2 Xt 1. When the influence weights of a1 on a2, b1 on a2, and c1 on a2 are n2, m2, and t2, respectively, then the first word vector information corresponding to a2 is: a2+ a1 xn 2+ b1 xm 2+ c1 xt 2; similarly, the global vector information of the vectors b1, c1 and b2, c2 can be obtained in the same way.

S45, inputting the first word vector information and the first video vector information into a next layer of mutual attention layer according to the hierarchical relationship, and performing cross fusion of features again to obtain second word vector information and second video vector influence weight respectively; and circulating according to the above steps until the fused word vector information and the fused video feature vector information are obtained through the output of the last layer of mutual attention layer.

Specifically, after the first word vector information and the first video vector information are obtained, the first video vector information and the first word vector information are input into the mutual attention layer of the next layer for feature cross-fusion again, where the cross-fusion method may be performed with reference to the example in step S44, and the process is repeated until the mutual attention layer of the last layer outputs the fused word vector information and the fused video feature vector information that are cross-fused by features through all the mutual attention layers.

S46, decoding the fused word vector information to obtain a fused word vector;

and S47, decoding the fused video feature vector information to obtain a fused video feature vector.

Specifically, the decoder is configured to decode the vector information to obtain a corresponding vector, and output the vector. Typically, the decoder is also a recurrent neural network.

By the method in the embodiment, the word vectors can be influenced through the video feature vectors, and the video feature vectors can be influenced through the word vectors; the finally obtained fused word vector can be fused to obtain the characteristics of the video feature vector, the fused video feature vector can be fused to obtain the characteristics of the word vector, and further the finally obtained fused word vector and the fused video feature vector can more accurately represent the characteristics of the text information and the video information.

In some embodiments, as the method mentioned above, the step S5 obtaining the corresponding content tag according to the fused word vector and the fused video feature vector includes the following steps S51 to S53:

step S461, determining candidate word vectors of all words in the word list;

step s462, determining candidate word vectors closest to the first distance of each output vector, respectively, where the output vectors include: fused word vectors and fused video feature vectors;

and S463, taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.

Specifically, a candidate word vector corresponding to each word in a word list is determined; then determining a first distance (generally, the first distance may be a cosine distance) between each output vector and each candidate word vector in the word list, and determining a candidate word vector closest to the first distance of each output vector; and finally, taking the word corresponding to the candidate word vector closest to the first distance of each output vector as the content label corresponding to the output vector.

By the method in the embodiment, each feature can be quickly matched to obtain the closest content label so as to obtain the label capable of accurately representing the video information.

In some embodiments, as the method described above, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector, the method further includes the following steps B1 to B5:

and B1, acquiring the total number of the content tags.

Specifically, this step is used to determine the total number of all content tags obtained in step S4.

And B2, when the total number of the content labels is greater than a preset upper limit threshold value, acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector.

Specifically, the upper threshold may be set according to an actual situation, and when the total number of the content tags is greater than the upper threshold, the content tags need to be discarded, so as to prevent the content tags from being too many and affecting the simplicity; according to the steps in the foregoing embodiment, the word corresponding to the candidate word vector closest to the first distance of each output vector is taken as the content tag corresponding to the output vector, and therefore, the content tag, the candidate word vector and the output vector have a unique correspondence relationship;

b3, determining the corresponding relation between the content label and the second distance;

specifically, after determining that the content tag, the candidate word vector and the output vector have the unique correspondence relationship in step B3, since both the candidate word vector and the output vector are determined, the second distance therebetween is also determined, and thus the correspondence relationship between the content tag and the second distance can be obtained.

Step B4. ranks the content tags by a second distance from small to large;

step B5. deletes content tags whose rank order is greater than the upper threshold in accordance with the correspondence.

Specifically, after the content tags are arranged according to the second distance from small to large, the arrangement order of the second distance can be determined; since the farther the distance is, the lower the correlation between the two words is, only the content tags arranged within the upper threshold are reserved, and the accuracy of semantic expression of the content tags can be guaranteed.

In some embodiments, as in the foregoing method, the step S31 performs image extraction on the video information by frames to obtain at least two video frame images, including the following steps S311 to S314:

step S311, the total frame number of the images included in the video information is obtained.

Specifically, the number of frames per second of each piece of video information is generally fixed, and therefore, as long as the duration of the video information and the number of frames per second are obtained, the total number of frames of images included in the video information can be determined.

And S312, determining a preset upper limit threshold of the number of images.

Specifically, a large amount of computing resources are consumed for feature extraction of images, and meanwhile, new features generally do not suddenly appear on displayed information of images of adjacent frames, particularly some transition frames, so that the waste of the computing resources is caused by feature extraction of the images, and the computing amount of the feature extraction can be effectively controlled by predetermining an upper limit threshold of the number of the images.

And S313, determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper limit threshold of the image number.

Specifically, the numerical relationship between the total frame number and the upper threshold of the image number may be a ratio relationship or a difference relationship; different numerical relationships may correspond to different extraction strategies, for example: when the numerical relationship is a ratio relationship, and the total frame number is 10 times of the upper threshold of the image number, the extraction strategy may be: one image is taken every 10 frames.

And S314, carrying out image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.

Specifically, the extraction strategy obtained by matching can meet the condition that the number of finally obtained video frame images is less than or equal to the upper limit threshold of the number of images when the video information is subjected to image extraction generally; after the extraction strategy is obtained, the video information can be subjected to image extraction according to the corresponding rule, so that the number of the final video frame images meets the requirement, the situations that the processing time is too long, the computing resources are wasted and the like due to the fact that the number of the video frame images is too large are avoided, and the processing efficiency is effectively improved.

Application example:

the test is carried out by adopting the method in any one of the embodiments and compared with the existing model in the related technology, and the test set shows the effects of 6000 data marked by two persons as shown in the table 1, wherein the data marked by the two persons are as follows: for the same piece of data, the label comprises the results of two person labels:

TABLE 1

Therefore, the model obtained by the method provided by the embodiment has remarkable improvements in three judgment standards of recall rate, accuracy and F value.

As shown in fig. 4, according to an embodiment of another aspect of the present invention, there is also provided a data processing apparatus including:

the system comprises an acquisition module 1, a display module and a display module, wherein the acquisition module is used for acquiring text information and video information for generating a content label;

the determining module 2 is used for determining a word vector of each word segmentation in the text information;

the vector acquisition module 3 is used for extracting the characteristics of the video information to obtain video characteristic vectors corresponding to the video information;

the feature fusion module 4 is used for performing cross fusion on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;

and the label determining module 5 is used for obtaining a corresponding content label according to the fused word vector and the fused video feature vector.

In some embodiments, as in the foregoing apparatus, the determining module 2 includes:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain word segments forming the text information;

the word list unit is used for obtaining a corresponding word list according to the word segmentation and the preset label words;

and the word vector unit is used for determining the word vector of each participle according to the word vector model obtained by pre-training and the word list.

In some embodiments, as in the foregoing apparatus, the vector obtaining module 3 includes:

and the acquisition unit is used for acquiring a video feature vector obtained by performing feature extraction on the video frame image by a feature extraction layer in the deep neural network.

In some embodiments, such as the aforementioned apparatus, the feature fusion module 4 comprises:

the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension adjusting word vectors and carrying out vector dimension adjustment on the video feature vectors to obtain dimension adjusting video feature vectors;

the splicing unit is used for splicing each dimensionality-adjusting word vector and the dimensionality-adjusting video feature vector to obtain initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;

the fusion unit is used for inputting the word vector information and the video vector information into a mutual attention layer arranged on a first layer for cross fusion of features, obtaining each piece of first video vector information fused with all pieces of initial word vector information according to the first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, and obtaining each piece of first word vector information fused with all pieces of initial video vector information according to the first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information;

the output unit is used for inputting the first word vector information and the first video vector information into a next layer of mutual attention layer according to the hierarchical relationship to perform cross fusion of features again, and obtaining second word vector information and second video vector influence weight respectively; circulating according to the above steps until fused word vector information and fused video feature vector information are obtained through the output of the last layer of mutual attention layer;

the first decoding unit is used for decoding the fused word vector information to obtain a fused word vector;

and the second decoding unit is used for decoding the fused video feature vector information to obtain a fused video feature vector.

In some embodiments, as in the foregoing apparatus, the tag determination module 5 comprises:

a second determining unit, configured to determine candidate word vectors closest to the first distance of each output vector, respectively, where the output vectors include: fused word vectors and fused video feature vectors;

In some embodiments, the apparatus as in the previous paragraph, further comprising: a tag screening module; the label screening module comprises:

a total number unit for acquiring a total number of the content tags;

the second distance unit is used for acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector when the total number of the content labels is greater than a preset upper limit threshold;

In some embodiments, as in the foregoing apparatus, the extraction unit comprises:

the strategy subunit is used for determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper limit threshold of the image number;

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 1501, a communication interface 1502, a memory 1503, and a communication bus 1504, where the processor 1501, the communication interface 1502, and the memory 1503 complete mutual communication through the communication bus 1504,

a memory 1503 for storing a computer program;

the processor 1501, when executing the program stored in the memory 1503, implements the following steps:

the communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to execute the data processing method for generating a content tag according to any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data processing method for generating a content tag as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A data processing method, comprising:

determining a word vector of each word segmentation in the text information;

2. The method of claim 1, wherein determining a word vector for each word in the textual information comprises:

3. The method according to claim 1, wherein the extracting the features of the video information to obtain a video feature vector corresponding to the video information comprises:

respectively inputting the video frame images into a deep neural network;

4. The method according to claim 2, wherein the cross-fusing the features of the word vector and the features of the video feature vector through a mutual attention mechanism to obtain a fused word vector and a fused video feature vector, comprises:

decoding the fused word vector information to obtain the fused word vector;

5. The method according to claim 2, wherein obtaining the corresponding content tag according to the fused word vector and the fused video feature vector comprises:

determining candidate word vectors of all words in the word list;

6. The method of claim 5, further comprising, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector:

acquiring the total number of the content tags;

determining a correspondence between the content tag and a second distance;

7. The method of claim 3, wherein performing image extraction on the video information on a frame-by-frame basis to obtain at least two video frame images comprises:

acquiring the total frame number of images included in the video information;

determining a preset upper limit threshold of the number of images;

8. A data processing apparatus, comprising:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.