CN111767727A

CN111767727A - Data processing method and device

Info

Publication number: CN111767727A
Application number: CN202010589941.9A
Authority: CN
Inventors: 张轩玮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-13
Anticipated expiration: 2040-06-24
Also published as: CN111767727B

Abstract

The embodiment of the invention provides a data processing method and a device, wherein the method comprises the following steps: acquiring multimedia data for generating a content tag and text information for describing the multimedia data; determining a word vector of each word segmentation in the text information; extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data; global relation between the word vector and the image feature vector is controlled through a self-attention mechanism, and global vector information is obtained; after the image characteristic vector is used as the first input of a decoder, sequentially inputting global vector information into the decoder to obtain output vectors after decoding all global vectors under the guidance of the image characteristic vector; a content tag corresponding to the output vector is determined. According to the method and the device, under the condition that text information lacks comprehensive or key information, the content label can be combined with information contained in multimedia data by using the image characteristics, and therefore the advantage of improving the label accuracy is achieved.

Description

Data processing method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data processing method and device.

Background

At present, a method for labeling data is mainly based on texts, but because video content can have many expressed characteristics, it is difficult to completely represent all information in a video through texts. When the text content only comprises a plurality of word groups, the information provided by the word groups is limited, and if the text content is not combined with specific video content, the text content is likely to be the main information which cannot be represented, and even useful information is difficult to be obtained by analyzing.

In view of the above problems, the prior art also provides a related solution, but most of the image-text fusion methods provided in the prior art are based on simply splicing the features of the two at the input end, so that the method is only used in an encoder, and only more features are obtained, but the features of the text and the video are still independent from each other, the effect is limited, and the content of multimedia data such as the video and the like cannot be fully utilized in the decoder.

Aiming at the problem that the accurate label can not be obtained by utilizing multimedia data in the related technology, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention aims to provide a data processing method and a data processing device, so as to solve the problem that an accurate label cannot be obtained by using multimedia data in the related technology. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a data processing method, including:

acquiring multimedia data for generating a content tag and text information for describing the multimedia data; wherein the multimedia data comprises: video or images;

determining a word vector of each word segmentation in the text information;

extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data;

acquiring global relation between the word vector and the image feature vector through a self-attention mechanism, and respectively obtaining global vector information corresponding to the word vector and the image feature vector according to the global relation;

after the image feature vectors are used as first input of a decoder, sequentially inputting the global vector information into the decoder to obtain output vectors after decoding all the global vectors under the guidance of the image feature vectors;

determining the content tag corresponding to the output vector.

Optionally, as in the foregoing method, the determining a word vector of each word segmentation in the text information includes:

performing word segmentation processing on the text information to obtain the word segments forming the text information;

obtaining a corresponding word list according to the word segmentation and a preset label word;

and determining the word vector of each word segmentation according to a word vector model obtained by pre-training and the word list.

Optionally, as in the foregoing method, the performing feature extraction on the multimedia data to obtain an image feature vector corresponding to the multimedia data includes:

inputting the multimedia data into a preset deep neural network;

and obtaining the image feature vector obtained after the feature extraction layer in the deep neural network performs feature extraction on the multimedia data.

Optionally, as in the foregoing method, the obtaining global vector information corresponding to the word vector and the image feature vector according to the global association between the word vector and the image feature vector through the self-attention mechanism includes:

carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusting word vectors, and carrying out vector dimension adjustment on the image feature vectors to obtain dimension-adjusting image feature vectors;

inputting each dimensionality adjusting word vector and the dimensionality adjusting image feature vector into an encoder for splicing and fusion, and then obtaining corresponding vector information of each dimensionality adjusting word vector and the dimensionality adjusting image feature vector;

obtaining global relation among the vector information through a self-attention mechanism;

and adjusting the vector information according to the global contact to obtain the global vector information.

Optionally, as in the foregoing method, after the image feature vector is used as a first input of a decoder, the sequentially inputting the global vector information into the decoder to obtain output vectors obtained by decoding each global vector under the guidance of the image feature vector includes:

inputting the image feature vector into the decoder as a first input;

determining order information in which the global vector information is sequentially input to the decoder;

determining a first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information input into the decoder firstly;

adjusting the initial global vector according to the first influence weight and the image feature vector to obtain an adjusted initial global vector; the initial global vector is obtained by decoding initial global vector information by a decoder;

determining a second influence weight of the adjusted initial global vector on next global vector information in the sequence information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; circulating according to the above steps until all adjusted global vectors are obtained;

and obtaining the output vector according to the adjusted global vector.

Optionally, as in the foregoing method, the determining the content tag corresponding to the output vector includes:

determining candidate word vectors of all words in the word list;

respectively determining the candidate word vector closest to the first distance of each output vector;

and taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.

Optionally, as in the foregoing method, after obtaining the corresponding content tag according to the word vector and the image feature vector, the method further includes:

acquiring the total number of the content tags;

when the total number of the content labels is larger than a preset upper limit threshold value, acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector;

determining a correspondence between the content tag and a second distance;

arranging the content tags according to the second distance from small to large;

and deleting the content tags with the arrangement order larger than the upper threshold value according to the corresponding relation.

In a second aspect of the present invention, there is also provided a data processing apparatus comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring multimedia data for generating a content label and text information for describing the multimedia data; wherein the multimedia data comprises: video or images;

the determining module is used for determining a word vector of each word segmentation in the text information;

the vector acquisition module is used for extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data;

the global module is used for respectively obtaining global vector information corresponding to the word vectors and the image characteristic vectors according to global relations between the word vectors and the image characteristic vectors through a self-attention mechanism;

the decoding module is used for inputting the global vector information into a decoder in sequence after the image feature vector is used as a first input of the decoder so as to obtain an output vector after each global vector is decoded under the guidance of the image feature vector;

a tag determination module to determine the content tag corresponding to the output vector.

Optionally, as in the foregoing apparatus, the determining module includes:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain the word segmentation forming the text information;

the word list unit is used for obtaining a corresponding word list according to the participles and preset label words;

and the word vector unit is used for determining the word vector of each participle according to a word vector model obtained by pre-training and the word list.

Optionally, as in the foregoing apparatus, the vector obtaining module includes:

the first input unit is used for inputting the multimedia data into a preset deep neural network;

and the extraction unit is used for acquiring the image feature vector obtained by performing feature extraction on the multimedia data by a feature extraction layer in the deep neural network.

Optionally, as in the foregoing apparatus, the global module includes:

the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension adjusting word vectors and carrying out vector dimension adjustment on the image feature vectors to obtain the dimension adjusting image feature vectors;

the fusion unit is used for inputting each dimensionality adjusting word vector and the dimensionality adjusting image feature vector into an encoder for splicing and fusion to obtain corresponding vector information of each dimensionality adjusting word vector and the dimensionality adjusting image feature vector;

the self-attention unit is used for obtaining global connection among the vector information through the self-attention mechanism;

and the adjusting unit is used for adjusting the vector information according to the global contact to obtain the global vector information.

Optionally, as in the foregoing apparatus, the decoding module includes:

a second input unit for inputting the image feature vector to the decoder as a first input;

an order unit for determining order information in which the global vector information is sequentially input to the decoder;

a first determining unit, configured to determine a first influence weight of the image feature vector on initial global vector information, where the initial global vector information is global vector information first input to the decoder;

the influence unit is used for adjusting the initial global vector according to the first influence weight and the image feature vector to obtain an adjusted initial global vector; the initial global vector is obtained after the decoder decodes the initial global vector information; determining a second influence weight of the adjusted initial global vector on next global vector information in the sequence information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; circulating according to the above steps until all adjusted global vectors are obtained;

and the output vector unit is used for obtaining the output vector according to the adjusted global vector.

Optionally, as in the foregoing apparatus, the tag determining module includes:

a candidate word vector determining unit, configured to determine a candidate word vector for each word in the word list;

the word vector screening unit is used for respectively determining the candidate word vectors which are closest to the first distance of each output vector;

and the label determining unit is used for taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.

Optionally, the apparatus as described above, further comprising: a tag screening module; the tag screening module comprises:

a total number determination unit for acquiring a total number of the content tags;

the screening unit is used for acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector when the total number of the content labels is greater than a preset upper limit threshold;

a correspondence unit configured to determine a correspondence between the content tag and a second distance;

the arrangement unit is used for arranging the content labels according to the second distance from small to large;

and the deleting unit is used for deleting the content labels with the arrangement order larger than the upper limit threshold according to the corresponding relation.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

The embodiment of the invention provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring multimedia data for generating a content tag and text information for describing the multimedia data; wherein the multimedia data comprises: video or images; determining a word vector of each word segmentation in the text information; extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data; obtaining global vector information corresponding to the word vectors and the image feature vectors respectively according to global relations between the word vectors and the image feature vectors through a self-attention mechanism; after the image feature vectors are used as first input of a decoder, sequentially inputting the global vector information into the decoder to obtain output vectors after decoding all the global vectors under the guidance of the image feature vectors; determining the content tag corresponding to the output vector. The corresponding content label is obtained through the word vector and the image characteristic, and the content label can be combined with the information contained in the multimedia data when being generated by using the image characteristic under the condition that text information lacks comprehensive or key information, so that the advantages of improving the label recall rate and the accuracy rate are finally achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 3 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 4 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 5 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

Fig. 1 shows a data processing method according to an embodiment of the present application, which includes the following steps S1 to S6:

s1, acquiring multimedia data for generating a content label and text information for describing the multimedia data; wherein the multimedia data includes: video or images.

Specifically, the multimedia data may include, but is not limited to: one or more of a picture, video or motion picture file; the text information may be: one or more keywords, long sentences or articles, etc.; the method is used for extracting keywords from data containing multimedia data and text information and labeling, so that the text information and the multimedia data belong to the same data; for example: when the multimedia data is a piece of video, the text information may be text content for content summarizing the piece of video.

And S2, determining a word vector of each word in the text information.

Specifically, whether machine learning or deep learning is essentially a numerical number, the word vector does the task of mapping words into a vector space and representing them by vectors. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions. Methods of generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the context of explicit representation of terms for word occurrences.

Determining a word vector for each participle in the text information may be implemented by language model methods such as word2vec, glove, ELMo, BERT, and the like.

And S3, extracting the characteristics of the multimedia data to obtain an image characteristic vector corresponding to the multimedia data.

Specifically, the feature extraction is performed on the multimedia data, which is used for identifying and obtaining key information in the multimedia data, and the feature extraction can be performed on the multimedia data through neural network models such as CNN and the like, so as to obtain corresponding image feature information.

And S4, acquiring global relation between the word vectors and the image characteristic vectors through a self-attention mechanism, and respectively acquiring global vector information corresponding to the word vectors and the image characteristic vectors according to the global relation.

The internal relevance of each feature can be captured through the global relation obtained through the self-attention mechanism, so that the obtained global vector information can more accurately represent the specific meaning, and the accuracy of the label result can be effectively improved.

And S5, after the image feature vector is used as the first input of the decoder, sequentially inputting the global vector information into the decoder to obtain output vectors after decoding all the global vectors under the guidance of the image feature vector.

Specifically, in the prior art, when a decoder decodes, it refers to information decoded before, and processes information to be decoded currently, but when information to be decoded first needs to be input into the decoder, because there is no decoded information in the preamble, the information input into the decoder in the preamble is marked as 0, and therefore there is no other information affecting the information to be decoded first.

The decoder may be configured to decode an output result of the encoder, and output the decoded output result to obtain an output vector. Typically, the decoder is a recurrent neural network.

And S6, determining a content label corresponding to the output vector.

Although each output vector is obtained through the dimensionality adjusting word vector and the dimensionality adjusting image characteristic vector of the input encoder through the steps, the output vector is different from the dimensionality adjusting word vector and the dimensionality adjusting image characteristic vector of the input encoder after the processing of the steps, so that the corresponding word cannot be directly obtained as a content label, and the corresponding word needs to be selected in a word list through the input vector.

By adopting the method in the embodiment, under the condition that the text information lacks comprehensive or key information, the information contained in the multimedia data can be combined by using the image characteristics, so that the advantages of improving the recall rate and the accuracy rate of the label are finally achieved.

In some embodiments, as in the foregoing method, determining a word vector for each participle in the text message includes steps a 1-A3 as follows:

a1, performing word segmentation processing on the text information to obtain word segments forming the text information;

a2, obtaining a corresponding word list according to the word segmentation and the preset label words;

and A3, determining the word vector of each participle according to the word vector model obtained by pre-training and the word list.

Specifically, the word segmentation processing on the text information is used for splitting a text into a plurality of words, for example: when the text information is that the tradition actors in the sound effect of the miracle are also pieced together with the performance for the sound effect, the obtained participles comprise: "wonderful flower", "sound effect", "teacher", "tragic", "actor", "do", "fit", "sound effect", "also", "is", "spell", "played", "performance", "play", "easy" and "do".

The preset label words can be word groups obtained by pre-selection, and the words in the word list comprise the label words and word segmentation obtained by word segmentation according to text information.

The word vector model obtained by pre-training can be a word2vec model (a tool for calculating word vectors); the word vector resulting for each participle can thus be determined by training the resulting word2vec model.

Specifically, after determining the vocabulary and the model, the word vector of each participle in the vocabulary can be determined. Further, words in the word list may be randomly initialized into 512-dimensional vectors as a word vector and a tag vector (word vector of tag words) for each participle, respectively.

By the method in the embodiment, the relation among the participles in the text information can be obtained through the word vector, the semantics of each participle in the text information can be effectively obtained, and the accuracy of the labeling result can be effectively improved.

As shown in fig. 2, in some embodiments, as the foregoing method, the step S3 performs feature extraction on the multimedia data to obtain an image feature vector corresponding to the multimedia data, including the following steps S31 and S32:

s31, inputting multimedia data into a preset deep neural network;

and S32, obtaining an image feature vector obtained by performing feature extraction on the multimedia data by a feature extraction layer in the deep neural network.

Specifically, the deep neural network has a capability of extracting features of the multimedia data, so that corresponding image feature vectors can be obtained by inputting the multimedia data into the deep neural network.

One of the optional implementation methods is as follows: when multimedia data is input to an xception (depth separable convolution) model, 2048-dimensional vectors of the penultimate layer of the xception model are extracted as image features because the extracted image features of the penultimate layer are most abundant.

By adopting the method in the embodiment, the characteristic extraction layer in the deep neural network is used for extracting the characteristics of the video information to obtain rich video characteristic vectors so as to obtain more information provided by the video.

As shown in fig. 3, in some embodiments, as the aforementioned method, the step S4 obtains global vector information corresponding to the word vector and the image feature vector respectively according to global relations between the word vector and the image feature vector through the self-attention mechanism, and includes the following steps S41 to S44:

and S41, carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusted word vectors, and carrying out vector dimension adjustment on the image characteristic vectors to obtain dimension-adjusted image characteristic vectors.

Specifically, on the basis of the foregoing embodiment, since the word vector of the word segmentation is 512 dimensions, and the image feature vector is 2048 dimensions; the two dimensions are different, so that the two dimensions cannot be spliced and fused, and the dimensions of the two dimensions need to be unified; optionally, because the dimension of the image feature vector is higher, dimension reduction processing may be performed on the image feature vector, and the latitude is reduced through a full-connection network, so that a dimension-adjusted image feature vector of 512 dimensions is obtained.

S42, inputting each dimension adjusting word vector and the dimension adjusting image feature vector into an encoder for splicing and fusing to obtain corresponding vector information of each dimension adjusting word vector and the dimension adjusting image feature vector;

specifically, the encoder may encode the input data, and generally, the encoder is a recurrent neural network. Therefore, the splicing and fusion of the dimensionality adjusting word vectors and the dimensionality adjusting image feature vectors are input into the encoder, so that a context relationship is formed, so that the global relation between each dimensionality adjusting word vector and each dimensionality adjusting image feature vector can be found, and the implementation method can be as follows: placing the feature vector of the dimension-adjusting image as a word vector and each dimension-adjusting word vector to the same level; and the vector information is the dimensionality-adjusting word vector and the dimensionality-adjusting image feature vector, and the vector information is input into an encoder to quickly realize the purpose of splicing and fusion.

And S43, obtaining the global relation among the vector information through a self-attention mechanism.

Specifically, the attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external perception to increase the fineness of observation of a partial region. Attention mechanism can quickly extract important features of sparse data, and thus is widely used for natural language processing tasks, especially machine translation. While the autoflight mechanism is an improvement of the attentiveness mechanism, which reduces reliance on external information and is more adept at capturing internal correlations of data or features. Therefore, through the self-attention mechanism, the global connection between the vector information can be obtained.

S44, adjusting the vector information according to the global relation to obtain global vector information; such as the presence vectors a, b, c; where the weights of a and b, a and c are a1 and a2, respectively, then a corresponds to the global vector information: a1 × b + a2 × c, b, c are similar.

In summary, by using the method in this embodiment, each dimension-adjusting word vector and the dimension-adjusting image feature vector are spliced and fused, and then the internal correlation of each vector information can be captured by the self-attention mechanism, so that the specific meaning of the text information and the multimedia data can be more accurately analyzed, and the accuracy of the label result can be effectively improved.

As shown in fig. 4, in some embodiments, after the image feature vector is used as the first input of the decoder in step S5, the global vector information is sequentially input into the decoder to obtain output vectors after decoding each global vector under the guidance of the image feature vector, as in the foregoing method, including steps S51 to S56 as follows:

step S51, inputting the image feature vector into a decoder as a first input.

And S52, determining order information for inputting the global vector information to the decoder in sequence.

Specifically, each global vector information is generally input into the decoder one by one, and the order information can be obtained according to the order of each participle in the text information; by way of example: because the global vector information corresponds to a specific dimensionality adjusting word vector, each dimensionality adjusting word vector corresponds to a word vector, and each word vector corresponds to a participle, the sequence of each global vector information corresponding to the word vector can be determined through the sequence of the participle, and finally, the sequence information can be obtained only by determining the sequence (which can be arranged at the head or the tail) corresponding to the dimensionality adjusting image feature vector.

And S53, determining a first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information of a first input decoder. The images are fused on the global vector information, for example, the image a has an influence of a1 on the global vector b, c, the image a has an influence of a2 on the global vector b, and then the global information is a1 × b + a2 × c.

In particular, a first impact weight of the image feature vector on the initial global vector information is generally determined by the decoder.

S54, adjusting the initial global vector according to the image feature vector according to the first influence weight to obtain an adjusted initial global vector; the initial global vector is obtained by decoding the initial global vector information by a decoder.

Specifically, the adjusting of the initial global vector information according to the image feature vector may be: after the first influence weight is obtained, assuming that the first influence weight is t, when the image feature vector information is M and the initial global vector information is N, the adjusted initial global vector information may be N (1-t) + Mt.

S55, determining a second influence weight of the adjusted initial global vector on next global vector information in the sequence information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; and circulating according to the steps until all the adjusted global vectors are obtained.

Specifically, according to the method in step S54, all vectors (i.e., adjusted global vectors) generated under the guidance of the image feature vectors can be obtained by sequentially cycling, and the specific implementation method may refer to the method in step S454, which is not described herein again.

And S56, obtaining an output vector according to the adjusted global vector.

Specifically, the global vector adjusted according to the foregoing steps may be directly output and used as an output vector.

In summary, by using the method in this embodiment, each global vector information can be decoded under the guidance of the image feature vector, and further, features corresponding to the image feature vector can be further merged into an output vector obtained by decoding, so that more features carried in the multimedia data can be obtained in the output vector; and more effective information carried in the multimedia data is embodied.

In some embodiments, as in the previous method, the step S6 of determining the content tag corresponding to the output vector includes the steps S61 to S63 as follows:

s61, determining candidate word vectors of all words in a word list;

s62, respectively determining candidate word vectors closest to the first distances of the output vectors;

and S63, taking the word corresponding to the candidate word vector with the closest first distance as a content label corresponding to the output vector.

Specifically, a candidate word vector corresponding to each word in a word list is determined; then determining a first distance (generally, the first distance may be a cosine distance) between each output vector and each candidate word vector in the word list, and determining a candidate word vector closest to the first distance of each output vector; and finally, taking the word corresponding to the candidate word vector closest to the first distance of each output vector as the content label corresponding to the output vector.

In summary, by using the method in this embodiment, the relevance between each candidate word vector and the output vector can be captured, so that the output vector obtained through the text information and the multimedia data is more accurately analyzed to obtain the specific meaning thereof, and the accuracy of the labeling result can be effectively improved.

In some embodiments, as the method described above, after obtaining the corresponding content tag according to the word vector and the image feature vector, the method further includes steps B1 to B5 as follows:

and B1, acquiring the total number of the content tags.

Specifically, this step is used to determine the total number of all content tags obtained in step S4.

And B2, when the total number of the content labels is greater than a preset upper limit threshold value, acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector.

Specifically, the upper threshold may be set according to an actual situation, and when the total number of the content tags is greater than the upper threshold, the content tags need to be discarded, so as to prevent the content tags from being too many and affecting the simplicity; according to the steps in the foregoing embodiment, the word corresponding to the candidate word vector closest to the first distance of each output vector is taken as the content tag corresponding to the output vector, and therefore, the content tag, the candidate word vector and the output vector have a unique correspondence relationship;

b3, determining the corresponding relation between the content label and the second distance;

specifically, after determining that the content tag, the candidate word vector and the output vector have the unique correspondence relationship in step B3, since both the candidate word vector and the output vector are determined, the second distance therebetween is also determined, and thus the correspondence relationship between the content tag and the second distance can be obtained.

Step B4. ranks the content tags by a second distance from small to large;

step B5. deletes content tags whose rank order is greater than the upper threshold in accordance with the correspondence.

Specifically, after the content tags are arranged according to the second distance from small to large, the arrangement order of the second distance can be determined; since the farther the distance is, the lower the correlation between the two words is, only the content tags arranged within the upper threshold are reserved, and the accuracy of semantic expression of the content tags can be guaranteed.

As shown in fig. 5, in a second aspect of the implementation of the present invention, there is further provided a data processing apparatus, comprising:

the system comprises an acquisition module 1, a processing module and a display module, wherein the acquisition module is used for acquiring multimedia data used for generating a content label and text information used for describing the multimedia data; the multimedia data includes: video or images;

the determining module 2 is used for determining a word vector of each word segmentation in the text information;

the vector acquisition module 3 is used for extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data;

the global module 4 is used for respectively obtaining global vector information corresponding to the word vectors and the image characteristic vectors according to global relations between the word vectors and the image characteristic vectors through a self-attention mechanism;

the decoding module 5 is configured to, after the image feature vector is used as a first input of a decoder, sequentially input the global vector information to the decoder, so as to obtain output vectors obtained by decoding each global vector under the guidance of the image feature vector;

and a label determining module 6, configured to determine a content label corresponding to the output vector.

The system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring multimedia data used for generating a content label and text information used for describing the multimedia data; wherein the multimedia data includes: video information and/or image information;

the global module is used for obtaining global vector information corresponding to the word vectors and the image characteristic vectors according to global relations between the word vectors and the image characteristic vectors through a self-attention mechanism;

the decoding module is used for inputting the global vector information into the decoder in sequence after the image characteristic vector is used as the first input of the decoder so as to obtain output vectors after decoding all the global vectors under the guidance of the image characteristic vector;

and the label determining module is used for determining the content label corresponding to the output vector.

In some embodiments, as in the aforementioned apparatus, the determining module comprises:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain word segments forming the text information;

the word list unit is used for obtaining a corresponding word list according to the word segmentation and the preset label words;

and the word vector unit is used for determining the word vector of each participle according to the word vector model obtained by pre-training and the word list.

In some embodiments, as in the foregoing apparatus, the vector obtaining module comprises:

the first input unit is used for inputting multimedia data into a preset deep neural network;

and the extraction unit is used for acquiring an image feature vector obtained by performing feature extraction on the multimedia data by a feature extraction layer in the deep neural network.

In some embodiments, such as the aforementioned apparatus, the global module comprises:

the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension adjusting word vectors and carrying out vector dimension adjustment on the image feature vectors to obtain dimension adjusting image feature vectors;

the fusion unit is used for inputting each dimension-adjusting word vector and the dimension-adjusting image feature vector into the encoder for splicing and fusion to obtain corresponding vector information of each dimension-adjusting word vector and the dimension-adjusting image feature vector;

the self-attention unit is used for obtaining global relation among vector information through a self-attention mechanism;

and the adjusting unit is used for adjusting the vector information according to the global contact to obtain global vector information.

In some embodiments, as in the foregoing apparatus, the decoding module comprises:

the first determining unit is used for determining a first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information of a first input decoder;

the influence unit is used for adjusting the initial global vector according to the first influence weight and the image feature vector to obtain an adjusted initial global vector; the initial global vector is obtained by decoding initial global vector information by a decoder; determining a second influence weight of the adjusted initial global vector on next global vector information in the sequence information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; circulating according to the above steps until all adjusted global vectors are obtained;

and the output vector unit is used for obtaining an output vector according to the adjusted global vector.

In some embodiments, the tag determination module, as in the aforementioned apparatus, comprises:

the candidate word vector determining unit is used for determining a candidate word vector of each word in the word list;

the word vector screening unit is used for respectively determining candidate word vectors closest to the first distance of each output vector;

In some embodiments, the apparatus as in the previous paragraph, further comprising: a tag screening module; the label screening module comprises:

the screening unit is used for acquiring a second distance between a candidate word vector corresponding to the same content label and an output vector when the total number of the content labels is greater than a preset upper limit threshold;

a correspondence unit configured to determine a correspondence between the content tag and the second distance;

and the deleting unit is used for deleting the content tags with the arrangement order larger than the upper limit threshold according to the corresponding relation.

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 1501, a communication interface 1502, a memory 1503, and a communication bus 1504, where the processor 1501, the communication interface 1502, and the memory 1503 complete mutual communication through the communication bus 1504,

a memory 1503 for storing a computer program;

the processor 1501, when executing the program stored in the memory 1503, implements the following steps:

the communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to execute the data processing method for generating a content tag according to any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data processing method for generating a content tag as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A data processing method, comprising:

determining a word vector of each word segmentation in the text information;

determining the content tag corresponding to the output vector.

2. The method of claim 1, wherein determining a word vector for each word in the textual information comprises:

3. The method of claim 1, wherein the extracting the features of the multimedia data to obtain an image feature vector corresponding to the multimedia data comprises:

inputting the multimedia data into a preset deep neural network;

4. The method according to claim 1, wherein the obtaining global vector information corresponding to the word vector and the image feature vector according to the global association between the word vector and the image feature vector through the self-attention mechanism comprises:

obtaining global relations among the vector information through the self-attention mechanism;

5. The method according to claim 1, wherein after the image feature vector is used as a first input of a decoder, sequentially inputting the global vector information into the decoder to obtain output vectors after decoding each global vector under the guidance of the image feature vector, comprising:

inputting the image feature vector into the decoder as a first input;

adjusting the initial global vector according to the first influence weight and the image feature vector to obtain an adjusted initial global vector; the initial global vector is obtained after the decoder decodes the initial global vector information;

and obtaining the output vector according to the adjusted global vector.

6. The method of claim 2, wherein the determining the content tag corresponding to the output vector comprises:

determining candidate word vectors of all words in the word list;

7. The method of claim 6, after obtaining the corresponding content tag from the word vector and the image feature vector, further comprising:

acquiring the total number of the content tags;

determining a correspondence between the content tag and a second distance;

8. A data processing apparatus, comprising:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.