CN111767727B

CN111767727B - Data processing method and device

Info

Publication number: CN111767727B
Application number: CN202010589941.9A
Authority: CN
Inventors: 张轩玮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2024-02-06
Anticipated expiration: 2040-06-24
Also published as: CN111767727A

Abstract

The embodiment of the invention provides a data processing method and device, wherein the method comprises the following steps: acquiring multimedia data for generating content tags and text information for describing the multimedia data; determining word vectors of each word segmentation in the text information; extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data; global relation between the word vector and the image feature vector is achieved through a self-attention mechanism, and global vector information is obtained; after taking the image feature vector as a first input of a decoder, sequentially inputting global vector information into the decoder to obtain an output vector after decoding each global vector under the guidance of the image feature vector; a content tag corresponding to the output vector is determined. According to the method and the device, under the condition that the text information lacks comprehensive or key information, the content tag can be combined with the information contained in the multimedia data by using the image characteristics, so that the advantage of improving the tag accuracy is achieved.

Description

Data processing method and device

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a data processing method and apparatus.

Background

The current method for marking data is mainly based on text, but because of the many characteristics that video content can represent, it is difficult to completely characterize all information in video by text alone. When the text content only comprises a plurality of phrases, the phrases can provide limited information, and if the text content is not combined with specific video content, the text content is likely to be unable to characterize main information, and even useful information is difficult to analyze from the text content.

The related solutions are provided in the prior art, but the image-text fusion method provided in the prior art is mostly based on simply splicing the features of the two at the input end, so that the method is only used in an encoder, more features are obtained, the features between the text and the video are still independent, the effect is limited, and the content of the multimedia data such as the video cannot be fully utilized in a decoder.

Aiming at the problem that the accurate label cannot be obtained by utilizing the multimedia data in the related technology, no effective solution is provided at present.

Disclosure of Invention

The embodiment of the invention aims to provide a data processing method and device, which are used for solving the problem that the accurate label cannot be obtained by utilizing multimedia data in the related technology. The specific technical scheme is as follows:

In a first aspect of the present invention, there is provided a data processing method, including:

acquiring multimedia data for generating content tags and text information for describing the multimedia data; wherein the multimedia data comprises: video or image;

determining a word vector of each word segmentation in the text information;

extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data;

acquiring global relations between the word vectors and the image feature vectors through a self-attention mechanism, and respectively acquiring global vector information corresponding to the word vectors and the image feature vectors according to the global relations;

after the image feature vector is used as a first input of a decoder, sequentially inputting the global vector information into the decoder to obtain an output vector after decoding each global vector under the guidance of the image feature vector;

the content tag corresponding to the output vector is determined.

Optionally, in the foregoing method, the determining a word vector of each word segment in the text information includes:

performing word segmentation processing on the text information to obtain the word segmentation forming the text information;

Obtaining a corresponding word list according to the word segmentation and the preset tag word;

and determining the word vector of each word according to a word vector model obtained through pre-training and the word list.

Optionally, in the foregoing method, the feature extracting the multimedia data to obtain an image feature vector corresponding to the multimedia data includes:

inputting the multimedia data into a preset deep neural network;

and acquiring the image feature vector obtained after the feature extraction layer in the deep neural network performs feature extraction on the multimedia data.

Optionally, in the foregoing method, the obtaining global vector information corresponding to the word vector and the image feature vector according to the global relation between the word vector and the image feature vector through a self-attention mechanism includes:

performing vector dimension adjustment on the word vector to obtain a dimension-adjusting word vector, and performing vector dimension adjustment on the image feature vector to obtain the dimension-adjusting image feature vector;

inputting each dimension-adjusting word vector and the dimension-adjusting image feature vector into an encoder for splicing and fusing, and obtaining corresponding vector information of each dimension-adjusting word vector and the dimension-adjusting image feature vector;

Obtaining global relation among the vector information through a self-attention mechanism;

and adjusting the vector information according to the global relation to obtain the global vector information.

Optionally, in the foregoing method, after the image feature vector is used as a first input of a decoder, the global vector information is sequentially input into the decoder, so as to obtain output vectors after decoding each global vector under the guidance of the image feature vector, where the method includes:

inputting the image feature vector into the decoder as a first input;

determining order information for sequentially inputting the global vector information to the decoder;

determining first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information input into the decoder for the first time;

according to the first influence weight, the initial global vector is adjusted according to the image feature vector, and an adjusted initial global vector is obtained; the initial global vector is obtained after the decoder decodes the initial global vector information;

determining a second influence weight of the adjusted initial global vector on next global vector information in the order information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; cycling until all the adjusted global vectors are obtained;

And obtaining the output vector according to the adjusted global vector.

Optionally, in the foregoing method, the determining the content tag corresponding to the output vector includes:

determining candidate word vectors of all words in the word list;

determining the candidate word vectors closest to the first distance of each output vector;

and taking the words corresponding to the candidate word vectors with the first nearest distance as the content labels corresponding to the output vectors.

Optionally, the method further includes, after obtaining the corresponding content tag according to the word vector and the image feature vector:

acquiring the total number of the content tags;

when the total number of the content tags is larger than a preset upper threshold value, acquiring a second distance between the candidate word vector and the output vector, which correspond to the same content tag;

determining a correspondence between the content tag and a second distance;

arranging the content tags from small to large according to the second distance;

and deleting the content labels with the arrangement order larger than the upper threshold according to the corresponding relation.

In a second aspect of the present invention, there is also provided a data processing apparatus, comprising:

The acquisition module is used for acquiring multimedia data for generating content tags and text information for describing the multimedia data; wherein the multimedia data comprises: video or image;

the determining module is used for determining word vectors of each word segmentation in the text information;

the vector acquisition module is used for extracting the characteristics of the multimedia data to obtain an image characteristic vector corresponding to the multimedia data;

the global module is used for respectively obtaining global vector information corresponding to the word vector and the image feature vector according to global relation between the word vector and the image feature vector through a self-attention mechanism;

the decoding module is used for sequentially inputting the global vector information into the decoder after the image feature vector is used as a first input of the decoder so as to obtain an output vector after decoding each global vector under the guidance of the image feature vector;

and the label determining module is used for determining the content label corresponding to the output vector.

Optionally, in the foregoing apparatus, the determining module includes:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain the word segmentation forming the text information;

The vocabulary unit is used for obtaining a corresponding vocabulary according to the word segmentation and the preset tag word;

and the word vector unit is used for determining the word vector of each word segmentation according to a word vector model obtained through pre-training and the word list.

Optionally, in the foregoing apparatus, the vector acquisition module includes:

the first input unit is used for inputting the multimedia data into a preset deep neural network;

the extraction unit is used for obtaining the image feature vector obtained after the feature extraction layer in the deep neural network performs feature extraction on the multimedia data.

Optionally, as in the foregoing apparatus, the global module includes:

the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusting word vectors, and carrying out vector dimension adjustment on the image feature vectors to obtain dimension-adjusting image feature vectors;

the fusion unit is used for carrying out splicing fusion on each dimension-adjusting word vector and the dimension-adjusting image feature vector input encoder to obtain corresponding vector information of each dimension-adjusting word vector and the dimension-adjusting image feature vector;

a self-attention unit, configured to obtain global links between the vector information through the self-attention mechanism;

And the adjusting unit is used for adjusting the vector information according to the global relation to obtain the global vector information.

Optionally, as in the foregoing apparatus, the decoding module includes:

a second input unit for inputting the image feature vector into the decoder as a first input;

an order unit for determining order information for sequentially inputting the global vector information to the decoder;

a first determining unit, configured to determine a first influence weight of the image feature vector on initial global vector information, where the initial global vector information is global vector information input to the decoder for the first time;

the influence unit is used for adjusting the initial global vector according to the first influence weight and the image feature vector to obtain an adjusted initial global vector; the initial global vector is obtained after the decoder decodes the initial global vector information; determining a second influence weight of the adjusted initial global vector on next global vector information in the order information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; cycling until all the adjusted global vectors are obtained;

And the output vector unit is used for obtaining the output vector according to the adjusted global vector.

Optionally, in the foregoing apparatus, the tag determining module includes:

a candidate word vector determining unit, configured to determine candidate word vectors of each word in the vocabulary;

a word vector screening unit, configured to determine the candidate word vectors closest to the first distances of the output vectors, respectively;

and the label determining unit is used for taking the word corresponding to the candidate word vector with the first nearest distance as the content label corresponding to the output vector.

Optionally, the apparatus as described above further comprises: a tag screening module; the tag screening module comprises:

a total number determining unit, configured to obtain a total number of the content tags;

a screening unit, configured to obtain a second distance between the candidate word vector and the output vector corresponding to the same content tag when the total number of the content tags is greater than a preset upper threshold;

a correspondence unit, configured to determine a correspondence between the content tag and a second distance;

an arrangement unit configured to arrange the content tags from small to large according to the second distance;

And the deleting unit is used for deleting the content labels with the arrangement order larger than the upper limit threshold according to the corresponding relation.

In yet another aspect of the present invention, there is also provided a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the invention there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

The embodiment of the invention provides a data processing method and device, wherein the method comprises the following steps: acquiring multimedia data for generating content tags and text information for describing the multimedia data; wherein the multimedia data comprises: video or image; determining a word vector of each word segmentation in the text information; extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data; global relation between the word vector and the image feature vector is obtained through a self-attention mechanism, and global vector information corresponding to the word vector and the image feature vector is obtained according to the global relation; after the image feature vector is used as a first input of a decoder, sequentially inputting the global vector information into the decoder to obtain an output vector after decoding each global vector under the guidance of the image feature vector; the content tag corresponding to the output vector is determined. The corresponding content tag is obtained through the word vector and the image feature, and the image feature is used under the condition that the text information lacks comprehensive or key information, so that the content tag can be combined with the information contained in the multimedia data when being generated, and the recall and the accuracy of the tag are improved finally.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the invention;

FIG. 2 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 3 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 4 is a flow chart of a data processing method according to another embodiment of the present invention;

FIG. 5 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

As shown in fig. 1, a method for processing data in an embodiment of the present application includes steps S1 to S6 as follows:

s1, acquiring multimedia data for generating content labels and text information for describing the multimedia data; wherein the multimedia data comprises: video or image.

In particular, the multimedia data may include, but is not limited to: one or more of a picture, video, or a video file; the text information may be: one or more keywords, long sentences or articles, etc.; the text information and the multimedia data are information belonging to the same data as the text information is used for extracting keywords from the data containing the multimedia data and the text information and labeling the data; for example: when the multimedia data is a piece of video, the text information may be text content for content summarizing the piece of video.

S2, determining word vectors of each word segmentation in the text information.

In particular, whether machine learning or deep learning is digital in nature, word vectors do things by mapping words into a vector space and representing the words as vectors. Conceptually, it involves mathematical embedding from a space of one dimension per word to a continuous vector space with lower dimensions. Methods of generating such mappings include neural networks, dimension reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context of word occurrences.

Determining the word vector for each word segment in the text information may be accomplished by a language model method such as word2vec, glove, ELMo, BERT.

And S3, extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data.

Specifically, the feature extraction of the multimedia data is used for identifying and obtaining key information in the multimedia data, and the feature extraction of the multimedia data can be performed through neural network models such as CNN, so as to obtain corresponding image feature information.

S4, global relations between the word vectors and the image feature vectors are obtained through a self-attention mechanism, and global vector information corresponding to the word vectors and the image feature vectors is obtained according to the global relations.

The global relation obtained through the self-attention mechanism can capture the internal correlation of each feature, so that the obtained global vector information can represent the specific meaning more accurately, and the accuracy of the label result can be further improved effectively.

S5, after the image feature vector is used as a first input of the decoder, global vector information is sequentially input into the decoder, so that an output vector obtained after each global vector is decoded under the guidance of the image feature vector is obtained.

Specifically, in the prior art, the decoder processes the current information to be decoded by referring to the information of the previous decoding, however, when the first information to be decoded is input to the decoder, the information of the previous decoded information is recorded as 0, so that the first information to be decoded does not have other information affecting the first information, in this embodiment, the image feature vector is taken as the first input of the decoder, and the decoding process of the global vector information can be guided by the image feature vector, so that the multimedia data can further affect the generation of the final tag.

The decoder may be configured to decode and output an output result of the encoder, to obtain an output vector. In general, the decoder is a recurrent neural network.

And S6, determining the content label corresponding to the output vector.

Since each output vector is obtained by the dimension-adjusting word vector and dimension-adjusting image feature vector of the input encoder through the steps, but after the processing through the steps, the output vector is different from the dimension-adjusting word vector and dimension-adjusting image feature vector of the input encoder, so that the corresponding word cannot be directly obtained as a content tag, and the corresponding word needs to be selected in a word list through the input vector.

By adopting the method in the embodiment, the image characteristics can be used under the condition that the text information lacks comprehensive or key information, so that the information contained in the multimedia data can be combined, and the advantages of improving the recall and the accuracy of the label are finally achieved.

In some embodiments, the method as described above, determining the word vector of each word segment in the text information includes the following steps A1 to A3:

step A1, performing word segmentation processing on the text information to obtain word segmentation forming the text information;

A2, obtaining a corresponding word list according to the word segmentation and the preset tag word;

and step A3, determining the word vector of each word according to the word vector model and the word list which are obtained through training in advance.

Specifically, word segmentation processing is used for splitting a text into a plurality of words, for example: when the text information is "the beautiful flower sound effect is that the tragic actors are matched with the sound effect, the playing is easy, and after the word segmentation processing is carried out on the text information, the obtained word segmentation comprises the following steps: "beautiful," sound effect, "" artist, "" tragic, "" actor, "" match, "" sound effect, "" also, "" yes, "" spell, "" show, "" individual, "" play, "" easy, "" mock.

The preset tag word can be a word group which is obtained through pre-selection, and the words in the word list comprise the tag word and the word segmentation obtained through word segmentation processing according to the text information.

The word vector model obtained through pre-training can be a word2vec model (a tool for calculating word vectors); the word vector for each word segment can be determined by training the word2vec model.

Specifically, after determining the vocabulary and the model, a word vector for each word segment in the vocabulary may be determined. Further, the words in the vocabulary may be randomly initialized to 512-dimensional vectors as a word vector and a tag vector (a word vector of a tag word) for each word segment, respectively.

Through the method in the embodiment, the relation among the segmented words in the text information can be obtained through the word vector, the semantics of each segmented word in the text information can be effectively obtained, and the accuracy of the label result can be effectively improved.

As shown in fig. 2, in some embodiments, as in the foregoing method, the step S3 performs feature extraction on the multimedia data to obtain an image feature vector corresponding to the multimedia data, and includes the following steps S31 and S32:

s31, inputting the multimedia data into a preset deep neural network;

s32, obtaining an image feature vector obtained after feature extraction of the multimedia data by a feature extraction layer in the deep neural network.

Specifically, the deep neural network has the capability of extracting the characteristics of the multimedia data, so that the corresponding image characteristic vector can be obtained by inputting the multimedia data into the deep neural network.

One of the alternative implementation methods is as follows: multimedia data is input into an xception (depth separable convolution) model, and the extracted image features of the penultimate layer of the xception model are most abundant, so that 2048-dimensional vectors of the penultimate layer of the model are extracted as the image features.

By adopting the method in the embodiment, the feature extraction layer in the deep neural network is used for extracting the features of the video information to obtain rich video feature vectors so as to obtain more information provided by the video.

As shown in fig. 3, in some embodiments, the step S4 obtains global vector information corresponding to the word vector and the image feature vector according to global relation between the word vector and the image feature vector through a self-attention mechanism, as in the previous method, and includes the following steps S41 to S44:

and S41, carrying out vector dimension adjustment on the word vector to obtain a dimension-adjusting word vector, and carrying out vector dimension adjustment on the image feature vector to obtain a dimension-adjusting image feature vector.

Specifically, on the basis of the foregoing embodiment, since the word vector of the word segmentation is 512 dimensions, and the image feature vector is 2048 dimensions; because the dimensions of the two are different, the two cannot be spliced and fused, and the dimensions of the two are required to be unified; optionally, the dimension of the image feature vector is higher, so that dimension reduction processing can be performed on the image feature vector, and the dimension-adjustable image feature vector with 512 dimensions is obtained through the full-connection network dimension reduction.

S42, inputting each dimension-adjusting word vector and dimension-adjusting image feature vector into an encoder for splicing and fusing, and obtaining corresponding vector information of each dimension-adjusting word vector and dimension-adjusting image feature vector;

specifically, the encoder may encode the input data, and in general, the encoder is a recurrent neural network. The feature vector input encoder for the dimension-adjusting word vector and the dimension-adjusting image feature vector are spliced and fused so as to form a context relation, so that the global relation between each dimension-adjusting word vector and the dimension-adjusting image feature vector can be found conveniently, and the realization method can be as follows: the feature vector of the dimension-adjusting image is used as a word vector and each dimension-adjusting word vector to be put into the same level; the vector information is the vector of the dimension-adjusting word and the feature vector of the dimension-adjusting image are input into the encoder, so that the aim of splicing and fusion can be quickly achieved.

And S43, obtaining global relation among the vector information through a self-attention mechanism.

In particular, the attentiveness mechanism mimics the internal course of biological observation behavior, a mechanism that aligns internal experience with external sensations to increase the observation finesse of a partial region. Attention mechanisms can quickly extract important features of sparse data and are thus widely used for natural language processing tasks, particularly machine translation. While self-attention mechanisms are improvements in attention mechanisms that reduce reliance on external information, are more adept at capturing internal dependencies of data or features. Thus, through a self-attention mechanism, a global link between the various vector information can be obtained.

S44, adjusting the vector information according to the global relation to obtain global vector information; such as presence vectors a, b, c; wherein weights of a and b, a and c are a1 and a2, respectively, and then the global vector information corresponding to a is: a1+a2 c, b, c are similar.

In summary, by adopting the method in this embodiment, each dimension-adjusting word vector and dimension-adjusting image feature vector are spliced and fused, and then the internal correlation of each vector information can be captured through a self-attention mechanism, so that the specific meaning of text information and multimedia data can be more accurately analyzed and obtained, and the accuracy of the label result can be further effectively improved.

As shown in fig. 4, in some embodiments, after the image feature vector is used as the first input of the decoder in step S5, global vector information is sequentially input to the decoder to obtain output vectors after decoding each global vector under the guidance of the image feature vector, which includes steps S51 to S56 as follows:

step s51. Input the image feature vector to the decoder as a first input.

Specifically, in the prior art, the decoder processes the current information to be decoded by referring to the information of the previous decoding, however, when the first information to be decoded is input to the decoder, the information of the previous decoded information is recorded as 0, so that the first information to be decoded does not have other information affecting the first information, in this embodiment, the image feature vector is taken as the first input of the decoder, and the next decoding process of the global vector information can be guided, so that the multimedia data can further affect the generation of the final tag.

Step S52, determining order information for sequentially inputting the global vector information to the decoder.

Specifically, generally, each global vector information is input into a decoder one by one, and order information can be obtained according to the order of each word in the text information; by way of example: because the global vector information corresponds to a specific dimension-adjusting word vector, each dimension-adjusting word vector is provided with a corresponding word vector, and each word vector corresponds to a word segmentation, the sequence of each global vector information corresponding to the word vector can be determined through the sequence of the word segmentation, and finally, the sequence information can be obtained only by determining the sequence (which can be placed at the beginning or the end) corresponding to the dimension-adjusting image feature vector.

S53, determining first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information of a first input decoder. The images are fused on global vector information, for example, the effect of the image a on the global vector b, c is a1, the effect of the image a on the global vector b is a2, and then the global information is a1×b+a2×c.

In particular, a first impact weight of the image feature vector on the initial global vector information is typically determined by the decoder.

S54, according to the first influence weight, adjusting the initial global vector according to the image feature vector to obtain an adjusted initial global vector; the initial global vector is obtained after the decoder decodes the initial global vector information.

Specifically, the initial global vector information is adjusted according to the image feature vector, which may be: after the first influence weight is obtained, assuming that the first influence weight is t, when the image feature vector information is M and the initial global vector information is N, the adjusted initial global vector information may be N (1-t) +mt.

S55, determining a second influence weight of the adjusted initial global vector on next global vector information in the order information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; and circulating until all the adjusted global vectors are obtained.

Specifically, according to the method in step S54, all the vectors generated under the guidance of the image feature vector (i.e., the adjusted global vector) can be obtained by sequentially cycling, and the specific implementation method can be described in step S454, which is not described in detail herein.

And S56, obtaining an output vector according to the adjusted global vector.

Specifically, the global vector adjusted according to the foregoing steps may be directly output and used as an output vector.

In summary, by adopting the method in this embodiment, each global vector information may be decoded under the guidance of the image feature vector, so that the features corresponding to the image feature vector may be further integrated into the output vector obtained by decoding, so that the features carried in more multimedia data may be obtained in the output vector; more effective information carried in the multimedia data is embodied.

In some embodiments, as the aforementioned method, step S6 determines the content tag corresponding to the output vector, including steps S61 to S63 as follows:

s61, determining candidate word vectors of all words in a word list;

s62, respectively determining candidate word vectors closest to the first distance of each output vector;

and S63, taking the words corresponding to the candidate word vectors closest to the first distance as content labels corresponding to the output vectors.

Specifically, candidate word vectors corresponding to each word in the word list are determined; then determining a first distance (typically, the first distance may be a cosine distance) between each output vector and each candidate word vector in the vocabulary, and determining therefrom the candidate word vector closest to the first distance of each output vector; finally, the word corresponding to the candidate word vector with the first distance nearest to each output vector is used as the content label corresponding to the output vector.

In summary, by adopting the method in the embodiment, the correlation between each candidate word vector and the output vector can be captured, so that the output vector obtained through text information and multimedia data can be more accurately analyzed to obtain specific meanings in the output vector, and the accuracy of the label result can be further effectively improved.

In some embodiments, as in the previous method, after obtaining the corresponding content tag according to the word vector and the image feature vector, the method further includes the following steps B1 to B5:

step b1. The total number of content tags is obtained.

Specifically, this step is used to determine the total number of all content tags obtained in step S4.

And B2, when the total number of the content labels is larger than a preset upper limit threshold value, acquiring a second distance between the candidate word vector and the output vector corresponding to the same content label.

Specifically, the upper limit threshold can be set according to actual conditions, and when the total number of content tags is greater than the online threshold, the content tags need to be discarded to prevent the content tags from being too much and affecting the conciseness; according to the steps in the foregoing embodiments, it is known that the word corresponding to the candidate word vector closest to the first distance of each output vector is used as the content tag corresponding to the output vector, so that there is a unique correspondence between the content tag, the candidate word vector and the output vector;

B3, determining the corresponding relation between the content tag and the second distance;

specifically, after determining that the content tag, the candidate word vector and the output vector have a unique correspondence in step B3, since the candidate word vector and the output vector are both determined, the second distance between the candidate word vector and the output vector is also determined, so that the correspondence between the content tag and the second distance can be obtained.

Step B4., arranging the content tags from small to large according to the second distance;

step B5. deletes content tags whose ranking order is greater than the upper threshold according to the correspondence.

Specifically, after the content tags are arranged according to the second distance from small to large, the arrangement order of the second distance can be determined; the further the distance is, the lower the relativity between the two words is, so that only the content labels arranged in the upper limit threshold value are reserved, and the accuracy of semantic expression of the content labels can be ensured.

As shown in fig. 5, in a second aspect of the implementation of the present invention, there is also provided a data processing apparatus, including:

an acquisition module 1 for acquiring multimedia data for generating a content tag and text information for describing the multimedia data; the multimedia data includes: video or image;

A determining module 2, configured to determine a word vector of each word in the text information;

the vector acquisition module 3 is used for extracting the characteristics of the multimedia data to obtain an image characteristic vector corresponding to the multimedia data;

the global module 4 is used for respectively obtaining global vector information corresponding to the word vector and the image feature vector according to global relation between the word vector and the image feature vector through a self-attention mechanism;

the decoding module 5 is configured to sequentially input the global vector information to a decoder after the image feature vector is used as a first input of the decoder, so as to obtain an output vector after decoding each global vector under the guidance of the image feature vector;

the tag determining module 6 is configured to determine a content tag corresponding to the output vector.

The acquisition module is used for acquiring multimedia data for generating content tags and text information for describing the multimedia data; wherein the multimedia data comprises: video information and/or image information;

the vector acquisition module is used for extracting the characteristics of the multimedia data to obtain image characteristic vectors corresponding to the multimedia data;

the decoding module is used for sequentially inputting the global vector information into the decoder after the image feature vector is used as the first input of the decoder so as to obtain an output vector after each global vector is decoded under the guidance of the image feature vector;

In some embodiments, as in the foregoing apparatus, the determining module comprises:

the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain word segmentation forming the text information;

and the word vector unit is used for determining the word vector of each word according to the word vector model and the word list which are obtained through training in advance.

In some embodiments, as in the foregoing apparatus, the vector acquisition module includes:

the extraction unit is used for obtaining an image feature vector obtained after the feature extraction layer in the deep neural network performs feature extraction on the multimedia data.

In some embodiments, as in the previous apparatus, the global module comprises:

the fusion unit is used for inputting each dimension-adjusting word vector and dimension-adjusting image feature vector into the encoder for splicing and fusion to obtain corresponding vector information of each dimension-adjusting word vector and dimension-adjusting image feature vector;

the self-attention unit is used for obtaining global relation among the vector information through a self-attention mechanism;

and the adjusting unit is used for adjusting the vector information according to the global relation to obtain global vector information.

In some embodiments, as in the previous apparatus, the decoding module comprises:

the first determining unit is used for determining first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information of a first input decoder;

and the output vector unit is used for obtaining an output vector according to the adjusted global vector.

In some embodiments, as in the previous apparatus, the tag determination module comprises:

the candidate word vector determining unit is used for determining candidate word vectors of all words in the word list;

a word vector screening unit for respectively determining candidate word vectors closest to the first distance of each output vector;

and the label determining unit is used for taking the word corresponding to the candidate word vector with the first distance as the content label corresponding to the output vector.

In some embodiments, an apparatus as previously described, further comprising: a tag screening module; the tag screening module comprises:

A total number determining unit for obtaining the total number of the content tags;

the screening unit is used for acquiring a second distance between the candidate word vector and the output vector corresponding to the same content label when the total number of the content labels is larger than a preset upper limit threshold value;

the corresponding relation unit is used for determining the corresponding relation between the content tag and the second distance;

an arrangement unit for arranging the content tags from small to large according to a second distance;

and the deleting unit is used for deleting the content labels with the arrangement order larger than the upper limit threshold value according to the corresponding relation.

The embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, where the processor 1501, the communication interface 1502 and the memory 1503 complete communication with each other through the communication bus 1504,

a memory 1503 for storing a computer program;

the processor 1501, when executing the program stored in the memory 1503, performs the following steps:

the communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which instructions are stored, which when executed on a computer, cause the computer to perform the data processing method for generating content tags according to any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data processing method of generating content tags as described in any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of data processing, comprising:

determining a word vector of each word segmentation in the text information;

after the image feature vector is used as a first input of a decoder, sequentially inputting the global vector information into the decoder to obtain output vectors after decoding each global vector under the guidance of the image feature vector, wherein the method comprises the following steps of: inputting the image feature vector into the decoder as a first input; determining order information for sequentially inputting the global vector information to the decoder; determining first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information input into the decoder for the first time; according to the first influence weight, the initial global vector is adjusted according to the image feature vector, and an adjusted initial global vector is obtained; the initial global vector is obtained after the decoder decodes the initial global vector information; determining a second influence weight of the adjusted initial global vector on next global vector information in the order information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; cycling until all the adjusted global vectors are obtained; obtaining the output vector according to the adjusted global vector;

The content tag corresponding to the output vector is determined.

2. The method of claim 1, wherein said determining a word vector for each word segment in the text information comprises:

3. The method according to claim 1, wherein the feature extracting the multimedia data to obtain the image feature vector corresponding to the multimedia data includes:

inputting the multimedia data into a preset deep neural network;

4. The method according to claim 1, wherein the obtaining global vector information corresponding to the word vector and the image feature vector according to the global relation between the word vector and the image feature vector through the self-attention mechanism comprises:

obtaining global relations among the vector information through the self-attention mechanism;

5. The method of claim 2, wherein the determining the content tag corresponding to the output vector comprises:

determining candidate word vectors of all words in the word list;

6. The method of claim 5, further comprising, after deriving the corresponding content tag from the word vector and the image feature vector:

Acquiring the total number of the content tags;

determining a correspondence between the content tag and a second distance;

7. A data processing apparatus, comprising:

The decoding module, configured to sequentially input the global vector information to a decoder after the image feature vector is used as a first input of the decoder, so as to obtain output vectors after decoding each global vector under the guidance of the image feature vector, includes: inputting the image feature vector into the decoder as a first input; determining order information for sequentially inputting the global vector information to the decoder; determining first influence weight of the image feature vector on initial global vector information, wherein the initial global vector information is global vector information input into the decoder for the first time; according to the first influence weight, the initial global vector is adjusted according to the image feature vector, and an adjusted initial global vector is obtained; the initial global vector is obtained after the decoder decodes the initial global vector information; determining a second influence weight of the adjusted initial global vector on next global vector information in the order information, and obtaining an adjusted next global vector according to the adjusted initial global vector, the next global vector information and the second influence weight; cycling until all the adjusted global vectors are obtained; obtaining the output vector according to the adjusted global vector;

8. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-6 when executing a program stored on a memory.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.