CN111767726A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN111767726A CN111767726A CN202010588592.9A CN202010588592A CN111767726A CN 111767726 A CN111767726 A CN 111767726A CN 202010588592 A CN202010588592 A CN 202010588592A CN 111767726 A CN111767726 A CN 111767726A
- Authority
- CN
- China
- Prior art keywords
- video
- vector
- word
- information
- fused
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 14
- 239000013598 vector Substances 0.000 claims abstract description 442
- 238000000034 method Methods 0.000 claims abstract description 54
- 230000004927 fusion Effects 0.000 claims abstract description 31
- 230000011218 segmentation Effects 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 16
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims description 55
- 238000013528 artificial neural network Methods 0.000 claims description 16
- 238000004891 communication Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 description 6
- 238000012216 screening Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 230000009467 reduction Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000007500 overflow downdraw method Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Abstract
The embodiment of the invention provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring video information for generating a content label and text information for describing the video information; determining a word vector of each word segmentation in the text information; extracting the features of the video information to obtain a video feature vector corresponding to the video information; performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors; and obtaining the corresponding content label according to the fused word vector and the fused video feature vector. By the method in the embodiment, the characteristics of the text information and the video information can be subjected to cross fusion, and the cross information of the text information and the video information is extracted, so that the content tag result is more accurate.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data processing method and device.
Background
At present, a method for labeling data is mainly based on texts, but because video content can have many expressed characteristics, it is difficult to completely represent all information in a video through texts. When the text content only comprises a plurality of word groups, the information provided by the word groups is limited, and if the text content is not combined with specific video content, the text content is likely to be the main information which cannot be represented, and even useful information is difficult to be obtained by analyzing.
In view of the above problems, the prior art also provides a related solution, but most of the methods for image-text fusion provided in the prior art are based on simply splicing the features of the two at the input end, so that the methods are only used in the encoder, but more features are obtained, the features of the text and the video are still independent from each other, the effect is limited, and the video information cannot be fully utilized in the decoder.
In view of the technical problems in the related art, no effective solution is provided at present.
Disclosure of Invention
An embodiment of the present invention provides a data processing method and apparatus, so as to solve at least one technical problem in the related art. The specific technical scheme is as follows:
in a first aspect of the present invention, there is provided a data processing method, including:
acquiring video information for generating a content label and text information for describing the video information;
determining a word vector of each word segmentation in the text information;
extracting the features of the video information to obtain a video feature vector corresponding to the video information;
performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors;
and obtaining the corresponding content label according to the fused word vector and the fused video feature vector.
Optionally, as in the foregoing method, the determining a word vector of each word segmentation in the text information includes:
performing word segmentation processing on the text information to obtain the word segments forming the text information;
obtaining a corresponding word list according to the word segmentation and a preset label word;
and determining the word vector of each word segmentation according to a word vector model obtained by pre-training and the word list.
Optionally, as in the foregoing method, the performing feature extraction on the video information to obtain a video feature vector corresponding to the video information includes:
performing image extraction on the video information according to frames to obtain at least two video frame images;
respectively inputting the video frame images into a deep neural network;
and obtaining the video feature vector obtained by performing feature extraction on the video frame image by a feature extraction layer in the deep neural network.
Optionally, as in the foregoing method, the cross-fusing the features of the word vector and the features of the video feature vector through a mutual attention mechanism to obtain a fused word vector and a fused video feature vector, includes:
carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusting word vectors, and carrying out vector dimension adjustment on the video feature vectors to obtain dimension-adjusting video feature vectors;
after splicing each dimensionality-adjusting word vector and the dimensionality-adjusting video feature vector, obtaining initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;
determining a hierarchical relationship of the mutual attention layers, wherein the hierarchical relationship represents a connection relationship between different mutual attention layers;
inputting the word vector information and the video vector information into the mutual attention layer arranged on a first layer for cross fusion of features, obtaining each piece of first video vector information fused with all pieces of initial word vector information according to the first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, and obtaining each piece of first word vector information fused with all pieces of initial video vector information according to the first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information;
inputting the first word vector information and the first video vector information into the mutual attention layer of the next layer according to the hierarchical relationship to perform cross fusion of features again, and respectively obtaining second word vector information and second video vector influence weight; circulating according to the above steps until fused word vector information and fused video feature vector information are obtained through the output of the last layer of the mutual attention layer;
decoding the fused word vector information to obtain the fused word vector;
and decoding the fused video feature vector information to obtain the fused video feature vector.
Optionally, as in the foregoing method, the obtaining the corresponding content tag according to the fused word vector and the fused video feature vector includes:
determining candidate word vectors of all words in the word list;
determining the candidate word vector closest to a first distance of each output vector, respectively, the output vectors comprising: the fused word vector and the fused video feature vector;
and taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.
Optionally, as in the foregoing method, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector, the method further includes:
acquiring the total number of the content tags;
when the total number of the content labels is larger than a preset upper limit threshold value, acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector;
determining a correspondence between the content tag and a second distance;
arranging the content tags according to the second distance from small to large;
and deleting the content tags with the arrangement order larger than the upper threshold value according to the corresponding relation.
Optionally, as in the foregoing method, performing image extraction on the video information by frame to obtain at least two video frame images, includes:
acquiring the total frame number of images included in the video information;
determining a preset upper limit threshold of the number of images;
determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper threshold of the image quantity;
and carrying out image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.
In a second aspect of the present invention, there is also provided a data processing apparatus comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring video information used for generating a content label and text information used for describing the video information;
the determining module is used for determining a word vector of each word segmentation in the text information;
the vector acquisition module is used for extracting the characteristics of the video information to obtain a video characteristic vector corresponding to the video information;
the feature fusion module is used for performing cross fusion on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;
and the label determining module is used for obtaining the corresponding content label according to the fused word vector and the fused video feature vector.
Optionally, as in the foregoing apparatus, the determining module includes:
the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain the word segmentation forming the text information;
the word list unit is used for obtaining a corresponding word list according to the participles and preset label words;
and the word vector unit is used for determining the word vector of each participle according to a word vector model obtained by pre-training and the word list.
Optionally, as in the foregoing apparatus, the vector obtaining module includes:
the extraction unit is used for carrying out image extraction on the video information according to frames to obtain at least two video frame images;
the input unit is used for respectively inputting the video frame images into the deep neural network;
and the obtaining unit is used for obtaining the video feature vector obtained after the feature extraction layer in the deep neural network extracts the features of the video frame image.
Optionally, in the foregoing apparatus, the feature fusion module includes:
the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension adjusting word vectors and carrying out vector dimension adjustment on the video feature vectors to obtain the dimension adjusting video feature vectors;
the splicing unit is used for splicing each dimensionality-adjusting word vector with the dimensionality-adjusting video feature vector to obtain initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;
the relationship determination unit is used for determining the hierarchical relationship of the mutual attention layers, and the hierarchical relationship represents the connection relationship between different mutual attention layers;
a fusion unit, configured to input the word vector information and the video vector information into the mutual attention layer disposed on a first layer to perform cross fusion of features, obtain, according to a first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, each piece of first video vector information in which all pieces of initial word vector information are fused, and obtain, according to a first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information, each piece of first word vector information in which all pieces of initial video vector information are fused;
the output unit is used for inputting the first word vector information and the first video vector information into the attention layer of the next layer according to the hierarchical relationship to perform cross fusion of features again, and obtaining second word vector information and second video vector influence weight respectively; circulating according to the above steps until fused word vector information and fused video feature vector information are obtained through the output of the last layer of the mutual attention layer;
the first decoding unit is used for decoding the fused word vector information to obtain the fused word vector;
and the second decoding unit is used for decoding the fused video feature vector information to obtain the fused video feature vector.
Optionally, in the foregoing apparatus, the tag determining module includes:
the first determining unit is used for determining candidate word vectors of all words in the word list;
a second determining unit, configured to determine the candidate word vector closest to the first distance of each output vector, respectively, where the output vectors include: the fused word vector and the fused video feature vector;
and the label determining unit is used for taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.
Optionally, the apparatus as described above, further comprising: a tag screening module; the tag screening module comprises:
a total number unit for acquiring a total number of the content tags;
a second distance unit, configured to obtain a second distance between the candidate word vector and the output vector corresponding to the same content tag when the total number of the content tags is greater than a preset upper threshold;
a third determining unit configured to determine a correspondence between the content tag and the second distance;
the arrangement unit is used for arranging the content labels according to the second distance from small to large;
and the screening unit is used for deleting the content tags of which the arrangement order is greater than the upper limit threshold value according to the corresponding relation.
Optionally, as in the foregoing apparatus, the extracting unit includes:
a total frame number subunit, configured to obtain a total frame number of an image included in the video information;
the threshold subunit is used for determining a preset upper limit threshold of the number of images;
the strategy subunit is used for determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper threshold of the image number;
and the image determining subunit is used for performing image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.
In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.
In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.
The embodiment of the invention provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring video information for generating a content label and text information for describing the video information; determining a word vector of each word segmentation in the text information; extracting the features of the video information to obtain a video feature vector corresponding to the video information; performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors; and obtaining the corresponding content label according to the fused word vector and the fused video feature vector. By the method in the embodiment, the characteristics of the text information and the video information can be subjected to cross fusion, and the cross information of the text information and the video information is extracted, so that the content tag result is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 3 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
Fig. 1 shows a data processing method according to an embodiment of the present application, which includes the following steps S1 to S5:
s1, video information used for generating a content label and text information used for describing the video information are obtained.
Specifically, the video information is a continuous picture capable of exhibiting a smooth continuous visual effect; specifically, video information is typically greater than 24 frames per second; the text information may be: one or more keywords, long sentences or articles, etc.; the method is used for extracting the keywords of the data containing the video information and the text information and labeling, so that the text information and the video information belong to the same data; for example: when the video information is a piece of video, the text information may be text content for summarizing the piece of video.
And S2, determining a word vector of each word in the text information.
Specifically, whether machine learning or deep learning is essentially a numerical number, the word vector does the task of mapping words into a vector space and representing them by vectors. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions. Methods of generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the context of explicit representation of terms for word occurrences.
Determining a word vector for each participle in the text information may be implemented by language model methods such as word2vec, glove, ELMo, BERT, and the like.
And S3, extracting the characteristics of the video information to obtain a video characteristic vector corresponding to the video information.
Specifically, the feature extraction is performed on the video information, generally, image extraction is performed on the video, that is, each frame of image is extracted and then identified to obtain key information in the video information.
And S4, performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors.
Specifically, a mutual attention mechanism can be adopted to find the hidden dependency relationship between the word vector and the video feature vector, so that more supplementary information can be provided for the word vector through the video feature vector, more supplementary information can be provided for the video feature vector through the word vector, and the information fusion degree of the word vector and the video feature vector is higher.
And S5, obtaining a corresponding content label according to the fused word vector and the fused video feature vector.
Specifically, the content tag obtained according to the fused word vector and the fused video feature vector may be:
1) obtaining corresponding labels according to the fused word vector and the fused video feature vector respectively, and then obtaining content labels according to the labels of the word vector and the fused video feature vector;
2) and fusing the fused word vector and the fused video feature vector, so that the fused video feature vector and the fused word vector mutually affect and fuse again, and further obtaining a content label according to the affected fused word vector and the fused video feature vector.
By adopting the method in the embodiment, under the condition that the text information lacks comprehensive or key information, the video characteristics can be used, and further the information contained in the video information can be combined, so that the advantages of improving the recall and accuracy of the label are finally achieved.
In some embodiments, as in the foregoing method, determining a word vector for each participle in the text message includes steps a 1-A3 as follows:
a1, performing word segmentation processing on the text information to obtain word segments forming the text information;
a2, obtaining a corresponding word list according to the word segmentation and the preset label words;
and A3, determining the word vector of each participle according to the word vector model obtained by pre-training and the word list.
Specifically, the word segmentation processing on the text information is used for splitting a text into a plurality of words, for example: when the text information is that the tradition actors in the sound effect of the miracle are also pieced together with the performance for the sound effect, the obtained participles comprise: "wonderful flower", "sound effect", "teacher", "tragic", "actor", "do", "fit", "sound effect", "also", "is", "spell", "played", "performance", "play", "easy" and "do".
The preset label words can be word groups obtained by pre-selection, and the words in the word list comprise the label words and word segmentation obtained by word segmentation according to text information.
The word vector model obtained by pre-training can be a word2vec model (a tool for calculating word vectors); the word vector resulting for each participle can thus be determined by training the resulting word2vec model.
Specifically, after determining the vocabulary and the model, the word vector of each participle in the vocabulary can be determined. Further, words in the word list may be randomly initialized into 512-dimensional vectors as a word vector and a tag vector (word vector of tag words) for each participle, respectively.
By the method in the embodiment, the relation among the participles in the text information can be obtained through the word vector, the semantics of each participle in the text information can be effectively obtained, and the accuracy of the labeling result can be effectively improved.
As shown in fig. 2, in some embodiments, as the foregoing method, the step S3 performs feature extraction on the video information to obtain a video feature vector corresponding to the video information, including the following steps S31 and S33:
s31, extracting the video information according to frames to obtain at least two video frame images;
s32, inputting the video image into a deep neural network;
and S33, obtaining a video feature vector obtained after a feature extraction layer in the deep neural network performs feature extraction on video information.
Specifically, the image extraction is performed on the video information frame by frame, or on a frame-by-frame basis, and in addition, since the image information of adjacent frames may be close to each other, the image extraction may also be performed at fixed intervals, and a specific image extraction strategy may be selected according to actual conditions; the video frame image is image information obtained by performing image extraction on the video information.
Because the deep neural network has the capability of extracting the features of the image information, the corresponding video feature vector can be obtained by inputting the video frame image into the deep neural network.
One of the optional implementation methods is as follows: when video information is input into an xception (depth separable convolution) model, 2048-dimensional vectors of the penultimate layer of the xception model are extracted as video features because the extracted video features of the penultimate layer are most abundant.
By adopting the method in the embodiment, the images with high repeatability can be filtered by extracting the images of the video information according to frames, the same video characteristic vector can be prevented from being extracted repeatedly, and the efficiency of characteristic extraction can be effectively improved; the characteristic extraction of the video information is carried out through a characteristic extraction layer in the deep neural network, so that abundant video characteristic vectors can be extracted, and more information provided by the video can be obtained.
As shown in fig. 3, in some embodiments, the step S4 of obtaining the corresponding content tag according to the word vector and the video feature vector, as the aforementioned method, includes the following steps S41 to S47:
and S41, carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusted word vectors, and carrying out vector dimension adjustment on the video feature vectors to obtain dimension-adjusted video feature vectors.
Specifically, on the basis of the foregoing embodiment, the word vector of the word segmentation may be 512 dimensions, while the video feature vector may be 2048 dimensions; therefore, the dimensionality of the word vector and the dimensionality of the video feature vector are different from each other, and if the word vector and the video feature vector are spliced and fused, the dimensionality of the word vector and the dimensionality of the video feature vector need to be unified; optionally, because the dimensionality of the video feature vector is higher, only the dimensionality reduction processing can be performed on the video feature vector, and the dimensionality reduction of the 2048-dimensional video feature vector can be performed to obtain a 512-dimensional dimensionality-adjusted video feature vector generally by a method of dimensionality reduction of a fully-connected network.
S42, splicing each dimensionality-adjusting word vector and the dimensionality-adjusting video feature vector to obtain initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;
specifically, the reason for splicing and fusing the dimensionality adjusting word vectors and the dimensionality adjusting video feature vectors is to form a context relationship so as to find out the global relation between each dimensionality adjusting word vector and each dimensionality adjusting video feature vector. In addition, an encoder may be used to encode input data to obtain corresponding information, and generally, the encoder is a recurrent neural network, which is a network model that converts sequence modeling into time sequence modeling, and generally takes sequence data as input, performs recursion in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain manner. And the initial word vector information and the initial video vector information are information obtained by inputting the dimension-adjusting word vector and the dimension-adjusting video feature vector into an encoder for encoding.
The splicing fusion method can be as follows: and (4) taking each dimension-adjusting video feature vector as a word vector, setting the dimension of each dimension-adjusting word vector as the same grade, and then connecting.
And S43, determining the hierarchical relationship of the mutual attention layers, wherein the hierarchical relationship represents the connection relationship between different mutual attention layers.
Specifically, each of the inter-attention layers may perform a one-pass inter-attention mechanism, and extract and cross-fuse respective deep features of the word vector information and the video feature vector information.
The hierarchical relationship is used for representing the front-back connection relationship between different mutual attention layers, and the output of the mutual attention layer of the previous layer enters the next mutual attention layer to be subjected to deep feature extraction and cross fusion again.
And S44, inputting the word vector information and the video vector information into a mutual attention layer arranged on a first layer for cross fusion of features, obtaining each piece of first video vector information fused with all pieces of initial word vector information according to the first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, and obtaining each piece of first word vector information fused with all pieces of initial video vector information according to the first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information.
Specifically, for example, when feature fusion is performed on one initial video vector information by using a mutual attention mechanism, the influence weight of each initial word vector on the initial video vector information is determined, for example, when the initial video vector information includes a1, b1 and c1, initial word vectors a2, b2 and c2 exist; when the influence weights of a2 on a1, b2 on a1, and c2 on a1 are n1, m1, and t1, respectively, the first video vector information corresponding to a1 is: a1+ a2 Xn 1+ b2 Xm 1+ c2 Xt 1. When the influence weights of a1 on a2, b1 on a2, and c1 on a2 are n2, m2, and t2, respectively, then the first word vector information corresponding to a2 is: a2+ a1 xn 2+ b1 xm 2+ c1 xt 2; similarly, the global vector information of the vectors b1, c1 and b2, c2 can be obtained in the same way.
S45, inputting the first word vector information and the first video vector information into a next layer of mutual attention layer according to the hierarchical relationship, and performing cross fusion of features again to obtain second word vector information and second video vector influence weight respectively; and circulating according to the above steps until the fused word vector information and the fused video feature vector information are obtained through the output of the last layer of mutual attention layer.
Specifically, after the first word vector information and the first video vector information are obtained, the first video vector information and the first word vector information are input into the mutual attention layer of the next layer for feature cross-fusion again, where the cross-fusion method may be performed with reference to the example in step S44, and the process is repeated until the mutual attention layer of the last layer outputs the fused word vector information and the fused video feature vector information that are cross-fused by features through all the mutual attention layers.
S46, decoding the fused word vector information to obtain a fused word vector;
and S47, decoding the fused video feature vector information to obtain a fused video feature vector.
Specifically, the decoder is configured to decode the vector information to obtain a corresponding vector, and output the vector. Typically, the decoder is also a recurrent neural network.
By the method in the embodiment, the word vectors can be influenced through the video feature vectors, and the video feature vectors can be influenced through the word vectors; the finally obtained fused word vector can be fused to obtain the characteristics of the video feature vector, the fused video feature vector can be fused to obtain the characteristics of the word vector, and further the finally obtained fused word vector and the fused video feature vector can more accurately represent the characteristics of the text information and the video information.
In some embodiments, as the method mentioned above, the step S5 obtaining the corresponding content tag according to the fused word vector and the fused video feature vector includes the following steps S51 to S53:
step S461, determining candidate word vectors of all words in the word list;
step s462, determining candidate word vectors closest to the first distance of each output vector, respectively, where the output vectors include: fused word vectors and fused video feature vectors;
and S463, taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.
Specifically, a candidate word vector corresponding to each word in a word list is determined; then determining a first distance (generally, the first distance may be a cosine distance) between each output vector and each candidate word vector in the word list, and determining a candidate word vector closest to the first distance of each output vector; and finally, taking the word corresponding to the candidate word vector closest to the first distance of each output vector as the content label corresponding to the output vector.
By the method in the embodiment, each feature can be quickly matched to obtain the closest content label so as to obtain the label capable of accurately representing the video information.
In some embodiments, as the method described above, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector, the method further includes the following steps B1 to B5:
and B1, acquiring the total number of the content tags.
Specifically, this step is used to determine the total number of all content tags obtained in step S4.
And B2, when the total number of the content labels is greater than a preset upper limit threshold value, acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector.
Specifically, the upper threshold may be set according to an actual situation, and when the total number of the content tags is greater than the upper threshold, the content tags need to be discarded, so as to prevent the content tags from being too many and affecting the simplicity; according to the steps in the foregoing embodiment, the word corresponding to the candidate word vector closest to the first distance of each output vector is taken as the content tag corresponding to the output vector, and therefore, the content tag, the candidate word vector and the output vector have a unique correspondence relationship;
b3, determining the corresponding relation between the content label and the second distance;
specifically, after determining that the content tag, the candidate word vector and the output vector have the unique correspondence relationship in step B3, since both the candidate word vector and the output vector are determined, the second distance therebetween is also determined, and thus the correspondence relationship between the content tag and the second distance can be obtained.
Step B4. ranks the content tags by a second distance from small to large;
step B5. deletes content tags whose rank order is greater than the upper threshold in accordance with the correspondence.
Specifically, after the content tags are arranged according to the second distance from small to large, the arrangement order of the second distance can be determined; since the farther the distance is, the lower the correlation between the two words is, only the content tags arranged within the upper threshold are reserved, and the accuracy of semantic expression of the content tags can be guaranteed.
In some embodiments, as in the foregoing method, the step S31 performs image extraction on the video information by frames to obtain at least two video frame images, including the following steps S311 to S314:
step S311, the total frame number of the images included in the video information is obtained.
Specifically, the number of frames per second of each piece of video information is generally fixed, and therefore, as long as the duration of the video information and the number of frames per second are obtained, the total number of frames of images included in the video information can be determined.
And S312, determining a preset upper limit threshold of the number of images.
Specifically, a large amount of computing resources are consumed for feature extraction of images, and meanwhile, new features generally do not suddenly appear on displayed information of images of adjacent frames, particularly some transition frames, so that the waste of the computing resources is caused by feature extraction of the images, and the computing amount of the feature extraction can be effectively controlled by predetermining an upper limit threshold of the number of the images.
And S313, determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper limit threshold of the image number.
Specifically, the numerical relationship between the total frame number and the upper threshold of the image number may be a ratio relationship or a difference relationship; different numerical relationships may correspond to different extraction strategies, for example: when the numerical relationship is a ratio relationship, and the total frame number is 10 times of the upper threshold of the image number, the extraction strategy may be: one image is taken every 10 frames.
And S314, carrying out image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.
Specifically, the extraction strategy obtained by matching can meet the condition that the number of finally obtained video frame images is less than or equal to the upper limit threshold of the number of images when the video information is subjected to image extraction generally; after the extraction strategy is obtained, the video information can be subjected to image extraction according to the corresponding rule, so that the number of the final video frame images meets the requirement, the situations that the processing time is too long, the computing resources are wasted and the like due to the fact that the number of the video frame images is too large are avoided, and the processing efficiency is effectively improved.
Application example:
the test is carried out by adopting the method in any one of the embodiments and compared with the existing model in the related technology, and the test set shows the effects of 6000 data marked by two persons as shown in the table 1, wherein the data marked by the two persons are as follows: for the same piece of data, the label comprises the results of two person labels:
TABLE 1
Therefore, the model obtained by the method provided by the embodiment has remarkable improvements in three judgment standards of recall rate, accuracy and F value.
As shown in fig. 4, according to an embodiment of another aspect of the present invention, there is also provided a data processing apparatus including:
the system comprises an acquisition module 1, a display module and a display module, wherein the acquisition module is used for acquiring text information and video information for generating a content label;
the determining module 2 is used for determining a word vector of each word segmentation in the text information;
the vector acquisition module 3 is used for extracting the characteristics of the video information to obtain video characteristic vectors corresponding to the video information;
the feature fusion module 4 is used for performing cross fusion on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;
and the label determining module 5 is used for obtaining a corresponding content label according to the fused word vector and the fused video feature vector.
In some embodiments, as in the foregoing apparatus, the determining module 2 includes:
the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain word segments forming the text information;
the word list unit is used for obtaining a corresponding word list according to the word segmentation and the preset label words;
and the word vector unit is used for determining the word vector of each participle according to the word vector model obtained by pre-training and the word list.
In some embodiments, as in the foregoing apparatus, the vector obtaining module 3 includes:
the extraction unit is used for carrying out image extraction on the video information according to frames to obtain at least two video frame images;
the input unit is used for respectively inputting the video frame images into the deep neural network;
and the acquisition unit is used for acquiring a video feature vector obtained by performing feature extraction on the video frame image by a feature extraction layer in the deep neural network.
In some embodiments, such as the aforementioned apparatus, the feature fusion module 4 comprises:
the dimension adjusting unit is used for carrying out vector dimension adjustment on the word vectors to obtain dimension adjusting word vectors and carrying out vector dimension adjustment on the video feature vectors to obtain dimension adjusting video feature vectors;
the splicing unit is used for splicing each dimensionality-adjusting word vector and the dimensionality-adjusting video feature vector to obtain initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;
the relationship determination unit is used for determining the hierarchical relationship of the mutual attention layers, and the hierarchical relationship represents the connection relationship between different mutual attention layers;
the fusion unit is used for inputting the word vector information and the video vector information into a mutual attention layer arranged on a first layer for cross fusion of features, obtaining each piece of first video vector information fused with all pieces of initial word vector information according to the first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, and obtaining each piece of first word vector information fused with all pieces of initial video vector information according to the first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information;
the output unit is used for inputting the first word vector information and the first video vector information into a next layer of mutual attention layer according to the hierarchical relationship to perform cross fusion of features again, and obtaining second word vector information and second video vector influence weight respectively; circulating according to the above steps until fused word vector information and fused video feature vector information are obtained through the output of the last layer of mutual attention layer;
the first decoding unit is used for decoding the fused word vector information to obtain a fused word vector;
and the second decoding unit is used for decoding the fused video feature vector information to obtain a fused video feature vector.
In some embodiments, as in the foregoing apparatus, the tag determination module 5 comprises:
the first determining unit is used for determining candidate word vectors of all words in the word list;
a second determining unit, configured to determine candidate word vectors closest to the first distance of each output vector, respectively, where the output vectors include: fused word vectors and fused video feature vectors;
and the label determining unit is used for taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.
In some embodiments, the apparatus as in the previous paragraph, further comprising: a tag screening module; the label screening module comprises:
a total number unit for acquiring a total number of the content tags;
the second distance unit is used for acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector when the total number of the content labels is greater than a preset upper limit threshold;
a third determining unit configured to determine a correspondence between the content tag and the second distance;
the arrangement unit is used for arranging the content labels according to the second distance from small to large;
and the screening unit is used for deleting the content tags of which the arrangement order is greater than the upper limit threshold value according to the corresponding relation.
In some embodiments, as in the foregoing apparatus, the extraction unit comprises:
a total frame number subunit, configured to obtain a total frame number of an image included in the video information;
the threshold subunit is used for determining a preset upper limit threshold of the number of images;
the strategy subunit is used for determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper limit threshold of the image number;
and the image determining subunit is used for performing image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 1501, a communication interface 1502, a memory 1503, and a communication bus 1504, where the processor 1501, the communication interface 1502, and the memory 1503 complete mutual communication through the communication bus 1504,
a memory 1503 for storing a computer program;
the processor 1501, when executing the program stored in the memory 1503, implements the following steps:
the communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to execute the data processing method for generating a content tag according to any one of the above embodiments.
In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data processing method for generating a content tag as described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (10)
1. A data processing method, comprising:
acquiring video information for generating a content label and text information for describing the video information;
determining a word vector of each word segmentation in the text information;
extracting the features of the video information to obtain a video feature vector corresponding to the video information;
performing cross fusion on the characteristics of the word vectors and the characteristics of the video characteristic vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video characteristic vectors;
and obtaining the corresponding content label according to the fused word vector and the fused video feature vector.
2. The method of claim 1, wherein determining a word vector for each word in the textual information comprises:
performing word segmentation processing on the text information to obtain the word segments forming the text information;
obtaining a corresponding word list according to the word segmentation and a preset label word;
and determining the word vector of each word segmentation according to a word vector model obtained by pre-training and the word list.
3. The method according to claim 1, wherein the extracting the features of the video information to obtain a video feature vector corresponding to the video information comprises:
performing image extraction on the video information according to frames to obtain at least two video frame images;
respectively inputting the video frame images into a deep neural network;
and obtaining the video feature vector obtained by performing feature extraction on the video frame image by a feature extraction layer in the deep neural network.
4. The method according to claim 2, wherein the cross-fusing the features of the word vector and the features of the video feature vector through a mutual attention mechanism to obtain a fused word vector and a fused video feature vector, comprises:
carrying out vector dimension adjustment on the word vectors to obtain dimension-adjusting word vectors, and carrying out vector dimension adjustment on the video feature vectors to obtain dimension-adjusting video feature vectors;
after splicing each dimensionality-adjusting word vector and the dimensionality-adjusting video feature vector, obtaining initial word vector information of the dimensionality-adjusting word vector and initial video vector information of the dimensionality-adjusting video feature vector;
determining a hierarchical relationship of the mutual attention layers, wherein the hierarchical relationship represents a connection relationship between different mutual attention layers;
inputting the word vector information and the video vector information into the mutual attention layer arranged on a first layer for cross fusion of features, obtaining each piece of first video vector information fused with all pieces of initial word vector information according to the first word vector influence weight of each piece of initial word vector information on each piece of initial video vector information, and obtaining each piece of first word vector information fused with all pieces of initial video vector information according to the first video vector influence weight of each piece of initial video vector information on each piece of initial word vector information;
inputting the first word vector information and the first video vector information into the mutual attention layer of the next layer according to the hierarchical relationship to perform cross fusion of features again, and respectively obtaining second word vector information and second video vector influence weight; circulating according to the above steps until fused word vector information and fused video feature vector information are obtained through the output of the last layer of the mutual attention layer;
decoding the fused word vector information to obtain the fused word vector;
and decoding the fused video feature vector information to obtain the fused video feature vector.
5. The method according to claim 2, wherein obtaining the corresponding content tag according to the fused word vector and the fused video feature vector comprises:
determining candidate word vectors of all words in the word list;
determining the candidate word vector closest to a first distance of each output vector, respectively, the output vectors comprising: the fused word vector and the fused video feature vector;
and taking the word corresponding to the candidate word vector with the closest first distance as the content label corresponding to the output vector.
6. The method of claim 5, further comprising, after obtaining the corresponding content tag according to the fused word vector and the fused video feature vector:
acquiring the total number of the content tags;
when the total number of the content labels is larger than a preset upper limit threshold value, acquiring a second distance between the candidate word vector corresponding to the same content label and the output vector;
determining a correspondence between the content tag and a second distance;
arranging the content tags according to the second distance from small to large;
and deleting the content tags with the arrangement order larger than the upper threshold value according to the corresponding relation.
7. The method of claim 3, wherein performing image extraction on the video information on a frame-by-frame basis to obtain at least two video frame images comprises:
acquiring the total frame number of images included in the video information;
determining a preset upper limit threshold of the number of images;
determining a preset extraction strategy for extracting the images of the video message according to the numerical relationship between the total frame number and the upper threshold of the image quantity;
and carrying out image extraction on the video information by frames according to the extraction strategy to obtain video frame images with the number less than or equal to the number corresponding to the upper limit threshold of the number of the images.
8. A data processing apparatus, comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring video information used for generating a content label and text information used for describing the video information;
the determining module is used for determining a word vector of each word segmentation in the text information;
the vector acquisition module is used for extracting the characteristics of the video information to obtain a video characteristic vector corresponding to the video information;
the feature fusion module is used for performing cross fusion on the features of the word vectors and the features of the video feature vectors through a mutual attention mechanism to respectively obtain fused word vectors and fused video feature vectors;
and the label determining module is used for obtaining the corresponding content label according to the fused word vector and the fused video feature vector.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010588592.9A CN111767726B (en) | 2020-06-24 | 2020-06-24 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010588592.9A CN111767726B (en) | 2020-06-24 | 2020-06-24 | Data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111767726A true CN111767726A (en) | 2020-10-13 |
CN111767726B CN111767726B (en) | 2024-02-06 |
Family
ID=72722324
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010588592.9A Active CN111767726B (en) | 2020-06-24 | 2020-06-24 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111767726B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112580599A (en) * | 2020-12-30 | 2021-03-30 | 北京达佳互联信息技术有限公司 | Video identification method and device and computer readable storage medium |
CN113010740A (en) * | 2021-03-09 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Word weight generation method, device, equipment and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012167568A1 (en) * | 2011-11-23 | 2012-12-13 | 华为技术有限公司 | Video advertisement broadcasting method, device and system |
US20150221126A1 (en) * | 2011-09-09 | 2015-08-06 | Hisense Co., Ltd. | Method And Apparatus For Virtual Viewpoint Synthesis In Multi-Viewpoint Video |
CN107239801A (en) * | 2017-06-28 | 2017-10-10 | 安徽大学 | Video attribute represents that learning method and video text describe automatic generation method |
WO2018124309A1 (en) * | 2016-12-30 | 2018-07-05 | Mitsubishi Electric Corporation | Method and system for multi-modal fusion model |
WO2019024704A1 (en) * | 2017-08-03 | 2019-02-07 | 阿里巴巴集团控股有限公司 | Entity annotation method, intention recognition method and corresponding devices, and computer storage medium |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110781347A (en) * | 2019-10-23 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and readable storage medium |
CN110837579A (en) * | 2019-11-05 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Video classification method, device, computer and readable storage medium |
-
2020
- 2020-06-24 CN CN202010588592.9A patent/CN111767726B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150221126A1 (en) * | 2011-09-09 | 2015-08-06 | Hisense Co., Ltd. | Method And Apparatus For Virtual Viewpoint Synthesis In Multi-Viewpoint Video |
WO2012167568A1 (en) * | 2011-11-23 | 2012-12-13 | 华为技术有限公司 | Video advertisement broadcasting method, device and system |
WO2018124309A1 (en) * | 2016-12-30 | 2018-07-05 | Mitsubishi Electric Corporation | Method and system for multi-modal fusion model |
CN107239801A (en) * | 2017-06-28 | 2017-10-10 | 安徽大学 | Video attribute represents that learning method and video text describe automatic generation method |
WO2019024704A1 (en) * | 2017-08-03 | 2019-02-07 | 阿里巴巴集团控股有限公司 | Entity annotation method, intention recognition method and corresponding devices, and computer storage medium |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110781347A (en) * | 2019-10-23 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and readable storage medium |
CN110837579A (en) * | 2019-11-05 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Video classification method, device, computer and readable storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112580599A (en) * | 2020-12-30 | 2021-03-30 | 北京达佳互联信息技术有限公司 | Video identification method and device and computer readable storage medium |
CN113010740A (en) * | 2021-03-09 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Word weight generation method, device, equipment and medium |
CN113010740B (en) * | 2021-03-09 | 2023-05-30 | 腾讯科技(深圳)有限公司 | Word weight generation method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN111767726B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111767461B (en) | Data processing method and device | |
CN110737801B (en) | Content classification method, apparatus, computer device, and storage medium | |
CN111488931B (en) | Article quality evaluation method, article recommendation method and corresponding devices | |
CN110795657A (en) | Article pushing and model training method and device, storage medium and computer equipment | |
CN111160031A (en) | Social media named entity identification method based on affix perception | |
CN111263238B (en) | Method and equipment for generating video comments based on artificial intelligence | |
CN112464100B (en) | Information recommendation model training method, information recommendation method, device and equipment | |
CN111767726B (en) | Data processing method and device | |
CN113094549A (en) | Video classification method and device, electronic equipment and storage medium | |
CN114996511A (en) | Training method and device for cross-modal video retrieval model | |
CN114332679A (en) | Video processing method, device, equipment, storage medium and computer program product | |
CN110852071B (en) | Knowledge point detection method, device, equipment and readable storage medium | |
CN113961666A (en) | Keyword recognition method, apparatus, device, medium, and computer program product | |
CN113204956B (en) | Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device | |
CN114743029A (en) | Image text matching method | |
CN114357167A (en) | Bi-LSTM-GCN-based multi-label text classification method and system | |
CN115269781A (en) | Modal association degree prediction method, device, equipment, storage medium and program product | |
CN113408282B (en) | Method, device, equipment and storage medium for topic model training and topic prediction | |
CN114579876A (en) | False information detection method, device, equipment and medium | |
CN111767727B (en) | Data processing method and device | |
CN114443916A (en) | Supply and demand matching method and system for test data | |
CN114282528A (en) | Keyword extraction method, device, equipment and storage medium | |
CN112287184B (en) | Migration labeling method, device, equipment and storage medium based on neural network | |
CN116610871B (en) | Media data recommendation method, device, computer equipment and storage medium | |
CN116957097A (en) | Typesetting evaluation model training method, device, equipment, storage medium and product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |