CN112883234A

CN112883234A - Label data generation method and device, storage medium and electronic equipment

Info

Publication number: CN112883234A
Application number: CN202110190294.9A
Authority: CN
Inventors: 丁仁杰; 闫峰; 卫海天
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2021-06-01

Abstract

The invention discloses a generation method and device of label data, a storage medium and electronic equipment, and belongs to the field of artificial intelligence. Wherein, the method comprises the following steps: acquiring video data, wherein the video data comprises data contents of a plurality of modalities; extracting video content and description text of the video data; constructing a first feature tag of the video data based on the video content and a second feature tag of the video data based on the description text; and aggregating and generating the video label of the video data according to the first characteristic label and the second characteristic label. The invention solves the technical problem of incomplete label generation aiming at video data in the related technology, and enriches the label system of the video data by combining the bimodal complementary characteristics of video content and text.

Description

Label data generation method and device, storage medium and electronic equipment

Technical Field

The invention relates to the field of artificial intelligence, in particular to a tag data generation method and device, a storage medium and electronic equipment.

Background

In the related art, an information transmission mode taking short videos as media in the internet occupies an extremely important position. The quantity and content richness of videos are increasing day by day, and the efficient management and operation of video data are required, so that high-quality recommendation and search services provided for users are all dependent on a perfect video tag classification system. In the background of the current extremely fast information updating, the content of the hot spots and video focuses which people pay attention to is constantly changing and deriving.

The efficiency of manual tagging is extremely low and tagging of videos using a persistent tagging system is difficult to meet with ongoing requirements. The related art starts from a single video modality, and roughly comprises the following steps: the method comprises the steps of video acquisition, sampling (segmentation), video feature extraction and training according to labeling categories, and the videos are trained and labeled by means of a fixed label system while only a single mode is used. There are at least the following drawbacks: the mode category simplification is used, the information lacks complementarity, in the current video media, the current video media usually not only contain a video mode, but also text information such as video titles, brief introduction and the like is common; curing a label system: because only a video mode is used, the scheme extremely depends on perfect video labeling, so that when important contents outside a label system appear in a video, the requirements are often difficult to meet, but the contents and hot spots of the video are advanced, the influence of the solidification of the label system needs to be eliminated, the generated video label is inaccurate and imperfect, and unstable factors are brought to model training based on label data and the like.

In view of the above problems in the related art, no effective solution has been found at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating label data, a storage medium and electronic equipment.

According to an aspect of an embodiment of the present application, there is provided a tag data generation method, including: acquiring video data, wherein the video data comprises data contents of a plurality of modalities; extracting video content and description text of the video data; constructing a first feature tag of the video data based on the video content and a second feature tag of the video data based on the description text; and aggregating and generating the video label of the video data according to the first characteristic label and the second characteristic label.

Further, constructing a first feature tag of the video data based on the video content comprises: sampling a frame sequence from the video content according to a preset period; extracting frame-level features of the frame sequence frame by adopting a preset convolution model to obtain a corresponding feature sequence; and constructing a video content label and a video action label of the video data according to the characteristic sequence.

Further, constructing the video content tag of the video data according to the feature sequence comprises: for each frame-level feature x in the sequence of features_tCalculating the probability alpha of any clustering center k in the preset clustering center set by adopting the following formula_k：

Wherein, w and b are preset parameters; for each frame-level feature x in the sequence of features_tUsing alpha of each cluster center in the preset cluster center set_kAs the weight, x is weighted and calculated by the following formula_tVideo content tag of

Wherein, mu_kIs the clustering characteristic of the clustering center k.

Further, constructing the video action tag of the video data according to the feature sequence comprises: calculating the hidden layer feature h before the step t-1 in the feature sequence_t-1And obtaining the input characteristic x of the t-1 step from the characteristic sequence_t-1(ii) a Based on the h_t-1And x_t-1And calculating the video motion label of the t-1 step by adopting a preset long-short term memory network (LSTM) model, wherein the t-1 step is the last step.

Further, constructing a second feature tag of the video data based on the descriptive text comprises: extracting a first topic keyword set from the description text by adopting a preset text format as an extraction template, and extracting a second topic keyword set from the description text by adopting a preset named entity recognition NER model and a part-of-speech tagging model, wherein the preset NER model and the part-of-speech tagging model are used for extracting keywords of a target part-of-speech from an input text; and carrying out duplication elimination on the first topic keyword set and the second topic keyword set to obtain a second characteristic label of the video data.

Further, aggregating and generating the video tag of the video data according to the first feature tag and the second feature tag comprises: clustering the first feature tags according to a preset parent-level tag to obtain sub-tags of multiple categories, wherein the first feature tags comprise a plurality of content tags and a plurality of action tags; calculating the confidence coefficient of the parent label by adopting the child label, and selecting the appointed parent label with the maximum confidence coefficient; calculating the relevance between the label text of each sub-label under the appointed parent label and the keywords, and selecting M appointed keywords the relevance of which meets a preset condition, wherein the second characteristic label comprises N keywords, N is not less than M, and M and N are positive integers; and aggregating the first feature tag and the M specified keywords into a video tag of the video data.

Further, selecting M designated keywords whose relevance meets a preset condition includes: and selecting a first keyword set with the correlation larger than a first threshold value and a second keyword set with the correlation smaller than a second threshold value to obtain M appointed keywords.

According to another aspect of the embodiments of the present application, there is also provided a tag data generation apparatus, including: the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring video data, and the video data comprises data contents of a plurality of modals; the extraction module is used for extracting the video content and the description text of the video data; the construction module is used for constructing a first characteristic label of the video data based on the video content and constructing a second characteristic label of the video data based on the description text; and the generating module is used for aggregating and generating the video label of the video data according to the first characteristic label and the second characteristic label.

Further, the building module comprises: a sampling unit for sampling a sequence of frames from the video content according to a preset period; the extracting unit is used for extracting the frame level characteristics of the frame sequence frame by adopting a preset convolution model to obtain a corresponding characteristic sequence; and the construction unit is used for constructing the video content label and the video action label of the video data according to the characteristic sequence.

Further, the construction unit includes: a first calculation subunit for calculating, for each frame-level feature x in the sequence of features_tCalculating the probability alpha of any clustering center k in the preset clustering center set by adopting the following formula_k：

Wherein, w and b are preset parameters; a second calculation subunit for calculating, for each frame-level feature x in the sequence of features_tUsing alpha of each cluster center in the preset cluster center set_kAs the weight, x is weighted and calculated by the following formula_tVideo content tag of

Wherein, mu_kIs the clustering characteristic of the clustering center k.

Further, the construction unit includes: a third computing subunit, configured to compute hidden layer feature h before t-1 step in the feature sequence_t-1And obtaining the input characteristic x of the t-1 step from the characteristic sequence_t-1(ii) a A fourth calculation subunit for calculating a second calculation result based on the h_t-1And x_t-1And calculating the video motion label of the t-1 step by adopting a preset long-short term memory network (LSTM) model, wherein the t-1 step is the last step.

Further, the building module comprises: the processing unit is used for extracting a first topic keyword set from the description text by adopting a preset text format as an extraction template, and extracting a second topic keyword set from the description text by adopting a preset named entity recognition NER model and a part-of-speech tagging model, wherein the preset NER model and the part-of-speech tagging model are used for extracting keywords of a target part-of-speech from an input text; and the duplication removing unit is used for carrying out duplication removal on the first topic keyword set and the second topic keyword set to obtain a second feature tag of the video data.

Further, the generating module includes: the classification unit is used for clustering the first feature tags according to preset parent tags to obtain sub-tags of multiple categories, wherein the first feature tags comprise a plurality of content tags and a plurality of action tags; the computing unit is used for computing the confidence coefficient of the parent label by adopting the child label and selecting the appointed parent label with the maximum confidence coefficient; the selection unit is used for calculating the correlation between the label text of each sub-label under the appointed parent label and the keywords, and selecting M appointed keywords of which the correlation meets a preset condition, wherein the second characteristic label comprises N keywords, N is more than or equal to M, and M and N are positive integers; and the aggregation unit is used for aggregating the first feature tag and the M specified keywords into the video tag of the video data.

Further, the selection unit includes: and the selecting subunit is used for selecting a first keyword set with the correlation larger than a first threshold value and selecting a second keyword set with the correlation smaller than a second threshold value to obtain M designated keywords.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that executes the above steps when the program is executed.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein: a memory for storing a computer program; a processor for executing the steps of the method by running the program stored in the memory.

Embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the steps of the above method.

According to the invention, the video data is obtained, the video content and the description text of the video data are extracted, the first feature tag of the video data is constructed based on the video content, the second feature tag of the video data is constructed based on the description text, and finally the video tag of the video data is generated by aggregation according to the first feature tag and the second feature tag.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a server according to an embodiment of the present invention;

fig. 2 is a flowchart of a tag data generation method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an embodiment of the present invention;

fig. 4 is a block diagram of a tag data generation apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device implementing an embodiment of the invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

The method provided by the first embodiment of the present application can be executed in a server, a computer, a mobile phone, a video device, or a similar computing device. Taking an example of the server running on the server, fig. 1 is a hardware structure block diagram of a server according to an embodiment of the present invention. As shown in fig. 1, the server 10 may include one or more (only one shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and is not intended to limit the structure of the server. For example, the server 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a server program, for example, a software program and a module of application software, such as a server program corresponding to a tag data generation method in an embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the server program stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 104 may further include memory located remotely from processor 102, which may be connected to server 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a method for generating tag data is provided, and fig. 2 is a flowchart of a method for generating tag data according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, video data is obtained, wherein the video data comprises data contents of a plurality of modals;

in this embodiment, video data, such as short video data, motion picture data, etc., usually includes information of multiple modalities, for example: video, cover pictures, text, audio and the like, and information of different modes can play a complementary role.

Step S204, extracting video content and description text of the video data;

in this embodiment, the descriptive text may provide the video modality with information supplementary effective beyond the existing tag system, such as new product name, new network popular language, and the like. Alternatively, the description text may be video data other than video content such as audio.

Step S206, constructing a first characteristic label of the video data based on the video content and constructing a second characteristic label of the video data based on the description text;

and step S208, aggregating and generating a video label of the video data according to the first characteristic label and the second characteristic label.

By extracting effective information from different modalities and aggregating, accurate and comprehensive labels are automatically and efficiently constructed for videos.

Through the steps, the video data are obtained, the video content and the description text of the video data are extracted, the first feature tag of the video data is constructed based on the video content, the second feature tag of the video data is constructed based on the description text, finally the video tag of the video data is generated according to the first feature tag and the second feature tag in an aggregation mode, the tags generated aiming at the video data are more accurate and complete by constructing the feature tags of the video data in a multi-mode, the technical problem that the tags generated aiming at the video data in the related technology are incomplete is solved, and a tag system of the video data is enriched by combining the bimodal complementary characteristics of the video content and the text.

In an implementation manner of this embodiment, constructing the first feature tag of the video data based on the video content includes:

s11, sampling a frame sequence from the video content according to a preset period;

since a video is composed of consecutive picture Frames, a video usually includes a large number of Frames, especially when the video length is large and FPS (Frames Per Second transmission) is high. The time and hardware cost of training and classifying with all frames of video is significant. Therefore, in the identification of video data, it is necessary to sample frames within a video. In one example, a frame sequence of seconds long is taken for each video sample, with a strategy of 1 frame per second.

Optionally, the frame sequence may be further preprocessed, and operations such as resolution resetting, noise reduction, noise increase, normalization, and the like are performed on the frame sequence to make the frame sequence meet the input requirement of the feature extraction model, where the sequence of the preprocessed 1 to n frames is as follows: { f₁，f₂，…，f_n}。

S12, extracting frame-level features of the frame sequence frame by adopting a preset convolution model to obtain a corresponding feature sequence;

using ImageNet fully pre-trained models such as addition/ResNet 152 and the like to extract depth features of each frame, and defining the models as the following mapping:

x_n＝d_conv(f_n)；

wherein f is_nFor the nth picture frame of the video, d_convFor convolution calculation, x_nIs f_nAnd calculating a result through the convolution model, wherein the result is called as the characteristic of the image frame after convolution mapping.

After the frame sequence is mapped, the feature sequence of the frame-level features is obtained as follows:

{f₁，f₂，…，f_n}→{x₁，x₂，…，x_n}：

a video has n picture frames f to form a sequence, and the sequence of n characteristic x is obtained after the mapping of a convolution model.

And S13, constructing a video content label and a video action label of the video data according to the characteristic sequence.

In one aspect of this embodiment, constructing a video content tag of video data from a feature sequence comprises:

for each frame-level feature x in the sequence of features_tCalculating the probability alpha of any clustering center k in the preset clustering center set by adopting the following formula_k：

Wherein, w and b are preset parameters, and a set of w and b is corresponding to different clustering centers, namely w_k，b_k；

For each frame-level feature x in the sequence of features_tUsing alpha of each cluster center in the preset cluster center set_kAs the weight, x is weighted and calculated by the following formula_tVideo content tag of

Wherein, mu_kIs the clustering characteristic of the clustering center k.

Although the time length of the short video data is limited, the content richness of the short video data in a short time is high, the embodiment uses NetVLAD (Local aggregation descriptor Vector) to cluster and aggregate the frame-level feature sequences, and sets K cluster centers for all the frame-level features in the feature sequences, wherein K is a hyper-parameter.

For feature x_tMapping to a K dimension is carried out, a Softmax function is used for activation, the probability that each frame-level feature belongs to each cluster center is solved, T represents a matrix to convert the rank, and the calculation process is as follows:

then calculating a difference vector between each frame level feature and each cluster center, wherein the cluster feature of each cluster center is mu_kThe difference vector is x_t-μ_kAnd summed with the above probabilities as weights:

thus, finally the total feature set for each video is as follows:

{z₁，z₂，…，z_k}；

and performing classifier processing by using the set as a feature to obtain a classification category of the video content, thereby obtaining the text label.

In another aspect of this embodiment, constructing a video motion tag of video data from a feature sequence comprises: calculating hidden layer characteristic h before t-1 step in characteristic sequence_t-1And from the sequence of featuresObtaining the input characteristic x of the t-1 step_t-1(ii) a Based on h_t-1And x_t-1And calculating the video action label of the t-1 step by adopting a preset Long Short-Term Memory network (LSTM) model, wherein t-1 is the last step of the video action and is a positive integer.

In this embodiment, the video motion is t steps in total (from 0 to t-1), the last step being the t-1 st step, x_t-1Is the input of the t-1 step of the preset LSTM model, h_t-1The method comprises the steps of inputting hidden layer feature states before the t-1 step, obtaining output of the last time step t-1 by the hidden layer feature states and using the output as features of all t-step data of the video data, calculating the features of the last step, wherein the features of the last step comprise feature information before the time sequence of the features, classifying by using the features of the last step, and obtaining the video action label of the video data.

In short video, the action types are limited due to video duration limitation, and the frame-to-frame connection is provided. Thus, in action classification, the present embodiment uses the LSTM aggregate frame level feature, as shown below, where each step of the LSTM is based on the state of the previous step, taking the last step output y_nAs an aggregation feature, performing classification training, and outputting a label text of a video action label:

h_t，y_t-1＝f_lstm(h_t-1，x_t-1)；

the video motion classifier is a deep learning model for classification, and uses a cross entropy loss function and gradient descent training, and the output of the model is a classification result of video motion.

In an implementation manner of this embodiment, constructing the second feature tag of the video data based on the description text includes: extracting a first topic keyword set from a description text by adopting a preset text format as an extraction template, and extracting a second topic keyword set from the description text by adopting a preset Named Entity Recognition (NER) model and a part of speech tagging model, wherein the preset NER model and the part of speech tagging model are used for extracting keywords of a target part of speech from an input text; and carrying out duplication removal on the first topic keyword set and the second topic keyword set to obtain a second characteristic label of the video data. In the NER model and part-of-speech tagging model, NER is used to extract explicit entities such as place names, person names, trade names, etc., and part-of-speech tagging is used for nouns, verbs, etc., to perform combined new word discovery.

And processing the short video title text, and filtering information such as expressions, special symbols, messy codes and the like in the short video title text.

In one aspect, in the short video title, some specific topics (such as microblog topics, buffalo topics, etc.) are included, such as # number marks, for example, "# makeup evaluation", formats in different videos are slightly different, and these words are key topics of the videos.

On the other hand, the organization name and person name are extracted from text using a pre-trained Bert-based Chinese NER model, which is often the key topic of video. And then new words are found based on the part-of-speech tagging scheme. And extracting words with parts of speech NR and NN from the parts of speech of the label. Since short video titles are short, the nouns appearing therein are considered to play an important role. If two consecutive words form part-of-speech combinations of NR (proper noun) + NN (other noun) or NN + NN, they are combined to form a new phrase. The nouns are three, respectively: NR, NT (time noun), NN, where a proper noun is a subset of nouns, a proper noun may be a particular name of a person, politically or geographically defined place (city, country, river, mountain, etc.), or an organization (business, government or other organizational entity), and a proper noun is typically unique. Other nouns include all other nouns.

The topic keyword sets obtained by the two schemes are likely to have repeated keywords, or a certain keyword is a subset of another keyword, and in order to prevent result redundancy, deduplication needs to be performed.

In an implementation manner of this embodiment, the generating a video tag of the video data according to the aggregation of the first feature tag and the second feature tag includes: clustering the first feature tags according to a preset parent-level tag to obtain sub-tags of multiple categories, wherein the first feature tags comprise a plurality of content tags and a plurality of action tags; calculating the confidence coefficient of the parent label by adopting the child label, and selecting the appointed parent label with the maximum confidence coefficient; calculating the relevance between the label text of each sub-label under the appointed parent label and the key words, and selecting M appointed key words of which the relevance meets a preset condition, wherein the second characteristic label comprises N key words, N is not less than M, and M and N are positive integers; and aggregating the first characteristic label and the M specified keywords into a video label of the video data.

Optionally, selecting M designated keywords whose correlations meet the preset conditions includes: and selecting a first keyword set with the correlation larger than a first threshold value and a second keyword set with the correlation smaller than a second threshold value to obtain M appointed keywords.

And establishing a parent label system aiming at the content and the action label of the video. The parent tags are used to summarize broad themes of the tags, such as food, sports, make-up, travel, and the like. And performing content theme correction on the video content tags and the action tags through the parent tags, classifying the predicted tags according to the parent tags, counting the confidence sums of the sub-tags contained under different parent tags, sequencing, selecting topN parent tags with the highest confidence as the theme of the video segment, and excluding the sub-tags under the other parent tags.

The video content and the video actions may correspond to the same parent tag. In one example, a video may be for a food, so its content tags in the model may be hamburger, cola, steak, etc., and action tags may be for meals, vegetables, stirring, etc. The parent class labels are all Food & Dr i nk. The parent class labels reflect the rough subject of the video, and the content and action labels reflect the specific content of the video.

Aiming at the keywords extracted from the text, the correlation between the keywords and the classification label text obtained by the video can be calculated, and the cosine similarity is calculated after the words of Word2Vec fully pre-trained on a large amount of texts are mapped by an embedded model.

The text vectors of the first feature label and the second feature label are respectively a and b, and the text vectors are extremely high in correlation, so that the description is very close to the theme and can be retained, but repeated attention is needed. And if the correlation is extremely low, the newly-appeared hot words, names and the like are possible, and the hot words, the names and the like are reserved by setting a threshold value and are expanded into a tag system.

Fig. 3 is a flowchart of a scheme according to an embodiment of the present invention, and a multi-modal video classification labeling technique is constructed from multiple aspects of video content classification, video action recognition, and text content extraction for short video data based on two modalities, i.e., video content and description text, so that a label generated for a short video is more accurate and complete and is not limited to a fixed label system. Aiming at the video mode, the human body action in the short video content and the short video is respectively identified, a video content label and a video action label are constructed, and the video is labeled as comprehensively as possible. And aiming at the text mode, performing key entity extraction and new word discovery on the text, and supplementing the acquired content as a label to obtain a final label.

By adopting the scheme of the embodiment, the tags obtained in a bimodal mode are effectively aggregated, useless contents are filtered, the tag system is enriched, the video and text bimodal complementary characteristics are fully combined, classification and extraction processes are respectively constructed, rich and comprehensive tags are provided for videos, the defect of a fixed tag system is overcome, emerging keywords are extracted by utilizing the flexible characteristics of texts, and the tag system is further updated and enriched while the videos are comprehensively labeled.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a tag data generating device is further provided, which is used to implement the foregoing embodiments and preferred embodiments, and the description that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a tag data generation apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus includes: an acquisition module 40, an extraction module 42, a construction module 44, a generation module 46, wherein,

an obtaining module 40, configured to obtain video data, where the video data includes data content of multiple modalities;

an extracting module 42, configured to extract video content and description text of the video data;

a construction module 44, configured to construct a first feature tag of the video data based on the video content, and construct a second feature tag of the video data based on the description text;

a generating module 46, configured to generate a video tag of the video data according to the first feature tag and the second feature tag in an aggregation manner.

Optionally, the building module includes: a sampling unit for sampling a sequence of frames from the video content according to a preset period; the extracting unit is used for extracting the frame level characteristics of the frame sequence frame by adopting a preset convolution model to obtain a corresponding characteristic sequence; and the construction unit is used for constructing the video content label and the video action label of the video data according to the characteristic sequence.

Optionally, the building unit includes: a first calculation subunit for calculating, for each frame-level feature x in the sequence of features_tCalculating the probability alpha of any clustering center k in the preset clustering center set by adopting the following formula_k：

Wherein, mu_kIs the clustering characteristic of the clustering center k.

Optionally, the building unit includes: a third computing subunit, configured to compute hidden layer feature h before t-1 step in the feature sequence_t-1And obtaining the input characteristic x of the t-1 step from the characteristic sequence_t-1(ii) a A fourth calculation subunit for calculating a second calculation result based on the h_t-1And x_t-1And calculating the video motion label of the t-1 step by adopting a preset long-short term memory network (LSTM) model, wherein the t-1 step is the last step.

Optionally, the building module includes: the processing unit is used for extracting a first topic keyword set from the description text by adopting a preset text format as an extraction template, and extracting a second topic keyword set from the description text by adopting a preset named entity recognition NER model and a part-of-speech tagging model, wherein the preset NER model and the part-of-speech tagging model are used for extracting keywords of a target part-of-speech from an input text; and the duplication removing unit is used for carrying out duplication removal on the first topic keyword set and the second topic keyword set to obtain a second feature tag of the video data.

Optionally, the generating module includes: the classification unit is used for clustering the first feature tags according to preset parent tags to obtain sub-tags of multiple categories, wherein the first feature tags comprise a plurality of content tags and a plurality of action tags; the computing unit is used for computing the confidence coefficient of the parent label by adopting the child label and selecting the appointed parent label with the maximum confidence coefficient; the selection unit is used for calculating the correlation between the label text of each sub-label under the appointed parent label and the keywords, and selecting M appointed keywords of which the correlation meets a preset condition, wherein the second characteristic label comprises N keywords, N is more than or equal to M, and M and N are positive integers; and the aggregation unit is used for aggregating the first feature tag and the M specified keywords into the video tag of the video data.

Optionally, the selecting unit includes: and the selecting subunit is used for selecting a first keyword set with the correlation larger than a first threshold value and selecting a second keyword set with the correlation smaller than a second threshold value to obtain M designated keywords.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring video data, wherein the video data comprises data contents of a plurality of modalities;

s2, extracting the video content and the description text of the video data;

s3, constructing a first characteristic label of the video data based on the video content and constructing a second characteristic label of the video data based on the description text;

and S4, aggregating and generating the video label of the video data according to the first characteristic label and the second characteristic label.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s2, extracting the video content and the description text of the video data;

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 5, including a processor 51, a communication interface 52, a memory 53 and a communication bus 54, where the processor 51, the communication interface 52, and the memory 53 complete communication with each other through the communication bus 54, and the memory 53 is used for storing computer programs; and a processor 51 for executing the program stored in the memory 53.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method for generating tag data, comprising:

acquiring video data, wherein the video data comprises data contents of a plurality of modalities;

extracting video content and description text of the video data;

constructing a first feature tag of the video data based on the video content and a second feature tag of the video data based on the description text;

and aggregating and generating the video label of the video data according to the first characteristic label and the second characteristic label.

2. The method of claim 1, wherein constructing the first feature tag for the video data based on the video content comprises:

sampling a frame sequence from the video content according to a preset period;

extracting frame-level features of the frame sequence frame by adopting a preset convolution model to obtain a corresponding feature sequence;

and constructing a video content label and a video action label of the video data according to the characteristic sequence.

3. The method of claim 2, wherein constructing the video content tag for the video data from the signature sequence comprises:

Wherein, w and b are preset parameters;

for each frame-level feature x in the sequence of features_tUsing alpha of each cluster center in the preset cluster center set_kAs the weight, x is weighted and calculated by the following formula_tVideo content tag z_k:

Wherein, mu_kIs the clustering characteristic of the clustering center k.

4. The method of claim 2, wherein constructing the video action tag for the video data from the sequence of features comprises:

calculating the hidden layer feature h before the step t-1 in the feature sequence_t-1And obtaining the input characteristic x of the t-1 step from the characteristic sequence_t-1；

Based on the h_t-1And x_t-1And calculating the video motion label of the t-1 step by adopting a preset long-short term memory network (LSTM) model, wherein the t-1 step is the last step.

5. The method of claim 1, wherein constructing the second feature label for the video data based on the descriptive text comprises:

extracting a first topic keyword set from the description text by adopting a preset text format as an extraction template, and extracting a second topic keyword set from the description text by adopting a preset named entity recognition NER model and a part-of-speech tagging model, wherein the preset NER model and the part-of-speech tagging model are used for extracting keywords of a target part-of-speech from an input text;

and carrying out duplication elimination on the first topic keyword set and the second topic keyword set to obtain a second characteristic label of the video data.

6. The method of claim 1, wherein aggregating the video tags of the video data according to the first feature tag and the second feature tag comprises:

clustering the first feature tags according to a preset parent-level tag to obtain sub-tags of multiple categories, wherein the first feature tags comprise a plurality of content tags and a plurality of action tags;

calculating the confidence coefficient of the parent label by adopting the child label, and selecting the appointed parent label with the maximum confidence coefficient;

calculating the relevance between the label text of each sub-label under the appointed parent label and the keywords, and selecting M appointed keywords the relevance of which meets a preset condition, wherein the second characteristic label comprises N keywords, N is not less than M, and M and N are positive integers;

and aggregating the first feature tag and the M specified keywords into a video tag of the video data.

7. The method of claim 6, wherein selecting M designated keywords having a relevance meeting a preset condition comprises:

and selecting a first keyword set with the correlation larger than a first threshold value and a second keyword set with the correlation smaller than a second threshold value to obtain M appointed keywords.

8. An apparatus for generating tag data, comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring video data, and the video data comprises data contents of a plurality of modals;

the extraction module is used for extracting the video content and the description text of the video data;

the construction module is used for constructing a first characteristic label of the video data based on the video content and constructing a second characteristic label of the video data based on the description text;

and the generating module is used for aggregating and generating the video label of the video data according to the first characteristic label and the second characteristic label.

9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program is operative to perform the method steps of any of the preceding claims 1 to 7.

10. An electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; wherein:

a memory for storing a computer program;

a processor for performing the method steps of any of claims 1 to 7 by executing a program stored on a memory.