CN111291688A

CN111291688A - Video tag obtaining method and device

Info

Publication number: CN111291688A
Application number: CN202010088404.6A
Authority: CN
Inventors: 徐鸣谦
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2020-06-16
Anticipated expiration: 2040-02-12
Also published as: CN111291688B

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a method and a device for acquiring a video tag. The video tag obtaining method comprises the following steps: responding to an input video image, and acquiring characteristic information corresponding to the video image; respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the classification models, wherein each label set comprises at least one label; and selecting the video labels of the video images from the plurality of label sets according to the probability values of the labels in the label sets. According to the invention, a plurality of different classification models can be fused to obtain the video tags of the video images, so that the classification models with various dimensionality types can be selected to adapt to different types of video images, the obtained video tags of the video images are more accurate, and the robustness of automatic video tag addition is improved.

Description

Video tag obtaining method and device

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method and a device for acquiring a video tag.

Background

With the rapid development of the internet, users can acquire various types of videos, such as movies, television shows, and the like, through video applications or websites of various terminals. For videos, the video tags can well show the types, characteristics and the like of the videos, so that a user can select favorite video types according to the video tags, and the use experience of the user is greatly improved.

At present, videos are generally labeled automatically in an artificial intelligence mode, and the common automatic video labeling modes mainly include two types: one is based on automatic recognition of video-related text; the other is based on automatic recognition of images in the video. Because the video is played for a long time, the image information involved is too much, the processing speed is slow, and the voluntary cost is high, so a mode based on the automatic identification of the related characters of the video is generally selected to label the video.

The inventor finds that at least the following problems exist in the prior art: the video is labeled based on the automatic identification mode of the video related characters, and the labels of the videos are obtained based on a single model, so that the effect difference is large when different types of videos are labeled, the condition that the video labels are inaccurate is caused, and the user experience is influenced.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for acquiring video tags, which can be used for acquiring video tags of video images by fusing a plurality of different classification models, so that the classification models with various dimensionality types can be selected to adapt to the video images with different types, the acquired video tags of the video images are more accurate, and the robustness of automatically adding the video tags is improved.

In order to solve the above technical problem, an embodiment of the present invention provides a method for acquiring a video tag, including: responding to an input video image, and acquiring characteristic information corresponding to the video image; respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the classification models, wherein each label set comprises at least one label; and selecting the video labels of the video images from the plurality of label sets according to the probability values of the labels in the label sets.

The embodiment of the invention also provides a video text recognition device, which comprises at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the video tag acquisition method.

Compared with the prior art, the method and the device for detecting the video tags of the video image have the advantages that in response to the input video image, the corresponding characteristic information is firstly obtained, the characteristic information is input into a plurality of different classification models, a plurality of tag sets output by the classification models are obtained, each tag set comprises at least one tag, and then the video tags of the video image are determined from the tag sets according to the probability values of the tags in the tag sets; the video tags of the video images are obtained by fusing a plurality of different classification models, so that the video tags of the video images can be adapted to the video images of different types by selecting the classification models of multiple dimensionality types, the obtained video tags of the video images are more accurate, and the robustness of automatically adding the video tags is improved.

In addition, the method for obtaining a plurality of label sets output by a plurality of classification models by respectively inputting the characteristic information into a plurality of different classification models comprises the following steps: respectively inputting the characteristic information into each classification model to obtain the probability value of each label generated by each classification model; and for each classification model, selecting the labels meeting the first preset condition according to the generated probability value of each label, and adding the labels meeting the first preset condition into the label set of the classification model. The embodiment provides a specific implementation mode for respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the plurality of classification models.

In addition, according to the probability value of each generated label, selecting the label meeting a first preset condition, including: forming a first queue from large to small according to the probability value of the labels with the probability value corresponding to the classification model being larger than or equal to a second preset threshold value; and traversing the tags in the first queue in sequence until the difference value of the probability value of the current tag minus the probability value of the next tag is greater than a first preset threshold value, and taking the traversed tags as the tags meeting a first preset condition. The embodiment provides a specific implementation mode for selecting the tags meeting the first preset condition according to the probability value of each generated tag, and tags relatively related to the video image can be selected to form a tag set.

In addition, according to the probability value of the label in each label set, the video label of the video image is selected from the plurality of label sets, and the method comprises the following steps: for each label in each label set, calculating to obtain an evaluation value of the label according to the probability value of the label in each label set; and selecting the tags meeting a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets, and taking the tags meeting the second preset condition as the video tags. The embodiment provides a specific implementation mode for selecting the video tags of the video image from the plurality of tag sets according to the probability values of the tags in the tag sets, the evaluation value of each tag is calculated based on the number of the tag sets occupied by each tag, the more the number of the tag sets occupied by the tags is, the more the number of votes obtained by the representation tags is, and therefore the video tags of the video image are selected in a voting mode, and the acquired video tags are more accurate.

In addition, selecting tags satisfying a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets includes: forming a second queue by the tags in the plurality of tag sets according to the evaluation values from large to small; and traversing the tags in the second queue in sequence until the difference value of the evaluation value of the current tag minus the evaluation value of the next tag is greater than a third preset threshold, and taking the traversed tags as tags meeting a second preset condition. The embodiment provides a specific formula for selecting tags satisfying a second preset condition from a plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets.

In addition, forming a second queue by the tags in the plurality of tag sets according to the evaluation values from large to small, specifically: and forming a second queue by the tags with the evaluation values larger than a fourth preset threshold value in the plurality of tag sets according to the evaluation values from large to small. In this embodiment, the tags whose evaluation values are less than or equal to the fourth preset threshold are removed when the second queue is formed, so that the tags whose evaluation values are less than or equal to the fourth preset threshold are prevented from being selected as video tags, and the subsequent calculation amount is reduced.

In addition, the calculation formula of the evaluation value of the tag is:

where S denotes an evaluation value of tags, T denotes the number of tag sets including tags, A^lAnd l is an integer and is more than or equal to 1 and less than or equal to T, and the probability value of the label in the ith label set of the T label sets is represented. The present embodiment provides a calculation formula of the evaluation value of the tag.

In addition, the plurality of classification models are acquired in the following manner: and respectively training the M classifiers according to the N training parameters and a plurality of dimension types corresponding to at least one label to obtain a plurality of classification models, wherein each training parameter comprises a learning rate, a sample base and iteration times, and N, M is an integer greater than or equal to 1. In this embodiment, a specific implementation manner of obtaining a plurality of classification models that incorporate a plurality of training parameters, a plurality of dimensional types, and different classifiers is provided.

In addition, the training mode of each classifier by using any training parameter is as follows: counting the number value of the sample video images corresponding to each label under the current dimension type according to the sample labels of the plurality of sample video images in the sample library of the training parameters, taking the maximum number value as a reference value, and adding the sample video images corresponding to each label; acquiring sample characteristic information corresponding to each sample video image; and training the classifier by using the characteristic information of each sample according to the iteration times and the learning rate in the training parameters to obtain a classification model corresponding to the classifier under the current dimension type. This embodiment provides a specific implementation of training a classifier using any training parameter in one dimension.

In addition, the M classifiers include: at least one linear classifier and at least one non-linear classifier. In the embodiment, the characteristic of the linear classifier can be used for adapting to the condition that the similarity between the feature information and the label is high, and the characteristic of the nonlinear classifier is used for adapting to describe the relatively abstract feature information.

In addition, the acquiring of the feature information corresponding to the video image includes: acquiring video text information corresponding to the video image, and removing preset information in the video text information; and converting a plurality of word segments in the video text information without the preset information into word segment vectors as characteristic information. The embodiment provides a specific implementation mode for acquiring the characteristic information corresponding to the video image.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a specific flow of a method for acquiring a video tag according to a first embodiment of the present invention;

fig. 2 is a detailed flowchart of a method for acquiring a video tag according to a second embodiment of the present invention;

FIG. 3 is a detailed flowchart of the training method of the classifier according to the third embodiment of the present invention;

fig. 4 is a schematic diagram of a participled text in a third embodiment according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

A first embodiment of the present invention relates to a video tag acquisition method for automatically adding a video tag to a video image, such as a movie video, a television video, or the like.

Fig. 1 shows a specific flow of the method for acquiring a video tag according to this embodiment.

Step 101, responding to an input video image, and acquiring feature information corresponding to the video image.

Specifically, when a video image to be added with a video tag is input, video text information corresponding to the video image is acquired, where the video text information may be preset information or information acquired in real time from other accessed websites or databases, and the video text information may include: a brief introduction to the video, comments from the user, etc. The video text information is the original corpus, and is generally a long-segment character including punctuation marks, so that the preset information in the original corpus needs to be removed first, and the preset information is preset Chinese stop words, such as punctuation marks, inflectional words, exclamation points, turning words and other words unrelated to text semantics. And then, performing word segmentation on the original corpus from which the preset information is removed to obtain a word segmentation text comprising a plurality of words, converting each word segmentation into word segmentation vectors, wherein the feature information corresponding to the video image comprises the word segmentation vectors. The method can convert the word segmentation in the word segmentation text into a word segmentation vector by a Term Frequency-Inverse Document Frequency (TF-IDF) method, and the word segmentation vector can represent the importance degree of each word segmentation to the word segmentation text.

And 102, respectively inputting the characteristic information into a plurality of different classification models to obtain a label set output by each classification model, wherein the label set comprises at least one label.

Specifically, each classification model corresponds to a plurality of labels, and the labels corresponding to the classification models may be the same or different. Taking any classification model as an example, when characteristic information is input into the classification model, the probability value of each label can be obtained, the probability value is higher, the probability that the video label is represented as a video image is higher, and therefore at least one label can be selected from a plurality of corresponding labels to form a label set and output.

And 103, selecting video tags of the video images from the plurality of tag sets according to the probability values of the tags in the tag sets.

Specifically, if each tag exists in other tag sets, the probability values of the tag sets can be combined to determine whether the tag can be used as a video tag of a video image, so that at least one video tag with the tag added as an input video image can be selected from the tag sets.

Compared with the prior art, in response to an input video image, the embodiment first obtains corresponding feature information, inputs the feature information into a plurality of different classification models, obtains a plurality of label sets output by the plurality of classification models, each label set comprises at least one label, and then determines a video label of the video image from the plurality of label sets according to the probability value of the label in each label set; the video tags of the video images are obtained by fusing a plurality of different classification models, so that the video tags of the video images can be adapted to the video images of different types by selecting the classification models of multiple dimensionality types, the obtained video tags of the video images are more accurate, and the robustness of automatically adding the video tags is improved.

A second embodiment of the present invention relates to a method for acquiring a video tag, and the present embodiment is mainly different from the first embodiment in that: specific implementation modes of acquiring the tag set and selecting the video tags are provided.

A specific flow of the training process of the video tag acquisition method according to the present embodiment is shown in fig. 2.

Step 201, responding to an input video image, and acquiring feature information corresponding to the video image. This step is substantially the same as step 101 in the first embodiment, and will not be described herein again.

Step 202, comprising the following sub-steps:

substep 2021, inputting the feature information into each classification model, respectively, to obtain probability values of each label generated by each classification model.

In the substep 2022, for each classification model, according to the probability value of each generated label, a label meeting a first preset condition is selected, and the label meeting the first preset condition is added to the label set of the classification model.

Specifically, each classification model corresponds to a dimension type, the dimension type includes a plurality of labels, and the dimension type corresponding to each classification model may be the same or different. Any classification model is taken as an example for explanation:

when the characteristic information is input into the classification model, the probability value of each label can be obtained, then a plurality of labels corresponding to the classification model form a first queue according to the probability value from large to small, then the labels in the first queue are sequentially traversed until the difference value of the probability value of the current label minus the probability value of the next label is larger than a first preset threshold value, and the traversed label is used as the label meeting a first preset condition. For example, a plurality of labels corresponding to the classification model are arranged from large to small according to probability values to form a first queue, taking the example that the dimension type corresponding to the classification model includes X labels, a_iRepresenting the probability value of the ith label, i is more than or equal to 1 and less than or equal to X, the first queue is (A)₁，A₂，A₃，…，A_X) (ii) a The method comprises the following steps that labels with probability values larger than or equal to a second preset threshold value O can be selected from X labels to form a first queue according to the probability values from large to small; and Y represents the number of the labels with the probability value larger than or equal to a second preset threshold value O in the X labels, namely the final first queue meets the following formula:

A_Y＝max(A₁,A₂，...，A_X)

st.A_Y≥O

and then traversing the tags in the first queue from the beginning until the difference value of the probability value of the current tag minus the probability value of the next tag is greater than a first preset threshold, and taking the traversed tags as the tags meeting a first preset condition. In particular, from the first queue A_YBegins with the first tag in (1), calculates A_i-A_i+1If A is greater than a first predetermined threshold value_i-A_i+1Is less than or equal to the first threshold value, the ith label is compared with the ith label₊1 label is taken as a label meeting a first preset condition until A_i-A_i+1Is greater than a first preset threshold value, only the ith label is taken as satisfying the ithA label with a preset condition is marked, and traversal is stopped; for example, when i is 5, a₅-A₆When the difference is greater than a first preset threshold, the 1 st to 5 th labels are all labels meeting a first preset condition, that is, the label set output by the classification model includes the 1 st to 5 th labels in the first queue.

In summary, for a plurality of classification models, the above process is repeated, and then the label set output by each classification model can be obtained.

Step 203, comprising the following substeps:

in sub-step 2031, for each tag in each tag set, an evaluation value of the tag is calculated according to the probability value of the tag in each tag set.

Specifically, taking the number of classification models in step 202 as Z as an example, Z tag sets can be obtained

Wherein each tag set B_qThe number q of the tags contained in the label set is less than the number X of the tags contained in the corresponding dimension type of the label set.

Taking one tag m in any tag set as an example, probability values of the tag m in Z tag sets are respectively obtained, T represents the number of the tag sets containing the tag m, T is more than or equal to 1 and less than or equal to Z, and the evaluation value S of the tag m_mThe calculation formula of (2) is as follows:

wherein the content of the first and second substances,

representing the probability value of tag m in the ith tag set of the T tag sets containing tag m.

In summary, the evaluation values of all tags in the Z tag set may be acquired, where the same tag only needs to be calculated once.

Substep 2032, selecting tags meeting a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the tag sets, and using the tags meeting the second preset condition as video tags.

Specifically, for all tags in the plurality of tag sets, the second queue may be formed by evaluation values from large to small, and in one example, tags having evaluation values larger than a fourth preset threshold value in the plurality of tag sets may be formed by evaluation values from large to small, and the tags in the second queue may be sequentially traversed until a difference between the evaluation value of the current tag and the evaluation value of the next tag is larger than a third preset threshold value, and the traversed tag is taken as a tag satisfying a second preset condition. For example, the plurality of tag sets may include P different tags, arranged from large to small (S)₁，S₂，S₃，…，S_P) Selecting the tags with the evaluation value larger than a fourth preset threshold value H, and using U to represent the number of the tags with the evaluation value larger than or equal to the fourth preset threshold value H in the P tags, wherein the larger the value of U is, the more the number of the tag sets occupied by the tags is, the more votes obtained by the tags is, and the final second queue satisfies the following formula:

S_U＝max(S₁,S₂，...，S_P)

st.S_U≥H

then, the tags in the second queue are traversed from the beginning until the difference value of the evaluation value of the current tag minus the evaluation value of the next tag is greater than a first preset threshold, and the traversed tags are taken as tags meeting a second preset condition. In particular, from the first queue S_UBegins with the first tag in (1), calculates S_i-S_i+1If the difference is greater than a third preset threshold value, if S_i-S_i+1Is less than or equal to a third threshold value, the ith tag is compared with the ith tag₊1 label is taken as the label meeting the second preset condition until S_i-S_i+1When the difference value is larger than a third preset threshold value, only the ith label is taken as a label meeting a second preset condition, and traversal is stopped; for example, when i is 3, S₃-S₄Is greater than a third preset threshold value, and the 1 st label to the 3 rd label are all the sameThe tags satisfying the second preset condition, that is, the video tags of the video image include the 1 st tag to the 3 rd tag in the second queue.

Compared with the first embodiment, the embodiment provides a specific implementation manner of acquiring the tag set and selecting the video tags, the evaluation value of each tag is calculated based on the number of the tag set occupied by each tag, and the more the number of the tag set occupied by the tag is, the more the number of votes obtained by representing the tag is, so that the video tags of the video image are selected in a voting manner, and the acquired video tags are more accurate.

A third embodiment of the present invention relates to a method for acquiring a video tag, and the present embodiment is mainly different from the first embodiment in that: a plurality of training parameters, a plurality of dimensionality types and different classifiers are fused in the plurality of classification models.

In this embodiment, the obtaining manner of the multiple classification models is as follows: and respectively training the M classifiers according to the N training parameters and a plurality of dimension types corresponding to at least one label to obtain a plurality of classification models, wherein each training parameter comprises a learning rate, a sample base and iteration times, and N, M is an integer greater than or equal to 1. That is, the number of training parameters is N, and when each classifier is trained, the classifier is trained by using N training parameters in each dimension type, so that N classification models can be obtained; for M classifiers, M × N classification models may be obtained, that is, M × N classification models may be obtained on each dimension type, and if there are K dimension types, K × M × N classification models may be obtained after training. Each training parameter comprises a learning rate, a sample base and iteration times, and at least one index in the N training parameters is different.

For example, the dimension types are 3, respectively: topics, content and features; each dimension type corresponds to a plurality of tags, as follows:

the story dimension types correspond to 32 tags, including: classic, art, animation, family, tragedy, laugh, inspirational, magic, mythology, female, antique, child, youth, police bandit, black slope, gunfight, military, reasoning, campus, idol, youth, psychology, history, poetry, fairy tale, animal, highway, theme, sports (sports), super nature, traversing, ethics, politics, little girl video, ethnicity.

The content dimension type contains 39 tags, including: growth, hommization, dream, kill, revenge, high-intelligence quotient, racing car, death, first love, dark love, handful, space, Beijing, Shanghai, familiarity, friendship, brother, derailment, multi-angular love, triangular love, off-site love, homosexual love, Master and student love, sibling, newspaper, end, change, workplace, law, medical treatment, uterine fighting, combat, spy, father and daughter, mother and daughter, sister and sister.

The characteristic dimension type contains 16 labels in total, including: handwriting, temperature (feeling), lacrimation, hot blood, anepithymia, basic emotion, romance, black humor, pure love, abuse, temperament, heavy taste, violence aesthetics, gothic wind, healing, and brain burning.

With reference to fig. 3, in any dimension type, the way of training any classifier by using any training parameter is as follows:

step 301, according to the sample labels of the plurality of sample video images in the sample library of the training parameters, counting the number value of the sample video image corresponding to each label under the current dimension type, taking the maximum number value as a reference value, and adding the sample video image corresponding to each label.

Specifically, the sample library comprises a plurality of sample video images added with sample labels, and for each label corresponding to the current dimension type, the sample video image corresponding to the sample label matched with the label is searched, so that the quantity value of the sample video image corresponding to the label can be obtained; repeating the process to obtain the number value of the sample video images corresponding to each label under the current dimension type, and searching the sample video images corresponding to each label from other sample libraries or accessed websites by taking the maximum number value as a reference value to add the sample video images into the sample libraries until the number value of the sample video images corresponding to each label reaches the reference value.

Step 302, sample characteristic information corresponding to each sample video image is obtained.

Specifically, video text information corresponding to a sample video image is obtained, where the video text information may be preset information or information acquired in real time from other accessed websites or databases, and the video text information may include: a brief introduction to the video, comments from the user, etc. The video text information is the original corpus, and is generally a long-segment character including punctuation marks, so that the preset information in the original corpus needs to be removed first, and the preset information is preset Chinese stop words, such as punctuation marks, inflectional words, exclamation points, turning words and other words unrelated to text semantics. Then, the original corpus from which the preset information is removed is segmented, referring to fig. 4, a segmented text including a plurality of segmented words can be obtained, and then each segmented word is converted into a segmented word vector, and the sample feature information includes the segmented word vectors. The method can convert the word segmentation in the word segmentation text into a word segmentation vector by a Term Frequency-inverse document Frequency (TF-IDF) method, and the word segmentation vector can represent the importance degree of each word segmentation to the word segmentation text.

And 303, training the classifier by using the characteristic information of each sample according to the iteration times and the learning rate in the training parameters to obtain a classification model corresponding to the classifier under the current dimension type.

Specifically, parameters of the classifier are set according to the number of iterations in the training parameters and the learning rate, and then a plurality of sample feature information are input into the classifier to train the classifier, so that a classification model corresponding to the classifier under the current dimension type can be obtained.

In one example, the M classifiers include: the system comprises at least one linear classifier and at least one nonlinear classifier, so that the characteristics of the linear classifier can be used for adapting to the condition of high similarity between the participles and the labels, and the characteristics of the nonlinear classifier can be used for adapting to describe the more abstract participles. The linear classifier is, for example, an SVM classifier, and the non-linear classifier is, for example, a CNN classifier. For the SVM classifier, an RBF kernel and a one-to-one multi-classification method can be adopted, and a grid search method is used for optimizing punishment parameters and bias parameters in the training process so as to obtain a better classification model; for the CNN classifier, softmax layer can be used to predict the label, and the cross entropy method is used to calculate the loss function, so as to obtain a better classification model.

Compared with the first embodiment, the present embodiment provides a specific implementation manner for acquiring multiple classification models that merge multiple training parameters, multiple dimension types, and different classifiers.

A fourth embodiment of the present invention relates to an apparatus for acquiring a video tag, which is applied to an electronic device, such as a computer or a server. The video tag acquisition device is used for automatically adding a video tag to a video image, such as a movie video, a television video and the like.

In this embodiment, the apparatus for acquiring a video tag includes at least one processor; and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the video tag acquisition method as in the first to third embodiments.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method for acquiring a video tag is characterized by comprising the following steps:

responding to an input video image, and acquiring characteristic information corresponding to the video image;

respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the classification models, wherein each label set comprises at least one label;

and selecting the video label of the video image from the plurality of label sets according to the probability value of the label in each label set.

2. The method for acquiring video tags according to claim 1, wherein said inputting the feature information into a plurality of different classification models respectively to obtain a plurality of tag sets output by the plurality of classification models comprises:

inputting the characteristic information into each classification model respectively to obtain probability values of the labels generated by each classification model;

and for each classification model, selecting the labels meeting a first preset condition according to the generated probability value of each label, and adding the labels meeting the first preset condition into a label set of the classification model.

3. The method for acquiring video tags according to claim 2, wherein the selecting the tags meeting a first preset condition according to the generated probability values of the tags comprises:

forming a first queue from the labels with the probability values corresponding to the classification models larger than or equal to a second preset threshold value according to the probability values;

and traversing the tags in the first queue in sequence until the difference value of the probability value of the current tag minus the probability value of the next tag is greater than a first preset threshold value, and taking the traversed tags as the tags meeting a first preset condition.

4. The method as claimed in claim 1, wherein the selecting the video tag of the video image from the plurality of tag sets according to the probability value of the tag in each tag set comprises:

for each label in each label set, calculating to obtain an evaluation value of the label according to the probability value of the label in each label set;

and selecting the tags meeting a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets, and taking the tags meeting the second preset condition as the video tags.

5. The method according to claim 4, wherein the selecting, from the plurality of tag sets, the tags satisfying a second preset condition according to the evaluation values of the tags in the plurality of tag sets includes:

forming a second queue by the tags in the plurality of tag sets according to the evaluation values from large to small;

and traversing the tags in the second queue in sequence until the difference value of the evaluation value of the current tag minus the evaluation value of the next tag is greater than a third preset threshold, and taking the traversed tags as the tags meeting a second preset condition.

6. The method for acquiring a video tag according to claim 5, wherein the forming a second queue of the tags in the plurality of tag sets according to evaluation values from large to small specifically includes:

and forming a second queue by the tags with the tag centralized evaluation values larger than a fourth preset threshold value according to the evaluation values from large to small.

7. The method of claim 5, wherein the evaluation value of the tag is calculated by the formula:

wherein S represents an evaluation value of the tag, T represents the number of the tag sets including the tag, A^lAnd the probability value of the label in the ith label set of the T label sets is represented, l is more than or equal to 1 and less than or equal to T, and l is an integer.

8. The method for acquiring video tags according to claim 1, wherein the plurality of classification models are acquired in a manner that:

and respectively training the M classifiers according to the N training parameters and a plurality of dimension types corresponding to at least one label to obtain a plurality of classification models, wherein each training parameter comprises a learning rate, a sample library and iteration times, and N, M is an integer greater than or equal to 1.

9. The method of claim 8, wherein the M classifiers comprise: at least one linear classifier and at least one non-linear classifier.

10. An apparatus for acquiring a video tag, comprising: at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of video tag acquisition of any one of claims 1 to 9.