CN111291688A - Video tag obtaining method and device - Google Patents

Video tag obtaining method and device Download PDF

Info

Publication number
CN111291688A
CN111291688A CN202010088404.6A CN202010088404A CN111291688A CN 111291688 A CN111291688 A CN 111291688A CN 202010088404 A CN202010088404 A CN 202010088404A CN 111291688 A CN111291688 A CN 111291688A
Authority
CN
China
Prior art keywords
tag
tags
video
label
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010088404.6A
Other languages
Chinese (zh)
Other versions
CN111291688B (en
Inventor
徐鸣谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010088404.6A priority Critical patent/CN111291688B/en
Publication of CN111291688A publication Critical patent/CN111291688A/en
Application granted granted Critical
Publication of CN111291688B publication Critical patent/CN111291688B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a method and a device for acquiring a video tag. The video tag obtaining method comprises the following steps: responding to an input video image, and acquiring characteristic information corresponding to the video image; respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the classification models, wherein each label set comprises at least one label; and selecting the video labels of the video images from the plurality of label sets according to the probability values of the labels in the label sets. According to the invention, a plurality of different classification models can be fused to obtain the video tags of the video images, so that the classification models with various dimensionality types can be selected to adapt to different types of video images, the obtained video tags of the video images are more accurate, and the robustness of automatic video tag addition is improved.

Description

Video tag obtaining method and device
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a method and a device for acquiring a video tag.
Background
With the rapid development of the internet, users can acquire various types of videos, such as movies, television shows, and the like, through video applications or websites of various terminals. For videos, the video tags can well show the types, characteristics and the like of the videos, so that a user can select favorite video types according to the video tags, and the use experience of the user is greatly improved.
At present, videos are generally labeled automatically in an artificial intelligence mode, and the common automatic video labeling modes mainly include two types: one is based on automatic recognition of video-related text; the other is based on automatic recognition of images in the video. Because the video is played for a long time, the image information involved is too much, the processing speed is slow, and the voluntary cost is high, so a mode based on the automatic identification of the related characters of the video is generally selected to label the video.
The inventor finds that at least the following problems exist in the prior art: the video is labeled based on the automatic identification mode of the video related characters, and the labels of the videos are obtained based on a single model, so that the effect difference is large when different types of videos are labeled, the condition that the video labels are inaccurate is caused, and the user experience is influenced.
Disclosure of Invention
The embodiment of the invention aims to provide a method and a device for acquiring video tags, which can be used for acquiring video tags of video images by fusing a plurality of different classification models, so that the classification models with various dimensionality types can be selected to adapt to the video images with different types, the acquired video tags of the video images are more accurate, and the robustness of automatically adding the video tags is improved.
In order to solve the above technical problem, an embodiment of the present invention provides a method for acquiring a video tag, including: responding to an input video image, and acquiring characteristic information corresponding to the video image; respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the classification models, wherein each label set comprises at least one label; and selecting the video labels of the video images from the plurality of label sets according to the probability values of the labels in the label sets.
The embodiment of the invention also provides a video text recognition device, which comprises at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the video tag acquisition method.
Compared with the prior art, the method and the device for detecting the video tags of the video image have the advantages that in response to the input video image, the corresponding characteristic information is firstly obtained, the characteristic information is input into a plurality of different classification models, a plurality of tag sets output by the classification models are obtained, each tag set comprises at least one tag, and then the video tags of the video image are determined from the tag sets according to the probability values of the tags in the tag sets; the video tags of the video images are obtained by fusing a plurality of different classification models, so that the video tags of the video images can be adapted to the video images of different types by selecting the classification models of multiple dimensionality types, the obtained video tags of the video images are more accurate, and the robustness of automatically adding the video tags is improved.
In addition, the method for obtaining a plurality of label sets output by a plurality of classification models by respectively inputting the characteristic information into a plurality of different classification models comprises the following steps: respectively inputting the characteristic information into each classification model to obtain the probability value of each label generated by each classification model; and for each classification model, selecting the labels meeting the first preset condition according to the generated probability value of each label, and adding the labels meeting the first preset condition into the label set of the classification model. The embodiment provides a specific implementation mode for respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the plurality of classification models.
In addition, according to the probability value of each generated label, selecting the label meeting a first preset condition, including: forming a first queue from large to small according to the probability value of the labels with the probability value corresponding to the classification model being larger than or equal to a second preset threshold value; and traversing the tags in the first queue in sequence until the difference value of the probability value of the current tag minus the probability value of the next tag is greater than a first preset threshold value, and taking the traversed tags as the tags meeting a first preset condition. The embodiment provides a specific implementation mode for selecting the tags meeting the first preset condition according to the probability value of each generated tag, and tags relatively related to the video image can be selected to form a tag set.
In addition, according to the probability value of the label in each label set, the video label of the video image is selected from the plurality of label sets, and the method comprises the following steps: for each label in each label set, calculating to obtain an evaluation value of the label according to the probability value of the label in each label set; and selecting the tags meeting a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets, and taking the tags meeting the second preset condition as the video tags. The embodiment provides a specific implementation mode for selecting the video tags of the video image from the plurality of tag sets according to the probability values of the tags in the tag sets, the evaluation value of each tag is calculated based on the number of the tag sets occupied by each tag, the more the number of the tag sets occupied by the tags is, the more the number of votes obtained by the representation tags is, and therefore the video tags of the video image are selected in a voting mode, and the acquired video tags are more accurate.
In addition, selecting tags satisfying a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets includes: forming a second queue by the tags in the plurality of tag sets according to the evaluation values from large to small; and traversing the tags in the second queue in sequence until the difference value of the evaluation value of the current tag minus the evaluation value of the next tag is greater than a third preset threshold, and taking the traversed tags as tags meeting a second preset condition. The embodiment provides a specific formula for selecting tags satisfying a second preset condition from a plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets.
In addition, forming a second queue by the tags in the plurality of tag sets according to the evaluation values from large to small, specifically: and forming a second queue by the tags with the evaluation values larger than a fourth preset threshold value in the plurality of tag sets according to the evaluation values from large to small. In this embodiment, the tags whose evaluation values are less than or equal to the fourth preset threshold are removed when the second queue is formed, so that the tags whose evaluation values are less than or equal to the fourth preset threshold are prevented from being selected as video tags, and the subsequent calculation amount is reduced.
In addition, the calculation formula of the evaluation value of the tag is:
Figure BDA0002382869560000031
where S denotes an evaluation value of tags, T denotes the number of tag sets including tags, AlAnd l is an integer and is more than or equal to 1 and less than or equal to T, and the probability value of the label in the ith label set of the T label sets is represented. The present embodiment provides a calculation formula of the evaluation value of the tag.
In addition, the plurality of classification models are acquired in the following manner: and respectively training the M classifiers according to the N training parameters and a plurality of dimension types corresponding to at least one label to obtain a plurality of classification models, wherein each training parameter comprises a learning rate, a sample base and iteration times, and N, M is an integer greater than or equal to 1. In this embodiment, a specific implementation manner of obtaining a plurality of classification models that incorporate a plurality of training parameters, a plurality of dimensional types, and different classifiers is provided.
In addition, the training mode of each classifier by using any training parameter is as follows: counting the number value of the sample video images corresponding to each label under the current dimension type according to the sample labels of the plurality of sample video images in the sample library of the training parameters, taking the maximum number value as a reference value, and adding the sample video images corresponding to each label; acquiring sample characteristic information corresponding to each sample video image; and training the classifier by using the characteristic information of each sample according to the iteration times and the learning rate in the training parameters to obtain a classification model corresponding to the classifier under the current dimension type. This embodiment provides a specific implementation of training a classifier using any training parameter in one dimension.
In addition, the M classifiers include: at least one linear classifier and at least one non-linear classifier. In the embodiment, the characteristic of the linear classifier can be used for adapting to the condition that the similarity between the feature information and the label is high, and the characteristic of the nonlinear classifier is used for adapting to describe the relatively abstract feature information.
In addition, the acquiring of the feature information corresponding to the video image includes: acquiring video text information corresponding to the video image, and removing preset information in the video text information; and converting a plurality of word segments in the video text information without the preset information into word segment vectors as characteristic information. The embodiment provides a specific implementation mode for acquiring the characteristic information corresponding to the video image.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
Fig. 1 is a specific flow of a method for acquiring a video tag according to a first embodiment of the present invention;
fig. 2 is a detailed flowchart of a method for acquiring a video tag according to a second embodiment of the present invention;
FIG. 3 is a detailed flowchart of the training method of the classifier according to the third embodiment of the present invention;
fig. 4 is a schematic diagram of a participled text in a third embodiment according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.
A first embodiment of the present invention relates to a video tag acquisition method for automatically adding a video tag to a video image, such as a movie video, a television video, or the like.
Fig. 1 shows a specific flow of the method for acquiring a video tag according to this embodiment.
Step 101, responding to an input video image, and acquiring feature information corresponding to the video image.
Specifically, when a video image to be added with a video tag is input, video text information corresponding to the video image is acquired, where the video text information may be preset information or information acquired in real time from other accessed websites or databases, and the video text information may include: a brief introduction to the video, comments from the user, etc. The video text information is the original corpus, and is generally a long-segment character including punctuation marks, so that the preset information in the original corpus needs to be removed first, and the preset information is preset Chinese stop words, such as punctuation marks, inflectional words, exclamation points, turning words and other words unrelated to text semantics. And then, performing word segmentation on the original corpus from which the preset information is removed to obtain a word segmentation text comprising a plurality of words, converting each word segmentation into word segmentation vectors, wherein the feature information corresponding to the video image comprises the word segmentation vectors. The method can convert the word segmentation in the word segmentation text into a word segmentation vector by a Term Frequency-Inverse Document Frequency (TF-IDF) method, and the word segmentation vector can represent the importance degree of each word segmentation to the word segmentation text.
And 102, respectively inputting the characteristic information into a plurality of different classification models to obtain a label set output by each classification model, wherein the label set comprises at least one label.
Specifically, each classification model corresponds to a plurality of labels, and the labels corresponding to the classification models may be the same or different. Taking any classification model as an example, when characteristic information is input into the classification model, the probability value of each label can be obtained, the probability value is higher, the probability that the video label is represented as a video image is higher, and therefore at least one label can be selected from a plurality of corresponding labels to form a label set and output.
And 103, selecting video tags of the video images from the plurality of tag sets according to the probability values of the tags in the tag sets.
Specifically, if each tag exists in other tag sets, the probability values of the tag sets can be combined to determine whether the tag can be used as a video tag of a video image, so that at least one video tag with the tag added as an input video image can be selected from the tag sets.
Compared with the prior art, in response to an input video image, the embodiment first obtains corresponding feature information, inputs the feature information into a plurality of different classification models, obtains a plurality of label sets output by the plurality of classification models, each label set comprises at least one label, and then determines a video label of the video image from the plurality of label sets according to the probability value of the label in each label set; the video tags of the video images are obtained by fusing a plurality of different classification models, so that the video tags of the video images can be adapted to the video images of different types by selecting the classification models of multiple dimensionality types, the obtained video tags of the video images are more accurate, and the robustness of automatically adding the video tags is improved.
A second embodiment of the present invention relates to a method for acquiring a video tag, and the present embodiment is mainly different from the first embodiment in that: specific implementation modes of acquiring the tag set and selecting the video tags are provided.
A specific flow of the training process of the video tag acquisition method according to the present embodiment is shown in fig. 2.
Step 201, responding to an input video image, and acquiring feature information corresponding to the video image. This step is substantially the same as step 101 in the first embodiment, and will not be described herein again.
Step 202, comprising the following sub-steps:
substep 2021, inputting the feature information into each classification model, respectively, to obtain probability values of each label generated by each classification model.
In the substep 2022, for each classification model, according to the probability value of each generated label, a label meeting a first preset condition is selected, and the label meeting the first preset condition is added to the label set of the classification model.
Specifically, each classification model corresponds to a dimension type, the dimension type includes a plurality of labels, and the dimension type corresponding to each classification model may be the same or different. Any classification model is taken as an example for explanation:
when the characteristic information is input into the classification model, the probability value of each label can be obtained, then a plurality of labels corresponding to the classification model form a first queue according to the probability value from large to small, then the labels in the first queue are sequentially traversed until the difference value of the probability value of the current label minus the probability value of the next label is larger than a first preset threshold value, and the traversed label is used as the label meeting a first preset condition. For example, a plurality of labels corresponding to the classification model are arranged from large to small according to probability values to form a first queue, taking the example that the dimension type corresponding to the classification model includes X labels, aiRepresenting the probability value of the ith label, i is more than or equal to 1 and less than or equal to X, the first queue is (A)1,A2,A3,…,AX) (ii) a The method comprises the following steps that labels with probability values larger than or equal to a second preset threshold value O can be selected from X labels to form a first queue according to the probability values from large to small; and Y represents the number of the labels with the probability value larger than or equal to a second preset threshold value O in the X labels, namely the final first queue meets the following formula:
AY=max(A1,A2,...,AX)
st.AY≥O
and then traversing the tags in the first queue from the beginning until the difference value of the probability value of the current tag minus the probability value of the next tag is greater than a first preset threshold, and taking the traversed tags as the tags meeting a first preset condition. In particular, from the first queue AYBegins with the first tag in (1), calculates Ai-Ai+1If A is greater than a first predetermined threshold valuei-Ai+1Is less than or equal to the first threshold value, the ith label is compared with the ith label+1 label is taken as a label meeting a first preset condition until Ai-Ai+1Is greater than a first preset threshold value, only the ith label is taken as satisfying the ithA label with a preset condition is marked, and traversal is stopped; for example, when i is 5, a5-A6When the difference is greater than a first preset threshold, the 1 st to 5 th labels are all labels meeting a first preset condition, that is, the label set output by the classification model includes the 1 st to 5 th labels in the first queue.
In summary, for a plurality of classification models, the above process is repeated, and then the label set output by each classification model can be obtained.
Step 203, comprising the following substeps:
in sub-step 2031, for each tag in each tag set, an evaluation value of the tag is calculated according to the probability value of the tag in each tag set.
Specifically, taking the number of classification models in step 202 as Z as an example, Z tag sets can be obtained
Figure BDA0002382869560000061
Wherein each tag set BqThe number q of the tags contained in the label set is less than the number X of the tags contained in the corresponding dimension type of the label set.
Taking one tag m in any tag set as an example, probability values of the tag m in Z tag sets are respectively obtained, T represents the number of the tag sets containing the tag m, T is more than or equal to 1 and less than or equal to Z, and the evaluation value S of the tag mmThe calculation formula of (2) is as follows:
Figure BDA0002382869560000062
wherein the content of the first and second substances,
Figure BDA0002382869560000063
representing the probability value of tag m in the ith tag set of the T tag sets containing tag m.
In summary, the evaluation values of all tags in the Z tag set may be acquired, where the same tag only needs to be calculated once.
Substep 2032, selecting tags meeting a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the tag sets, and using the tags meeting the second preset condition as video tags.
Specifically, for all tags in the plurality of tag sets, the second queue may be formed by evaluation values from large to small, and in one example, tags having evaluation values larger than a fourth preset threshold value in the plurality of tag sets may be formed by evaluation values from large to small, and the tags in the second queue may be sequentially traversed until a difference between the evaluation value of the current tag and the evaluation value of the next tag is larger than a third preset threshold value, and the traversed tag is taken as a tag satisfying a second preset condition. For example, the plurality of tag sets may include P different tags, arranged from large to small (S)1,S2,S3,…,SP) Selecting the tags with the evaluation value larger than a fourth preset threshold value H, and using U to represent the number of the tags with the evaluation value larger than or equal to the fourth preset threshold value H in the P tags, wherein the larger the value of U is, the more the number of the tag sets occupied by the tags is, the more votes obtained by the tags is, and the final second queue satisfies the following formula:
SU=max(S1,S2,...,SP)
st.SU≥H
then, the tags in the second queue are traversed from the beginning until the difference value of the evaluation value of the current tag minus the evaluation value of the next tag is greater than a first preset threshold, and the traversed tags are taken as tags meeting a second preset condition. In particular, from the first queue SUBegins with the first tag in (1), calculates Si-Si+1If the difference is greater than a third preset threshold value, if Si-Si+1Is less than or equal to a third threshold value, the ith tag is compared with the ith tag+1 label is taken as the label meeting the second preset condition until Si-Si+1When the difference value is larger than a third preset threshold value, only the ith label is taken as a label meeting a second preset condition, and traversal is stopped; for example, when i is 3, S3-S4Is greater than a third preset threshold value, and the 1 st label to the 3 rd label are all the sameThe tags satisfying the second preset condition, that is, the video tags of the video image include the 1 st tag to the 3 rd tag in the second queue.
Compared with the first embodiment, the embodiment provides a specific implementation manner of acquiring the tag set and selecting the video tags, the evaluation value of each tag is calculated based on the number of the tag set occupied by each tag, and the more the number of the tag set occupied by the tag is, the more the number of votes obtained by representing the tag is, so that the video tags of the video image are selected in a voting manner, and the acquired video tags are more accurate.
A third embodiment of the present invention relates to a method for acquiring a video tag, and the present embodiment is mainly different from the first embodiment in that: a plurality of training parameters, a plurality of dimensionality types and different classifiers are fused in the plurality of classification models.
In this embodiment, the obtaining manner of the multiple classification models is as follows: and respectively training the M classifiers according to the N training parameters and a plurality of dimension types corresponding to at least one label to obtain a plurality of classification models, wherein each training parameter comprises a learning rate, a sample base and iteration times, and N, M is an integer greater than or equal to 1. That is, the number of training parameters is N, and when each classifier is trained, the classifier is trained by using N training parameters in each dimension type, so that N classification models can be obtained; for M classifiers, M × N classification models may be obtained, that is, M × N classification models may be obtained on each dimension type, and if there are K dimension types, K × M × N classification models may be obtained after training. Each training parameter comprises a learning rate, a sample base and iteration times, and at least one index in the N training parameters is different.
For example, the dimension types are 3, respectively: topics, content and features; each dimension type corresponds to a plurality of tags, as follows:
the story dimension types correspond to 32 tags, including: classic, art, animation, family, tragedy, laugh, inspirational, magic, mythology, female, antique, child, youth, police bandit, black slope, gunfight, military, reasoning, campus, idol, youth, psychology, history, poetry, fairy tale, animal, highway, theme, sports (sports), super nature, traversing, ethics, politics, little girl video, ethnicity.
The content dimension type contains 39 tags, including: growth, hommization, dream, kill, revenge, high-intelligence quotient, racing car, death, first love, dark love, handful, space, Beijing, Shanghai, familiarity, friendship, brother, derailment, multi-angular love, triangular love, off-site love, homosexual love, Master and student love, sibling, newspaper, end, change, workplace, law, medical treatment, uterine fighting, combat, spy, father and daughter, mother and daughter, sister and sister.
The characteristic dimension type contains 16 labels in total, including: handwriting, temperature (feeling), lacrimation, hot blood, anepithymia, basic emotion, romance, black humor, pure love, abuse, temperament, heavy taste, violence aesthetics, gothic wind, healing, and brain burning.
With reference to fig. 3, in any dimension type, the way of training any classifier by using any training parameter is as follows:
step 301, according to the sample labels of the plurality of sample video images in the sample library of the training parameters, counting the number value of the sample video image corresponding to each label under the current dimension type, taking the maximum number value as a reference value, and adding the sample video image corresponding to each label.
Specifically, the sample library comprises a plurality of sample video images added with sample labels, and for each label corresponding to the current dimension type, the sample video image corresponding to the sample label matched with the label is searched, so that the quantity value of the sample video image corresponding to the label can be obtained; repeating the process to obtain the number value of the sample video images corresponding to each label under the current dimension type, and searching the sample video images corresponding to each label from other sample libraries or accessed websites by taking the maximum number value as a reference value to add the sample video images into the sample libraries until the number value of the sample video images corresponding to each label reaches the reference value.
Step 302, sample characteristic information corresponding to each sample video image is obtained.
Specifically, video text information corresponding to a sample video image is obtained, where the video text information may be preset information or information acquired in real time from other accessed websites or databases, and the video text information may include: a brief introduction to the video, comments from the user, etc. The video text information is the original corpus, and is generally a long-segment character including punctuation marks, so that the preset information in the original corpus needs to be removed first, and the preset information is preset Chinese stop words, such as punctuation marks, inflectional words, exclamation points, turning words and other words unrelated to text semantics. Then, the original corpus from which the preset information is removed is segmented, referring to fig. 4, a segmented text including a plurality of segmented words can be obtained, and then each segmented word is converted into a segmented word vector, and the sample feature information includes the segmented word vectors. The method can convert the word segmentation in the word segmentation text into a word segmentation vector by a Term Frequency-inverse document Frequency (TF-IDF) method, and the word segmentation vector can represent the importance degree of each word segmentation to the word segmentation text.
And 303, training the classifier by using the characteristic information of each sample according to the iteration times and the learning rate in the training parameters to obtain a classification model corresponding to the classifier under the current dimension type.
Specifically, parameters of the classifier are set according to the number of iterations in the training parameters and the learning rate, and then a plurality of sample feature information are input into the classifier to train the classifier, so that a classification model corresponding to the classifier under the current dimension type can be obtained.
In one example, the M classifiers include: the system comprises at least one linear classifier and at least one nonlinear classifier, so that the characteristics of the linear classifier can be used for adapting to the condition of high similarity between the participles and the labels, and the characteristics of the nonlinear classifier can be used for adapting to describe the more abstract participles. The linear classifier is, for example, an SVM classifier, and the non-linear classifier is, for example, a CNN classifier. For the SVM classifier, an RBF kernel and a one-to-one multi-classification method can be adopted, and a grid search method is used for optimizing punishment parameters and bias parameters in the training process so as to obtain a better classification model; for the CNN classifier, softmax layer can be used to predict the label, and the cross entropy method is used to calculate the loss function, so as to obtain a better classification model.
Compared with the first embodiment, the present embodiment provides a specific implementation manner for acquiring multiple classification models that merge multiple training parameters, multiple dimension types, and different classifiers.
A fourth embodiment of the present invention relates to an apparatus for acquiring a video tag, which is applied to an electronic device, such as a computer or a server. The video tag acquisition device is used for automatically adding a video tag to a video image, such as a movie video, a television video and the like.
In this embodiment, the apparatus for acquiring a video tag includes at least one processor; and a memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the video tag acquisition method as in the first to third embodiments.
Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
Compared with the prior art, in response to an input video image, the embodiment first obtains corresponding feature information, inputs the feature information into a plurality of different classification models, obtains a plurality of label sets output by the plurality of classification models, each label set comprises at least one label, and then determines a video label of the video image from the plurality of label sets according to the probability value of the label in each label set; the video tags of the video images are obtained by fusing a plurality of different classification models, so that the video tags of the video images can be adapted to the video images of different types by selecting the classification models of multiple dimensionality types, the obtained video tags of the video images are more accurate, and the robustness of automatically adding the video tags is improved.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. A method for acquiring a video tag is characterized by comprising the following steps:
responding to an input video image, and acquiring characteristic information corresponding to the video image;
respectively inputting the characteristic information into a plurality of different classification models to obtain a plurality of label sets output by the classification models, wherein each label set comprises at least one label;
and selecting the video label of the video image from the plurality of label sets according to the probability value of the label in each label set.
2. The method for acquiring video tags according to claim 1, wherein said inputting the feature information into a plurality of different classification models respectively to obtain a plurality of tag sets output by the plurality of classification models comprises:
inputting the characteristic information into each classification model respectively to obtain probability values of the labels generated by each classification model;
and for each classification model, selecting the labels meeting a first preset condition according to the generated probability value of each label, and adding the labels meeting the first preset condition into a label set of the classification model.
3. The method for acquiring video tags according to claim 2, wherein the selecting the tags meeting a first preset condition according to the generated probability values of the tags comprises:
forming a first queue from the labels with the probability values corresponding to the classification models larger than or equal to a second preset threshold value according to the probability values;
and traversing the tags in the first queue in sequence until the difference value of the probability value of the current tag minus the probability value of the next tag is greater than a first preset threshold value, and taking the traversed tags as the tags meeting a first preset condition.
4. The method as claimed in claim 1, wherein the selecting the video tag of the video image from the plurality of tag sets according to the probability value of the tag in each tag set comprises:
for each label in each label set, calculating to obtain an evaluation value of the label according to the probability value of the label in each label set;
and selecting the tags meeting a second preset condition from the plurality of tag sets according to the evaluation values of the tags in the plurality of tag sets, and taking the tags meeting the second preset condition as the video tags.
5. The method according to claim 4, wherein the selecting, from the plurality of tag sets, the tags satisfying a second preset condition according to the evaluation values of the tags in the plurality of tag sets includes:
forming a second queue by the tags in the plurality of tag sets according to the evaluation values from large to small;
and traversing the tags in the second queue in sequence until the difference value of the evaluation value of the current tag minus the evaluation value of the next tag is greater than a third preset threshold, and taking the traversed tags as the tags meeting a second preset condition.
6. The method for acquiring a video tag according to claim 5, wherein the forming a second queue of the tags in the plurality of tag sets according to evaluation values from large to small specifically includes:
and forming a second queue by the tags with the tag centralized evaluation values larger than a fourth preset threshold value according to the evaluation values from large to small.
7. The method of claim 5, wherein the evaluation value of the tag is calculated by the formula:
Figure FDA0002382869550000021
wherein S represents an evaluation value of the tag, T represents the number of the tag sets including the tag, AlAnd the probability value of the label in the ith label set of the T label sets is represented, l is more than or equal to 1 and less than or equal to T, and l is an integer.
8. The method for acquiring video tags according to claim 1, wherein the plurality of classification models are acquired in a manner that:
and respectively training the M classifiers according to the N training parameters and a plurality of dimension types corresponding to at least one label to obtain a plurality of classification models, wherein each training parameter comprises a learning rate, a sample library and iteration times, and N, M is an integer greater than or equal to 1.
9. The method of claim 8, wherein the M classifiers comprise: at least one linear classifier and at least one non-linear classifier.
10. An apparatus for acquiring a video tag, comprising: at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of video tag acquisition of any one of claims 1 to 9.
CN202010088404.6A 2020-02-12 2020-02-12 Video tag acquisition method and device Active CN111291688B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010088404.6A CN111291688B (en) 2020-02-12 2020-02-12 Video tag acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010088404.6A CN111291688B (en) 2020-02-12 2020-02-12 Video tag acquisition method and device

Publications (2)

Publication Number Publication Date
CN111291688A true CN111291688A (en) 2020-06-16
CN111291688B CN111291688B (en) 2023-07-14

Family

ID=71021331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010088404.6A Active CN111291688B (en) 2020-02-12 2020-02-12 Video tag acquisition method and device

Country Status (1)

Country Link
CN (1) CN111291688B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708913A (en) * 2020-08-19 2020-09-25 腾讯科技(深圳)有限公司 Label generation method and device and computer readable storage medium
CN111741330A (en) * 2020-07-17 2020-10-02 腾讯科技(深圳)有限公司 Video content evaluation method and device, storage medium and computer equipment
CN111950360A (en) * 2020-07-06 2020-11-17 北京奇艺世纪科技有限公司 Method and device for identifying infringing user
CN112699945A (en) * 2020-12-31 2021-04-23 青岛海尔科技有限公司 Data labeling method and device, storage medium and electronic device
CN114466251A (en) * 2022-04-08 2022-05-10 深圳市致尚信息技术有限公司 Video-based classification label mark processing method and system

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809218A (en) * 2015-04-30 2015-07-29 北京奇艺世纪科技有限公司 UGC (User Generated Content) video classification method and device
CN104992184A (en) * 2015-07-02 2015-10-21 东南大学 Multiclass image classification method based on semi-supervised extreme learning machine
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
US20180181843A1 (en) * 2016-12-28 2018-06-28 Ancestry.Com Operations Inc. Clustering historical images using a convolutional neural net and labeled data bootstrapping
CN108664989A (en) * 2018-03-27 2018-10-16 北京达佳互联信息技术有限公司 Image tag determines method, apparatus and terminal
CN108694217A (en) * 2017-04-12 2018-10-23 合信息技术(北京)有限公司 The label of video determines method and device
CN108764371A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Image processing method, device, computer readable storage medium and electronic equipment
CN109271521A (en) * 2018-11-16 2019-01-25 北京九狐时代智能科技有限公司 A kind of file classification method and device
CN109409414A (en) * 2018-09-28 2019-03-01 北京达佳互联信息技术有限公司 Sample image determines method and apparatus, electronic equipment and storage medium
CN109740018A (en) * 2019-01-29 2019-05-10 北京字节跳动网络技术有限公司 Method and apparatus for generating video tab model
CN109815365A (en) * 2019-01-29 2019-05-28 北京字节跳动网络技术有限公司 Method and apparatus for handling video
CN109947989A (en) * 2019-03-18 2019-06-28 北京字节跳动网络技术有限公司 Method and apparatus for handling video
CN110287372A (en) * 2019-06-26 2019-09-27 广州市百果园信息技术有限公司 Label for negative-feedback determines method, video recommendation method and its device
CN110348367A (en) * 2019-07-08 2019-10-18 北京字节跳动网络技术有限公司 Video classification methods, method for processing video frequency, device, mobile terminal and medium
CN110458245A (en) * 2019-08-20 2019-11-15 图谱未来(南京)人工智能研究院有限公司 A kind of multi-tag disaggregated model training method, data processing method and device
CN110598011A (en) * 2019-09-27 2019-12-20 腾讯科技(深圳)有限公司 Data processing method, data processing device, computer equipment and readable storage medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809218A (en) * 2015-04-30 2015-07-29 北京奇艺世纪科技有限公司 UGC (User Generated Content) video classification method and device
CN104992184A (en) * 2015-07-02 2015-10-21 东南大学 Multiclass image classification method based on semi-supervised extreme learning machine
US20180181843A1 (en) * 2016-12-28 2018-06-28 Ancestry.Com Operations Inc. Clustering historical images using a convolutional neural net and labeled data bootstrapping
CN108694217A (en) * 2017-04-12 2018-10-23 合信息技术(北京)有限公司 The label of video determines method and device
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
CN108664989A (en) * 2018-03-27 2018-10-16 北京达佳互联信息技术有限公司 Image tag determines method, apparatus and terminal
CN108764371A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Image processing method, device, computer readable storage medium and electronic equipment
CN109409414A (en) * 2018-09-28 2019-03-01 北京达佳互联信息技术有限公司 Sample image determines method and apparatus, electronic equipment and storage medium
CN109271521A (en) * 2018-11-16 2019-01-25 北京九狐时代智能科技有限公司 A kind of file classification method and device
CN109740018A (en) * 2019-01-29 2019-05-10 北京字节跳动网络技术有限公司 Method and apparatus for generating video tab model
CN109815365A (en) * 2019-01-29 2019-05-28 北京字节跳动网络技术有限公司 Method and apparatus for handling video
CN109947989A (en) * 2019-03-18 2019-06-28 北京字节跳动网络技术有限公司 Method and apparatus for handling video
CN110287372A (en) * 2019-06-26 2019-09-27 广州市百果园信息技术有限公司 Label for negative-feedback determines method, video recommendation method and its device
CN110348367A (en) * 2019-07-08 2019-10-18 北京字节跳动网络技术有限公司 Video classification methods, method for processing video frequency, device, mobile terminal and medium
CN110458245A (en) * 2019-08-20 2019-11-15 图谱未来(南京)人工智能研究院有限公司 A kind of multi-tag disaggregated model training method, data processing method and device
CN110598011A (en) * 2019-09-27 2019-12-20 腾讯科技(深圳)有限公司 Data processing method, data processing device, computer equipment and readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
TIMOTHY RUBIN等: "Statistical Topic Models for Multi-Label Document Classification", 《ARXIV:1107.2462V2》 *
吴雨希: "基于文本挖掘的视频标签生成及视频分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
肖谦: "基于时空兴趣点的深度视频人体动作识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
赵丽: "多语义非线性农业咨询视频检索系统的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
钟岑岑: "基于上下文的音视频标注研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950360A (en) * 2020-07-06 2020-11-17 北京奇艺世纪科技有限公司 Method and device for identifying infringing user
CN111950360B (en) * 2020-07-06 2023-08-18 北京奇艺世纪科技有限公司 Method and device for identifying infringement user
CN111741330A (en) * 2020-07-17 2020-10-02 腾讯科技(深圳)有限公司 Video content evaluation method and device, storage medium and computer equipment
CN111741330B (en) * 2020-07-17 2024-01-30 腾讯科技(深圳)有限公司 Video content evaluation method and device, storage medium and computer equipment
CN111708913A (en) * 2020-08-19 2020-09-25 腾讯科技(深圳)有限公司 Label generation method and device and computer readable storage medium
CN112699945A (en) * 2020-12-31 2021-04-23 青岛海尔科技有限公司 Data labeling method and device, storage medium and electronic device
CN112699945B (en) * 2020-12-31 2023-10-27 青岛海尔科技有限公司 Data labeling method and device, storage medium and electronic device
CN114466251A (en) * 2022-04-08 2022-05-10 深圳市致尚信息技术有限公司 Video-based classification label mark processing method and system

Also Published As

Publication number Publication date
CN111291688B (en) 2023-07-14

Similar Documents

Publication Publication Date Title
Wu et al. Cycle-consistent deep generative hashing for cross-modal retrieval
CN111291688A (en) Video tag obtaining method and device
CN106973244B (en) Method and system for automatically generating image captions using weak supervision data
CN108875074B (en) Answer selection method and device based on cross attention neural network and electronic equipment
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
Wu et al. Learning of multimodal representations with random walks on the click graph
CN111741330A (en) Video content evaluation method and device, storage medium and computer equipment
CN112307351A (en) Model training and recommending method, device and equipment for user behavior
CN113627447A (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN112464100B (en) Information recommendation model training method, information recommendation method, device and equipment
Zhou A novel movies recommendation algorithm based on reinforcement learning with DDPG policy
Ma et al. Topic-based algorithm for multilabel learning with missing labels
US20230237093A1 (en) Video recommender system by knowledge based multi-modal graph neural networks
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
US11983183B2 (en) Techniques for training machine learning models using actor data
Meng et al. Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection
Das A multimodal approach to sarcasm detection on social media
CN113065027A (en) Video recommendation method and device, electronic equipment and storage medium
Peng et al. Quintuple-media joint correlation learning with deep compression and regularization
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN113822065A (en) Keyword recall method and device, electronic equipment and storage medium
CN112269877A (en) Data labeling method and device
Xiong et al. An intelligent film recommender system based on emotional analysis
CN113743050B (en) Article layout evaluation method, apparatus, electronic device and storage medium
Nurhasanah et al. Fine-grained object recognition using a combination model of navigator–teacher–scrutinizer and spinal networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant