CN115359402A

CN115359402A - Video labeling method and device, equipment, medium and product thereof

Info

Publication number: CN115359402A
Application number: CN202211028339.3A
Authority: CN
Inventors: 李益永; 杨晓斌; 陈德健; 项伟
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-11-18

Abstract

The application relates to a video labeling method, a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: acquiring media information of a video to be marked, wherein the media information comprises image data and a description text of the video to be marked; determining the confidence coefficient of each class label mapped to the class label set by the image data by adopting a video classification model, and determining the class label of which the confidence coefficient exceeds a preset confidence coefficient threshold value as a target class label; taking the class label with the confidence coefficient not reaching the confidence coefficient threshold value as an undetermined class label, and checking whether the undetermined class label is a target class label or not based on the description text; and labeling the video to be labeled by adopting the target category label. The class label with low confidence coefficient identified in the video classification model is verified and confirmed by using the description text of the video to be labeled, so that the accuracy of video labeling can be improved, the long tail effect caused by insufficient training samples of the class label can be overcome, and the utilization efficiency of video information resources can be improved.

Description

Video labeling method, device, equipment, medium and product

Technical Field

The present application relates to video information processing technologies, and in particular, to a video annotation method, and an apparatus, a device, a medium, and a product thereof.

Background

The user and the video uploaded by the user are two basic elements of the short video, the user can derive behaviors of watching, commenting, agreeing, sharing and the like on the video, and the user can generate behaviors of paying attention to other users, privately believing and the like. The goals of the short video platform are: 1. the association between users is tighter, so that the users have more participation; 2. the users can obtain the video content interested by themselves to meet the requirements of the mental level, and the attention behaviors among the users can be further generated, so that the participation sense is improved.

What kind of video is issued to the user is important for user retention, so that the user and the interest category of the video need to be identified. However, the interest categories of the users are basically obtained by tag rendering of videos uploaded, watched, commented, praised and shared by the users, and therefore marking of the interest categories of the videos is the basis.

Due to the requirement of fine operation, video marking usually adopts multi-level labels for marking, the number of the labels is hundreds, and in practice, marking the video has a plurality of problems, such as:

on the one hand, the marking difficulty is extremely high by only depending on the video classification model, the data volume of the training samples required by the training video classification model is often more than millions of levels, but the classification capability of hundreds of classes can be obtained, if the labels need to be expanded, the training samples need to be marked again, the cost is extremely high, the accuracy of the video classification model is not high, and the classification accuracy is greatly reduced due to the fact that the classes are infinitely increased.

On the other hand, the classification of videos has a long-tail effect, the number of training samples of certain categories is small, and in the process of training the video classification model, the lack of sufficient training samples leads to the fact that the video classification model cannot accurately identify videos of corresponding categories, and therefore accurate classification of the videos cannot be achieved.

On the other hand, the video uploaded by the user contains a large amount of related information, and the quality of the information is uneven, so that the information is often skipped by the traditional marking mode, and effective data mining on the related information is not realized, so that useful information resources are wasted.

Disclosure of Invention

The present application aims to solve the above problems and provide a video annotation method, and corresponding apparatus, device, non-volatile readable storage medium, and computer program product.

According to an aspect of the present application, there is provided a video annotation method, including the steps of:

acquiring media information of a video to be marked, wherein the media information comprises image data and a description text of the video to be marked;

determining the confidence coefficient of each category label mapped to the category label set by the image data by adopting a video classification model, and determining the category label of which the confidence coefficient exceeds a preset confidence coefficient threshold value as a target category label;

taking the class label with the confidence coefficient not reaching the confidence coefficient threshold value as an undetermined class label, and checking whether the undetermined class label is a target class label or not based on the description text;

and labeling the video to be labeled by adopting the target category label.

According to another aspect of the present application, there is provided a video annotation apparatus, including:

the information acquisition module is used for acquiring media information of the video to be marked, wherein the media information comprises image data and description text of the video to be marked;

a default classification module configured to determine a confidence level of each class label mapped to the class label set by the image data using a video classification model, and determine a class label having a confidence level exceeding a preset confidence level threshold as a target class label;

the auxiliary checking module is set to use the class label of which the confidence coefficient does not reach the confidence coefficient threshold value as an undetermined class label and check whether the undetermined class label is a target class label or not based on the description text;

and the video labeling module is set to label the video to be labeled by adopting the target category label.

According to another aspect of the present application, there is provided a video annotation device, comprising a central processing unit and a memory, wherein the central processing unit is configured to invoke and run a computer program stored in the memory to perform the steps of the video annotation method described herein.

According to another aspect of the present application, a non-transitory readable storage medium is provided, which stores a computer program implemented according to the video annotation method in the form of computer readable instructions, and when the computer program is called by a computer, the computer program executes the steps included in the method.

According to another aspect of the present application, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method described in any one of the embodiments of the present application.

Compared with the prior art, the application has various technical advantages, including but not limited to:

firstly, after a category label corresponding to a video to be labeled is predicted by a video classification model, the category label with higher confidence coefficient is directly determined as a target category label of the video to be labeled, and for the category label with lower confidence coefficient, a description text in media information of the video to be labeled is used as reference information to check the category labels with lower confidence coefficient so as to determine whether the category labels belong to the target category label, so that the accuracy of video labeling based on the video classification model is improved on the basis of not increasing high training cost.

Secondly, because the description text in the media information is usually used for introducing the content related to the video, the class labels with low confidence coefficient predicted by the video classification model are verified by using the description text, the long tail effect can be effectively overcome, and the identification capability of the class labels can be improved under the assistance of the description text for the condition that the video classification model cannot effectively distinguish the class labels due to the lack of sufficient training samples for the part of the class labels.

Finally, the description text of the video often contains information such as content, value and characteristics related to the video, and the description text is good at data mining in the process of marking the video, so that the utilization value of information resources can be improved, and the phenomenon of information resource waste is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a network architecture corresponding to a video service applied in the present application;

FIG. 2 is a flowchart illustrating a video annotation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an exemplary network architecture of a video classification model employed in the present application;

FIG. 4 is a schematic diagram of another exemplary network architecture for a video classification model employed in the present application;

FIG. 5 is a schematic diagram of yet another exemplary network architecture for a video classification model employed in the present application;

FIG. 6 is a schematic flowchart illustrating a process of determining a target class label according to image data by a video classification model according to an embodiment of the present application;

fig. 7 is a schematic flow chart illustrating a process of verifying a pending type tag based on a word frequency statistical manner in an embodiment of the present application;

fig. 8 is a schematic flowchart of the process of constructing a word frequency statistical table corresponding to each category label in the embodiment of the present application;

FIG. 9 is a schematic flow chart illustrating a process of verifying a tag of a pending class using a text classification model according to an embodiment of the present application;

FIG. 10 is a schematic flowchart illustrating further applying the target category label of the video to be annotated according to the embodiment of the present application;

FIG. 11 is a schematic block diagram of a video annotation device of the present application;

fig. 12 is a schematic structural diagram of a video annotation apparatus used in the present application.

Detailed Description

The models referred or possibly referred to in the application include a traditional machine learning model or a deep learning model, unless specified in clear text, the models can be deployed in a remote server and used for remote calling at a client, and can also be deployed in a client with sufficient equipment capability for direct calling.

Referring to fig. 1, a network architecture adopted by an exemplary application scenario of the present application may be used to deploy video services, upload various videos by a user, and push videos carrying corresponding category tags for the user based on the category tags representing interests of the user. The video may be a short video with a short playing time, or called a small video, or may also be a video such as a movie or a tv series with a long playing time, or may also be a live video with different time lengths generated by live broadcasting of a user. The application server 81 shown in fig. 1 may be used to support the operation of the video service, while the media server 82 may be used to store or forward the video of the user, wherein the terminal devices such as the computer 83 and the mobile phone 84, which are used as clients, are generally provided for the end user to use, and may be used to upload or download the video. The method or apparatus of the present application may be executed in the application server 81, the media server 82, or other computer devices accessible to the video, so as to implement the invocation and annotation of the video.

Referring to fig. 2, in an embodiment of a video annotation method according to the present application, the method includes the following steps:

step S1100, media information of a video to be annotated is obtained, wherein the media information comprises image data and a description text of the video to be annotated;

videos in a video service are usually stored in a server in the form of video files, and each video file is associated with one another, and a description text for describing information such as content, value, characteristics, and the like of the video file is also stored, and the description text is generally set by a user in an associated manner. Each video is usually provided with a unique feature identifier, so that the corresponding video is called through the unique feature information, including calling a video file and a description text thereof and the like. For each video uploaded by the user, it is also typically associated with a corresponding user identification to identify the user identity of each video.

The video marking method is suitable for the video marking system, the category label set can be set according to a certain classification standard, and the category labels in the category label set are adopted to mark the video, namely marking. When marking the videos, determining one video as the video to be marked, reasoning and determining corresponding target category labels in the category label set according to the media information of the video to be marked, and associating the target category labels with the video to be marked so as to realize marking operation of the video to be marked.

In one embodiment, the media information of the video to be annotated at least includes image data and description text of the video to be annotated. The image data of the video to be annotated can be corresponding image data extracted from a video file of the video to be annotated, such as a plurality of image frames therein. And the description text of the video to be marked is associated with the corresponding description text of the corresponding video file. In other embodiments that are not described in detail, the media information may further include other associated information of the video to be labeled, for example, a feature identifier of a user who uploads the video to be labeled, a playing time of the video to be labeled, and the like, which may provide reference value for an inference process of the category label.

The timing for acquiring the media information of the video to be marked can be automatically triggered after the user finishes uploading the video to be marked in one embodiment, and can be triggered by a timing task preset by a video service or an active request of the user for uploading the video to be marked in another embodiment.

Step S1200, determining confidence degrees of the image data mapped to the various category labels of the category label set by adopting a video classification model, and determining the category labels with the confidence degrees exceeding a preset confidence degree threshold value as target category labels;

after the media information of the video to be labeled is obtained, a video classification model can be adopted, part or all of the media information is selected to construct input of the video classification model according to the reference requirement of the video classification model, the video classification model carries out reasoning according to the input information, and the confidence degree corresponding to each class label of the information mapped to a preset class label set is predicted, so that the class label to be labeled of the video to be labeled can be identified according to the confidence degree of each class label.

The video classification model is usually implemented by a convolution-based neural network model, deep semantic information of input information is extracted, then the deep semantic information is mapped to a classification space corresponding to the category label set in a full-connection mode, the categories set in the classification space correspond to the category labels in the category label set one by one, and accordingly confidence degrees of mapping the deep semantic information to the category labels are obtained. It is understood that the video classification model is trained to a convergence state by using enough corresponding training samples in advance, so that the video classification model learns the capability of reasoning and determining the confidence degrees of all the class labels according to corresponding input, and then is used in the technical scheme of the application.

In one embodiment, the video classification model is configured as a single-mode video classification model, which only relies on image data in media information of a video to be labeled to perform class label inference, so that, as shown in fig. 3, a convolutional neural network is used to extract deep semantic information from a plurality of image frames in the image data, which may be specifically, the image frames, and then a classifier is used to perform classification mapping, so as to determine the confidence of each class label as a classification result. The video classification model realized based on the single mode has relatively low computation amount, controllable training cost and high training efficiency.

In another embodiment, as shown in fig. 4, the video classification model is configured as a multi-modal video classification model, and is capable of receiving image data in the media information as one input path and extracting image feature information thereof by using a convolutional neural network, receiving description text in the media information as another input path and extracting text feature information thereof by using a text feature extractor, and then splicing the image feature information and the text feature information into semantic deep information by a splicing layer, and then performing classification mapping by a classifier to obtain a classification result, and determining a confidence of each class label. The video classification model based on multi-mode realization is expected to obtain higher prediction accuracy because multi-channel information depended by inference is mutually referred.

In another embodiment, when the media information includes other related information besides the description text, the related information may also be used as extended information of the description text, and is combined into the description text to be uniformly encoded to construct an input of a multi-modal video classification model, and corresponding processing is performed through the multi-modal video classification model.

In another embodiment, in consideration that image data of a video to be annotated includes a plurality of image frames, and each image frame has a sequential relationship and contains context association, in order to implement utilization of context information, after a convolutional neural network adopted by the video classification model in each embodiment, a recurrent neural network may be accessed to perform context combing on image feature information output by the convolutional neural network and comprehensive feature information obtained after splicing description text output by a text feature extractor, so as to obtain corresponding deep semantic information, as shown in fig. 5, the deep semantic information is input to a classifier for classification mapping, so as to further refer to the context in the image data and the description text and improve prediction accuracy of the video classification model.

In the above embodiments, the Convolutional Neural Network (CNN) used for extracting features from the image data may be any Neural Network model developed on the basis of a Convolutional Neural Network, which is suitable for processing image features, such as a ResNet (residual Network) residual Network model; the text feature extractor for extracting the feature of the description text may be implemented based on a Recurrent Neural Network (RNN), and the Recurrent Neural Network described in the present application may also be any Neural Network developed on the basis of the Recurrent Neural Network, which is suitable for extracting the feature of the text, such as an LSTM (Long Short-Term Memory), a transform series encoder, and the like.

As can be understood from the above description of the different embodiments, the video classification models implemented in the different embodiments can all serve for implementation of the present application, and are used for determining the confidence of the video to be labeled mapped to each class label in the class label set according to the image data of the video to be labeled.

When the video classification model disclosed in any one of the embodiments is adopted, after the input information including the video image data to be labeled is subjected to classification mapping and the confidence degree of the classification mapping to each class label in the class label set is determined, it is easy to understand that the class label with higher confidence degree is a class label with higher accuracy and higher reliability, so that a confidence degree threshold value can be preset, the confidence degree of each class label is compared with the confidence degree threshold value, and the class label with the confidence degree exceeding the confidence degree threshold value is determined to be a target class label suitable for labeling the video to be labeled. Category labels for which the confidence level does not exceed the confidence level threshold may not be processed in this step. The confidence threshold belongs to an empirical threshold or an experimental threshold and can be flexibly set as required.

It is noted that each category label in the category label set of the present application, in one embodiment, may be a single-level category label; in another embodiment, there may be multiple levels of category labels, that is, each level includes multiple category labels, and each upper level category label may include multiple lower level category labels. The various levels of category labels can be divided according to different standards such as interests, functions, fields and the like. And forming a vertical category label structure along a path formed by backtracking the category label of the last stage towards the upper stage direction of the last stage, wherein each vertical category label structure can be represented by the characteristic identifier corresponding to the category label of the last stage or a vector formed by the characteristic identifiers of the category labels of all stages in the path.

Step S1300, taking the class label with the confidence coefficient not reaching the confidence coefficient threshold value as an undetermined class label, and checking whether the undetermined class label is a target class label or not based on the description text;

for class labels for which the confidence level inferentially determined by the video classification model does not meet the confidence level threshold, these class labels may be determined as pending class labels, indicating that a further verification process is required in view of their confidence level being below the confidence level threshold, without temporarily confirming whether they can be included in the category of the target class label. The class labels to be determined need to be further verified, and the fact that the judgment of the class labels with low confidence coefficient is not accurate enough due to the fact that the video classification model is not enough in self training or reasoning capability or due to the fact that the recognition capability of the class labels is not enough due to the existence of long tail effect is mainly considered, so that the video classification model is assisted to improve the accuracy of labeling the videos to be labeled.

As mentioned above, the media information prepared for marking also includes the description text corresponding to the video to be marked, and the description text is usually a brief introduction of the content, characteristics and value of the video to be marked, so that the media information has an information reference value for determining the category label of the video to be marked. Therefore, the descriptive text can be used for assisting in verifying whether the pending class label predicted by the video classification model is suitable for being used as the target class label.

In an embodiment, whether a text corresponding to the pending category tag exists in the description text may be queried based on a rule matching manner, and if so, it may be directly determined that the pending category tag is a target category tag, otherwise, it is not the target category tag. The method is adopted to check the label to be classified, and is efficient and quick.

In another embodiment, a data distance between the text vector of the pending class label and the text vector of the description text may be calculated based on a semantic matching manner, and when the data distance exceeds a preset threshold, the pending class label is determined to be a target class label, otherwise, the pending class label is not the target class label. By adopting the mode, the labels of the to-be-determined categories are verified, the efficiency and the accuracy are compromised, and the method is economical and practical.

In still another embodiment, the following steps may be provided: when the confidence degree of the label of the undetermined category reaches a confidence degree threshold, determining a hit category label of the video to be annotated in the category label set according to the word frequency statistical characteristics of the description text, and when the hit category label comprises the label of the undetermined category, determining that the label of the undetermined category is a target category label, wherein the confidence degree threshold is lower than the confidence degree threshold.

Specifically, the word frequency of each participle in a participle set used by each category label in a category label set can be determined based on a word frequency statistic mode, then, a description text of a video to be labeled is participled, it is determined which category labels of each participle of the description text have higher word frequency, the category labels with higher word frequency can be used as hit category labels, and then it is determined whether the hit category labels include one or more pending category labels. The method realizes the check of the labels of the to-be-determined categories based on the word frequency statistics mode, can utilize the advantage of relatively low operation computation amount of word classification, can overcome the influence of individual text on the use of words of the questions, can realize query multiplexing for infinite times after obtaining the word frequency statistics table through one-time statistics of the word frequency, and conveniently, quickly and efficiently realizes the discrimination of the labels of the to-be-determined categories.

In this embodiment, it should be noted that if the confidence of the tag to be classified is too low, the tag is not trusted, so that only the confidence identified by the video classification model is lower than the confidence threshold but higher than the preset confidence threshold in this embodiment, and the confidence threshold is lower than the confidence threshold.

In yet another embodiment, the following steps may be provided: when the confidence of the undetermined class label reaches a confidence median, determining that the description text is mapped to the prediction class label in the class label set by adopting a text classification model, and when the prediction class label comprises the undetermined class label, determining that the undetermined class label is a target class label, wherein the confidence median is lower than the confidence threshold.

Specifically, a text classification model trained to a convergence state in advance may be adopted, classification mapping is performed according to deep semantic information of a coding vector of the description text, the deep semantic information is mapped to each category label in the category label set to determine a classification probability corresponding to each category label, then, a category label in which the classification probability reaches a preset probability threshold is screened out as a prediction category label, if the prediction category labels include a certain undetermined category label, the undetermined category label may be used as a target category label, otherwise, if the certain undetermined category label is not recognized as any one of the prediction category labels, the undetermined category label is not the target category label. The method has the advantages that the deep semantics of the description text can be utilized by separately carrying out classification mapping on the deep semantic information extracted from the description text, the condition that individual texts in the description text are not aligned can be compatible, the prediction class label is expected to be determined with higher reliability, and therefore more reliable decision reference is provided for the discrimination of the undetermined class label predicted by the video classification model.

In this embodiment, it should be noted that, if the confidence obtained by the label of the pending category in the video classification model is too low, the label of the pending category is not trusted, so that only the confidence identified by the video classification model that is lower than the confidence threshold but higher than the preset median of confidence in this embodiment may be processed, and the median of confidence is lower than the confidence threshold.

In a further embodiment, the specific implementation modes given above for checking the undetermined class label can be flexibly combined, any two or more specific implementation modes are adopted for progressive combination, and when one undetermined class label is identified as a target class label through checking in the current mode, the undetermined class label does not need to be checked in the subsequent mode; when the current mode fails to determine that the label of the undetermined class is the target class label, the next mode is continuously adopted for auxiliary verification, and so on. By processing according to the mode, more comprehensive pocket bottom verification can be implemented on the labels of the to-be-determined categories, the rate of missed annotation of the video annotation is reduced, and the accuracy of the video annotation is improved.

When the verification method implemented based on word frequency statistics and the verification method implemented based on a text classification model are combined, the verification method can be respectively used for processing the undetermined class labels falling into different confidence intervals according to respective characteristics of the two methods, and can be flexibly combined and applied. In an embodiment of the improvement, the word frequency statistical manner can be further provided for manual confirmation after the word frequency statistics, and the manual identification is more accurate and reliable, so that only tags of undetermined category with confidence levels falling within a first confidence level interval defined by the confidence level threshold and the confidence level threshold can be processed; since the text classification model is determined by relying on the reasoning ability of the neural network model, the text classification model has limited comprehensibility, and therefore only pending class labels whose confidence levels fall within a second confidence interval defined by the confidence threshold and the confidence median are processed by the text classification model. Wherein, the confidence median is higher than the confidence threshold, that is, the second confidence interval belongs to a subdivided interval of the first confidence interval.

When the method is applied to the specific implementation of checking the tags to be classified, the tags to be classified can be identified one by one so as to determine whether each tag to be classified is a target tag one by one.

And step S1400, labeling the video to be labeled by adopting the target type label.

After the above processes, it is understood that the video to be annotated may obtain one or more target category labels, and the target category labels are used as category labels for annotating the video to be annotated, and a mapping relationship is established between the target category labels and the video to be annotated, so that the annotation of the video to be annotated can be realized.

After the labeling of the video to be labeled is finished, the video to be labeled can be subsequently and correspondingly called according to the target category label so as to push the corresponding video to a corresponding user. Of course, the target category tag may also be regarded as a category tag that is interested by a user who uploads a video to be labeled, and the video carrying the target category tag is called for the target category tag and pushed to the uploading user of the video to be labeled.

As can be seen from the above embodiments, the present application achieves various technical advantages, including but not limited to:

Secondly, because the description text in the media information is usually used for introducing the content related to the video, the class labels with low confidence coefficient predicted by the video classification model are verified by using the description text, the long tail effect can be effectively overcome, and the identification capability of the class labels can be improved under the assistance of the description text aiming at the condition that the video classification model cannot effectively distinguish the class labels due to the lack of sufficient training samples.

On the basis of any embodiment of the present application, please refer to fig. 6, determining a confidence level of each class label mapped to the class label set by using a video classification model, and determining a class label with the confidence level exceeding a preset confidence level threshold as a target class label, includes:

step 1210, extracting a plurality of image frames from the image data of the video to be annotated;

the image data of the video to be annotated is encapsulated in the video file of the video to be annotated, so that the video file can be decoded to read the corresponding image data. In one embodiment, after decoding the video file into the image space, a number of image frames are read from the image space at equal intervals for input information corresponding to image data required for constructing the video classification model.

The interval between the image frames can be determined in a fixed or dynamic manner by taking the number of frames or time as a reference, for example, the interval can be determined according to the following formula:

Span＝Length/(N-1)

wherein Span represents an interval based on time, length represents the playing time Length of the video to be marked, and N represents the total number of image frames required to be acquired.

Step S1220, extracting image feature information of the image frames by using the video classification model, and classifying and mapping the image feature information into classification spaces corresponding to the category label sets to obtain confidence levels corresponding to the category labels;

the video classification model adopted in the present application has been trained to a convergence state in advance, please refer to any one of examples in fig. 3 to 5, where there is an input depending on image data, and for a plurality of image frames obtained corresponding to the image data, image formatting preprocessing, such as adjusting the size of the image frames, may be performed on the image frames according to the input constraint requirement of the video classification model, and then the image feature information is input into the video classification model, and the image feature information is extracted by a convolutional neural network part therein, and then the image feature information or deep semantic information obtained on the basis of the image feature information is mapped to a classification space corresponding to the class label set by a classifier part therein, so as to obtain the confidence of each class label in the class label set correspondingly.

Step S1230, determining the category label with the confidence level exceeding the confidence level threshold in the classification space, as the target category label of the video to be labeled.

The confidence corresponding to each class label obtained in the classification space, where the class label with higher confidence, for example, the class label corresponding to the confidence threshold exceeding 0.95, is trustable, and thus, the class label with the confidence exceeding the confidence threshold can be directly used as the target class label of the video to be labeled. It is understood that the confidence threshold may be an empirical threshold or an experimental threshold, and may be flexibly predetermined.

According to the embodiment, the main information depended on by the video classification model for marking the video to be marked is the plurality of image frames of the video to be marked, the image frames effectively represent the content of the video to be marked, and the high-confidence class label obtained by predicting the video to be marked by utilizing the semantic comprehension capability of the video classification model on the image is a more reliable target class label.

On the basis of any embodiment of the present application, referring to fig. 7, determining a hit category tag of the to-be-annotated video in the category tag set according to the word frequency statistical characteristics of the description text, and when the hit category tag includes the to-be-annotated category tag, determining that the to-be-annotated category tag is a target category tag, includes:

step S2100, performing word segmentation processing on the description text to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of word segments;

for the description text, the description text may be segmented in any manner to obtain a corresponding segmentation set, so that the description text includes a plurality of segmentation words, and a specific segmentation manner may be a statistical-based manner, such as an N-Gram multivariate segmentation algorithm, or a manner based on a bag of words model (BOW), or any other known manner.

In one embodiment, all the participles obtained from each description text can be configured as a participle set in a set manner, that is, the same participle appearing multiple times in the description text is embodied as a single element in the participle set, so that repeated subsequent processing is avoided for the same participle appearing multiple times, and the computation amount is saved.

Step S2200, inquiring the word frequency of each participle in the word frequency statistical table corresponding to each category label of the category label set, and constructing the word frequency of each participle corresponding to each category label as a word frequency vector corresponding to the participle;

each category label in the category label set is pre-provided with a word frequency statistical table for storing word frequencies of various participles obtained through statistics in a description text of a video carrying the corresponding category label, and when the word frequency of a certain participle corresponding to a certain category label needs to be determined, the word frequency statistical table corresponding to the category label can be inquired to determine the word frequency. Each word frequency statistical table can be expressed as a vector, and each word frequency statistical table corresponding to all category labels can also be expressed as a two-dimensional table in a combined manner, so that the method can be flexibly implemented.

For each participle in the participle set of the description text, by inquiring the word frequency statistical table corresponding to each category label in the category label set, the word frequency x of each participle corresponding to each category label i can be determined _i Therefore, each participle in the participle set for describing the text can obtain the corresponding word frequency mapped to all the category labels, and the word frequencies of each participle are summed up and can also obtainTotal word frequency corresponding to each participle n

And the total is the total number of the category labels in the category label set, so that a word frequency vector corresponding to each participle can be further constructed.

In one embodiment, before constructing the word frequency vector of each participle, considering that the information reference value of the participle with lower total word frequency is limited, filtering each participle, and filtering the participle with overall lower word frequency. By way of example, a participle n in a participle set of description text may be filtered out as long as it satisfies any one of the following exemplary conditions:

sum _n <Threshold _sum

among them, threshold _sum The preset minimum threshold corresponding to the word frequency can be set as required, when the total word frequency of a participle is lower than the threshold, the participle is a low-frequency invalid word, and the participle can be filtered from a participle set, and then the word frequency vector does not need to be constructed for the participle.

And t is a preset threshold value for checking a corresponding normalization result after normalizing the word frequency distribution of each participle in the whole category label set, and can be set as required.

After filtering, the total number of the participles in the participle set can be greatly simplified, and then the word frequency vector of the simplified few effective participles is constructed. The form of the word frequency vector may be represented by the word frequency of each participle n, for example:

X _n ＝(x ₁ ,x ₂ ,…,x _total ,sum _n )

of course, in another embodiment, the word frequency of each category label may be further divided by the total word frequency to be converted into a normalized value by using a normalization method for each word frequency in each word frequency vector.

Step S2300, determining a category label corresponding to the maximum word frequency as a hit category label according to the word frequency vector of each word in the word segmentation set;

based on the word frequency vectors describing valid segments in the text, it may be further checked whether they are sufficient to check the labels of the category to be determined.

Taking the word frequency vector constructed by using the numerical values corresponding to the word frequencies as an example, whether the maximum word frequency is greater than a preset value or not can be judged, when the maximum word frequency is greater than the preset value, the category label corresponding to the maximum word frequency can be confirmed as the category label deduced according to the word frequency statistical characteristics of the description text, the category label is taken as a hit category label, and the participle corresponding to the word frequency vector is taken as a lower category label of the hit category label, so that the participle is taken as the lower category label for labeling the video to be labeled after the subsequent manual confirmation. The same applies to the word frequency vector expressed as a normalized value of the word frequency corresponding to each class label.

It is easy to understand that a plurality of corresponding hit category labels and a lower-level category label corresponding to each hit category label may be determined according to the word frequency statistical characteristics of the description text, and the hit category labels are members in a category label set, and are suspected category labels inferred for the video to be labeled according to the word frequency statistical characteristics corresponding to the word segmentation of the description text, and can be used for verifying the undetermined category labels predicted by the video classification model.

Step S2400, judging whether the hit type label includes the undetermined type label, and when the hit type label is judged to be the undetermined type label, determining that the undetermined type label is a target type label.

Each hit category label determined according to the word-frequency statistical characteristics of the word segments describing the text can be used for checking each undetermined category label predicted by the video classification model, specifically, whether one hit category label is the same as the hit category label is judged for each undetermined category label, when the hit category label is the same as the undetermined category label, the undetermined category label can be determined to be the target category label, and when all the hit category labels are different from the undetermined category label, the undetermined category label is not the target category label. Therefore, whether each label to be classified belongs to the target class label can be checked one by one.

According to the embodiment, the word frequency distribution of each participle of the description text of the video to be labeled is determined based on the word frequency statistical table of each category label, the category label where the participle with higher word frequency is located is determined as the hit category label determined based on the word frequency statistical characteristics, then the hit category label is adopted to check each undetermined category label one by one, the undetermined category label which passes the check is actually the hit category label, and the word frequency statistical characteristics of the description text are utilized. The participles corresponding to the hit category labels can be used as subordinate category labels of the corresponding hit category labels after being manually and accurately confirmed, vertical level subdivision expansion of the category label set is achieved, and fine marking of the video to be marked is achieved.

In addition, the method can be used for deducing the hit category label of the video to be labeled based on the word frequency statistics, and can also effectively filter out invalid words which are intentionally added for attracting the flow of a user when the user writes a description text, so as to avoid interference. The data actually measured according to the embodiment shows that, for the tags to be classified predicted by using the video classification model, the embodiment can recall 10% of untagged videos with an accuracy of 70%, and a great progress is made.

On the basis of any embodiment of the present application, please refer to fig. 8, where querying the word frequency of the word frequency statistical table corresponding to each category label of the category label set for each participle before constructing a corresponding word frequency vector includes:

step S3100, obtaining a description text of the video carrying the category label as a description sample based on each category label in the category label set, and forming a sample set corresponding to the category label;

in one embodiment, the description text of the marked video stored in correspondence with the video service may be used as a description sample for generating the word frequency statistical table of the present application.

For each category label in the category label set, the full amount of videos carrying the category label can be recalled as video samples, and then description texts of all the video samples are obtained as description samples to form a sample set corresponding to the category label. When determining a corresponding word frequency statistical table for each category label, the corresponding word frequency statistical table may be constructed based on word frequency statistics for a sample set of the category labels.

In another embodiment, the word frequency statistical table may be generated by using video samples and description samples which are obtained from other sources but labeled by using the same category label set.

Step S3200, based on the sample set corresponding to each category label, performing word segmentation on each description sample to determine a word segmentation sequence of each description sample;

in order to facilitate the statistics of word frequency, for each description sample in the sample set corresponding to each category label, any feasible word segmentation mode may be applied to perform word segmentation to obtain a corresponding word segmentation sequence. As an alternative manner, in each word segmentation sequence, the same word segmentation can be retained according to the text position where the word segmentation sequence is located, and the same word segmentation is not combined into the same word segmentation, so as to truly reflect the occurrence frequency of the word segmentation. The word segmentation mode can be flexibly selected by adopting a statistical-based trial, such as an N-Gram algorithm, or a word bag model.

S3300, performing word frequency statistics by taking the participles as a unit based on each participle sequence of the sample set corresponding to each category label, and determining the word frequency corresponding to each participle;

in order to conveniently represent the word frequency statistical result, a dictionary can be constructed according to all the participles corresponding to all the class labels, and the dictionary is used for storing mapping relation data of each participle to the word frequency corresponding to any one class label. Accordingly, for the full-scale participles corresponding to the full-scale description sample of the sample set corresponding to one category label, for each participle in the full-scale participle sequence, the total times of the participles appearing in the full-scale participle sequence corresponding to the current category label is counted as a word frequency, and then the word frequency and the participle are associated and stored in the dictionary to serve as mapping relation data under the corresponding category label.

And step S3400, constructing a mapping relation between each participle in the sample set corresponding to each category label and the corresponding word frequency of the participle into a word frequency statistical table corresponding to the category label.

And counting the processes for each participle under each category label, and finally obtaining word frequency data corresponding to all participles under each category label. For each category label, the mapping relationship between the participle and the word frequency can be regarded as a word frequency statistical table corresponding to the category label, and the word frequency statistical table can be used for determining a word frequency vector of the description text of the video to be labeled so as to further deduce a hit category label.

According to the embodiment, word frequency statistics can be performed according to description texts of video samples stored in history in video services to construct a word frequency statistical table corresponding to each category label in each category label set, and the word frequency statistical table can be used for deducing hit category labels of videos to be labeled by combining the description texts of the videos to be labeled, so that undetermined category labels with lower confidence degrees predicted by a video classification model for the videos to be labeled are corrected. The word frequency statistical table can be a product of a description sample of a historical video sample, and the category label of the word frequency statistical table has practical success, so that word frequency information in the corresponding word frequency statistical table has higher reference value, and word frequency distribution of each category label can be represented, so that the hit category label of the video to be labeled is judged according to word segmentation of the description text, and the corrected category label to be labeled has high credibility.

On the basis of any embodiment of the present application, after determining, according to the word frequency vector of each participle in the participle set, a category tag corresponding to the maximum word frequency therein as a hit category tag, the method includes:

step S2500, the participles of the determined hit type labels are used as subordinate candidate labels of the determined hit type labels and stored as data to be confirmed;

in the process of determining the hit category label of the video to be annotated by applying a word frequency statistical method according to the description text of the video to be annotated, as described above, for the word frequency vector of each participle, the category label corresponding to the maximum word frequency is determined as the hit category label corresponding to the participle, the hit category label can be used as the superior category label corresponding to the participle, and the participle also becomes the inferior category label of the hit category label, and then the hit category label and the inferior category label are associated with the feature identifier of the video to be annotated to form the data to be confirmed, and the data to be confirmed is stored in a database for further manual calling and confirmation.

And step S2600, responding to a user confirmation instruction, and labeling the video to be labeled with a hit type label specified by the instruction and a lower-level candidate label thereof.

The management user or authorized user of the video service can call the data to be confirmed from the database, perform resolution confirmation on each data to be confirmed, trigger a corresponding user confirmation instruction after the data to be confirmed is confirmed, and respond to the instruction, so that the data to be confirmed pointed by the instruction can be enabled to take effect, that is, according to the data to be confirmed, the appointed hit type label and the corresponding lower type label are used for marking the appointed video to be marked. It is easy to understand that the hit category label is a member of the category label set, and when the hit category label is confirmed manually and labels a video to be labeled, the hit category label actually becomes a target category label confirmed manually, and the lower category label is a corresponding word segmentation determined from a description text of the video to be labeled according to word frequency statistical characteristics, and the word segmentation semantically plays a role in subdividing the pointing range of the hit category label to form a lower vertical classification label of the hit category label.

According to the embodiment, whether the corresponding hit category labels and the lower category labels are valid is confirmed manually by further means of a user confirmation process, so that mislabeling can be effectively avoided, the labeling accuracy is improved, representative effective participles are used as the lower category labels to be associated with the corresponding hit category labels for labeling the videos to be labeled, the labeling work of expanding vertical classification can be saved, the response speed of labeling services is accelerated, bottom-pocketing labeling is carried out on the videos to be labeled based on word frequency statistics and manual review, words which are intentionally filled by a user and are not consistent with the video contents can be filtered, and the anti-interference capability is strong.

On the basis of any embodiment of the present application, please refer to fig. 9, determining, by using a text classification model, that the description text is mapped to a prediction category label in the category label set, and when the prediction category label includes the pending category label, determining that the pending category label is a target category label, includes:

step S4100, embedding words into the description text to obtain corresponding coding vectors;

when a text classification model is required to be applied to classify the description text of the video to be labeled, the text classification model is trained to a convergence state by adopting enough corresponding training samples in advance, so that the text classification model can extract deep semantic information from the text and then perform classification mapping.

In order to facilitate semantic understanding of the text classification model, word embedding needs to be performed on a description text of a video to be labeled, on the basis that a corresponding word segmentation sequence is obtained by segmenting words of the description text, a reference word list quoted by the text classification model is adopted, coding features corresponding to the words are inquired from the reference word list, and according to the sequence of the words, the coding features of the words are constructed into coding vectors to be used as input of the text classification model.

In one embodiment, on the basis of the coding vector, the position codes corresponding to the participles are superimposed according to the positions of the participles appearing in the participle sequence of the description text, so that the basic semantic information required by the text classification model is enriched, the text classification model is easier to train, and an accurate prediction result can be obtained.

Step S4200, extracting text characteristic information from the coding vector by the text classification model, and mapping the text characteristic information to classification spaces corresponding to the class label sets in a classification manner to obtain classification probabilities corresponding to the class labels;

an exemplary text classification model may be implemented using a text feature extractor based on a recurrent neural network followed by a classifier, for example, using RNN, LSTM, BERT (Bidirectional Encoder representation based on Transformer), transformer Encoder, etc., which is suitable for reasoning with context information to more accurately understand semantics and obtain more accurate prediction results.

After the coding vector is input into a text classification model, deep semantic information is mined for the coding vector by a text feature extractor therein to obtain corresponding text feature information, then the text feature information is fully connected through a subsequent classifier, the text feature information is mapped to a classification space set corresponding to the class label set, and the classification probability corresponding to each class label in the classification space is calculated.

Step S4300, determining a category label of which the classification probability exceeds a preset probability threshold value in the classification space as a prediction category label, and determining the pending category label as a target category label of the video to be labeled when the prediction category label comprises the pending category label.

The classification probability of each class label in the classification space is also the confidence coefficient of the coding vector mapped to the corresponding class label, but the text classification model is independently dependent on the prediction made by the description text, and the situation that the description text possibly has discordance is considered, so the prediction result of the text classification model cannot be completely trusted.

In order to implement the screening, a probability threshold may be preset, and then, each class label with a classification probability higher than the probability threshold is screened from the classification space as a prediction class label, where the prediction class labels are class labels to which the video to be labeled predicted by the text classification model corresponding to the description text may be mapped, and may be used to correct the label to be labeled predicted by the video classification model. The probability threshold may be set as desired.

Specifically, for each undetermined category label, whether a prediction category label is the same as the undetermined category label is judged, if the prediction category label is the same as the target category label, the undetermined category label is determined to be the target category label, and if all the prediction category labels are different from the undetermined category label, the undetermined category label is not the target category label. Therefore, whether each label to be classified belongs to the target class label can be checked one by one.

In one embodiment, when the text classification model and the word frequency statistics based mode are used for carrying out the base verification on the tags of the to-be-determined category together, the word frequency statistics based mode can be used for firstly verifying the tags of the to-be-determined category, and when the text classification model is used for verifying the tags of the to-be-determined category which pass the verification, the secondary verification can be omitted, and only the tags of the to-be-determined category which do not pass the verification can be correspondingly verified.

According to the embodiment, the text classification model has the capability of carrying out deep semantic understanding on the description text to infer the corresponding category label, the category label with high confidence is preferably selected as the prediction category label determined by the text classification model by utilizing the probability threshold, and the prediction category label is adopted to carry out pocket bottom verification on each to-be-determined category label generated by the video classification model, so that pocket bottom labeling of the to-be-labeled video is realized, and the labeling of the to-be-labeled video is more complete.

Similarly, in the above embodiment, with the help of the text classification model having the text overall semantic pertinence, the influence of the invalid vocabulary which is intentionally added by the user when the user fills the description text and is inconsistent with the video content of the video to be tagged can be weakened, and the tagging coverage rate can be effectively enlarged.

On the basis of any embodiment of the present application, as an exemplary application, please refer to fig. 10, where after the target category tag is used to label the video to be labeled, the method includes:

s1500, acquiring media information of a target video carrying the target category label, and constructing a video recommendation list according to the media information;

after the process of the above embodiments is performed on the video to be labeled, actually, the labeled target category label of the video to be labeled also represents the interested category label of the user of the video to be labeled to a certain extent, so that the video can be recommended to the uploading user according to the labeled target category label of the video to be labeled.

In order to recommend videos to the uploading user, the target category tag may be a query tag, a target video carrying any one of the target category tags is queried and recalled in a video library corresponding to a video service, then media information of the target video is acquired, including image data, description text and the like in a video file of the target video, then the image data is constructed as a cover page of the target video according to a preset format, for example, a video recommendation list is constructed according to the media information of each target video, and each target video is listed in the video recommendation list.

And S1600, pushing the video recommendation list to a terminal user uploading the video to be annotated.

After the video recommendation list is determined, the video recommendation list can be pushed to a terminal user who uploads the video to be labeled, namely, an uploading user of the video to be labeled. And after receiving the video recommendation list, the user analyzes the video recommendation list, displays each target video in the video recommendation list in a graphical user interface, and can click any one of the target videos to perform corresponding access.

According to the embodiments, it can be understood that after the user uploads the video to be annotated and the corresponding target category tag is determined by annotation, the corresponding target category tag can serve as a category tag which is interested by the user for video recommendation of the uploading user of the video to be annotated, so that the uploading user can expand an interest boundary to obtain recommended videos which are close to or related to the content of the video to be annotated.

Referring to fig. 11, a video annotation apparatus according to an aspect of the present application includes an information obtaining module 1100, a default classification module 1200, an auxiliary verification module 1300, and a video annotation module 1400, where the information obtaining module 1100 is configured to obtain media information of a video to be annotated, where the media information includes image data and a description text of the video to be annotated; the default classification module 1200 is configured to determine a confidence level of the image data mapped to each category label of the category label set by using a video classification model, and determine a category label of which the confidence level exceeds a preset confidence level threshold as a target category label; the auxiliary checking module 1300 is configured to check whether the category label to be checked is a target category label based on the description text, where the category label whose confidence level does not reach the confidence level threshold is used as the category label to be checked; the video tagging module 1400 is configured to tag the video to be tagged with the target category tag.

On the basis of any embodiment of the present application, the default classification module 1200 includes: the image extraction unit is used for extracting a plurality of image frames from the image data of the video to be marked; the video classification unit is used for extracting image characteristic information of the image frames by adopting the video classification model, and classifying and mapping the image characteristic information into a classification space corresponding to the class label set to obtain a confidence coefficient corresponding to each class label; and the target preliminary screening unit is configured to determine a category label with the confidence coefficient exceeding the confidence coefficient threshold in the classification space, and the category label is used as a target category label of the video to be labeled.

On the basis of any embodiment of the present application, the auxiliary verification module 1300 includes: and the word frequency checking module is set to determine a hit category label of the video to be annotated in the category label set according to the word frequency statistical characteristics of the description text when the confidence coefficient of the label to be determined reaches a confidence coefficient threshold, and determine that the label to be determined is a target category label when the hit category label comprises the label to be determined, wherein the confidence coefficient threshold is lower than the confidence coefficient threshold.

On the basis of any embodiment of the present application, the auxiliary verification module 1300 includes: and the semantic checking module is configured to determine, by using a text classification model, that the description text is mapped to the prediction category label in the category label set when the confidence of the undetermined category label reaches a confidence median, determine, when the prediction category label includes the undetermined category label, that the undetermined category label is a target category label, and determine that the confidence median is lower than the confidence threshold.

In the two embodiments disclosed above, the word frequency check module and the semantic check module can be flexibly combined, and according to an improved embodiment, the median confidence value adopted by the semantic check module may be greater than the confidence threshold adopted by the word frequency check module, but both are lower than the confidence thresholds corresponding to the video classification models.

On the basis of any embodiment of the present application, the word frequency check module includes: the word segmentation processing unit is used for carrying out word segmentation processing on the description text to obtain a word segmentation set, and the word segmentation set comprises a plurality of words; the word frequency query unit is used for querying the word frequency of each participle in the word frequency statistical table corresponding to each category label of the category label set and constructing the word frequency of each participle corresponding to each category label as a corresponding word frequency vector of the participle; the upper-level determining unit is set to determine a category label corresponding to the maximum word frequency as a hit category label according to the word frequency vector of each participle in the participle set; and the judging and checking unit is used for judging whether the hit type label comprises the undetermined type label or not, and determining the undetermined type label as a target type label when the judgment is in a right state.

On the basis of any embodiment of the present application, a word frequency query unit in the word frequency check module includes: the sample acquisition unit is arranged for acquiring a description text of the video carrying the category label as a description sample based on each category label in the category label set to form a sample set corresponding to the category label; the sample word segmentation unit is set to perform word segmentation on each description sample based on a sample set corresponding to each category label to determine a word segmentation sequence of each description sample; the word frequency statistical unit is set to perform word frequency statistics by taking the word as a unit based on each word segmentation sequence of the sample set corresponding to each category label, and determines the word frequency corresponding to each word segmentation; and the statistical table building unit is set to construct a mapping relation between each participle in the sample set corresponding to each category label and the corresponding word frequency of the participle into a word frequency statistical table corresponding to the category label.

On the basis of any embodiment of the present application, a superior determining unit in the word frequency check module, which is later than the superior determining unit, includes: the data storage unit is set to take the participle with the determined hit type label as a lower-level candidate label of the determined hit type label and store the participle as data to be confirmed; and the manual labeling unit is set to respond to a user confirmation instruction and label the video to be labeled by using the hit type label specified by the instruction and the lower-level candidate label thereof.

On the basis of any embodiment of the present application, the semantic checking module includes: the text coding unit is used for embedding words into the description text to obtain a corresponding coding vector; the text classification unit is used for extracting text characteristic information from the coding vector by adopting the text classification model, and classifying and mapping the text characteristic information to a classification space corresponding to the class label set to obtain the classification probability corresponding to each class label; and the probability checking unit is set to determine a category label of which the classification probability exceeds a preset probability threshold value in the classification space as a prediction category label, and when the prediction category label comprises the pending category label, the pending category label is determined as a target category label of the video to be labeled.

On the basis of any embodiment of the present application, the video annotation module 1400 includes: the recommendation construction module is used for acquiring media information of a target video carrying the target category label and constructing a video recommendation list according to the media information; and the recommendation execution module is used for pushing the video recommendation list to a terminal user uploading the video to be marked.

Another embodiment of the present application also provides a video annotation device. As shown in FIG. 12, the internal structure of the video annotation equipment is shown schematically. The videomarking apparatus includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer-readable non-transitory readable storage medium of the video annotation device stores an operating system, a database, and computer-readable instructions, wherein the database stores information sequences, and the computer-readable instructions, when executed by the processor, cause the processor to implement a video annotation method.

The processor of the video annotation equipment is used for providing calculation and control capacity and supporting the operation of the whole video annotation equipment. The memory of the video annotation device can store computer readable instructions, and when the computer readable instructions are executed by the processor, the processor can be used for executing the video annotation method. The network interface of the video annotation equipment is used for connecting and communicating with a terminal.

It will be understood by those skilled in the art that the structure shown in fig. 12 is a block diagram of only a part of the structure related to the present application, and does not constitute a limitation to the video annotation equipment to which the present application is applied, and a specific video annotation equipment may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module in fig. 11, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for realizing data transmission between user terminals or servers. The non-volatile readable storage medium in the present embodiment stores program codes and data required for executing all modules in the video annotation device of the present application, and the server can call the program codes and data of the server to execute the functions of all modules.

The present application also provides a non-transitory readable storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the video annotation method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application may be implemented by hardware related to instructions of a computer program, which may be stored in a non-volatile readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), or other computer readable storage medium, or a Random Access Memory (RAM).

To sum up, this application can carry out more comprehensive systematic mark to the video, avoids leaking mark and mistake mark, promotes the mark rate of accuracy, on the basis of the classification label of accurate mark video, promotes the quality of service of video service comprehensively.

Claims

1. A method for video annotation, comprising:

the category label with the confidence coefficient not reaching the confidence coefficient threshold value is used as a pending category label, and whether the pending category label is a target category label is checked based on the description text;

and labeling the video to be labeled by adopting the target category label.

2. The method for video annotation according to claim 1, wherein determining the confidence level of the image data mapped to each class label of the class label set by using a video classification model, and determining the class label with the confidence level exceeding a preset confidence level threshold as a target class label comprises:

extracting a plurality of image frames from the image data of the video to be annotated;

extracting image characteristic information of the image frames by adopting the video classification model, and classifying and mapping the image characteristic information into a classification space corresponding to the class label set to obtain a confidence coefficient corresponding to each class label;

and determining the class label with the confidence coefficient exceeding the confidence coefficient threshold value in the classification space as a target class label of the video to be labeled.

3. The video annotation method of claim 1, wherein verifying whether the pending category label is a target category label based on the descriptive text comprises:

when the confidence of the undetermined category label reaches a confidence threshold, determining a hit category label of the video to be annotated in the category label set according to the word frequency statistical characteristics of the description text, and when the hit category label comprises the undetermined category label, determining the undetermined category label as a target category label, wherein the confidence threshold is lower than the confidence threshold;

and/or the presence of a gas in the atmosphere,

when the confidence of the undetermined class label reaches a confidence median, determining that the description text is mapped to the prediction class label in the class label set by adopting a text classification model, and when the prediction class label comprises the undetermined class label, determining that the undetermined class label is a target class label, wherein the confidence median is lower than the confidence threshold.

4. The method of claim 3, wherein determining a hit category label of the to-be-annotated video in the category label set according to the word frequency statistical characteristics of the descriptive text, and when the hit category label includes the to-be-annotated category label, determining that the to-be-annotated category label is a target category label comprises:

performing word segmentation processing on the description text to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of word segments;

the word frequency of each participle is inquired from a word frequency statistical table corresponding to each category label of the category label set, and the word frequency of each participle corresponding to each category label is constructed into a word frequency vector corresponding to the participle;

determining a category label corresponding to the maximum word frequency as a hit category label according to the word frequency vector of each word in the word segmentation set;

and judging whether the hit type label comprises the undetermined type label, and when the hit type label is judged to be right, determining the undetermined type label as a target type label.

5. The method of claim 4, wherein before the step of searching for the word frequency of the word frequency statistical table corresponding to each category label of the category label set for each participle and constructing the word frequency vector as a corresponding word frequency vector, the method comprises:

acquiring a description text of the video carrying the category label as a description sample based on each category label in the category label set to form a sample set corresponding to the category label;

based on the sample set corresponding to each category label, performing word segmentation on each description sample to determine a word segmentation sequence of each description sample;

performing word frequency statistics by taking the participles as a unit based on each participle sequence of the sample set corresponding to each category label, and determining the word frequency corresponding to each participle;

and constructing a mapping relation between each participle in the sample set corresponding to each category label and the corresponding word frequency thereof into a word frequency statistical table corresponding to the category label.

6. The method of claim 4, wherein after determining the category label corresponding to the maximum word frequency as the hit category label according to the word frequency vector of each word in the word segmentation set, the method comprises:

the participles determining the hit type labels are used as subordinate candidate labels of the determined hit type labels and stored as data to be confirmed;

and responding to a user confirmation instruction, and labeling the video to be labeled with a hit type label and a lower-level candidate label specified by the instruction.

7. The method of claim 3, wherein determining that the description text is mapped to a prediction category label in the category label set using a text classification model, and when the prediction category label comprises the pending category label, determining that the pending category label is a target category label comprises:

performing word embedding on the description text to obtain a corresponding coding vector;

extracting text characteristic information from the coding vector by adopting the text classification model, and classifying and mapping the text characteristic information into a classification space corresponding to the class label set to obtain a classification probability corresponding to each class label;

and determining a category label of which the classification probability exceeds a preset probability threshold value in the classification space as a prediction category label, and determining the pending category label as a target category label of the video to be labeled when the prediction category label comprises the pending category label.

8. A video annotation apparatus, comprising:

the auxiliary checking module is set to check whether the class label to be checked is a target class label or not based on the description text by taking the class label of which the confidence coefficient does not reach the confidence coefficient threshold value as the class label to be checked;

9. A videolabelling device comprising a central processor and a memory, characterised in that said central processor is adapted to invoke the execution of a computer program stored in said memory to perform the steps of the method according to any one of claims 1 to 7.

10. A non-transitory readable storage medium, storing a computer program implemented according to the method of any one of claims 1 to 7 in the form of computer readable instructions, the computer program, when invoked by a computer, performing the steps comprised by the corresponding method.

11. A computer program product comprising computer programs/instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 7.