CN109684506B

CN109684506B - Video tagging processing method and device and computing equipment

Info

Publication number: CN109684506B
Application number: CN201811400848.8A
Authority: CN
Inventors: 罗玄; 张好; 黄君实; 陈强
Original assignee: 3600 Technology Group Co ltd
Current assignee: 3600 Technology Group Co ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2023-10-20
Anticipated expiration: 2038-11-22
Also published as: CN109684506A

Abstract

The invention provides a video labeling processing method and device. The method comprises the following steps: acquiring original video data; inputting the original video data into a feature extraction network to extract image features to obtain an image feature vector of a first preset dimension of the original video; performing cluster analysis on the image feature vectors to obtain different classifications of the original video corresponding to the image feature vectors; and extracting keywords in titles corresponding to the original videos in the same category aiming at different categories of the original videos corresponding to the image feature vectors, and selecting one or more keywords from the keywords according to a first preset rule to serve as labels of the original videos in the category. The scheme of the embodiment of the invention realizes high-efficiency and accurate video classification and high-accuracy and comprehensive video tagging, thereby improving the search hit rate and recommendation accuracy of videos.

Description

Video tagging processing method and device and computing equipment

Technical Field

The invention relates to the technical field of video processing, in particular to a video tagging method, a video tagging device, a computer storage medium and computing equipment.

Background

With the development and popularization of network technology, a large number of distribution platforms for aggregation video are presented, which can provide personalized video services for network users, including uploading, searching, recommending, playing, downloading and other services. In order to facilitate the user to search and use videos and to recommend the videos according to the interests and demands of the user, it is necessary to accurately classify massive videos on a platform and allocate comprehensive and reasonable labels to each video. The existing video labeling method generally extracts keywords as labels through manual labeling or simply through the titles and the explanatory characters of single videos, and has low operation efficiency, low accuracy and small label coverage. Therefore, a need exists for a highly efficient, highly accurate, and comprehensive video tagging technique.

Disclosure of Invention

In view of the above problems, the present invention has been made to provide a video tagging processing method, a video tagging processing apparatus, a computer storage medium, and a computing device that overcome or at least partially solve the above problems.

According to an aspect of the embodiment of the present invention, there is provided a video tagging method, including:

Acquiring original video data;

inputting the original video data into a feature extraction network to extract image features to obtain an image feature vector of a first preset dimension of the original video;

performing cluster analysis on the image feature vectors to obtain different classifications of the original video corresponding to the image feature vectors;

and extracting keywords in titles corresponding to the original videos in the same category aiming at different categories of the original videos corresponding to the image feature vectors, and selecting one or more keywords from the keywords according to a first preset rule to serve as labels of the original videos in the category.

Optionally, inputting the original video data into a feature extraction network to perform image feature extraction to obtain an image feature vector of a first preset dimension of the original video, including:

decoding the original video data to obtain a plurality of video frames;

inputting each video frame in the plurality of video frames into a convolutional neural network CNN for image feature extraction to obtain an image feature vector of a second preset dimension of each video frame;

and carrying out operation processing on the image feature vectors of the plurality of video frames by using a preset algorithm to obtain the image feature vector of the first preset dimension of the original video.

Optionally, after decoding the original video data to obtain a plurality of video frames, the method further includes:

extracting n video frames from the plurality of video frames at predetermined time intervals;

inputting each video frame in the plurality of video frames into a convolutional neural network CNN for image feature extraction to obtain an image feature vector of a second preset dimension of each video frame, wherein the method comprises the following steps:

inputting each video frame in the n video frames into CNN for image feature extraction to obtain m-dimensional image feature vectors corresponding to each video frame;

performing operation processing on the image feature vectors of the plurality of video frames by using a preset algorithm to obtain an image feature vector of a first preset dimension of the original video, wherein the operation processing comprises the following steps:

connecting n m-dimensional image feature vectors in time sequence to obtain an n multiplied by m-dimensional image feature vector;

and performing dimension reduction analysis on the n multiplied by m dimension image feature vectors to obtain final first preset dimension image feature vectors of the original video, wherein n and m are natural numbers larger than 1.

Optionally, performing a dimension reduction analysis on the n×m-dimensional image feature vector includes:

and carrying out average pooling on the n multiplied by m dimension image feature vectors.

Optionally, the cluster analysis comprises K-means clustering.

Optionally, extracting keywords in the title corresponding to the original video in the same class includes:

the title corresponding to each original video in the same class is segmented to obtain a plurality of segmented words;

and selecting one or more segmented words from the plurality of segmented words according to a preset screening strategy to serve as keywords of the original video.

Optionally, selecting one or more keywords from the keywords according to a first predetermined rule as tags of the original video of the category, including:

counting the occurrence times of each keyword;

and selecting keywords with the occurrence times greater than or equal to a preset threshold value as labels of the original videos of the category.

counting the occurrence times of each keyword;

sorting the keywords according to the occurrence times;

the designated number of keywords ranked first are selected as labels of the original video of the category.

Optionally, after acquiring the original video data, the method further comprises:

separating an audio signal from the original video data;

Performing cluster analysis on the audio signals to obtain different classifications of the audio signals;

and extracting keywords in titles of original videos corresponding to the audio signals in the same category aiming at different categories of the audio signals, and selecting one or more keywords from the keywords according to a second preset rule to serve as labels of the original videos corresponding to the audio signals in the category.

Optionally, performing cluster analysis on the audio signal to obtain different classifications of the audio signal, including:

discretizing the audio signal to obtain target audio;

extracting the characteristics of the target audio through a time sequence convolutional neural network CNN to obtain a voice characteristic vector of the target audio;

and carrying out cluster analysis on the voice feature vector to obtain different classifications of the audio signals corresponding to the voice feature vector.

According to another aspect of the embodiment of the present invention, there is also provided a video tagging apparatus, including:

the video data acquisition module is suitable for acquiring original video data;

the image feature extraction module is suitable for inputting the original video data into a feature extraction network to extract image features, so as to obtain an image feature vector of a first preset dimension of the original video;

The first cluster analysis module is suitable for carrying out cluster analysis on the image feature vectors to obtain different classifications of the original video corresponding to the image feature vectors; and

the first labeling module is suitable for extracting keywords in titles corresponding to the original videos in the same category according to different categories of the original videos corresponding to the image feature vectors, and selecting one or more keywords from the keywords as labels of the original videos in the category according to a first preset rule.

Optionally, the image feature extraction module is further adapted to:

decoding the original video data to obtain a plurality of video frames;

Optionally, the image feature extraction module is further adapted to:

after decoding the original video data to obtain a plurality of video frames, extracting n video frames from the plurality of video frames according to a preset time interval;

Optionally, the image feature extraction module is further adapted to:

and carrying out average pooling on the n multiplied by m dimension image feature vectors to realize dimension reduction.

Optionally, the cluster analysis comprises K-means clustering.

Optionally, the first labeling module is further adapted to:

counting the occurrence times of each keyword;

Optionally, the first labeling module is further adapted to:

counting the occurrence times of each keyword;

sorting the keywords according to the occurrence times;

Optionally, the apparatus further comprises:

an audio signal separation module adapted to separate an audio signal from the original video data;

the second clustering analysis module is suitable for carrying out clustering analysis on the audio signals to obtain different classifications of the audio signals; and

the second labeling module is suitable for extracting keywords in titles of original videos corresponding to the audio signals in the same category according to different categories of the audio signals, and selecting one or more keywords from the keywords according to a second preset rule to serve as labels of the original videos corresponding to the audio signals in the category.

Optionally, the second aggregation analysis module includes:

the audio discretization unit is suitable for discretizing the audio signal to obtain target audio; the voice feature extraction unit is suitable for extracting features of the target audio through a time sequence convolutional neural network CNN to obtain a voice feature vector of the target audio;

And the audio cluster analysis unit is suitable for carrying out cluster analysis on the voice feature vector to obtain different classifications of audio signals corresponding to the voice feature vector.

According to yet another aspect of embodiments of the present invention, there is also provided a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to perform a method of tagging video according to any of the preceding claims.

According to yet another aspect of an embodiment of the present invention, there is also provided a computing device including:

a processor; and

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform the video tagging method according to any one of the preceding claims.

According to the method and the device for labeling the video, after the original video data are acquired, the image feature vectors of the original video are extracted through the feature extraction network, then the image feature vectors are subjected to cluster analysis to obtain different classifications of the original video corresponding to the image feature vectors, and one or more keywords are selected from keywords of titles corresponding to the original video in the same category according to a preset rule and serve as labels of the original video in the category according to the different classifications of the original video. By adopting the feature extraction network to extract the image feature vectors of the video and performing cluster analysis on the image feature vectors, the efficient and accurate video classification is realized. Furthermore, by screening one or more keywords from keywords contained in the titles of all videos in the same category according to a preset rule, compared with manual labeling and single video labeling modes, high-accuracy and comprehensive video labeling is achieved, and therefore the search hit rate and recommendation accuracy of the videos can be improved.

Further, after the original video data is obtained, the audio signals can be separated from the original video data, then the separated audio signals are subjected to cluster analysis to obtain different classifications of the audio signals, and one or more keywords are selected from keywords of the title of the original video corresponding to the audio signals in the same category according to a preset rule and used as labels of the original video corresponding to the audio signals in the category according to the different classifications of the audio signals. By further acquiring the labels related to the voice characteristics of the video, the accuracy and the comprehensiveness of the finally generated video labels are further improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

The above, as well as additional objectives, advantages, and features of the present invention will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present invention when read in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 shows a flow chart of a method of tagging video according to an embodiment of the present invention;

FIG. 2 shows a flow chart of a clustering algorithm according to an embodiment of the invention;

FIG. 3 shows a flow chart of a method of tagging video according to another embodiment of the present invention;

fig. 4 is a schematic structural view showing a video labeling apparatus according to an embodiment of the present invention; and

fig. 5 shows a schematic structural diagram of a video labeling apparatus according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The inventors have found that for video distribution platforms (e.g., fast video, short-sounded video APP, cool video network, etc.), the user's use experience is largely dependent on the search hit rate and recommendation accuracy of the video. In order to ensure the search hit rate and the recommendation accuracy, the massive videos on the platform need to be accurately classified and comprehensive and reasonable labels are allocated to each video. The existing video labeling method generally extracts keywords as labels through manual labeling or simply through the titles and the explanatory characters of single videos, and has low operation efficiency, low accuracy and small label coverage.

In order to solve the technical problems, an embodiment of the invention provides a video labeling processing method. Fig. 1 shows a flowchart of a video tagging processing method according to an embodiment of the present invention. Referring to fig. 1, the method may include at least the following steps S102 to S108.

Step S102, original video data is acquired.

Step S104, inputting the original video data into a feature extraction network to extract image features, and obtaining an image feature vector of a first preset dimension of the original video.

And S106, performing cluster analysis on the image feature vectors to obtain different classifications of the original video corresponding to the image feature vectors.

Step S108, extracting keywords in titles corresponding to the original videos in the same category aiming at different categories of the original videos corresponding to the image feature vectors, and selecting one or more keywords from the keywords as labels of the original videos in the category according to a first preset rule.

According to the video labeling processing method, after original video data are acquired, image feature vectors of the original video are extracted through a feature extraction network, then clustering analysis is carried out on the image feature vectors, different classifications of the original video corresponding to the image feature vectors are obtained, and one or more keywords are selected from keywords of titles corresponding to the original video in the same category according to preset rules and used as labels of the original video in the category according to the different classifications of the original video. By adopting the feature extraction network to extract the image feature vectors of the video and performing cluster analysis on the image feature vectors, the efficient and accurate video classification is realized. Furthermore, by screening one or more keywords from keywords contained in the titles of all videos in the same category according to a preset rule, compared with manual labeling and single video labeling modes, high-accuracy and comprehensive video labeling is achieved, and therefore the search hit rate and recommendation accuracy of the videos can be improved.

In the above step S102, the original video data may be acquired in various manners, for example, the video uploaded by the user may be directly acquired, the video may be recorded by the image capturing apparatus, etc., which is not limited by the present invention.

In step S104, the image feature vector of the first preset dimension of the original video is extracted through the feature extraction network.

The feature extraction network mentioned here may employ a deep-learning artificial neural network, such as a CNN (convolutional neural network) neural network, BP (BackPropagation) neural network, LVQ (LearningVectorQuantization) neural network, or the like. CNN neural networks are preferred. CNN is a supervised machine learning model, which is one of the representative methods of deep learning, and can implement automatic extraction of image features.

The image features extracted by the feature extraction network may mainly include color features, texture features, shape features, spatial relationship features, and the like.

The numerical value of the first preset dimension can be obtained through experimental verification according to practical application occasions. In a specific embodiment, the value of the first preset dimension may be 1024 dimensions, that is, 1024 dimensions of image feature vectors may be extracted by the feature extraction network, where the 1024 dimensions of image feature vectors can more comprehensively embody content information of the original video.

In an alternative embodiment, step S104 may be implemented as the following steps:

the first step, decoding the original video data to obtain a plurality of video frames.

The video data is composed of video frames of one frame, and in order to accurately extract the image features of the video data, the original video data needs to be decoded before the image features of the original video data are extracted, so as to obtain a plurality of video frames.

And a second step of inputting each video frame in the plurality of video frames into a convolutional neural network CNN for image feature extraction to obtain an image feature vector of a second preset dimension of each video frame.

In this step, the second predetermined dimension may be the same as or different from the first predetermined dimension.

And thirdly, carrying out operation processing on the image feature vectors of the plurality of video frames by using a preset algorithm to obtain the image feature vector of the first preset dimension of the original video.

When the second preset dimension is the same as the first preset dimension, for example, when the second preset dimension and the first preset dimension are 1024 dimensions, the image feature vectors of the plurality of video frames can be respectively averaged on each dimension feature, so that the original video representation composed of the image feature vectors of the plurality of video frames is directly reduced in dimension to the image feature vector of the first preset dimension.

When the second preset dimension is different from the first preset dimension, the image feature vector of the first preset dimension of the original video can be obtained by performing dimension reduction analysis operation on the original video representation composed of the image feature vectors of the plurality of video frames. The dimension reduction algorithm may employ an algorithm such as principal component analysis (PrincipalComponentAnalysis, PCA).

Further, after decoding the original video data in the first step to obtain a plurality of video frames, the method further includes the following steps:

n video frames are extracted from the plurality of video frames at predetermined time intervals.

The predetermined time interval may be set according to actual requirements, for example, may be set to 2s.

At this time, the second step may be further implemented as:

and inputting each video frame in the n video frames into CNN for image feature extraction to obtain m-dimensional image feature vectors corresponding to each video frame.

The m dimension herein refers to the second preset dimension mentioned above. The m-dimensional image feature vector may be expressed as vi= { T1, T2, …, tm }, where i represents an i-th video frame of the n video frames, T1, T2, …, tm represents m image features extracted from the i-th video frame, respectively.

Meanwhile, the third step may be further implemented as:

first, n m-dimensional image feature vectors are connected in time sequence to obtain an n×m-dimensional image feature vector. Specifically, the n×m-dimensional image feature vector may be expressed as { V1, V2, …, vi, …, vn }, for example.

And then, performing dimension reduction analysis on the n multiplied by m dimension image feature vector to obtain a final first preset dimension image feature vector of the original video. Both n and m mentioned above are natural numbers greater than 1.

Further, the dimension reduction of the n×m-dimensional image feature vector may also be achieved by average pooling (AveragePooling) of the n×m-dimensional image feature vector.

AveragePooling is an averaging of feature points within a neighborhood, with the goal of integrating features, reducing parameters and maintaining translational invariance. AveragePooling can reduce errors caused by the increase of estimated value variance due to the limited neighborhood size, and more emphasizes that one-layer downsampling is performed on the whole feature information, so that contribution to reducing parameter dimension is larger.

In step S106 above, efficient and accurate video classification is achieved by performing cluster analysis on the image feature vectors extracted from the original video. The cluster analysis algorithm can adopt K-means clustering, hierarchical clustering and the like. Preferably, a K-means clustering algorithm is used.

Because the traditional clustering method has some problems in the clustering process of the data, one is the problem of updating the data, the other is that the central point of the clustering is uncontrollable, and when the clustering process iterates to a certain degree, whether the final clustering result meets the requirement or not and whether the central point is accurate or not can not be judged, so that the accuracy of the final clustering result is also influenced. Therefore, in order to solve the problems, the embodiment of the invention introduces purity calculation in the clustering process to monitor the clustering result, so that the accuracy of the clustering result can be improved while optimizing the clustering process of the data to be processed. Referring to fig. 2, the clustering algorithm of the inventive scheme may include the following steps S1-S4.

Step S1, obtaining to-be-processed data comprising a plurality of clustering objects and the specified target class number of the to-be-processed data.

In this step, the clustering objects are a plurality of original videos, and the data to be processed of the clustering objects are image feature vectors extracted from the original video data.

And S2, classifying each clustering object in the data to be processed according to the class attribute of each clustering object to obtain the clustering class of the specified target class number.

In this step, the K-means clustering algorithm may be preferentially used to classify each clustered object. The specific process comprises the steps of S2-1 to S2-5.

S2-1, randomly initializing each clustering center of the data to be processed based on the specified target class number.

S2-2, calculating the distance between each clustering object in the data to be processed and each clustering center, and classifying each clustering object into the clustering category where the corresponding clustering center is located by the minimum distance.

For each clustering object in the data to be processed, the clustering object can be regarded as a plurality of data points in a multidimensional space, and in the initial clustering, the number of the designated target categories such as k (k can be a natural number and is set according to different requirements) is known, namely the data to be processed needs to be classified into k categories, so that each clustering center of the data to be processed can be randomly initialized based on the designated target category number, k clustering objects are selected as initial clustering centers, and then the distance from each selected clustering center is calculated for other clustering objects, so that each clustering object is classified into the clustering center closest to the clustering center.

In general, when clustering a plurality of clustered objects, a plurality of iterative processes are required to achieve the optimal effect, and therefore, after the step S2-2, the method may further include:

S2-3, calculating new cluster centers of all cluster categories;

s2-4, obtaining the distance from each clustering object to the new clustering center, and classifying each clustering object into the clustering category to which the corresponding new clustering center belongs by the minimum distance;

s2-5, iteratively calculating the designated times of the new cluster centers of the cluster types until the change distance of the new cluster centers of the cluster types is within a preset range.

When the new cluster center of each cluster category is calculated in the step S2-3, since each cluster object has been clustered in the step S2-2 to obtain a cluster category of the specified target category number, when any cluster category is selected, the mean value of the cluster category is calculated, that is, the cluster object having the same vector length as that of each cluster object is calculated as the new cluster center, and the other cluster categories perform the same data processing.

After confirming the new cluster centers of the designated target class number, calculating the distance from each cluster object to the new cluster center, and classifying each cluster object into the cluster class to which the corresponding new cluster center belongs by using the minimum distance. Repeating the steps S2-3 to S2-4, and iteratively calculating the designated times of the new cluster centers of each cluster type until the change distance of the new cluster centers of each cluster type is within a preset range, wherein the preset distance can be set according to different application requirements.

In the embodiment of the invention, when the selection of the new cluster center is completed once, the purity of the new cluster category can be calculated, or the purity of the cluster category can be calculated after the designated times of clustering.

And S3, calculating the purity of each cluster type.

In this step, in the process of calculating the purity of each cluster category, for any cluster category, a first cluster object of a specified proportion may be first screened out based on all the cluster objects of the cluster category. Then, a preset number of second cluster objects adjacent to each first cluster object are respectively acquired. Finally, the purity of the cluster class is calculated based on the class attribute of the second cluster object.

The purity of each cluster category can also be calculated by combining with a KNN (k-NearestNeighbor) method when actually calculating the purity of each cluster category through the following formula:

in this formula, purity _i Representing the purity of the cluster class i; class of things _i Representing a clustering class i; knn _y K-nearest neighbor representing sample y; NUM (x) represents the total number of clustered objects in which all clustered objects in cluster class i take k neighbors; NUM (x E class) _i ) The number of clustered objects belonging to the cluster category i in the total number of clustered objects is represented.

And S4, confirming the final clustering category of the data to be processed according to the purity of each clustering category.

In this step, the final clustering category of the data to be processed may be confirmed in combination with the purity of each clustering category, and the clustering center of each final clustering category may be output, and in a preferred embodiment of the present invention, step S4 may specifically include the following steps:

s4-1, judging whether the iteration calculation times of the new cluster centers of each cluster category reach the maximum iteration times;

s4-2, screening out a first cluster category with purity greater than preset initial screening purity if the iteration calculation times of the new cluster centers of the cluster categories do not reach the maximum iteration times;

s4-3, storing and inputting the clustering centers of the first clustering category.

When the clustering algorithm provided by the embodiment of the invention is used for processing video data, the clustering is performed on the basis of the extracted image features of the video at the corresponding data coordinate points in the same multidimensional vector space, so that the efficient and accurate video classification is realized.

In step S108, the title keywords of each category of original video are extracted for different categories of the original video corresponding to the image feature vector, and the keywords are selected as the labels of the category of original video.

In an alternative embodiment, extracting keywords in titles corresponding to original videos in the same class may be implemented as:

The title corresponding to each original video in the same class is segmented to obtain a plurality of segmented words; then, one or more segmentations are selected from the plurality of segmentations as keywords for the original video according to a predetermined filtering strategy.

In one embodiment, the selection may be based on classification characteristics (e.g., nouns, hotness words) of each word, etc., to selectively extract one or more keywords in the video title that are more relevant to the characteristics of the video.

It should be noted that, since some words, such as prepositions, conjunctions, assistances, and imaginary words, have no real meaning, these words may be removed after the word segmentation process.

In the embodiment of the invention, after the keywords in the titles corresponding to the original videos in the same category are extracted, one or more keywords are selected from the extracted keywords according to the first preset rule to serve as labels of the original videos in the category, so that the video is labeled. The selection of keywords as video tags may include the following two ways.

Mode one

For each type of clustered original video, first, the number of occurrences of each keyword extracted from the title of the type of original video may be counted. Then, a keyword whose number of occurrences is greater than or equal to a predetermined threshold is selected as a tag of the original video of the category. The predetermined threshold mentioned herein may be set according to the actual requirements of the application.

Mode two

For each type of clustered original video, first, the number of occurrences of each keyword extracted from the title of the type of original video may be counted. And then, sorting the keywords according to the counted occurrence times of the keywords. Finally, the appointed number of keywords which are ranked in front are selected as labels of the original video of the category. For example, the top 10 keywords (i.e., the keywords ranked 10 top) may be selected as labels for the original video of the category.

By screening one or more keywords with higher occurrence frequency from keywords contained in the titles of all videos in the same category as the labels of the videos, compared with manual labeling and single video labeling modes, the video labeling with high accuracy and comprehensiveness is realized, and therefore the search hit rate and the recommendation accuracy of the videos can be improved.

Video typically includes a picture (i.e., image) and corresponding sound, and in some cases, the sound content of the video characterizes features that cannot be characterized by the video picture, so that by separating an audio signal from the video for analysis, feature information of the video can be further obtained.

In a preferred embodiment, as shown in fig. 3, in addition to the above-mentioned steps S102 to S108, the tagging processing method of the video may further include the following steps S110 to S114 after the step S102 is performed to acquire the original video data.

Step S110, separating the audio signal from the original video data.

Step S112, performing cluster analysis on the separated audio signals to obtain different classifications of the audio signals.

Step S114, extracting keywords in the titles of the original videos corresponding to the audio signals in the same category according to different categories of the audio signals, and selecting one or more keywords from the keywords according to a second predetermined rule as labels of the original videos corresponding to the audio signals in the category.

The accuracy and the comprehensiveness of the finally generated video label are further improved by further acquiring the label related to the voice characteristic of the video on the basis of acquiring the label related to the image characteristic of the video.

In a more preferred embodiment, the above step S112 may be further implemented as the following steps:

first, discretizing the separated audio signal to obtain a target audio.

The discretization referred to herein means sampling and quantizing an analog audio signal, i.e., discretizing sound in both time axis and amplitude, thereby converting into a digital signal. The sampling frequency is typically not less than twice the highest frequency of the sound signal to achieve lossless digitization. Quantization refers to the fact that the amplitude value for each sampling point in the sampling process is represented by a digital quantity. If the partitions of amplitude are equally spaced, this is called linear quantization, otherwise nonlinear quantization. The larger the number of quantization levels, the larger the dynamic range of the acoustic wave amplitude that can be represented, and the smaller the quantization noise.

Optionally, discretizing the separated audio signal may be further implemented as:

the audio signal is sampled at a specified sampling frequency, the sampled value amplitude is quantized, and encoded into a pulse code modulated (PulseCodingModulation, PCM) signal.

Then, extracting the characteristics of the target audio through a time sequence convolutional neural network CNN to obtain the voice characteristic vector of the target audio.

When the feature extraction is performed, a convolution kernel (Convolitional Kernel) of the time sequence CNN can be kept consistent with the extracted feature in the feature dimension and only moved in the time dimension, so that the purpose of time sequence convolution is achieved. Because the audio has time sequence, the time sequence CNN network is adopted to extract the characteristics according to time windows, and then the neural network is used for one-dimensional convolution, the audio characterization can be obtained, so that the efficiency and the accuracy of the audio characteristic extraction are improved.

The extracted audio features may include, for example, fundamental frequencies, formants, mel-frequency cepstrum coefficients (MelFrequencyCepstralCoefficients, MFCC), short-time power spectral densities, and the like.

And finally, carrying out cluster analysis on the voice feature vectors to obtain different classifications of the audio signals corresponding to the voice feature vectors. The clustering algorithm employed for the clustering analysis may be similar to the clustering algorithm employed for the clustering analysis of the image feature vectors extracted from the original video data. At this time, the clustered objects are a plurality of audio signals separated from the original video data, and the data to be processed of the clustered objects are speech feature vectors extracted after the separated audio signals are converted.

In a specific embodiment, the step of extracting the feature of the target audio through the time-series convolutional neural network CNN to obtain the speech feature vector of the target audio may be implemented in the following manner:

(1) And framing the target audio according to the time window to obtain p audio frames.

Specifically, for example, an audio segment with a length of 25ms may be taken every 10ms, that is, frames are divided by frame length of 25ms and frame shift of 10ms, so as to obtain p audio frames with frame length of 25ms and 15ms overlap between every two adjacent frames.

(2) A Short-time Fourier transform (Short-TermFourierTransform, STFT) is performed on each audio frame, and an audio frame signal is transformed from a time domain to a frequency domain to obtain a spectrogram of the audio frame.

(3) And carrying out log-mel transformation on the spectrogram of each audio frame to obtain a log-mel spectrogram of the audio frame.

Specifically, the spectrogram of each audio frame may be subjected to log-Mel transformation by a Mel-scale filter bank (Mel-scalef iotalterfbanks), to obtain a corresponding log-Mel spectrogram (also referred to as Mel spectrum). The log-mel transformation follows the following mapping relationship: mel (f) =1595×log10 (1+f/700), where f represents the normal frequency.

(4) And inputting the log-mel spectrogram of each audio frame into a time sequence CNN for feature extraction to obtain q-dimensional voice feature vectors corresponding to each audio frame.

In particular, the extracted features may characterize speech features of each audio frame (audio clip), such as human voice, instrument voice, vehicle engine voice, animal voice, and the like.

The q-dimensional speech feature vector may be represented as ai= { C1, C2, …, cq }, where i represents the i-th audio frame of the p audio frames, C1, C2, …, cq represent q features extracted from the i-th audio frame, respectively.

(5) And connecting p q-dimensional voice feature vectors in time sequence to obtain a p multiplied by q-dimensional voice feature vector.

Specifically, the p×q-dimensional speech feature vector may be expressed as { A1, A2, …, ai, …, ap }, for example.

(6) And performing dimension reduction analysis on the p multiplied by q dimension voice feature vector to obtain the final voice feature vector with the appointed dimension of the target audio. Wherein p and q mentioned above are natural numbers greater than 1.

The dimension reduction algorithm may employ an algorithm such as principal component analysis (PrincipalComponentAnalysis, PCA). The appointed dimension of the final voice feature vector of the target audio can be obtained through experimental verification according to practical application occasions. In one embodiment, the specified dimension of the final speech feature vector of the target audio may be 640 dimensions, which both ensures adequate characterization of the audio features and reduces the computational effort for subsequent processing.

More preferably, the dimension reduction of the p×q-dimensional speech feature vector may also be achieved by averaging pooling (AveragePooling) the p×q-dimensional speech feature vector.

The keyword in the title is extracted in step S114 and the keyword is selected in a similar or identical manner to step S108. At this time, the different categories targeted are categories of audio signals separated from the original video data.

The labels related to the image features and the voice features of the video are respectively obtained to be used as the final labels of the video, so that the finally generated video labels cover more comprehensive and wider information, and the search hit rate and the recommendation accuracy rate of the video are further improved.

Based on the same inventive concept, the embodiment of the invention also provides a video labeling processing device, which is used for supporting the video labeling processing method provided by any one embodiment or combination thereof. Fig. 4 shows a schematic structural diagram of a video labeling apparatus according to an embodiment of the present invention. Referring to fig. 4, the apparatus may include at least: a video data acquisition module 410, an image feature extraction module 420, a first cluster analysis module 430, and a first tagging module 440.

The functions of each component or device of the video labeling processing device according to the embodiment of the invention and the connection relation between each part are described:

the video data acquisition module 410 is adapted to acquire raw video data.

The image feature extraction module 420 is connected to the video data acquisition module 410, and is adapted to input the original video data into the feature extraction network to perform image feature extraction, so as to obtain an image feature vector of a first preset dimension of the original video.

The first cluster analysis module 430 is connected to the image feature extraction module 420, and is adapted to perform cluster analysis on the image feature vectors, so as to obtain different classifications of the original video corresponding to the image feature vectors.

The first labeling module 440 is connected to the first cluster analysis module 430, and is adapted to extract keywords in titles corresponding to the original videos in the same category for different classifications of the original videos corresponding to the image feature vectors, and select one or more keywords from the extracted keywords as labels of the original videos in the category according to a first predetermined rule.

In an alternative embodiment, the image feature extraction module 420 is further adapted to:

decoding the original video data to obtain a plurality of video frames;

Further, the image feature extraction module 420 is further adapted to:

Still further, the image feature extraction module 420 is further adapted to:

In an alternative embodiment, the cluster analysis may include K-means clustering.

In an alternative embodiment, the first labeling module 440 is further adapted to:

one or more segmentations are selected from the plurality of segmentations as keywords for the original video according to a predetermined filtering strategy.

counting the occurrence times of each keyword aiming at the keywords extracted from the titles of the original video of the same class;

the keywords are ordered according to the occurrence times;

In an alternative embodiment, as shown in fig. 5, the video labeling apparatus may further include an audio signal separation module 450, a second aggregation analysis module 460, and a second labeling module 470.

The audio signal separation module 450 is connected to the video data acquisition module 410 and is adapted to separate an audio signal from the original video data after the video data acquisition module 410 acquires the original video data.

The second cluster analysis module 460 is connected to the audio signal separation module 450, and is adapted to perform cluster analysis on the separated audio signals, so as to obtain different classifications of the audio signals.

The second labeling module 470 is connected to the second aggregation module 460, and is adapted to extract keywords in the title of the original video corresponding to the audio signals in the same class according to different classifications of the audio signals, and select one or more keywords from the extracted keywords as labels of the original video corresponding to the audio signals in the class according to a second predetermined rule.

Further, referring to fig. 5, the second aggregation analysis module 460 may include the following elements:

an audio discretization unit 461 adapted to discretize the audio signal to obtain a target audio;

a voice feature extraction unit 462 adapted to extract features of the target audio through the time sequence CNN to obtain a voice feature vector of the target audio; and

the audio cluster analysis unit 463 is adapted to perform cluster analysis on the speech feature vector, so as to obtain different classifications of the audio signals corresponding to the speech feature vector.

Based on the same inventive concept, the embodiment of the invention also provides a computer storage medium. The computer storage medium stores computer program code which, when run on a computing device, causes the computing device to perform the video tagging method according to any one or combination of the above embodiments.

Based on the same inventive concept, the embodiment of the invention also provides a computing device. The computing device may include:

a processor; and

a memory storing computer program code;

the computer program code, when executed by a processor, causes the computing device to perform the video tagging method according to any one or combination of the embodiments described above.

According to any one of the optional embodiments or the combination of multiple optional embodiments, the following beneficial effects can be achieved according to the embodiment of the invention:

It will be clear to those skilled in the art that the specific working procedures of the above-described systems, devices and units may refer to the corresponding procedures in the foregoing method embodiments, and are not repeated herein for brevity.

In addition, each functional unit in the embodiments of the present invention may be physically independent, two or more functional units may be integrated together, or all functional units may be integrated in one processing unit. The integrated functional units may be implemented in hardware or in software or firmware.

Those of ordinary skill in the art will appreciate that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or in whole or in part in the form of a software product stored in a storage medium, comprising instructions for causing a computing device (e.g., a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a personal computer, a server, or a computing device such as a network device) associated with program instructions, where the program instructions may be stored on a computer-readable storage medium, and where the program instructions, when executed by a processor of the computing device, perform all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all technical features thereof can be replaced by others within the spirit and principle of the present invention; such modifications and substitutions do not depart from the scope of the invention.

Claims

1. A video tagging method, comprising:

acquiring original video data;

extracting keywords in titles corresponding to the original videos in the same category aiming at different categories of the original videos corresponding to the image feature vectors, and selecting one or more keywords from the keywords according to a first preset rule to serve as labels of the original videos in the category;

Wherein after the original video data is acquired, further comprising:

separating an audio signal from the original video data;

extracting keywords in titles of original videos corresponding to the audio signals in the same category aiming at different categories of the audio signals, and selecting one or more keywords from the keywords according to a second preset rule to serve as labels of the original videos corresponding to the audio signals in the category;

the step of performing cluster analysis on the image feature vector to obtain different classifications of the original video corresponding to the image feature vector includes:

acquiring to-be-processed data comprising a plurality of clustering objects and the specified target class number of the to-be-processed data, wherein the clustering objects are original videos, and the to-be-processed data of the clustering objects are image feature vectors extracted from the original videos;

classifying each clustering object in the data to be processed according to the class attribute of each clustering object to obtain a clustering class of the designated target class number;

calculating the purity of each cluster class;

determining the final clustering category of the data to be processed according to the purity of each clustering category to obtain different categories of the original video;

Wherein the calculating the purity of each cluster category comprises:

screening out a first clustering object with a specified proportion according to all the clustering objects of the clustering class;

acquiring a preset number of second cluster objects adjacent to each first cluster object;

and calculating the purity of the clustering class according to the class attribute of the second clustering object.

2. The method of claim 1, wherein inputting the original video data into a feature extraction network for image feature extraction to obtain an image feature vector of a first preset dimension of the original video, comprises:

decoding the original video data to obtain a plurality of video frames;

3. The method of claim 2, wherein after decoding the original video data to obtain a plurality of video frames, further comprising:

4. A method according to claim 3, wherein performing a dimension-reduction analysis on the n x m dimensional image feature vector comprises:

and carrying out average pooling on the n multiplied by m dimensional image feature vectors.

5. The method of claim 1, wherein the cluster analysis comprises K-means clustering.

6. The method of claim 1, wherein extracting keywords in titles corresponding to original videos in the same category comprises:

7. The method of claim 1, wherein selecting one or more keywords from the keywords as labels of the category of original video according to a first predetermined rule comprises:

counting the occurrence times of each keyword;

8. The method of claim 1, wherein selecting one or more keywords from the keywords as labels of the category of original video according to a first predetermined rule comprises:

counting the occurrence times of each keyword;

sorting the keywords according to the occurrence times;

9. The method of claim 1, wherein performing cluster analysis on the audio signal results in different classifications of the audio signal, comprising:

Discretizing the audio signal to obtain target audio;

10. A video tagging processing apparatus comprising:

the first labeling module is suitable for extracting keywords in titles corresponding to the original videos in the same category aiming at different categories of the original videos corresponding to the image feature vectors, and selecting one or more keywords from the keywords as labels of the original videos in the category according to a first preset rule;

Wherein, still include:

the second labeling module is suitable for extracting keywords in titles of original videos corresponding to the audio signals in the same category aiming at different categories of the audio signals, and selecting one or more keywords from the keywords to serve as labels of the original videos corresponding to the audio signals in the category according to a second preset rule;

the first cluster analysis module is further adapted to acquire to-be-processed data comprising a plurality of cluster objects and a specified target class number of the to-be-processed data, wherein the cluster objects are original videos, and the to-be-processed data of the cluster objects are image feature vectors extracted from the original videos;

calculating the purity of each cluster class;

The first cluster analysis module is further adapted to screen out a first cluster object with a specified proportion according to all cluster objects of the cluster category;

11. The apparatus of claim 10, wherein the image feature extraction module is further adapted to:

decoding the original video data to obtain a plurality of video frames;

12. The apparatus of claim 11, wherein the image feature extraction module is further adapted to:

13. The apparatus of claim 12, wherein the image feature extraction module is further adapted to:

14. The apparatus of claim 10, wherein the cluster analysis comprises K-means clustering.

15. The apparatus of claim 10, wherein the first tagging module is further adapted to:

16. The apparatus of claim 10, wherein the first tagging module is further adapted to:

counting the occurrence times of each keyword;

17. The apparatus of claim 10, wherein the first tagging module is further adapted to:

counting the occurrence times of each keyword;

sorting the keywords according to the occurrence times;

18. The apparatus of claim 10, wherein the second aggregation analysis module comprises:

the audio discretization unit is suitable for discretizing the audio signal to obtain target audio;

the voice feature extraction unit is suitable for extracting features of the target audio through a time sequence convolutional neural network CNN to obtain a voice feature vector of the target audio;

19. A computer storage medium storing computer program code which, when run on a computing device, causes the computing device to perform the method of labelling a video according to any of claims 1-9.

20. A computing device, comprising:

a processor; and

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform the method of tagging video according to any one of claims 1-9.