CN110516086B

CN110516086B - Method for automatically acquiring movie label based on deep neural network

Info

Publication number: CN110516086B
Application number: CN201910627545.8A
Authority: CN
Inventors: 宣琦; 王冠华; 俞山青; 孙佳慧; 韩忙; 孙翊杰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2022-05-03
Anticipated expiration: 2039-07-12
Also published as: CN110516086A

Abstract

A method for automatically acquiring a film and television label based on a deep neural network comprises the following steps: step 1: collecting the lines of the film and constructing a line data set; step 2: collecting original sound of a film, and constructing a sound data set; and step 3: collecting the generated labels on the related film and television platforms, and constructing a film label data set; and 4, step 4: constructing an automatic labeling model based on film lines; and 5: adopting a CNN-LSTM algorithm of a shared node to construct an automatic labeling model based on the original sound of the film; step 6: and fusing the two models mentioned in the step 4 and the step 5. The invention provides a method for automatically acquiring a film and television label based on a deep neural network, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, mainly aims at a film, and extracts high-level abstract attributes from original information of a film, such as a speech text, an audio signal and the like by utilizing the time correlation of the film.

Description

Method for automatically acquiring movie label based on deep neural network

Technical Field

The invention relates to data mining, network science and a deep neural network, in particular to a method for automatically acquiring a film and television label based on the deep neural network.

Background

With the deep development of the information-oriented society and the digital society, the film and television industry is more and more inclined to perform digital distribution through streaming media services and online film and television stores. 2018 global film and television report shows that the global film and television market size has increased by 5.9% in 2017, wherein the increase in the digital film and television market income is as high as 17.7%. The income of the digital video market in 2017 is increased by more than half of the income of the whole video market for the first time, and the phenomenon is obviously worthy of attention. Reports indicate a 60.4% surge in revenue for streaming services in the context of dual atrophy of the movie download market and the physical movie market, indicating that the most prominent impetus for a large increase in revenue in the digital movie market is streaming services. Currently, the total number of subscribers to online movie and television pay services worldwide has exceeded 1 hundred million, and this important milestone represents that streaming media services have become a significant component of the digital movie and television market. Meanwhile, the film and television industry in China is considered to have great development potential, the scale of the film and television market in China is enlarged by 20.3% according to the report in 2017, the income of the streaming media service is increased by 30.6%, and the expansion is larger than that of the whole film and television market. The most important Chinese stream media service provider has many video platforms and over 1500 ten thousand pay users for video entertainment, so it can be said that the Chinese digital video market will grow into one of the world important digital video markets.

The popularity of high-speed mobile networks and intelligent devices has shifted the consumer's movie consumption habits from physical and downloading to streaming media. With new backgrounds, the digital movie and television markets all over the world have strong competition, and movie and television streaming media service providers continuously develop and expand respective products and services to provide more diversified and personalized experiences for consumers. In the face of a huge online movie library with massive digital movie resources, how to perform more efficient organization, how to provide higher-quality subscription, and how to recommend more accurate content becomes an important technical hotspot concerned by each large streaming media service provider.

Under the transformation background of the film and television market, the importance of the concept of film and television labels is highlighted as a structured film and television information organization mode, and the effect of improving the film and television labeling task through various advanced technologies is becoming a popular direction in the field of film and television information retrieval. The film and television labels refer to phrases capable of accurately describing high-level film and television semantics, and due to the particularity of the film and television, lines and sounds are difficult to manage and search in a conventional mode, and the labels representing the characteristics of the film and television greatly help the classification, organization and search of the film and television. The natural language tags can help the user find movies with specific attributes through keywords, lists and tag clouds. On the basis, the streaming media service provider can also perform personalized recommendation by using the label information of the film and television, and the mode based on the content and the characteristics of the film and television can help overcome the cold start problem of a collaborative filtering recommendation algorithm widely used in the current market.

At present, three mainstream methods are used for completing the film and television labeling task, namely expert labeling, social labeling and algorithm automatic labeling. The expert marking means that professionals in the film and television industry mark films and televisions based on professional film and television knowledge and self literacy, film and television labels given by experts are accurate, but the marking mode is high in cost and not rich enough in content. The social annotation means that users are encouraged to carry out unlimited or semi-limited annotation according to personal understanding and feeling of movies and televisions in a mode similar to crowd funding tasks, and annotation data of a large number of users are collected to be processed and counted to generate labels. The method has the advantages of low cost and rich content, but as different users have subjective understanding and feeling on the film and television, labeling results are uneven, and even labels with completely opposite semantics appear on the same film and television, the result is very noisy. The automatic labeling is that on the existing small-scale movie label data set, a characteristic training classification model is extracted from self and related information of various movies such as audio signals, lines text, related comments, posters and the like, and a label result is automatically generated for large-scale movie data. The method for automatically labeling the film and television labels based on the content by utilizing the algorithm can solve the problems of cost and time and also can solve the problem of universality of a labeling method. The accuracy and the application range of the existing automatic labeling method still have a larger space for improvement, so that the research of carrying out film and television automatic labeling by using film and television contents is receiving more and more attention.

In summary, many problems to be solved urgently exist in the conventional labeling algorithm, including noise in the feature design process and the limitation of the shallow structure of the classifier, and no effective solution exists yet.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method for automatically acquiring a film and television label based on a deep neural network.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for automatically acquiring a film and television label based on a deep neural network comprises the following steps:

step 1: collecting the lines of the film and constructing a line data set;

step 2: collecting original sound of a film, and constructing a sound data set;

and step 3: collecting the generated labels on a movie platform, and constructing a movie label data set;

and 4, step 4: constructing a label automatic acquisition model based on film lines;

and 5: adopting a shared node CNN-LSTM algorithm to construct a label automatic acquisition model based on the film sound;

Step 6: and fusing the two models mentioned in the step 4 and the step 5.

Further, in step 1, lines of the movie are collected, and the behavior data does not include a movie-end egg-painted portion.

Still further, in step 2, according to the collected movie speech in step 1, the corresponding movie sound is relatively collected, and the behavior data does not include the ending colored egg portion.

Still further, in step 3, the movie platform includes an arcade art, an Tencent video, a Youkou, a cat eye movie and a broad bean movie, and the process of constructing the movie label data set includes:

3.1) merging all the labels collected by 5 platforms to ensure that no repeated labels exist;

3.2) carrying out format standardization on all labels, wherein the formats comprise uniform character codes and uniform label separators;

3.3) corresponding the movies in step 1, 2 with the collected labels.

In the step 4, the automatic labeling model based on the film lines is constructed by the following processes:

4.1) use WordPiece tool to perform word segmentation and insert special separators ([ CLS)]For separating samples) and separators ([ SEP ]]For separating different sentences in a sample), each sentence corresponding to a matrix X ═ X (X) ₁，x₂，...，x_t) Wherein x is_iAll represent the word vector (row vector) of the ith word, the dimension is d dimension, so x ∈ R^n×d. The encoding is performed using the following formula:

wherein A and B are another sequence (matrix) introduced additionally, and the purpose of introducing A and B is to let x_tIs compared with each word to obtain y_t。

4.2) inputting the result of the previous step into a model for pre-training, wherein the model calculation formula is as follows:

and

wherein t is₁，t₂...，t_NAre successive tokens, t₁，t₂...，t_kAlso consecutive tokens. Further, let log_pt_kIs r_kEstablishing a bidirectional model which is convenient for training large-scale texts, wherein the model calculation formula is as follows:

wherein t is₁，t₂...，t_NAre successive tokens, t₁，t₂...，t_kAlso successive tokens, theta_xIs the input, the content of the input is the initial word vector, is the normalization layer parameter,

in order to be a forward-directed LSTM model,

the backward LSTM model is a backward LSTM model, and on the basis, fifteen percent of word vectors generated by word passing are randomly covered;

4.3) performing embedding operation on the vector after the improved Model pre-training based on the Masked Language Model. The types of Embedding operations are Token Embedding (indicating Embedding of the current word), Segment Embedding (indicating index Embedding of the sentence where the current word is located), and Position Embedding (indicating index Embedding of the Position where the current word is located), respectively. In order to simultaneously represent a single sentence and a sentence pair, multiple sentences need to be spliced to be used as a single sentence and are distinguished by segment embedding and [ SEG ]; summing the three embedding to obtain an input vector;

4.4) taking the vector generated in the previous step as an input and putting the vector into a transform model with 12 layers and 768 dimensions;

4.5) modifying the model by using fine-tuning, and taking the output of token as the input of the input softmax layer, thereby obtaining the output of the prediction result of the movie label.

In the step 5, the method for constructing the automatic labeling model based on the film acoustic sound by adopting the CNN-LSTM algorithm of the shared node comprises the following processes:

5.1) obtaining the power spectrum of the sound data set corresponding to the step 4 through Fast Fourier Transform (FFT), and then mapping the frequency spectrum to a Mel scale by using a triangular window function, wherein the calculation formula is as follows:

where f is the hertz frequency. Let E (B), 0 ≦ B < B denote the Mel-scale power spectral coefficients at the B-th subband, where B denotes the total number of filters in the pre-processing. The MFCCN value is the spectrum of the discrete cosine transform after taking the logarithm of e (b), where the logarithm of e (b) is set as h (b), and is calculated as follows:

where L represents the dimension of MFCCN, the MFCCN feature vector is obtained as follows:

x_MFCCN＝[MFCCN(0)，MFCCN(1)，...MFCCN(L-1)]^T (7)

5.2) carrying out short-time Fourier transform on the equal-length audio segments in the overlapped short window, wherein each Fourier transform generates a frame, the continuous frames are combined into a matrix to form a frequency spectrum, finally, the linear frequency axis is transformed into a Mel scale, the amplitudes distributed unevenly on the frequency axis are subjected to logarithmic scaling, and then, the logarithmic scaling is used as the characteristic representation of the audio signal;

5.3) inputting the characteristics generated in the last step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;

5.4) putting the output of the previous step into a pooling layer with a length of a pooling window of 4 maximums;

5.5) putting the output of the previous step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;

5.6) putting the output of the previous step into a pooling layer with the length of a pooling window being 4 maximums;

5.7) constructing a plurality of models, and adopting shared nodes to output depth characteristic sequences respectively;

5.8) because the lengths of the films are different, the number of the input segments is also different, the time correlation characteristics of the variable-length depth feature sequences output by the three traditional CNN models are captured through a recurrent neural network of an LSTM structure, and finally, a prediction label value is output.

In the step 6, fusing the two models mentioned in the step 4 and the step 5 comprises the following processes:

6.1) selecting a convolutional neural network structure to use through respective basic networks for audio representation and text representation, splicing after batch normalization operation, and finally carrying out scale transformation to obtain output;

6.2) inputting the output of the spliced model into two full-connection layers with output nodes of 1024 and 512 respectively again, and finally outputting the predicted value of the label.

The invention has the following beneficial effects: the method adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, mainly aims at the film, and extracts high-level abstract attributes from original information of the film, such as a speech text, an audio signal and the like by utilizing the time correlation of the film.

Drawings

Fig. 1 is a flowchart of constructing a movie-based automatic tag acquisition model according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for constructing a label automatic acquisition model of a movie sound by using a shared node CNN-LSTM algorithm according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for fusing a tag automatic acquisition model based on movie & television lines and two models mentioned in the tag automatic acquisition model of movie & television original sounds according to an embodiment of the present invention;

fig. 4 is a block diagram of a structure of a method for automatically acquiring a video tag based on a deep neural network according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 4, a method for automatically acquiring a video label based on a deep neural network specifically includes the following five steps:

step 1: collecting the lines of the film and constructing a line data set;

Step 2: collecting original sound of a film, and constructing a sound data set;

step 6: and fusing the two models mentioned in the step 4 and the step 5.

3.3) corresponding the movies in step 1, 2 with the collected labels.

4.1) word segmentation with WordPiece tool and insertion of special separators ([ CLS)]For separating samples) and separators ([ SEP ]]To separate different sentences within a sample). Each sentence corresponds to a matrix X ═ X (X)₁，x₂，...，x_t) Wherein x is_iAll represent the word vector (row vector) of the ith word, the dimension is d dimension, so x ∈ R^n×dThe encoding is performed using the following formula:

wherein A and B are another sequence (matrix) introduced additionally, and the purpose of introducing A and B is to let x_tA comparison is made with each of the words,thereby obtaining y_t；

and

wherein t is₁，t₂...，t_NAre successive tokens, t₁，t₂...，t_kAlso successive tokens, theta_xIs the input, the content of which is the most initial word vector. Theta_sIs the parameters of the normalization layer(s),

in order to be a forward-directed LSTM model,

the backward LSTM model. On the basis, fifteen percent of word vectors generated by the word through the word vector are randomly covered;

Where f is the hertz frequency. And E (B), B is more than or equal to 0 and less than B, and the B represents the Mel scale power spectrum coefficient on the B-th subband, wherein B represents the total number of filters in preprocessing, and the MFCCN value is the frequency spectrum of discrete cosine transform after taking logarithm of E (B), wherein taking logarithm of E (B) is set as H (B), and the calculation formula is as follows:

x_MFCCN＝[MFCCN(0)，MFCCN(1)，...MFCCN(L-1)]^T (7)

In the step 6, the two models mentioned in the step 4 and the step 5 are fused, and the process comprises the following steps:

In the method for automatically acquiring a video tag based on a deep neural network provided by the embodiment of the present invention, the instruction included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and details are not described herein.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for automatically acquiring a film and television label based on a deep neural network is characterized by comprising the following steps:

step 1: collecting the lines of the film and constructing a line data set;

step 2: collecting original sound of a film, and constructing a sound data set;

and 4, step 4: constructing an automatic labeling model based on film lines;

and 5: adopting a shared node CNN-LSTM algorithm to construct an automatic labeling model based on the original sound of the film;

step 6: fusing the two models mentioned in the step 4 and the step 5;

4.1) word segmentation with WordPiece tool and insertion of a special separator [ CLS ]]For separating samples, and separators [ SEP ]]For separating different sentences in a sample, each sentence is associated with a matrix of X ═ X (X)₁,x₂,…,x_t) Wherein x is_iAll represent the word vector of the ith word, and the dimension is d dimension, so x belongs to R^n×dThe encoding is performed using the following formula:

wherein A and B are another sequence additionally introduced, and the purpose of introducing A and B is to allow x to pass through_tIs compared with each word to obtain y _t；

and

wherein t is₁,t₂…,t_NAre successive tokens, t₁,t₂…,t_kAlso continuous tokens, further, let log_pt_kIs r_kEstablishing a bidirectional model which is convenient for training large-scale texts, wherein the model calculation formula is as follows:

wherein t is₁,t₂…,t_NAre successive tokens, t₁,t₂…,t_kAlso successive tokens, theta_xIs the input whose content is the most initial word vector, theta_sIs the parameters of the normalization layer(s),

in order to be a forward-directed LSTM model,

4.3) Embedding vector after model pre-training, wherein in the type of Embedding operation, Token Embedding represents Embedding of the current word, Segment Embedding represents index Embedding of the sentence where the current word is located, and Position Embedding represents index Embedding of the Position where the current word is located, wherein in order to simultaneously represent a single sentence and a sentence pair, multiple sentences need to be spliced to be used as a single sentence, and are distinguished by Segment Embedding and [ SEG ]; summing the three embeddings to obtain an input vector;

4.5) modifying the model by using fine-tuning, and taking the output of token as the input of the input softmax normalization layer so as to obtain the output of the film label prediction result;

in the step 5, the method for constructing the automatic labeling model based on the film sound by adopting the shared node CNN-LSTM algorithm comprises the following processes:

5.1) obtaining the power spectrum of the sound data set corresponding to the step 4 through Fast Fourier Transform (FFT), and then mapping the frequency spectrum to a Mel scale m by using a triangular window function, wherein the calculation formula is as follows:

wherein f is hertz frequency, and E (B) is set, B is more than or equal to 0 and less than B, and the B represents the Mel scale power spectrum coefficient on the B-th sub-band, wherein B represents the total number of filters in pretreatment, and the MFCCN value is the spectrum of discrete cosine transform after taking logarithm of E (B), wherein taking logarithm of E (B) is set as H (B), and the calculation formula is as follows:

wherein L represents the dimension of MFCCN, and obtaining an MFCCN feature vector x_MFCCNAs follows:

x_MFCCN＝[MFCCN(0),MFCCN(1),...MFCCN(L-1)]^T (7)

5.7) constructing three CNN models, and adopting shared nodes to output depth characteristic sequences respectively;

5.8) because the lengths of the films are different, the number of the input segments is also different, the time correlation characteristics of the variable-length depth feature sequences output by the three CNN models are captured through a recurrent neural network of an LSTM structure, and finally, a prediction label value is output.

2. The method for automatically acquiring the film and television label based on the deep neural network as claimed in claim 1, wherein: in the step 3, the movie platform comprises an Aiqiyi, an Tengchi video, a Youkou, a cat eye movie and a broad bean movie; constructing the movie tag data set includes the following processes:

3.3) corresponding the movies in step 1 and step 2 with the collected labels.

3. The method for automatically acquiring the film and television label based on the deep neural network as claimed in claim 1, wherein the method comprises the following steps: in the step 6, fusing the two models mentioned in the step 4 and the step 5 comprises the following processes:

6.1) selecting a convolutional neural network structure for use through respective basic networks for audio representation and text representation, splicing after batch standardization operation, and finally carrying out scale transformation to obtain output;