CN110516086B - Method for automatically acquiring movie label based on deep neural network - Google Patents

Method for automatically acquiring movie label based on deep neural network Download PDF

Info

Publication number
CN110516086B
CN110516086B CN201910627545.8A CN201910627545A CN110516086B CN 110516086 B CN110516086 B CN 110516086B CN 201910627545 A CN201910627545 A CN 201910627545A CN 110516086 B CN110516086 B CN 110516086B
Authority
CN
China
Prior art keywords
film
constructing
model
neural network
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910627545.8A
Other languages
Chinese (zh)
Other versions
CN110516086A (en
Inventor
宣琦
王冠华
俞山青
孙佳慧
韩忙
孙翊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201910627545.8A priority Critical patent/CN110516086B/en
Publication of CN110516086A publication Critical patent/CN110516086A/en
Application granted granted Critical
Publication of CN110516086B publication Critical patent/CN110516086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

A method for automatically acquiring a film and television label based on a deep neural network comprises the following steps: step 1: collecting the lines of the film and constructing a line data set; step 2: collecting original sound of a film, and constructing a sound data set; and step 3: collecting the generated labels on the related film and television platforms, and constructing a film label data set; and 4, step 4: constructing an automatic labeling model based on film lines; and 5: adopting a CNN-LSTM algorithm of a shared node to construct an automatic labeling model based on the original sound of the film; step 6: and fusing the two models mentioned in the step 4 and the step 5. The invention provides a method for automatically acquiring a film and television label based on a deep neural network, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, mainly aims at a film, and extracts high-level abstract attributes from original information of a film, such as a speech text, an audio signal and the like by utilizing the time correlation of the film.

Description

Method for automatically acquiring movie label based on deep neural network
Technical Field
The invention relates to data mining, network science and a deep neural network, in particular to a method for automatically acquiring a film and television label based on the deep neural network.
Background
With the deep development of the information-oriented society and the digital society, the film and television industry is more and more inclined to perform digital distribution through streaming media services and online film and television stores. 2018 global film and television report shows that the global film and television market size has increased by 5.9% in 2017, wherein the increase in the digital film and television market income is as high as 17.7%. The income of the digital video market in 2017 is increased by more than half of the income of the whole video market for the first time, and the phenomenon is obviously worthy of attention. Reports indicate a 60.4% surge in revenue for streaming services in the context of dual atrophy of the movie download market and the physical movie market, indicating that the most prominent impetus for a large increase in revenue in the digital movie market is streaming services. Currently, the total number of subscribers to online movie and television pay services worldwide has exceeded 1 hundred million, and this important milestone represents that streaming media services have become a significant component of the digital movie and television market. Meanwhile, the film and television industry in China is considered to have great development potential, the scale of the film and television market in China is enlarged by 20.3% according to the report in 2017, the income of the streaming media service is increased by 30.6%, and the expansion is larger than that of the whole film and television market. The most important Chinese stream media service provider has many video platforms and over 1500 ten thousand pay users for video entertainment, so it can be said that the Chinese digital video market will grow into one of the world important digital video markets.
The popularity of high-speed mobile networks and intelligent devices has shifted the consumer's movie consumption habits from physical and downloading to streaming media. With new backgrounds, the digital movie and television markets all over the world have strong competition, and movie and television streaming media service providers continuously develop and expand respective products and services to provide more diversified and personalized experiences for consumers. In the face of a huge online movie library with massive digital movie resources, how to perform more efficient organization, how to provide higher-quality subscription, and how to recommend more accurate content becomes an important technical hotspot concerned by each large streaming media service provider.
Under the transformation background of the film and television market, the importance of the concept of film and television labels is highlighted as a structured film and television information organization mode, and the effect of improving the film and television labeling task through various advanced technologies is becoming a popular direction in the field of film and television information retrieval. The film and television labels refer to phrases capable of accurately describing high-level film and television semantics, and due to the particularity of the film and television, lines and sounds are difficult to manage and search in a conventional mode, and the labels representing the characteristics of the film and television greatly help the classification, organization and search of the film and television. The natural language tags can help the user find movies with specific attributes through keywords, lists and tag clouds. On the basis, the streaming media service provider can also perform personalized recommendation by using the label information of the film and television, and the mode based on the content and the characteristics of the film and television can help overcome the cold start problem of a collaborative filtering recommendation algorithm widely used in the current market.
At present, three mainstream methods are used for completing the film and television labeling task, namely expert labeling, social labeling and algorithm automatic labeling. The expert marking means that professionals in the film and television industry mark films and televisions based on professional film and television knowledge and self literacy, film and television labels given by experts are accurate, but the marking mode is high in cost and not rich enough in content. The social annotation means that users are encouraged to carry out unlimited or semi-limited annotation according to personal understanding and feeling of movies and televisions in a mode similar to crowd funding tasks, and annotation data of a large number of users are collected to be processed and counted to generate labels. The method has the advantages of low cost and rich content, but as different users have subjective understanding and feeling on the film and television, labeling results are uneven, and even labels with completely opposite semantics appear on the same film and television, the result is very noisy. The automatic labeling is that on the existing small-scale movie label data set, a characteristic training classification model is extracted from self and related information of various movies such as audio signals, lines text, related comments, posters and the like, and a label result is automatically generated for large-scale movie data. The method for automatically labeling the film and television labels based on the content by utilizing the algorithm can solve the problems of cost and time and also can solve the problem of universality of a labeling method. The accuracy and the application range of the existing automatic labeling method still have a larger space for improvement, so that the research of carrying out film and television automatic labeling by using film and television contents is receiving more and more attention.
In summary, many problems to be solved urgently exist in the conventional labeling algorithm, including noise in the feature design process and the limitation of the shallow structure of the classifier, and no effective solution exists yet.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for automatically acquiring a film and television label based on a deep neural network.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for automatically acquiring a film and television label based on a deep neural network comprises the following steps:
step 1: collecting the lines of the film and constructing a line data set;
step 2: collecting original sound of a film, and constructing a sound data set;
and step 3: collecting the generated labels on a movie platform, and constructing a movie label data set;
and 4, step 4: constructing a label automatic acquisition model based on film lines;
and 5: adopting a shared node CNN-LSTM algorithm to construct a label automatic acquisition model based on the film sound;
Step 6: and fusing the two models mentioned in the step 4 and the step 5.
Further, in step 1, lines of the movie are collected, and the behavior data does not include a movie-end egg-painted portion.
Still further, in step 2, according to the collected movie speech in step 1, the corresponding movie sound is relatively collected, and the behavior data does not include the ending colored egg portion.
Still further, in step 3, the movie platform includes an arcade art, an Tencent video, a Youkou, a cat eye movie and a broad bean movie, and the process of constructing the movie label data set includes:
3.1) merging all the labels collected by 5 platforms to ensure that no repeated labels exist;
3.2) carrying out format standardization on all labels, wherein the formats comprise uniform character codes and uniform label separators;
3.3) corresponding the movies in step 1, 2 with the collected labels.
In the step 4, the automatic labeling model based on the film lines is constructed by the following processes:
4.1) use WordPiece tool to perform word segmentation and insert special separators ([ CLS)]For separating samples) and separators ([ SEP ]]For separating different sentences in a sample), each sentence corresponding to a matrix X ═ X (X) 1,x2,...,xt) Wherein x isiAll represent the word vector (row vector) of the ith word, the dimension is d dimension, so x ∈ Rn×d. The encoding is performed using the following formula:
Figure BDA0002127609350000041
wherein A and B are another sequence (matrix) introduced additionally, and the purpose of introducing A and B is to let xtIs compared with each word to obtain yt
4.2) inputting the result of the previous step into a model for pre-training, wherein the model calculation formula is as follows:
Figure BDA0002127609350000042
and
Figure BDA0002127609350000043
wherein t is1,t2...,tNAre successive tokens, t1,t2...,tkAlso consecutive tokens. Further, let logptkIs rkEstablishing a bidirectional model which is convenient for training large-scale texts, wherein the model calculation formula is as follows:
Figure BDA0002127609350000044
wherein t is1,t2...,tNAre successive tokens, t1,t2...,tkAlso successive tokens, thetaxIs the input, the content of the input is the initial word vector, is the normalization layer parameter,
Figure BDA0002127609350000045
in order to be a forward-directed LSTM model,
Figure BDA0002127609350000051
the backward LSTM model is a backward LSTM model, and on the basis, fifteen percent of word vectors generated by word passing are randomly covered;
4.3) performing embedding operation on the vector after the improved Model pre-training based on the Masked Language Model. The types of Embedding operations are Token Embedding (indicating Embedding of the current word), Segment Embedding (indicating index Embedding of the sentence where the current word is located), and Position Embedding (indicating index Embedding of the Position where the current word is located), respectively. In order to simultaneously represent a single sentence and a sentence pair, multiple sentences need to be spliced to be used as a single sentence and are distinguished by segment embedding and [ SEG ]; summing the three embedding to obtain an input vector;
4.4) taking the vector generated in the previous step as an input and putting the vector into a transform model with 12 layers and 768 dimensions;
4.5) modifying the model by using fine-tuning, and taking the output of token as the input of the input softmax layer, thereby obtaining the output of the prediction result of the movie label.
In the step 5, the method for constructing the automatic labeling model based on the film acoustic sound by adopting the CNN-LSTM algorithm of the shared node comprises the following processes:
5.1) obtaining the power spectrum of the sound data set corresponding to the step 4 through Fast Fourier Transform (FFT), and then mapping the frequency spectrum to a Mel scale by using a triangular window function, wherein the calculation formula is as follows:
Figure BDA0002127609350000052
where f is the hertz frequency. Let E (B), 0 ≦ B < B denote the Mel-scale power spectral coefficients at the B-th subband, where B denotes the total number of filters in the pre-processing. The MFCCN value is the spectrum of the discrete cosine transform after taking the logarithm of e (b), where the logarithm of e (b) is set as h (b), and is calculated as follows:
Figure BDA0002127609350000053
where L represents the dimension of MFCCN, the MFCCN feature vector is obtained as follows:
xMFCCN=[MFCCN(0),MFCCN(1),...MFCCN(L-1)]T (7)
5.2) carrying out short-time Fourier transform on the equal-length audio segments in the overlapped short window, wherein each Fourier transform generates a frame, the continuous frames are combined into a matrix to form a frequency spectrum, finally, the linear frequency axis is transformed into a Mel scale, the amplitudes distributed unevenly on the frequency axis are subjected to logarithmic scaling, and then, the logarithmic scaling is used as the characteristic representation of the audio signal;
5.3) inputting the characteristics generated in the last step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;
5.4) putting the output of the previous step into a pooling layer with a length of a pooling window of 4 maximums;
5.5) putting the output of the previous step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;
5.6) putting the output of the previous step into a pooling layer with the length of a pooling window being 4 maximums;
5.7) constructing a plurality of models, and adopting shared nodes to output depth characteristic sequences respectively;
5.8) because the lengths of the films are different, the number of the input segments is also different, the time correlation characteristics of the variable-length depth feature sequences output by the three traditional CNN models are captured through a recurrent neural network of an LSTM structure, and finally, a prediction label value is output.
In the step 6, fusing the two models mentioned in the step 4 and the step 5 comprises the following processes:
6.1) selecting a convolutional neural network structure to use through respective basic networks for audio representation and text representation, splicing after batch normalization operation, and finally carrying out scale transformation to obtain output;
6.2) inputting the output of the spliced model into two full-connection layers with output nodes of 1024 and 512 respectively again, and finally outputting the predicted value of the label.
The invention has the following beneficial effects: the method adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, mainly aims at the film, and extracts high-level abstract attributes from original information of the film, such as a speech text, an audio signal and the like by utilizing the time correlation of the film.
Drawings
Fig. 1 is a flowchart of constructing a movie-based automatic tag acquisition model according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for constructing a label automatic acquisition model of a movie sound by using a shared node CNN-LSTM algorithm according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for fusing a tag automatic acquisition model based on movie & television lines and two models mentioned in the tag automatic acquisition model of movie & television original sounds according to an embodiment of the present invention;
fig. 4 is a block diagram of a structure of a method for automatically acquiring a video tag based on a deep neural network according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 4, a method for automatically acquiring a video label based on a deep neural network specifically includes the following five steps:
step 1: collecting the lines of the film and constructing a line data set;
Step 2: collecting original sound of a film, and constructing a sound data set;
and step 3: collecting the generated labels on a movie platform, and constructing a movie label data set;
and 4, step 4: constructing a label automatic acquisition model based on film lines;
and 5: adopting a shared node CNN-LSTM algorithm to construct a label automatic acquisition model based on the film sound;
step 6: and fusing the two models mentioned in the step 4 and the step 5.
Further, in step 1, lines of the movie are collected, and the behavior data does not include a movie-end egg-painted portion.
Still further, in step 2, according to the collected movie speech in step 1, the corresponding movie sound is relatively collected, and the behavior data does not include the ending colored egg portion.
Still further, in step 3, the movie platform includes an arcade art, an Tencent video, a Youkou, a cat eye movie and a broad bean movie, and the process of constructing the movie label data set includes:
3.1) merging all the labels collected by 5 platforms to ensure that no repeated labels exist;
3.2) carrying out format standardization on all labels, wherein the formats comprise uniform character codes and uniform label separators;
3.3) corresponding the movies in step 1, 2 with the collected labels.
In the step 4, the automatic labeling model based on the film lines is constructed by the following processes:
4.1) word segmentation with WordPiece tool and insertion of special separators ([ CLS)]For separating samples) and separators ([ SEP ]]To separate different sentences within a sample). Each sentence corresponds to a matrix X ═ X (X)1,x2,...,xt) Wherein x isiAll represent the word vector (row vector) of the ith word, the dimension is d dimension, so x ∈ Rn×dThe encoding is performed using the following formula:
Figure BDA0002127609350000081
wherein A and B are another sequence (matrix) introduced additionally, and the purpose of introducing A and B is to let xtA comparison is made with each of the words,thereby obtaining yt
4.2) inputting the result of the previous step into a model for pre-training, wherein the model calculation formula is as follows:
Figure BDA0002127609350000082
and
Figure BDA0002127609350000083
wherein t is1,t2...,tNAre successive tokens, t1,t2...,tkAlso consecutive tokens. Further, let logptkIs rkEstablishing a bidirectional model which is convenient for training large-scale texts, wherein the model calculation formula is as follows:
Figure BDA0002127609350000084
wherein t is1,t2...,tNAre successive tokens, t1,t2...,tkAlso successive tokens, thetaxIs the input, the content of which is the most initial word vector. ThetasIs the parameters of the normalization layer(s),
Figure BDA0002127609350000091
in order to be a forward-directed LSTM model,
Figure BDA0002127609350000092
the backward LSTM model. On the basis, fifteen percent of word vectors generated by the word through the word vector are randomly covered;
4.3) performing embedding operation on the vector after the improved Model pre-training based on the Masked Language Model. The types of Embedding operations are Token Embedding (indicating Embedding of the current word), Segment Embedding (indicating index Embedding of the sentence where the current word is located), and Position Embedding (indicating index Embedding of the Position where the current word is located), respectively. In order to simultaneously represent a single sentence and a sentence pair, multiple sentences need to be spliced to be used as a single sentence and are distinguished by segment embedding and [ SEG ]; summing the three embedding to obtain an input vector;
4.4) taking the vector generated in the previous step as an input and putting the vector into a transform model with 12 layers and 768 dimensions;
4.5) modifying the model by using fine-tuning, and taking the output of token as the input of the input softmax layer, thereby obtaining the output of the prediction result of the movie label.
In the step 5, the method for constructing the automatic labeling model based on the film acoustic sound by adopting the CNN-LSTM algorithm of the shared node comprises the following processes:
5.1) obtaining the power spectrum of the sound data set corresponding to the step 4 through Fast Fourier Transform (FFT), and then mapping the frequency spectrum to a Mel scale by using a triangular window function, wherein the calculation formula is as follows:
Figure BDA0002127609350000093
Where f is the hertz frequency. And E (B), B is more than or equal to 0 and less than B, and the B represents the Mel scale power spectrum coefficient on the B-th subband, wherein B represents the total number of filters in preprocessing, and the MFCCN value is the frequency spectrum of discrete cosine transform after taking logarithm of E (B), wherein taking logarithm of E (B) is set as H (B), and the calculation formula is as follows:
Figure BDA0002127609350000101
where L represents the dimension of MFCCN, the MFCCN feature vector is obtained as follows:
xMFCCN=[MFCCN(0),MFCCN(1),...MFCCN(L-1)]T (7)
5.2) carrying out short-time Fourier transform on the equal-length audio segments in the overlapped short window, wherein each Fourier transform generates a frame, the continuous frames are combined into a matrix to form a frequency spectrum, finally, the linear frequency axis is transformed into a Mel scale, the amplitudes distributed unevenly on the frequency axis are subjected to logarithmic scaling, and then, the logarithmic scaling is used as the characteristic representation of the audio signal;
5.3) inputting the characteristics generated in the last step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;
5.4) putting the output of the previous step into a pooling layer with a length of a pooling window of 4 maximums;
5.5) putting the output of the previous step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;
5.6) putting the output of the previous step into a pooling layer with the length of a pooling window being 4 maximums;
5.7) constructing a plurality of models, and adopting shared nodes to output depth characteristic sequences respectively;
5.8) because the lengths of the films are different, the number of the input segments is also different, the time correlation characteristics of the variable-length depth feature sequences output by the three traditional CNN models are captured through a recurrent neural network of an LSTM structure, and finally, a prediction label value is output.
In the step 6, the two models mentioned in the step 4 and the step 5 are fused, and the process comprises the following steps:
6.1) selecting a convolutional neural network structure to use through respective basic networks for audio representation and text representation, splicing after batch normalization operation, and finally carrying out scale transformation to obtain output;
6.2) inputting the output of the spliced model into two full-connection layers with output nodes of 1024 and 512 respectively again, and finally outputting the predicted value of the label.
In the method for automatically acquiring a video tag based on a deep neural network provided by the embodiment of the present invention, the instruction included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and details are not described herein.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (3)

1. A method for automatically acquiring a film and television label based on a deep neural network is characterized by comprising the following steps:
step 1: collecting the lines of the film and constructing a line data set;
step 2: collecting original sound of a film, and constructing a sound data set;
and step 3: collecting the generated labels on a movie platform, and constructing a movie label data set;
and 4, step 4: constructing an automatic labeling model based on film lines;
and 5: adopting a shared node CNN-LSTM algorithm to construct an automatic labeling model based on the original sound of the film;
step 6: fusing the two models mentioned in the step 4 and the step 5;
in the step 4, the automatic labeling model based on the film lines is constructed by the following processes:
4.1) word segmentation with WordPiece tool and insertion of a special separator [ CLS ]]For separating samples, and separators [ SEP ]]For separating different sentences in a sample, each sentence is associated with a matrix of X ═ X (X)1,x2,…,xt) Wherein x isiAll represent the word vector of the ith word, and the dimension is d dimension, so x belongs to Rn×dThe encoding is performed using the following formula:
Figure FDA0003497884780000011
wherein A and B are another sequence additionally introduced, and the purpose of introducing A and B is to allow x to pass throughtIs compared with each word to obtain y t
4.2) inputting the result of the previous step into a model for pre-training, wherein the model calculation formula is as follows:
Figure FDA0003497884780000012
and
Figure FDA0003497884780000013
wherein t is1,t2…,tNAre successive tokens, t1,t2…,tkAlso continuous tokens, further, let logptkIs rkEstablishing a bidirectional model which is convenient for training large-scale texts, wherein the model calculation formula is as follows:
Figure FDA0003497884780000014
wherein t is1,t2…,tNAre successive tokens, t1,t2…,tkAlso successive tokens, thetaxIs the input whose content is the most initial word vector, thetasIs the parameters of the normalization layer(s),
Figure FDA0003497884780000021
in order to be a forward-directed LSTM model,
Figure FDA0003497884780000022
the backward LSTM model is a backward LSTM model, and on the basis, fifteen percent of word vectors generated by word passing are randomly covered;
4.3) Embedding vector after model pre-training, wherein in the type of Embedding operation, Token Embedding represents Embedding of the current word, Segment Embedding represents index Embedding of the sentence where the current word is located, and Position Embedding represents index Embedding of the Position where the current word is located, wherein in order to simultaneously represent a single sentence and a sentence pair, multiple sentences need to be spliced to be used as a single sentence, and are distinguished by Segment Embedding and [ SEG ]; summing the three embeddings to obtain an input vector;
4.4) taking the vector generated in the previous step as an input and putting the vector into a transform model with 12 layers and 768 dimensions;
4.5) modifying the model by using fine-tuning, and taking the output of token as the input of the input softmax normalization layer so as to obtain the output of the film label prediction result;
in the step 5, the method for constructing the automatic labeling model based on the film sound by adopting the shared node CNN-LSTM algorithm comprises the following processes:
5.1) obtaining the power spectrum of the sound data set corresponding to the step 4 through Fast Fourier Transform (FFT), and then mapping the frequency spectrum to a Mel scale m by using a triangular window function, wherein the calculation formula is as follows:
Figure FDA0003497884780000023
wherein f is hertz frequency, and E (B) is set, B is more than or equal to 0 and less than B, and the B represents the Mel scale power spectrum coefficient on the B-th sub-band, wherein B represents the total number of filters in pretreatment, and the MFCCN value is the spectrum of discrete cosine transform after taking logarithm of E (B), wherein taking logarithm of E (B) is set as H (B), and the calculation formula is as follows:
Figure FDA0003497884780000024
wherein L represents the dimension of MFCCN, and obtaining an MFCCN feature vector xMFCCNAs follows:
xMFCCN=[MFCCN(0),MFCCN(1),...MFCCN(L-1)]T (7)
5.2) carrying out short-time Fourier transform on the equal-length audio segments in the overlapped short window, wherein each Fourier transform generates a frame, the continuous frames are combined into a matrix to form a frequency spectrum, finally, the linear frequency axis is transformed into a Mel scale, the amplitudes distributed unevenly on the frequency axis are subjected to logarithmic scaling, and then, the logarithmic scaling is used as the characteristic representation of the audio signal;
5.3) inputting the characteristics generated in the last step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;
5.4) putting the output of the previous step into a pooling layer with a length of a pooling window of 4 maximums;
5.5) putting the output of the previous step into a convolution layer containing 32 one-dimensional filters with the length of 8, wherein the window size is 8;
5.6) putting the output of the previous step into a pooling layer with the length of a pooling window being 4 maximums;
5.7) constructing three CNN models, and adopting shared nodes to output depth characteristic sequences respectively;
5.8) because the lengths of the films are different, the number of the input segments is also different, the time correlation characteristics of the variable-length depth feature sequences output by the three CNN models are captured through a recurrent neural network of an LSTM structure, and finally, a prediction label value is output.
2. The method for automatically acquiring the film and television label based on the deep neural network as claimed in claim 1, wherein: in the step 3, the movie platform comprises an Aiqiyi, an Tengchi video, a Youkou, a cat eye movie and a broad bean movie; constructing the movie tag data set includes the following processes:
3.1) merging all the labels collected by 5 platforms to ensure that no repeated labels exist;
3.2) carrying out format standardization on all labels, wherein the formats comprise uniform character codes and uniform label separators;
3.3) corresponding the movies in step 1 and step 2 with the collected labels.
3. The method for automatically acquiring the film and television label based on the deep neural network as claimed in claim 1, wherein the method comprises the following steps: in the step 6, fusing the two models mentioned in the step 4 and the step 5 comprises the following processes:
6.1) selecting a convolutional neural network structure for use through respective basic networks for audio representation and text representation, splicing after batch standardization operation, and finally carrying out scale transformation to obtain output;
6.2) inputting the output of the spliced model into two full-connection layers with output nodes of 1024 and 512 respectively again, and finally outputting the predicted value of the label.
CN201910627545.8A 2019-07-12 2019-07-12 Method for automatically acquiring movie label based on deep neural network Active CN110516086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910627545.8A CN110516086B (en) 2019-07-12 2019-07-12 Method for automatically acquiring movie label based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910627545.8A CN110516086B (en) 2019-07-12 2019-07-12 Method for automatically acquiring movie label based on deep neural network

Publications (2)

Publication Number Publication Date
CN110516086A CN110516086A (en) 2019-11-29
CN110516086B true CN110516086B (en) 2022-05-03

Family

ID=68623048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910627545.8A Active CN110516086B (en) 2019-07-12 2019-07-12 Method for automatically acquiring movie label based on deep neural network

Country Status (1)

Country Link
CN (1) CN110516086B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460820B (en) * 2020-03-06 2022-06-17 中国科学院信息工程研究所 Network space security domain named entity recognition method and device based on pre-training model BERT
CN112084371B (en) * 2020-07-21 2024-04-16 中国科学院深圳先进技术研究院 Movie multi-label classification method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294797A (en) * 2016-08-15 2017-01-04 北京聚爱聊网络科技有限公司 A kind of generation method and apparatus of video gene
CN108965920A (en) * 2018-08-08 2018-12-07 北京未来媒体科技股份有限公司 A kind of video content demolition method and device
CN109359636A (en) * 2018-12-14 2019-02-19 腾讯科技(深圳)有限公司 Video classification methods, device and server
CN109710800A (en) * 2018-11-08 2019-05-03 北京奇艺世纪科技有限公司 Model generating method, video classification methods, device, terminal and storage medium
CN109767785A (en) * 2019-03-06 2019-05-17 河北工业大学 Ambient noise method for identifying and classifying based on convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9946933B2 (en) * 2016-08-18 2018-04-17 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294797A (en) * 2016-08-15 2017-01-04 北京聚爱聊网络科技有限公司 A kind of generation method and apparatus of video gene
CN108965920A (en) * 2018-08-08 2018-12-07 北京未来媒体科技股份有限公司 A kind of video content demolition method and device
CN109710800A (en) * 2018-11-08 2019-05-03 北京奇艺世纪科技有限公司 Model generating method, video classification methods, device, terminal and storage medium
CN109359636A (en) * 2018-12-14 2019-02-19 腾讯科技(深圳)有限公司 Video classification methods, device and server
CN109767785A (en) * 2019-03-06 2019-05-17 河北工业大学 Ambient noise method for identifying and classifying based on convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BERT: Pre-training of Deep Bidirectional Transformers for;Jacob Devlind等;《arXiv》;20190524;全文 *

Also Published As

Publication number Publication date
CN110516086A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
JP7142737B2 (en) Multimodal theme classification method, device, device and storage medium
CN112749608B (en) Video auditing method, device, computer equipment and storage medium
CN106328147B (en) Speech recognition method and device
CN110704674B (en) Video playing integrity prediction method and device
CN112104919B (en) Content title generation method, device, equipment and computer readable storage medium based on neural network
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
US20240212706A1 (en) Audio data processing
CN111046225B (en) Audio resource processing method, device, equipment and storage medium
CN112418011A (en) Method, device and equipment for identifying integrity of video content and storage medium
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
CN110516086B (en) Method for automatically acquiring movie label based on deep neural network
CN111488813B (en) Video emotion marking method and device, electronic equipment and storage medium
CN111540364A (en) Audio recognition method and device, electronic equipment and computer readable medium
CN116975615A (en) Task prediction method and device based on video multi-mode information
CN117033558A (en) BERT-WWM and multi-feature fused film evaluation emotion analysis method
CN116977701A (en) Video classification model training method, video classification method and device
CN115909390B (en) Method, device, computer equipment and storage medium for identifying low-custom content
CN115186085A (en) Reply content processing method and interaction method of media content interaction content
CN113704541A (en) Training data acquisition method, video push method, device, medium and electronic equipment
CN114547435A (en) Content quality identification method, device, equipment and readable storage medium
CN117009574B (en) Hot spot video template generation method, system, equipment and storage medium
CN114328990B (en) Image integrity recognition method, device, computer equipment and storage medium
CN115905584B (en) Video splitting method and device
CN116610804A (en) Text recall method and system for improving recognition of small sample category
CN114372139A (en) Data processing method, abstract display method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant