CN112084371A

CN112084371A - Film multi-label classification method and device, electronic equipment and storage medium

Info

Publication number: CN112084371A
Application number: CN202010708014.4A
Authority: CN
Inventors: 吕子钰; 禹一童; 杨敏; 李成明; 姜青山
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-12-15
Anticipated expiration: 2040-07-21
Also published as: CN112084371B

Abstract

The application discloses a method and a device for classifying multiple film labels, electronic equipment and a computer readable storage medium, wherein the method for classifying the multiple film labels comprises the following steps: acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.

Description

Film multi-label classification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision application technologies, and in particular, to a face roll call method, an apparatus, an electronic device, and a storage medium.

Background

The category label of the movie (e.g., war, comedy, animation, etc.) as a kind of information highly condensed about the movie content is not only an important criterion for people to select movies but also a basis for building a movie database. However, with the development of the movie industry, the types of movie categories are increasing. Therefore, the efficient movie label classification system is constructed to update the movie labels of the old movies, and has very important practical significance and application value.

Currently, existing movie classification algorithms mainly include movie trailer-based and poster-based. Among them, the poster-based method has a limitation in that posters for movies are various in kind, and posters may not fully contain their category information, so that the prediction accuracy of such a method is limited. The main problems of the movie classification method based on movie trailers mainly include:

(1) consider a movie to belong to only a single category;

(2) the low-level visual features in the movie trailers are used for classification;

(3) some video frames (e.g., open field and end) that have a fixed pattern and do not contain useful classification features are not distinguished from other video frames, possibly misleading the classification.

Not only can the current classification method not effectively model the timing information in the video, but also invalid frames (such as the head and tail of a movie) can be picked when selecting key frames.

Disclosure of Invention

The application at least provides a movie multi-label classification method, a movie multi-label classification device, an electronic device and a computer-readable storage medium.

The application provides a method for classifying multiple labels of a movie, which comprises the following steps:

acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments;

acquiring video segment characteristics of the video frame sequence based on a preset neural network model;

computing an attention matrix based on the video segment features;

traversing the sequence of video frames according to the attention moment array to output a tag class of the sequence of video frames.

Wherein the step of traversing the sequence of video frames according to the attention moment array to output tag categories of the sequence of video frames comprises:

acquiring a corresponding video feature matrix based on the attention matrix and the video segment features;

forming a two-layer perceptron through the attention matrix and the video feature matrix;

converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;

outputting a label category of the sequence of video frames in the movie category space.

Wherein the step of calculating an attention matrix based on the video segment features comprises:

calculating a forward hidden state and a backward hidden state of the video segment features based on the BilSTM;

and adopting a self-attention mechanism to calculate an attention matrix of forward hidden states and backward hidden states of all the video segment features.

Wherein, the step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video segment features by using a self-attention mechanism comprises:

calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic;

acquiring the number of hidden layer nodes of the BilSTM;

and obtaining the attention matrix based on hidden elements of all the video segment characteristics and the number of the hidden layer nodes.

Wherein the step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features comprises:

acquiring a hidden element set of all the video clip characteristics;

carrying out normalization processing on the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;

and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.

Wherein, after the step of outputting the label categories of the sequence of video frames, the classification method further comprises:

acquiring a cross entropy loss function of the neural network model;

evaluating a score for the label category based on the cross entropy loss function;

wherein, the output layer of the neural network model is a full connection layer fc 7.

Wherein, after the step of obtaining a sequence of consecutive video frames, the classification method further comprises:

calculating the accumulated sum of the difference values of the adjacent frames in each gray level in the continuous video frame sequence;

when the accumulated sum is larger than a preset threshold value, the accumulated sum is superposed on a color histogram of a video frame of a next frame in the adjacent frames;

dividing the video frame sequence after the superposition processing into a plurality of sections of video segments according to time sequence, and extracting video frames with preset frame number from each section of the video segments so as to form a new video segment sequence.

A second aspect of the present application provides a movie multi-label sorting apparatus, including:

an obtaining module, configured to obtain a continuous sequence of video frames;

the characteristic extraction module is used for acquiring video segment characteristics of the video frame sequence based on a preset neural network model;

an attention calculation module for calculating an attention matrix based on the video segment features;

and the label classification module is used for traversing the video frame sequence according to the attention moment array so as to output the label category of the video frame sequence.

A third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the processor is configured to execute program instructions stored in the memory to implement the foregoing multi-label classification method for movies in the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a processor, implement the movie multi-label classification method of the first aspect.

According to the scheme, the film multi-label classification device acquires a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

FIG. 1 is a flowchart illustrating an embodiment of a multi-tag movie classification method provided by the present application;

FIG. 2 is a block diagram of a multi-label classification model for movies provided by the present application;

fig. 3 is a detailed flowchart of step S103 in the multi-label classification method of the movie shown in fig. 1;

fig. 4 is a detailed flowchart of step S104 in the multi-label classification method of the movie shown in fig. 1;

FIG. 5 is a block diagram of an embodiment of a multi-tag movie classification apparatus provided herein;

FIG. 6 is a block diagram of an embodiment of an electronic device provided herein;

FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium provided herein.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic flowchart of an embodiment of a multi-label classification method for a movie provided by the present application, and fig. 2 is a schematic frame diagram of a multi-label classification model for a movie provided by the present application. The multi-label classification method for the film can be applied to classification of various different types of labels for the positive film or the trailer of the film, and is convenient for audiences to know the basic information of the film.

The main body of the movie multi-tag classification method of the present application may be a movie multi-tag classification apparatus, for example, the movie multi-tag classification method may be executed by a terminal device or a server or other processing devices, where the movie multi-tag classification apparatus may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a wireless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the movie multi-label classification method may be implemented by a processor calling computer-readable instructions stored in a memory.

Specifically, the movie multi-label classification method of the embodiment of the present disclosure may include the following steps:

step S101: a sequence of consecutive video frames is acquired, wherein the sequence of video frames comprises a number of video segments.

Wherein the sorting means obtains a sequence of consecutive video frames, which may be part or all of a trailer or a feature of a movie. The classification means may perform a data pre-processing of the sequence of consecutive video frames before the segment generation of the sequence of consecutive video frames in order to enable the unprocessed sequence of video frames to comply with input rules of a subsequent neural network model.

The preprocessing flow can effectively reduce the risk of network overfitting and eliminate noise irrelevant to the category information in the original data, for example, black frames may be filled around the image in the video to maintain the aspect ratio and the video size of the video, but the black frames not only do not help the classification result, but also may cause the neural network model to misunderstand that the information is useful information, thereby affecting the prediction result.

Taking the C3D network (convert 3D, virtual reality engine) as an example, the standard input is a 4-dimensional matrix (channel number X frame height X frame width). For a given sequence of video frames U ═ U_cAnd e belongs to {1, 2.. said., T-1}, and the preprocessing stage mainly processes the frame Height and the frame Width in each frame of video frame, wherein the original frame Height in the video frame is Height and the original frame Width is Width.

The specific flow of pretreatment is as follows: firstly, the classification device removes black frames in the images, and adjusts the size of the video frames to be a preset video frame size under the condition of keeping the original image aspect ratio. For example, the classification means may make the video frame largeThe minitrim is 196 (frame width) X128 (frame height). Then, during training, the classification device can input 112 (frame width) X112 (frame height) dithering random cropping to improve the robustness of the system, and the video frame sequence is

For the embodiment of the present disclosure, the video frame sequence after the preprocessing process is as follows

The classification device calculates the difference of each gray level between each frame of video frame and the next frame of video frame in the video frame sequence

The difference value calculation formula is as follows:

wherein H_e(j)，H_e+1(j) Are respectively video frames

Is the value of the color histogram at gray level j, and n is the number of gray levels in the color histogram.

When the difference D is larger than a given preset threshold value, the video frame sequence is considered to have lens conversion, and the classification device superimposes the difference D on the video frame sequence

In (1). After lens detection, the sorting device can obtain a new video frame sequence

Wherein the corner mark t represents the tth sliceSegment, k denotes that the sequence of video frames consists of k segments, the corner mark r denotes the r-th video frame in a video segment, m_tIndicating that the t-th video segment in the video frame sequence contains m_tA video frame.

Then, in order to satisfy the input requirement of the subsequent neural network model, such as the C3D network, referring to the Candidate Clip Generation part in fig. 2, the classifying device may further extract 16 video frames from each video Clip in a predetermined order or randomly to form a new video Clip. For example, for a given video clip

The sorting device according to the extraction interval

Equidistant decimation is performed, where the Frame _ rate represents the Frame rate of the current video, so that a new sequence of video frames F ═ F is composed of new video segments_t ^(j)},t∈{1,2,...k},j∈{1,1+,...,1+15*}。

Step S102: and acquiring the video segment characteristics of the video frame sequence based on a preset neural network model.

The classification device inputs the video frame sequence into a preset neural network model, taking a C3D network as an example, so as to obtain the video segment characteristics of the video frame sequence.

In the embodiment of the present disclosure, the classification apparatus extracts the video segment features in the video frame sequence through a C3D network: { x_t＝f(f_t ¹:f_t ^1+15*) Please refer to the Spatio-temporal Descriptor part of fig. 2. The C3D network has yielded good performance over many video analysis tasks in the context of a large supervised training dataset, and can successfully learn and model spatio-temporal features in videos. However, one problem with using the C3D network directly is that the task data set lacks relevant action annotation data to let the C3D network learn the dynamic features. The pre-training can effectively solve the problem, and the pre-training is widely appliedThe application effect can be remarkably promoted when the method is used in the field of computer vision, and the successful effect is achieved in the field of natural language processing in recent years.

Generally, the pre-training process creates a training task, obtains the trained model parameters, and loads the trained model parameters on the C3D network according to the embodiment of the present disclosure, thereby initializing the model weights of the C3D network.

The loading model weight mainly comprises two loading methods: one is that the loaded model parameters remain unchanged during the task of training the disclosed embodiments, referred to as the "Frozen" approach; the other is that the model weight of the C3D network is initialized, but is still changed along with the training process in the task process of the embodiment of the present disclosure, which is called a "Fine-Tuning" method.

It should be noted that, in the embodiment of the present disclosure, the classification apparatus initializes the model weight of the C3D network in a "Frozen" manner.

Specifically, the classification device performs pre-training processing on the Sports-1M data set by using the C3D network, applies the trained C3D network to the task data set according to the embodiment of the present disclosure, and takes the output of the penultimate full-link layer, that is, fc7, in the C3D network as the final output value of the network.

It should be noted that, since the problem that the extracted features are not related to a specific task exists when the output of the C3D network is directly applied to a task, in order to maintain the generality of the features, the embodiment of the present disclosure selects the output of fc7 as the feature vector of the video segment features. Deleting the processing layer behind fc7 is beneficial to enhancing the migration capability of the C3D network, and meeting the requirement of multi-label category classification of the application.

Step S103: an attention matrix is calculated based on the video segment characteristics.

The specific process of the classification device calculating the Attention matrix according to the obtained video segment features is shown in fig. 2, and fig. 3, where fig. 3 is a schematic flowchart of step S103 in the multi-tag classification method for movies shown in fig. 1. Specifically, the method comprises the following steps:

step S201: and calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic.

Wherein the classifying means extracts each video segment feature X ═ (X) in the sequence of video frames₁,x₂,...,x_k)∈R^k*DThen, the dependency between the video segment features also needs to be modeled. In the embodiment of the present disclosure, the classification device uses BiLSTM to process the video segment features obtained in the above steps:

step S202: and acquiring the number of hidden layer nodes of the BilSTM.

Wherein, the classification device further connects the forward hidden state and the backward hidden state of the video segment characteristics to obtain h_tWhen the hidden node of the LSTM is set as u, H ∈ R is set^k*2uRepresented as a collection of all hidden states:

H＝(h₁,h₂,...,h_k)

wherein each element h in the hidden state_iGlobal information is described around the ith video segment in a sequence of video frames.

Step S203: and obtaining an attention matrix based on hidden elements of all video features and the number of hidden nodes.

Here, LSTM generally gives the same attention to the content in all locations, whereas the disclosed embodiments hope to let the C3D network focus only on the important content in the video. To achieve this, the disclosed embodiments add a self-attention mechanism after the LSTM, with the inputs being hidden state set H and the outputs being an attention matrix V:

V＝softmax(W_btanh(W_aH^T))

wherein,

is a two coefficient weight matrix and D_aAnd D_bFor hyper-parametric, the shape of the final V is

Because the softmax function is used for normalization, each dimension in the row vector of V can be regarded as attention of a corresponding position in the video, and each row vector is a representation of a specific content in the video. Since a movie often contains a plurality of different category labels, the different category labels are usually represented by different contents in the video, and the same type may also be represented by different contents, the setting D of the embodiment of the present disclosure is selected_aAnd D_bLet C3D learn different content parts in the video over the internet as a hyper-parameter.

Further, after the classification device obtains the attention moment matrix of the video, a corresponding video feature matrix B needs to be further obtained, and the specific calculation method is as follows:

B＝VH

step S104: and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence.

The classification device extracts video features in the video frame sequence based on the attention matrix and the video feature matrix obtained in the above steps, and outputs a plurality of different label categories of the video frame sequence according to the video features. Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a specific process of step S104 in the movie multi-tag classification method shown in fig. 1. Specifically, the method comprises the following steps:

step S301: and acquiring a corresponding video feature matrix based on the attention matrix and the video segment features.

Step S302: and forming a two-layer perceptron by the attention matrix and the video feature matrix.

The classification device sequentially overlaps the attention matrix and the video feature matrix to form a two-layer perceptron.

Step S303: and converting the space where the video frame sequence is located into the movie category space according to a two-layer perceptron.

Wherein the classification device converts the original space of the video frame sequence into the film category space by using a two-layer perceptron

Step S304: the label category of the video frame sequence is output in a movie category space.

The classification device extracts video features of the video frame sequence in a movie category space and outputs a plurality of label categories corresponding to the video frame sequence according to the categories of the video features.

After outputting the label categories of the sequence of video frames, the classifying device further constructs a cross entropy loss function L of the C3D network according to the multi-label learning task:

wherein, the classification device can evaluate the score of the multi-label learning task, i.e. the accuracy of the output label category, according to the cross entropy loss function L, and can optimize the C3D network according to the cross entropy loss function L.

In this embodiment, the classification device obtains a continuous video frame sequence, where the video frame sequence includes a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.

Specifically, compared with the current movie classification method, the movie classification method of the present application includes the following advantages: (1) extracting bottom-layer features by using a C3D network, and effectively retaining time sequence features in the video; (2) an attention mechanism is introduced, and the response of a certain position in the video frame sequence is calculated by paying attention to all the positions and taking the weighted average value of the positions in an embedding space, so that on one hand, the attention degree of important contents can be improved, and on the other hand, the influence of invalid information segments (such as a slice header and a slice trailer) on a classification result can be reduced; (3) considering that a film is often subordinate to multiple categories, the film classification task is expanded into a multi-label learning task.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Referring to fig. 5, fig. 5 is a schematic diagram of a framework of an embodiment of a multi-tag sorting apparatus for movies according to the present application. The movie multi-label sorting apparatus 50 includes:

an obtaining module 51 for obtaining a sequence of consecutive video frames.

And the feature extraction module 52 is configured to obtain video segment features of the video frame sequence based on a preset neural network model.

And an attention calculation module 53 for calculating an attention matrix based on the video segment characteristics.

And a label classification module 54 for traversing the video frame sequence according to the attention moment matrix to output a label class of the video frame sequence.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an embodiment of an electronic device provided in the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the processor 62 being configured to execute program instructions stored in the memory 61 to implement the steps in any of the above-described embodiments of the movie multi-label classification method. In one particular implementation scenario, electronic device 60 may include, but is not limited to: a microcomputer, a server, and in addition, the electronic device 60 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

In particular, the processor 62 is configured to control itself and the memory 61 to implement the steps in any of the above-described embodiments of the movie multi-label classification method. The processor 62 may also be referred to as a CPU (Central Processing Unit). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be collectively implemented by an integrated circuit chip.

Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 70 stores program instructions 701 executable by a processor, the program instructions 701 being for implementing the steps in any of the above-described embodiments of the movie multi-label classification method.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A multi-label classification method for a movie is characterized by comprising the following steps:

computing an attention matrix based on the video segment features;

2. The movie multi-label classification method according to claim 1,

the step of traversing the sequence of video frames according to the attention moment array to output tag categories of the sequence of video frames comprises:

3. The movie multi-label classification method according to claim 2,

the step of calculating an attention matrix based on the video segment features comprises:

4. The movie multi-label classification method according to claim 3,

the step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video segment features by using a self-attention mechanism comprises the following steps:

acquiring the number of hidden layer nodes of the BilSTM;

5. The movie multi-label classification method according to claim 4,

the step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features includes:

acquiring a hidden element set of all the video clip characteristics;

6. The movie multi-label classification method according to claim 1,

after the step of outputting the label categories of the sequence of video frames, the classification method further comprises:

acquiring a cross entropy loss function of the neural network model;

7. The movie multi-label classification method according to claim 1,

after the step of acquiring a sequence of consecutive video frames, the classification method further comprises:

8. A movie multi-label sorting apparatus, comprising:

9. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the movie multi-label classification method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon program instructions, which when executed by a processor, implement the movie multi-label classification method according to any one of claims 1 to 7.