CN112084371B

CN112084371B - Movie multi-label classification method and device, electronic equipment and storage medium

Info

Publication number: CN112084371B
Application number: CN202010708014.4A
Authority: CN
Inventors: 吕子钰; 禹一童; 杨敏; 李成明; 姜青山
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2024-04-16
Anticipated expiration: 2040-07-21
Also published as: CN112084371A

Abstract

The application discloses a method, a device, an electronic device and a computer readable storage medium for classifying multiple labels of a movie, wherein the method for classifying multiple labels of the movie comprises the following steps: acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video clips; acquiring video fragment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video clip features; traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.

Description

Movie multi-label classification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision applications, and in particular, to a face roll call method, a face roll call device, an electronic device, and a storage medium.

Background

Category labels for movies (e.g. war, comedy, animation, etc.) are a kind of information that highly concentrates the content of movies, not only the important criteria for people to choose movies but also the basis for constructing a movie database. However, with the development of the movie industry, the variety of movie categories increases. Therefore, an efficient film tag classification system is constructed to update the film tags of old films, so that the method has very important practical significance and application value.

Currently, existing movie classification algorithms mainly include movie trailer-based and poster-based algorithms. Among them, the poster-based method is limited in that the variety of the posters of the movie is wide, and the posters may not completely contain their category information, so that the prediction accuracy of such a method is limited. The main problems of the movie classification method based on movie trailers are as follows:

(1) Consider a movie to belong to only a single category;

(2) Classifying by using low-level visual features in movie trailers;

(3) Some video frames (e.g., start and end) that have a fixed pattern and do not contain useful classification features are not distinguished from other video frames, and classification may be misleading.

Not only does the current classification method not effectively model the timing information in the video, but it also makes it possible to pick invalid frames (e.g., the beginning and end of a film) when selecting key frames.

Disclosure of Invention

The application provides at least a method, a device, an electronic device and a computer readable storage medium for classifying multiple labels of films.

The first aspect of the present application provides a method for classifying multiple tags of a movie, where the method for classifying multiple tags of a movie includes:

acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video clips;

acquiring video fragment characteristics of the video frame sequence based on a preset neural network model;

calculating an attention matrix based on the video clip features;

traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence.

Wherein the step of traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence comprises:

acquiring a corresponding video feature matrix based on the attention matrix and the video clip features;

forming a two-layer perceptron through the attention matrix and the video feature matrix;

converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;

and outputting the label category of the video frame sequence in the film category space.

Wherein the step of calculating an attention matrix based on the video clip features comprises:

calculating a forward hiding state and a backward hiding state of the video clip features based on BiLSTM;

and calculating the attention matrix of the forward hidden state and the backward hidden state of all the video clip features by adopting a self-attention mechanism.

The step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video clip features by adopting a self-attention mechanism comprises the following steps:

calculating the front hidden state and the hidden elements of the back hidden state of each video segment characteristic;

acquiring the number of hidden layer nodes of the BiLSTM;

and obtaining the attention matrix based on all hidden elements of the video clip features and the hidden layer node number.

The step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features includes:

acquiring a hidden element set of all video segment characteristics;

normalizing the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;

and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.

Wherein after the step of outputting the tag class of the sequence of video frames, the classification method further comprises:

acquiring a cross entropy loss function of the neural network model;

evaluating a score for the tag class based on the cross entropy loss function;

the output layer of the neural network model is a full connection layer fc7.

Wherein after the step of obtaining a sequence of consecutive video frames, the classification method further comprises:

calculating the accumulated sum of the difference values of adjacent frames in each gray level in the continuous video frame sequence;

if the accumulated sum is greater than a preset threshold value, the accumulated sum is overlapped on a color histogram of a video frame of a next frame in the adjacent frames;

dividing the superimposed video frame sequence into a plurality of video segments according to time sequence, and extracting a preset frame number video frame from each video segment to form a new video segment sequence.

A second aspect of the present application provides a movie multi-tag classification apparatus, the movie multi-tag classification apparatus comprising:

the acquisition module is used for acquiring a continuous video frame sequence;

the feature extraction module is used for acquiring video fragment features of the video frame sequence based on a preset neural network model;

an attention calculating module for calculating an attention matrix based on the video clip features;

and the label classification module is used for traversing the video frame sequence according to the attention matrix so as to output the label class of the video frame sequence.

A third aspect of the present application provides an electronic device, including a memory and a processor coupled to each other, where the processor is configured to execute program instructions stored in the memory, so as to implement the method for classifying multiple tags of a movie in the first aspect.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the method for multi-label classification of movies of the first aspect described above.

According to the scheme, the movie multi-label classification device acquires a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video clips; acquiring video fragment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video clip features; traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.

FIG. 1 is a flowchart illustrating an embodiment of a method for classifying multiple tags of a movie according to the present application;

FIG. 2 is a schematic diagram of a framework of a movie multi-label classification model provided herein;

FIG. 3 is a schematic flowchart of step S103 in the multi-label classification method of the movie shown in FIG. 1;

FIG. 4 is a flowchart illustrating a step S104 in the multi-label classification method of the movie shown in FIG. 1;

FIG. 5 is a schematic diagram of an embodiment of a multi-label film sorting apparatus according to the present application;

FIG. 6 is a schematic diagram of a frame of an embodiment of an electronic device provided herein;

FIG. 7 is a schematic diagram of a framework of one embodiment of a computer readable storage medium provided herein.

Detailed Description

The following describes the embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Referring to fig. 1 and fig. 2, fig. 1 is a flowchart illustrating an embodiment of a method for classifying multiple labels of a movie according to the present application, and fig. 2 is a schematic diagram illustrating a framework of a model for classifying multiple labels of a movie according to the present application. The movie multi-label classification method provided by the application can be applied to classifying various different types of labels for movie feature films or movie trailers, so that a viewer can know basic information of a movie conveniently.

The main execution body of the movie multi-tag classification method of the present application may be a movie multi-tag classification apparatus, for example, the movie multi-tag classification method may be executed by a terminal device or a server or other processing device, where the movie multi-tag classification apparatus may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a wireless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the movie multi-label classification method may be implemented by a processor invoking computer readable instructions stored in a memory.

Specifically, the movie multi-label classification method according to the embodiment of the present disclosure may include the steps of:

step S101: a sequence of consecutive video frames is acquired, wherein the sequence of video frames comprises a number of video segments.

Wherein the classifying means obtain a sequence of consecutive video frames, which may be part or all of a movie trailer or feature film. The classification device may pre-process the continuous video frame sequence before the segment generation of the continuous video frame sequence in order to enable the raw video frame sequence to conform to the input rules of the subsequent neural network model.

The preprocessing flow can effectively reduce the risk of network overfitting, eliminate noise irrelevant to category information in the original data, for example, black frames may be used in the video to fill around the image to keep the aspect ratio and the video size of the video, however, the black frames are not only helpful to the classification result, but also can make the neural network model misuse the information as useful information, thereby influencing the prediction result.

Taking a C3D network (reverse 3D, virtual reality engine) as an example, the standard input is a 4-dimensional matrix (channel number X number of frames X frame height X frame width). For a given sequence of video frames u= { U _c E.e. {1,2,.. The pre-processing stage mainly processes for the frame Height and frame Width in each frame of video frame, the original frame Height is Height and the original frame Width is Width in the video frame.

The specific flow of pretreatment is as follows: firstly, the classifying device removes black frames in the images, and adjusts the size of the video frames to a preset video frame size under the condition of keeping the aspect ratio of the original images. For example, the classification device may clip the video frame size to 196 (frame width) X128 (frame height). Then, during training, the classifying device can input the random clipping of the jitter of 112 (frame width) X112 (frame height) to improve the robustness of the system, and the video frame sequence is

For the embodiment of the present disclosure, the video frame sequence after the preprocessing procedure isThe classifying means calculate the difference value of each video frame in the sequence of video frames with respect to each gray level of the video frame next to the video frame of each frame>The difference calculation formula is as follows:

wherein H is _e (j)，H _e+1 (j) Respectively video frames And n is the number of gray levels in the color histogram.

When the difference D is larger than a given preset threshold, the video frame sequence is considered to be subjected to lens transformation, and the classifying device superimposes the difference D on the video frame sequenceIs a kind of medium. After shot detection, the sorting device can obtain a new video frame sequence +.>Wherein, the angle mark t represents the t-th segment, k represents that the video frame sequence consists of k segments, the angle mark r represents the r-th video frame in the video segment, and m _t Representing that the t-th video segment in the video frame sequence contains m in total _t A video frame.

Then, in order to meet the input requirements of the subsequent neural network model, for example, the C3D network, referring specifically to the Candidate Clip Generation (candidate segment generation) part in fig. 2, the classification device may further extract 16 frames of video frames from each video segment according to a specified order or randomly to form a new video segment. For example, for a given video clipThe sorting device is arranged according to the extraction interval->Equidistant extraction is performed, wherein frame_rate represents the Frame rate of the current video, so that a new video Frame sequence f= { F is composed of new video segments _t ^(j) },t∈{1,2,...k},j∈{1,1+δ,...,1+15*δ}。

Step S102: and acquiring video segment characteristics of the video frame sequence based on a preset neural network model.

The classification device inputs the video frame sequence into a preset neural network model, taking a C3D network as an example, so as to acquire the video segment characteristics of the video frame sequence.

In an embodiment of the disclosure, the classification device extracts the video clip features in the video frame sequence through the C3D network: { x _t ＝f(f _t ¹ :f _t ^1+15* ) See, in particular, the space-temporal Descriptor (spatiotemporal feature descriptor) section of fig. 2. C3D networks have produced good performance over many video analysis tasks in the context of large-scale supervised training datasets, and can successfully learn and model spatio-temporal features in video. However, one problem with using a C3D network directly is that the task data set lacks relevant action annotation data to let the C3D network learn the dynamic characteristics. The problem can be effectively solved by pre-training, and the pre-training is widely applied to the field of computer vision, can obviously promote the application effect, and has successful effect in the field of natural language processing in recent years.

Generally, the pre-training process creates a training task, acquires trained model parameters, and loads the trained model parameters on the C3D network according to the embodiments of the present disclosure, thereby initializing model weights of the C3D network.

The loading model weight mainly comprises two loading methods: one is that the model parameters that are loaded remain unchanged during the task of training the disclosed embodiments, referred to as the "Frozen" approach; another is that the model weights of the C3D network, although initialized, are still changing with the training process during the task of the disclosed embodiments, which is called "Fine-Tuning" approach.

It should be noted that, in the embodiment of the present disclosure, the classification device initializes the model weights of the C3D network in a "Frozen" manner.

Specifically, the classification device performs pre-training processing on the Sports-1M data set by using the C3D network, applies the trained C3D network to the task data set in the embodiment of the disclosure, and takes the output of the last-to-last full-connection layer, namely fc7, in the C3D network as the final output value of the network.

It should be noted that, since the direct application of the output of the C3D network to the task has a problem that the extracted feature is irrelevant to the specific task, in order to maintain the feature generality, the embodiment of the disclosure selects the output of fc7 as the feature vector of the video clip feature. Deleting the processing layer behind fc7 is beneficial to enhancing the migration capability of the C3D network, and meets the requirement of multi-label class classification of the application.

Step S103: an attention matrix is calculated based on the video clip features.

The classifying device calculates the Attention matrix according to the obtained video clip features, and the detailed process is shown in the Attention-based Sequential Module (Attention-based serialization module) section of fig. 2 and fig. 3, and fig. 3 is a specific flowchart of step S103 in the multi-label classification method of the movie shown in fig. 1. Specifically, the method comprises the following steps:

step S201: and calculating the hidden elements of the forward hidden state and the backward hidden state of each video segment characteristic.

Wherein the classifying means extracts each video clip feature x= (X) in the video frame sequence ₁ ,x ₂ ,...,x _k )∈R ^k*D Thereafter, dependencies between video clip features also need to be modeled. In the embodiment of the disclosure, the classifying device uses the BiLSTM to process the video clip features acquired in the above steps:

step S202: and obtaining the number of hidden nodes of the BiLSTM.

The classification device further connects the forward hiding state and the backward hiding state of the video clip features to obtain h _t When setting hidden layer node of LSTM as u, setting H epsilon R ^k*2u Expressed as a set of all hidden layer states:

H＝(h ₁ ,h ₂ ,...,h _k )

wherein each element h in the hidden layer state _i The overall information around the ith video clip in the sequence of video frames is described.

Step S203: an attention matrix is derived based on the number of hidden elements and hidden nodes of all video features.

Where LSTM generally gives the same focus to content at all locations, embodiments of the present disclosure contemplate that C3D networks focus only on the important content in video. To achieve this function, the disclosed embodiment adds a self-attention mechanism after LSTM, with its input as hidden layer set H and its output as an attention matrix V:

V＝softmax(W _b tanh(W _a H ^T ))

wherein, is a two coefficient weight matrix and D _a And D _b Is super-parametric, the final V shape is +.>Because of the normalization process using the softmax function, each dimension in the V's row vector can be considered as the attention of the corresponding location in the video, while each row vector is in the videoA representation of a particular content. Since a movie often contains multiple different category labels, which are typically embodied by different content in the video, and the same type may also be represented by different content, embodiments of the present disclosure select setting D _a And D _b As a super parameter, let the C3D network learn different content parts in the video.

Further, after the classifying device obtains the attention moment matrix of the video, a corresponding video feature matrix B needs to be further obtained, and the specific calculation mode is as follows:

B＝VH

step S104: traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence.

The classifying device extracts video features in the video frame sequence based on the attention matrix and the video feature matrix obtained in the steps, and outputs a plurality of different label categories of the video frame sequence according to the video features. Referring to fig. 4, fig. 4 is a schematic flow chart of step S104 in the method for classifying multiple labels of a movie shown in fig. 1. Specifically, the method comprises the following steps:

step S301: and acquiring a corresponding video feature matrix based on the attention matrix and the video clip features.

Step S302: and forming a two-layer perceptron through the attention matrix and the video feature matrix.

The classifying device sequentially superimposes the attention matrix and the video feature matrix to form two layers of perceptrons.

Step S303: and converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron.

Wherein, the classifying device uses a two-layer perceptron to convert the original space where the video frame sequence is located into a film category space

Step S304: the tag class of the sequence of video frames is output in the movie class space.

The classification device extracts video features of the video frame sequences in a film category space, and outputs a plurality of tag categories corresponding to the video frame sequences according to the categories of the video features.

After outputting the tag class of the sequence of video frames, the classification means further constructs a cross entropy loss function L of the C3D network according to the multi-tag learning task:

the classification device may evaluate the score of the multi-tag learning task, that is, the accuracy of the output tag class, according to the cross entropy loss function L, and may optimize the C3D network of the embodiments of the disclosure according to the cross entropy loss function L.

In this embodiment, the classification device acquires a continuous video frame sequence, where the video frame sequence includes a plurality of video clips; acquiring video fragment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video clip features; traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.

In particular, the movie classification method of the present application includes the following advantages compared to the current movie classification method: (1) Extracting bottom layer features by using a C3D network, and effectively reserving time sequence features in the video; (2) Introducing an attention mechanism, namely calculating the response of a certain position in a video frame sequence by focusing on all positions and taking a weighted average value of all positions in an embedded space, so that on one hand, the focusing degree of important content can be improved, and on the other hand, the influence of invalid information fragments (such as a head and a tail) on a classification result can be reduced; (3) Considering that a movie often belongs to multiple categories, the movie classification task is expanded into a multi-label learning task.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

With continued reference to fig. 5, fig. 5 is a schematic diagram of a frame of an embodiment of a multi-label classification apparatus for movies provided in the present application. The movie multi-label classification device 50 includes:

an acquisition module 51 for acquiring a sequence of consecutive video frames.

The feature extraction module 52 is configured to obtain a video clip feature of the video frame sequence based on a preset neural network model.

An attention calculation module 53 for calculating an attention matrix based on the video clip features.

The tag classification module 54 is configured to traverse the video frame sequence according to the attention matrix to output a tag class of the video frame sequence.

Referring to fig. 6, fig. 6 is a schematic frame diagram of an embodiment of an electronic device provided in the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the processor 62 being adapted to execute program instructions stored in the memory 61 for implementing the steps of any of the movie multi-label classification method embodiments described above. In one particular implementation scenario, electronic device 60 may include, but is not limited to: the microcomputer and the server, and the electronic device 60 may also include a mobile device such as a notebook computer and a tablet computer, which is not limited herein.

In particular, the processor 62 is adapted to control itself and the memory 61 to implement the steps in any of the movie multi-label classification method embodiments described above. The processor 62 may also be referred to as a CPU (Central Processing Unit ). The processor 62 may be an integrated circuit chip having signal processing capabilities. The processor 62 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be commonly implemented by an integrated circuit chip.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an embodiment of a computer readable storage medium provided in the present application. The computer readable storage medium 70 stores program instructions 701 capable of being executed by a processor, the program instructions 701 for implementing the steps in any of the movie multi-label classification method embodiments described above.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method for classifying a plurality of movies, the method comprising:

calculating an attention matrix based on the video clip features;

traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence;

after the step of obtaining a sequence of consecutive video frames, the classification method further comprises:

dividing the superimposed video frame sequence into a plurality of video segments according to time sequence, and extracting a preset frame number video frame from each video segment to form a new video segment sequence;

the step of traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence comprises:

outputting a tag class of the video frame sequence in the movie class space;

the step of calculating an attention matrix based on the video clip features includes:

calculating the attention moment of the forward hiding state and the backward hiding state of all the video clip features by adopting a self-attention mechanism;

acquiring the number of hidden layer nodes of the BiLSTM;

2. The method of multi-label classification of motion pictures of claim 1,

the step of acquiring the corresponding video feature matrix based on the attention matrix and the video clip features includes:

acquiring a hidden element set of all video segment characteristics;

3. The method of multi-label classification of motion pictures of claim 1,

after the step of outputting the tag class of the sequence of video frames, the classification method further comprises:

acquiring a cross entropy loss function of the neural network model;

evaluating a score for the tag class based on the cross entropy loss function;

the output layer of the neural network model is a full connection layer fc7.

4. A movie multi-label classification device, characterized in that the movie multi-label classification device comprises:

the acquisition module is used for acquiring a continuous video frame sequence;

the acquisition module is further used for calculating the accumulated sum of the difference values of adjacent frames in each gray level in the continuous video frame sequence; if the accumulated sum is greater than a preset threshold value, the accumulated sum is overlapped on a color histogram of a video frame of a next frame in the adjacent frames; dividing the superimposed video frame sequence into a plurality of video segments according to time sequence, and extracting a preset frame number video frame from each video segment to form a new video segment sequence;

the tag classification module is used for traversing the video frame sequence according to the attention matrix so as to output tag types of the video frame sequence;

the tag classification module is further configured to obtain a corresponding video feature matrix based on the attention matrix and the video segment features; forming a two-layer perceptron through the attention matrix and the video feature matrix; converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron; outputting a tag class of the video frame sequence in the movie class space;

the label classification module is also used for calculating the forward hiding state and the backward hiding state of the video clip features based on BiLSTM; calculating the attention matrix of the forward hidden state and the backward hidden state of all the video clip features by adopting a self-attention mechanism;

the tag classification module is further used for calculating a front hidden state and a hidden element of a rear hidden state of each video segment characteristic; acquiring the number of hidden layer nodes of the BiLSTM; and obtaining the attention matrix based on all hidden elements of the video clip features and the hidden layer node number.

5. An electronic device comprising a memory and a processor coupled to each other, the processor configured to execute program instructions stored in the memory to implement the method of multi-label classification of movies of any one of claims 1 to 3.

6. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the movie multi-label classification method of any one of claims 1 to 3.