CN112084371B - Movie multi-label classification method and device, electronic equipment and storage medium - Google Patents
Movie multi-label classification method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112084371B CN112084371B CN202010708014.4A CN202010708014A CN112084371B CN 112084371 B CN112084371 B CN 112084371B CN 202010708014 A CN202010708014 A CN 202010708014A CN 112084371 B CN112084371 B CN 112084371B
- Authority
- CN
- China
- Prior art keywords
- video
- video frame
- frame sequence
- attention
- hidden
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 239000011159 matrix material Substances 0.000 claims abstract description 66
- 238000003062 neural network model Methods 0.000 claims abstract description 17
- 239000012634 fragment Substances 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011068 loading method Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a method, a device, an electronic device and a computer readable storage medium for classifying multiple labels of a movie, wherein the method for classifying multiple labels of the movie comprises the following steps: acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video clips; acquiring video fragment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video clip features; traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
Description
Technical Field
The present disclosure relates to the field of computer vision applications, and in particular, to a face roll call method, a face roll call device, an electronic device, and a storage medium.
Background
Category labels for movies (e.g. war, comedy, animation, etc.) are a kind of information that highly concentrates the content of movies, not only the important criteria for people to choose movies but also the basis for constructing a movie database. However, with the development of the movie industry, the variety of movie categories increases. Therefore, an efficient film tag classification system is constructed to update the film tags of old films, so that the method has very important practical significance and application value.
Currently, existing movie classification algorithms mainly include movie trailer-based and poster-based algorithms. Among them, the poster-based method is limited in that the variety of the posters of the movie is wide, and the posters may not completely contain their category information, so that the prediction accuracy of such a method is limited. The main problems of the movie classification method based on movie trailers are as follows:
(1) Consider a movie to belong to only a single category;
(2) Classifying by using low-level visual features in movie trailers;
(3) Some video frames (e.g., start and end) that have a fixed pattern and do not contain useful classification features are not distinguished from other video frames, and classification may be misleading.
Not only does the current classification method not effectively model the timing information in the video, but it also makes it possible to pick invalid frames (e.g., the beginning and end of a film) when selecting key frames.
Disclosure of Invention
The application provides at least a method, a device, an electronic device and a computer readable storage medium for classifying multiple labels of films.
The first aspect of the present application provides a method for classifying multiple tags of a movie, where the method for classifying multiple tags of a movie includes:
acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video clips;
acquiring video fragment characteristics of the video frame sequence based on a preset neural network model;
calculating an attention matrix based on the video clip features;
traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence.
Wherein the step of traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence comprises:
acquiring a corresponding video feature matrix based on the attention matrix and the video clip features;
forming a two-layer perceptron through the attention matrix and the video feature matrix;
converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;
and outputting the label category of the video frame sequence in the film category space.
Wherein the step of calculating an attention matrix based on the video clip features comprises:
calculating a forward hiding state and a backward hiding state of the video clip features based on BiLSTM;
and calculating the attention matrix of the forward hidden state and the backward hidden state of all the video clip features by adopting a self-attention mechanism.
The step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video clip features by adopting a self-attention mechanism comprises the following steps:
calculating the front hidden state and the hidden elements of the back hidden state of each video segment characteristic;
acquiring the number of hidden layer nodes of the BiLSTM;
and obtaining the attention matrix based on all hidden elements of the video clip features and the hidden layer node number.
The step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features includes:
acquiring a hidden element set of all video segment characteristics;
normalizing the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;
and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.
Wherein after the step of outputting the tag class of the sequence of video frames, the classification method further comprises:
acquiring a cross entropy loss function of the neural network model;
evaluating a score for the tag class based on the cross entropy loss function;
the output layer of the neural network model is a full connection layer fc7.
Wherein after the step of obtaining a sequence of consecutive video frames, the classification method further comprises:
calculating the accumulated sum of the difference values of adjacent frames in each gray level in the continuous video frame sequence;
if the accumulated sum is greater than a preset threshold value, the accumulated sum is overlapped on a color histogram of a video frame of a next frame in the adjacent frames;
dividing the superimposed video frame sequence into a plurality of video segments according to time sequence, and extracting a preset frame number video frame from each video segment to form a new video segment sequence.
A second aspect of the present application provides a movie multi-tag classification apparatus, the movie multi-tag classification apparatus comprising:
the acquisition module is used for acquiring a continuous video frame sequence;
the feature extraction module is used for acquiring video fragment features of the video frame sequence based on a preset neural network model;
an attention calculating module for calculating an attention matrix based on the video clip features;
and the label classification module is used for traversing the video frame sequence according to the attention matrix so as to output the label class of the video frame sequence.
A third aspect of the present application provides an electronic device, including a memory and a processor coupled to each other, where the processor is configured to execute program instructions stored in the memory, so as to implement the method for classifying multiple tags of a movie in the first aspect.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the method for multi-label classification of movies of the first aspect described above.
According to the scheme, the movie multi-label classification device acquires a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video clips; acquiring video fragment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video clip features; traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.
FIG. 1 is a flowchart illustrating an embodiment of a method for classifying multiple tags of a movie according to the present application;
FIG. 2 is a schematic diagram of a framework of a movie multi-label classification model provided herein;
FIG. 3 is a schematic flowchart of step S103 in the multi-label classification method of the movie shown in FIG. 1;
FIG. 4 is a flowchart illustrating a step S104 in the multi-label classification method of the movie shown in FIG. 1;
FIG. 5 is a schematic diagram of an embodiment of a multi-label film sorting apparatus according to the present application;
FIG. 6 is a schematic diagram of a frame of an embodiment of an electronic device provided herein;
FIG. 7 is a schematic diagram of a framework of one embodiment of a computer readable storage medium provided herein.
Detailed Description
The following describes the embodiments of the present application in detail with reference to the drawings.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Referring to fig. 1 and fig. 2, fig. 1 is a flowchart illustrating an embodiment of a method for classifying multiple labels of a movie according to the present application, and fig. 2 is a schematic diagram illustrating a framework of a model for classifying multiple labels of a movie according to the present application. The movie multi-label classification method provided by the application can be applied to classifying various different types of labels for movie feature films or movie trailers, so that a viewer can know basic information of a movie conveniently.
The main execution body of the movie multi-tag classification method of the present application may be a movie multi-tag classification apparatus, for example, the movie multi-tag classification method may be executed by a terminal device or a server or other processing device, where the movie multi-tag classification apparatus may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a wireless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the movie multi-label classification method may be implemented by a processor invoking computer readable instructions stored in a memory.
Specifically, the movie multi-label classification method according to the embodiment of the present disclosure may include the steps of:
step S101: a sequence of consecutive video frames is acquired, wherein the sequence of video frames comprises a number of video segments.
Wherein the classifying means obtain a sequence of consecutive video frames, which may be part or all of a movie trailer or feature film. The classification device may pre-process the continuous video frame sequence before the segment generation of the continuous video frame sequence in order to enable the raw video frame sequence to conform to the input rules of the subsequent neural network model.
The preprocessing flow can effectively reduce the risk of network overfitting, eliminate noise irrelevant to category information in the original data, for example, black frames may be used in the video to fill around the image to keep the aspect ratio and the video size of the video, however, the black frames are not only helpful to the classification result, but also can make the neural network model misuse the information as useful information, thereby influencing the prediction result.
Taking a C3D network (reverse 3D, virtual reality engine) as an example, the standard input is a 4-dimensional matrix (channel number X number of frames X frame height X frame width). For a given sequence of video frames u= { U c E.e. {1,2,.. The pre-processing stage mainly processes for the frame Height and frame Width in each frame of video frame, the original frame Height is Height and the original frame Width is Width in the video frame.
The specific flow of pretreatment is as follows: firstly, the classifying device removes black frames in the images, and adjusts the size of the video frames to a preset video frame size under the condition of keeping the aspect ratio of the original images. For example, the classification device may clip the video frame size to 196 (frame width) X128 (frame height). Then, during training, the classifying device can input the random clipping of the jitter of 112 (frame width) X112 (frame height) to improve the robustness of the system, and the video frame sequence is
For the embodiment of the present disclosure, the video frame sequence after the preprocessing procedure isThe classifying means calculate the difference value of each video frame in the sequence of video frames with respect to each gray level of the video frame next to the video frame of each frame>The difference calculation formula is as follows:
wherein H is e (j),H e+1 (j) Respectively video frames And n is the number of gray levels in the color histogram.
When the difference D is larger than a given preset threshold, the video frame sequence is considered to be subjected to lens transformation, and the classifying device superimposes the difference D on the video frame sequenceIs a kind of medium. After shot detection, the sorting device can obtain a new video frame sequence +.>Wherein, the angle mark t represents the t-th segment, k represents that the video frame sequence consists of k segments, the angle mark r represents the r-th video frame in the video segment, and m t Representing that the t-th video segment in the video frame sequence contains m in total t A video frame.
Then, in order to meet the input requirements of the subsequent neural network model, for example, the C3D network, referring specifically to the Candidate Clip Generation (candidate segment generation) part in fig. 2, the classification device may further extract 16 frames of video frames from each video segment according to a specified order or randomly to form a new video segment. For example, for a given video clipThe sorting device is arranged according to the extraction interval->Equidistant extraction is performed, wherein frame_rate represents the Frame rate of the current video, so that a new video Frame sequence f= { F is composed of new video segments t (j) },t∈{1,2,...k},j∈{1,1+δ,...,1+15*δ}。
Step S102: and acquiring video segment characteristics of the video frame sequence based on a preset neural network model.
The classification device inputs the video frame sequence into a preset neural network model, taking a C3D network as an example, so as to acquire the video segment characteristics of the video frame sequence.
In an embodiment of the disclosure, the classification device extracts the video clip features in the video frame sequence through the C3D network: { x t =f(f t 1 :f t 1+15* ) See, in particular, the space-temporal Descriptor (spatiotemporal feature descriptor) section of fig. 2. C3D networks have produced good performance over many video analysis tasks in the context of large-scale supervised training datasets, and can successfully learn and model spatio-temporal features in video. However, one problem with using a C3D network directly is that the task data set lacks relevant action annotation data to let the C3D network learn the dynamic characteristics. The problem can be effectively solved by pre-training, and the pre-training is widely applied to the field of computer vision, can obviously promote the application effect, and has successful effect in the field of natural language processing in recent years.
Generally, the pre-training process creates a training task, acquires trained model parameters, and loads the trained model parameters on the C3D network according to the embodiments of the present disclosure, thereby initializing model weights of the C3D network.
The loading model weight mainly comprises two loading methods: one is that the model parameters that are loaded remain unchanged during the task of training the disclosed embodiments, referred to as the "Frozen" approach; another is that the model weights of the C3D network, although initialized, are still changing with the training process during the task of the disclosed embodiments, which is called "Fine-Tuning" approach.
It should be noted that, in the embodiment of the present disclosure, the classification device initializes the model weights of the C3D network in a "Frozen" manner.
Specifically, the classification device performs pre-training processing on the Sports-1M data set by using the C3D network, applies the trained C3D network to the task data set in the embodiment of the disclosure, and takes the output of the last-to-last full-connection layer, namely fc7, in the C3D network as the final output value of the network.
It should be noted that, since the direct application of the output of the C3D network to the task has a problem that the extracted feature is irrelevant to the specific task, in order to maintain the feature generality, the embodiment of the disclosure selects the output of fc7 as the feature vector of the video clip feature. Deleting the processing layer behind fc7 is beneficial to enhancing the migration capability of the C3D network, and meets the requirement of multi-label class classification of the application.
Step S103: an attention matrix is calculated based on the video clip features.
The classifying device calculates the Attention matrix according to the obtained video clip features, and the detailed process is shown in the Attention-based Sequential Module (Attention-based serialization module) section of fig. 2 and fig. 3, and fig. 3 is a specific flowchart of step S103 in the multi-label classification method of the movie shown in fig. 1. Specifically, the method comprises the following steps:
step S201: and calculating the hidden elements of the forward hidden state and the backward hidden state of each video segment characteristic.
Wherein the classifying means extracts each video clip feature x= (X) in the video frame sequence 1 ,x 2 ,...,x k )∈R k*D Thereafter, dependencies between video clip features also need to be modeled. In the embodiment of the disclosure, the classifying device uses the BiLSTM to process the video clip features acquired in the above steps:
step S202: and obtaining the number of hidden nodes of the BiLSTM.
The classification device further connects the forward hiding state and the backward hiding state of the video clip features to obtain h t When setting hidden layer node of LSTM as u, setting H epsilon R k*2u Expressed as a set of all hidden layer states:
H=(h 1 ,h 2 ,...,h k )
wherein each element h in the hidden layer state i The overall information around the ith video clip in the sequence of video frames is described.
Step S203: an attention matrix is derived based on the number of hidden elements and hidden nodes of all video features.
Where LSTM generally gives the same focus to content at all locations, embodiments of the present disclosure contemplate that C3D networks focus only on the important content in video. To achieve this function, the disclosed embodiment adds a self-attention mechanism after LSTM, with its input as hidden layer set H and its output as an attention matrix V:
V=softmax(W b tanh(W a H T ))
wherein, is a two coefficient weight matrix and D a And D b Is super-parametric, the final V shape is +.>Because of the normalization process using the softmax function, each dimension in the V's row vector can be considered as the attention of the corresponding location in the video, while each row vector is in the videoA representation of a particular content. Since a movie often contains multiple different category labels, which are typically embodied by different content in the video, and the same type may also be represented by different content, embodiments of the present disclosure select setting D a And D b As a super parameter, let the C3D network learn different content parts in the video.
Further, after the classifying device obtains the attention moment matrix of the video, a corresponding video feature matrix B needs to be further obtained, and the specific calculation mode is as follows:
B=VH
step S104: traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence.
The classifying device extracts video features in the video frame sequence based on the attention matrix and the video feature matrix obtained in the steps, and outputs a plurality of different label categories of the video frame sequence according to the video features. Referring to fig. 4, fig. 4 is a schematic flow chart of step S104 in the method for classifying multiple labels of a movie shown in fig. 1. Specifically, the method comprises the following steps:
step S301: and acquiring a corresponding video feature matrix based on the attention matrix and the video clip features.
Step S302: and forming a two-layer perceptron through the attention matrix and the video feature matrix.
The classifying device sequentially superimposes the attention matrix and the video feature matrix to form two layers of perceptrons.
Step S303: and converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron.
Wherein, the classifying device uses a two-layer perceptron to convert the original space where the video frame sequence is located into a film category space
Step S304: the tag class of the sequence of video frames is output in the movie class space.
The classification device extracts video features of the video frame sequences in a film category space, and outputs a plurality of tag categories corresponding to the video frame sequences according to the categories of the video features.
After outputting the tag class of the sequence of video frames, the classification means further constructs a cross entropy loss function L of the C3D network according to the multi-tag learning task:
the classification device may evaluate the score of the multi-tag learning task, that is, the accuracy of the output tag class, according to the cross entropy loss function L, and may optimize the C3D network of the embodiments of the disclosure according to the cross entropy loss function L.
In this embodiment, the classification device acquires a continuous video frame sequence, where the video frame sequence includes a plurality of video clips; acquiring video fragment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video clip features; traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
In particular, the movie classification method of the present application includes the following advantages compared to the current movie classification method: (1) Extracting bottom layer features by using a C3D network, and effectively reserving time sequence features in the video; (2) Introducing an attention mechanism, namely calculating the response of a certain position in a video frame sequence by focusing on all positions and taking a weighted average value of all positions in an embedded space, so that on one hand, the focusing degree of important content can be improved, and on the other hand, the influence of invalid information fragments (such as a head and a tail) on a classification result can be reduced; (3) Considering that a movie often belongs to multiple categories, the movie classification task is expanded into a multi-label learning task.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
With continued reference to fig. 5, fig. 5 is a schematic diagram of a frame of an embodiment of a multi-label classification apparatus for movies provided in the present application. The movie multi-label classification device 50 includes:
an acquisition module 51 for acquiring a sequence of consecutive video frames.
The feature extraction module 52 is configured to obtain a video clip feature of the video frame sequence based on a preset neural network model.
An attention calculation module 53 for calculating an attention matrix based on the video clip features.
The tag classification module 54 is configured to traverse the video frame sequence according to the attention matrix to output a tag class of the video frame sequence.
Referring to fig. 6, fig. 6 is a schematic frame diagram of an embodiment of an electronic device provided in the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the processor 62 being adapted to execute program instructions stored in the memory 61 for implementing the steps of any of the movie multi-label classification method embodiments described above. In one particular implementation scenario, electronic device 60 may include, but is not limited to: the microcomputer and the server, and the electronic device 60 may also include a mobile device such as a notebook computer and a tablet computer, which is not limited herein.
In particular, the processor 62 is adapted to control itself and the memory 61 to implement the steps in any of the movie multi-label classification method embodiments described above. The processor 62 may also be referred to as a CPU (Central Processing Unit ). The processor 62 may be an integrated circuit chip having signal processing capabilities. The processor 62 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be commonly implemented by an integrated circuit chip.
Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an embodiment of a computer readable storage medium provided in the present application. The computer readable storage medium 70 stores program instructions 701 capable of being executed by a processor, the program instructions 701 for implementing the steps in any of the movie multi-label classification method embodiments described above.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Claims (6)
1. A method for classifying a plurality of movies, the method comprising:
acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video clips;
acquiring video fragment characteristics of the video frame sequence based on a preset neural network model;
calculating an attention matrix based on the video clip features;
traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence;
after the step of obtaining a sequence of consecutive video frames, the classification method further comprises:
calculating the accumulated sum of the difference values of adjacent frames in each gray level in the continuous video frame sequence;
if the accumulated sum is greater than a preset threshold value, the accumulated sum is overlapped on a color histogram of a video frame of a next frame in the adjacent frames;
dividing the superimposed video frame sequence into a plurality of video segments according to time sequence, and extracting a preset frame number video frame from each video segment to form a new video segment sequence;
the step of traversing the video frame sequence according to the attention matrix to output a tag class of the video frame sequence comprises:
acquiring a corresponding video feature matrix based on the attention matrix and the video clip features;
forming a two-layer perceptron through the attention matrix and the video feature matrix;
converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;
outputting a tag class of the video frame sequence in the movie class space;
the step of calculating an attention matrix based on the video clip features includes:
calculating a forward hiding state and a backward hiding state of the video clip features based on BiLSTM;
calculating the attention moment of the forward hiding state and the backward hiding state of all the video clip features by adopting a self-attention mechanism;
the step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video clip features by adopting a self-attention mechanism comprises the following steps:
calculating the front hidden state and the hidden elements of the back hidden state of each video segment characteristic;
acquiring the number of hidden layer nodes of the BiLSTM;
and obtaining the attention matrix based on all hidden elements of the video clip features and the hidden layer node number.
2. The method of multi-label classification of motion pictures of claim 1,
the step of acquiring the corresponding video feature matrix based on the attention matrix and the video clip features includes:
acquiring a hidden element set of all video segment characteristics;
normalizing the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;
and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.
3. The method of multi-label classification of motion pictures of claim 1,
after the step of outputting the tag class of the sequence of video frames, the classification method further comprises:
acquiring a cross entropy loss function of the neural network model;
evaluating a score for the tag class based on the cross entropy loss function;
the output layer of the neural network model is a full connection layer fc7.
4. A movie multi-label classification device, characterized in that the movie multi-label classification device comprises:
the acquisition module is used for acquiring a continuous video frame sequence;
the acquisition module is further used for calculating the accumulated sum of the difference values of adjacent frames in each gray level in the continuous video frame sequence; if the accumulated sum is greater than a preset threshold value, the accumulated sum is overlapped on a color histogram of a video frame of a next frame in the adjacent frames; dividing the superimposed video frame sequence into a plurality of video segments according to time sequence, and extracting a preset frame number video frame from each video segment to form a new video segment sequence;
the feature extraction module is used for acquiring video fragment features of the video frame sequence based on a preset neural network model;
an attention calculating module for calculating an attention matrix based on the video clip features;
the tag classification module is used for traversing the video frame sequence according to the attention matrix so as to output tag types of the video frame sequence;
the tag classification module is further configured to obtain a corresponding video feature matrix based on the attention matrix and the video segment features; forming a two-layer perceptron through the attention matrix and the video feature matrix; converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron; outputting a tag class of the video frame sequence in the movie class space;
the label classification module is also used for calculating the forward hiding state and the backward hiding state of the video clip features based on BiLSTM; calculating the attention matrix of the forward hidden state and the backward hidden state of all the video clip features by adopting a self-attention mechanism;
the tag classification module is further used for calculating a front hidden state and a hidden element of a rear hidden state of each video segment characteristic; acquiring the number of hidden layer nodes of the BiLSTM; and obtaining the attention matrix based on all hidden elements of the video clip features and the hidden layer node number.
5. An electronic device comprising a memory and a processor coupled to each other, the processor configured to execute program instructions stored in the memory to implement the method of multi-label classification of movies of any one of claims 1 to 3.
6. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the movie multi-label classification method of any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010708014.4A CN112084371B (en) | 2020-07-21 | 2020-07-21 | Movie multi-label classification method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010708014.4A CN112084371B (en) | 2020-07-21 | 2020-07-21 | Movie multi-label classification method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112084371A CN112084371A (en) | 2020-12-15 |
CN112084371B true CN112084371B (en) | 2024-04-16 |
Family
ID=73735152
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010708014.4A Active CN112084371B (en) | 2020-07-21 | 2020-07-21 | Movie multi-label classification method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112084371B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113515997B (en) * | 2020-12-28 | 2024-01-19 | 腾讯科技(深圳)有限公司 | Video data processing method and device and readable storage medium |
CN114329060A (en) * | 2021-12-24 | 2022-04-12 | 空间视创(重庆)科技股份有限公司 | Method and system for automatically generating multiple labels of video frame based on neural network model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763326A (en) * | 2018-05-04 | 2018-11-06 | 南京邮电大学 | A kind of sentiment analysis model building method of the diversified convolutional neural networks of feature based |
CN110209823A (en) * | 2019-06-12 | 2019-09-06 | 齐鲁工业大学 | A kind of multi-tag file classification method and system |
CN110516086A (en) * | 2019-07-12 | 2019-11-29 | 浙江工业大学 | One kind being based on deep neural network video display label automatic obtaining method |
CN111026915A (en) * | 2019-11-25 | 2020-04-17 | Oppo广东移动通信有限公司 | Video classification method, video classification device, storage medium and electronic equipment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9830709B2 (en) * | 2016-03-11 | 2017-11-28 | Qualcomm Incorporated | Video analysis with convolutional attention recurrent neural networks |
US11418476B2 (en) * | 2018-06-07 | 2022-08-16 | Arizona Board Of Regents On Behalf Of Arizona State University | Method and apparatus for detecting fake news in a social media network |
-
2020
- 2020-07-21 CN CN202010708014.4A patent/CN112084371B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763326A (en) * | 2018-05-04 | 2018-11-06 | 南京邮电大学 | A kind of sentiment analysis model building method of the diversified convolutional neural networks of feature based |
CN110209823A (en) * | 2019-06-12 | 2019-09-06 | 齐鲁工业大学 | A kind of multi-tag file classification method and system |
CN110516086A (en) * | 2019-07-12 | 2019-11-29 | 浙江工业大学 | One kind being based on deep neural network video display label automatic obtaining method |
CN111026915A (en) * | 2019-11-25 | 2020-04-17 | Oppo广东移动通信有限公司 | Video classification method, video classification device, storage medium and electronic equipment |
Non-Patent Citations (1)
Title |
---|
桑海峰等.基于循环区域关注和视频帧关注的视频行为识别网络设计".《 电子学报》.2020,全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN112084371A (en) | 2020-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108133188B (en) | Behavior identification method based on motion history image and convolutional neural network | |
Bouwmans et al. | Scene background initialization: A taxonomy | |
CN110765860B (en) | Tumble judging method, tumble judging device, computer equipment and storage medium | |
CN111444878B (en) | Video classification method, device and computer readable storage medium | |
CN109754015B (en) | Neural networks for drawing multi-label recognition and related methods, media and devices | |
CN112597941B (en) | Face recognition method and device and electronic equipment | |
CN111738357B (en) | Junk picture identification method, device and equipment | |
CN113379627B (en) | Training method of image enhancement model and method for enhancing image | |
CN110826596A (en) | Semantic segmentation method based on multi-scale deformable convolution | |
CN112966646A (en) | Video segmentation method, device, equipment and medium based on two-way model fusion | |
CN112487207A (en) | Image multi-label classification method and device, computer equipment and storage medium | |
CN112614110B (en) | Method and device for evaluating image quality and terminal equipment | |
CN111539289A (en) | Method and device for identifying action in video, electronic equipment and storage medium | |
CN110958469A (en) | Video processing method and device, electronic equipment and storage medium | |
CN112084371B (en) | Movie multi-label classification method and device, electronic equipment and storage medium | |
CN111079864A (en) | Short video classification method and system based on optimized video key frame extraction | |
WO2024041108A1 (en) | Image correction model training method and apparatus, image correction method and apparatus, and computer device | |
Zhao et al. | Gradient-based conditional generative adversarial network for non-uniform blind deblurring via DenseResNet | |
JP2024508867A (en) | Image clustering method, device, computer equipment and computer program | |
CN112102200A (en) | Image completion model initialization method, training method and image completion method | |
CN112149526A (en) | Lane line detection method and system based on long-distance information fusion | |
CN111027472A (en) | Video identification method based on fusion of video optical flow and image space feature weight | |
CN110135428A (en) | Image segmentation processing method and device | |
CN116682141A (en) | Multi-label pedestrian attribute identification method and medium based on multi-scale progressive perception | |
CN116110005A (en) | Crowd behavior attribute counting method, system and product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |