CN112084371A - Film multi-label classification method and device, electronic equipment and storage medium - Google Patents
Film multi-label classification method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112084371A CN112084371A CN202010708014.4A CN202010708014A CN112084371A CN 112084371 A CN112084371 A CN 112084371A CN 202010708014 A CN202010708014 A CN 202010708014A CN 112084371 A CN112084371 A CN 112084371A
- Authority
- CN
- China
- Prior art keywords
- video
- sequence
- attention
- label
- movie
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 239000011159 matrix material Substances 0.000 claims abstract description 51
- 238000003062 neural network model Methods 0.000 claims abstract description 17
- 230000006870 function Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 239000000284 extract Substances 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011068 loading method Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a method and a device for classifying multiple film labels, electronic equipment and a computer readable storage medium, wherein the method for classifying the multiple film labels comprises the following steps: acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
Description
Technical Field
The present application relates to the field of computer vision application technologies, and in particular, to a face roll call method, an apparatus, an electronic device, and a storage medium.
Background
The category label of the movie (e.g., war, comedy, animation, etc.) as a kind of information highly condensed about the movie content is not only an important criterion for people to select movies but also a basis for building a movie database. However, with the development of the movie industry, the types of movie categories are increasing. Therefore, the efficient movie label classification system is constructed to update the movie labels of the old movies, and has very important practical significance and application value.
Currently, existing movie classification algorithms mainly include movie trailer-based and poster-based. Among them, the poster-based method has a limitation in that posters for movies are various in kind, and posters may not fully contain their category information, so that the prediction accuracy of such a method is limited. The main problems of the movie classification method based on movie trailers mainly include:
(1) consider a movie to belong to only a single category;
(2) the low-level visual features in the movie trailers are used for classification;
(3) some video frames (e.g., open field and end) that have a fixed pattern and do not contain useful classification features are not distinguished from other video frames, possibly misleading the classification.
Not only can the current classification method not effectively model the timing information in the video, but also invalid frames (such as the head and tail of a movie) can be picked when selecting key frames.
Disclosure of Invention
The application at least provides a movie multi-label classification method, a movie multi-label classification device, an electronic device and a computer-readable storage medium.
The application provides a method for classifying multiple labels of a movie, which comprises the following steps:
acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments;
acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
computing an attention matrix based on the video segment features;
traversing the sequence of video frames according to the attention moment array to output a tag class of the sequence of video frames.
Wherein the step of traversing the sequence of video frames according to the attention moment array to output tag categories of the sequence of video frames comprises:
acquiring a corresponding video feature matrix based on the attention matrix and the video segment features;
forming a two-layer perceptron through the attention matrix and the video feature matrix;
converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;
outputting a label category of the sequence of video frames in the movie category space.
Wherein the step of calculating an attention matrix based on the video segment features comprises:
calculating a forward hidden state and a backward hidden state of the video segment features based on the BilSTM;
and adopting a self-attention mechanism to calculate an attention matrix of forward hidden states and backward hidden states of all the video segment features.
Wherein, the step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video segment features by using a self-attention mechanism comprises:
calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic;
acquiring the number of hidden layer nodes of the BilSTM;
and obtaining the attention matrix based on hidden elements of all the video segment characteristics and the number of the hidden layer nodes.
Wherein the step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features comprises:
acquiring a hidden element set of all the video clip characteristics;
carrying out normalization processing on the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;
and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.
Wherein, after the step of outputting the label categories of the sequence of video frames, the classification method further comprises:
acquiring a cross entropy loss function of the neural network model;
evaluating a score for the label category based on the cross entropy loss function;
wherein, the output layer of the neural network model is a full connection layer fc 7.
Wherein, after the step of obtaining a sequence of consecutive video frames, the classification method further comprises:
calculating the accumulated sum of the difference values of the adjacent frames in each gray level in the continuous video frame sequence;
when the accumulated sum is larger than a preset threshold value, the accumulated sum is superposed on a color histogram of a video frame of a next frame in the adjacent frames;
dividing the video frame sequence after the superposition processing into a plurality of sections of video segments according to time sequence, and extracting video frames with preset frame number from each section of the video segments so as to form a new video segment sequence.
A second aspect of the present application provides a movie multi-label sorting apparatus, including:
an obtaining module, configured to obtain a continuous sequence of video frames;
the characteristic extraction module is used for acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
an attention calculation module for calculating an attention matrix based on the video segment features;
and the label classification module is used for traversing the video frame sequence according to the attention moment array so as to output the label category of the video frame sequence.
A third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the processor is configured to execute program instructions stored in the memory to implement the foregoing multi-label classification method for movies in the first aspect.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a processor, implement the movie multi-label classification method of the first aspect.
According to the scheme, the film multi-label classification device acquires a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.
FIG. 1 is a flowchart illustrating an embodiment of a multi-tag movie classification method provided by the present application;
FIG. 2 is a block diagram of a multi-label classification model for movies provided by the present application;
fig. 3 is a detailed flowchart of step S103 in the multi-label classification method of the movie shown in fig. 1;
fig. 4 is a detailed flowchart of step S104 in the multi-label classification method of the movie shown in fig. 1;
FIG. 5 is a block diagram of an embodiment of a multi-tag movie classification apparatus provided herein;
FIG. 6 is a block diagram of an embodiment of an electronic device provided herein;
FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium provided herein.
Detailed Description
The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic flowchart of an embodiment of a multi-label classification method for a movie provided by the present application, and fig. 2 is a schematic frame diagram of a multi-label classification model for a movie provided by the present application. The multi-label classification method for the film can be applied to classification of various different types of labels for the positive film or the trailer of the film, and is convenient for audiences to know the basic information of the film.
The main body of the movie multi-tag classification method of the present application may be a movie multi-tag classification apparatus, for example, the movie multi-tag classification method may be executed by a terminal device or a server or other processing devices, where the movie multi-tag classification apparatus may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a wireless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the movie multi-label classification method may be implemented by a processor calling computer-readable instructions stored in a memory.
Specifically, the movie multi-label classification method of the embodiment of the present disclosure may include the following steps:
step S101: a sequence of consecutive video frames is acquired, wherein the sequence of video frames comprises a number of video segments.
Wherein the sorting means obtains a sequence of consecutive video frames, which may be part or all of a trailer or a feature of a movie. The classification means may perform a data pre-processing of the sequence of consecutive video frames before the segment generation of the sequence of consecutive video frames in order to enable the unprocessed sequence of video frames to comply with input rules of a subsequent neural network model.
The preprocessing flow can effectively reduce the risk of network overfitting and eliminate noise irrelevant to the category information in the original data, for example, black frames may be filled around the image in the video to maintain the aspect ratio and the video size of the video, but the black frames not only do not help the classification result, but also may cause the neural network model to misunderstand that the information is useful information, thereby affecting the prediction result.
Taking the C3D network (convert 3D, virtual reality engine) as an example, the standard input is a 4-dimensional matrix (channel number X frame height X frame width). For a given sequence of video frames U ═ UcAnd e belongs to {1, 2.. said., T-1}, and the preprocessing stage mainly processes the frame Height and the frame Width in each frame of video frame, wherein the original frame Height in the video frame is Height and the original frame Width is Width.
The specific flow of pretreatment is as follows: firstly, the classification device removes black frames in the images, and adjusts the size of the video frames to be a preset video frame size under the condition of keeping the original image aspect ratio. For example, the classification means may make the video frame largeThe minitrim is 196 (frame width) X128 (frame height). Then, during training, the classification device can input 112 (frame width) X112 (frame height) dithering random cropping to improve the robustness of the system, and the video frame sequence is
For the embodiment of the present disclosure, the video frame sequence after the preprocessing process is as followsThe classification device calculates the difference of each gray level between each frame of video frame and the next frame of video frame in the video frame sequenceThe difference value calculation formula is as follows:
wherein He(j),He+1(j) Are respectively video frames Is the value of the color histogram at gray level j, and n is the number of gray levels in the color histogram.
When the difference D is larger than a given preset threshold value, the video frame sequence is considered to have lens conversion, and the classification device superimposes the difference D on the video frame sequenceIn (1). After lens detection, the sorting device can obtain a new video frame sequenceWherein the corner mark t represents the tth sliceSegment, k denotes that the sequence of video frames consists of k segments, the corner mark r denotes the r-th video frame in a video segment, mtIndicating that the t-th video segment in the video frame sequence contains mtA video frame.
Then, in order to satisfy the input requirement of the subsequent neural network model, such as the C3D network, referring to the Candidate Clip Generation part in fig. 2, the classifying device may further extract 16 video frames from each video Clip in a predetermined order or randomly to form a new video Clip. For example, for a given video clipThe sorting device according to the extraction intervalEquidistant decimation is performed, where the Frame _ rate represents the Frame rate of the current video, so that a new sequence of video frames F ═ F is composed of new video segmentst (j)},t∈{1,2,...k},j∈{1,1+,...,1+15*}。
Step S102: and acquiring the video segment characteristics of the video frame sequence based on a preset neural network model.
The classification device inputs the video frame sequence into a preset neural network model, taking a C3D network as an example, so as to obtain the video segment characteristics of the video frame sequence.
In the embodiment of the present disclosure, the classification apparatus extracts the video segment features in the video frame sequence through a C3D network: { xt=f(ft 1:ft 1+15*) Please refer to the Spatio-temporal Descriptor part of fig. 2. The C3D network has yielded good performance over many video analysis tasks in the context of a large supervised training dataset, and can successfully learn and model spatio-temporal features in videos. However, one problem with using the C3D network directly is that the task data set lacks relevant action annotation data to let the C3D network learn the dynamic features. The pre-training can effectively solve the problem, and the pre-training is widely appliedThe application effect can be remarkably promoted when the method is used in the field of computer vision, and the successful effect is achieved in the field of natural language processing in recent years.
Generally, the pre-training process creates a training task, obtains the trained model parameters, and loads the trained model parameters on the C3D network according to the embodiment of the present disclosure, thereby initializing the model weights of the C3D network.
The loading model weight mainly comprises two loading methods: one is that the loaded model parameters remain unchanged during the task of training the disclosed embodiments, referred to as the "Frozen" approach; the other is that the model weight of the C3D network is initialized, but is still changed along with the training process in the task process of the embodiment of the present disclosure, which is called a "Fine-Tuning" method.
It should be noted that, in the embodiment of the present disclosure, the classification apparatus initializes the model weight of the C3D network in a "Frozen" manner.
Specifically, the classification device performs pre-training processing on the Sports-1M data set by using the C3D network, applies the trained C3D network to the task data set according to the embodiment of the present disclosure, and takes the output of the penultimate full-link layer, that is, fc7, in the C3D network as the final output value of the network.
It should be noted that, since the problem that the extracted features are not related to a specific task exists when the output of the C3D network is directly applied to a task, in order to maintain the generality of the features, the embodiment of the present disclosure selects the output of fc7 as the feature vector of the video segment features. Deleting the processing layer behind fc7 is beneficial to enhancing the migration capability of the C3D network, and meeting the requirement of multi-label category classification of the application.
Step S103: an attention matrix is calculated based on the video segment characteristics.
The specific process of the classification device calculating the Attention matrix according to the obtained video segment features is shown in fig. 2, and fig. 3, where fig. 3 is a schematic flowchart of step S103 in the multi-tag classification method for movies shown in fig. 1. Specifically, the method comprises the following steps:
step S201: and calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic.
Wherein the classifying means extracts each video segment feature X ═ (X) in the sequence of video frames1,x2,...,xk)∈Rk*DThen, the dependency between the video segment features also needs to be modeled. In the embodiment of the present disclosure, the classification device uses BiLSTM to process the video segment features obtained in the above steps:
step S202: and acquiring the number of hidden layer nodes of the BilSTM.
Wherein, the classification device further connects the forward hidden state and the backward hidden state of the video segment characteristics to obtain htWhen the hidden node of the LSTM is set as u, H ∈ R is setk*2uRepresented as a collection of all hidden states:
H=(h1,h2,...,hk)
wherein each element h in the hidden stateiGlobal information is described around the ith video segment in a sequence of video frames.
Step S203: and obtaining an attention matrix based on hidden elements of all video features and the number of hidden nodes.
Here, LSTM generally gives the same attention to the content in all locations, whereas the disclosed embodiments hope to let the C3D network focus only on the important content in the video. To achieve this, the disclosed embodiments add a self-attention mechanism after the LSTM, with the inputs being hidden state set H and the outputs being an attention matrix V:
V=softmax(Wbtanh(WaHT))
wherein, is a two coefficient weight matrix and DaAnd DbFor hyper-parametric, the shape of the final V isBecause the softmax function is used for normalization, each dimension in the row vector of V can be regarded as attention of a corresponding position in the video, and each row vector is a representation of a specific content in the video. Since a movie often contains a plurality of different category labels, the different category labels are usually represented by different contents in the video, and the same type may also be represented by different contents, the setting D of the embodiment of the present disclosure is selectedaAnd DbLet C3D learn different content parts in the video over the internet as a hyper-parameter.
Further, after the classification device obtains the attention moment matrix of the video, a corresponding video feature matrix B needs to be further obtained, and the specific calculation method is as follows:
B=VH
step S104: and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence.
The classification device extracts video features in the video frame sequence based on the attention matrix and the video feature matrix obtained in the above steps, and outputs a plurality of different label categories of the video frame sequence according to the video features. Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a specific process of step S104 in the movie multi-tag classification method shown in fig. 1. Specifically, the method comprises the following steps:
step S301: and acquiring a corresponding video feature matrix based on the attention matrix and the video segment features.
Step S302: and forming a two-layer perceptron by the attention matrix and the video feature matrix.
The classification device sequentially overlaps the attention matrix and the video feature matrix to form a two-layer perceptron.
Step S303: and converting the space where the video frame sequence is located into the movie category space according to a two-layer perceptron.
Wherein the classification device converts the original space of the video frame sequence into the film category space by using a two-layer perceptron
Step S304: the label category of the video frame sequence is output in a movie category space.
The classification device extracts video features of the video frame sequence in a movie category space and outputs a plurality of label categories corresponding to the video frame sequence according to the categories of the video features.
After outputting the label categories of the sequence of video frames, the classifying device further constructs a cross entropy loss function L of the C3D network according to the multi-label learning task:
wherein, the classification device can evaluate the score of the multi-label learning task, i.e. the accuracy of the output label category, according to the cross entropy loss function L, and can optimize the C3D network according to the cross entropy loss function L.
In this embodiment, the classification device obtains a continuous video frame sequence, where the video frame sequence includes a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
Specifically, compared with the current movie classification method, the movie classification method of the present application includes the following advantages: (1) extracting bottom-layer features by using a C3D network, and effectively retaining time sequence features in the video; (2) an attention mechanism is introduced, and the response of a certain position in the video frame sequence is calculated by paying attention to all the positions and taking the weighted average value of the positions in an embedding space, so that on one hand, the attention degree of important contents can be improved, and on the other hand, the influence of invalid information segments (such as a slice header and a slice trailer) on a classification result can be reduced; (3) considering that a film is often subordinate to multiple categories, the film classification task is expanded into a multi-label learning task.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Referring to fig. 5, fig. 5 is a schematic diagram of a framework of an embodiment of a multi-tag sorting apparatus for movies according to the present application. The movie multi-label sorting apparatus 50 includes:
an obtaining module 51 for obtaining a sequence of consecutive video frames.
And the feature extraction module 52 is configured to obtain video segment features of the video frame sequence based on a preset neural network model.
And an attention calculation module 53 for calculating an attention matrix based on the video segment characteristics.
And a label classification module 54 for traversing the video frame sequence according to the attention moment matrix to output a label class of the video frame sequence.
Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an embodiment of an electronic device provided in the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the processor 62 being configured to execute program instructions stored in the memory 61 to implement the steps in any of the above-described embodiments of the movie multi-label classification method. In one particular implementation scenario, electronic device 60 may include, but is not limited to: a microcomputer, a server, and in addition, the electronic device 60 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.
In particular, the processor 62 is configured to control itself and the memory 61 to implement the steps in any of the above-described embodiments of the movie multi-label classification method. The processor 62 may also be referred to as a CPU (Central Processing Unit). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be collectively implemented by an integrated circuit chip.
Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 70 stores program instructions 701 executable by a processor, the program instructions 701 being for implementing the steps in any of the above-described embodiments of the movie multi-label classification method.
In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.
The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Claims (10)
1. A multi-label classification method for a movie is characterized by comprising the following steps:
acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments;
acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
computing an attention matrix based on the video segment features;
traversing the sequence of video frames according to the attention moment array to output a tag class of the sequence of video frames.
2. The movie multi-label classification method according to claim 1,
the step of traversing the sequence of video frames according to the attention moment array to output tag categories of the sequence of video frames comprises:
acquiring a corresponding video feature matrix based on the attention matrix and the video segment features;
forming a two-layer perceptron through the attention matrix and the video feature matrix;
converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;
outputting a label category of the sequence of video frames in the movie category space.
3. The movie multi-label classification method according to claim 2,
the step of calculating an attention matrix based on the video segment features comprises:
calculating a forward hidden state and a backward hidden state of the video segment features based on the BilSTM;
and adopting a self-attention mechanism to calculate an attention matrix of forward hidden states and backward hidden states of all the video segment features.
4. The movie multi-label classification method according to claim 3,
the step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video segment features by using a self-attention mechanism comprises the following steps:
calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic;
acquiring the number of hidden layer nodes of the BilSTM;
and obtaining the attention matrix based on hidden elements of all the video segment characteristics and the number of the hidden layer nodes.
5. The movie multi-label classification method according to claim 4,
the step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features includes:
acquiring a hidden element set of all the video clip characteristics;
carrying out normalization processing on the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;
and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.
6. The movie multi-label classification method according to claim 1,
after the step of outputting the label categories of the sequence of video frames, the classification method further comprises:
acquiring a cross entropy loss function of the neural network model;
evaluating a score for the label category based on the cross entropy loss function;
wherein, the output layer of the neural network model is a full connection layer fc 7.
7. The movie multi-label classification method according to claim 1,
after the step of acquiring a sequence of consecutive video frames, the classification method further comprises:
calculating the accumulated sum of the difference values of the adjacent frames in each gray level in the continuous video frame sequence;
when the accumulated sum is larger than a preset threshold value, the accumulated sum is superposed on a color histogram of a video frame of a next frame in the adjacent frames;
dividing the video frame sequence after the superposition processing into a plurality of sections of video segments according to time sequence, and extracting video frames with preset frame number from each section of the video segments so as to form a new video segment sequence.
8. A movie multi-label sorting apparatus, comprising:
an obtaining module, configured to obtain a continuous sequence of video frames;
the characteristic extraction module is used for acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
an attention calculation module for calculating an attention matrix based on the video segment features;
and the label classification module is used for traversing the video frame sequence according to the attention moment array so as to output the label category of the video frame sequence.
9. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the movie multi-label classification method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon program instructions, which when executed by a processor, implement the movie multi-label classification method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010708014.4A CN112084371B (en) | 2020-07-21 | 2020-07-21 | Movie multi-label classification method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010708014.4A CN112084371B (en) | 2020-07-21 | 2020-07-21 | Movie multi-label classification method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112084371A true CN112084371A (en) | 2020-12-15 |
CN112084371B CN112084371B (en) | 2024-04-16 |
Family
ID=73735152
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010708014.4A Active CN112084371B (en) | 2020-07-21 | 2020-07-21 | Movie multi-label classification method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112084371B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113515997A (en) * | 2020-12-28 | 2021-10-19 | 腾讯科技(深圳)有限公司 | Video data processing method and device and readable storage medium |
CN114329060A (en) * | 2021-12-24 | 2022-04-12 | 空间视创(重庆)科技股份有限公司 | Method and system for automatically generating multiple labels of video frame based on neural network model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170262995A1 (en) * | 2016-03-11 | 2017-09-14 | Qualcomm Incorporated | Video analysis with convolutional attention recurrent neural networks |
CN108763326A (en) * | 2018-05-04 | 2018-11-06 | 南京邮电大学 | A kind of sentiment analysis model building method of the diversified convolutional neural networks of feature based |
CN110209823A (en) * | 2019-06-12 | 2019-09-06 | 齐鲁工业大学 | A kind of multi-tag file classification method and system |
CN110516086A (en) * | 2019-07-12 | 2019-11-29 | 浙江工业大学 | One kind being based on deep neural network video display label automatic obtaining method |
US20190379628A1 (en) * | 2018-06-07 | 2019-12-12 | Arizona Board Of Regents On Behalf Of Arizona State University | Method and apparatus for detecting fake news in a social media network |
CN111026915A (en) * | 2019-11-25 | 2020-04-17 | Oppo广东移动通信有限公司 | Video classification method, video classification device, storage medium and electronic equipment |
-
2020
- 2020-07-21 CN CN202010708014.4A patent/CN112084371B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170262995A1 (en) * | 2016-03-11 | 2017-09-14 | Qualcomm Incorporated | Video analysis with convolutional attention recurrent neural networks |
CN108763326A (en) * | 2018-05-04 | 2018-11-06 | 南京邮电大学 | A kind of sentiment analysis model building method of the diversified convolutional neural networks of feature based |
US20190379628A1 (en) * | 2018-06-07 | 2019-12-12 | Arizona Board Of Regents On Behalf Of Arizona State University | Method and apparatus for detecting fake news in a social media network |
CN110209823A (en) * | 2019-06-12 | 2019-09-06 | 齐鲁工业大学 | A kind of multi-tag file classification method and system |
CN110516086A (en) * | 2019-07-12 | 2019-11-29 | 浙江工业大学 | One kind being based on deep neural network video display label automatic obtaining method |
CN111026915A (en) * | 2019-11-25 | 2020-04-17 | Oppo广东移动通信有限公司 | Video classification method, video classification device, storage medium and electronic equipment |
Non-Patent Citations (1)
Title |
---|
桑海峰等: "基于循环区域关注和视频帧关注的视频行为识别网络设计"", 《 电子学报》, 15 June 2020 (2020-06-15) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113515997A (en) * | 2020-12-28 | 2021-10-19 | 腾讯科技(深圳)有限公司 | Video data processing method and device and readable storage medium |
CN113515997B (en) * | 2020-12-28 | 2024-01-19 | 腾讯科技(深圳)有限公司 | Video data processing method and device and readable storage medium |
CN114329060A (en) * | 2021-12-24 | 2022-04-12 | 空间视创(重庆)科技股份有限公司 | Method and system for automatically generating multiple labels of video frame based on neural network model |
Also Published As
Publication number | Publication date |
---|---|
CN112084371B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230196117A1 (en) | Training method for semi-supervised learning model, image processing method, and device | |
CN109711481B (en) | Neural networks for drawing multi-label recognition, related methods, media and devices | |
CN108133188B (en) | Behavior identification method based on motion history image and convolutional neural network | |
Bouwmans et al. | Scene background initialization: A taxonomy | |
Fu et al. | Fast crowd density estimation with convolutional neural networks | |
US11222211B2 (en) | Method and apparatus for segmenting video object, electronic device, and storage medium | |
CN110765860B (en) | Tumble judging method, tumble judging device, computer equipment and storage medium | |
US10339421B2 (en) | RGB-D scene labeling with multimodal recurrent neural networks | |
CN108921225B (en) | Image processing method and device, computer equipment and storage medium | |
CN109033107B (en) | Image retrieval method and apparatus, computer device, and storage medium | |
CN113066017B (en) | Image enhancement method, model training method and equipment | |
CN112487207A (en) | Image multi-label classification method and device, computer equipment and storage medium | |
CN114549913B (en) | Semantic segmentation method and device, computer equipment and storage medium | |
CN112132145B (en) | Image classification method and system based on model extended convolutional neural network | |
JP2010157118A (en) | Pattern identification device and learning method for the same and computer program | |
WO2022072199A1 (en) | Sparse optical flow estimation | |
CN111079864A (en) | Short video classification method and system based on optimized video key frame extraction | |
WO2024041108A1 (en) | Image correction model training method and apparatus, image correction method and apparatus, and computer device | |
CN112084371B (en) | Movie multi-label classification method and device, electronic equipment and storage medium | |
CN114494699B (en) | Image semantic segmentation method and system based on semantic propagation and front background perception | |
Aldhaheri et al. | MACC Net: Multi-task attention crowd counting network | |
CN111027472A (en) | Video identification method based on fusion of video optical flow and image space feature weight | |
CN113869234A (en) | Facial expression recognition method, device, equipment and storage medium | |
CN111079900B (en) | Image processing method and device based on self-adaptive connection neural network | |
CN111126177B (en) | Method and device for counting number of people |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |