CN112084371A - Film multi-label classification method and device, electronic equipment and storage medium - Google Patents

Film multi-label classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112084371A
CN112084371A CN202010708014.4A CN202010708014A CN112084371A CN 112084371 A CN112084371 A CN 112084371A CN 202010708014 A CN202010708014 A CN 202010708014A CN 112084371 A CN112084371 A CN 112084371A
Authority
CN
China
Prior art keywords
video
sequence
attention
label
movie
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010708014.4A
Other languages
Chinese (zh)
Other versions
CN112084371B (en
Inventor
吕子钰
禹一童
杨敏
李成明
姜青山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202010708014.4A priority Critical patent/CN112084371B/en
Publication of CN112084371A publication Critical patent/CN112084371A/en
Application granted granted Critical
Publication of CN112084371B publication Critical patent/CN112084371B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method and a device for classifying multiple film labels, electronic equipment and a computer readable storage medium, wherein the method for classifying the multiple film labels comprises the following steps: acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.

Description

Film multi-label classification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer vision application technologies, and in particular, to a face roll call method, an apparatus, an electronic device, and a storage medium.
Background
The category label of the movie (e.g., war, comedy, animation, etc.) as a kind of information highly condensed about the movie content is not only an important criterion for people to select movies but also a basis for building a movie database. However, with the development of the movie industry, the types of movie categories are increasing. Therefore, the efficient movie label classification system is constructed to update the movie labels of the old movies, and has very important practical significance and application value.
Currently, existing movie classification algorithms mainly include movie trailer-based and poster-based. Among them, the poster-based method has a limitation in that posters for movies are various in kind, and posters may not fully contain their category information, so that the prediction accuracy of such a method is limited. The main problems of the movie classification method based on movie trailers mainly include:
(1) consider a movie to belong to only a single category;
(2) the low-level visual features in the movie trailers are used for classification;
(3) some video frames (e.g., open field and end) that have a fixed pattern and do not contain useful classification features are not distinguished from other video frames, possibly misleading the classification.
Not only can the current classification method not effectively model the timing information in the video, but also invalid frames (such as the head and tail of a movie) can be picked when selecting key frames.
Disclosure of Invention
The application at least provides a movie multi-label classification method, a movie multi-label classification device, an electronic device and a computer-readable storage medium.
The application provides a method for classifying multiple labels of a movie, which comprises the following steps:
acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments;
acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
computing an attention matrix based on the video segment features;
traversing the sequence of video frames according to the attention moment array to output a tag class of the sequence of video frames.
Wherein the step of traversing the sequence of video frames according to the attention moment array to output tag categories of the sequence of video frames comprises:
acquiring a corresponding video feature matrix based on the attention matrix and the video segment features;
forming a two-layer perceptron through the attention matrix and the video feature matrix;
converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;
outputting a label category of the sequence of video frames in the movie category space.
Wherein the step of calculating an attention matrix based on the video segment features comprises:
calculating a forward hidden state and a backward hidden state of the video segment features based on the BilSTM;
and adopting a self-attention mechanism to calculate an attention matrix of forward hidden states and backward hidden states of all the video segment features.
Wherein, the step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video segment features by using a self-attention mechanism comprises:
calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic;
acquiring the number of hidden layer nodes of the BilSTM;
and obtaining the attention matrix based on hidden elements of all the video segment characteristics and the number of the hidden layer nodes.
Wherein the step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features comprises:
acquiring a hidden element set of all the video clip characteristics;
carrying out normalization processing on the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;
and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.
Wherein, after the step of outputting the label categories of the sequence of video frames, the classification method further comprises:
acquiring a cross entropy loss function of the neural network model;
evaluating a score for the label category based on the cross entropy loss function;
wherein, the output layer of the neural network model is a full connection layer fc 7.
Wherein, after the step of obtaining a sequence of consecutive video frames, the classification method further comprises:
calculating the accumulated sum of the difference values of the adjacent frames in each gray level in the continuous video frame sequence;
when the accumulated sum is larger than a preset threshold value, the accumulated sum is superposed on a color histogram of a video frame of a next frame in the adjacent frames;
dividing the video frame sequence after the superposition processing into a plurality of sections of video segments according to time sequence, and extracting video frames with preset frame number from each section of the video segments so as to form a new video segment sequence.
A second aspect of the present application provides a movie multi-label sorting apparatus, including:
an obtaining module, configured to obtain a continuous sequence of video frames;
the characteristic extraction module is used for acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
an attention calculation module for calculating an attention matrix based on the video segment features;
and the label classification module is used for traversing the video frame sequence according to the attention moment array so as to output the label category of the video frame sequence.
A third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the processor is configured to execute program instructions stored in the memory to implement the foregoing multi-label classification method for movies in the first aspect.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a processor, implement the movie multi-label classification method of the first aspect.
According to the scheme, the film multi-label classification device acquires a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.
FIG. 1 is a flowchart illustrating an embodiment of a multi-tag movie classification method provided by the present application;
FIG. 2 is a block diagram of a multi-label classification model for movies provided by the present application;
fig. 3 is a detailed flowchart of step S103 in the multi-label classification method of the movie shown in fig. 1;
fig. 4 is a detailed flowchart of step S104 in the multi-label classification method of the movie shown in fig. 1;
FIG. 5 is a block diagram of an embodiment of a multi-tag movie classification apparatus provided herein;
FIG. 6 is a block diagram of an embodiment of an electronic device provided herein;
FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium provided herein.
Detailed Description
The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic flowchart of an embodiment of a multi-label classification method for a movie provided by the present application, and fig. 2 is a schematic frame diagram of a multi-label classification model for a movie provided by the present application. The multi-label classification method for the film can be applied to classification of various different types of labels for the positive film or the trailer of the film, and is convenient for audiences to know the basic information of the film.
The main body of the movie multi-tag classification method of the present application may be a movie multi-tag classification apparatus, for example, the movie multi-tag classification method may be executed by a terminal device or a server or other processing devices, where the movie multi-tag classification apparatus may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a wireless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the movie multi-label classification method may be implemented by a processor calling computer-readable instructions stored in a memory.
Specifically, the movie multi-label classification method of the embodiment of the present disclosure may include the following steps:
step S101: a sequence of consecutive video frames is acquired, wherein the sequence of video frames comprises a number of video segments.
Wherein the sorting means obtains a sequence of consecutive video frames, which may be part or all of a trailer or a feature of a movie. The classification means may perform a data pre-processing of the sequence of consecutive video frames before the segment generation of the sequence of consecutive video frames in order to enable the unprocessed sequence of video frames to comply with input rules of a subsequent neural network model.
The preprocessing flow can effectively reduce the risk of network overfitting and eliminate noise irrelevant to the category information in the original data, for example, black frames may be filled around the image in the video to maintain the aspect ratio and the video size of the video, but the black frames not only do not help the classification result, but also may cause the neural network model to misunderstand that the information is useful information, thereby affecting the prediction result.
Taking the C3D network (convert 3D, virtual reality engine) as an example, the standard input is a 4-dimensional matrix (channel number X frame height X frame width). For a given sequence of video frames U ═ UcAnd e belongs to {1, 2.. said., T-1}, and the preprocessing stage mainly processes the frame Height and the frame Width in each frame of video frame, wherein the original frame Height in the video frame is Height and the original frame Width is Width.
The specific flow of pretreatment is as follows: firstly, the classification device removes black frames in the images, and adjusts the size of the video frames to be a preset video frame size under the condition of keeping the original image aspect ratio. For example, the classification means may make the video frame largeThe minitrim is 196 (frame width) X128 (frame height). Then, during training, the classification device can input 112 (frame width) X112 (frame height) dithering random cropping to improve the robustness of the system, and the video frame sequence is
Figure BDA0002595412400000061
For the embodiment of the present disclosure, the video frame sequence after the preprocessing process is as follows
Figure BDA0002595412400000062
The classification device calculates the difference of each gray level between each frame of video frame and the next frame of video frame in the video frame sequence
Figure BDA0002595412400000063
The difference value calculation formula is as follows:
Figure BDA0002595412400000064
wherein He(j),He+1(j) Are respectively video frames
Figure BDA0002595412400000065
Figure BDA0002595412400000066
Is the value of the color histogram at gray level j, and n is the number of gray levels in the color histogram.
When the difference D is larger than a given preset threshold value, the video frame sequence is considered to have lens conversion, and the classification device superimposes the difference D on the video frame sequence
Figure BDA0002595412400000067
In (1). After lens detection, the sorting device can obtain a new video frame sequence
Figure BDA0002595412400000068
Wherein the corner mark t represents the tth sliceSegment, k denotes that the sequence of video frames consists of k segments, the corner mark r denotes the r-th video frame in a video segment, mtIndicating that the t-th video segment in the video frame sequence contains mtA video frame.
Then, in order to satisfy the input requirement of the subsequent neural network model, such as the C3D network, referring to the Candidate Clip Generation part in fig. 2, the classifying device may further extract 16 video frames from each video Clip in a predetermined order or randomly to form a new video Clip. For example, for a given video clip
Figure BDA0002595412400000071
The sorting device according to the extraction interval
Figure BDA0002595412400000072
Equidistant decimation is performed, where the Frame _ rate represents the Frame rate of the current video, so that a new sequence of video frames F ═ F is composed of new video segmentst (j)},t∈{1,2,...k},j∈{1,1+,...,1+15*}。
Step S102: and acquiring the video segment characteristics of the video frame sequence based on a preset neural network model.
The classification device inputs the video frame sequence into a preset neural network model, taking a C3D network as an example, so as to obtain the video segment characteristics of the video frame sequence.
In the embodiment of the present disclosure, the classification apparatus extracts the video segment features in the video frame sequence through a C3D network: { xt=f(ft 1:ft 1+15*) Please refer to the Spatio-temporal Descriptor part of fig. 2. The C3D network has yielded good performance over many video analysis tasks in the context of a large supervised training dataset, and can successfully learn and model spatio-temporal features in videos. However, one problem with using the C3D network directly is that the task data set lacks relevant action annotation data to let the C3D network learn the dynamic features. The pre-training can effectively solve the problem, and the pre-training is widely appliedThe application effect can be remarkably promoted when the method is used in the field of computer vision, and the successful effect is achieved in the field of natural language processing in recent years.
Generally, the pre-training process creates a training task, obtains the trained model parameters, and loads the trained model parameters on the C3D network according to the embodiment of the present disclosure, thereby initializing the model weights of the C3D network.
The loading model weight mainly comprises two loading methods: one is that the loaded model parameters remain unchanged during the task of training the disclosed embodiments, referred to as the "Frozen" approach; the other is that the model weight of the C3D network is initialized, but is still changed along with the training process in the task process of the embodiment of the present disclosure, which is called a "Fine-Tuning" method.
It should be noted that, in the embodiment of the present disclosure, the classification apparatus initializes the model weight of the C3D network in a "Frozen" manner.
Specifically, the classification device performs pre-training processing on the Sports-1M data set by using the C3D network, applies the trained C3D network to the task data set according to the embodiment of the present disclosure, and takes the output of the penultimate full-link layer, that is, fc7, in the C3D network as the final output value of the network.
It should be noted that, since the problem that the extracted features are not related to a specific task exists when the output of the C3D network is directly applied to a task, in order to maintain the generality of the features, the embodiment of the present disclosure selects the output of fc7 as the feature vector of the video segment features. Deleting the processing layer behind fc7 is beneficial to enhancing the migration capability of the C3D network, and meeting the requirement of multi-label category classification of the application.
Step S103: an attention matrix is calculated based on the video segment characteristics.
The specific process of the classification device calculating the Attention matrix according to the obtained video segment features is shown in fig. 2, and fig. 3, where fig. 3 is a schematic flowchart of step S103 in the multi-tag classification method for movies shown in fig. 1. Specifically, the method comprises the following steps:
step S201: and calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic.
Wherein the classifying means extracts each video segment feature X ═ (X) in the sequence of video frames1,x2,...,xk)∈Rk*DThen, the dependency between the video segment features also needs to be modeled. In the embodiment of the present disclosure, the classification device uses BiLSTM to process the video segment features obtained in the above steps:
Figure BDA0002595412400000081
Figure BDA0002595412400000082
step S202: and acquiring the number of hidden layer nodes of the BilSTM.
Wherein, the classification device further connects the forward hidden state and the backward hidden state of the video segment characteristics to obtain htWhen the hidden node of the LSTM is set as u, H ∈ R is setk*2uRepresented as a collection of all hidden states:
H=(h1,h2,...,hk)
wherein each element h in the hidden stateiGlobal information is described around the ith video segment in a sequence of video frames.
Step S203: and obtaining an attention matrix based on hidden elements of all video features and the number of hidden nodes.
Here, LSTM generally gives the same attention to the content in all locations, whereas the disclosed embodiments hope to let the C3D network focus only on the important content in the video. To achieve this, the disclosed embodiments add a self-attention mechanism after the LSTM, with the inputs being hidden state set H and the outputs being an attention matrix V:
V=softmax(Wbtanh(WaHT))
wherein,
Figure BDA0002595412400000091
Figure BDA0002595412400000092
is a two coefficient weight matrix and DaAnd DbFor hyper-parametric, the shape of the final V is
Figure BDA0002595412400000093
Because the softmax function is used for normalization, each dimension in the row vector of V can be regarded as attention of a corresponding position in the video, and each row vector is a representation of a specific content in the video. Since a movie often contains a plurality of different category labels, the different category labels are usually represented by different contents in the video, and the same type may also be represented by different contents, the setting D of the embodiment of the present disclosure is selectedaAnd DbLet C3D learn different content parts in the video over the internet as a hyper-parameter.
Further, after the classification device obtains the attention moment matrix of the video, a corresponding video feature matrix B needs to be further obtained, and the specific calculation method is as follows:
B=VH
step S104: and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence.
The classification device extracts video features in the video frame sequence based on the attention matrix and the video feature matrix obtained in the above steps, and outputs a plurality of different label categories of the video frame sequence according to the video features. Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a specific process of step S104 in the movie multi-tag classification method shown in fig. 1. Specifically, the method comprises the following steps:
step S301: and acquiring a corresponding video feature matrix based on the attention matrix and the video segment features.
Step S302: and forming a two-layer perceptron by the attention matrix and the video feature matrix.
The classification device sequentially overlaps the attention matrix and the video feature matrix to form a two-layer perceptron.
Step S303: and converting the space where the video frame sequence is located into the movie category space according to a two-layer perceptron.
Wherein the classification device converts the original space of the video frame sequence into the film category space by using a two-layer perceptron
Figure BDA0002595412400000094
Step S304: the label category of the video frame sequence is output in a movie category space.
The classification device extracts video features of the video frame sequence in a movie category space and outputs a plurality of label categories corresponding to the video frame sequence according to the categories of the video features.
After outputting the label categories of the sequence of video frames, the classifying device further constructs a cross entropy loss function L of the C3D network according to the multi-label learning task:
Figure BDA0002595412400000101
Figure BDA0002595412400000102
wherein, the classification device can evaluate the score of the multi-label learning task, i.e. the accuracy of the output label category, according to the cross entropy loss function L, and can optimize the C3D network according to the cross entropy loss function L.
In this embodiment, the classification device obtains a continuous video frame sequence, where the video frame sequence includes a plurality of video segments; acquiring video segment characteristics of a video frame sequence based on a preset neural network model; calculating an attention matrix based on the video segment characteristics; and traversing the video frame sequence according to the attention moment array to output the label category of the video frame sequence. According to the scheme, the attention degree of important contents in the video can be improved, and the accuracy of multi-label classification is improved.
Specifically, compared with the current movie classification method, the movie classification method of the present application includes the following advantages: (1) extracting bottom-layer features by using a C3D network, and effectively retaining time sequence features in the video; (2) an attention mechanism is introduced, and the response of a certain position in the video frame sequence is calculated by paying attention to all the positions and taking the weighted average value of the positions in an embedding space, so that on one hand, the attention degree of important contents can be improved, and on the other hand, the influence of invalid information segments (such as a slice header and a slice trailer) on a classification result can be reduced; (3) considering that a film is often subordinate to multiple categories, the film classification task is expanded into a multi-label learning task.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Referring to fig. 5, fig. 5 is a schematic diagram of a framework of an embodiment of a multi-tag sorting apparatus for movies according to the present application. The movie multi-label sorting apparatus 50 includes:
an obtaining module 51 for obtaining a sequence of consecutive video frames.
And the feature extraction module 52 is configured to obtain video segment features of the video frame sequence based on a preset neural network model.
And an attention calculation module 53 for calculating an attention matrix based on the video segment characteristics.
And a label classification module 54 for traversing the video frame sequence according to the attention moment matrix to output a label class of the video frame sequence.
Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an embodiment of an electronic device provided in the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the processor 62 being configured to execute program instructions stored in the memory 61 to implement the steps in any of the above-described embodiments of the movie multi-label classification method. In one particular implementation scenario, electronic device 60 may include, but is not limited to: a microcomputer, a server, and in addition, the electronic device 60 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.
In particular, the processor 62 is configured to control itself and the memory 61 to implement the steps in any of the above-described embodiments of the movie multi-label classification method. The processor 62 may also be referred to as a CPU (Central Processing Unit). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be collectively implemented by an integrated circuit chip.
Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 70 stores program instructions 701 executable by a processor, the program instructions 701 being for implementing the steps in any of the above-described embodiments of the movie multi-label classification method.
In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.
The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (10)

1. A multi-label classification method for a movie is characterized by comprising the following steps:
acquiring a continuous video frame sequence, wherein the video frame sequence comprises a plurality of video segments;
acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
computing an attention matrix based on the video segment features;
traversing the sequence of video frames according to the attention moment array to output a tag class of the sequence of video frames.
2. The movie multi-label classification method according to claim 1,
the step of traversing the sequence of video frames according to the attention moment array to output tag categories of the sequence of video frames comprises:
acquiring a corresponding video feature matrix based on the attention matrix and the video segment features;
forming a two-layer perceptron through the attention matrix and the video feature matrix;
converting the space where the video frame sequence is located into a film category space according to the two-layer perceptron;
outputting a label category of the sequence of video frames in the movie category space.
3. The movie multi-label classification method according to claim 2,
the step of calculating an attention matrix based on the video segment features comprises:
calculating a forward hidden state and a backward hidden state of the video segment features based on the BilSTM;
and adopting a self-attention mechanism to calculate an attention matrix of forward hidden states and backward hidden states of all the video segment features.
4. The movie multi-label classification method according to claim 3,
the step of calculating the attention matrix of the forward hidden state and the backward hidden state of all the video segment features by using a self-attention mechanism comprises the following steps:
calculating hidden elements of a forward hidden state and a backward hidden state of each video segment characteristic;
acquiring the number of hidden layer nodes of the BilSTM;
and obtaining the attention matrix based on hidden elements of all the video segment characteristics and the number of the hidden layer nodes.
5. The movie multi-label classification method according to claim 4,
the step of obtaining a corresponding video feature matrix based on the attention matrix and the video segment features includes:
acquiring a hidden element set of all the video clip characteristics;
carrying out normalization processing on the hidden element set by adopting the self-attention mechanism to obtain the attention matrix;
and obtaining the video feature matrix through the product of the attention matrix and the hidden element set.
6. The movie multi-label classification method according to claim 1,
after the step of outputting the label categories of the sequence of video frames, the classification method further comprises:
acquiring a cross entropy loss function of the neural network model;
evaluating a score for the label category based on the cross entropy loss function;
wherein, the output layer of the neural network model is a full connection layer fc 7.
7. The movie multi-label classification method according to claim 1,
after the step of acquiring a sequence of consecutive video frames, the classification method further comprises:
calculating the accumulated sum of the difference values of the adjacent frames in each gray level in the continuous video frame sequence;
when the accumulated sum is larger than a preset threshold value, the accumulated sum is superposed on a color histogram of a video frame of a next frame in the adjacent frames;
dividing the video frame sequence after the superposition processing into a plurality of sections of video segments according to time sequence, and extracting video frames with preset frame number from each section of the video segments so as to form a new video segment sequence.
8. A movie multi-label sorting apparatus, comprising:
an obtaining module, configured to obtain a continuous sequence of video frames;
the characteristic extraction module is used for acquiring video segment characteristics of the video frame sequence based on a preset neural network model;
an attention calculation module for calculating an attention matrix based on the video segment features;
and the label classification module is used for traversing the video frame sequence according to the attention moment array so as to output the label category of the video frame sequence.
9. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the movie multi-label classification method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon program instructions, which when executed by a processor, implement the movie multi-label classification method according to any one of claims 1 to 7.
CN202010708014.4A 2020-07-21 2020-07-21 Movie multi-label classification method and device, electronic equipment and storage medium Active CN112084371B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010708014.4A CN112084371B (en) 2020-07-21 2020-07-21 Movie multi-label classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010708014.4A CN112084371B (en) 2020-07-21 2020-07-21 Movie multi-label classification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112084371A true CN112084371A (en) 2020-12-15
CN112084371B CN112084371B (en) 2024-04-16

Family

ID=73735152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010708014.4A Active CN112084371B (en) 2020-07-21 2020-07-21 Movie multi-label classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112084371B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515997A (en) * 2020-12-28 2021-10-19 腾讯科技(深圳)有限公司 Video data processing method and device and readable storage medium
CN114329060A (en) * 2021-12-24 2022-04-12 空间视创(重庆)科技股份有限公司 Method and system for automatically generating multiple labels of video frame based on neural network model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170262995A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Video analysis with convolutional attention recurrent neural networks
CN108763326A (en) * 2018-05-04 2018-11-06 南京邮电大学 A kind of sentiment analysis model building method of the diversified convolutional neural networks of feature based
CN110209823A (en) * 2019-06-12 2019-09-06 齐鲁工业大学 A kind of multi-tag file classification method and system
CN110516086A (en) * 2019-07-12 2019-11-29 浙江工业大学 One kind being based on deep neural network video display label automatic obtaining method
US20190379628A1 (en) * 2018-06-07 2019-12-12 Arizona Board Of Regents On Behalf Of Arizona State University Method and apparatus for detecting fake news in a social media network
CN111026915A (en) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 Video classification method, video classification device, storage medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170262995A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Video analysis with convolutional attention recurrent neural networks
CN108763326A (en) * 2018-05-04 2018-11-06 南京邮电大学 A kind of sentiment analysis model building method of the diversified convolutional neural networks of feature based
US20190379628A1 (en) * 2018-06-07 2019-12-12 Arizona Board Of Regents On Behalf Of Arizona State University Method and apparatus for detecting fake news in a social media network
CN110209823A (en) * 2019-06-12 2019-09-06 齐鲁工业大学 A kind of multi-tag file classification method and system
CN110516086A (en) * 2019-07-12 2019-11-29 浙江工业大学 One kind being based on deep neural network video display label automatic obtaining method
CN111026915A (en) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 Video classification method, video classification device, storage medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
桑海峰等: "基于循环区域关注和视频帧关注的视频行为识别网络设计"", 《 电子学报》, 15 June 2020 (2020-06-15) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515997A (en) * 2020-12-28 2021-10-19 腾讯科技(深圳)有限公司 Video data processing method and device and readable storage medium
CN113515997B (en) * 2020-12-28 2024-01-19 腾讯科技(深圳)有限公司 Video data processing method and device and readable storage medium
CN114329060A (en) * 2021-12-24 2022-04-12 空间视创(重庆)科技股份有限公司 Method and system for automatically generating multiple labels of video frame based on neural network model

Also Published As

Publication number Publication date
CN112084371B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
US20230196117A1 (en) Training method for semi-supervised learning model, image processing method, and device
CN109711481B (en) Neural networks for drawing multi-label recognition, related methods, media and devices
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
Bouwmans et al. Scene background initialization: A taxonomy
Fu et al. Fast crowd density estimation with convolutional neural networks
US11222211B2 (en) Method and apparatus for segmenting video object, electronic device, and storage medium
CN110765860B (en) Tumble judging method, tumble judging device, computer equipment and storage medium
US10339421B2 (en) RGB-D scene labeling with multimodal recurrent neural networks
CN108921225B (en) Image processing method and device, computer equipment and storage medium
CN109033107B (en) Image retrieval method and apparatus, computer device, and storage medium
CN113066017B (en) Image enhancement method, model training method and equipment
CN112487207A (en) Image multi-label classification method and device, computer equipment and storage medium
CN114549913B (en) Semantic segmentation method and device, computer equipment and storage medium
CN112132145B (en) Image classification method and system based on model extended convolutional neural network
JP2010157118A (en) Pattern identification device and learning method for the same and computer program
WO2022072199A1 (en) Sparse optical flow estimation
CN111079864A (en) Short video classification method and system based on optimized video key frame extraction
WO2024041108A1 (en) Image correction model training method and apparatus, image correction method and apparatus, and computer device
CN112084371B (en) Movie multi-label classification method and device, electronic equipment and storage medium
CN114494699B (en) Image semantic segmentation method and system based on semantic propagation and front background perception
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
CN111027472A (en) Video identification method based on fusion of video optical flow and image space feature weight
CN113869234A (en) Facial expression recognition method, device, equipment and storage medium
CN111079900B (en) Image processing method and device based on self-adaptive connection neural network
CN111126177B (en) Method and device for counting number of people

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant