CN116645630A - Video classification method, device, equipment and medium based on feature fusion - Google Patents
Video classification method, device, equipment and medium based on feature fusion Download PDFInfo
- Publication number
- CN116645630A CN116645630A CN202310611223.0A CN202310611223A CN116645630A CN 116645630 A CN116645630 A CN 116645630A CN 202310611223 A CN202310611223 A CN 202310611223A CN 116645630 A CN116645630 A CN 116645630A
- Authority
- CN
- China
- Prior art keywords
- image
- video
- feature
- normalized
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 90
- 238000000034 method Methods 0.000 title claims abstract description 45
- 239000013598 vector Substances 0.000 claims abstract description 134
- 239000011159 matrix material Substances 0.000 claims abstract description 125
- 238000010606 normalization Methods 0.000 claims description 39
- 230000006870 function Effects 0.000 claims description 21
- 230000004913 activation Effects 0.000 claims description 19
- 230000003213 activating effect Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 abstract description 5
- 238000010801 machine learning Methods 0.000 abstract description 3
- 238000007726 management method Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 239000011324 bead Substances 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the field of machine learning and the field of financial videos, and discloses a video classification method based on feature fusion, which comprises the following steps: calculating the normalized image attention weight of each image in the video to be classified, generating weighted image feature vectors, splicing the image feature vectors to obtain a video feature matrix, calculating the normalized video attention weight of the video feature matrix, generating weighted video feature vectors, calculating the normalized text attention weight of the text feature matrix of the video to be classified, generating weighted text feature vectors, superposing the video feature vectors and the text feature vectors to obtain a fusion feature matrix, calculating the normalized comprehensive weight of the fusion feature matrix, generating the fusion feature vector, and classifying the video to be classified by using a classifier which completes training in advance according to the fusion feature vector. The invention also provides a video classification device, equipment and medium based on the feature fusion. The invention can improve the accuracy of the financial video classification.
Description
Technical Field
The present invention relates to the field of human machine learning and the field of financial video, and in particular, to a video classification method, apparatus, electronic device and computer readable storage medium based on feature fusion.
Background
Generally, a financial video mainly comprises trend analysis contents such as funds, futures, stock market and the like, financial chart contents and the like, wherein related charts have high similarity, related professional vocabularies are overlapped, and the degree of distinction of the financial video is smaller.
In view of the above, in general, when classifying a financial video, it is generally necessary to extract features of multiple modes of the financial video, for example, video features and text features, fuse the features of multiple modes, and then classify the financial video according to the fused features by using a machine learning model.
When multi-mode feature fusion is carried out, as the scale of each feature is different, pooling operation is usually needed to be carried out on the features of each mode, and an average pooling algorithm or a maximum pooling algorithm which is mainly used by general feature pooling operation is forced to splice the features of each scale with equal weight or maximum weight and participate in feature fusion.
Disclosure of Invention
The invention provides a video classification method, a video classification device, electronic equipment and a computer readable storage medium based on feature fusion, and aims to improve the accuracy of financial video classification.
In order to achieve the above object, the present invention provides a video classification method based on feature fusion, including:
generating an image sequence of a video to be classified, sequentially calculating normalized image attention weights of each image in the image sequence, and generating weighted image feature vectors corresponding to each image by using the normalized image attention weights;
splicing all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculating normalized video attention weights of the video feature matrix, and generating the video feature vector of the video to be classified by using the normalized video attention weights;
generating a text feature matrix corresponding to the video to be classified, calculating a normalized text attention weight of the text feature matrix, and generating a weighted text feature vector corresponding to the text content by using the normalized text attention weight;
superposing the video feature vector and the text feature vector to obtain a fusion feature matrix, calculating the normalized comprehensive weight of the fusion feature matrix, and generating the fusion feature vector of the video to be classified by using the normalized comprehensive weight;
And classifying the videos to be classified according to the fusion feature vector by using a classifier which is trained in advance.
Optionally, the calculating the normalized image attention weight of each image in the image sequence sequentially includes:
extracting image characteristics of each image in the image sequence to obtain an image characteristic matrix of each image;
sorting the image feature matrix according to the pixel value of each column of the image feature matrix;
converting the ordered image feature matrix into a one-dimensional image feature vector by utilizing a pre-trained full-connection layer;
and activating and normalizing the one-dimensional image feature vector by using a preset activation function to obtain normalized image attention weight of each image.
Optionally, the activating and normalizing the one-dimensional image feature vector by using a preset activating function to obtain a normalized image attention weight of each image includes:
taking the sum of pixels of the one-dimensional image feature vector of each image as a feature weight coefficient of the corresponding image;
performing nonlinear activation on characteristic weight coefficients of each image;
And carrying out linear normalization on the characteristic weight coefficient after nonlinear activation to obtain the normalized image attention weight corresponding to each image.
Optionally, the generating, by using the normalized image attention weight, a weighted image feature vector corresponding to each image includes:
multiplying the normalized image attention weight with the ordered image feature matrix of the corresponding image to obtain a weighted image feature matrix;
and summing the numerical values of each column in the weighted image feature matrix to obtain a weighted image feature vector.
Optionally, the generating a text feature matrix corresponding to the video to be classified includes:
identifying the text content of the video to be classified, and carrying out clause on the text content to obtain a text clause set;
extracting the clause text characteristics of each clause in the text clause set to obtain a clause text characteristic matrix of each clause;
calculating the normalized sentence attention weight of each clause text feature matrix in turn, and generating a weighted sentence feature vector of each clause by using the normalized sentence attention weight;
and splicing all the weighted sentence feature vectors to obtain a text feature matrix of the video to be classified.
Optionally, the generating the image sequence of the video to be classified includes:
framing the video to be classified to obtain a video frame set;
and selecting video frames from the video frame set according to the preset video frame extraction frequency and time sequence to form the image sequence.
In order to solve the above problems, the present invention further provides a video classification device based on feature fusion, the device comprising:
the image feature weight normalization module is used for generating an image sequence of the video to be classified, sequentially calculating normalized image attention weights of each image in the image sequence, and generating weighted image feature vectors corresponding to each image by utilizing the normalized image attention weights;
the video feature weight normalization module is used for splicing all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculating the normalized video attention weight of the video feature matrix, and generating the video feature vector of the video to be classified by using the normalized video attention weight;
the text feature weight normalization module is used for generating a text feature matrix corresponding to the video to be classified, calculating normalized text attention weight of the text feature matrix, and generating a weighted text feature vector corresponding to the text content by utilizing the normalized text attention weight;
The fusion feature weight normalization module is used for superposing the video feature vector and the text feature vector to obtain a fusion feature matrix, calculating the normalization comprehensive weight of the fusion feature matrix, and generating the fusion feature vector of the video to be classified by using the normalization comprehensive weight;
and the video classification module is used for classifying the videos to be classified according to the fusion feature vector by utilizing a classifier which is trained in advance.
Optionally, the image feature weight normalization module calculates a normalized image attention weight of each image in the image sequence by:
extracting image characteristics of each image in the image sequence to obtain an image characteristic matrix of each image;
sorting the image feature matrix according to the pixel value of each column of the image feature matrix;
converting the ordered image feature matrix into a one-dimensional image feature vector by utilizing a pre-trained full-connection layer;
and activating and normalizing the one-dimensional image feature vector by using a preset activation function to obtain normalized image attention weight of each image.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
a memory storing at least one computer program; and
And the processor executes the program stored in the memory to realize the video classification method based on the feature fusion.
In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the above-mentioned feature fusion-based video classification method.
According to the embodiment of the invention, before the video feature vector and the text feature vector of the video to be classified are fused, corresponding attention weight normalization operation is carried out on the image feature vector and the video feature matrix which form the video feature vector, and attention weight normalization operation is carried out on the text feature matrix which forms the text feature vector, so that the stability of the internal differences of the features of the two modes in the forming process is ensured, and when the feature vectors of the two modes are fused, the weight normalization operation is carried out on the fused feature matrix again, namely the existence of the differences of the features of the two modes is reserved, and the jump property of the feature difference of the two modes is avoided in a normalization operation mode, so that the accuracy of the final fused feature matrix is improved, and the accuracy of the financial video classification is facilitated.
Drawings
Fig. 1 is a flow chart of a video classification method based on feature fusion according to an embodiment of the application;
FIG. 2 is a flowchart illustrating a detailed implementation of one of the steps in a feature fusion-based video classification method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating another step in a video classification method based on feature fusion according to an embodiment of the present application;
FIG. 4 is a functional block diagram of a video classification device based on feature fusion according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device for implementing the video classification method based on feature fusion according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The embodiment of the application provides a video classification method based on feature fusion. The execution subject of the video classification method based on feature fusion includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the feature fusion-based video classification method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flow chart of a video classification method based on feature fusion according to an embodiment of the invention is shown. In this embodiment, the method for classifying video based on feature fusion includes:
s1, generating an image sequence of a video to be classified, sequentially calculating normalized image attention weights of each image in the image sequence, and generating weighted image feature vectors corresponding to each image by using the normalized image attention weights;
in the embodiment of the invention, the video to be classified can be a video of the financial field, such as a fund, futures trend analysis video, a financial hot event video, a financial expertise training video and the like.
Illustratively, a financial financing platform promotes adhesion to users by continuously uploading new base classes or video of financial topics immediately following the current day. The financial videos can be classified according to actual business needs, for example, the financial videos are classified according to the attribute of financial products related to the financial videos, including futures trading videos, stock trading videos and the like; the financial video is classified according to the influence area or the occurrence area of the financial video, including European financial video, bead triangle financial video and the like, and can be classified according to the professional knowledge field related to the financial video, including financial risk management and control, financial tax knowledge video and the like. The platform classifies the financial videos pushed or cited, on one hand, the coverage of the financial videos can be guaranteed, the requirements of different users are met, on the other hand, repetition of the financial videos can be avoided, redundant videos are eliminated, platform storage resources are saved, meanwhile, the financial videos are classified, and the efficiency of searching target videos by the users can be improved.
It will be appreciated that video is composed of an unequal number of video frames, each of which may be understood as an image, and thus that a sequence of images of the video to be classified may be composed using the video frames of the video to be classified.
In detail, the generating the image sequence of the video to be classified includes:
framing the video to be classified to obtain a video frame set;
and selecting video frames from the video frame set according to the preset video frame extraction frequency and time sequence to form the image sequence.
It will be appreciated that the video to be classified is composed of a series of video frames, and the extracted video frames may be formed into the image sequence by randomly extracting a predetermined number of video frames from the video to be classified at a certain frequency. For example, if the duration of the video to be classified is a seconds and 5 video frames are extracted per second, the generated image sequence includes 5*A video frames.
In the embodiment of the invention, since each image contains features with different scales, the attention weights of each image are also different in the convolutional layer extraction by using the deep learning model based on the convolutional neural network (Convolutional Neural Networks, CNN for short), and the difference between the weights with larger values can be prevented from being further amplified too quickly in the training of image feature fusion by carrying out normalization processing on the image attention weights of each image.
In detail, referring to fig. 2, the sequentially calculating the normalized image attention weight of each image in the image sequence includes:
s11, extracting image features of each image in the image sequence to obtain an image feature matrix of each image;
s12, sorting the image feature matrix according to the pixel value of each column of the image feature matrix;
s13, converting the ordered image feature matrix into a one-dimensional image feature vector by utilizing a pre-trained full-connection layer;
s14, activating and normalizing the one-dimensional image feature vector by using a preset activation function to obtain normalized image attention weight of each image.
In the embodiment of the invention, the image characteristics of each image in the image sequence can be sequentially extracted by utilizing a pre-trained ViT (Vision Transformer) model.
The image feature matrix of an image is a matrix of (n×m) ×768, and the matrix has n×m rows and 768 columns, and the columns of the image feature matrix are ordered from large to small for subsequent calculation, and the ordered image feature matrix is still (n×m) ×768.
Preferably, the pre-trained full connection layer may be located at the end of the pre-trained ViT model, and the full connection layer may convert the image feature matrix from a two-dimensional matrix to a one-dimensional image feature vector.
In the embodiment of the present invention, the preset activation function may be a nonlinear activation function commonly used in deep learning, including, but not limited to, a tanh activation function, a sigmoid activation function, and a relu activation function.
In detail, the activating and normalizing the one-dimensional image feature vector by using a preset activating function to obtain a normalized image attention weight of each image, including:
taking the sum of pixels of the one-dimensional image feature vector of each image as a feature weight coefficient of the corresponding image;
performing nonlinear activation on characteristic weight coefficients of each image;
and carrying out linear normalization on the characteristic weight coefficient after nonlinear activation to obtain the normalized image attention weight corresponding to each image.
According to the invention, nonlinear activation and linear normalization operations are carried out on the image features of each image, so that the numerical values of the weight coefficients of different image features of each scale are all between 0 and 1, the sum of the numerical values is equal to 1, and particularly, the saturation region of a nonlinear activation function is utilized, so that the severe oscillation caused in the image feature fusion training is avoided when the difference between the weight coefficients with larger numerical values is excessively and rapidly further amplified, the operation amount is reduced by utilizing linear normalization, and the stability and the efficiency of weight coefficient calculation are improved.
Further, the generating, by using the normalized image attention weights, a weighted image feature vector corresponding to each image includes:
multiplying the normalized image attention weight with the ordered image feature matrix of the corresponding image to obtain a weighted image feature matrix;
and summing the numerical values of each column in the weighted image feature matrix to obtain a weighted image feature vector.
In the embodiment of the invention, the normalized image attention weight of the image of the minimum unit corresponding to the video to be classified is calculated, so that the feature scale difference of each image is reserved on one hand, and on the other hand, the normalized image attention weight can be reduced, and the feature scale difference of the image is further enlarged in the subsequent feature fusion.
S2, splicing all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculating normalized video attention weights of the video feature matrix, and generating the video feature vector of the video to be classified by using the normalized video attention weights;
for example, if the duration of the video to be classified is a seconds, and 5 video frames are extracted in each second, the generated image sequence includes 5*A video frames, and accordingly, each video frame corresponds to a weighted image feature vector, and each weighted image feature vector considers a one-dimensional vector, and if the vector is assumed to be a 1×768 vector, the video feature matrix is a two-dimensional feature matrix of (5*A) ×768.
In the embodiment of the present invention, the method for calculating the normalized video attention weight of the video feature matrix, generating the video feature vector of the video to be classified by using the normalized video attention weight, and the method for generating the weighted image feature vector corresponding to each image by using the normalized image attention weight are the same as the method for calculating the normalized image attention weight of each image in the image sequence, and are not described herein again.
In the embodiment of the invention, after the normalization weighting is performed on the image features of the minimum unit corresponding to the video to be classified, further, the normalization weighting operation is performed on the integral video feature matrix, so that the subsequent fusion of the features of other modes of the video to be classified is facilitated.
S3, generating a text feature matrix corresponding to the video to be classified, calculating normalized text attention weight of the text feature matrix, and generating a weighted text feature vector corresponding to the text content by using the normalized text attention weight.
It can be appreciated that, in the operation of classifying the video, the video features of the video itself and the text features corresponding to the video are considered more, and the video to be classified is classified based on the features of two modes of the video features and the text features.
In detail, referring to fig. 3, the generating a text feature matrix corresponding to the video to be classified includes:
s31, identifying text content of the video to be classified, and carrying out clause on the text content to obtain a text clause set;
s32, extracting the clause text characteristics of each clause in the text clause set to obtain a clause text characteristic matrix of each clause;
s33, sequentially calculating the normalized sentence attention weight of each clause text feature matrix, and generating a weighted sentence feature vector of each clause by using the normalized sentence attention weight;
and S34, splicing all the weighted sentence feature vectors to obtain a text feature matrix of the video to be classified.
In the embodiment of the invention, the text content corresponding to the video to be classified can be identified by utilizing ASR (Automatic Speech Recognition ) or OCR (Optical Character Recognition, optical character recognition) technology.
It can be understood that, in general, the text content may be divided into different clauses, each clause may be further divided into different word segments, and text features corresponding to each word segment may be extracted by using a pre-trained BERT model, so as to obtain text features of each clause and the entire text content.
In the embodiment of the present invention, the method for calculating the normalized sentence attention weight of each sentence text feature matrix, generating the weighted sentence feature vector of each sentence by using the normalized sentence attention weight is the same as the method for calculating the normalized image attention weight of each image in the image sequence, and generating the weighted image feature vector corresponding to each image by using the normalized image attention weight, which is not described herein again.
In the embodiment of the present invention, the method for calculating the normalized text attention weight corresponding to the text feature matrix, and generating the weighted text feature vector corresponding to the text content by using the normalized text attention weight is the same as the method for calculating the normalized image attention weight of each image in the image sequence, and the method for generating the weighted image feature vector corresponding to each image by using the normalized image attention weight is not repeated here.
In the embodiment of the invention, the text feature matrix is compared with the video feature matrix of the video to be classified, the principle is the same, the text feature matrix is constructed based on the sentence text feature vector of each sentence, the sentence text feature vector is normalized by the attention weight of the related sentence, and the text feature matrix is finally normalized by the attention weight of the text, so that the stability of the internal difference of the text feature matrix can be ensured.
S4, superposing the video feature vector and the text feature vector to obtain a fusion feature matrix, calculating the normalized comprehensive weight of the fusion feature matrix, generating the fusion feature vector of the video to be classified by using the normalized comprehensive weight, and classifying the video to be classified according to the fusion feature vector by using a classifier which is trained in advance.
In the embodiment of the invention, the video feature vector and the text feature vector are one-dimensional vectors, the video feature vector and the text feature vector are overlapped to obtain a two-dimensional fusion feature matrix, the same method as that for calculating the normalized image attention weight of each image is adopted to generate the normalized comprehensive weight of the fusion feature matrix, and further, after multiplying the fusion feature matrix by the normalized comprehensive weight, the numerical value of each column in the multiplied fusion feature matrix is summed to obtain the normalized fusion feature vector.
In the embodiment of the invention, the fusion feature vector not only keeps the existence of the difference of the features of the two modes, but also avoids the jump of the feature difference of the two modes in a normalization operation mode, thereby improving the accuracy of the final fusion feature matrix.
In the embodiment of the invention, the classifier which is trained in advance can be an MLP model, and the MLP model is utilized to classify the videos to be classified based on the fusion feature vector.
According to the embodiment of the invention, before the video feature vector and the text feature vector of the video to be classified are fused, corresponding attention weight normalization operation is carried out on the image feature vector and the video feature matrix which form the video feature vector, and attention weight normalization operation is carried out on the text feature matrix which forms the text feature vector, so that the stability of the internal differences of the features of the two modes in the forming process is ensured, and when the feature vectors of the two modes are fused, the weight normalization operation is carried out on the fused feature matrix again, namely the existence of the differences of the features of the two modes is reserved, and the jump property of the feature difference of the two modes is avoided in a normalization operation mode, so that the accuracy of the final fused feature matrix is improved, and the accuracy of the financial video classification is facilitated.
Fig. 4 is a functional block diagram of a video classification device based on feature fusion according to an embodiment of the present invention.
The video classification device 100 based on feature fusion according to the present invention may be installed in an electronic device. According to the implemented functions, the video classification device 100 based on feature fusion includes: an image feature weight normalization module 101, a video feature weight normalization module 102, a text feature weight normalization module 103, a fusion feature weight normalization module 104, and a video classification module 105. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the image feature weight normalization module 101 is configured to generate an image sequence of a video to be classified, sequentially calculate a normalized image attention weight of each image in the image sequence, and generate a weighted image feature vector corresponding to each image by using the normalized image attention weight;
the video feature weight normalization module 102 is configured to splice all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculate a normalized video attention weight of the video feature matrix, and generate a video feature vector of the video to be classified by using the normalized video attention weight;
The text feature weight normalization module 103 is configured to generate a text feature matrix corresponding to the video to be classified, calculate a normalized text attention weight of the text feature matrix, and generate a weighted text feature vector corresponding to the text content by using the normalized text attention weight;
the fusion feature weight normalization module 104 is configured to superimpose the video feature vector and the text feature vector to obtain a fusion feature matrix, calculate a normalized comprehensive weight of the fusion feature matrix, and generate a fusion feature vector of the video to be classified by using the normalized comprehensive weight;
the video classification module 105 is configured to classify the video to be classified according to the fusion feature vector by using a classifier that completes training in advance.
In detail, each module in the video classification device 100 based on feature fusion in the embodiment of the present invention adopts the same technical means as the video classification method based on feature fusion described in fig. 1 to 3, and can produce the same technical effects, which are not described herein.
Fig. 5 is a schematic structural diagram of an electronic device for implementing a feature fusion-based video classification method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a video classification program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of video classification programs, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., video classification programs, etc.) stored in the memory 11, and calling data stored in the memory 11.
The bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 5 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The video retrieval program stored in the memory 11 in the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
generating an image sequence of a video to be classified, sequentially calculating normalized image attention weights of each image in the image sequence, and generating weighted image feature vectors corresponding to each image by using the normalized image attention weights;
Splicing all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculating normalized video attention weights of the video feature matrix, and generating the video feature vector of the video to be classified by using the normalized video attention weights;
generating a text feature matrix corresponding to the video to be classified, calculating a normalized text attention weight of the text feature matrix, and generating a weighted text feature vector corresponding to the text content by using the normalized text attention weight;
superposing the video feature vector and the text feature vector to obtain a fusion feature matrix, calculating the normalized comprehensive weight of the fusion feature matrix, and generating the fusion feature vector of the video to be classified by using the normalized comprehensive weight;
and classifying the videos to be classified according to the fusion feature vector by using a classifier which is trained in advance.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
generating an image sequence of a video to be classified, sequentially calculating normalized image attention weights of each image in the image sequence, and generating weighted image feature vectors corresponding to each image by using the normalized image attention weights;
splicing all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculating normalized video attention weights of the video feature matrix, and generating the video feature vector of the video to be classified by using the normalized video attention weights;
generating a text feature matrix corresponding to the video to be classified, calculating a normalized text attention weight of the text feature matrix, and generating a weighted text feature vector corresponding to the text content by using the normalized text attention weight;
superposing the video feature vector and the text feature vector to obtain a fusion feature matrix, calculating the normalized comprehensive weight of the fusion feature matrix, and generating the fusion feature vector of the video to be classified by using the normalized comprehensive weight;
And classifying the videos to be classified according to the fusion feature vector by using a classifier which is trained in advance.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the holographic projection technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present application without departing from the spirit and scope of the technical solution of the present application.
Claims (10)
1. A video classification method based on feature fusion, the method comprising:
Generating an image sequence of a video to be classified, sequentially calculating normalized image attention weights of each image in the image sequence, and generating weighted image feature vectors corresponding to each image by using the normalized image attention weights;
splicing all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculating normalized video attention weights of the video feature matrix, and generating the video feature vector of the video to be classified by using the normalized video attention weights;
generating a text feature matrix corresponding to the video to be classified, calculating a normalized text attention weight of the text feature matrix, and generating a weighted text feature vector corresponding to the text content by using the normalized text attention weight;
superposing the video feature vector and the text feature vector to obtain a fusion feature matrix, calculating the normalized comprehensive weight of the fusion feature matrix, and generating the fusion feature vector of the video to be classified by using the normalized comprehensive weight;
and classifying the videos to be classified according to the fusion feature vector by using a classifier which is trained in advance.
2. The feature fusion-based video classification method of claim 1, wherein said sequentially calculating normalized image attention weights for each image in said sequence of images comprises:
extracting image characteristics of each image in the image sequence to obtain an image characteristic matrix of each image;
sorting the image feature matrix according to the pixel value of each column of the image feature matrix;
converting the ordered image feature matrix into a one-dimensional image feature vector by utilizing a pre-trained full-connection layer;
and activating and normalizing the one-dimensional image feature vector by using a preset activation function to obtain normalized image attention weight of each image.
3. The method for classifying video based on feature fusion according to claim 2, wherein said activating and normalizing the feature vectors of the one-dimensional image by using a preset activation function to obtain a normalized image attention weight of each image comprises:
taking the sum of pixels of the one-dimensional image feature vector of each image as a feature weight coefficient of the corresponding image;
performing nonlinear activation on characteristic weight coefficients of each image;
And carrying out linear normalization on the characteristic weight coefficient after nonlinear activation to obtain the normalized image attention weight corresponding to each image.
4. The feature fusion-based video classification method of claim 2, wherein said generating a weighted image feature vector for each of said images using said normalized image attention weights comprises:
multiplying the normalized image attention weight with the ordered image feature matrix of the corresponding image to obtain a weighted image feature matrix;
and summing the numerical values of each column in the weighted image feature matrix to obtain a weighted image feature vector.
5. The method for classifying video based on feature fusion according to claim 1, wherein the generating a text feature matrix corresponding to the video to be classified comprises:
identifying the text content of the video to be classified, and carrying out clause on the text content to obtain a text clause set;
extracting the clause text characteristics of each clause in the text clause set to obtain a clause text characteristic matrix of each clause;
calculating the normalized sentence attention weight of each clause text feature matrix in turn, and generating a weighted sentence feature vector of each clause by using the normalized sentence attention weight;
And splicing all the weighted sentence feature vectors to obtain a text feature matrix of the video to be classified.
6. The feature fusion-based video classification method of claim 1, wherein the generating an image sequence of the video to be classified comprises:
framing the video to be classified to obtain a video frame set;
and selecting video frames from the video frame set according to the preset video frame extraction frequency and time sequence to form the image sequence.
7. A video classification device based on feature fusion, the device comprising:
the image feature weight normalization module is used for generating an image sequence of the video to be classified, sequentially calculating normalized image attention weights of each image in the image sequence, and generating weighted image feature vectors corresponding to each image by utilizing the normalized image attention weights;
the video feature weight normalization module is used for splicing all weighted image feature vectors to obtain a video feature matrix of the video to be classified, calculating the normalized video attention weight of the video feature matrix, and generating the video feature vector of the video to be classified by using the normalized video attention weight;
The text feature weight normalization module is used for generating a text feature matrix corresponding to the video to be classified, calculating normalized text attention weight of the text feature matrix, and generating a weighted text feature vector corresponding to the text content by utilizing the normalized text attention weight;
the fusion feature weight normalization module is used for superposing the video feature vector and the text feature vector to obtain a fusion feature matrix, calculating the normalization comprehensive weight of the fusion feature matrix, and generating the fusion feature vector of the video to be classified by using the normalization comprehensive weight;
and the video classification module is used for classifying the videos to be classified according to the fusion feature vector by utilizing a classifier which is trained in advance.
8. The feature fusion-based video classification apparatus of claim 7, wherein the image feature weight normalization module calculates a normalized image attention weight for each image in the sequence of images by:
extracting image characteristics of each image in the image sequence to obtain an image characteristic matrix of each image;
Sorting the image feature matrix according to the pixel value of each column of the image feature matrix;
converting the ordered image feature matrix into a one-dimensional image feature vector by utilizing a pre-trained full-connection layer;
and activating and normalizing the one-dimensional image feature vector by using a preset activation function to obtain normalized image attention weight of each image.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the feature fusion-based video classification method of any one of claims 1 to 6.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the feature fusion based video classification method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310611223.0A CN116645630A (en) | 2023-05-25 | 2023-05-25 | Video classification method, device, equipment and medium based on feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310611223.0A CN116645630A (en) | 2023-05-25 | 2023-05-25 | Video classification method, device, equipment and medium based on feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116645630A true CN116645630A (en) | 2023-08-25 |
Family
ID=87639363
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310611223.0A Pending CN116645630A (en) | 2023-05-25 | 2023-05-25 | Video classification method, device, equipment and medium based on feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116645630A (en) |
-
2023
- 2023-05-25 CN CN202310611223.0A patent/CN116645630A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113157927B (en) | Text classification method, apparatus, electronic device and readable storage medium | |
CN112988963B (en) | User intention prediction method, device, equipment and medium based on multi-flow nodes | |
CN113821602B (en) | Automatic answering method, device, equipment and medium based on image-text chat record | |
CN112269875B (en) | Text classification method, device, electronic equipment and storage medium | |
CN114880449B (en) | Method and device for generating answers of intelligent questions and answers, electronic equipment and storage medium | |
WO2023178798A1 (en) | Image classification method and apparatus, and device and medium | |
CN115221276A (en) | Chinese image-text retrieval model training method, device, equipment and medium based on CLIP | |
CN116681082A (en) | Discrete text semantic segmentation method, device, equipment and storage medium | |
CN113704474B (en) | Bank outlet equipment operation guide generation method, device, equipment and storage medium | |
CN116630712A (en) | Information classification method and device based on modal combination, electronic equipment and medium | |
CN116737933A (en) | Text classification method, apparatus, electronic device and computer readable storage medium | |
CN116644208A (en) | Video retrieval method, device, electronic equipment and computer readable storage medium | |
CN116468025A (en) | Electronic medical record structuring method and device, electronic equipment and storage medium | |
CN116521867A (en) | Text clustering method and device, electronic equipment and storage medium | |
CN116705345A (en) | Medical entity labeling method, device, equipment and storage medium | |
CN113806540B (en) | Text labeling method, text labeling device, electronic equipment and storage medium | |
CN116383478A (en) | Transaction recommendation method, device, equipment and storage medium | |
CN116645630A (en) | Video classification method, device, equipment and medium based on feature fusion | |
CN111414609B (en) | Object verification method and device | |
CN114676307A (en) | Ranking model training method, device, equipment and medium based on user retrieval | |
CN114219367A (en) | User scoring method, device, equipment and storage medium | |
CN113706207A (en) | Order transaction rate analysis method, device, equipment and medium based on semantic analysis | |
CN114546882B (en) | Intelligent question-answering system testing method and device, electronic equipment and storage medium | |
CN116451706A (en) | Emotion recognition method and device based on specific words, electronic equipment and storage medium | |
CN116453137A (en) | Expression semantic extraction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |