CN111489378A

CN111489378A - Video frame feature extraction method and device, computer equipment and storage medium

Info

Publication number: CN111489378A
Application number: CN202010596100.0A
Authority: CN
Inventors: 姜博源; 罗栋豪; 翁俊武; 王亚彪; 丁鹏; 汪铖杰; 李季檩; 黄飞跃; 吴永坚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-08-04
Anticipated expiration: 2040-06-28
Also published as: CN111489378B

Abstract

The embodiment of the application discloses a video frame feature extraction method and device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: the method comprises the steps of obtaining a plurality of video frames, respectively extracting features of each video frame to obtain initial feature information of each video frame, carrying out motion recognition according to the initial feature information of the plurality of video frames to obtain motion feature information of the plurality of video frames, carrying out comparison processing on the motion feature information of the plurality of video frames to obtain weight information of each video frame, and respectively carrying out fusion processing on the initial feature information of each video frame and the corresponding weight information to obtain target feature information of each video frame. The information irrelevant to the motion characteristics in each video frame is weakened, the accuracy of the motion characteristic information of a plurality of video frames is improved, and the motion characteristic information in the target characteristic information of each video frame is enhanced, so that the accuracy of the target characteristic information is improved, and the data calculation of the video frames is realized.

Description

Video frame feature extraction method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a video frame feature extraction method and device, computer equipment and a storage medium.

Background

With the development of computer technology, video data is more and more abundant, and processing modes of video data are more and more diverse, such as video data classification, video data segmentation and the like. When video data is classified or divided, processing is generally performed according to feature information of the video data, and therefore, it is important how to accurately extract the feature information.

In the related art, a plurality of video frames in video data are generally acquired, and feature extraction is performed on each video frame to obtain feature information of each video frame. The method is used for respectively extracting the characteristics of each video frame, so that the accuracy of the obtained characteristic information is poor.

Disclosure of Invention

The embodiment of the application provides a video frame feature extraction method, a video frame feature extraction device, computer equipment and a storage medium, and can improve the accuracy of feature information. The technical scheme comprises the following contents.

In one aspect, a method for extracting features of a video frame is provided, where the method includes:

acquiring a plurality of video frames in the same video data;

respectively extracting features of each video frame to obtain initial feature information of each video frame, wherein the initial feature information comprises initial features corresponding to a plurality of feature dimensions;

performing motion identification according to the initial feature information of the plurality of video frames to obtain motion feature information of the plurality of video frames, wherein the motion feature information comprises motion features corresponding to the plurality of feature dimensions;

comparing the motion characteristic information of the plurality of video frames to obtain weight information of each video frame, wherein the weight information comprises weights corresponding to the plurality of characteristic dimensions, and the weights represent the influence degree of the characteristic dimensions on the motion characteristics of the video frames;

and respectively carrying out fusion processing on the initial characteristic information of each video frame and the corresponding weight information to obtain the target characteristic information of each video frame.

In one possible implementation manner, the training the feature extraction model, the motion recognition model, the weight obtaining model, and the attention fusion model according to the feature information of the target samples of the plurality of sample video frames includes:

calling a classification model, and classifying the target sample characteristic information of each sample video frame to obtain the category characteristic information of each sample video frame, wherein the category characteristic information comprises characteristic values corresponding to a plurality of action categories;

fusing the category characteristic information of the plurality of sample video frames to obtain fused category characteristic information;

determining the action category to which the maximum characteristic value in the fusion category characteristic information belongs as a target action category of the sample video;

determining a second loss value of the feature extraction model according to a difference between a target action category of the sample video and a sample action category of the sample video, wherein the second loss value is in a positive correlation with the difference;

and training the feature extraction model, the motion recognition model, the weight acquisition model and the attention model according to the second loss value.

In another aspect, a method for extracting features of a video frame is provided, where the method includes:

acquiring a plurality of sample video frames in the same sample video data;

calling a feature extraction model, and respectively extracting features of each sample video frame to obtain target sample feature information of each sample video frame;

for any target feature dimension, determining the similarity of the target feature dimension according to the sum of the similarities between the sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the plurality of sample video frames;

determining a first loss value of the feature extraction model according to the similarity of a preset number of target feature dimensions, wherein the first loss value and the similarity of the preset number of target feature dimensions are in a positive correlation relationship;

training the feature extraction model according to the first loss value;

and calling the trained feature extraction model to extract the features of any video frame to obtain the feature information of any video frame.

In one possible implementation, the method further includes:

calling a classification model, and classifying the target sample characteristics of each sample video frame to obtain the category characteristic information of each sample video frame, wherein the category characteristic information comprises characteristic values corresponding to a plurality of categories;

determining the category to which the maximum characteristic value in the fusion category characteristic information belongs as a target category of the sample video;

determining a second loss value of the feature extraction model according to a difference between a target class of the sample video and a sample class of the sample video, wherein the second loss value is in a positive correlation with the difference;

the training the feature extraction model according to the first loss value includes:

and training the feature extraction model according to the first loss value and the second loss value.

In another aspect, an apparatus for extracting features of a video frame is provided, the apparatus including:

the video frame acquisition module is used for acquiring a plurality of video frames in the same video data;

the feature extraction module is used for respectively extracting features of each video frame to obtain initial feature information of each video frame, wherein the initial feature information comprises initial features corresponding to a plurality of feature dimensions;

the motion identification module is used for carrying out motion identification according to the initial characteristic information of the plurality of video frames to obtain the motion characteristic information of the plurality of video frames, wherein the motion characteristic information comprises motion characteristics corresponding to the plurality of characteristic dimensions;

the comparison processing module is used for performing comparison processing on the motion characteristic information of the plurality of video frames to obtain weight information of each video frame, wherein the weight information comprises weights corresponding to the plurality of characteristic dimensions, and the weights represent the influence degree of the characteristic dimensions on the motion characteristics of the video frames;

and the first fusion processing module is used for respectively carrying out fusion processing on the initial characteristic information of each video frame and the corresponding weight information to obtain the target characteristic information of each video frame.

In a possible implementation manner, the motion recognition module is configured to compare initial feature information of any two adjacent video frames in the plurality of video frames to obtain motion feature information of a first video frame in the any two video frames.

In another possible implementation manner, the motion recognition module includes:

the dimension reduction processing unit is used for carrying out dimension reduction processing on each initial feature in the initial feature information of the first video frame and the second video frame in any two video frames;

and the characteristic information determining unit is used for determining the difference characteristic information between the characteristic information of the first video frame after the dimension reduction processing and the characteristic information of the second video frame after the dimension reduction processing as the motion characteristic information of the first video frame.

In another possible implementation manner, the apparatus further includes:

and the characteristic information determining module is used for determining preset characteristic information as the motion characteristic information of the last video frame in the plurality of video frames.

In another possible implementation manner, the comparison processing module includes:

the fusion processing unit is used for performing fusion processing on the motion characteristic information of the video frame and the motion characteristic information of at least one video frame before the video frame to obtain fusion motion characteristic information of the video frame, wherein the fusion motion characteristic information comprises fusion motion characteristics corresponding to the multiple characteristic dimensions;

and the normalization processing unit is used for performing normalization processing on the plurality of fusion motion characteristics in the fusion motion characteristic information and taking the fusion motion characteristic information after the normalization processing as the weight information.

In another possible implementation manner, the apparatus further includes:

and the second fusion processing module is used for performing fusion processing on the motion characteristic information of the video frame and the motion characteristic information of the last video frame in the plurality of video frames in response to the fact that the video frame is the first video frame in the plurality of video frames, so as to obtain the fusion motion characteristic information of the video frames.

In another possible implementation manner, the apparatus further includes:

the classification processing module is used for classifying the target characteristic information of each video frame to obtain the category characteristic information of each video frame, wherein the category characteristic information comprises characteristic values corresponding to a plurality of action categories;

the information fusion module is used for fusing the category characteristic information of the video frames to obtain fusion category characteristic information;

and the category determining module is used for determining the action category to which the maximum characteristic value in the fusion category characteristic information belongs as the action category of the video data.

In another possible implementation manner, the feature extraction module is further configured to invoke a feature extraction model, and perform feature extraction on each video frame respectively to obtain initial feature information of each video frame;

the motion recognition module is further used for calling a motion recognition model, and performing motion recognition according to the initial characteristic information of the plurality of video frames to obtain the motion characteristic information of the plurality of video frames;

the comparison processing module is further configured to invoke a weight obtaining model, and compare the motion characteristic information of the plurality of video frames to obtain weight information of each video frame;

the first fusion processing module is further configured to invoke an attention fusion model, and perform fusion processing on the initial feature information of each video frame and the corresponding weight information respectively to obtain target feature information of each video frame.

In another possible implementation, the apparatus further includes;

the video frame acquisition module is further used for acquiring a plurality of sample video frames in the same sample video data;

the feature extraction module is further configured to invoke the feature extraction model, and perform feature extraction on each sample video frame respectively to obtain initial sample feature information of each sample video frame, where the initial sample feature information includes initial sample features corresponding to multiple feature dimensions;

the motion recognition module is further configured to invoke the motion recognition model, perform motion recognition according to initial sample feature information of the plurality of sample video frames, and obtain motion sample feature information of the plurality of sample video frames, where the motion sample feature information includes motion sample features corresponding to the plurality of feature dimensions;

the comparison processing module is further configured to invoke the weight obtaining model, perform comparison processing on the motion sample feature information of the plurality of sample video frames, and obtain sample weight information of each sample video frame, where the sample weight information includes weights corresponding to the plurality of feature dimensions;

the first fusion processing module is further configured to invoke the attention fusion model, and perform fusion processing on the initial sample feature information of each sample video frame and the corresponding sample weight information respectively to obtain target sample feature information of each sample video frame;

and the model training module is used for training the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model according to the target sample feature information of the plurality of sample video frames.

In another possible implementation manner, the model training module includes:

the similarity determining unit is used for determining the similarity of the target feature dimension according to the similarity between the sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the plurality of sample video frames for any target feature dimension;

the first loss value determining unit is used for determining a first loss value of the feature extraction model according to the similarity of a preset number of target feature dimensions, wherein the first loss value and the similarity of the preset number of target feature dimensions are in a positive correlation relationship;

and the first model training unit is used for training the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model according to the first loss value.

In another possible implementation manner, the model training module includes:

the classification processing unit is used for calling a classification model and classifying the target sample feature information of each sample video frame to obtain the category feature information of each sample video frame, wherein the category feature information comprises feature values corresponding to a plurality of action categories;

the information fusion unit is used for fusing the category characteristic information of the plurality of sample video frames to obtain fusion category characteristic information;

a category determining unit, configured to determine an action category to which a maximum feature value in the fusion category feature information belongs as a target action category of the sample video;

a second loss value determination unit, configured to determine a second loss value of the feature extraction model according to a difference between a target motion category of the sample video and a sample motion category of the sample video, where the second loss value and the difference have a positive correlation;

and the second model training unit is used for training the feature extraction model, the motion recognition model, the weight acquisition model and the attention model according to the second loss value.

the video frame acquisition module is used for acquiring a plurality of sample video frames in the same sample video data;

the characteristic extraction module is used for calling a characteristic extraction model and respectively extracting the characteristics of each sample video frame to obtain the target sample characteristic information of each sample video frame;

the similarity determining module is used for determining the similarity of the target feature dimension according to the sum of the similarities between the sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the plurality of sample video frames for any target feature dimension;

the first loss value determining module is used for determining a first loss value of the feature extraction model according to the similarity of a preset number of target feature dimensions, wherein the first loss value and the similarity of the preset number of target feature dimensions are in a positive correlation relationship;

the model training module is used for training the feature extraction model according to the first loss value;

and the feature extraction module is used for calling the trained feature extraction model to extract features of any video frame to obtain feature information of any video frame.

In one possible implementation, the apparatus further includes:

the classification processing module is used for calling a classification model and classifying the target sample characteristics of each sample video frame to obtain the category characteristic information of each sample video frame, wherein the category characteristic information comprises characteristic values corresponding to a plurality of action categories;

the information fusion module is used for fusing the category characteristic information of the plurality of sample video frames to obtain fusion category characteristic information;

the category determination module is used for determining the action category to which the maximum characteristic value in the fusion category characteristic information belongs as the target action category of the sample video;

a second loss value determining module, configured to determine a second loss value of the feature extraction model according to a difference between a target motion category of the sample video and a sample motion category of the sample video, where the second loss value and the difference have a positive correlation;

the model training module comprises:

and the model training unit is used for training the feature extraction model according to the first loss value and the second loss value.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the video frame feature extraction method according to the above aspect.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the video frame feature extraction method according to the above aspect.

In yet another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device is loaded and executed to implement the operations performed in the video frame feature extraction method according to the above aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

according to the method, the device and the storage medium provided by the embodiment of the application, after the initial feature information of a plurality of video frames in the same video frame data is obtained, the initial feature information of the plurality of video frames is subjected to motion recognition, information irrelevant to motion features in each video frame is weakened, and the accuracy of the motion feature information of the plurality of video frames is improved. And then comparing the motion characteristic information of the plurality of video frames to analyze the relevance of the motion characteristic information of the plurality of video frames, determining the influence degree of each characteristic dimension of the video frames on the motion characteristic of the video frames, thereby obtaining the weight information of each video frame, improving the accuracy of the weight information of each video frame, respectively fusing the initial characteristic information of each video frame and the corresponding weight information, enhancing the motion characteristic information in the target characteristic information of each video frame, weakening the information irrelevant to the motion characteristic, and improving the accuracy of the target characteristic information.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of a video frame feature extraction method provided in an embodiment of the present application;

fig. 3 is a flowchart of a video frame feature extraction method provided in an embodiment of the present application;

fig. 4 is a flowchart for acquiring target feature information of each video frame according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a video frame compared with a target feature according to an embodiment of the present application;

FIG. 6 is a model training method for video frame feature extraction according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram for obtaining similarity of target feature dimensions according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a network model provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a video frame compared with a target feature according to an embodiment of the present application;

fig. 10 is a flowchart of a video frame feature extraction method provided in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a video frame feature extraction apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a video frame feature extraction apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a video frame feature extraction apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a video frame feature extraction apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

The terms "first," "second," and the like as used herein may be used herein to describe various concepts that are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first image may be referred to as a second image, and similarly, a second image may be referred to as a first image, without departing from the scope of the present application.

As used herein, the terms "at least one," "a plurality," "each," and "any," at least one of which includes one, two, or more than two, and a plurality of which includes two or more than two, each of which refers to each of the corresponding plurality, and any of which refers to any of the plurality. For example, the plurality of video frames includes 3 video frames, each of the 3 video frames refers to each of the 3 video frames, and any one of the 3 video frames refers to any one of the 3 video frames, which may be a first video frame, a second video frame, or a third video frame.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine learning (Machine L earning, M L) is a multi-domain cross discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. a special study on how a computer simulates or implements human learning behavior to acquire new knowledge or skills, reorganizes existing knowledge structures to continuously improve its performance.

Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a Web (World Wide Web) container, and the like. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

The scheme provided by the embodiment of the application is based on artificial intelligence and cloud technology, the model for extracting the video frame features can be obtained through training, the feature information of the video frame can be obtained through calling the trained model, and data calculation of the video frame is achieved.

The video frame feature extraction method provided by the embodiment of the application can be used in computer equipment, and the computer equipment comprises a terminal or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like.

Fig. 1 is a schematic structural diagram of an implementation environment provided in an embodiment of the present application, and as shown in fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 establishes a communication connection with the server 102, and performs interaction through the established communication connection.

The method comprises the steps that a terminal 101 shoots to obtain video data, a plurality of video frames in the video data are sent to a server 102, the server 102 receives the video frames, feature extraction is carried out on each video frame respectively to obtain initial feature information of each video frame, motion recognition is carried out according to the initial feature information of the video frames to obtain motion feature information of the video frames, the motion feature information of the video frames is compared to obtain weight information of each video frame, fusion processing is carried out on the initial feature information of each video frame and the corresponding weight information respectively to obtain target feature information of each video frame.

The method provided by the embodiment of the application can be used in a plurality of scenes.

For example in the context of video data positioning.

After the computer device acquires the video data, the video frame feature extraction method provided by the embodiment of the application is adopted to acquire the target feature information of a plurality of video frames in the video data, and then the target video frame containing the target action is identified according to the target feature information of the plurality of video frames, so that the video segment containing the target action in the video data is intercepted, and the video data can be accurately and efficiently positioned.

As another example, in the context of video data classification.

After the computer device obtains video data, by adopting the video frame feature extraction method provided by the embodiment of the application, target feature information of a plurality of video frames in the video data is obtained, classification feature information of each video frame is determined according to the target feature information of the plurality of video frames, an action category to which the video data belongs is determined according to the classification feature information of the plurality of video frames, action identification of the video data is realized, and the video data is stored in a database corresponding to the action category according to the action category to which the video data belongs.

Fig. 2 is a flowchart of a video frame feature extraction method provided in an embodiment of the present application, and is applied to a computer device, as shown in fig. 2, the method includes the following steps.

201. A computer device obtains multiple video frames in the same video data.

The video data may be any type of video data, such as an outdoor running video, a cell surveillance video, a dance teaching video, and the like. The video data is a continuous sequence of video frames, and the plurality of video frames are different video frames in the video data. After the computer equipment acquires the video data, the video data is subjected to frame extraction processing to obtain a plurality of video frames; or, the other devices acquire a plurality of video frames and send the plurality of video frames to the computer device, and then the computer device acquires the plurality of video frames.

202. And respectively extracting the characteristics of each video frame by the computer equipment to obtain the initial characteristic information of each video frame.

The initial feature information comprises initial features corresponding to a plurality of feature dimensions, and the initial features of each feature dimension are used for describing information of different dimensions in the video frame. For example, in a plurality of feature dimensions, an initial feature of one feature dimension is used to describe the number of people in a video frame, an initial feature of one feature dimension is used to describe color values of pixels in the video frame, and an initial feature of one feature dimension is used to describe the size of the video frame.

203. And the computer equipment performs motion recognition according to the initial characteristic information of the plurality of video frames to obtain the motion characteristic information of the plurality of video frames.

In the embodiment of the application, the motion characteristic information is dynamic information describing video frames, each video frame comprises dynamic information and static information, the static information is information irrelevant to the motion characteristic, a plurality of video frames belong to the same video data, the static information contained in different video frames may be the same, motion identification is performed according to the initial characteristic information of the plurality of video frames, the dynamic information of each video frame is identified, the information irrelevant to the motion characteristic in each video frame is weakened, and the motion characteristic information of each video frame is obtained.

204. And the computer equipment compares the motion characteristic information of the plurality of video frames to obtain the weight information of each video frame.

The weight information comprises weights corresponding to a plurality of feature dimensions, the weights represent the degree of influence of the corresponding feature dimensions on the motion features of the video frame, if the degree of influence of the features on the feature dimensions on the motion features of the video frame is large, the weights corresponding to the feature dimensions are large, and if the degree of influence of the features on the feature dimensions on the motion features of the video frame is small, the weights corresponding to the feature dimensions are small. The weight information can be represented by a vector, a matrix or other forms.

The plurality of video frames belong to the same video data, and the motion characteristic information in the plurality of video frames has relevance, for example, the plurality of video frames comprise a running action of a person, the former video frame displays that the person takes a left leg, and the latter video frame displays that the person takes a right leg, so that in order to improve the accuracy of the characteristic information of the video frames, the motion characteristic information contained in each video frame is enhanced, the background information in the video frames is weakened, the motion characteristic information of the plurality of video frames is compared, the characteristic dimension influencing the motion characteristic in each video frame can be determined, the corresponding weight is determined for each characteristic dimension of the video frames, and the weight information is obtained.

205. And the computer equipment respectively carries out fusion processing on the initial characteristic information of each video frame and the corresponding weight information to obtain the target characteristic information of each video frame.

The target feature information comprises a plurality of feature dimensions of the target feature, and the feature dimensions are the same as those in the initial feature information. After the weight information of each video frame is obtained, for any video frame, the initial features and the weights belonging to the same feature dimension in the video frame are fused to obtain the fused target feature of each feature dimension, so that the target feature information of the video frame is obtained.

According to the mode, the initial characteristic information and the weight of each video frame are fused, so that the target characteristic information of each video frame is obtained.

According to the method provided by the embodiment of the application, after the initial characteristic information of a plurality of video frames in the same video frame data is obtained, the initial characteristic information of the plurality of video frames is subjected to motion recognition, information irrelevant to motion characteristics in each video frame is weakened, and the accuracy of the motion characteristic information of the plurality of video frames is improved. And then comparing the motion characteristic information of the plurality of video frames to analyze the relevance of the motion characteristic information of the plurality of video frames, determining the influence degree of each characteristic dimension of the video frames on the motion characteristic of the video frames, thereby obtaining the weight information of each video frame, improving the accuracy of the weight information of each video frame, respectively fusing the initial characteristic information of each video frame and the corresponding weight information, enhancing the motion characteristic information in the target characteristic information of each video frame, weakening the information irrelevant to the motion characteristic, and improving the accuracy of the target characteristic information.

In one possible implementation manner, performing motion recognition according to initial feature information of a plurality of video frames to obtain motion feature information of the plurality of video frames includes:

and comparing the initial characteristic information of any two adjacent video frames in the plurality of video frames to obtain the motion characteristic information of the first video frame in any two video frames.

In another possible implementation manner, the comparing the initial feature information of any two adjacent video frames in the plurality of video frames to obtain the motion feature information of the first video frame in any two video frames includes:

performing dimensionality reduction processing on each initial feature in the initial feature information of the first video frame and the second video frame in any two video frames;

and determining the difference characteristic information between the characteristic information of the first video frame after the dimension reduction processing and the characteristic information of the second video frame after the dimension reduction processing as the motion characteristic information of the first video frame.

In another possible implementation, the method further includes:

and determining the preset characteristic information as the motion characteristic information of the last video frame in the plurality of video frames.

In another possible implementation manner, comparing the motion characteristic information of a plurality of video frames to obtain weight information of each video frame includes:

for each video frame, carrying out fusion processing on the motion characteristic information of the video frame and the motion characteristic information of at least one video frame before the video frame to obtain fusion motion characteristic information of the video frame, wherein the fusion motion characteristic information comprises fusion motion characteristics corresponding to a plurality of characteristic dimensions;

and normalizing the plurality of fusion motion characteristics in the fusion motion characteristic information, and taking the fusion motion characteristic information after normalization as weight information.

In another possible implementation, the method further includes:

and responding to the video frame as the first video frame in the plurality of video frames, and fusing the motion characteristic information of the video frame and the motion characteristic information of the last video frame in the plurality of video frames to obtain the fused motion characteristic information of the video frame.

In another possible implementation manner, after the initial feature information of each video frame and the corresponding weight information are respectively fused to obtain the target feature information of each video frame, the method further includes:

classifying the target characteristic information of each video frame to obtain category characteristic information of each video frame, wherein the category characteristic information comprises characteristic values corresponding to a plurality of action categories;

fusing the category characteristic information of a plurality of video frames to obtain fused category characteristic information;

and determining the action type to which the maximum characteristic value in the fusion type characteristic information belongs as the action type of the video data.

In another possible implementation manner, feature extraction is respectively carried out on each video frame, and the step of obtaining the initial feature information of each video frame is realized by calling a feature extraction model;

the step of carrying out motion recognition according to the initial characteristic information of the plurality of video frames to obtain the motion characteristic information of the plurality of video frames is realized by calling a motion recognition model;

the motion characteristic information of a plurality of video frames is compared to obtain the weight information of each video frame by calling a weight acquisition model;

and respectively carrying out fusion processing on the initial characteristic information of each video frame and the corresponding weight information to obtain the target characteristic information of each video frame by calling an attention fusion model.

In another possible implementation, the method further includes:

acquiring a plurality of sample video frames in the same sample video data;

calling a feature extraction model, and respectively extracting features of each sample video frame to obtain initial sample feature information of each sample video frame, wherein the initial sample feature information comprises initial sample features corresponding to a plurality of feature dimensions;

calling a motion recognition model, and performing motion recognition according to initial sample feature information of a plurality of sample video frames to obtain motion sample feature information of the plurality of sample video frames, wherein the motion sample feature information comprises motion sample features corresponding to a plurality of feature dimensions;

calling a weight obtaining model, and comparing the motion sample characteristic information of a plurality of sample video frames to obtain sample weight information of each sample video frame, wherein the sample weight information comprises weights corresponding to a plurality of characteristic dimensions;

calling an attention fusion model, and respectively carrying out fusion processing on the initial sample characteristic information and the corresponding sample weight information of each sample video frame to obtain target sample characteristic information of each sample video frame;

and training a feature extraction model, a motion recognition model, a weight acquisition model and an attention fusion model according to the target sample feature information of the plurality of sample video frames.

In another possible implementation manner, training a feature extraction model, a motion recognition model, a weight acquisition model, and an attention fusion model according to target sample feature information of a plurality of sample video frames includes:

for any target feature dimension, determining the similarity of the target feature dimension according to the similarity between the sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the plurality of sample video frames;

determining a first loss value of the feature extraction model according to the similarity of the preset number of target feature dimensions, wherein the first loss value and the similarity of the preset number of target feature dimensions are in a positive correlation relationship;

and training the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model according to the first loss value.

fusing the category characteristic information of a plurality of sample video frames to obtain fused category characteristic information;

determining a second loss value of the feature extraction model according to the difference between the target action category of the sample video and the sample action category of the sample video, wherein the second loss value and the difference are in positive correlation;

Fig. 3 is a flowchart of a video frame feature extraction method provided in an embodiment of the present application, and is applied to a computer device, as shown in fig. 3, the method includes the following steps.

301. A computer device obtains multiple video frames in the same video data.

In one possible implementation, this step 301 may include: the computer equipment acquires video data, and performs frame extraction processing on the video data to obtain a plurality of video frames in the video data. When the computer device obtains the video data, the computer device can shoot through the camera to obtain the video data, or receive the video data sent by other devices.

In one possible implementation, the step 301 may further include: the computer device establishes a communication connection with the other device, and receives the plurality of video frames sent by the other device through the communication connection.

302. And calling a feature extraction model by the computer equipment, and respectively extracting the features of each video frame to obtain the initial feature information of each video frame.

The feature extraction model is a model for extracting feature information of a video frame, and may be a convolution model including a plurality of convolution layers or other network models.

In the embodiment of the application, a feature extraction model is called, feature extraction is respectively carried out on each video frame in video data, and extracted feature information is used as initial feature information.

After the computer equipment acquires a plurality of video frames, inputting each video frame into the feature extraction model, calling the feature extraction model, and performing feature extraction on each video frame to obtain initial feature information of each video frame.

In one possible implementation, this step 302 may include: and calling a feature extraction model, and sequentially extracting features of each video frame according to the arrangement sequence of the plurality of video frames to obtain initial feature information of each video frame. The arrangement order of the plurality of video frames may be determined according to the arrangement order of the plurality of video frames in time in the video data, for example, if the first video frame is a video frame at the 1 st minute in the video data, the second video frame is a video frame at the 2 nd minute in the video data, and the third video frame is a video frame at the 3 rd minute in the video data, the arrangement order of the three video frames is: the first video frame, the second video frame and the third video frame.

303. And calling the motion recognition model by the computer equipment, and comparing the initial characteristic information of any two adjacent video frames in the plurality of video frames to obtain the motion characteristic information of the first video frame in any two video frames.

The motion recognition model is used for obtaining motion feature information, the motion feature information comprises motion features corresponding to a plurality of feature dimensions, and the feature dimensions are the same as those in the initial feature information.

Because a plurality of video frames belong to the same video data, and information which is contained in adjacent video frames and is irrelevant to motion characteristics has similarity, for example, if a first video frame contains a motion of opening a door by a person, and a second video frame contains a motion of closing the door by the person, other background information in the two video frames except the motion of opening the door by the person and the motion of closing the door by the person can be similar, therefore, initial characteristics of the two adjacent video frames are compared, the obtained motion characteristic information is used as the motion characteristic information of the first video frame in the two video frames, the similar information which is irrelevant to the motion characteristics in the adjacent video frames can be weakened, the motion characteristics in the video frames are highlighted, and the accuracy of the motion characteristic information is improved.

In a possible implementation manner, the motion recognition model includes a dimension reduction processing sub-model and a motion feature obtaining sub-model, then the step 303 may include the following steps 3031-3032.

3031. And calling a dimension reduction processing sub-model to perform dimension reduction processing on each initial feature in the initial feature information of the first video frame and the second video frame in any two video frames.

The dimension reduction processing submodel is a model for performing dimension reduction processing on each initial feature, and optionally, the dimension reduction processing submodel may include a global average pooling layer, through which dimension reduction processing may be performed on the feature information.

In the embodiment of the application, each initial feature in the initial feature information is a multi-dimensional feature, and in order to reduce the calculation amount subsequently, each initial feature in the initial feature information is subjected to dimension reduction processing, so that when the initial feature information after the dimension reduction processing is subsequently processed, the calculation amount is reduced, and the efficiency of extracting the features is improved.

When the dimensionality reduction processing is carried out on each initial feature in the initial feature information of the video frame, the dimensionality reduction processing is carried out on the initial feature of each feature dimensionality in the initial feature information, so that the dimensionality of each initial feature in the initial feature information after the dimensionality reduction processing is reduced. The initial feature information after the dimension reduction processing comprises initial features after the dimension reduction processing of a plurality of feature dimensions.

For example, the initial feature information of the video frame includes C feature dimensions, where C is a positive integer not less than 1, the initial feature of each feature dimension is a matrix of 100 rows and 100 columns, that is, the initial feature information may be represented by a matrix of C × 100 × 100, and each initial feature in the initial feature information is subjected to dimension reduction processing by a dimension reduction processing sub-model to obtain a matrix of 1 row and 1 column of the initial feature of each feature dimension, that is, the initial feature information after dimension reduction processing may be represented by a matrix of C × 1 × 1.

In addition, when the initial feature of each feature dimension is a multi-dimensional feature matrix, and the multi-dimensional feature matrix of any feature dimension is subjected to dimension reduction processing, the feature values in the multi-dimensional feature matrix of the feature dimension are summed and averaged to obtain the initial feature after the dimension reduction processing of the feature dimension, or the feature values in the multi-dimensional feature matrix of the feature dimension are summed to obtain the initial feature after the dimension reduction processing of the feature dimension.

3032. And calling a motion characteristic acquisition sub-model, and determining the difference characteristic information between the characteristic information of the first video frame after the dimension reduction processing and the characteristic information of the second video frame after the dimension reduction processing as the motion characteristic information of the first video frame.

The difference feature information is used for representing the difference between feature information of two video frames after dimension reduction processing, and is obtained by performing difference operation on the feature information of a first video frame after dimension reduction processing and the feature information of a second video frame after dimension reduction processing.

Because the information which is contained in the two adjacent video frames and is irrelevant to the motion characteristic has similarity, the similar information in the two adjacent video frames is counteracted by acquiring the difference characteristic information in the two adjacent video frames, the background information in the video frames is weakened, the motion characteristic information in the video frames is highlighted, and therefore the difference characteristic information is used as the motion characteristic information of the first video frame.

For any video frame, the motion characteristic information of each video frame is obtained by taking the difference characteristic information between the characteristic information of the video frame after the dimension reduction processing and the characteristic information of the video frame next to the video frame after the dimension reduction processing as the motion characteristic information of the video frame.

In addition, when the motion characteristic information of each video frame is acquired, the motion characteristic information of other video frames except the last video frame in the plurality of video frames can be acquired according to the above manner. In one possible implementation, the obtaining motion feature information of a last video frame of the plurality of video frames may include: and calling the motion characteristic acquisition sub-model, and determining the preset characteristic information as the motion characteristic information of the last video frame in the plurality of video frames.

The preset feature information includes features of a plurality of feature dimensions, the feature dimensions are the same as the feature dimensions in the initial feature information, and the preset feature information may be arbitrarily set feature information, for example, the preset feature is a zero vector or a zero matrix.

It should be noted that, in the embodiment of the present application, the motion characteristic information of the previous video frame is obtained by comparing the initial characteristic information of two adjacent video frames, but in another embodiment, step 303 is not required to be executed, and a motion recognition model may be invoked in another manner, and motion recognition is performed according to the initial characteristic information of a plurality of video frames, so as to obtain the motion characteristic information of the plurality of video frames.

304. And for each video frame, the computer equipment calls a feature fusion layer, and performs fusion processing on the motion feature information of the video frame and the motion feature information of at least one video frame before the video frame to obtain the fusion motion feature information of the video frame.

In the embodiment of the application, the weight obtaining model is a model for obtaining weight information, the weight obtaining model comprises a feature fusion layer and a normalization layer, and the motion feature information is processed through the feature fusion layer and the normalization layer to obtain the weight information of each video frame.

The fusion motion characteristic information comprises fusion motion characteristics corresponding to a plurality of characteristic dimensions.

When the motion feature information of a plurality of video frames is subjected to fusion processing, according to each feature dimension, the motion features belonging to the same feature dimension in the motion feature information of the plurality of video frames are respectively fused to obtain the motion feature after fusion of each feature dimension, and the motion features after fusion of the plurality of feature dimensions form the fusion feature information of the video frames. For example, the motion feature information of one video frame is [0, 9, 2, 8, 16], the motion feature information of another video frame is [2, 3, 4, 5, 1], each numerical value represents a feature dimension, and the motion features belonging to the same feature dimension in the motion feature information of the two video frames are fused to obtain fused feature information [2, 12, 6, 13, 17 ].

Because the motion characteristics in the video frames have relevance, and the motion characteristics belonging to the same characteristic dimension in the motion characteristic information of any video frame and at least one video frame before the video frame have relevance, the motion characteristic information of the video frame is fused with the motion characteristic information of the video frame before the video frame, the motion characteristic information of the video frame is enriched, and the accuracy of the fusion characteristic information is improved.

In addition, according to the above manner, the fusion feature information of the other video frames except the first video frame in the plurality of video frames may be obtained, and in a possible implementation manner, the obtaining of the fusion feature information of the first video frame may include: and responding to the video frame as the first video frame in the plurality of video frames, calling a feature fusion layer, and fusing the motion feature information of the video frame and the motion feature information of the last video frame in the plurality of video frames to obtain the fusion motion feature information of the video frame.

In one possible implementation, then this step 304 may include: the method comprises the steps of sequentially obtaining fusion characteristic information of each video frame according to the arrangement sequence of a plurality of video frames, responding to the fact that a current video frame is a first video frame, calling a characteristic fusion layer, carrying out fusion processing on the motion characteristic information of the current video frame and the motion characteristic information of a last video frame to obtain fusion motion characteristic information of the current video frame, responding to the fact that the current video frame is not the first video frame, calling the characteristic fusion layer, and fusing the fusion characteristic information of a previous video frame of the current video frame and the motion characteristic information of the current video frame to obtain the fusion characteristic information of the current video frame.

305. And calling a normalization layer by the computer equipment, normalizing the plurality of fusion motion characteristics in the fusion motion characteristic information, and taking the fusion motion characteristic information after normalization as weight information.

Because the fusion motion characteristic information of each video frame is obtained through the motion characteristic information of a plurality of video frames, in the fusion motion characteristic information, the fusion motion characteristic of the characteristic dimension with large influence degree on the motion characteristic is enhanced, and the fusion motion characteristic of the characteristic dimension with small influence degree on the motion characteristic is weakened, therefore, the influence degree of each characteristic dimension on the motion characteristic of the video frame can be reflected through the normalization processing of the fusion motion characteristic information, and the weight information of the video frame is obtained. For example, the weight information is [0.5, 0.2, 0.1, 0.2 ].

In one possible implementation, the fused motion feature includes fused feature values of a plurality of feature dimensions, and this step 305 may include: for the fusion characteristic information of any video frame, determining the fusion characteristic value sum of a plurality of characteristic dimensions, determining the ratio of the fusion characteristic value of each characteristic dimension to the fusion characteristic value sum as the weight of the corresponding characteristic dimension, and then forming the weight information of the video frame by the weights of the plurality of characteristic dimensions. In the weight information obtained in this way, each weight falls between 0 and 1, and the sum of the weights of the plurality of feature dimensions is 1.

In one possible implementation, the fused motion feature includes fused feature values of a plurality of feature dimensions, and this step 305 may include: for the fusion characteristic information of any video frame, determining a difference value between a maximum fusion characteristic value and a minimum fusion characteristic value in the fusion characteristic information, determining a ratio of the fusion characteristic value of each characteristic dimension to the difference value as the weight of the corresponding characteristic dimension, and then forming the weight information of the video frame by the weights of a plurality of characteristic dimensions. In the weight information obtained in this way, each weight falls within the range of 0 to 1.

It should be noted that, in the embodiment of the present application, the weight information is obtained through the feature fusion layer and the normalization layer, but in another embodiment, the step 304 and the step 305 need not be executed, and a weight obtaining model may be invoked in other ways to compare the motion feature information of a plurality of video frames, so as to obtain the weight information of each video frame.

306. And calling an attention fusion model by the computer equipment, and respectively fusing the initial characteristic information of each video frame and the corresponding weight information to obtain the target characteristic information of each video frame.

Here, the attention fusion model is a model for fusing initial feature information and weight information. When any video frame and corresponding weight information are subjected to fusion processing, the initial features and weights belonging to the same feature dimension are fused to obtain the target feature of each feature dimension, and the obtained target features of a plurality of feature dimensions form the target feature information of the video frame. And respectively fusing the initial characteristic information of each video frame and the corresponding weight information to obtain the target characteristic information of each video frame.

In one possible implementation manner, if the initial feature of each feature dimension in the initial feature information is an initial feature matrix, where the initial feature matrix includes a plurality of initial feature values, then step 306 may include: and regarding each video frame and each feature dimension, taking the product of each initial feature value in the initial feature matrix of the feature dimension and the weight corresponding to the feature dimension in the initial feature information of the video frame as the target feature value of the feature dimension respectively, wherein the obtained plurality of target feature values form the target feature matrix of the feature dimension, and the plurality of target feature matrices of the feature dimension form the target feature information of the video frame.

307. And calling a classification model by the computer equipment, and classifying the target characteristic information of each video frame to obtain the category characteristic information of each video frame.

The classification model is a model for classifying the video frame, and the classification model may include a full link layer through which the category feature information of the video frame may be acquired. The category feature information includes feature values corresponding to a plurality of motion categories, and the feature value corresponding to a motion category may indicate a probability that a corresponding video frame belongs to the motion category, or may indicate a similarity between target feature information of the corresponding video frame and a feature vector of the motion category.

In the embodiment of the application, a plurality of action categories are preset and used for representing a plurality of action categories to which the motion features in the video data belong, and the action categories can be a dance action category, a cycling action category, a running action category, a normal action category and the like.

In one possible implementation, this step 307 may include: for any video frame, determining the similarity between the target feature information of the video frame and the motion category feature vector of each motion category, taking each obtained similarity as the feature value of the corresponding motion category, and obtaining the feature values of a plurality of motion categories to form the category feature information of the video frame.

It should be noted that, in the embodiment of the present application, the target feature information of each video frame is obtained through the feature extraction model, the motion recognition model, the weight obtaining model and the attention fusion model, but in another embodiment, when the step 302 and the step 307 are executed, the corresponding steps may be directly executed by the computer device without invoking the model.

308. And the computer equipment fuses the category characteristic information of the video frames to obtain fused category characteristic information.

The fusion type feature information includes fused feature values of a plurality of action types. When the category feature information of a plurality of video frames is fused, feature values belonging to the same action category in the category feature information of the plurality of video frames are fused to obtain fused feature values of a plurality of action categories, so that the fused category feature information is obtained.

Since the plurality of video frames all belong to the same video data, in order to improve the accuracy of the action category of the video data, the category feature information of the plurality of video frames is fused, so that the action category of the video data can be determined according to the obtained fusion category feature information.

309. And the computer equipment determines the action category to which the maximum characteristic value in the fusion category characteristic information belongs as the action category of the video data.

The larger the feature value corresponding to the motion category is, the higher the probability that the video belongs to the motion category is, and the smaller the feature value corresponding to the motion category is, the lower the probability that the video belongs to the motion category is. And after the fusion type feature information is obtained, determining the action type to which the maximum feature value belongs in the fusion type feature information as the action type of the video data.

And when the weight information of each video frame is acquired, the motion characteristic information of the video frame and at least one video frame before the video frame is fused according to the arrangement time sequence of the video frames, so that the motion characteristic of the previous video frame is fused into each video frame, the time sequence characteristic among the video frames is enhanced, the relevance of the motion characteristic information of the video frames is enhanced, and the accuracy of the target characteristic information is improved.

In addition, as the motion characteristics in the target characteristic information are enhanced, the information irrelevant to the motion characteristics is weakened, the diversity of the target characteristics is enriched, the action category of the video data is determined according to the target characteristic information of each video frame, and the classification accuracy is improved.

And after dimension reduction processing is carried out on each initial feature in the initial feature information, the motion feature information is obtained through the feature information after the dimension reduction processing, the calculation amount is reduced, and the efficiency of obtaining the feature information is improved.

Fig. 4 is a flowchart of acquiring target feature information of each video frame according to an embodiment of the present application, and as shown in fig. 4, the method includes the following steps.

1. When the target characteristic information of each video frame is obtained, the initial characteristic information of each video frame is obtained, and global average pooling is carried out on each initial characteristic in the initial characteristic information of each video frame through a pooling layer to obtain the initial characteristic information of each video frame after dimension reduction processing.

2. And performing difference operation on the initial characteristic information after the dimension reduction processing of every two adjacent video frames to obtain the motion characteristic information of the first video frame in the two video frames, so as to obtain the motion characteristic information of other video frames except the last video frame in the video frames, and taking the preset characteristic information as the motion characteristic information of the last video frame.

3. According to the arrangement sequence of a plurality of video frames, carrying out fusion processing on the motion characteristic information of the last video frame and the motion characteristic information of the first video frame to obtain the fusion motion characteristic information of the first video frame, carrying out fusion processing on the fusion motion characteristic information of the first video frame and the motion characteristic information of the second video frame to obtain the fusion motion characteristic information of the second video frame, and carrying out fusion processing on the current motion characteristic information and the fusion motion characteristic information of the previous video frame each time according to the mode to obtain the fusion motion characteristic information of each video frame.

4. And performing normalization processing on the fusion motion characteristic information of each video frame to obtain weight information of each video frame, and performing fusion processing on the initial characteristic information of each video frame and the corresponding weight information to obtain target characteristic information of each video frame.

As shown in fig. 5, a PEM (Progressive Enhancement Module) Module formed by the motion recognition model, the weight obtaining model and the attention fusion model provided in the embodiment of the present application enhances the motion feature information in the video frame and weakens the static background information in the video frame by determining the weight information of a plurality of feature dimensions. The left image and the right image in fig. 5 are two sets of comparison images, the first line of images is a video frame of each set, the second line of images is initial feature information of each video frame obtained through a feature extraction model, the third line of images is a target feature of a feature dimension corresponding to the maximum weight in weight information of each video frame, and the fourth line of images is a target feature of a feature dimension corresponding to the minimum weight in weight information of each video frame.

By contrast, the target feature with the feature dimension corresponding to the maximum weight has a higher response value at the moving subject, and the target feature with the feature dimension corresponding to the minimum weight has a larger response value at the background unrelated to the moving subject. Through the weight information of each video frame, the characteristics of each characteristic dimension are multiplied by the weight, the information related to a moving body in the characteristics of the video frame is enhanced, the interference of redundant background information in the video frame is weakened, and the accuracy of the target characteristic information of each video frame is improved.

On the basis of the embodiment shown in fig. 3, before the feature extraction model, the motion recognition model, the weight acquisition model, and the attention fusion model are called, the feature extraction model, the motion recognition model, the weight acquisition model, and the attention fusion model need to be trained, and the training process is described in the following embodiments.

Fig. 6 is a model training method for video frame feature extraction according to an embodiment of the present application, applied to a computer device, and as shown in fig. 6, the method includes the following steps.

601. A computer device obtains a plurality of sample video frames in the same sample video data.

602. And calling a feature extraction model by the computer equipment, and respectively extracting the features of each sample video frame to obtain the initial sample feature information of each sample video frame.

The initial sample feature information comprises initial sample features corresponding to a plurality of feature dimensions.

603. And the computer equipment calls the motion recognition model, performs motion recognition according to the initial sample characteristic information of the plurality of sample video frames, and obtains the motion sample characteristic information of the plurality of sample video frames.

The motion sample feature information comprises motion sample features corresponding to a plurality of feature dimensions.

604. And calling a weight obtaining model by the computer equipment, and comparing the motion sample characteristic information of the plurality of sample video frames to obtain the sample weight information of each sample video frame.

The sample weight information comprises weights corresponding to a plurality of characteristic dimensions.

605. And calling an attention fusion model by the computer equipment, and respectively carrying out fusion processing on the initial sample characteristic information and the corresponding sample weight information of each sample video frame to obtain the target sample characteristic information of each sample video frame.

606. And training the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model by the computer equipment according to the target sample feature information of the plurality of sample video frames.

The target sample feature information is jointly acquired through the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model, so that the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model can be trained subsequently according to whether the target sample feature information is accurate or not, the accuracy of the models is improved, and the trained feature extraction model, the trained motion recognition model, the trained weight acquisition model and the trained attention fusion model can be obtained subsequently.

In one possible implementation, this step 606 may include the following two ways.

The first method includes the following steps 6601-6603.

6601. And for any target feature dimension, determining the similarity of the target feature dimension according to the similarity between the sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the plurality of sample video frames.

In the embodiment of the application, because each target sample feature information comprises a plurality of feature dimensions, in order to reduce the calculation amount and improve the efficiency of training the model, a preset number of target feature dimensions are selected from the plurality of feature dimensions, and the model is trained. The preset number may be any number, such as 10, 8, etc.

The similarity between the sample features of the target feature dimension is used to represent the similarity of the sample features of different video frames in the target feature dimension, and the greater the similarity, the more similar the sample features representing two video frames, that is, the more similar the information contained in the two sample features in the feature dimension, the greater the possibility that the information contained in the two sample features in the feature dimension is the background information.

For any target feature dimension, the target sample feature information of each sample video frame includes a target sample feature belonging to the target feature dimension, so that a plurality of target sample features can be determined, the similarity between every two target sample features in the plurality of target sample features is determined to obtain a plurality of similarities, and the similarities are counted to obtain the similarity of the target feature dimension. According to the method, each target feature dimension is processed, and the similarity of each target feature dimension can be obtained. As shown in fig. 7, when determining the similarity of any target feature dimension, a plurality of target sample features belonging to the target feature dimension in the target sample feature information of a plurality of video frames are determined, and the similarity between each two of the plurality of target sample features is determined, so as to obtain the similarity of the target feature dimension.

6602. And determining a first loss value of the feature extraction model according to the similarity of the preset number of target feature dimensions.

The first loss value is in positive correlation with the similarity of the preset number of target feature dimensions, the greater the similarity of the preset number of target feature dimensions is, the greater the first loss value is, the smaller the similarity of the preset number of target feature dimensions is, and the smaller the first loss value is.

In one possible implementation, this step 6602 may include: and taking the sum of the similarity of the preset number of target feature dimensions as a first loss value of the feature extraction model.

The sum of the similarity of the preset number of target feature dimensions is used as a first loss value of a training model, so that the training model can avoid background information of video frames, and motion feature information of videos can be enhanced.

In one possible implementation, the first loss value of the feature extraction model

The following relationship can be satisfied:

wherein the content of the first and second substances,

a sequence number representing a target feature dimension;

is shown in

Obtaining the number of similarity between each two video frames in each target feature dimension;

、

respectively representing the sequence numbers of a plurality of video frames,

、

different;

is shown as

A video frame is in

The target feature in the dimension of the individual target feature,

is shown as

A video frame is in

Target features on a target feature dimension;

for representing a cosine similarity function.

6603. And training the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model according to the first loss value.

And training the model according to the first loss value so as to reduce the similarity between the target characteristic information of different video frames, weaken the similar redundant information between different video frames and enhance the motion characteristic information of each video frame, thereby improving the accuracy of the characteristic extraction model, the motion recognition model, the weight acquisition model and the attention fusion model. In the subsequent target feature information in different video frames obtained through the trained feature extraction model, the trained motion recognition model, the trained weight acquisition model and the trained attention fusion model, the similarity between the target features belonging to the same feature dimension is low, so that each video frame is ensured to have unique motion features, the similar redundant information between different video frames is weakened, and the accuracy of the target feature information is improved.

The second method includes the following steps 6604-6608.

6604. And calling a classification model, and classifying the target sample characteristic information of each sample video frame to obtain the class characteristic information of each sample video frame.

The category feature information includes feature values corresponding to a plurality of action categories.

This step is similar to step 307 described above and will not be described further herein.

6605. And fusing the category characteristic information of the plurality of sample video frames to obtain fused category characteristic information.

This step is similar to step 308 described above and will not be described further herein.

6606. And determining the action category to which the maximum characteristic value in the fusion category characteristic information belongs as the target action category of the sample video.

This step is similar to step 309 described above and will not be described further herein.

6607. And determining a second loss value of the feature extraction model according to the difference between the target action category of the sample video and the sample action category of the sample video.

The second loss value and the difference are in a positive correlation relationship, the sample motion category is a real motion category to which the sample video belongs, the target motion category is a predicted motion category of the sample video, the larger the difference between the target motion category and the sample motion category of the sample video is, the larger the second loss value is, the smaller the difference between the target motion category and the sample motion category of the sample video is, and the smaller the second loss value is.

And determining a second loss value of the feature extraction model according to the difference between the target action category and the sample action category of the sample video, so that the model is trained subsequently according to the second loss value to reduce the loss value, namely, the difference between the target action category and the sample action category of the sample video is reduced, and the accuracy of the model is improved.

6608. And training the feature extraction model, the motion recognition model, the weight acquisition model and the attention model according to the second loss value.

And training the feature extraction model, the motion recognition model, the weight acquisition model and the attention model through the second loss value to reduce the loss value, so that the target action type predicted by the trained model for the video data is the same as the actual action type of the video data, and the accuracy of the model is improved.

In one possible implementation, this step 6608 may include: and training the feature extraction model, the motion recognition model, the weight acquisition model, the attention model and the classification model according to the second loss value.

In addition, the two modes can be combined, and the feature extraction model, the motion recognition model, the weight acquisition model and the attention model are trained according to the obtained first loss value and the second loss value.

In a possible implementation manner, the first loss value and the second loss value are subjected to weighted fusion to obtain a total loss value, and the feature extraction model, the motion recognition model, the weight acquisition model and the attention model are trained according to the total loss value.

Optionally, the feature extraction model, the motion recognition model, the weight obtaining model, the attention model and the classification model are trained according to the total loss value.

Optionally, the total loss value

The following relationship is satisfied:

wherein the content of the first and second substances,

the value of the second loss is represented,

which represents the value of the first loss to be,

the weighting factor may be any constant;

representing a total number of a plurality of sample action categories;

sequence numbers representing a plurality of sample action categories;

a category vector representing each sample action category;

representing a target action category of the sample video data.

According to the method provided by the embodiment of the application, the feature extraction model, the motion recognition model, the weight acquisition model and the attention model are jointly trained through the feature information of the target sample of the video frame acquired by the feature extraction model, the motion recognition model, the weight acquisition model and the attention model, so that the accuracy of the multiple models is improved.

In addition, in the process of training the model, a first loss value of the training model is determined through the similarity between the sample features of the video frames in the same target feature dimension, so that the similarity between the sample features of different video frames in the same target feature dimension is reduced, and similar redundant information in different video frames is weakened, so that the motion features in the video frames are enhanced, the difference of the target feature information between different video frames is enhanced, the diversity of the target sample features is enriched, and the accuracy of the model is improved. And a preset number of feature dimensions are selected from the plurality of feature dimensions to train the model, so that the calculated amount is reduced, and the efficiency of training the model is improved.

In the process of training the model, the target action type of the sample video data is determined according to the target sample characteristic information of the plurality of video frames, and the model is trained according to the difference between the target action type of the sample video data and the sample action type, so that the difference between the target action type of the sample video data and the sample action type is reduced, and the accuracy of the model is improved.

As shown in table 1, the comparison results of the method for training the model in the present application and the method for training the model in the related art are shown in different data. By comparing the basic network model and the network model structure in the method for training the model and the method for training the model in the related technology, the method for training the model in the application has the advantages that the calculated amount of the method for training the model in the application in the method for training the model in the related technology is small, and the accuracy of the trained model is improved.

TABLE 1

Fig. 8 is a schematic structural diagram of a network model, where the network model is used to classify video data, the network model includes a plurality of modules, and the

modules

1, 2, 3, and 4 have similar structures, taking module 3 as an example, module 3 includes a plurality of

sub-modules

1 and 2, sub-module 1 is a module DR L-a with a memory attention mechanism, sub-module 1 includes a motion recognition model, a weight acquisition model, and an attention fusion model in the present embodiment, sub-module 2 is a module with a memory attention mechanism and a time sequence diversity constraint, sub-module 2 includes a motion enhancement unit, a temporal modeling unit, a convolutional layer, and a time sequence diversity constraint, and the motion enhancement unit includes a motion recognition model, a weight acquisition model, and an attention fusion model.

By connecting the sub-modules 1, the target characteristic information of a plurality of video frames can be repeatedly updated, so that the motion characteristic information of the video frames is enhanced, and information irrelevant to motion, namely background information, in the video frames is weakened. By connecting a plurality of modules in the network model, the target characteristic information of a plurality of video frames can be repeatedly updated, so that the motion characteristic information of the video frames is enhanced, and the information irrelevant to motion, namely background information, in the video frames is weakened.

As shown in fig. 9, a TD (Temporal Diversity restriction, which uses time sequence Diversity constraint) module trains a model, which plays a role in enriching features, and comparing two dotted frames of an image shows that, compared with a target feature obtained without using time sequence Diversity constraint, the target feature obtained after adding time sequence Diversity constraint also contains part of adjacent frame feature information, thereby enriching the Diversity of features and improving the accuracy of the target feature information.

Fig. 10 is a video frame feature extraction method provided in an embodiment of the present application, and is applied to a computer device, as shown in fig. 10, the method includes the following steps.

1001. A computer device obtains a plurality of sample video frames in the same sample video data.

1002. And calling a feature extraction model by the computer equipment, and respectively extracting the features of each sample video frame to obtain the target sample feature information of each sample video frame.

1003. And for any target feature dimension, the computer equipment determines the similarity of the target feature dimension according to the sum of the similarities between the sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the plurality of sample video frames.

1004. And the computer equipment determines a first loss value of the feature extraction model according to the similarity of the preset number of target feature dimensions.

The first loss value and the similarity of the preset number of target feature dimensions are in positive correlation.

1005. And calling a classification model by the computer equipment, and classifying the target sample characteristics of each sample video frame to obtain the class characteristic information of each sample video frame.

1006. And the computer equipment fuses the category characteristic information of the plurality of sample video frames to obtain fused category characteristic information.

1007. And the computer equipment determines the action category to which the maximum characteristic value in the fusion category characteristic information belongs as the target action category of the sample video.

1008. The computer device determines a second loss value of the feature extraction model according to a difference between the target motion category of the sample video and the sample motion category of the sample video.

Wherein the second loss value is positively correlated with the difference.

1009. The computer device trains the feature extraction model according to the first loss value and the second loss value.

1010. And calling the trained feature extraction model by the computer equipment, and extracting the features of any video frame to obtain the feature information of any video frame.

Through the trained feature extraction model, when the feature extraction is carried out on the video frame, the motion feature in the video frame can be enhanced, and the background information in the video frame is weakened, so that the accurate feature information of the video frame can be obtained.

It should be noted that, in the embodiment of the present application, the feature extraction model is trained according to the first loss value and the second loss value, but in another embodiment, the

step

1005 and 1009 need not be executed, and only the feature extraction model is trained according to the first loss value.

The method provided by the embodiment of the application determines a first loss value of a training model through the similarity between sample features of a plurality of video frames in the same target feature dimension to reduce the similarity between the sample features of different video frames in the same target feature dimension and weaken similar background information in different video frames, thereby enhancing motion features in the video frames, enhancing the difference of target feature information between different video frames, enriching the diversity of target sample features, determining a target action category of sample video data through the acquired target sample feature information of the plurality of video frames, training the model through the difference between the target action category and the sample action category of the sample video data to reduce the difference between the target action category and the sample action category of the sample video data, therefore, the accuracy of the characteristic information of the target sample is improved, and the accuracy of the model is improved.

And a preset number of feature dimensions are selected from the plurality of feature dimensions to train the model, so that the calculated amount is reduced, and the efficiency of training the model is improved.

In addition, by the method provided by the embodiment of the application, the trained feature extraction model is integrated with the time sequence information between different video frames when the target feature information of the video frames is acquired, so that the accuracy of the acquired target feature extraction model is improved. And the target characteristic information of each video frame can be acquired only by the characteristic extraction model, so that the model is simplified, and the efficiency of acquiring the target characteristic information of the video frames is improved.

Fig. 11 is a schematic structural diagram of a video frame feature extraction apparatus according to an embodiment of the present application, and as shown in fig. 11, the apparatus includes:

a video frame acquiring module 1101, configured to acquire multiple video frames in the same video data;

the feature extraction module 1102 is configured to perform feature extraction on each video frame to obtain initial feature information of each video frame, where the initial feature information includes initial features corresponding to multiple feature dimensions;

a motion identification module 1103, configured to perform motion identification according to initial feature information of multiple video frames to obtain motion feature information of the multiple video frames, where the motion feature information includes motion features corresponding to multiple feature dimensions;

the comparison processing module 1104 is configured to perform comparison processing on the motion feature information of the plurality of video frames to obtain weight information of each video frame, where the weight information includes weights corresponding to the plurality of feature dimensions, and the weights indicate degrees of influence of the feature dimensions on the motion features of the video frames;

a first fusion processing module 1105, configured to perform fusion processing on the initial feature information of each video frame and the corresponding weight information, respectively, to obtain target feature information of each video frame.

In a possible implementation manner, the motion recognition module 1103 is configured to compare initial feature information of any two adjacent video frames in the plurality of video frames to obtain motion feature information of a first video frame in any two video frames.

In another possible implementation, as shown in fig. 12, the motion recognition module 1103 includes:

a dimension reduction processing unit 1131, configured to perform dimension reduction processing on each initial feature in the initial feature information of the first video frame and the second video frame in any two video frames;

the feature information determining unit 1132 is configured to determine, as the motion feature information of the first video frame, difference feature information between the feature information of the first video frame after the dimension reduction processing and the feature information of the second video frame after the dimension reduction processing.

In another possible implementation, as shown in fig. 12, the apparatus further includes:

a feature information determining module 1106, configured to determine preset feature information as motion feature information of a last video frame of the plurality of video frames.

In another possible implementation, as shown in fig. 12, the comparison processing module 1104 includes:

the fusion processing unit 1141 is configured to perform fusion processing on the motion feature information of the video frame and the motion feature information of at least one video frame before the video frame for each video frame to obtain fusion motion feature information of the video frame, where the fusion motion feature information includes fusion motion features corresponding to multiple feature dimensions;

a normalization processing unit 1142, configured to perform normalization processing on the multiple fusion motion features in the fusion motion feature information, and use the fusion motion feature information after the normalization processing as weight information.

the second fusion processing module 1107 is configured to, in response to that the video frame is a first video frame in the multiple video frames, perform fusion processing on the motion feature information of the video frame and the motion feature information of a last video frame in the multiple video frames to obtain fusion motion feature information of the video frames.

a classification processing module 1108, configured to perform classification processing on the target feature information of each video frame to obtain category feature information of each video frame, where the category feature information includes feature values corresponding to multiple action categories;

the information fusion module 1109 is configured to fuse category feature information of the plurality of video frames to obtain fusion category feature information;

the category determining module 1110 is configured to determine an action category to which the maximum feature value in the fusion category feature information belongs as an action category of the video data.

In another possible implementation manner, the feature extraction module 1102 is further configured to invoke a feature extraction model, and perform feature extraction on each video frame respectively to obtain initial feature information of each video frame;

the motion recognition module 1103 is further configured to invoke a motion recognition model, perform motion recognition according to the initial feature information of the plurality of video frames, and obtain motion feature information of the plurality of video frames;

the comparison processing module 1104 is further configured to invoke a weight obtaining model, and perform comparison processing on the motion characteristic information of the plurality of video frames to obtain weight information of each video frame;

the first fusion processing module 1105 is further configured to invoke an attention fusion model, and perform fusion processing on the initial feature information and the corresponding weight information of each video frame, respectively, to obtain target feature information of each video frame.

the video frame acquiring module 1101 is further configured to acquire a plurality of sample video frames in the same sample video data;

the feature extraction module 1102 is further configured to invoke a feature extraction model, and perform feature extraction on each sample video frame respectively to obtain initial sample feature information of each sample video frame, where the initial sample feature information includes initial sample features corresponding to multiple feature dimensions;

the motion identification module 1103 is further configured to invoke a motion identification model, perform motion identification according to initial sample feature information of the multiple sample video frames, and obtain motion sample feature information of the multiple sample video frames, where the motion sample feature information includes motion sample features corresponding to multiple feature dimensions;

the comparison processing module 1104 is further configured to invoke a weight obtaining model, compare the motion sample feature information of the plurality of sample video frames, and obtain sample weight information of each sample video frame, where the sample weight information includes weights corresponding to a plurality of feature dimensions;

the first fusion processing module 1105 is further configured to invoke an attention fusion model, and perform fusion processing on the initial sample feature information and the corresponding sample weight information of each sample video frame respectively to obtain target sample feature information of each sample video frame;

the model training module 1111 is configured to train the feature extraction model, the motion recognition model, the weight obtaining model, and the attention fusion model according to the target sample feature information of the plurality of sample video frames.

In another possible implementation, as shown in fig. 12, the model training module 1111 includes:

a similarity determining unit 1112, configured to determine, for any target feature dimension, a similarity of the target feature dimension according to a similarity between sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the multiple sample video frames;

a first loss value determining unit 1113, configured to determine a first loss value of the feature extraction model according to the similarity of the preset number of target feature dimensions, where the first loss value and the similarity of the preset number of target feature dimensions are in a positive correlation;

the first model training unit 1114 is configured to train the feature extraction model, the motion recognition model, the weight obtaining model, and the attention fusion model according to the first loss value.

a classification processing unit 1115, configured to invoke a classification model, and perform classification processing on target sample feature information of each sample video frame to obtain category feature information of each sample video frame, where the category feature information includes feature values corresponding to multiple action categories;

an information fusion unit 1116, configured to fuse category feature information of multiple sample video frames to obtain fusion category feature information;

a category determining unit 1117, configured to determine an action category to which a maximum feature value in the fusion category feature information belongs as a target action category of the sample video;

a second loss value determining unit 1118, configured to determine a second loss value of the feature extraction model according to a difference between the target motion category of the sample video and the sample motion category of the sample video, where the second loss value and the difference have a positive correlation;

and a second model training unit 1119, configured to train the feature extraction model, the motion recognition model, the weight obtaining model, and the attention model according to the second loss value.

Fig. 13 is a schematic structural diagram of a video frame feature extraction apparatus according to an embodiment of the present application, and as shown in fig. 13, the apparatus includes:

a video frame obtaining module 1301, configured to obtain multiple sample video frames in the same sample video data;

the feature extraction module 1302 is configured to invoke a feature extraction model, and perform feature extraction on each sample video frame respectively to obtain target sample feature information of each sample video frame;

a similarity determining module 1303, configured to determine, for any target feature dimension, a similarity of the target feature dimension according to a sum of similarities between sample features belonging to the target feature dimension in the target sample feature information of every two sample video frames of the multiple sample video frames;

a first loss value determining module 1304, configured to determine a first loss value of the feature extraction model according to the similarity of the preset number of target feature dimensions, where the first loss value and the similarity of the preset number of target feature dimensions are in a positive correlation;

a model training module 1305, configured to train the feature extraction model according to the first loss value;

the feature extraction module 1306 is configured to invoke the trained feature extraction model, perform feature extraction on any video frame, and obtain feature information of any video frame.

In one possible implementation, as shown in fig. 14, the apparatus further includes:

a classification processing module 1307, configured to invoke a classification model, and perform classification processing on the target sample features of each sample video frame to obtain category feature information of each sample video frame, where the category feature information includes feature values corresponding to multiple action categories;

an information fusion module 1308, configured to fuse category feature information of multiple sample video frames to obtain fusion category feature information;

a category determining module 1309, configured to determine, as a target action category of the sample video, an action category to which a maximum feature value in the fusion category feature information belongs;

a second loss value determining module 1310, configured to determine a second loss value of the feature extraction model according to a difference between the target motion category of the sample video and the sample motion category of the sample video, where the second loss value and the difference have a positive correlation;

model training module 1305, comprising:

the model training unit 1351 is configured to train the feature extraction model according to the first loss value and the second loss value.

Fig. 15 shows a block diagram of an electronic device 1500 provided in an exemplary embodiment of the present application, where the electronic device 1500 may be a portable mobile terminal, such as a smart phone, a tablet computer, an MP3 player (Moving picture Experts Group Audio L layer III, mpeg Audio layer 3), an MP4 player (Moving picture Experts Group Audio L layer IV, mpeg Audio layer 4), a notebook computer, or a desktop computer.

In general, electronic device 1500 includes: a processor 1501 and memory 1502.

Processor 1501 may include one or more Processing cores, such as a 4-core processor, an 8-core processor, etc., processor 1501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a P L a (Programmable logic Array), processor 1501 may also include a main processor, which is a processor for Processing data in a wake-up state, also referred to as a CPU (Central Processing Unit), and a coprocessor, which is a low-power processor for Processing data in a standby state, in some embodiments, the processor may be integrated with a GPU (Graphics Processing Unit) for rendering and rendering content desired for a display screen, and in some embodiments, processor 1501 may also include an intelligent processor 1501 for learning about AI operations of the machine learning processor.

The memory 1502 may include one or more computer-readable storage media, which may be non-transitory. The memory 1502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1502 is used to store at least one instruction for execution by processor 1501 to implement the video frame feature extraction methods provided by method embodiments herein.

In some embodiments, the electronic device 1500 may further include: a peripheral interface 1503 and at least one peripheral. The processor 1501, memory 1502, and peripheral interface 1503 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1503 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1504, a display 1505, a camera assembly 1506, an audio circuit 1507, a positioning assembly 1508, and a power supply 1509.

The peripheral interface 1503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1501 and the memory 1502. In some embodiments, the processor 1501, memory 1502, and peripheral interface 1503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1501, the memory 1502, and the peripheral interface 1503 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1504 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 1504 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1504 can communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1504 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The Display 1505 is for displaying a UI (User Interface) that may include graphics, text, icons, video, and any combination thereof, where the Display 1505 is a touch Display, the Display 1505 also has the ability to capture touch signals on or over a surface of the Display 1505, which may be input to the processor 1501 for processing as control signals, where the Display 1505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard.

The camera assembly 1506 is used to capture images or video. Optionally, the camera assembly 1506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1507 may include a microphone and speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1501 for processing or inputting the electric signals to the radio frequency circuit 1504 to realize voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of the electronic device 1500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1501 or the radio frequency circuit 1504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1507 may also include a headphone jack.

The Positioning component 1508 is used to locate a current geographic location of the electronic device 1500 to implement navigation or L BS (L position Based Service). The Positioning component 1508 can be a Positioning component Based on the GPS (Global Positioning System) in the United states, the Beidou System in China, or the Galileo System in Russia.

The power supply 1509 is used to supply power to the various components in the electronic device 1500. The power supply 1509 may be alternating current, direct current, disposable or rechargeable. When the power supply 1509 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 1500 also includes one or more sensors 1510. The one or more sensors 1510 include, but are not limited to: acceleration sensor 1511, gyro sensor 1512, pressure sensor 1513, fingerprint sensor 1514, optical sensor 1515, and proximity sensor 1516.

The acceleration sensor 1511 may detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the electronic apparatus 1500. For example, the acceleration sensor 1511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1501 may control the display screen 1505 to display the user interface in a landscape view or a portrait view based on the gravitational acceleration signal collected by the acceleration sensor 1511. The acceleration sensor 1511 may also be used for acquisition of motion data of a game or a user.

The gyroscope sensor 1512 may detect a body direction and a rotation angle of the electronic device 1500, and the gyroscope sensor 1512 and the acceleration sensor 1511 may cooperate to collect a 3D motion of the user on the electronic device 1500. The processor 1501 may implement the following functions according to the data collected by the gyro sensor 1512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 1513 may be disposed on a side bezel of the electronic device 1500 and/or underneath the display 1505. When the pressure sensor 1513 is disposed on the side frame of the electronic device 1500, the holding signal of the user to the electronic device 1500 may be detected, and the processor 1501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1513. When the pressure sensor 1513 is disposed at a lower layer of the display screen 1505, the processor 1501 controls the operability control on the UI interface in accordance with the pressure operation of the user on the display screen 1505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1514 is configured to collect a fingerprint of a user, and the processor 1501 identifies the user's identity based on the fingerprint collected by the fingerprint sensor 1514, or the fingerprint sensor 1514 identifies the user's identity based on the collected fingerprint when the user's identity is identified as a trusted identity, the processor 1501 authorizes the user to perform a related sensitive operation, including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc.

The optical sensor 1515 is used to collect ambient light intensity. In one embodiment, processor 1501 may control the brightness of display screen 1505 based on the intensity of ambient light collected by optical sensor 1515. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1505 is increased; when the ambient light intensity is low, the display brightness of the display screen 1505 is adjusted down. In another embodiment, the processor 1501 may also dynamically adjust the shooting parameters of the camera assembly 1506 based on the ambient light intensity collected by the optical sensor 1515.

A proximity sensor 1516, also referred to as a distance sensor, is typically provided on the front panel of the electronic device 1500. The proximity sensor 1516 is used to capture the distance between the user and the front of the electronic device 1500. In one embodiment, the processor 1501 controls the display 1505 to switch from the bright screen state to the dark screen state when the proximity sensor 1516 detects that the distance between the user and the front of the electronic device 1500 is gradually decreased; when the proximity sensor 1516 detects that the distance between the user and the front of the electronic device 1500 gradually becomes larger, the processor 1501 controls the display 1505 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in FIG. 15 is not intended to be limiting of electronic device 1500, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 16 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1600 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1601 and one or more memories 1602, where the memory 1602 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1601 to implement the methods provided by the method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The server 1600 may be configured to perform the video frame feature extraction method described above.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor, so as to implement the video frame feature extraction method in the foregoing embodiment.

The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is loaded and executed by a processor to implement the video frame feature extraction method according to the foregoing embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device is loaded and executed to implement the operations performed in the video frame feature extraction method of the above aspect.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only an alternative embodiment of the present application and should not be construed as limiting the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for extracting features of video frames, the method comprising:

acquiring a plurality of video frames in the same video data;

2. The method according to claim 1, wherein the performing motion recognition based on the initial feature information of the plurality of video frames to obtain the motion feature information of the plurality of video frames comprises:

3. The method according to claim 2, wherein the comparing the initial feature information of any two adjacent video frames in the plurality of video frames to obtain the motion feature information of the first video frame in the any two video frames comprises:

4. The method of claim 3, further comprising:

determining preset feature information as motion feature information of a last video frame of the plurality of video frames.

5. The method according to claim 1, wherein the comparing the motion characteristic information of the plurality of video frames to obtain the weight information of each video frame comprises:

for each video frame, performing fusion processing on the motion characteristic information of the video frame and the motion characteristic information of at least one video frame before the video frame to obtain fusion motion characteristic information of the video frame, wherein the fusion motion characteristic information comprises fusion motion characteristics corresponding to the multiple characteristic dimensions;

and normalizing the plurality of fusion motion characteristics in the fusion motion characteristic information, and taking the fusion motion characteristic information after normalization as the weight information.

6. The method of claim 5, further comprising:

and in response to the video frame being the first video frame of the plurality of video frames, performing fusion processing on the motion characteristic information of the video frame and the motion characteristic information of the last video frame of the plurality of video frames to obtain fusion motion characteristic information of the video frame.

7. The method according to claim 1, wherein after the initial feature information of each video frame and the corresponding weight information are respectively fused to obtain the target feature information of each video frame, the method further comprises:

fusing the category characteristic information of the plurality of video frames to obtain fused category characteristic information;

8. The method according to claim 1, wherein the step of separately performing feature extraction on each video frame to obtain initial feature information of each video frame is implemented by calling a feature extraction model;

the step of comparing the motion characteristic information of the plurality of video frames to obtain the weight information of each video frame is realized by calling a weight acquisition model;

and the step of respectively carrying out fusion processing on the initial characteristic information of each video frame and the corresponding weight information to obtain the target characteristic information of each video frame is realized by calling an attention fusion model.

9. The method of claim 8, further comprising:

acquiring a plurality of sample video frames in the same sample video data;

calling the feature extraction model, and respectively performing feature extraction on each sample video frame to obtain initial sample feature information of each sample video frame, wherein the initial sample feature information comprises initial sample features corresponding to the multiple feature dimensions;

calling the motion recognition model, and performing motion recognition according to initial sample feature information of the plurality of sample video frames to obtain motion sample feature information of the plurality of sample video frames, wherein the motion sample feature information comprises motion sample features corresponding to the plurality of feature dimensions;

calling the weight obtaining model, and comparing the motion sample characteristic information of the plurality of sample video frames to obtain sample weight information of each sample video frame, wherein the sample weight information comprises weights corresponding to the plurality of characteristic dimensions;

calling the attention fusion model, and respectively carrying out fusion processing on the initial sample characteristic information and the corresponding sample weight information of each sample video frame to obtain target sample characteristic information of each sample video frame;

and training the feature extraction model, the motion recognition model, the weight acquisition model and the attention fusion model according to the target sample feature information of the plurality of sample video frames.

10. The method according to claim 9, wherein the training the feature extraction model, the motion recognition model, the weight obtaining model and the attention fusion model according to the target sample feature information of the plurality of sample video frames comprises:

11. A method for extracting features of video frames, the method comprising:

acquiring a plurality of sample video frames in the same sample video data;

training the feature extraction model according to the first loss value;

12. An apparatus for extracting features of a video frame, the apparatus comprising:

13. An apparatus for extracting features of a video frame, the apparatus comprising:

14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement the video frame feature extraction method of any of claims 1 to 10; or to implement the video frame feature extraction method of claim 11.

15. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor, to implement the video frame feature extraction method according to any one of claims 1 to 10; or to implement the video frame feature extraction method of claim 11.