CN113158710B

CN113158710B - Video classification method, device, terminal and storage medium

Info

Publication number: CN113158710B
Application number: CN202010441124.9A
Authority: CN
Inventors: 董强; 李雪; 孙芯彤
Original assignee: Xi'an Tianhe Defense Technology Co ltd
Current assignee: Xi'an Tianhe Defense Technology Co ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2024-05-31
Anticipated expiration: 2040-05-22
Also published as: CN113158710A

Abstract

The application is applicable to the technical field of computers, and provides a video classification method, a device, a terminal and a storage medium, wherein the method comprises the following steps: acquiring a plurality of single-frame images corresponding to a video to be processed; inputting a plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; generating a first target video based on the at least one first target image; and inputting the first target video into the trained video classification model for processing to obtain a classification result corresponding to the first target video. According to the mode, the global features corresponding to the single-frame images are extracted through the global feature extraction model, the target video is generated based on the images with the global features, and when the target video is processed through the video classification model, the extracted semantic features are richer and more accurate; when classifying is carried out based on the semantic features, the classification result is more accurate, so that the accuracy of video classification is improved.

Description

Video classification method, device, terminal and storage medium

Technical Field

The application belongs to the technical field of computers, and particularly relates to a video classification method, a video classification device, a video classification terminal and a video classification storage medium.

Background

Video classification refers to classifying videos into their corresponding categories by analyzing the video information. Video classification plays an important role in real-world applications, e.g., video classification may be applied in video search, video recommendation, etc. However, the conventional video classification method does not have accurate classification results, so that the accuracy of video classification needs to be improved.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a method, an apparatus, a terminal, and a storage medium for video classification, so as to solve the problem of inaccurate classification results in the conventional video classification method.

A first aspect of an embodiment of the present application provides a method for video classification, including:

Acquiring a plurality of single-frame images corresponding to a video to be processed;

inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; the global features comprise human body joint point features, personnel attribute features and scene features;

Generating a first target video based on at least one of the first target images;

and inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video.

Optionally, the global feature extraction model comprises a human body posture joint point extraction model, a personnel attribute identification model and a scene identification model; inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image, wherein the method comprises the following steps:

Inputting the single-frame images into the human body posture joint point extraction model for processing aiming at each single-frame image to obtain a first image marked with human body joint point characteristics;

inputting the single-frame image into the personnel attribute identification model for processing to obtain a second image marked with personnel attribute characteristics;

Inputting the single-frame image into the scene recognition model for processing to obtain a third image marked with scene characteristics;

and fusing the marked features in the other two images into any one image based on any one image to obtain the first target image for the first image, the second image and the third image.

Optionally, for each single frame image, inputting the single frame image into the human body posture joint point extraction model for processing to obtain a first image marked with human body joint point characteristics, including:

Acquiring a human body characteristic map corresponding to each single frame image through the human body posture joint point extraction model;

identifying and marking each human joint point feature in the human feature map;

And generating a first image corresponding to each single frame image based on each single frame image and each human body joint point characteristic corresponding to each single frame image.

Distributing weight values to the characteristics of the joints of the human body corresponding to each single frame image through the trained spatial attention network;

And generating a first image corresponding to each single frame image based on the human body joint point characteristics of which the weight value corresponding to each single frame image is larger than a first preset threshold value and each single frame image.

Optionally, the generating a first target video based on at least one first target image includes:

Assigning a weight value to each of the first target images through a trained time attention network;

And generating the first target video based on the first target image with the weight value larger than a second preset threshold value.

Optionally, the step of inputting the single frame image into the personnel attribute identification model to obtain a second image marked with personnel attribute features includes:

Acquiring a human body image corresponding to the single-frame image through the personnel attribute identification model;

identifying and marking personnel attribute features in the human body image;

the second image is generated based on the single frame image and the person attribute feature.

Optionally, the inputting the single frame image into the scene recognition model for processing to obtain a third image marked with scene features includes:

Extracting scene characteristics in the single-frame image through the scene recognition model;

Determining a scene category corresponding to the single-frame image based on the scene feature;

the third image is generated based on the single frame image and the scene category.

Optionally, the inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video includes:

Acquiring semantic features corresponding to the first target video through the video classification model;

and classifying the semantic features to obtain the classification result.

A second aspect of an embodiment of the present invention provides an apparatus for video classification, the apparatus comprising:

the acquisition unit is used for acquiring a plurality of single-frame images corresponding to the video to be processed;

The first processing unit is used for inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; the global features comprise human body joint point features, personnel attribute features and scene features;

a generating unit, configured to generate a first target video based on at least one of the first target images;

and the second processing unit is used for inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video.

Optionally, the global feature extraction model comprises a human body posture joint point extraction model, a personnel attribute identification model and a scene identification model; the first processing unit includes:

the first image generation unit is used for inputting the single-frame images into the human body posture joint point extraction model for processing aiming at each single-frame image to obtain a first image marked with human body joint point characteristics;

the second image generating unit is used for inputting the single-frame image into the personnel attribute identification model for processing to obtain a second image marked with personnel attribute characteristics;

A third image generating unit, configured to input the single-frame image into the scene recognition model for processing, to obtain a third image marked with scene features;

and the first target image generation unit is used for fusing the marked features in the other two images into any one image based on the first image, the second image and the third image to obtain the first target image.

Optionally, the first image generating unit is specifically configured to:

Optionally, the generating unit is specifically configured to:

Optionally, the second image generating unit is specifically configured to:

identifying and marking personnel attribute features in the human body image;

Optionally, the third image generating unit is specifically configured to:

Optionally, the second processing unit is specifically configured to:

and classifying the semantic features to obtain the classification result.

A third aspect of an embodiment of the present invention provides a terminal for video classification, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, and the memory is configured to store a computer program supporting the terminal to perform the above method, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to perform the following steps:

A fourth aspect of embodiments of the present invention provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of:

The video classification method, the video classification device, the video classification terminal and the video classification storage medium provided by the embodiment of the application have the following beneficial effects:

According to the embodiment of the application, a plurality of single-frame images corresponding to the video to be processed are acquired; inputting a plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; generating a first target video based on the at least one first target image; and inputting the first target video into the trained video classification model for processing to obtain a classification result corresponding to the first target video. According to the application, the single-frame image is processed through the trained global feature extraction model, so that the global features corresponding to the single-frame image, namely the human body joint features, the personnel attribute features and the scene features corresponding to the single-frame image are extracted, and the extracted features corresponding to the single-frame image are very comprehensive and rich. Generating a target video based on the image with the global feature, wherein the extracted semantic features are richer and more accurate when the target video is processed through the trained video classification model; therefore, when the video classification is performed based on the semantic features, the classification result is more accurate, and the accuracy of video classification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating an implementation of a method for video classification according to an embodiment of the present application;

FIG. 2 is a flow chart of an implementation of a method for video classification according to another embodiment of the application;

FIG. 3 is a schematic diagram of an apparatus for video classification according to an embodiment of the present application;

Fig. 4 is a schematic diagram of a terminal for video classification according to another embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for classifying video according to an embodiment of the present invention. The execution subject of the video classification method in this embodiment is a terminal, which includes, but is not limited to, mobile terminals such as smart phones, tablet computers, personal digital assistants (Personal DIGITAL ASSISTANT, PDA), and the like, and may also include terminals such as desktop computers, and the like. The method of video classification as shown in fig. 1 may include:

S101: and acquiring a plurality of single-frame images corresponding to the video to be processed.

The video to be processed refers to the video that needs to be classified. The terminal obtains a plurality of single-frame images corresponding to the video to be processed, and can divide the video to be processed by taking the single-frame image as a minimum dividing unit after obtaining the video to be processed, so as to obtain a plurality of single-frame images corresponding to the video to be processed. For example, the video to be processed is composed of 64 video frames, each video frame corresponds to a single-frame image, and the terminal performs segmentation processing on the video to obtain 64 single-frame images.

The terminal obtains a plurality of single-frame images corresponding to the video to be processed, or the terminal or other terminals can divide the video to be processed in advance to obtain a plurality of single-frame images corresponding to the video to be processed, and the terminal obtains a plurality of single-frame images corresponding to the video to be processed. For example, the terminal obtains a plurality of single-frame images corresponding to the video to be processed in the database, or other terminals send the plurality of single-frame images corresponding to the video to be processed to the terminal, and the terminal receives the plurality of single-frame images corresponding to the video to be processed sent by the other terminals. This is merely illustrative and is not limiting.

S102: inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; the global features include human body joint point features, personnel attribute features and scene features.

The global features corresponding to each single frame image may include human body node features, personnel attribute features, scene features, and the like corresponding to the single frame image. The human body joint feature may be a feature of each joint point of the human body in a single frame image, for example, a feature corresponding to the joints such as the head, shoulder, elbow, wrist, hand, knee, crotch, ankle, and the like of the human body. The person attribute features may be respective attributes corresponding to the person in the single frame image, for example, attribute features of a coat type, a pants type, a skirt type, a shoe type, a carried article type, a hair length, and the like of the person. The scene features may be scene types corresponding to a single frame image, for example, scene features of a study room scene, an office scene, a playground, a traffic intersection, in a bus, in a subway, and the like.

The first target image is an image obtained by processing a single frame image through a global feature extraction model, and the first target image is marked with global features, namely, human body related node features, personnel attribute features and scene features. The human body joint point features can mark the positions of the joint points and the corresponding joint point names, the personnel attribute features can mark the types on the corresponding attributes, and the scene features can be marked on the positions of the non-human bodies in the first target image.

The plurality of single-frame images can be sequentially input into a trained global feature extraction model for processing, and the global feature extraction model sequentially outputs a first target image corresponding to each single-frame image; the method can also input a plurality of single-frame images into a trained global feature extraction model in an out-of-order manner, and the global feature extraction model correspondingly outputs a first target image according to the input sequence of each single-frame image.

For example, the trained global feature extraction model may include a human body posture joint point extraction model, a person attribute recognition model, and a scene recognition model, and S102 may include S1021 to S1024, which are specifically as follows:

S1021: and inputting the single-frame image into the human body posture joint point extraction model for processing aiming at each single-frame image to obtain a first image marked with human body joint point characteristics.

The first image is an image obtained by processing a single frame image through a human body gesture joint point extraction model, and the first image is marked with human body joint point characteristics. For example, the human body feature map in the single frame image can be extracted through the human body posture joint point extraction model, and each human body joint point feature in the human body feature map is identified. Wherein, each human body joint point can be marked by means of dots, squares, bold drawing, red marking and the like, and the joint point names and the like corresponding to each human body joint point can be marked. And mapping the mark of each human body joint point into a single frame image to obtain a first image.

The human body posture joint point extraction model is obtained by training an initial human body posture joint point extraction network based on a first sample training set by using a machine learning algorithm. The first sample training set comprises a plurality of first sample images and first mark images which are marked with human joint point characteristics and correspond to each first sample image.

For example, when the terminal processes each single frame image through the trained human body posture node extraction model, S1021 may include S10211 to S10213, which are specifically as follows:

S10211: and acquiring a human body characteristic map corresponding to each single frame image through the human body posture joint point extraction model.

And aiming at each single frame image, detecting the single frame image in the input human body posture joint point extraction model, detecting the position of the human body in the single frame image, and intercepting the image of the region where the human body is located, thus obtaining the human body image. And extracting the feature vector of the human body image from the human body image through a network layer in the human body posture joint point extraction model to obtain a corresponding human body feature map.

S10212: identifying and marking each human joint point characteristic in the human characteristic map.

The human body posture joint point extraction model can comprise a plurality of convolution layers, a plurality of sampling layers and a full connection layer. Illustratively, the first convolution layer convolves the human body feature map, extracts features corresponding to the human body feature map, and outputs a feature map based on the extracted features. The first convolution layer inputs the feature map output by the first convolution layer into the first sampling layer, the first sampling layer performs feature selection on the feature map, namely, selects the characteristics of human joints in the feature map, and reconstructs a new feature map based on the selected characteristics. The first sampling layer transmits the new feature map to the second convolution layer, the second convolution layer performs secondary feature extraction on the new feature map and outputs the feature map again based on the extracted features, the second convolution layer transmits the feature map output again to the second sampling layer, and the second sampling layer performs secondary feature selection, namely, the human joint point feature in the feature map is selected again, and the feature map is reconstructed based on the selected feature again. And the like until the last sampling layer in the human body posture joint point extraction model finishes the feature map processing, and at the moment, the human body posture joint point extraction model identifies and extracts all human body joint point features in the human body feature map. The last sampling layer in the human body posture joint point extraction model transmits the final sampling result to the full-connection layer, marks all human body joint point characteristics in the human body characteristic map through the full-connection layer, and outputs the map after marking all human body joint point characteristics. Illustratively, each human body joint point may be marked by dots, squares, bold, red marks, etc., and the joint point name corresponding to each human body joint point may also be marked.

S10213: and generating a first image corresponding to each single frame image based on each single frame image and each human body joint point characteristic corresponding to each single frame image.

And acquiring marked positions of each human body joint point characteristic in the human body characteristic diagram. And mapping the marked position of each human body joint point characteristic in the human body characteristic diagram into a single frame image, and generating a first image corresponding to each single frame image. The marked position of each human body joint point characteristic in the human body characteristic diagram and the joint point name can be mapped into the single-frame image at the same time, and a first image corresponding to each single-frame image is generated.

For example, in another implementation, when the terminal processes each single frame image through the trained human body posture node extraction model, S1021 may include S10214 to S10217. It should be noted that S10211 to S10213 are juxtaposed with S10214 to S10217, and S10214 to S10217 are not executed after S10211 to S10213, and the specific execution mode is not limited to this. S10214 to S10217 are specifically as follows:

S10214: and acquiring a human body characteristic map corresponding to each single frame image through the human body posture joint point extraction model.

S10215: identifying and marking each human joint point characteristic in the human characteristic map.

S10214 and S10215 in this embodiment are the same as the execution process in S10211 and S10212 described above, and will not be described here again.

S10216: and distributing weight values to the characteristics of the joints of the human body corresponding to each single frame image through the trained spatial attention network.

The trained spatial attention network is used for distributing weight values to the characteristics of each human body joint point in a single frame image, namely the spatial attention network can distribute different weight values to each human body joint point according to different importance of each human body joint point in the single frame image. The trained spatial attention network application can be acquired from the network and applied to the terminal. The trained spatial attention network carries out preliminary detection on a single frame image, predicts the action of a human body in the single frame image, and distributes weight values for the joint point characteristics of each human body according to the prediction result.

For example, two persons are in a single frame image, a person standing on the right punches a punch to attack the left person, and the left person evades the attack. At this time, the trained spatial attention network performs preliminary detection on the single frame image, predicts that the motion of the right human body in the single frame image is a punch attack, and the motion of the left human body is a punch avoidance. For the motion in the single frame image, it is obvious that each human body joint point of the upper body is more important, and the spatial attention network assigns relatively large weight values for the important human body joint point features, such as hands, heads, trunk centers and the like, and relatively small weight values for the unimportant human body joint point features, such as knees, feet and the like. The specific weight value can be preset, and the sum of the weight values corresponding to all the human body joint points in one single frame image is only 1.

S10217: and generating a first image corresponding to each single frame image based on the human body joint point characteristics of which the weight value corresponding to each single frame image is larger than a first preset threshold value and each single frame image.

The first preset threshold value is used for comparing the weight value corresponding to each human body joint point characteristic, and judging which important human body joint point characteristics corresponding to each single frame image are according to the comparison result. The first preset threshold value may be preset and adjusted, which is not limited.

And after different weight values are distributed to the human body joint point characteristics by the trained spatial attention network aiming at each single frame image, comparing the size of each weight value with a first preset threshold value, and carrying out key marking on the human body joint point characteristics with the weight value larger than the first preset threshold value. The specific marking mode may be to mark the human body joint point features with the weight values larger than the first preset threshold value through large dots, or may directly mark the weight values corresponding to each human body joint point feature in the image, which is not limited. The large dot marks are relatively similar to the small dot marks in the graph, namely, the large dot marks are carried out on the human body joint characteristics with the weight value larger than the first preset threshold value in the graph, and the small dot marks are carried out on the human body joint characteristics with the weight value smaller than or equal to the first preset threshold value in the graph.

And acquiring marked positions of the human body joint point characteristics marked with the key points in the human body characteristic diagram. Mapping the marked position of each human body joint point feature marked in the human body feature map to a single frame image to generate a first image corresponding to each single frame image. The marked positions of the human body joint point features marked with the key points in the human body feature map and the joint point names can be mapped into the single-frame images at the same time, so that a first image corresponding to each single-frame image is generated.

The method may further include mapping a weight value corresponding to each marked human body joint point feature, a marked position of the human body joint point feature in the human body feature map, and a joint point name to the single frame image, and generating a first image corresponding to each single frame image.

In this embodiment, a weight value is assigned to each human body joint feature corresponding to each single frame image through the trained spatial attention network, and then a first image corresponding to each single frame image is generated based on the human body joint feature corresponding to each single frame image and the weight value greater than a first preset threshold. By the method, important human body joint point characteristics in each single frame image are extracted, and the important human body joint point characteristics can be used as effective characteristics of subsequent video classification, so that the accuracy of video classification is improved; and because the important human body joint point characteristics are extracted, the terminal can only process the important human body joint point characteristics during the subsequent video classification, and all the human body joint point characteristics are not required to be processed, so that the video classification speed is further improved.

Before performing S1021, it may further include: training the initial human body posture joint point extraction network based on the first sample training set to obtain a human body posture joint point extraction model.

Specifically, the first sample training set includes a plurality of first sample images and a first marker image marked with human body joint features corresponding to each of the first sample images. Inputting the first sample images in the first sample training set into an initial human body posture joint point extraction network for processing to obtain first mark images marked with human body joint point characteristics corresponding to each first sample image. The network structure corresponding to the initial human body posture joint point extraction network in the training process is the same as the network structure corresponding to the human body posture joint point extraction model used in the practical application process. The processing procedure of the initial human body posture node extraction network for each first sample image is the same as the processing procedure of the human body posture node extraction model for each single frame image, and reference may be made to the descriptions in the above steps S10211 to S10213, which are not repeated here.

Calculating a first loss value between an image marked with human body joint point characteristics obtained after the first sample image is processed by the initial human body posture joint point extraction network and a first marked image corresponding to the first sample image in a first sample training set according to a first preset loss function; in this example, an activation function (sigmod function) may be utilized as the loss function from which to calculate the first loss value. This is merely illustrative and is not limiting.

When the first loss value is calculated, judging whether the first loss value is larger than a first preset threshold value, and when the first loss value is larger than the first preset threshold value, adjusting parameters in the initial human body posture joint point extraction network, and returning to execute the steps of inputting the first sample images in the first sample training set into the initial human body posture joint point extraction network for processing to obtain first mark images marked with human body joint point characteristics corresponding to each first sample image. When the first loss value is smaller than or equal to a first preset threshold value, the current initial human body posture joint point extraction network is judged to meet the expected requirement, and training of the initial human body posture joint point extraction network is stopped. And taking the initial human body posture joint point extraction network at the moment as a trained human body posture joint point extraction model.

S1022: and inputting the single-frame image into the personnel attribute identification model for processing to obtain a second image marked with personnel attribute characteristics.

The second image is an image obtained by processing the single-frame image through the personnel attribute identification model, and personnel attribute features are marked in the second image. The personnel attribute recognition model is obtained by training the initial personnel attribute recognition network based on the second sample training set by using a machine learning algorithm. The second sample training set includes a plurality of second sample images and a second marker image labeled with a person attribute feature corresponding to each of the second sample images. The second sample image may be the same as or different from the first sample image, and is not limited thereto.

For example, when the terminal processes each single frame image through the trained personnel attribute recognition model, S1022 may include S10221 to S10223, which are specifically as follows:

S10221: and acquiring a human body image corresponding to the single frame image through the personnel attribute identification model.

Preprocessing a single frame image in an input personnel attribute identification model aiming at each single frame image; and acquiring a human body image corresponding to each single frame image. Specifically, the region where the human body is located in the single frame image can be detected, and the part of the image corresponding to the region where the human body is located is extracted, so that the human body image is obtained.

S10222: personnel attribute features in the human body image are identified and marked.

The personnel attribute identification model carries out convolution processing on the human body image to obtain a feature vector 1 corresponding to the human body image; and uniformly cutting the feature vector 1 into horizontal blocks, and respectively carrying out global average pooling on each cut horizontal block to obtain a feature vector 2. And carrying out attribute classification on the feature vector 2 to obtain personnel attribute features in the human body image. For example, the feature vector 2 is input to a plurality of different full-connection layers to classify the different attributes, so that classification results of the jacket type, the trousers type, the skirt type, the shoe type, the type of the carried article, the hair length and the like of the person can be obtained, and the classification results are used as the attribute features of the person in the human body image.

The personnel attribute identification model marks personnel attribute features in the human body image, namely, each position of the personnel body in the human body image corresponds to the marked personnel attribute features. For example, a cap type is marked at the head position of a person in a human body image, a coat type is marked at the upper body position of a person, a skirt type is marked at the lower body position of a person, and the like. This is merely illustrative and is not limiting.

S10223: the second image is generated based on the single frame image and the person attribute feature.

And acquiring the marked position and the marked type of the personnel attribute feature in the human body image, mapping the marked position and the marked type of the personnel attribute feature in the human body image into the single-frame image, and generating a second image corresponding to each single-frame image.

Prior to performing S1022, may further include: training the initial personnel attribute recognition network based on the second sample training set to obtain a personnel attribute recognition model.

Specifically, the second sample training set includes a plurality of second sample images and a second marker image labeled with a person attribute feature corresponding to each of the second sample images. And inputting a second sample image in the second sample training set into the initial personnel attribute identification network for processing to obtain a second marked image which corresponds to the second sample image and is marked with personnel attribute characteristics. The network structure corresponding to the initial personnel attribute identification network in the training process is the same as the network structure corresponding to the personnel attribute identification model used in the actual application process. The processing procedure of the initial personnel attribute recognition network for each second sample image is the same as the processing procedure of the personnel attribute recognition model for each single frame image, and reference may be made to the descriptions in the above steps S10221 to S10223, and the details are not repeated here.

Calculating a second loss value between an image marked with personnel attribute characteristics and a second marked image corresponding to the second sample image in a second sample training set, which is obtained after the initial personnel attribute identification network processes the second sample image, according to a second preset loss function; in this example, an activation function may be utilized as the loss function from which the second loss value is calculated. This is merely illustrative and is not limiting.

And when the second loss value is calculated, judging whether the second loss value is larger than a second preset threshold value, and when the second loss value is larger than the second preset threshold value, adjusting parameters in the initial personnel attribute identification network, and returning to execute the step of inputting a second sample image in the second sample training set into the initial personnel attribute identification network for processing to obtain a second marked image marked with personnel attribute characteristics corresponding to the second sample image. And when the second loss value is smaller than or equal to a second preset threshold value, judging that the current initial personnel attribute identification network meets the expected requirement, and stopping training the initial personnel attribute identification network. And taking the initial personnel attribute recognition network at the moment as a trained personnel attribute recognition model.

S1023: and inputting the single-frame image into the scene recognition model for processing to obtain a third image marked with scene characteristics.

The third image is an image obtained by processing the single frame image through the scene recognition model, and scene features are marked in the third image. The scene recognition model is obtained by training the initial scene recognition network based on a third sample training set by using a machine learning algorithm. The third sample training set includes a plurality of third sample images and a third marker image labeled with a scene feature corresponding to each of the third sample images. The third sample image may be the same as or different from the first sample image and the second sample image, and is not limited thereto.

For example, when the terminal processes each single frame image through the trained scene recognition model, S1023 may include S10231 to S10233, which are specifically as follows:

s10231: and extracting scene characteristics in the single-frame image through the scene recognition model.

And dividing the single-frame image input into the scene recognition model aiming at each single-frame image to obtain a plurality of image blocks corresponding to each single-frame image, and extracting depth features of each image block through a network layer in the scene recognition model to obtain depth feature vectors corresponding to each image block. And clustering each depth feature vector through a clustering algorithm to obtain corresponding scene features in the single-frame image. For example, substituting each depth feature vector into a function corresponding to the adopted clustering algorithm for calculation to obtain a corresponding scene feature in the single-frame image. The clustering algorithm may be a k-means clustering algorithm (k-means clustering algorithm, k-means) clustering algorithm, a local aggregation descriptor vector (Vector of Locally Aggregated Descriptors, VLAD) algorithm, or the like, which is not limited thereto.

S10232: and determining the scene category corresponding to the single-frame image based on the scene feature.

And inputting the scene characteristics in the single-frame image into a full-connection layer in the scene recognition model for classification, and obtaining the scene category corresponding to the scene characteristics. For example, a study room scenario, an office scenario, a playground, a traffic intersection, in a bus, in a subway, a meeting room, a basketball court, a football court, a waiting hall, and the like.

S10233: the third image is generated based on the single frame image and the scene category.

And obtaining the scene category corresponding to each single-frame image, marking the scene category in the corresponding single-frame image, and generating a third image corresponding to each single-frame image.

Prior to performing S1023, may further include: training the initial scene recognition network based on the third sample training set to obtain a scene recognition model.

Specifically, the third sample training set includes a plurality of third sample images and a third marker image labeled with a scene feature corresponding to each of the third sample images. And inputting a third sample image in the third sample training set into the initial scene recognition network for processing to obtain a third marked image marked with scene characteristics corresponding to the third sample image. The network structure corresponding to the initial scene recognition network in the training process is the same as the network structure corresponding to the scene recognition model used in the actual application process. The processing procedure of the initial scene recognition network for each third sample image is the same as the processing procedure of the scene recognition model for each single frame image, and reference may be made to the descriptions in steps S10231 to S10233, which are not repeated here.

Calculating a third loss value between an image marked with scene characteristics and a third marked image corresponding to the third sample image in a third sample training set, which is obtained by processing the third sample image by the initial scene recognition network, according to a third preset loss function; in this example, an activation function may be utilized as the loss function from which the third loss value is calculated. This is merely illustrative and is not limiting.

And when the third loss value is calculated, judging whether the third loss value is larger than a third preset threshold value, and when the third loss value is larger than the third preset threshold value, adjusting parameters in the initial scene recognition network, and returning to execute the step of inputting the third sample image in the third sample training set into the initial scene recognition network for processing to obtain a third marked image marked with scene characteristics corresponding to the third sample image. And when the third loss value is smaller than or equal to a third preset threshold value, judging that the current initial scene recognition network meets the expected requirement, and stopping training the initial scene recognition network. And taking the initial scene recognition network at the moment as a trained scene recognition model.

S1024: and fusing the marked features in the other two images into any one image based on any one image to obtain the first target image for the first image, the second image and the third image.

And fusing the marked features in the other two images into any image serving as a basis on the basis of any one image of the first image, the second image and the third image to obtain a first target image. For example, based on the first image, acquiring the marked position and the marked type of the personnel attribute feature in the second image, and correspondingly adding the personnel attribute feature to the first image; and acquiring the marked position and the marked type of the scene feature in the third image, and correspondingly adding the marked position and the marked type into the first image to obtain a first target image marked with the global feature. Similarly, based on the second image, fusing the marked features in the first image and the third image into the second image; or based on the third image, fusing the marked features in the first image and the second image into the third image to obtain a first target image marked with global features.

S103: a first target video is generated based on at least one of the first target images.

The first target video may be generated based on all the first target images, i.e. based on the first target images corresponding to all the single frame images corresponding to the video to be processed. The first target video may also be generated based on a portion of the first target image.

Specifically, the time corresponding to each first target image is acquired, and a first target video is generated based on the time sequence of each first target image and each first target image. The time of the single frame image corresponding to each first target image in the video to be processed can be obtained, the time corresponding to each first target image is obtained, the first target images are ordered based on time sequence, and the first target videos are generated through combination. Because the first target image is marked with global features, each video frame in the generated first target video is also marked with global features.

S104: and inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video.

The video classification model is obtained by training the initial video classification network based on a fourth sample training set using a machine learning algorithm. The fourth sample training set includes a plurality of video samples and classification results corresponding to the video samples.

For example, when the terminal processes the first target video through the trained video classification model, S104 may include S1041 to S1042, which are specifically as follows:

S1041: and acquiring semantic features corresponding to the first target video through the video classification model.

The network layer in the video classification model extracts global features of all video frames in the first target video, and inputs all the global features to the full-connection layer for processing to obtain semantic features corresponding to the first target video. The global feature of each video frame may be obtained, the semantic feature corresponding to each video frame is determined based on the global feature of each video frame, and the semantic feature with the largest number of the same semantic features is selected from the semantic features corresponding to each video frame as the semantic feature corresponding to the first target video. For example, the semantic features corresponding to each video frame in the first target video include "basketball players play basketball in the playground", "basketball players play basketball stands in the playground", "basketball players play basketball indoors", and the like, and the semantic feature with the largest occurrence number is selected as the semantic feature corresponding to the first target video.

S1042: and classifying the semantic features to obtain the classification result.

Inputting semantic features corresponding to the first target video into a classifier in a video classification model to classify, and obtaining a classification result corresponding to the first target video, namely obtaining a classification result corresponding to the video to be processed. The classifier collects a large number of different semantic features and classification results corresponding to the semantic features in the construction process, so that the semantic features corresponding to the first target video are input into the classifier for classification processing, and the corresponding classification results can be obtained.

Prior to performing S104, may further include: training the initial video classification network based on the fourth sample training set to obtain a video classification model.

Specifically, the fourth sample training set includes a plurality of video samples and classification results corresponding to the video samples. And inputting the video samples in the fourth sample training set into an initial video classification network for processing to obtain a real classification result corresponding to each video sample. The network structure corresponding to the initial video classification network in the training process is the same as the network structure corresponding to the video classification model used in the actual application process. The processing procedure of the initial video classification network for each video sample is the same as the processing procedure of the video classification model for the first target video, and reference may be made to the descriptions in the above steps S1041 to S1042, which are not repeated here.

Calculating a real classification result obtained after the video sample is processed by the initial video classification network according to a fourth preset loss function, and a fourth loss value between the real classification result and a classification result of the video sample corresponding to a fourth sample training set; in this example, an activation function may be utilized as the loss function from which the fourth loss value is calculated. This is merely illustrative and is not limiting.

When the fourth loss value is obtained through calculation, judging whether the fourth loss value is larger than a fourth preset threshold value, and when the fourth loss value is larger than the fourth preset threshold value, adjusting parameters in an initial video classification network, and returning to execute the step of inputting video samples in a fourth sample training set into the initial video classification network for processing to obtain a real classification result corresponding to each video sample. And when the fourth loss value is smaller than or equal to a fourth preset threshold value, judging that the current initial video classification network meets the expected requirement, and stopping training the initial video classification network. And taking the initial video classification network at the moment as a trained video classification model.

In another implementation manner, after obtaining first target images marked with global features corresponding to each single frame image, a terminal inputs the first target images into a Long Short-Term Memory (LSTM) network to process the first target images to obtain semantic features corresponding to a video to be processed, and classifies the semantic features to obtain classification results corresponding to the video to be processed. Specifically, the LSTM ranks according to the time corresponding to each first target image, and the LSTM extracts global features in the first target image and transmits the global features to the second first target image. The second first target image fuses the global features of the previous first target image and transmits the fusion result features to the third first target image. And fusing the fusion result characteristic of the second first target image by the third first target image, and transmitting the fusion result to the fourth first target image. And the like, until all the first target images are processed, the fusion feature corresponding to the last first target image fuses the global feature in all the first target images before the last first target image, and the fusion feature can represent the semantic feature corresponding to the video to be processed. And classifying the semantic features in the S1042 mode to obtain a classification result corresponding to the video to be processed.

In this embodiment, the trained global feature extraction model is used to process the single-frame image, so as to extract the global features corresponding to the single-frame image, that is, the human body node features, the personnel attribute features and the scene features corresponding to the single-frame image, so that the features corresponding to the single-frame image are very comprehensive and rich. Generating a target video based on the image with the global feature, wherein the extracted semantic features are richer and more accurate when the target video is processed through the trained video classification model; therefore, when the video classification is performed based on the semantic features, the classification result is more accurate, and the accuracy of video classification is improved.

Referring to fig. 2, fig. 2 is a schematic flowchart of a method for classifying video according to another embodiment of the present invention. The execution subject of the video classification method in this embodiment is a terminal, and the terminal includes, but is not limited to, mobile terminals such as smart phones, tablet computers, personal digital assistants, and the like, and may also include terminals such as desktop computers, and the like.

The difference between the present embodiment and the embodiment corresponding to fig. 1 is S203 to S204, in which S201, S202, S205 are identical to S101, S102, S104 in the previous embodiment, and specific reference is made to the related descriptions of S101, S102, S104 in the previous embodiment, which are not repeated here.

S201: and acquiring a plurality of single-frame images corresponding to the video to be processed.

S202: inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; the global features include human body joint point features, personnel attribute features and scene features.

S203: and assigning a weight value to each first target image through the trained time attention network.

The trained temporal attention network is used to assign a weight value to each first target image, i.e. the temporal attention network may assign different weight values to the respective first target images according to their different importance. The trained time attention network application can be acquired from the network and applied to the terminal. The trained time attention network predicts the actions of the personnel in all the first target images to obtain predicted actions, confirms important images in all the first target images based on the predicted actions and the duration of the video to be processed, and assigns large weight values for the important images, while the rest of non-important first target images are assigned small weight values. For example, there are 20 first target images, the time attention network predicts the actions of the personnel in the 20 first target images, the obtained predicted actions are taken as punch attacks, the duration of acquiring the video to be processed is 5 seconds, it is judged that the first target images corresponding to the 2 nd to 4 th seconds are important, if the time of acquiring the first target images corresponding to the 6 th to 12 th first target images is several seconds, the weight value allocated to the 6 th to 12 th first target images is large, and the weight value allocated to the first target images 1 to 5 th and 13 to 20 th first target images is small. The specific weight value can be preset, and the sum of the weight values corresponding to all the first target images is only required to be 1.

S204: and generating the first target video based on the first target image with the weight value larger than a second preset threshold value.

The second preset threshold value is used for comparing the weight values corresponding to the first target images, and judging which important first target images are according to the comparison result. The second preset threshold value can be preset and adjusted, which is not limited.

After the trained time attention network distributes weight values to each first target image, comparing the weight values with a second preset threshold value, extracting the first target images with the weight values larger than the second preset threshold value, and generating a first target video based on the extracted first target images. A specific method for generating the first target video is described in S103, which is not described herein.

In this embodiment, the important images in all the first target images are extracted through the trained time attention network, and when the subsequent video is classified, the terminal can only process the video generated based on the important images, and does not need to process all the first target images, so that the video classification speed is further improved.

S205: and inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video.

For ease of understanding, the implementation in the embodiment corresponding to fig. 1 and the embodiment corresponding to fig. 2 will be generally described. Illustratively, one implementation is: acquiring a plurality of single-frame images corresponding to a video to be processed; inputting a plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; generating a first target video based on a first target image corresponding to each single frame image; inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video, and obtaining a classification result corresponding to the video to be processed.

Inputting a plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image, wherein the method comprises the following steps: acquiring a human body characteristic map corresponding to each single frame image through a human body posture joint point extraction model; identifying and marking each human joint point characteristic in the human characteristic diagram; generating a first image corresponding to each single frame image based on each single frame image and each human body joint point characteristic corresponding to each single frame image; inputting the single-frame image into the personnel attribute identification model for processing to obtain a second image marked with personnel attribute characteristics; inputting the single-frame image into the scene recognition model for processing to obtain a third image marked with scene characteristics; and fusing the marked features in the other two images into any one of the first image, the second image and the third image based on the other one of the first image, the second image and the third image to obtain a first target image.

In the implementation mode, the single-frame image is processed through the trained global feature extraction model, and global features corresponding to the single-frame image, namely human body joint features, personnel attribute features and scene features corresponding to the single-frame image are extracted, so that the extracted features corresponding to the single-frame image are very comprehensive and rich. Generating a target video based on the image with the global feature, wherein the extracted semantic features are richer and more accurate when the target video is processed through the trained video classification model; therefore, when the video classification is performed based on the semantic features, the classification result is more accurate, and the accuracy of video classification is improved.

Illustratively, another implementation is: acquiring a plurality of single-frame images corresponding to a video to be processed; inputting a plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; generating a first target video based on a first target image corresponding to each single frame image; inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video, and obtaining a classification result corresponding to the video to be processed.

Inputting a plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image, wherein the method comprises the following steps: acquiring a human body characteristic map corresponding to each single frame image through a human body posture joint point extraction model; identifying and marking each human joint point characteristic in the human characteristic diagram; distributing weight values to the characteristics of each human joint point corresponding to each single frame image through the trained spatial attention network; generating a first image corresponding to each single-frame image based on the human body joint point characteristics of which the weight value corresponding to each single-frame image is larger than a first preset threshold value and each single-frame image; inputting the single-frame image into the personnel attribute identification model for processing to obtain a second image marked with personnel attribute characteristics; inputting the single-frame image into the scene recognition model for processing to obtain a third image marked with scene characteristics; and fusing the marked features in the other two images into any one of the first image, the second image and the third image based on the other one of the first image, the second image and the third image to obtain a first target image.

According to the implementation mode, a trained spatial attention network is introduced on the basis of the previous implementation mode, important human body joint point characteristics in each single frame image are extracted through the spatial attention network, and the important human body joint point characteristics can be used as effective characteristics of subsequent video classification, so that the accuracy of video classification is improved; because the important human body joint point characteristics are extracted, the terminal can only process the important human body joint point characteristics during the subsequent video classification, and all the human body joint point characteristics do not need to be processed, so that the video classification speed is further improved on the basis of improving the video classification accuracy.

Illustratively, yet another implementation is: acquiring a plurality of single-frame images corresponding to a video to be processed; inputting a plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; assigning a weight value to each first target image through the trained time attention network; generating a first target video based on a first target image with a weight value larger than a second preset threshold value; inputting the first target video into a trained video classification model for processing to obtain a classification result corresponding to the first target video, and obtaining a classification result corresponding to the video to be processed.

In the implementation manner, a trained time attention network is introduced, important images in all first target images are extracted through the time attention network, and when the subsequent videos are classified, the terminal can only process videos generated based on the important images without processing all the first target images, so that the video classification speed is further improved.

In the implementation mode, a trained spatial attention network and a trained time attention network are introduced, wherein the trained spatial attention network is used for extracting important human joint point characteristics in each single frame image, the trained time attention network is used for extracting important images in all first target images, and the effective characteristics and the important images which can be used as video classification are obtained through the cooperation of the two images, so that the interference of non-important characteristics and non-important images is reduced, and the accuracy of video classification is further improved; and the terminal is not required to process all the human body joint point characteristics and all the first target images, so that the video classification speed is further improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Referring to fig. 3, fig. 3 is a schematic diagram of an apparatus for video classification according to an embodiment of the application. The apparatus for video classification comprises units for performing the steps in the corresponding embodiments of fig. 1 and 2. Refer specifically to the related descriptions in the respective embodiments of fig. 1 and fig. 2. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 3, comprising:

an acquiring unit 310, configured to acquire a plurality of single-frame images corresponding to a video to be processed;

A first processing unit 320, configured to input the plurality of single-frame images into a trained global feature extraction model for processing, so as to obtain a first target image marked with a global feature corresponding to each single-frame image; the global features comprise human body joint point features, personnel attribute features and scene features;

A generating unit 330, configured to generate a first target video based on at least one of the first target images;

The second processing unit 340 is configured to input the first target video into a trained video classification model for processing, so as to obtain a classification result corresponding to the first target video.

Optionally, the global feature extraction model comprises a human body posture joint point extraction model, a personnel attribute identification model and a scene identification model; the first processing unit 320 includes:

Optionally, the first image generating unit is specifically configured to:

Optionally, the generating unit 330 is specifically configured to:

Optionally, the second image generating unit is specifically configured to:

identifying and marking personnel attribute features in the human body image;

Optionally, the third image generating unit is specifically configured to:

Optionally, the second processing unit 340 is specifically configured to:

and classifying the semantic features to obtain the classification result.

Referring to fig. 4, fig. 4 is a schematic diagram of a video classification terminal according to another embodiment of the present application. As shown in fig. 4, the terminal 4 of this embodiment includes: a processor 40, a memory 41, and computer readable instructions 42 stored in the memory 41 and executable on the processor 40. The processor 40, when executing the computer-readable instructions 42, performs the steps of the method embodiments described above for each video classification, such as S101 through S104 shown in fig. 1. Or the processor 40, when executing the computer-readable instructions 42, performs the functions of the elements of the embodiments described above, such as the elements 310-340 of fig. 3.

Illustratively, the computer readable instructions 42 may be partitioned into one or more units that are stored in the memory 41 and executed by the processor 40 to complete the present application. The one or more units may be a series of computer readable instruction segments capable of performing a specific function describing the execution of the computer readable instructions 42 in the terminal 4. For example, the computer readable instructions 42 may be divided into an acquisition unit, a first processing unit, a generation unit, and a second processing unit, each unit functioning specifically as described above.

The terminal of the video classification may include, but is not limited to, a processor 40, a memory 41. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the terminal 4 and is not intended to limit the terminal 4, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the terminal may further include an input-output terminal, a network access terminal, a bus, etc.

The Processor 40 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the terminal 4, such as a hard disk or a memory of the terminal 4. The memory 41 may be an external storage terminal of the terminal 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the terminal 4. Further, the memory 41 may also include both an internal memory unit of the terminal 4 and an external memory terminal. The memory 41 is used for storing the computer readable instructions and other programs and data required by the terminal. The memory 41 may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of video classification, comprising:

inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; the global features comprise human body joint point features, personnel attribute features and scene features; the global feature extraction model comprises a human body posture joint point extraction model, a personnel attribute identification model and a scene identification model; inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image, wherein the method comprises the following steps: inputting the single-frame images into the human body posture joint point extraction model for processing aiming at each single-frame image to obtain a first image marked with human body joint point characteristics; inputting the single-frame image into the personnel attribute identification model for processing to obtain a second image marked with personnel attribute characteristics; inputting the single-frame image into the scene recognition model for processing to obtain a third image marked with scene characteristics; for the first image, the second image and the third image, based on any one of the images, fusing the marked features in the other two images into any one of the images to obtain the first target image;

2. The method of claim 1, wherein said inputting the single frame image into the human body pose joint extraction model for each single frame image is performed to obtain a first image marked with human body joint features, comprising:

3. The method of claim 1, wherein said inputting the single frame image into the human body pose joint extraction model for each single frame image is performed to obtain a first image marked with human body joint features, comprising:

4. A method according to any one of claims 1 to 3, wherein said generating a first target video based on at least one of said first target images comprises:

5. The method of claim 1, wherein said inputting said single frame image into said person attribute identification model for processing results in a second image labeled with person attribute features, comprising:

identifying and marking personnel attribute features in the human body image;

6. The method of claim 1, wherein said inputting the single frame image into the scene recognition model for processing results in a third image labeled with scene features, comprising:

7. The method of claim 1, wherein the inputting the first target video into the trained video classification model for processing results in the classification corresponding to the first target video comprises:

and classifying the semantic features to obtain the classification result.

8. An apparatus for video classification, comprising:

The first processing unit is used for inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image; the global features comprise human body joint point features, personnel attribute features and scene features; the global feature extraction model comprises a human body posture joint point extraction model, a personnel attribute identification model and a scene identification model; inputting the plurality of single-frame images into a trained global feature extraction model for processing to obtain a first target image marked with global features corresponding to each single-frame image, wherein the method comprises the following steps: inputting the single-frame images into the human body posture joint point extraction model for processing aiming at each single-frame image to obtain a first image marked with human body joint point characteristics; inputting the single-frame image into the personnel attribute identification model for processing to obtain a second image marked with personnel attribute characteristics; inputting the single-frame image into the scene recognition model for processing to obtain a third image marked with scene characteristics; for the first image, the second image and the third image, based on any one of the images, fusing the marked features in the other two images into any one of the images to obtain the first target image;

9. A terminal for video classification comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 7.