CN111291692A

CN111291692A - Video scene recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111291692A
Application number: CN202010096738.8A
Authority: CN
Inventors: 赵璐; 李琳
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2020-06-16
Anticipated expiration: 2040-02-17
Also published as: CN111291692B

Abstract

The embodiment of the invention provides a video scene identification method, electronic equipment and a storage medium. The video scene identification method comprises the following steps: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network. According to the video scene identification method provided by the embodiment of the invention, the target neural network can be combined with local characteristics and global information, so that the secondary scene of the video can be identified under the condition that the video comprises activities of multiple persons.

Description

Video scene recognition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video technologies, and in particular, to a method and an apparatus for identifying a video scene, an electronic device, and a storage medium.

Background

In the classification process of videos, a video identification method needs to be used to determine a video scene, which may represent the content expressed by the video to some extent, such as whether a current video is a video related to a nodding scene of a football or a video related to a corner-ball scene of the football, where the football scene is a primary scene of the video, and the nodding scene and the corner-ball scene in the football scene are secondary scenes of the video.

The identification method in the prior art focuses on primary scene identification of a video, that is, only a primary scene of the video can be identified, for example, a scene of a basketball game, a scene of a football game or a scene of a volleyball game is identified, but it is difficult to perform secondary scene identification on the video, for example, it is difficult to identify a secondary scene of a video scene of a certain basketball game, such as a basketball shooting scene, a snapping scene or a capping scene.

Disclosure of Invention

The embodiment of the invention provides a video scene recognition method and device, electronic equipment and a storage medium, which are used for solving the problem that the prior art is difficult to be used for recognizing a secondary scene including multi-person activities in a video.

In one aspect, an embodiment of the present invention provides a video scene identification method, including: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

According to an embodiment of the invention, the first layer structure comprises a first sublayer structure and at least one second sublayer structure; the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the optimized key person output by the previous second sub-layer structure again according to the target video.

According to one embodiment of the invention, the first layer structure and the second layer structure are obtained based on a loss function training.

According to one embodiment of the invention, the loss function is a formula

Wherein, Y₁、Y₂、Y₃Respectively representing the outputs of the first sublayer structure, the second sublayer structure and the second layer structure, G representing a sample label of an input thermal image of the position of a key figure, T representing a video scene, N representing the number of image frames in a video sample of the scene to be identified, W representing the image length in the video sample of the scene to be identified, H representing the image width in the video sample of the scene to be identified,

output Y characterizing the kth image sample₁Wherein the x coordinate is i, the y coordinate is j,

output Y characterizing the kth image sample₂The value of a pixel point at the position where the middle x coordinate is i and the y coordinate is j;

is that the x coordinate in the kth image sample in the input sample label G is i, the y coordinate is jThe value of the pixel at the location;

is the value of the s-th dimension data in T;

is Y₃The values of the s-th dimension data, p and m, are preset parameters.

According to one embodiment of the invention, the position of the pixel point where the key character is located in the thermal image of the position of the key character is marked as 1.

In another aspect, an embodiment of the present invention provides a video identification apparatus, including: the scene feature type identification unit is used for inputting a target video of a scene to be identified into a target neural network, the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person; and the video scene recognition unit is used for recognizing the scene of the video according to the target recognition scene type output by the target neural network.

According to an embodiment of the invention, the first layer structure and the second layer structure are based on a loss function training.

In another aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the video scene recognition method described above.

In yet another aspect, the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the video scene recognition method described above.

According to the video scene identification method, the video scene identification device, the electronic equipment and the storage medium, the target neural network is designed to extract the position characteristics of key characters in a time domain and a space domain, and the target neural network is combined with the local characteristics and the comprehensive judgment of global information, so that the secondary scene of the video can be identified when the video comprises activities of multiple persons.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a video scene recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a first layer structure and a second layer structure in a video scene identification method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The video scene recognition method according to the embodiment of the present invention is described below with reference to fig. 1 to 2.

It should be noted that a video scene may represent what is expressed by the video, such as whether the current video is a goal scene about a football or a corner ball scene about the football, where the football scene is a primary scene of the video, and the goal scene and the corner ball scene are secondary scenes of the video. The video scene identification method provided by the embodiment of the invention can identify the secondary scene corresponding to the video.

As shown in fig. 1, the video scene recognition method includes:

s100, inputting a target video of a scene to be recognized into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate recognition scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal images of the key characters.

It can be understood that, in practice of applying the method, the target video input in the first layer structure is a video segment to be identified, and the video segment includes N frames, N position thermal images of key characters output by the first layer structure correspond to the N frames of the target video one by one, and each frame of the input target video corresponds to one position thermal image of a key character.

The position thermal images of the key characters are respectively used for representing the position information of the key characters in the video frames corresponding to the position thermal images of the key characters, for example, for a football video, the key characters are located on goalkeepers, the position thermal images of the key characters are used for representing the position information of the goalkeepers in the corresponding video frames, and the actions of the goalkeepers represent candidate recognition scene types of a target video, such as a point ball for rescuing, a goal for rescuing and the like.

The first layer structure is obtained by training with video sample data as a sample and with predetermined position thermal image sample data of a key person corresponding to the video sample data as a sample label.

In other words, the training samples for the first layer structure are: video sample data; the sample label is: thermal image sample data of the position of a key figure corresponding to the video sample data is determined in advance; for video sample data of N frames, there are N corresponding position thermodynamic image sample data.

The thermal image sample data of the positions of the key people is a thermal image (heatmap) predetermined based on video sample data, each video frame (image) sample data corresponds to the thermal image sample data of the positions of the key people, and the thermal image sample data of the positions of the key people is used for representing the position information of the key people in the corresponding video sample data.

The position thermodynamic image sample data of the key people serving as sample labels can be manually labeled or acquired one by one based on other single-frame image identification methods.

It should be noted that the first layer structure identifies local features in the video, that is, the step may extract time domain and space domain information in the video to obtain information of the key people in the time domain and the space domain.

The target video input in the second layer structure is the same as the target video input in the first layer structure.

The second layer structure is obtained by training with target video sample data and the thermal image of the position of the key person output by the first layer structure as samples and with a predetermined scene characteristic type corresponding to the video sample data as a sample label.

In other words, the training samples for the second layer structure are: target video sample data and the position thermal image of the key figure output in the first layer structure; the sample label is: and the scene feature type corresponding to the video sample data is determined in advance.

It should be noted that the scene feature type may be a secondary type of video, such as a shooting video of a soccer ball or a shooting video of a basketball.

The scene feature types as exemplar labels may be manually labeled.

It can be understood that, the second layer structure needs to input a target video and an output result of the first layer structure, wherein the output result of the first layer structure, namely a thermal image of the position of a key person, is used for representing the position information of the key person, which is local information, while the action of the key person in the target video represents a candidate recognition scene type of the target video, and the target video can be used for representing global information, so that the second layer structure can capture all environmental information while focusing on the local characteristics, so that the recognition result is more accurate.

And S200, carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

Therefore, the method can be used for a multi-person scene by identifying the local features through the first layer structure, accurately identifies key characters (local features) in the multi-person scene, and simultaneously accurately identifies the real actions of the key characters by utilizing the second layer structure and considering the local features and the global features, thereby realizing the refined classification of the video.

According to the video scene identification method provided by the embodiment of the invention, the first layer structure can acquire the information of the local features in a time domain and a space domain, the second layer structure can combine the local features and the global information, and the method can identify the secondary scene of the video under the condition that the video comprises activities of multiple persons.

In an embodiment of the present invention, the position of the pixel point where the key person is located in the thermal image of the position of the key person in step S100 is marked as 1.

For example, the size of the input target video is N × W × H × 3, N represents the number of frames, W represents the length of the image (video frame), H represents the width of the image (video frame), and 3 represents RGB three-channel data; the size of the thermal image of the positions of the N key characters is N W H1, N represents the number of frames, W represents the length of the image, H represents the width of the image, the position of a pixel point where the local feature is located is marked as 1, and the positions of other positions are marked as 0. Therefore, the positions of the local features can be represented through the pixel point images.

In one embodiment of the invention, the first layer structure comprises a first sublayer structure and at least one second sublayer structure; the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the position of the optimized key person output by the previous second sub-layer structure again according to the target video.

The first sub-layer structure is obtained by training with video sample data as a sample and thermal image sample data of the position of a key person corresponding to the video sample data as a sample label.

In other words, the training samples of the first sublayer structure are: video sample data; the sample label is: and thermal image sample data of the position of the key person corresponding to the video sample data.

The thermal image sample data of the positions of the key people is a thermal image (heatmap) predetermined based on video sample data, each video frame (image) sample data corresponds to the thermal image sample data of the positions of the key people, and the thermal image sample data of the positions of the key people is used for representing the position information of the key people in the corresponding video frame sample data.

The target video of the second sub-layer structure is input as the same as the target video of the first sub-layer structure.

And the second sub-layer structure connected with the first sub-layer structure is obtained by training by taking the target video sample data and the position thermal image of the key person output by the first sub-layer structure as samples and taking the position thermal image sample data of the key person corresponding to the video sample data as a sample label.

In other words, the training samples of the second sublayer structure are: video sample data and a thermal image of the position of a key figure output by the first sub-layer structure; the sample label is: and thermal image sample data of the position of the key person corresponding to the video sample data.

The second sub-layer structure connected with the second sub-layer structure is obtained by training with the target video sample data and the position thermal image of the key person output by the previous second sub-layer structure as samples and with the position thermal image sample data of the key person corresponding to the video sample data as a sample label.

It will be appreciated that in the second sub-layer structure, the target video (the video to be identified) and the output result of the first sub-layer structure or the previous second sub-layer structure need to be input.

In this embodiment, since the first layer structure is iteratively calculated at least twice to obtain the local features, the accuracy of local feature identification can be greatly improved. The number of iterations is not limited to two in this embodiment, and may be calculated iteratively more times, where the larger the number of iterations is, the more accurate the calculation is, and of course, the more calculation time and calculation resources are consumed.

It should be noted that, taking the first sublayer structure and the second sublayer structure of the first layer structure as a first-level neural network respectively, and taking the second layer structure as a first-level neural network as an example, this embodiment is implemented by a cascaded three-level neural network, where the first sublayer structure and the second sublayer structure are used to identify and capture local features, and the second layer structure is used to identify the scene of the video.

As shown in fig. 3, in this embodiment, the first layer structure and the second layer structure form a cascaded three-level neural network, in which the target video is input to the first sublayer structure of the first layer structure, the second sublayer structure of the first layer structure, and the second layer structure, an output of the first sublayer structure also serves as an input of the second sublayer structure, and an output of the second sublayer structure also serves as an input of the second layer structure.

In some embodiments of the invention, the first layer structure and the second layer structure are derived based on a loss function training, and the loss function is determined based on a first loss function of the first layer structure and a second loss function of the second layer structure.

The loss function comprises a first loss function and a second loss function, wherein the first loss function calculates the deviation of the output of the first two-level neural network and the key character feature identification for the first layer structure, namely the first loss function for the first sublayer structure and the second sublayer structure (the first two-level neural network); the second loss function calculates, for the second layer structure, a deviation of the video scene output by the second layer structure from the true result.

Because the output results of the first layer structure and the second layer structure have strong correlation, the first two levels of neural networks output local features (key characters) in the video, and the second layer structure (second layer structure) is a video scene corresponding to the key characters.

When the residual error between the first sublayer structure and the second sublayer structure is larger, the error of local feature identification is larger. Similarly, the larger the deviation of video scene identification is, the larger the identification result of the local features of the first sublayer structure and the second sublayer structure directly affects the classification result of the second sublayer structure. In the same way, the residual error of the second layer structure result acts on the first sublayer structure and the second sublayer structure, and the identification result of the local feature is optimized.

Therefore, in this embodiment, the loss function optimizes the first layer structure and the second layer structure simultaneously, and the more accurate the local feature identification of the first two levels of networks is, the better the classification result of the second layer structure is; and the residual error of the classification result of the second layer structure can also promote the increase of the recognition precision of the first two stages.

The solution can promote the rapid convergence of the network and improve the accuracy of the classification algorithm.

Specifically, the target video comprises N images with the size W H3, the size of the thermal image of the position of the key character is W H1, W represents the length of the image, H represents the width of the image, 3 represents RGB three-channel data, 1 represents a pixel point value, the output result of the second layer structure is a (m +1) -dimensional vector, and m represents the number of video scenes.

The loss function is a formula

Wherein, Y₁、Y₂、Y₃Respectively representing the output of the first sublayer structure, the second sublayer structure and the second layer structure, G representing a sample label of an input thermal image of the position of a key figure, T representing a video scene, N representing the number of image frames in a video sample of a scene to be identified, W representing the image length in the video sample of the scene to be identified, H representing the image width in the video sample of the scene to be identified,

the x coordinate of the kth image sample in the input sample label G is i, and the y coordinate is the value of a pixel point on the position of j;

is the value of the s-th dimension data in T;

is Y₃The values of the s-th dimension data, p and m, are preset parameters.

It can be understood that, in the training method, the first layer structure is trained first, and then the first layer structure and the second layer structure are trained in combination, so that the first layer structure with substantially accuracy is trained to a certain extent to accelerate the speed of step S320, and the first layer structure and the second layer structure are trained in combination in step S320, so that the recognition accuracy of the two recognition models can be rapidly improved simultaneously based on the strong correlation between the first layer structure and the second layer structure.

In the video scene recognition method according to the embodiment of the present invention, the training of the first layer structure and the second layer structure includes:

step S310, initializing a first layer structure, taking video sample data as a sample, taking predetermined position thermal image sample data of a key person corresponding to the video sample data as a sample label, and training the first layer structure.

Step S320, initializing the second layer structure, taking the video sample data and the thermal image of the position of the key character output by the first layer structure as samples, and taking a predetermined video scene corresponding to the video sample data as a sample label, and training the first layer structure and the second layer structure.

In an embodiment where the first layer structure comprises a first sublayer structure and a second sublayer structure, the training of the first layer structure and the second layer structure comprises:

step S301, initializing a first sublayer structure and a second sublayer structure by adopting a random initialization method, inputting video sample data extracted from a video, wherein the size of the video sample data is N x W x H3, and inputting position thermal image sample data of a key character with a label of N x W x H1; the first sub-layer structure and the second sub-layer structure are trained by adopting AdamaOptimizer, lr is initialized to 0.001, and iteration training is carried out for 10^4 times.

Step S302, setting training parameters of the first sublayer structure, the second sublayer structure, and the second sublayer structure, including a learning rate lr (0.001), a Loss function p ═ 0.5, an optimization algorithm, a maximum number of iterations, a learning rate decay parameter, and the like, and randomly initializing the second sublayer structure.

Step S303, inputting video sample data, the size of which is N × W × H3, inputting position thermodynamic image sample data of a key character with a label of N × W × H1, and inputting a scene T to which the video belongs.

Step S304, inputting video sample data into the target neural network, and calculating Loss function L ((G, Y) according to output results of the first level, the second level and the third level of the network₁)，(G，Y₂)，(T，Y₃) According to a back propagation algorithm, the residual errors of each stage are obtained.

And S305, updating the weight value in the target neural network by adopting a back propagation algorithm based on the calculated residual error.

Step S306, return to step S304.

In a specific embodiment, the video scene identification method can be used for identifying a multi-person motion video, namely the input target video expresses a multi-person motion scene.

The embodiment of the invention constructs a cascaded three-level neural network for processing the video classification problem of multi-person movement. Taking football recognition as an example, assuming that a scene to be recognized comprises m types of corner balls, point balls, shooting, plate lifting and the like, the input of the model is a target video, the frame size of each video is W x H x 3, the width of an image is W, the height is H, and 3 is RGB three-channel data. The size of the model input is then N x W x H3. The final output of the model is an (m +1) -dimensional vector (containing a background class, i.e., all scenes except the scene to be identified are referred to as background classes), each dimension represents the probability of belonging to the scene, and the sum is 1. Besides, in order to capture local features in the video, namely key characters, the first sub-layer structure and the second sub-layer structure of the model output N W H1 thermal images of the positions of the key characters, wherein pixel point values in the i (1 ≦ i ≦ N) frame represent the probability that the points belong to the key characters.

Model structure as shown in fig. 2, the model is composed of 3 cascaded neural networks, a first layer structure 410 (a first sub-layer structure 411 and a second sub-layer structure 412) is used for identifying and capturing key characters in a video, and a second layer structure 420 is used for identifying a secondary scene to which a target video (video clip) belongs.

Inputting a group of N, W, H and 3 images as model input, and marking the target video extracted from the video as X; meanwhile, in the training process, thermal image sample data of the positions of the key characters of NxW x H1 are still required to be input, namely, the images of the positions of the key characters are marked and marked as G, wherein the positions of the key characters are marked as 1, and otherwise, the positions of the key characters are 0; in order to identify the scene to which the target video belongs, a scene T needs to be marked. The i-th output of the model is recorded as Y_iWherein i is more than or equal to 1 and less than or equal to 3.

The first sublayer structure and the second sublayer structure can be a deduction of the structure of the Hourglass from 2D convolution to 3D convolution. I.e. by repeating the two-dimensional convolution kernel, thereby converting the two-dimensional convolution kernel into a 3D convolution kernel. The two-dimensional Hourglass convolution network is used for human body posture estimation, human face key point detection and the like by repeatedly monitoring from bottom to top and from top to bottom and combining intermediate results, and can well utilize different spatial positions of local features to optimize a recognition result. The 2D convolution is deduced to be the 3D convolution, so that the spatial position of the local information of the single-frame picture can be captured, the spatial position information of the target video can be further captured according to the continuity and the correlation of the images in the video, and the key character area of the target video can be identified.

According to the video scene identification method, the key character information captured by the first sublayer structure and the second sublayer structure is combined with the global information in the original video (target video) to serve as the input of the second layer structure, the local features are focused, meanwhile, the surrounding environment information is captured, and for the situation that the video comprises activities of multiple persons, the method can identify the secondary scene of the video.

It will be appreciated that the prior art identification methods, either having only local features, make it difficult to distinguish between sports with similar actions, such as nodding and shooting in soccer, with similar actions by players, but with different background conditions, one facing the goal and the other at the corner of the field.

Therefore, the video scene identification method provided by the embodiment of the invention can be used for correctly identifying the secondary classification of the video scene by combining the capture of global (background) and key local features aiming at the complexity of secondary classification labels in a multi-person identification scene.

For this cascaded three-level neural network, the loss function is designed as follows, where p (0< p <1), where we choose p ═ 0.5:

wherein, Y₁、Y₂、Y₃Respectively outputting a first sublayer structure, a second sublayer structure and a second layer structure, G is a sample label of an input key figure position thermal image, T is a video scene,

output Y for the k frame sample_l(1, 2) the value of a pixel at a position where the x coordinate is i and the y coordinate is j;

the x coordinate in the kth frame sample in the input label G is i, and the y coordinate is the value of the pixel point at the position of j;

is the value of the s-th dimension data in T;

is Y₃The value of the s-th dimension data.

In order to make the model convergence more effective, the implementation method of the present invention is described below with reference to a network structure diagram and football scene recognition, in which a pre-neural two-stage network is initially trained to obtain a result of approximate convergence, and then L ((G, Y) is used₁)，(G，Y₂)，(T，Y₃) Training a cascaded three-level neural network:

step S301, initializing a first sublayer structure and a second sublayer structure by adopting a random initialization method, inputting target video sample data extracted from a video, wherein the size of the target video sample data is N x W x H3, and inputting position thermal image sample data of a key character with a label of N x W x H1; the first sub-layer structure and the second sub-layer structure are trained by adopting AdamaOptimizer, lr is initialized to 0.001, and iteration training is carried out for 10^4 times.

Step S303, inputting target video sample data, wherein the size of the target video sample data is N x W x H3, inputting position thermal image sample data of a key character with a label of N x W x H1, and inputting a scene T to which the target video belongs. The scene label of the video is set according to the requirement, for example, the football can be set as: 0 represents a background class, 1 represents a corner ball, 2 represents an arbitrary ball, 3 represents a shot, and so on.

Step S304, inputting target video sample data into the three-level neural network, and calculating Loss function L ((G, Y) according to output results of the first level, the second level and the third level of the network₁)，(G，Y₂)，(T，Y₃) According to a back propagation algorithm, the residual errors of each stage are obtained.

And S305, updating the weight value in the cascade three-level neural network by adopting a back propagation algorithm based on the calculated residual error.

Step S306, return to step S304.

The following describes the video recognition device provided by the embodiment of the present invention, and the video recognition device described below and the video scene recognition method described above may be referred to correspondingly.

As shown in fig. 3, the video recognition apparatus according to the embodiment of the present invention includes: scene feature type identification unit 510, video scene identification unit 520.

The scene feature type identification unit 510 is configured to input a target video of a scene to be identified into a target neural network, where the target neural network includes a first layer structure and a second layer structure, the first layer structure is configured to determine, according to the target video, a position thermal image of a key person, and an action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person;

The second layer structure is obtained by training with the video sample data and the thermal image of the position of the key character output by the first layer structure as samples and a predetermined video scene corresponding to the video sample data as a sample label.

A video scene recognition unit 520, configured to perform scene recognition on the video according to the target recognition scene type output by the target neural network.

In some embodiments, for the scene feature type identification unit 510, the first layer structure includes a first sublayer structure and at least one second sublayer structure; the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the position of the optimized key person output by the previous second sub-layer structure again according to the target video.

In some embodiments, for the scene feature type identification unit 510, the first layer structure and the second layer structure are trained based on a loss function.

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform the following video scene recognition method: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 810, the communication interface 820, the memory 830, and the communication bus 840 shown in fig. 4, where the processor 810, the communication interface 820, and the memory 830 complete mutual communication through the communication bus 840, and the processor 810 may call the logic instructions in the memory 830 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Further, an embodiment of the present invention discloses a computer program product, the computer program product includes a computer program stored on a non-transitory computer readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer can execute the video scene recognition method provided by the above-mentioned embodiments of the method, for example, the video scene recognition method includes: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the video scene recognition method provided in the foregoing embodiments when executed by a processor, for example, the video scene recognition method includes: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for video scene recognition, comprising:

inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person;

and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

2. The video scene recognition method of claim 1, wherein the first layer structure comprises a first sublayer structure and at least one second sublayer structure;

the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the optimized key person output by the previous second sub-layer structure again according to the target video.

3. The method of claim 2, wherein the first layer structure and the second layer structure are obtained based on a loss function training.

4. The method of claim 3, wherein the loss function is a formula

is the value of the s-th dimension data in T;

is Y₃The values of the s-th dimension data, p and m, are preset parameters.

5. The video scene recognition method of any one of claims 1 to 4, wherein the position of the pixel point in the thermal image of the key person where the key person is located is marked as 1.

6. A video recognition device, comprising:

the scene feature type identification unit is used for inputting a target video of a scene to be identified into a target neural network, the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person;

and the video scene recognition unit is used for recognizing the scene of the video according to the target recognition scene type output by the target neural network.

7. The video recognition device of claim 6, wherein the first layer structure comprises a first sublayer structure and at least one second sublayer structure;

8. The video recognition device of claim 7, wherein the first layer structure and the second layer structure are trained based on a loss function.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the video scene recognition method according to any of claims 1 to 5 are implemented when the program is executed by the processor.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the video scene recognition method according to any one of claims 1 to 5.