CN111291692B

CN111291692B - Video scene recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111291692B
Application number: CN202010096738.8A
Authority: CN
Inventors: 赵璐; 李琳
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2023-10-20
Anticipated expiration: 2040-02-17
Also published as: CN111291692A

Abstract

The embodiment of the invention provides a video scene identification method, electronic equipment and a storage medium. The video scene recognition method comprises the following steps: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermodynamic image of a key person according to the target video, and the action of the key person in the target video represents the candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the thermal image of the position of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network. According to the video scene recognition method provided by the embodiment of the invention, as the target neural network can combine local characteristics and global information, the secondary scene of the video can be recognized under the condition that the video comprises multiple activities.

Description

Video scene recognition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video technologies, and in particular, to a method and apparatus for identifying a video scene, an electronic device, and a storage medium.

Background

In the classification process of the video, a video identification method is needed to determine a video scene, wherein the video scene can represent contents expressed by the video to a certain extent, such as whether the current video is a video of a football scene or a football scene, wherein the football scene is a primary scene of the video, and the football scene and the corner scene are secondary scenes of the video.

The recognition method in the prior art focuses on the primary scene recognition of a video, i.e. only the primary scene of the video can be recognized, for example, the video is recognized as a basketball game scene, a football game scene or a volleyball game scene, but it is difficult to recognize the secondary scene of the video, for example, it is difficult to recognize the secondary scene of the video scene of a basketball game, for example, a basketball shot scene, a robbery scene or a cap scene.

Disclosure of Invention

The embodiment of the invention provides a video scene recognition method, a device, electronic equipment and a storage medium, which are used for solving the problem that the prior art is difficult to use for secondary scene recognition including multi-person activities in videos.

In one aspect, an embodiment of the present invention provides a video scene recognition method, including: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents the candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene characteristic type from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

According to one embodiment of the invention, the first layer structure comprises a first sub-layer structure and at least one second sub-layer structure; the first sublayer structure is used for determining a position thermodynamic image of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermodynamic image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for re-optimizing the optimized position thermal image of the key person output by the previous second sub-layer structure according to the target video.

According to one embodiment of the invention, the first layer structure and the second layer structure are trained based on a loss function.

According to one embodiment of the invention, the loss function is a formula

Wherein Y is ₁ 、Y ₂ 、Y ₃ Respectively representing the output of the first sublayer structure, the second sublayer structure and the second layer structure, representing the sample label of the position thermodynamic image of the input key character, representing the video scene by T, representing the image frame number in the video sample of the scene to be identified by N, representing the image length in the video sample of the scene to be identified by W, representing the image width in the video sample of the scene to be identified by H,output Y characterizing the kth image sample ₁ The value of the pixel point at the position where the x-coordinate is i and the y-coordinate is j,/>Output Y characterizing the kth image sample ₂ The value of the pixel point at the position where the x coordinate is i and the y coordinate is j; />The value of the pixel point at the j position with the x coordinate being i and the y coordinate being j in the kth image sample in the input sample label G; />Is the value of the s-th dimension data in T; />Is Y ₃ The values of the s-th dimension data, p and m are preset parameters.

According to one embodiment of the present invention, a pixel point of the thermal image of the location of the key person, where the key person is located, is marked with 1.

In another aspect, an embodiment of the present invention provides a video recognition apparatus, including: the scene feature type recognition unit is used for inputting a target video of a scene to be recognized into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermodynamic image of a key person according to the target video, and the action of the key person in the target video represents the candidate recognition scene type of the target video; the second layer structure is used for determining a target recognition scene characteristic type from the candidate recognition scene types according to the target video and the position thermal image of the key person; and the video scene recognition unit is used for recognizing the video according to the target recognition scene type output by the target neural network.

In yet another aspect, an embodiment of the present invention provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the video scene recognition method described above when the program is executed.

In yet another aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video scene recognition method described above.

According to the video scene recognition method, the device, the electronic equipment and the storage medium, the target neural network is designed to extract the position characteristics of the key characters in the time domain and the space domain, and the target neural network is combined with the local characteristics and the comprehensive judgment of the global information, so that the secondary scene of the video can be recognized under the condition that the video comprises multiple people.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a video scene recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a first layer structure and a second layer structure in a video scene recognition method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video recognition device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes a video scene recognition method according to an embodiment of the present invention with reference to fig. 1 to 2.

It should be noted that, the video scene may represent what is expressed by the video, such as whether the current video is a shooting scene about a football or a corner scene about a football, where the football scene is a primary scene of the video, and the shooting scene and the corner scene are secondary scenes of the video. The video scene recognition method provided by the embodiment of the invention can recognize the secondary scene corresponding to the video.

As shown in fig. 1, the video scene recognition method includes:

step S100, inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents the candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the thermal images of the positions of the key characters.

It can be understood that in the practice of applying the method, the target video input into the first layer structure is a video segment to be identified, the video segment includes N frames, N position thermal images of the key person output by the first layer structure are in one-to-one correspondence with the N frames of the target video, and each frame of the input target video is corresponding to a position thermal image of the key person.

The position thermal images of the key characters are respectively used for representing the position information of the key characters in the video frames corresponding to the position thermal images of the key characters, such as for a football video, and the position thermal images of the key characters are used for representing the position information of goalkeepers in the corresponding video frames, and the actions of the goalkeepers represent candidate identification scene types of the target video, such as putting out a ball, putting out a goal, and the like.

The first layer structure is obtained by training with video sample data as a sample and predetermined position thermal image sample data of a key person corresponding to the video sample data as a sample label.

In other words, the training samples of the first layer structure are: video sample data; the sample label is: the method comprises the steps of presetting position thermal image sample data of a key person corresponding to video sample data; for the video sample data of N frames, there are N positions of thermal image sample data.

The key character position thermal image sample data is a thermal image (hectmap) predetermined based on the video sample data, each video frame (image) sample data corresponds to one key character position thermal image sample data, and the key character position thermal image sample data is used for representing the position information of the key character in the corresponding video sample data.

The thermal image sample data of the positions of the key characters serving as sample labels can be manually marked or obtained one by one based on other single-frame picture identification methods.

It should be noted that, the first layer structure identifies local features in the video, that is, the step may extract time domain and space domain information in the video, and obtain information of the key person in the time domain and the space domain.

The target video input to the second layer structure is the same as the target video input to the first layer structure.

The second layer structure is obtained by training by taking target video sample data and a position thermal image of a key person output by the first layer structure as samples and taking a predetermined scene feature type corresponding to the video sample data as a sample label.

In other words, the training samples of the second layer structure are: target video sample data and a position thermal image of a key person output in the first layer structure; the sample label is: a predetermined scene feature type corresponding to the video sample data.

It should be noted that the scene feature type may be a secondary type of video, such as a shot video of football or a basketball shot video.

The scene feature type as a sample tag may be manually annotated.

It can be understood that the target video and the output result of the first layer structure need to be input in the second layer structure, wherein the output result of the first layer structure, namely the position thermal image of the key person is used for representing the position information of the key person, is local information, the action of the key person in the target video represents candidate identification scene types of the target video, and the target video can be used for representing global information, so that the second layer structure can capture all environmental information while focusing on the local features, and the identification result is more accurate.

Step 200, performing scene recognition on the video according to the target recognition scene type output by the target neural network.

Therefore, the method can be used for accurately identifying the key characters (local characteristics) in the multi-person scene by identifying the local characteristics through the first layer structure, and accurately identifying the real actions of the key characters by taking the local characteristics and the global characteristics into consideration through the second layer structure, so that the fine classification of the video is realized.

According to the video scene identification method, the first layer structure can acquire the information of the local features in the time domain and the space domain, the second layer structure can combine the local features and the global information, and for the case that the video comprises multi-person activities, the method can identify the secondary scene of the video.

In one embodiment of the present invention, the pixel point of the thermal image of the location of the key person in the step S100 is marked with 1.

For example, the size of the input target video is n×w×h×3, N denotes the number of frames, W denotes the length of an image (video frame), H denotes the width of an image (video frame), and 3 denotes RGB three-channel data; the sizes of the output N thermal images of the positions of the key characters are N, W, H and N, wherein N represents the number of frames, W represents the length of the image, H represents the width of the image, the positions of the pixels where the local features are located are marked as 1, and the other positions are marked as 0. Thus, the position of the local feature can be represented through the pixel point image.

In one embodiment of the invention, the first layer structure comprises a first sub-layer structure and at least one second sub-layer structure; the first sublayer structure is used for determining a position thermodynamic image of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermodynamic image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for re-optimizing the optimized position thermodynamic image of the key character output by the previous second sub-layer structure according to the target video.

The first sublayer structure is obtained by training with video sample data as a sample and position thermal image sample data of a key person corresponding to the video sample data as a sample label.

In other words, the training samples of the first sublayer structure are: video sample data; the sample label is: and the position thermal image sample data of the key person corresponds to the video sample data.

The key character position thermal image sample data is a thermal image (hectmap) predetermined based on the video sample data, each video frame (image) sample data corresponds to one key character position thermal image sample data, and the key character position thermal image sample data is used for representing position information of the key character in the corresponding video frame sample data.

The target video input to the second sub-layer structure is the same as the target video input to the first sub-layer structure.

The second sub-layer structure connected with the first sub-layer structure is obtained by training with the target video sample data and the position thermal image of the key person output by the first sub-layer structure as samples and the position thermal image sample data of the key person corresponding to the video sample data as sample labels.

In other words, the training samples of the second sub-layer structure are: the video sample data and the position thermal image of the key person output by the first sublayer structure; the sample label is: and the position thermal image sample data of the key person corresponds to the video sample data.

The second sub-layer structure connected with the second sub-layer structure is obtained by training with the target video sample data and the position thermal image of the key person output by the previous second sub-layer structure as samples and the position thermal image sample data of the key person corresponding to the video sample data as sample labels.

It will be appreciated that the target video (video to be identified) and the output result of the first sub-layer structure or the previous second sub-layer structure need to be input in the second sub-layer structure.

In this embodiment, since the first layer structure is iterated at least twice to obtain the local feature, accuracy of local feature recognition can be greatly improved. The number of iterations is not limited to two in this embodiment, and more iterations may be performed, and the more iterations, the more accurate the calculation, and of course, the more calculation time and calculation resources are consumed.

It should be noted that, taking the first sub-layer structure and the second sub-layer structure of the first layer structure as the first-level neural network, the second layer structure is also the first-level neural network as an example, the embodiment is implemented through cascaded three-level neural networks, the first sub-layer structure and the second sub-layer structure are used for identifying and capturing local features, and the second layer structure is used for identifying the scene of the video.

As shown in fig. 3, in this embodiment, the first layer structure and the second layer structure form a cascaded three-level neural network, wherein the target video is input to a first sub-layer structure of the first layer structure, a second sub-layer structure of the first layer structure, and the second layer structure, and an output of the first sub-layer structure is also an input of the second sub-layer structure, and an output of the second sub-layer structure is also an input of the second layer structure.

In some embodiments of the invention, the first layer structure and the second layer structure are trained based on a loss function, and the loss function is determined based on the first loss function of the first layer structure and the second loss function of the second layer structure.

The loss function comprises a first loss function and a second loss function, wherein the first loss function aims at a first layer structure, namely the first loss function aims at the first sub-layer structure and a second sub-layer structure (a first two-stage neural network), and the deviation of the output of the first two-stage neural network and the identification of the key character features is calculated; and the second loss function aims at the second layer structure, and calculates the deviation between the video scene output by the second layer structure and the real result.

Because the output results of the first layer structure and the second layer structure have strong correlation, the first two-stage neural network outputs local features (key characters) in the video, and the second layer structure (second layer structure) is a video scene corresponding to the key characters.

The larger the residuals of the first and second sub-layer structures, the larger the error that accounts for local feature recognition. Similarly, the larger the deviation of video scene recognition is, the more the recognition results of the local features of the first sub-layer structure and the second sub-layer structure are, and the classification results of the second layer structure are directly affected. And the residual error of the second layer structure result can act on the first sub-layer structure and the second sub-layer structure in the same way, so that the recognition result of the local feature is optimized.

Therefore, in this embodiment, the loss function optimizes the first layer structure and the second layer structure at the same time, and the more accurate the local feature recognition of the first two-stage network, the better the classification result of the second layer structure; and the residual error of the classification result of the second layer structure also promotes the increase of the recognition accuracy of the first two stages.

The solution can promote the rapid convergence of the network and improve the accuracy of the classification algorithm.

Specifically, the target video includes N images with a size w×h×3, the size of the thermal image of the position of the key person is w×h×1, W represents the length of the image, H represents the width of the image, 3 represents RGB three-channel data, 1 represents the pixel value, the result of the second layer structure output is an (m+1) dimensional vector, and m represents the number of video scenes.

The loss function is a formula

Wherein Y is ₁ 、Y ₂ 、Y ₃ Respectively representing the output of the first sublayer structure, the second sublayer structure and the second layer structure, wherein G represents the sample label of the position thermodynamic image of the input key character, T represents the video scene, N represents the image frame number in the video sample of the scene to be identified, W represents the image length in the video sample of the scene to be identified, H represents the image width in the video sample of the scene to be identified,output Y characterizing the kth image sample ₁ The value of the pixel point at the position where the x-coordinate is i and the y-coordinate is j,/>Output Y characterizing the kth image sample ₂ The value of the pixel point at the position where the x coordinate is i and the y coordinate is j; />The value of the pixel point at the j position with the x coordinate being i and the y coordinate being j in the kth image sample in the input sample label G; />Is the value of the s-th dimension data in T; />Is Y ₃ The values of the s-th dimension data, p and m are preset parameters.

It can be understood that in the training method, the first layer structure is trained first, and then the first layer structure and the second layer structure are combined and trained, so that the first layer structure which is generally accurate is trained to a certain extent, the speed of the step S320 is increased, and in the step S320, the first layer structure and the second layer structure are combined and trained, so that the recognition accuracy of two recognition models can be rapidly improved at the same time based on the strong correlation of the first layer structure and the second layer structure.

In the following, the method for identifying a video scene according to the embodiment of the present invention includes:

step S310, initializing a first layer structure, taking video sample data as a sample, taking predetermined position thermal image sample data of a key person corresponding to the video sample data as a sample label, and training the first layer structure.

Step S320, initializing a second layer structure, taking video sample data and a position thermal image of a key person output by the first layer structure as samples, taking a predetermined video scene corresponding to the video sample data as a sample label, and training the first layer structure and the second layer structure.

In embodiments in which the first layer structure comprises a first sub-layer structure and a second sub-layer structure, the training of the first layer structure and the second layer structure comprises:

step S301, initializing a first sub-layer structure and a second sub-layer structure by adopting a random initialization method, inputting video sample data extracted from a video, wherein the size of the video sample data is N, W, H and 3, and inputting position thermal image sample data of a key character with the label of N, W, H and 1; training the first sub-layer structure and the second sub-layer structure by adopting an Adam Optimaizer, initializing lr to 0.001, and carrying out iterative training 10 times.

Step S302, setting training parameters of the first sublayer structure, the second sublayer structure and the second layer structure, including learning rate lr (0.001), p=0.5 of the Loss function, optimization algorithm, maximum iteration times, learning rate attenuation parameters, and the like, and randomly initializing the second layer structure.

In step S303, video sample data is input, the size is n×w×h×3, the position thermal image sample data of the key person with the label of n×w×h×1 is input, and the scene T to which the video belongs is input.

Step S304, inputting video sample data into the target neural network, and calculating Loss function L ((G, Y) according to the output results of the first stage, the second stage and the third stage of the network ₁ )，(G，Y ₂ )，(T，Y ₃ ) According to the back propagation algorithm, the residual error of each stage is obtained.

Step S305, updating the weight value in the target neural network by adopting a back propagation algorithm based on the calculated residual error.

Step S306, return to step S304.

In a specific embodiment, the video scene recognition method can be used for recognizing a multi-person motion video, namely, an input target video expresses a multi-person motion scene.

The embodiment of the invention constructs a cascaded three-level neural network for processing the video classification problem of the motions of multiple people. Taking football recognition as an example, assuming that a scene to be recognized comprises m types of angular balls, shot balls, shooting, playing cards and the like, the input of a model is a target video, the size of each video frame is W.times.H.times.3, the width of an image is W, the height is H, and 3 is RGB three-channel data. The size of the model input is N x W x H x 3. The result of the final output of the model is an (m+1) dimensional vector (containing a background class, i.e. all but the scene to be identified are called background classes), each dimension representing the probability of belonging to that scene, the sum of which is 1. In addition, in order to capture a local feature, namely a key character, in a video, the first sub-layer structure and the second sub-layer structure of the model output a thermal image of the position of the key character with the ratio of N to W to H to 1, wherein the pixel point value in the i (1.ltoreq.i.ltoreq.N) frame represents the probability that the point belongs to the key character.

The model structure is shown in fig. 2, and the model is composed of 3 cascaded neural networks, a first layer structure 410 (a first sublayer structure 411 and a second sublayer structure 412) is used for identifying and capturing key characters in a video, and a second layer structure 420 is used for identifying a secondary scene to which a target video (video clip) belongs.

Model input is a group of N, W, H,3 images, which are target videos extracted from videos and marked as X; meanwhile, in the training process, the sample data of the thermal image of the position of the key person with the number of N, W and H is still required to be input, namely, the image of the position of the key person is marked and marked as G, wherein the position of the key person is marked as 1, otherwise, the position of the key person is marked as 0; in order to identify the scene to which the target video belongs, a scene T needs to be marked. The ith stage output of the model is denoted Y _i Wherein i is more than or equal to 1 and less than or equal to 3.

The first sub-layer structure and the second sub-layer structure can be the deduction of the structure of Hourslass from 2D convolution to 3D convolution. I.e. by repeating the two-dimensional convolution kernel, the two-dimensional convolution kernel is converted into a 3D convolution kernel. The two-dimensional Hoursclass convolution network can be used for human body posture estimation, human face key point detection and the like by repeatedly carrying out bottom-up and top-down and combining supervision of intermediate results, so that different spatial positions of local features can be well utilized, and the recognition result can be optimized. The invention deduces the 2D convolution to the 3D convolution, so that not only the spatial position of the local information of the single-frame picture can be captured, but also the spatial position information of the target video can be further captured according to the continuity and the relativity of the images in the video, and the key character area of the target video can be identified.

According to the video scene recognition method, according to the key character information captured by the first sub-layer structure and the second sub-layer structure, global information in an original video (target video) is combined to serve as input of the second layer structure, surrounding environment information is captured while local characteristics are focused, and for the condition that the video comprises multiple people, the method can recognize the secondary scene of the video.

It will be appreciated that the prior art identification methods either have only partial features, which makes it difficult to distinguish between motions that are similar, such as a goal in a football, and a goal, and motions of the player are similar, but the background environments are different, one being just opposite the goal and the other being at a corner of the course.

Therefore, the video scene recognition method provided by the embodiment of the invention can accurately recognize the secondary classification of the video scene by combining the capturing of global (background) and key local characteristics aiming at the complexity of the secondary classification labels in the multi-person recognition scene.

For this cascaded three-level neural network, the loss function design is as follows, where p (0 < p < 1), here we choose p=0.5:

wherein Y is ₁ 、Y ₂ 、Y ₃ The first sublayer structure, the second sublayer structure and the output of the second layer structure are respectively, G is a sample label of the input position thermodynamic image of the key person, T is a video scene,output Y for the kth frame sample _l Values of pixel points at positions where x coordinates are i and y coordinates are j in (l=1, 2); />The value of the pixel point at the position where the x coordinate is i and the y coordinate is j in the kth frame sample in the input label G; />Is the T thValues of s-dimensional data; />Is Y ₃ Values of the s-th dimensional data.

The implementation method of the invention is described below with reference to a network structure diagram and football scene recognition, and in order to make the model convergence effect better, the pre-neural two-stage network is initially trained to obtain the approximate convergence result, and then L ((G, Y) is utilized ₁ )，(G，Y ₂ )，(T，Y ₃ ) Training a cascaded three-level neural network:

step S301, initializing a first sub-layer structure and a second sub-layer structure by adopting a random initialization method, inputting target video sample data extracted from a video, wherein the size of the target video sample data is N, W, H and 3, and inputting position thermal image sample data of a key character with a label of N, W, H and 1; training the first sub-layer structure and the second sub-layer structure by adopting an Adam Optimaizer, initializing lr to 0.001, and carrying out iterative training 10 times.

In step S303, target video sample data with size n×w×h×3 is input, and the position thermal image sample data of the key person with label n×w×h×1 is input, and the scene T to which the target video belongs is input. The scene tag of the video is set according to the requirement, for example, football can be set as follows: 0 for background class, 1 for corner balls, 2 for arbitrary balls, 3 for shooting, etc.

Step S304, inputting the target video sample data into the cascade three-level neural network, and calculating Loss function L ((G, Y) according to the output results of the first, second and third levels of the network ₁ )，(G，Y ₂ )，(T，Y ₃ ) According to the back propagation algorithm, the residual error of each stage is obtained.

And step 305, updating the weight value in the cascade three-level neural network by adopting a back propagation algorithm based on the calculated residual error.

Step S306, return to step S304.

The video recognition device provided by the embodiment of the invention is described below, and the video recognition device described below and the video scene recognition method described above can be referred to correspondingly.

As shown in fig. 3, the video recognition apparatus according to the embodiment of the present invention includes: scene feature type recognition unit 510, video scene recognition unit 520.

The scene feature type recognition unit 510 is configured to input a target video of a scene to be recognized into a target neural network, where the target neural network includes a first layer structure and a second layer structure, the first layer structure is configured to determine a thermal image of a position of a key person according to the target video, and an action of the key person in the target video characterizes a candidate recognition scene type of the target video; the second layer structure is used for determining a target recognition scene characteristic type from the candidate recognition scene types according to the target video and the position thermal image of the key person;

The second layer structure is obtained by training by taking video sample data and the position thermal image of the key person output by the first layer structure as samples and taking a predetermined video scene corresponding to the video sample data as a sample label.

The video scene recognition unit 520 is configured to perform scene recognition on the video according to the target recognition scene type output by the target neural network.

In some embodiments, for the scene feature type identification unit 510, the first layer structure includes a first sub-layer structure and at least one second sub-layer structure; the first sublayer structure is used for determining a position thermodynamic image of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermodynamic image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for re-optimizing the optimized position thermodynamic image of the key character output by the previous second sub-layer structure according to the target video.

In some embodiments, for the scene feature type identification unit 510, the first layer structure and the second layer structure are trained based on a loss function.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following video scene recognition method: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermodynamic image of a key person according to the target video, and the action of the key person in the target video represents the candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the thermal image of the position of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

It should be noted that, in this embodiment, the electronic device may be a server, a PC, or other devices in the specific implementation, so long as the structure of the electronic device includes a processor 810, a communication interface 820, a memory 830, and a communication bus 840 as shown in fig. 4, where the processor 810, the communication interface 820, and the memory 830 complete communication with each other through the communication bus 840, and the processor 810 may call logic instructions in the memory 830 to execute the above method. The embodiment does not limit a specific implementation form of the electronic device.

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the video scene recognition method provided by the above-described method embodiments, for example, the video scene recognition method comprising: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermodynamic image of a key person according to the target video, and the action of the key person in the target video represents the candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the thermal image of the position of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

In another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, is implemented to perform the video scene recognition method provided in the above embodiments, for example, the video scene recognition method includes: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermodynamic image of a key person according to the target video, and the action of the key person in the target video represents the candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the thermal image of the position of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying video scenes, comprising:

inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents the candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene characteristic type from the candidate recognition scene types according to the target video and the position thermal image of the key person;

performing scene recognition on the video according to the target recognition scene feature type output by the target neural network;

the first layer structure comprises a first sub-layer structure and at least one second sub-layer structure;

the first sublayer structure is used for determining a position thermodynamic image of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermodynamic image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for re-optimizing the optimized position thermal image of the key person output by the previous second sub-layer structure according to the target video;

training to obtain the first layer structure and the second layer structure based on a loss function;

the loss function is a formula

Wherein Y is ₁ 、Y ₂ 、Y ₃ Representing the output of the first sublayer structure, the second sublayer structure and the second layer structure respectively, and representing the bit of the input key character by GPlacing a sample tag of a thermal image, T representing a video scene, N representing the number of image frames in a video sample of the scene to be identified, W representing the length of the image in the video sample of the scene to be identified, H representing the width of the image in the video sample of the scene to be identified,output Y characterizing the kth image sample ₁ The value of the pixel point at the position where the x-coordinate is i and the y-coordinate is j,/>Output Y characterizing the kth image sample ₂ The value of the pixel point at the position where the x coordinate is i and the y coordinate is j; />The value of the pixel point at the j position with the x coordinate being i and the y coordinate being j in the kth image sample in the input sample label G; />Is the value of the s-th dimension data in T; />Is Y ₃ The values of the s-th dimension data, p and m are preset parameters.

2. The video scene recognition method according to claim 1, wherein a pixel point of the thermal image of the location of the key person where the key person is located is marked with 1.

3. A video recognition device, comprising:

the scene feature type recognition unit is used for inputting a target video of a scene to be recognized into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermodynamic image of a key person according to the target video, and the action of the key person in the target video represents the candidate recognition scene type of the target video; the second layer structure is used for determining a target recognition scene characteristic type from the candidate recognition scene types according to the target video and the position thermal image of the key person;

the video scene recognition unit is used for recognizing the video according to the target recognition scene characteristic type output by the target neural network;

the first layer structure and the second layer structure are obtained based on loss function training;

the loss function is a formula

4. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the video scene recognition method according to any one of claims 1 to 2 when the program is executed by the processor.

5. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the video scene recognition method according to any of claims 1 to 2.