CN111291692A - Video scene recognition method and device, electronic equipment and storage medium - Google Patents

Video scene recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111291692A
CN111291692A CN202010096738.8A CN202010096738A CN111291692A CN 111291692 A CN111291692 A CN 111291692A CN 202010096738 A CN202010096738 A CN 202010096738A CN 111291692 A CN111291692 A CN 111291692A
Authority
CN
China
Prior art keywords
video
layer structure
scene
target
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010096738.8A
Other languages
Chinese (zh)
Other versions
CN111291692B (en
Inventor
赵璐
李琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Original Assignee
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Migu Cultural Technology Co Ltd, China Mobile Communications Group Co Ltd filed Critical Migu Cultural Technology Co Ltd
Priority to CN202010096738.8A priority Critical patent/CN111291692B/en
Publication of CN111291692A publication Critical patent/CN111291692A/en
Application granted granted Critical
Publication of CN111291692B publication Critical patent/CN111291692B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a video scene identification method, electronic equipment and a storage medium. The video scene identification method comprises the following steps: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network. According to the video scene identification method provided by the embodiment of the invention, the target neural network can be combined with local characteristics and global information, so that the secondary scene of the video can be identified under the condition that the video comprises activities of multiple persons.

Description

Video scene recognition method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of video technologies, and in particular, to a method and an apparatus for identifying a video scene, an electronic device, and a storage medium.
Background
In the classification process of videos, a video identification method needs to be used to determine a video scene, which may represent the content expressed by the video to some extent, such as whether a current video is a video related to a nodding scene of a football or a video related to a corner-ball scene of the football, where the football scene is a primary scene of the video, and the nodding scene and the corner-ball scene in the football scene are secondary scenes of the video.
The identification method in the prior art focuses on primary scene identification of a video, that is, only a primary scene of the video can be identified, for example, a scene of a basketball game, a scene of a football game or a scene of a volleyball game is identified, but it is difficult to perform secondary scene identification on the video, for example, it is difficult to identify a secondary scene of a video scene of a certain basketball game, such as a basketball shooting scene, a snapping scene or a capping scene.
Disclosure of Invention
The embodiment of the invention provides a video scene recognition method and device, electronic equipment and a storage medium, which are used for solving the problem that the prior art is difficult to be used for recognizing a secondary scene including multi-person activities in a video.
In one aspect, an embodiment of the present invention provides a video scene identification method, including: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.
According to an embodiment of the invention, the first layer structure comprises a first sublayer structure and at least one second sublayer structure; the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the optimized key person output by the previous second sub-layer structure again according to the target video.
According to one embodiment of the invention, the first layer structure and the second layer structure are obtained based on a loss function training.
According to one embodiment of the invention, the loss function is a formula
Figure BDA0002385605050000021
Wherein, Y1、Y2、Y3Respectively representing the outputs of the first sublayer structure, the second sublayer structure and the second layer structure, G representing a sample label of an input thermal image of the position of a key figure, T representing a video scene, N representing the number of image frames in a video sample of the scene to be identified, W representing the image length in the video sample of the scene to be identified, H representing the image width in the video sample of the scene to be identified,
Figure BDA0002385605050000022
output Y characterizing the kth image sample1Wherein the x coordinate is i, the y coordinate is j,
Figure BDA0002385605050000023
output Y characterizing the kth image sample2The value of a pixel point at the position where the middle x coordinate is i and the y coordinate is j;
Figure BDA0002385605050000024
is that the x coordinate in the kth image sample in the input sample label G is i, the y coordinate is jThe value of the pixel at the location;
Figure BDA0002385605050000025
is the value of the s-th dimension data in T;
Figure BDA0002385605050000026
is Y3The values of the s-th dimension data, p and m, are preset parameters.
According to one embodiment of the invention, the position of the pixel point where the key character is located in the thermal image of the position of the key character is marked as 1.
In another aspect, an embodiment of the present invention provides a video identification apparatus, including: the scene feature type identification unit is used for inputting a target video of a scene to be identified into a target neural network, the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person; and the video scene recognition unit is used for recognizing the scene of the video according to the target recognition scene type output by the target neural network.
According to an embodiment of the invention, the first layer structure comprises a first sublayer structure and at least one second sublayer structure; the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the optimized key person output by the previous second sub-layer structure again according to the target video.
According to an embodiment of the invention, the first layer structure and the second layer structure are based on a loss function training.
In another aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the video scene recognition method described above.
In yet another aspect, the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the video scene recognition method described above.
According to the video scene identification method, the video scene identification device, the electronic equipment and the storage medium, the target neural network is designed to extract the position characteristics of key characters in a time domain and a space domain, and the target neural network is combined with the local characteristics and the comprehensive judgment of global information, so that the secondary scene of the video can be identified when the video comprises activities of multiple persons.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a flowchart of a video scene recognition method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a first layer structure and a second layer structure in a video scene identification method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a video recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The video scene recognition method according to the embodiment of the present invention is described below with reference to fig. 1 to 2.
It should be noted that a video scene may represent what is expressed by the video, such as whether the current video is a goal scene about a football or a corner ball scene about the football, where the football scene is a primary scene of the video, and the goal scene and the corner ball scene are secondary scenes of the video. The video scene identification method provided by the embodiment of the invention can identify the secondary scene corresponding to the video.
As shown in fig. 1, the video scene recognition method includes:
s100, inputting a target video of a scene to be recognized into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate recognition scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal images of the key characters.
It can be understood that, in practice of applying the method, the target video input in the first layer structure is a video segment to be identified, and the video segment includes N frames, N position thermal images of key characters output by the first layer structure correspond to the N frames of the target video one by one, and each frame of the input target video corresponds to one position thermal image of a key character.
The position thermal images of the key characters are respectively used for representing the position information of the key characters in the video frames corresponding to the position thermal images of the key characters, for example, for a football video, the key characters are located on goalkeepers, the position thermal images of the key characters are used for representing the position information of the goalkeepers in the corresponding video frames, and the actions of the goalkeepers represent candidate recognition scene types of a target video, such as a point ball for rescuing, a goal for rescuing and the like.
The first layer structure is obtained by training with video sample data as a sample and with predetermined position thermal image sample data of a key person corresponding to the video sample data as a sample label.
In other words, the training samples for the first layer structure are: video sample data; the sample label is: thermal image sample data of the position of a key figure corresponding to the video sample data is determined in advance; for video sample data of N frames, there are N corresponding position thermodynamic image sample data.
The thermal image sample data of the positions of the key people is a thermal image (heatmap) predetermined based on video sample data, each video frame (image) sample data corresponds to the thermal image sample data of the positions of the key people, and the thermal image sample data of the positions of the key people is used for representing the position information of the key people in the corresponding video sample data.
The position thermodynamic image sample data of the key people serving as sample labels can be manually labeled or acquired one by one based on other single-frame image identification methods.
It should be noted that the first layer structure identifies local features in the video, that is, the step may extract time domain and space domain information in the video to obtain information of the key people in the time domain and the space domain.
The target video input in the second layer structure is the same as the target video input in the first layer structure.
The second layer structure is obtained by training with target video sample data and the thermal image of the position of the key person output by the first layer structure as samples and with a predetermined scene characteristic type corresponding to the video sample data as a sample label.
In other words, the training samples for the second layer structure are: target video sample data and the position thermal image of the key figure output in the first layer structure; the sample label is: and the scene feature type corresponding to the video sample data is determined in advance.
It should be noted that the scene feature type may be a secondary type of video, such as a shooting video of a soccer ball or a shooting video of a basketball.
The scene feature types as exemplar labels may be manually labeled.
It can be understood that, the second layer structure needs to input a target video and an output result of the first layer structure, wherein the output result of the first layer structure, namely a thermal image of the position of a key person, is used for representing the position information of the key person, which is local information, while the action of the key person in the target video represents a candidate recognition scene type of the target video, and the target video can be used for representing global information, so that the second layer structure can capture all environmental information while focusing on the local characteristics, so that the recognition result is more accurate.
And S200, carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.
Therefore, the method can be used for a multi-person scene by identifying the local features through the first layer structure, accurately identifies key characters (local features) in the multi-person scene, and simultaneously accurately identifies the real actions of the key characters by utilizing the second layer structure and considering the local features and the global features, thereby realizing the refined classification of the video.
According to the video scene identification method provided by the embodiment of the invention, the first layer structure can acquire the information of the local features in a time domain and a space domain, the second layer structure can combine the local features and the global information, and the method can identify the secondary scene of the video under the condition that the video comprises activities of multiple persons.
In an embodiment of the present invention, the position of the pixel point where the key person is located in the thermal image of the position of the key person in step S100 is marked as 1.
For example, the size of the input target video is N × W × H × 3, N represents the number of frames, W represents the length of the image (video frame), H represents the width of the image (video frame), and 3 represents RGB three-channel data; the size of the thermal image of the positions of the N key characters is N W H1, N represents the number of frames, W represents the length of the image, H represents the width of the image, the position of a pixel point where the local feature is located is marked as 1, and the positions of other positions are marked as 0. Therefore, the positions of the local features can be represented through the pixel point images.
In one embodiment of the invention, the first layer structure comprises a first sublayer structure and at least one second sublayer structure; the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the position of the optimized key person output by the previous second sub-layer structure again according to the target video.
The first sub-layer structure is obtained by training with video sample data as a sample and thermal image sample data of the position of a key person corresponding to the video sample data as a sample label.
In other words, the training samples of the first sublayer structure are: video sample data; the sample label is: and thermal image sample data of the position of the key person corresponding to the video sample data.
The thermal image sample data of the positions of the key people is a thermal image (heatmap) predetermined based on video sample data, each video frame (image) sample data corresponds to the thermal image sample data of the positions of the key people, and the thermal image sample data of the positions of the key people is used for representing the position information of the key people in the corresponding video frame sample data.
The position thermodynamic image sample data of the key people serving as sample labels can be manually labeled or acquired one by one based on other single-frame image identification methods.
The target video of the second sub-layer structure is input as the same as the target video of the first sub-layer structure.
And the second sub-layer structure connected with the first sub-layer structure is obtained by training by taking the target video sample data and the position thermal image of the key person output by the first sub-layer structure as samples and taking the position thermal image sample data of the key person corresponding to the video sample data as a sample label.
In other words, the training samples of the second sublayer structure are: video sample data and a thermal image of the position of a key figure output by the first sub-layer structure; the sample label is: and thermal image sample data of the position of the key person corresponding to the video sample data.
The second sub-layer structure connected with the second sub-layer structure is obtained by training with the target video sample data and the position thermal image of the key person output by the previous second sub-layer structure as samples and with the position thermal image sample data of the key person corresponding to the video sample data as a sample label.
The position thermodynamic image sample data of the key people serving as sample labels can be manually labeled or acquired one by one based on other single-frame image identification methods.
It will be appreciated that in the second sub-layer structure, the target video (the video to be identified) and the output result of the first sub-layer structure or the previous second sub-layer structure need to be input.
In this embodiment, since the first layer structure is iteratively calculated at least twice to obtain the local features, the accuracy of local feature identification can be greatly improved. The number of iterations is not limited to two in this embodiment, and may be calculated iteratively more times, where the larger the number of iterations is, the more accurate the calculation is, and of course, the more calculation time and calculation resources are consumed.
It should be noted that, taking the first sublayer structure and the second sublayer structure of the first layer structure as a first-level neural network respectively, and taking the second layer structure as a first-level neural network as an example, this embodiment is implemented by a cascaded three-level neural network, where the first sublayer structure and the second sublayer structure are used to identify and capture local features, and the second layer structure is used to identify the scene of the video.
As shown in fig. 3, in this embodiment, the first layer structure and the second layer structure form a cascaded three-level neural network, in which the target video is input to the first sublayer structure of the first layer structure, the second sublayer structure of the first layer structure, and the second layer structure, an output of the first sublayer structure also serves as an input of the second sublayer structure, and an output of the second sublayer structure also serves as an input of the second layer structure.
In some embodiments of the invention, the first layer structure and the second layer structure are derived based on a loss function training, and the loss function is determined based on a first loss function of the first layer structure and a second loss function of the second layer structure.
The loss function comprises a first loss function and a second loss function, wherein the first loss function calculates the deviation of the output of the first two-level neural network and the key character feature identification for the first layer structure, namely the first loss function for the first sublayer structure and the second sublayer structure (the first two-level neural network); the second loss function calculates, for the second layer structure, a deviation of the video scene output by the second layer structure from the true result.
Because the output results of the first layer structure and the second layer structure have strong correlation, the first two levels of neural networks output local features (key characters) in the video, and the second layer structure (second layer structure) is a video scene corresponding to the key characters.
When the residual error between the first sublayer structure and the second sublayer structure is larger, the error of local feature identification is larger. Similarly, the larger the deviation of video scene identification is, the larger the identification result of the local features of the first sublayer structure and the second sublayer structure directly affects the classification result of the second sublayer structure. In the same way, the residual error of the second layer structure result acts on the first sublayer structure and the second sublayer structure, and the identification result of the local feature is optimized.
Therefore, in this embodiment, the loss function optimizes the first layer structure and the second layer structure simultaneously, and the more accurate the local feature identification of the first two levels of networks is, the better the classification result of the second layer structure is; and the residual error of the classification result of the second layer structure can also promote the increase of the recognition precision of the first two stages.
The solution can promote the rapid convergence of the network and improve the accuracy of the classification algorithm.
Specifically, the target video comprises N images with the size W H3, the size of the thermal image of the position of the key character is W H1, W represents the length of the image, H represents the width of the image, 3 represents RGB three-channel data, 1 represents a pixel point value, the output result of the second layer structure is a (m +1) -dimensional vector, and m represents the number of video scenes.
The loss function is a formula
Figure BDA0002385605050000091
Wherein, Y1、Y2、Y3Respectively representing the output of the first sublayer structure, the second sublayer structure and the second layer structure, G representing a sample label of an input thermal image of the position of a key figure, T representing a video scene, N representing the number of image frames in a video sample of a scene to be identified, W representing the image length in the video sample of the scene to be identified, H representing the image width in the video sample of the scene to be identified,
Figure BDA0002385605050000092
output Y characterizing the kth image sample1Wherein the x coordinate is i, the y coordinate is j,
Figure BDA0002385605050000093
output Y characterizing the kth image sample2The value of a pixel point at the position where the middle x coordinate is i and the y coordinate is j;
Figure BDA0002385605050000094
the x coordinate of the kth image sample in the input sample label G is i, and the y coordinate is the value of a pixel point on the position of j;
Figure BDA0002385605050000095
is the value of the s-th dimension data in T;
Figure BDA0002385605050000096
is Y3The values of the s-th dimension data, p and m, are preset parameters.
It can be understood that, in the training method, the first layer structure is trained first, and then the first layer structure and the second layer structure are trained in combination, so that the first layer structure with substantially accuracy is trained to a certain extent to accelerate the speed of step S320, and the first layer structure and the second layer structure are trained in combination in step S320, so that the recognition accuracy of the two recognition models can be rapidly improved simultaneously based on the strong correlation between the first layer structure and the second layer structure.
In the video scene recognition method according to the embodiment of the present invention, the training of the first layer structure and the second layer structure includes:
step S310, initializing a first layer structure, taking video sample data as a sample, taking predetermined position thermal image sample data of a key person corresponding to the video sample data as a sample label, and training the first layer structure.
Step S320, initializing the second layer structure, taking the video sample data and the thermal image of the position of the key character output by the first layer structure as samples, and taking a predetermined video scene corresponding to the video sample data as a sample label, and training the first layer structure and the second layer structure.
In an embodiment where the first layer structure comprises a first sublayer structure and a second sublayer structure, the training of the first layer structure and the second layer structure comprises:
step S301, initializing a first sublayer structure and a second sublayer structure by adopting a random initialization method, inputting video sample data extracted from a video, wherein the size of the video sample data is N x W x H3, and inputting position thermal image sample data of a key character with a label of N x W x H1; the first sub-layer structure and the second sub-layer structure are trained by adopting AdamaOptimizer, lr is initialized to 0.001, and iteration training is carried out for 10^4 times.
Step S302, setting training parameters of the first sublayer structure, the second sublayer structure, and the second sublayer structure, including a learning rate lr (0.001), a Loss function p ═ 0.5, an optimization algorithm, a maximum number of iterations, a learning rate decay parameter, and the like, and randomly initializing the second sublayer structure.
Step S303, inputting video sample data, the size of which is N × W × H3, inputting position thermodynamic image sample data of a key character with a label of N × W × H1, and inputting a scene T to which the video belongs.
Step S304, inputting video sample data into the target neural network, and calculating Loss function L ((G, Y) according to output results of the first level, the second level and the third level of the network1),(G,Y2),(T,Y3) According to a back propagation algorithm, the residual errors of each stage are obtained.
And S305, updating the weight value in the target neural network by adopting a back propagation algorithm based on the calculated residual error.
Step S306, return to step S304.
In a specific embodiment, the video scene identification method can be used for identifying a multi-person motion video, namely the input target video expresses a multi-person motion scene.
The embodiment of the invention constructs a cascaded three-level neural network for processing the video classification problem of multi-person movement. Taking football recognition as an example, assuming that a scene to be recognized comprises m types of corner balls, point balls, shooting, plate lifting and the like, the input of the model is a target video, the frame size of each video is W x H x 3, the width of an image is W, the height is H, and 3 is RGB three-channel data. The size of the model input is then N x W x H3. The final output of the model is an (m +1) -dimensional vector (containing a background class, i.e., all scenes except the scene to be identified are referred to as background classes), each dimension represents the probability of belonging to the scene, and the sum is 1. Besides, in order to capture local features in the video, namely key characters, the first sub-layer structure and the second sub-layer structure of the model output N W H1 thermal images of the positions of the key characters, wherein pixel point values in the i (1 ≦ i ≦ N) frame represent the probability that the points belong to the key characters.
Model structure as shown in fig. 2, the model is composed of 3 cascaded neural networks, a first layer structure 410 (a first sub-layer structure 411 and a second sub-layer structure 412) is used for identifying and capturing key characters in a video, and a second layer structure 420 is used for identifying a secondary scene to which a target video (video clip) belongs.
Inputting a group of N, W, H and 3 images as model input, and marking the target video extracted from the video as X; meanwhile, in the training process, thermal image sample data of the positions of the key characters of NxW x H1 are still required to be input, namely, the images of the positions of the key characters are marked and marked as G, wherein the positions of the key characters are marked as 1, and otherwise, the positions of the key characters are 0; in order to identify the scene to which the target video belongs, a scene T needs to be marked. The i-th output of the model is recorded as YiWherein i is more than or equal to 1 and less than or equal to 3.
The first sublayer structure and the second sublayer structure can be a deduction of the structure of the Hourglass from 2D convolution to 3D convolution. I.e. by repeating the two-dimensional convolution kernel, thereby converting the two-dimensional convolution kernel into a 3D convolution kernel. The two-dimensional Hourglass convolution network is used for human body posture estimation, human face key point detection and the like by repeatedly monitoring from bottom to top and from top to bottom and combining intermediate results, and can well utilize different spatial positions of local features to optimize a recognition result. The 2D convolution is deduced to be the 3D convolution, so that the spatial position of the local information of the single-frame picture can be captured, the spatial position information of the target video can be further captured according to the continuity and the correlation of the images in the video, and the key character area of the target video can be identified.
According to the video scene identification method, the key character information captured by the first sublayer structure and the second sublayer structure is combined with the global information in the original video (target video) to serve as the input of the second layer structure, the local features are focused, meanwhile, the surrounding environment information is captured, and for the situation that the video comprises activities of multiple persons, the method can identify the secondary scene of the video.
It will be appreciated that the prior art identification methods, either having only local features, make it difficult to distinguish between sports with similar actions, such as nodding and shooting in soccer, with similar actions by players, but with different background conditions, one facing the goal and the other at the corner of the field.
Therefore, the video scene identification method provided by the embodiment of the invention can be used for correctly identifying the secondary classification of the video scene by combining the capture of global (background) and key local features aiming at the complexity of secondary classification labels in a multi-person identification scene.
For this cascaded three-level neural network, the loss function is designed as follows, where p (0< p <1), where we choose p ═ 0.5:
Figure BDA0002385605050000121
wherein, Y1、Y2、Y3Respectively outputting a first sublayer structure, a second sublayer structure and a second layer structure, G is a sample label of an input key figure position thermal image, T is a video scene,
Figure BDA0002385605050000122
output Y for the k frame samplel(1, 2) the value of a pixel at a position where the x coordinate is i and the y coordinate is j;
Figure BDA0002385605050000123
the x coordinate in the kth frame sample in the input label G is i, and the y coordinate is the value of the pixel point at the position of j;
Figure BDA0002385605050000124
is the value of the s-th dimension data in T;
Figure BDA0002385605050000125
is Y3The value of the s-th dimension data.
In order to make the model convergence more effective, the implementation method of the present invention is described below with reference to a network structure diagram and football scene recognition, in which a pre-neural two-stage network is initially trained to obtain a result of approximate convergence, and then L ((G, Y) is used1),(G,Y2),(T,Y3) Training a cascaded three-level neural network:
step S301, initializing a first sublayer structure and a second sublayer structure by adopting a random initialization method, inputting target video sample data extracted from a video, wherein the size of the target video sample data is N x W x H3, and inputting position thermal image sample data of a key character with a label of N x W x H1; the first sub-layer structure and the second sub-layer structure are trained by adopting AdamaOptimizer, lr is initialized to 0.001, and iteration training is carried out for 10^4 times.
Step S302, setting training parameters of the first sublayer structure, the second sublayer structure, and the second sublayer structure, including a learning rate lr (0.001), a Loss function p ═ 0.5, an optimization algorithm, a maximum number of iterations, a learning rate decay parameter, and the like, and randomly initializing the second sublayer structure.
Step S303, inputting target video sample data, wherein the size of the target video sample data is N x W x H3, inputting position thermal image sample data of a key character with a label of N x W x H1, and inputting a scene T to which the target video belongs. The scene label of the video is set according to the requirement, for example, the football can be set as: 0 represents a background class, 1 represents a corner ball, 2 represents an arbitrary ball, 3 represents a shot, and so on.
Step S304, inputting target video sample data into the three-level neural network, and calculating Loss function L ((G, Y) according to output results of the first level, the second level and the third level of the network1),(G,Y2),(T,Y3) According to a back propagation algorithm, the residual errors of each stage are obtained.
And S305, updating the weight value in the cascade three-level neural network by adopting a back propagation algorithm based on the calculated residual error.
Step S306, return to step S304.
The following describes the video recognition device provided by the embodiment of the present invention, and the video recognition device described below and the video scene recognition method described above may be referred to correspondingly.
As shown in fig. 3, the video recognition apparatus according to the embodiment of the present invention includes: scene feature type identification unit 510, video scene identification unit 520.
The scene feature type identification unit 510 is configured to input a target video of a scene to be identified into a target neural network, where the target neural network includes a first layer structure and a second layer structure, the first layer structure is configured to determine, according to the target video, a position thermal image of a key person, and an action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person;
the first layer structure is obtained by training with video sample data as a sample and with predetermined position thermal image sample data of a key person corresponding to the video sample data as a sample label.
The second layer structure is obtained by training with the video sample data and the thermal image of the position of the key character output by the first layer structure as samples and a predetermined video scene corresponding to the video sample data as a sample label.
A video scene recognition unit 520, configured to perform scene recognition on the video according to the target recognition scene type output by the target neural network.
In some embodiments, for the scene feature type identification unit 510, the first layer structure includes a first sublayer structure and at least one second sublayer structure; the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the position of the optimized key person output by the previous second sub-layer structure again according to the target video.
In some embodiments, for the scene feature type identification unit 510, the first layer structure and the second layer structure are trained based on a loss function.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform the following video scene recognition method: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.
It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 810, the communication interface 820, the memory 830, and the communication bus 840 shown in fig. 4, where the processor 810, the communication interface 820, and the memory 830 complete mutual communication through the communication bus 840, and the processor 810 may call the logic instructions in the memory 830 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, an embodiment of the present invention discloses a computer program product, the computer program product includes a computer program stored on a non-transitory computer readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer can execute the video scene recognition method provided by the above-mentioned embodiments of the method, for example, the video scene recognition method includes: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the video scene recognition method provided in the foregoing embodiments when executed by a processor, for example, the video scene recognition method includes: inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key figure according to the target video, and the action of the key figure in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining the characteristic type of the target recognition scene from the candidate recognition scene types according to the target video and the position thermal image of the key person; and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for video scene recognition, comprising:
inputting a target video of a scene to be identified into a target neural network, wherein the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person;
and carrying out scene recognition on the video according to the target recognition scene type output by the target neural network.
2. The video scene recognition method of claim 1, wherein the first layer structure comprises a first sublayer structure and at least one second sublayer structure;
the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the optimized key person output by the previous second sub-layer structure again according to the target video.
3. The method of claim 2, wherein the first layer structure and the second layer structure are obtained based on a loss function training.
4. The method of claim 3, wherein the loss function is a formula
Figure FDA0002385605040000011
Wherein, Y1、Y2、Y3Respectively representing the outputs of the first sublayer structure, the second sublayer structure and the second layer structure, G representing a sample label of an input thermal image of the position of a key figure, T representing a video scene, N representing the number of image frames in a video sample of the scene to be identified, W representing the image length in the video sample of the scene to be identified, H representing the image width in the video sample of the scene to be identified,
Figure FDA0002385605040000021
output Y characterizing the kth image sample1Wherein the x coordinate is i, the y coordinate is j,
Figure FDA0002385605040000022
output Y characterizing the kth image sample2The value of a pixel point at the position where the middle x coordinate is i and the y coordinate is j;
Figure FDA0002385605040000023
the x coordinate of the kth image sample in the input sample label G is i, and the y coordinate is the value of a pixel point on the position of j;
Figure FDA0002385605040000024
is the value of the s-th dimension data in T;
Figure FDA0002385605040000025
is Y3The values of the s-th dimension data, p and m, are preset parameters.
5. The video scene recognition method of any one of claims 1 to 4, wherein the position of the pixel point in the thermal image of the key person where the key person is located is marked as 1.
6. A video recognition device, comprising:
the scene feature type identification unit is used for inputting a target video of a scene to be identified into a target neural network, the target neural network comprises a first layer structure and a second layer structure, the first layer structure is used for determining a position thermal image of a key person according to the target video, and the action of the key person in the target video represents a candidate identification scene type of the target video; the second layer structure is used for determining a target recognition scene feature type from the candidate recognition scene types according to the target video and the position thermal image of the key person;
and the video scene recognition unit is used for recognizing the scene of the video according to the target recognition scene type output by the target neural network.
7. The video recognition device of claim 6, wherein the first layer structure comprises a first sublayer structure and at least one second sublayer structure;
the first sub-layer structure is used for determining a thermal image of the position of a key person according to the target video; the second sub-layer structure is used for optimizing the position thermal image of the key person output by the first sub-layer structure according to the target video; or the second sub-layer structure is used for optimizing the thermal image of the optimized key person output by the previous second sub-layer structure again according to the target video.
8. The video recognition device of claim 7, wherein the first layer structure and the second layer structure are trained based on a loss function.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the video scene recognition method according to any of claims 1 to 5 are implemented when the program is executed by the processor.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the video scene recognition method according to any one of claims 1 to 5.
CN202010096738.8A 2020-02-17 2020-02-17 Video scene recognition method and device, electronic equipment and storage medium Active CN111291692B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010096738.8A CN111291692B (en) 2020-02-17 2020-02-17 Video scene recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010096738.8A CN111291692B (en) 2020-02-17 2020-02-17 Video scene recognition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111291692A true CN111291692A (en) 2020-06-16
CN111291692B CN111291692B (en) 2023-10-20

Family

ID=71023615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010096738.8A Active CN111291692B (en) 2020-02-17 2020-02-17 Video scene recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111291692B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845487A (en) * 2016-12-30 2017-06-13 佳都新太科技股份有限公司 A kind of licence plate recognition method end to end
US20180032031A1 (en) * 2016-08-01 2018-02-01 Integem Inc. Methods and systems for photorealistic human holographic augmented reality communication with interactive control in real-time
CN107886069A (en) * 2017-11-10 2018-04-06 东北大学 A kind of multiple target human body 2D gesture real-time detection systems and detection method
CN108648224A (en) * 2018-05-18 2018-10-12 杭州电子科技大学 A method of the real-time scene layout identification based on artificial neural network and reconstruction
CN108710847A (en) * 2018-05-15 2018-10-26 北京旷视科技有限公司 Scene recognition method, device and electronic equipment
CN108830208A (en) * 2018-06-08 2018-11-16 Oppo广东移动通信有限公司 Method for processing video frequency and device, electronic equipment, computer readable storage medium
CN109117703A (en) * 2018-06-13 2019-01-01 中山大学中山眼科中心 It is a kind of that cell category identification method is mixed based on fine granularity identification
CN109145840A (en) * 2018-08-29 2019-01-04 北京字节跳动网络技术有限公司 video scene classification method, device, equipment and storage medium
CN109271854A (en) * 2018-08-07 2019-01-25 北京市商汤科技开发有限公司 Based on method for processing video frequency and device, video equipment and storage medium
US20190068895A1 (en) * 2017-08-22 2019-02-28 Alarm.Com Incorporated Preserving privacy in surveillance
CN109508681A (en) * 2018-11-20 2019-03-22 北京京东尚科信息技术有限公司 The method and apparatus for generating human body critical point detection model
CN109598234A (en) * 2018-12-04 2019-04-09 深圳美图创新科技有限公司 Critical point detection method and apparatus
CN109740522A (en) * 2018-12-29 2019-05-10 广东工业大学 A kind of personnel's detection method, device, equipment and medium
CN110166826A (en) * 2018-11-21 2019-08-23 腾讯科技(深圳)有限公司 Scene recognition method, device, storage medium and the computer equipment of video
CN110348463A (en) * 2019-07-16 2019-10-18 北京百度网讯科技有限公司 The method and apparatus of vehicle for identification
CN110443969A (en) * 2018-05-03 2019-11-12 中移(苏州)软件技术有限公司 A kind of fire point detecting method, device, electronic equipment and storage medium
CN110766096A (en) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 Video classification method and device and electronic equipment

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032031A1 (en) * 2016-08-01 2018-02-01 Integem Inc. Methods and systems for photorealistic human holographic augmented reality communication with interactive control in real-time
CN106845487A (en) * 2016-12-30 2017-06-13 佳都新太科技股份有限公司 A kind of licence plate recognition method end to end
US20190068895A1 (en) * 2017-08-22 2019-02-28 Alarm.Com Incorporated Preserving privacy in surveillance
CN107886069A (en) * 2017-11-10 2018-04-06 东北大学 A kind of multiple target human body 2D gesture real-time detection systems and detection method
CN110443969A (en) * 2018-05-03 2019-11-12 中移(苏州)软件技术有限公司 A kind of fire point detecting method, device, electronic equipment and storage medium
CN108710847A (en) * 2018-05-15 2018-10-26 北京旷视科技有限公司 Scene recognition method, device and electronic equipment
CN108648224A (en) * 2018-05-18 2018-10-12 杭州电子科技大学 A method of the real-time scene layout identification based on artificial neural network and reconstruction
CN108830208A (en) * 2018-06-08 2018-11-16 Oppo广东移动通信有限公司 Method for processing video frequency and device, electronic equipment, computer readable storage medium
CN109117703A (en) * 2018-06-13 2019-01-01 中山大学中山眼科中心 It is a kind of that cell category identification method is mixed based on fine granularity identification
CN109271854A (en) * 2018-08-07 2019-01-25 北京市商汤科技开发有限公司 Based on method for processing video frequency and device, video equipment and storage medium
CN109145840A (en) * 2018-08-29 2019-01-04 北京字节跳动网络技术有限公司 video scene classification method, device, equipment and storage medium
CN109508681A (en) * 2018-11-20 2019-03-22 北京京东尚科信息技术有限公司 The method and apparatus for generating human body critical point detection model
CN110166826A (en) * 2018-11-21 2019-08-23 腾讯科技(深圳)有限公司 Scene recognition method, device, storage medium and the computer equipment of video
CN109598234A (en) * 2018-12-04 2019-04-09 深圳美图创新科技有限公司 Critical point detection method and apparatus
CN109740522A (en) * 2018-12-29 2019-05-10 广东工业大学 A kind of personnel's detection method, device, equipment and medium
CN110348463A (en) * 2019-07-16 2019-10-18 北京百度网讯科技有限公司 The method and apparatus of vehicle for identification
CN110766096A (en) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 Video classification method and device and electronic equipment

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
HAO GUO 等: "Human attribute recognition by refining attention heat map" *
MOHAMMAD ASHRAF RUSSO 等: "Sports Classification in Sequential Frames Using CNN and RNN" *
周永生: "基于多尺度CNN特征的人体行为识别算法研究" *
林露: "智能安防的感知和识别关键技术研究" *
王雨廷: ""个体-群组"关联描述的层次LSTM行为识别" *
站春儒: "基于卷积神经网络的图像场景分类方法研究" *

Also Published As

Publication number Publication date
CN111291692B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN111709409B (en) Face living body detection method, device, equipment and medium
CN110472554B (en) Table tennis action recognition method and system based on attitude segmentation and key point features
US9846845B2 (en) Hierarchical model for human activity recognition
Tran et al. Two-stream flow-guided convolutional attention networks for action recognition
Karlinsky et al. The chains model for detecting parts by their context
CN108205684B (en) Image disambiguation method, device, storage medium and electronic equipment
WO2021218671A1 (en) Target tracking method and device, and storage medium and computer program
CN109784148A (en) Biopsy method and device
CN113822254B (en) Model training method and related device
CN113348465B (en) Method, device, equipment and storage medium for predicting relevance of objects in image
CN110633004A (en) Interaction method, device and system based on human body posture estimation
CN111967407B (en) Action evaluation method, electronic device, and computer-readable storage medium
CN111401192A (en) Model training method based on artificial intelligence and related device
KR20180054406A (en) Image processing apparatus and method
EP4145400A1 (en) Evaluating movements of a person
CN113435264A (en) Face recognition attack resisting method and device based on black box substitution model searching
US20240303848A1 (en) Electronic device and method for determining human height using neural networks
Shen et al. A competitive method to vipriors object detection challenge
Bhargavi et al. Knock, knock. Who's there?--Identifying football player jersey numbers with synthetic data
KR20200061747A (en) Apparatus and method for recognizing events in sports video
CN111291692B (en) Video scene recognition method and device, electronic equipment and storage medium
CN113544701B (en) Method and device for detecting associated object, electronic equipment and storage medium
CN113177462B (en) Target detection method suitable for court trial monitoring
CN114511877A (en) Behavior recognition method and device, storage medium and terminal
Al Shami Generating Tennis Player by the Predicting Movement Using 2D Pose Estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant