CN113573009A

CN113573009A - Video processing method, video processing device, computer equipment and storage medium

Info

Publication number: CN113573009A
Application number: CN202110189164.3A
Authority: CN
Inventors: 张博深; 李昱希; 李剑; 王亚彪; 汪铖杰; 李季檩; 黄飞跃
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2021-10-29

Abstract

The embodiment of the application discloses a video processing method, a video processing device, computer equipment and a storage medium. The video processing method comprises the following steps: acquiring a target video frame sequence, wherein the target video frame sequence comprises a target key frame, and the target key frame comprises a target object; identifying target position information of a target object in a target key frame, extracting sequence characteristics of a target video frame sequence, and extracting short-time interest region characteristics of the target video frame sequence from the sequence characteristics according to the target position information; acquiring a short-time interest region feature set of K related video frame sequences; fusing the short-time interest region feature set into a long-time interest region feature; and determining the behavior category of the target object in the target key frame according to the long-time interest region characteristics and the short-time interest region characteristics of the target video frame sequence. By the method and the device, the video identification efficiency and the identification accuracy can be improved.

Description

Video processing method, video processing device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video processing method and apparatus, a computer device, and a storage medium.

Background

The behavior detection refers to detecting the behavior of people in a section of video, and the final output result comprises a detection frame of a human body in a video frame and one or more behavior labels corresponding to each human body. Behavior detection has a wide role in the fields of video monitoring, motion analysis and the like. For example, target detection and abnormal event identification can be realized in video monitoring, and personal and property safety of people in a place is ensured; in the field of sports and competition, accurate data analysis and support can be provided, and the fairness in the field of sports and competition is improved.

At present, videos are mainly analyzed manually based on past experience and knowledge to determine detection frames and behavior categories of people in the videos, but manual analysis is not only low in efficiency, but also has great influence on the main observation, and therefore the analysis result is inaccurate.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device, computer equipment and a storage medium, which can improve video identification efficiency and identification accuracy.

An embodiment of the present application provides a video processing method, including:

acquiring a target video frame sequence, wherein the target video frame sequence comprises a target key frame, the target key frame comprises a target object, the target video frame sequence is any one of N video frame sequences contained in a target video, and N is a positive integer greater than 1;

identifying target position information of the target object in the target key frame, extracting sequence characteristics of the target video frame sequence, and extracting short-time interest region characteristics of the target video frame sequence from the sequence characteristics according to the target position information;

acquiring a short-time interest region feature set of K associated video frame sequences, wherein the K associated video frame sequences are video frame sequences adjacent to the target video frame sequence in the N video frame sequences, and K is a positive integer;

fusing the short-time interest region feature set into long-time interest region features;

and determining the behavior category of the target object in the target key frame according to the long-time interest region characteristics and the short-time interest region characteristics of the target video frame sequence.

An aspect of an embodiment of the present application provides a video processing apparatus, including:

a first obtaining module, configured to obtain a target video frame sequence, where the target video frame sequence includes a target key frame, where the target key frame includes a target object, the target video frame sequence is any one of N video frame sequences included in a target video, and N is a positive integer greater than 1;

the identification module is used for identifying target position information of the target object in the target key frame, extracting sequence characteristics of the target video frame sequence, and extracting short-time interest region characteristics of the target video frame sequence from the sequence characteristics according to the target position information;

a second obtaining module, configured to obtain a feature set of a short-time interest region of K associated video frame sequences, where the K associated video frame sequences are video frame sequences adjacent to the target video frame sequence in the N video frame sequences, and K is a positive integer;

the fusion module is used for fusing the short-time interest region feature set into a long-time interest region feature;

and the determining module is used for determining the behavior category of the target object in the target key frame according to the long-time interest region characteristics and the short-time interest region characteristics of the target video frame sequence.

An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the method in the foregoing embodiments.

An aspect of the embodiments of the present application provides a computer storage medium, in which a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the method in the foregoing embodiments is performed.

An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium, and when the computer instructions are executed by a processor of a computer device, the computer instructions perform the methods in the embodiments described above.

According to the method and the device, the position information and the behavior category of the target object in the video frame are automatically identified by the terminal device without manual participation, so that the interference of subjective factors caused by manual analysis is avoided, the video identification efficiency and the identification accuracy are improved, and the video identification mode is enriched; moreover, the long-term interest region feature and the short-term interest region feature of the video frame sequence are extracted, and the long-term interest region feature and the short-term interest region feature respectively represent the behavior feature of the target object in a long time period and a short time period, so that the feature expression modes of behavior categories can be enriched, and the identification accuracy of the behavior categories is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a system architecture diagram of a video processing system according to an embodiment of the present application;

2 a-2 c are schematic views of a video processing scene provided in the example of the present application;

fig. 3 is a schematic flow chart of video processing provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a pooling process provided by an embodiment of the present application;

FIGS. 5 a-5 b are schematic diagrams of recognition results provided by embodiments of the present application;

fig. 6 is a schematic flow chart of video processing provided by an embodiment of the present application;

fig. 7 is a schematic diagram of long-time and short-time information decoupling provided in an embodiment of the present application;

FIG. 8 is a block diagram of atomic behavior detection provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of identifying behavior categories according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a text processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the application belongs to computer vision technology and machine learning/deep learning belonging to the field of artificial intelligence. According to the method and the device, the position information of the target object in the video frame is automatically identified through the trained artificial intelligence model, and the behavior category of the target object in the video frame is identified. Subsequently, the behavior quality of the target object can be evaluated based on the recognition result, or the abnormal behavior of the target object can be discovered in time.

The method can be applied to video monitoring scenes: the monitor can collect a video of a certain area (for example, a certain street and a certain store entrance), and the position information and the behavior category of the person contained in the video can be identified by adopting the scheme of the application. When the behavior class of the person is identified to be an abnormal class (for example, the abnormal class can be a fighting class, a theft class and the like), alarm information can be sent to remind security personnel to pay attention, the security personnel can quickly locate a specific area where the abnormal behavior occurs based on the identified position information, and personal and property safety of people in the video monitoring area is guaranteed.

The application can also be applied to sports scenes: recording a competition video of a player on a competition field, identifying the action category of the player in the video by adopting the scheme of the application, judging the competition state of the player through the identified action category, wherein the competition state can be used for subsequent personnel selection, and the fairness in the competition field is improved; or whether the player has illegal actions in the competitive process can be found through the identified action categories so as to improve the accuracy of judgment penalty of the judge.

The application can also be applied to outdoor teaching scenes: the outdoor exercise video of the students is recorded, the behavior categories of the students in the video are identified by adopting the scheme, the participation enthusiasm of the students can be determined through the identified behavior categories, and data support is provided for subsequent teaching effect evaluation and adjustment of teaching modes.

Referring to fig. 1, fig. 1 is a system architecture diagram of video processing according to an embodiment of the present disclosure. The server 10f establishes a connection with a user terminal cluster through the switch 10e and the communication bus 10d, and the user terminal cluster may include: user terminal 10a, user terminal 10 b. Taking the user terminal 10a as an example, the user terminal 10a acquires a video to be currently recognized, and sends the video to the server 10 f. Upon receiving the video, the server 10f divides the video into a plurality of video frame sequences, each of which includes a key frame. For any video frame sequence (referred to as a target video frame sequence), the server 10f identifies position information of a person in a key frame in the target video frame sequence, extracts sequence features of the target video frame sequence, and extracts short-time interest region features of the target video frame sequence corresponding to the position information from the sequence features. The server 10f determines a plurality of video frame sequences adjacent to the target video frame sequence among the plurality of video frame sequences, and the server determines the short-term interest region feature of each video frame sequence in the same manner and merges the short-term interest region features of the adjacent video frame sequences into the long-term interest region feature. The server 10f determines the behavior category of the person in the key frame of the target video frame sequence according to the short-term interest region feature and the long-term interest region feature of the target video frame sequence.

The server 10f may identify the position information and behavior category of the person in the key frame in each video frame sequence, and issue the identified position information and behavior category of all the video frame sequences to the user terminal 10a, and the user terminal 10a may perform a downstream task (e.g., video content understanding) based on the received data, or directly display the received data on a screen.

The user terminal 10a, the user terminal 10b, the user terminal 10c shown in fig. 1 may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a wearable device, or other intelligent devices with a video processing function. The user terminal cluster and the server 10f may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The following takes fig. 2 a-2 c as an example to specifically describe how the server identifies the position information and behavior category of the person in the video. Please refer to fig. 2 a-2 c, which are schematic views of a video processing scene according to an embodiment of the present application. As shown in fig. 2a, the server obtains a video 20a to be currently identified, selects a key frame at certain frame intervals (for example, 20 frames), and selects previous and subsequent video frames (for example, 16 frames before and after, and totally 32 frames) to form a video frame sequence with the key frame as the center, where the formed video frame sequence includes the key frame itself. In this manner, a plurality of sequences of video frames may be determined from the video 20 a. In the following description, one of the video frame sequences 20b is taken as an example, and assuming that the key frame in the video frame sequence 20b is an image 20c, the server invokes a trained human body position frame detection model to identify the position information of the person in the image 20c, and the human body position frame detection model can identify the position of the person in the image. The position of the person in the image 20c is identified by a dashed rectangle, as shown in image 20d in fig. 2 a.

The server inputs the video frame sequence 20b into the trained feature extraction model 20e, and since the video frame sequence 20b is formed by combining a plurality of video frames, the feature extraction model 20e may be a 3D feature extraction model, and the 3D feature extraction model may directly process the video frame sequence. The feature extraction model 20e includes a plurality of convolutional layers, and when the video frame sequence 20b is input to the feature extraction model 20e, the feature extraction model 20e can output a sequence feature, where the sequence feature is formed by combining a plurality of feature maps, and a certain proportional relationship exists between the feature maps and the image 20c (for example, the size of the feature map is 1/4 equal to the size of the image 20 c).

The server extracts a partial sequence feature corresponding to the position from the sequence features output by the feature extraction model 20e according to the proportional relationship and the position of the person in the image 20c, and the extracted partial sequence feature is called a short-time interest region feature 20f of the video frame sequence 20b, in short, the short-time interest region feature 20f is the feature of the person in the video frame sequence 20b (the feature with most of the background removed).

As shown in fig. 2b, it can be seen from the foregoing that the video 20a is divided into a plurality of video frame sequences, and a preceding video frame sequence and a succeeding video frame sequence are selected from the plurality of video frame sequences with the video frame sequence 20b as the center, that is, a video frame sequence adjacent to the video frame sequence 20b is determined, and it is assumed that the video frame sequences adjacent to the video frame sequence 20b are the video frame sequence 20g, the video frame sequence 20h, the video frame sequence 20i, and the video frame sequence 20 j. It should be noted that the video frame sequence adjacent to the video frame sequence 20b does not include the video frame sequence 20b itself.

While FIG. 2a above details how the short-term interest region features 20f of the sequence of video frames 20b are extracted, the server may determine the short-term interest region features 20k of the sequence of video frames 20g, the short-term interest region features 20l of the sequence of video frames 20h, the short-term interest region features 20m of the sequence of video frames 20i, and the short-term interest region features 20n of the sequence of video frames 20j in the same manner.

The server fuses the short-time interest region feature 20k, the short-time interest region feature 20l, the short-time interest region feature 20m, and the short-time interest region feature 20n into a long-time interest region feature 20p, wherein the 4 short-time interest region features may be directly overlaid by the fusion method, or a correlation degree between the short-time interest region feature 20f and the 4 short-time interest region features, which is a weight coefficient of the 4 short-time interest region features, may be determined by an attention system, and the 4 short-time interest region features are weighted and summed based on the weight coefficient, so as to obtain the long-time interest region feature 20p of the video frame sequence 20 b.

The short-time interest region feature 20f and the long-time interest region feature 20p are both features extracted based on the position of the human body, and the time period corresponding to the short-time interest region feature 20f is the time period corresponding to the video frame sequence 20b, and the time period corresponding to the long-time interest region feature 20p is the time period corresponding to the video frame sequence 20g — the time period corresponding to the video frame sequence 20j, so that one feature is a short-time feature and one feature is a long-time feature.

The server now extracts the short term region of interest features 20f and the long term region of interest features 20p of the sequence of video frames 20 b. As shown in fig. 2c, inputting the short-time interest region feature 20f into the fully-connected layer 20q, matching probabilities with various behavior types can be obtained; similarly, by inputting the long-term interest region feature 20p into the fully-connected layer 20r, the matching probability with a plurality of behavior types can be obtained, and of course, the plurality of behavior types in the fully-connected layer 20q and the plurality of behavior types in the fully-connected layer 20r are completely the same.

The matching probabilities output by the two full-link layers are superposed and normalized to obtain the final matching probability of the character and the multiple behavior types in the video frame sequence 20b, and the server can use the behavior type corresponding to the maximum matching probability as the behavior type of the video frame sequence 20 b. It is assumed that the behavior types of the sequence of video frames 20b are: and (6) running.

The server may mark the position of the person with a dashed rectangular box in the key frame 20c of the sequence of video frames 20b and add the identified behavior type "running" and may result in the image 20s shown in fig. 2 c.

Subsequently, the server may determine the position and behavior type of the character in each video frame sequence in the same manner, and output the identified result.

The specific processes of obtaining the target video frame sequence (e.g., the video frame sequence 20b in the above embodiment), identifying the target position information (e.g., the position of the person in the above embodiment) of the target object in the target key frame, extracting the short-term interest region feature (e.g., the short-term interest region feature 20f in the above embodiment) of the target video frame sequence, and extracting the long-term interest region feature (e.g., the long-term interest region feature 20p in the above embodiment) may be referred to the following embodiments corresponding to fig. 3 to 9.

Referring to fig. 3, which is a schematic flowchart of a video processing method provided in an embodiment of the present application, since the video processing method relates to identifying behavior categories by an artificial intelligence model, the following steps are described with a better-performing server as an execution subject, and the video processing method includes the following steps:

step S101, a target video frame sequence is obtained, wherein the target video frame sequence comprises a target key frame, the target key frame comprises a target object, the target video frame sequence is any one of N video frame sequences contained in a target video, and N is a positive integer greater than 1.

Specifically, the server obtains a video to be identified (referred to as a target video, for example, the video 20a in the embodiment corresponding to fig. 2a to 2c described above), where the target video includes a plurality of video frames, one key frame is selected at certain frame intervals (for example, 20 frames), and N key frames can be determined from the target video, where N is a positive integer. For 1 key frame, the front and back video frames (for example, the front and back 16 frames, total 32 frames) are selected to form 1 video frame sequence with the key frame as the center, and the video frame sequence includes the key frame.

The server may determine the video frame sequence corresponding to each key frame in the same manner, i.e. N video frame sequences may be obtained.

Any video frame sequence (referred to as a target video frame sequence, such as the video frame sequence 20b in the corresponding embodiments of fig. 2 a-2 c) is selected from the N video frame sequences, and a key frame in the target video frame sequence is referred to as a target key frame, where the target key frame includes a target object, and the target object may be a character, may be an animal (e.g., a cat, a dog, etc.), or may be a game character, etc.

The following takes the sequence of target video frames as an example to specifically describe.

Step S102, identifying the target position information of the target object in the target key frame, extracting the sequence characteristics of the target video frame sequence, and extracting the short-time interest region characteristics of the target video frame sequence from the sequence characteristics according to the target position information.

Specifically, the server inputs the target key frame into a pre-trained object position frame recognition model, where the object position frame recognition model is used to recognize position information of the target object in the image, and the position information may be position coordinates of 4 vertices of a rectangular frame in the image, and the rectangular frame certainly includes the target object. After the server inputs the target key frame into the pre-trained object position frame recognition model, the object position frame recognition model outputs the position information (called target position information) of the target object in the target key frame.

The server inputs the target video frame sequence into a trained 3D feature extraction model (such as the feature extraction model 20e in the embodiment corresponding to fig. 2 a-2 c described above), where the 3D feature extraction model may specifically be a 3D CNN (Convolutional Neural Network), the 3D feature extraction model may extract features of the sequence data, and after the target video frame sequence is input into the 3D feature extraction model, the sequence features of the target video frame sequence may be extracted. The sequence features of the target video frame sequence comprise a plurality of feature maps with the same size, and the size of each feature map and the size of the target key frame satisfy a preset proportional relation. For example, the length of each feature map is 1/4 the length of the target key frame, and the width of each feature map is 1/4 the width of the target key frame.

In other words, one pixel in the feature map may correspond to one image region of the target key frame.

The server performs scaling processing on the target position information according to a preset proportional relationship to obtain scaled target position information (called adjusted target position information), and extracts unit feature maps included in rectangular frames corresponding to the adjusted target position information from each feature map.

Each unit feature map is subjected to alignment processing (or pooling processing), so as to obtain a plurality of aligned unit feature maps, each aligned unit feature map includes pixels (the number is the same as the preset number or the size of the aligned unit feature map is the same as the preset size), for example, the unit feature map includes 4 × 4 pixels, and the aligned unit feature map includes 2 × 2 pixels.

The server combines the multiple aligned unit feature maps into a short-time region of interest feature (such as short-time region of interest feature 20f in the corresponding embodiment of fig. 2 a-2 c described above) of the sequence of target video frames. The reason why the server performs the alignment processing on the unit feature map is that: in a target key frame, a plurality of target position information of a plurality of target objects can be identified, the size of the rectangular frame corresponding to each target position information is different, and further the size of the unit feature map corresponding to each target position information is different, so that in order to make the feature dimensions consistent, the size of the unit feature map needs to be adjusted, so that the sizes of all the adjusted unit feature maps are consistent.

Referring to fig. 4, fig. 4 is a schematic diagram of a pooling process provided in an embodiment of the present application, after a region where a target object is located is identified in a key video frame (the region corresponds to a region marked by a detection frame in fig. 4), a unit feature map corresponding to the region is determined from feature maps included in sequence features. As can be seen from fig. 4, the pixel values in the current unit feature map include 1, 3, 2, 4, 7, 3, 2, 0, and 6, and the unit feature map needs to be aligned or pooled, assuming that the size of the feature map after alignment or pooling is 2 × 2. First, for

pixel values

1, 3, 4, and 7 of the top-left 4 pixels, a new pixel value may be determined by interpolation, or a new pixel value may be determined by using average pooling (or maximum pooling), assuming that the new pixel value is determined to be (1+3+4+7)/4 ═ 4 by using average pooling here; for

pixel values

3, 2, 7 and 3 of 4 pixels at the upper right corner, determining that the new pixel value is (3+2+7+ 3)/4-4 by adopting average pooling; for

pixel values

4, 7, 2 and 0 of 4 pixels at the lower left corner, determining that the new pixel value is (4+7+2+ 0)/4-3 by adopting average pooling; for the pixel values 7, 3, 0, 6 of the lower right 4 pixels, the new pixel value is determined to be (7+3+0+ 6)/4-4 using average pooling. In summary, the pixel values of the aligned unit feature maps are 4, 3, 4, i.e. 4, 3, 4 can be combined into the short-time interest region feature.

Step S103, acquiring a short-time interest region feature set of K related video frame sequences, wherein the K related video frame sequences are video frame sequences adjacent to the target video frame sequence in the N video frame sequences, and K is a positive integer.

Specifically, the server determines K video frame sequences (each referred to as an associated video frame sequence, such as video frame sequence 20g, video frame sequence 20h, video frame sequence 20i, and video frame sequence 20j in the corresponding embodiments of fig. 2 a-2 c described above) adjacent to the target video frame sequence among the plurality of video frame sequences, where K is a positive integer and is less than or equal to N.

It should be noted that the K associated video frame sequences do not include the target video frame sequence.

The server acquires the short-time interest region features (called to-be-fused short-time interest region features) of each associated video frame sequence, and combines the P to-be-fused short-time interest region features of the K associated video frame sequences into a short-time interest region feature set. The above step S102 describes in detail how to extract the short-term interest region feature of one video frame sequence (i.e. the target video frame sequence), and the server can extract the short-term interest region feature to be fused for each associated video frame sequence in the same manner. The specific process can be as follows:

as can be seen from the foregoing, each associated video frame sequence contains a key frame (referred to as an associated key frame), and each associated key frame contains a target object, and the server invokes an object location box identification model to identify location information (referred to as associated location information) of the target object in each associated key frame respectively. And calling a 3D feature extraction model to respectively extract sequence features (called correlation sequence features) of each correlated video frame sequence. Similarly, the server scales the associated position information of each associated key frame according to a preset proportion, respectively extracts a unit feature map corresponding to the scaled associated position information from each associated sequence feature, and after alignment processing, the short-time interest region feature to be fused of each associated video frame sequence can be obtained.

It should be noted that the number P of the to-be-fused short-time interest region features included in the short-time interest region feature set is not less than the number K of the associated video frame sequences, because at least one piece of associated position information can be identified in one associated key frame, and certainly, if a certain associated key frame includes a plurality of target objects, a plurality of pieces of associated position information can be identified in the associated key frame, and then a plurality of to-be-fused short-time interest region features can be determined based on one associated video frame sequence.

It can be known that the short-time interest region feature to be fused in the set of short-time interest region features of the associated video frame sequence is the short-time interest region feature of the video frame sequence adjacent to the target video frame sequence, and the set of short-time interest region features of the associated video frame sequence is the short-time interest region feature that does not include the target video frame sequence.

And step S104, fusing the short-time interest region feature set into a long-time interest region feature.

Specifically, the server may directly superimpose the short-time interest region features to be fused in the short-time interest region feature set to obtain the long-time interest region features of the target video frame sequence (such as the long-time interest region features 20p in the corresponding embodiments of fig. 2a to fig. 2c described above).

The server may also determine a degree of correlation between the short-term interest region feature of the sequence of target video frames and each short-term interest region feature to be fused based on an attention mechanism, wherein the degree of correlation for each short-term interest region feature to be fused may be determined by calculating a vector inner product between the short-term interest region of the sequence of target video frames and each short-term interest region feature to be fused. The degree of correlation may represent the effect of each short-term interest region feature to be fused on the short-term interest region feature of the sequence of target video frames.

And the server takes the correlation degree of each short-time interest region feature to be fused as a weighting coefficient, and performs weighted summation operation on the P short-time interest region features to be fused to obtain the long-time interest region feature of the target video frame sequence.

Step S105, determining the behavior category of the target object in the target key frame according to the long-time interest region characteristics and the short-time interest region characteristics of the target video frame sequence.

The server determines the long-term interest region feature and the short-term interest region feature of the target video frame sequence, and the server may determine behavior categories according to the long-term interest region feature and the short-term interest region feature of the target video frame sequence, and then determine a final behavior category according to the two determined behavior categories.

The server may combine the target position information of the identified target object in the target key frame and the behavior category of the target object in the target key frame into the identification result of the target video frame sequence, and output the identification result of the target video frame sequence.

Subsequently, the server may store the short-time interest region feature and the short-time interest region feature set of the target video frame sequence for use in subsequently determining recognition results of other video frame sequences of the target video.

Optionally, as can be seen from the foregoing, the target location information corresponds to a rectangular frame, and the rectangular frame includes the target object, and the server may mark the rectangular frame in the target key frame, that is, mark an area of the target object in the target key frame. The server may further add the identified final behavior category to the target key frame, where the behavior category may be added to the target key frame in a text form, and the server outputs the region where the marked target object is located and the target key frame to which the behavior category is added (such as the image 20s in the corresponding embodiment of fig. 2 a-2 c).

Referring to fig. 5 a-5 b, which are schematic diagrams illustrating an identification result according to an embodiment of the present application, as shown in fig. 5a, a position of a target object is marked by a dashed rectangle in a key frame, and an identified behavior category is marked, where the behavior category identified in fig. 5a is: standing and singing. As shown in fig. 5b, similarly, the position of the target object is marked by a dashed rectangle in another key frame, and since fig. 5b includes 2 target objects, there are two rectangle frames in the key frame to mark the positions of the 2 target objects, respectively, and the behavior categories of the two target objects identified in fig. 5b are: standing, playing, therefore, the behavior categories of the two target objects need to be marked in fig. 5 b.

According to the method, the short-term information and the long-term information are predicted and decoupled, the short-term information and the long-term information are independently extracted and calculated, the short-term characteristic and the long-term characteristic can be completely decoupled and independently calculated, the behavior category identification mode can be enriched, and the behavior category identification accuracy is improved.

Referring to fig. 6, fig. 6 is a schematic flowchart of a video processing method according to an embodiment of the present disclosure, where the video processing method includes the following steps:

step S201, obtaining a target video frame sequence, where the target video frame sequence includes a target key frame, the target key frame includes a target object, the target video frame sequence is any one of N video frame sequences included in a target video, and N is a positive integer greater than 1.

Step S202, identifying the target position information of the target object in the target key frame, extracting the sequence characteristics of the target video frame sequence, and extracting the short-time interest region characteristics of the target video frame sequence from the sequence characteristics according to the target position information.

Step S203, acquiring a short-time interest region feature set of K related video frame sequences, wherein the K related video frame sequences are video frame sequences adjacent to the target video frame sequence in the N video frame sequences, K is a positive integer, and fusing the short-time interest region feature set into long-time interest region features.

The specific processes of step S201 to step S203 may refer to step S101 to step S104 in the embodiment corresponding to fig. 3.

Step S204, identifying the characteristics of the short-time interest region of the target video frame sequence to obtain first behavior prediction information of a target object in the target key frame.

Specifically, the server inputs the feature of the short-term interest region of the target video frame sequence into a first feature identification model (such as the full-link layer 20q in the corresponding embodiment of fig. 2a to 2 c) to obtain the first behavior prediction information, where the first feature identification model may be formed by combining a full-link layer + softmax function. The first behavior prediction information may include matching probabilities (referred to as first matching probabilities) of M behavior classes included in the first feature recognition model, and a sum of all the first matching probabilities is equal to 1.

Step S205, the long-term interest region features are identified to obtain second behavior prediction information of the target object in the target key frame.

Specifically, the server inputs the long-term interest region feature into a second feature recognition model (such as the full-link layer 20r in the corresponding embodiment of fig. 2a to 2 c) to obtain second behavior prediction information, where the second feature recognition model may also be formed by combining a full-link layer + softmax function.

It should be noted that the first feature recognition model and the second feature recognition model may both be a combination of a fully-connected layer + softmax function, but coefficients of the fully-connected layer are different.

The second behavior prediction information may include matching probabilities (referred to as second matching probabilities) of M behavior classes included in the second feature recognition model, and a sum of all the second matching probabilities is equal to 1, and the M behavior classes of the first feature recognition model and the M behavior classes of the second feature recognition model are the same.

Step S206, determining the behavior category of the target object in the target key frame according to the first behavior prediction information and the second behavior prediction information.

Specifically, the server performs alignment addition on the M first matching probabilities and the M second matching probabilities to obtain M target matching probabilities, and takes a behavior category corresponding to a maximum target matching probability of the M target matching probabilities as a behavior category of the target object in the target key frame.

Optionally, the server may further add the M first matching probabilities and the M second matching probabilities in alignment, perform normalization processing, obtain M probability values, and use a behavior category corresponding to a probability value greater than a preset threshold value in the M probability values as a behavior category of the target object in the target key frame.

For example, the first feature extraction model outputs M first matching probabilities: the probability that the target object belongs to the behavior class a is 0.3, the probability that the target object belongs to the behavior class B is 0.5, and the probability that the target object belongs to the behavior class C is 0.2. The second feature extraction model outputs M second matching probabilities: the probability that the target object belongs to the behavior class a is 0.1, the probability that the target object belongs to the behavior class B is 0.8, and the probability that the target object belongs to the behavior class C is 0.1, so the result of superimposing the M first matching probabilities and the M second matching probabilities is: the probability that the target object belongs to the behavior class a is 0.4, the probability that the target object belongs to the behavior class B is 1.3, and the probability that the target object belongs to the behavior class C is 0.3, so that it can be determined that the behavior class B corresponding to the maximum probability of 1.3 is the behavior class of the target object in the target key frame.

Optionally, as can be seen from the foregoing description, in order to determine the behavior category of the target object in the target key frame, a 3D feature extraction model, a first feature recognition model and a second feature recognition model are involved, and the 3D feature extraction model is used for extracting sequence features of a video frame sequence, the first feature extraction model is based on the recognition of short-term interest region features, and the second feature extraction model is based on the recognition of long-term interest region features. The following is a detailed description of the training of the 2 models:

obtaining a sample video for model training, dividing video frames contained in the sample video into a plurality of video frame sequences (which may be both referred to as sample video frame sequences), wherein each sample video frame sequence also contains a key frame (which may be both referred to as sample key frames), similarly, firstly, identifying position information of a target object in each sample key frame, and then, respectively extracting a short-time interest region feature and a long-time interest region feature of each sample video frame sequence according to the manner described above (extraction of the short-time interest region feature and the long-time interest region feature involves a 3D feature extraction model, a first feature identification model and a second feature identification model). Predicting the behavior category of a target object in each sample video frame sequence according to the short-term interest region feature and the long-term interest region feature of each sample video frame sequence, determining a prediction error according to the predicted behavior category and the real behavior category, and reversely adjusting the model parameters of the 3D feature extraction model, the model parameters of the first feature recognition model and the model parameters of the second feature recognition model based on the prediction error.

Model parameters of the 3D feature extraction model, the first feature recognition model and the second feature recognition model can be adjusted for multiple times based on multiple sample videos, and when the adjusted 3 models are all converged, it is indicated that training of the current 3D feature extraction model, the first feature recognition model and the second feature recognition model is completed.

The modeling process of the 3D feature extraction model, the first feature recognition model and the second feature recognition model is shown as the following formula (1):

the video frame sequence is subjected to 3D feature extraction model to extract features, ROI (Region Of Interest) operation is performed on a feature map through a human body detection frame to obtain features corresponding to each human body, wherein the features are denoted as V, and the features V can correspond to the short-time Interest Region features Of the video frame sequence in the application. The variables required by classification can be split into long-term information variables Z_lAnd short-time information variable Z_s. Wherein the long-term information variable Z_lCan correspond to the second behavior prediction information, the short-time information variable Z in the application_sMay correspond to the first behavior prediction information in the present application. Assuming C is the class that needs to be predicted, since C is not directly related to the input image feature V, C, Z will be_lAnd Z_sThe joint distribution modeling of the three is expressed as the above formula (1).

In equation (1), the recognition behavior class is decomposed into three questions, P (Z)_lI V) and P (Z)_s| V) model long-term and short-term features corresponding to input video features, respectively, and P (C | Z)_l,Z_s) Then, a discriminant function of the final behavior category is obtained according to the long-term information and the short-term information. The whole process can be represented as the content shown in fig. 7, fig. 7 is a schematic diagram of long-time and short-time information decoupling provided in the embodiment of the present application, and according to the process of joint modeling shown in fig. 7, behavior category identification is decoupled into dependence of a long-time context and dependence of a short-time context, and variable spaces of the two are independent from each other, so that behavior categories can be identified and predicted respectively.

Referring to fig. 8, fig. 8 is a schematic diagram of a framework of atomic behavior detection according to an embodiment of the present application, where atomic detection refers to detecting behaviors of people in a video, and a final output result includes a detection frame of a human body and one or more behavior tags corresponding to each human body. The specific process of atomic behavior detection is as follows: firstly, a video is split into video frames, key frames are selected at certain frame intervals (for example, 20 frames), and a video frame sequence composed of front and rear video frames (for example, front and rear 16 frames, total 32 frames) is selected by taking the key frames as a center and serves as an input of a detection frame. For the key frame, a human body detector (which may correspond to the object position frame recognition model in the present application) trained in advance is used to obtain the human body detection frame. And then inputting the video frame sequence into a 3D CNN for feature extraction to obtain a feature map X, then performing ROI operation on the feature map X by using the previously obtained human body detection frame (namely obtaining features in the human body detection frame), respectively processing the obtained features by a short-time context module and a long-time context module to obtain output results Zl and Zs of the feature map X and the features Zl and Zs, and normalizing the output results Zl and Zs to obtain a final class output classification result C.

The long-term context module extracts and fuses features of video frame sequences before and after a current video frame sequence, which requires that features of the video frame sequences corresponding to all key frames of the whole video are extracted in advance, and features in all detection frames are stored in the feature storage module according to the fact that the detection frame of each person is used as a basic unit.

Through the frame design, the short-time information and the long-time information are predicted and decoupled, and the short-time information and the long-time information are independently subjected to feature extraction and calculation, wherein the feature source of the short-time information is the feature after ROI operation is carried out on the feature of the current image block, the feature source of the long-time information is the feature in a front fixed time window and a rear fixed time window when the video frame sequence is taken as the center, and the feature of the current video frame sequence is removed by the feature of the used time window.

Referring to fig. 9, fig. 9 is a schematic diagram of identifying behavior categories according to an embodiment of the present application, as shown in fig. 9, a plurality of video frames before and after a current video frame may be combined into a video frame sequence, a short-time context cue inference may be performed based on the video frame sequence to obtain behaviors such as "sit", "write", and the like, a plurality of video frame sequences before and after the video frame sequence may be subjected to a long-time context cue inference to obtain a category of "speak with person", and the above-mentioned "sit", "write", and "speak with person" are combined into a behavior category identified based on the video frame sequence. Therefore, the atomic behavior detection algorithm based on the long-term information decoupling and the short-term information decoupling is designed, the classification of behaviors is decoupled into two problems of long-term prediction and short-term prediction, the two problems are respectively predicted, and finally, result fusion is carried out on the probability level obtained by the two problems, so that the detection precision of the algorithm is effectively improved.

According to the method and the device, the long-term information and the short-term information can be finely modeled, so that the long-term information and the short-term information are decoupled, the long-term information and the short-term information are respectively subjected to category prediction, and finally higher behavior detection precision can be obtained, and efficient and reliable technical support can be provided for applications such as video content understanding and analysis.

Further, please refer to fig. 10, which is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application. As shown in fig. 10, the video processing apparatus 1 may be applied to the server in the above-described embodiments corresponding to fig. 3 to 9. Specifically, the video processing apparatus 1 may be a computer program (including program code) running in a computer device, for example, the video processing apparatus 1 is an application software; the video processing apparatus 1 may be configured to perform corresponding steps in the method provided by the embodiment of the present application.

The video processing apparatus 1 may include: a first obtaining module 11, a recognition module 12, a second obtaining module 13, a fusion module 14 and a determination module 15.

A first obtaining module 11, configured to obtain a target video frame sequence, where the target video frame sequence includes a target key frame, where the target key frame includes a target object, the target video frame sequence is any one of N video frame sequences included in a target video, and N is a positive integer greater than 1;

the identification module 12 is configured to identify target position information of the target object in the target key frame, extract sequence features of the target video frame sequence, and extract short-time interest region features of the target video frame sequence from the sequence features according to the target position information;

a second obtaining module 13, configured to obtain a feature set of a short-time interest region of K associated video frame sequences, where the K associated video frame sequences are video frame sequences adjacent to the target video frame sequence in the N video frame sequences, and K is a positive integer;

a fusion module 14, configured to fuse the short-time interest region feature set into a long-time interest region feature;

a determining module 15, configured to determine a behavior category of the target object in the target key frame according to the long-time interest region feature and the short-time interest region feature of the target video frame sequence.

In a possible implementation, the video processing apparatus 1 may further include: a combining module 16.

And the combination module 16 is configured to combine the target position information of the target object in the target key frame and the behavior category of the target object in the target key frame into the recognition result of the target video frame sequence, and output the recognition result of the target video frame sequence.

In a possible implementation, the determining module 15, when configured to determine the behavior category of the target object in the target key frame according to the long-term interest region feature and the short-term interest region feature of the target video frame sequence, is specifically configured to:

identifying and processing the characteristics of the short-time interest region of the target video frame sequence to obtain first behavior prediction information of a target object in the target key frame;

identifying the long-term interest region characteristics to obtain second behavior prediction information of the target object in the target key frame;

and determining the behavior category of the target object in the target key frame according to the first behavior prediction information and the second behavior prediction information.

In one possible embodiment, the first behavior prediction information includes a first match probability for M behavior classes, the second behavior prediction information includes a second match probability for the M behavior classes, M is a positive integer;

when the determining module 15 is configured to determine the behavior category of the target object in the target key frame according to the first behavior prediction information and the second behavior prediction information, specifically, to:

superposing the M first matching probabilities and the M second matching probabilities into M target matching probabilities;

and determining the behavior category of the target object in the target key frame according to the M target matching probabilities and the behavior category corresponding to each target matching probability.

In a possible embodiment, the recognition module 12, when configured to extract the sequence features of the sequence of target video frames, is specifically configured to:

calling a 3D feature extraction model to extract sequence features of the target video frame sequence;

the sequence features of the target video frame sequence comprise a plurality of feature maps with the same size, and the size of each feature map and the size of the target key frame satisfy a preset proportional relation.

In a possible embodiment, the recognition module 12, when configured to extract the short-term interest region feature of the target video frame sequence from the sequence features according to the target position information, is specifically configured to:

zooming the target position information according to the preset proportional relation to obtain adjusted target position information, and extracting a plurality of unit feature maps corresponding to the adjusted target position information from the plurality of feature maps;

and respectively aligning each unit feature map to obtain a plurality of aligned unit feature maps, and combining the plurality of aligned unit feature maps into the short-time interest region feature of the target video frame sequence.

In one possible embodiment, each associated video frame sequence comprises an associated key frame, and each associated key frame comprises the target object;

the second obtaining module 13, when configured to obtain the short-time interest region feature sets of the K associated video frame sequences, is specifically configured to:

identifying the associated position information of the target object in each associated key frame, calling the 3D feature extraction model, and extracting the associated sequence features of each associated video frame sequence;

extracting the short-time interest region feature to be fused of each relevant video frame sequence from each relevant sequence feature respectively according to the relevant position information of each relevant key frame;

and combining the short-time interest region features to be fused of the K related video frame sequences into the short-time interest region feature set.

In a possible implementation manner, the short-time interest region feature set comprises P short-time interest region features to be fused, wherein P is not less than K;

the fusion module 14, when configured to fuse the short-time interest region feature set into a long-time interest region feature, is specifically configured to:

determining a degree of correlation between the short-term interest region features of the target video frame sequence and each short-term interest region feature to be fused based on an attention mechanism;

and according to the correlation degree corresponding to each short-time interest region feature to be fused, performing weighted summation operation on the P short-time interest region features to be fused to obtain the long-time interest region feature.

In a possible implementation, the first obtaining module 11, when configured to obtain the sequence of target video frames, is specifically configured to:

acquiring the target video, wherein the target video comprises a plurality of video frames;

selecting N key frames from the plurality of video frames, N being a positive integer;

combining each key frame and video frames adjacent to each key frame in a plurality of video frames contained in the target video into a video frame sequence respectively, wherein the number of the video frames contained in each video frame sequence is the same;

selecting a target video frame sequence from the N video frame sequences, and taking a key frame in the target video frame sequence as a target key frame.

In a possible embodiment, the video processing apparatus 1 further comprises: a module 17 is added.

And an adding module 17, configured to mark, according to the target location information, a region where the target object is located in the target key frame, add the behavior category in the target key frame, and output the region where the target object is marked and the target key frame to which the behavior category is added.

According to an embodiment of the present application, the steps involved in the methods shown in fig. 3-9 may be performed by various modules in the video processing apparatus shown in fig. 10. For example, steps S101-S105 shown in fig. 3 may be performed by the first acquiring module 11, the identifying module 12, the second acquiring module 13, the fusing module 14, the determining module 15, the combining module 16, and the adding module 17 shown in fig. 10, respectively; as another example, steps S204-S206 shown in fig. 6 may be performed by the determination module 15 shown in fig. 10.

According to the method and the device, the position information and the behavior category of the target object in the video frame are automatically identified by the terminal device without manual participation, so that the interference of subjective factors caused by manual analysis is avoided, the efficiency and the accuracy of video identification are improved, and the video identification mode is enriched; moreover, the long-term interest region feature and the short-term interest region feature of the video frame sequence are extracted, and the long-term interest region feature and the short-term interest region feature respectively represent the behavior feature of the target object in a long time period and a short time period, so that the feature expression modes of behavior categories can be enriched, and the accuracy of behavior category identification is further improved.

Further, please refer to fig. 11, which is a schematic structural diagram of a computer device according to an embodiment of the present application. The server in the corresponding embodiments of fig. 3-9 described above may be the computer device 1000. As shown in fig. 11, the computer device 1000 may include: a user interface 1002, a processor 1004, an encoder 1006, and a memory 1008. Signal receiver 1016 is used to receive or transmit data via cellular interface 1010, WIFI interface 1012. The encoder 1006 encodes the received data into a computer-processed data format. The memory 1008 has stored therein a computer program by which the processor 1004 is arranged to perform the steps of any of the method embodiments described above. The memory 1008 may include volatile memory (e.g., dynamic random access memory DRAM) and may also include non-volatile memory (e.g., one time programmable read only memory OTPROM). In some instances, the memory 1008 can further include memory located remotely from the processor 1004, which can be connected to the computer device 1000 via a network. The user interface 1002 may include: a keyboard 1018, and a display 1020.

In the computer device 1000 shown in fig. 11, the processor 1004 may be configured to call the memory 1008 to store a computer program to implement:

In one embodiment, the processor 1004 further performs the following steps:

combining target position information of the target object in the target key frame and behavior categories of the target object in the target key frame into a recognition result of the target video frame sequence;

and outputting the identification result of the target video frame sequence.

In one embodiment, the processor 1004, when executing the determining of the behavior category of the target object in the target key frame according to the long-time interest region feature and the short-time interest region feature of the target video frame sequence, specifically executes the following steps:

In one embodiment, the first behavior prediction information comprises a first match probability for M behavior classes, the second behavior prediction information comprises a second match probability for the M behavior classes, M is a positive integer;

when determining the behavior category of the target object in the target key frame according to the first behavior prediction information and the second behavior prediction information, the processor 1004 specifically performs the following steps:

In one embodiment, the processor 1004, when executing the step of extracting the sequence features of the target video frame sequence, specifically executes the following steps:

In one embodiment, the processor 1004, when executing the extracting the short-time interest region feature of the target video frame sequence from the sequence feature according to the target position information, specifically executes the following steps:

In one embodiment, each associated sequence of video frames comprises an associated key frame, and each associated key frame comprises the target object;

the processor 1004, when executing the step of obtaining the set of short-time interest region features of the K sequences of associated video frames, specifically executes the following steps:

In one embodiment, the short-time interest region feature set comprises P short-time interest region features to be fused, where P is not less than K;

the processor 1004 specifically performs the following steps when performing fusion of the short-time interest region feature set into a long-time interest region feature:

In one embodiment, the processor 1004, when executing the step of obtaining the sequence of target video frames, specifically performs the following steps:

In one embodiment, the processor 1004 further performs the following steps:

according to the target position information, marking the area where the target object is located in the target key frame, and adding the behavior category in the target key frame;

and outputting the area marked with the target object and the target key frame added with the behavior category.

It should be understood that the computer device 1000 described in this embodiment of the present application can perform the description of the video processing method in the embodiment corresponding to fig. 3 to fig. 9, and can also perform the description of the video processing apparatus 1 in the embodiment corresponding to fig. 10, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer storage medium, and the computer storage medium stores the aforementioned computer program executed by the video processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the video processing method in the embodiment corresponding to fig. 3 to 9 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application. By way of example, program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network, and the multiple computer devices distributed across the multiple sites and interconnected by the communication network may be combined into a blockchain network.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device can execute the method in the embodiment corresponding to fig. 3 to 9, and therefore, the detailed description thereof will not be repeated here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A video processing method, comprising:

2. The method of claim 1, further comprising:

and outputting the identification result of the target video frame sequence.

3. The method of claim 1, wherein determining the behavior category of the target object in the target key frame according to the long-term interest region feature and the short-term interest region feature of the target video frame sequence comprises:

4. The method of claim 3, wherein the first behavior prediction information comprises a first match probability for M behavior classes, wherein the second behavior prediction information comprises a second match probability for the M behavior classes, and wherein M is a positive integer;

the determining the behavior category of the target object in the target key frame according to the first behavior prediction information and the second behavior prediction information includes:

5. The method of claim 1, wherein the extracting the sequence features of the sequence of target video frames comprises:

6. The method of claim 5, wherein the extracting the temporal interest region feature of the target video frame sequence from the sequence feature according to the target position information comprises:

7. The method of claim 5, wherein each associated sequence of video frames comprises an associated key frame, and each associated key frame comprises the target object;

the obtaining of the feature set of the short-time interest region of the K associated video frame sequences includes:

8. The method according to claim 1, wherein the short-time region of interest feature set comprises P short-time region of interest features to be fused, P being not less than K;

the fusing the short-time interest region feature set into a long-time interest region feature comprises:

9. The method of claim 1, wherein the obtaining the sequence of target video frames comprises:

10. The method according to any one of claims 1-9, further comprising:

11. A video processing apparatus, comprising:

12. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1-10.

13. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause a computer device having the processor to perform the method of any one of claims 1-10.