CN115578668A

CN115578668A - Target behavior recognition method, electronic device, and storage medium

Info

Publication number: CN115578668A
Application number: CN202211124689.XA
Authority: CN
Inventors: 程淑亚
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-01-06

Abstract

The application discloses a target behavior identification method, an electronic device and a storage medium, wherein the target behavior identification method comprises the following steps: acquiring a video to be identified; carrying out target detection on a video to be identified to obtain a target detection result corresponding to each video frame; respectively carrying out target form detection, target attribute detection and target behavior detection on each video frame based on the target detection result to obtain a form detection result, an attribute detection result and a behavior detection result corresponding to each video frame; and obtaining a target behavior recognition result corresponding to the video to be recognized based on the form detection result, the attribute detection result and the behavior detection result. Through the mode, the target behavior can be accurately identified.

Description

Target behavior recognition method, electronic device, and storage medium

Technical Field

The present application relates to the field of target identification technologies, and in particular, to a target behavior identification method, an electronic device, and a storage medium.

Background

Under the background of artificial intelligence, the target behavior recognition changes obviously in the aspects of data acquisition scale, sample data form, behavior analysis method and the like, and the target behavior recognition is gradually automated, informationized and intelligentized.

But the disadvantage is that the related target behavior recognition cannot accurately recognize the behavior of the target yet.

Disclosure of Invention

The application mainly solves the technical problem of providing a target behavior identification method, electronic equipment and a storage medium, which can accurately identify the behavior of a target.

In order to solve the technical problem, the application adopts a technical scheme that: a target behavior identification method is provided, and comprises the following steps: acquiring a video to be identified; the video to be identified comprises continuous video frames; carrying out target detection on a video to be identified to obtain a target detection result corresponding to each video frame; respectively carrying out target form detection, target attribute detection and target behavior detection on each video frame based on the target detection result to obtain a form detection result, an attribute detection result and a behavior detection result corresponding to each video frame; and obtaining a target behavior identification result corresponding to the video to be identified based on the form detection result, the attribute detection result and the behavior detection result.

The method for detecting the form, the attribute and the behavior of each video frame includes the following steps: performing target tracking on each video frame based on the target detection result to obtain a target tracking result corresponding to each video frame; performing key point detection on the video frames to obtain a key point detection result corresponding to each video frame; performing target form detection on the video frames based on the key point detection result and the target detection result to obtain a form detection result corresponding to each video frame; performing target attribute detection on the video frames based on the target tracking result and the key point detection result to obtain an attribute detection result corresponding to each video frame; and performing target behavior detection on all video frames based on the target tracking result and the target detection result to obtain a behavior detection result corresponding to each video frame.

The target tracking result includes track information, and target tracking is performed on each video frame based on the target detection result to obtain a target tracking result corresponding to each video frame, including: determining a target video frame from all video frames based on the target detection result; wherein, the target video frame at least comprises a target object; and forming track information of the target object based on the target object in the target video frame and the target objects in the rest video frames.

The method for detecting the target morphology of the video frames based on the key point detection result and the target detection result to obtain the morphology detection result corresponding to each video frame includes: determining a target video frame with a target object based on a target detection result; connecting key points in the key point detection result corresponding to each target video frame according to a preset mode to form a key point image of a target object; and carrying out target form detection on the key point image to obtain a form detection result corresponding to each target video frame.

The target tracking result includes track information, and target attribute detection is performed on the video frames based on the target tracking result and the key point detection result to obtain an attribute detection result corresponding to each video frame, including: determining a target video frame with a target object based on a target tracking result; and performing target attribute detection on the target video frames based on the track information corresponding to each target video frame and the key points in the key point detection result to determine the attribute detection result of the target object, wherein the attribute detection result comprises at least one of a backpack, a hat and a kettle.

The method for detecting the target behaviors of all video frames based on the target tracking result and the target detection result to obtain the behavior detection result corresponding to each video frame includes the following steps: performing event analysis on all video frames based on target tracking results and target detection results to obtain target events corresponding to each video frame; and classifying the target events to obtain a behavior detection result.

The target tracking result comprises track information, event analysis is carried out on all video frames based on the target tracking result and the target detection result, and a target event corresponding to each video frame is obtained, and the method comprises the following steps: and if the track information of the target object in the video to be identified is abnormal in two adjacent video frames, taking the target event corresponding to the target object as a key event.

Before performing key point detection on video frames and obtaining a key point detection result corresponding to each video frame, the method includes: screening all video frames based on the target detection result and the target tracking result, and screening out the video frames meeting the preset conditions; performing key point detection on the video frames to obtain a key point detection result corresponding to each video frame, wherein the key point detection result comprises the following steps: and performing key point detection on the video frames meeting the preset conditions to obtain a key point detection result corresponding to each video frame.

The target detection result comprises at least one of head information, shoulder information, upper body information, front information, side information and back information of a target object in each video frame, and the preset condition is that the score of the head information, the shoulder information, the upper body information, the front information, the side information or the back information is larger than the preset score.

Wherein, screen out the video frame that satisfies the predetermined condition after, include: selecting video frames meeting preset conditions according to a preset proportion to obtain selected video frames; performing key point detection on the video frames to obtain a key point detection result corresponding to each video frame, wherein the key point detection result comprises the following steps: and carrying out key point detection on the selected video frames to obtain a key point detection result corresponding to each video frame.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided an electronic device comprising a memory for storing program data and a processor for executing the program data to implement the target behaviour recognition method as described above.

In order to solve the technical problem, the other technical scheme adopted by the application is as follows: there is provided a computer-readable storage medium storing program data for implementing the target behavior recognition method as described above when executed by a processor.

The beneficial effect of this application is: different from the situation of the prior art, the method and the device have the advantages that the target form detection, the target attribute detection and the target behavior detection are carried out on the video frame after the target detection, the target behavior recognition result corresponding to the video to be recognized is obtained according to the form detection result, the attribute detection result and the behavior detection result, the detection dimensionalities of the target form detection, the target attribute detection and the like are increased, and the accuracy of the target behavior recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. Wherein:

fig. 1 is a schematic flow chart of a first embodiment of a target behavior identification method provided in the present application;

FIG. 2 is a schematic diagram of a structure of a keypoint image provided by the present application;

FIG. 3 is a schematic flowchart of a complete embodiment of a target behavior recognition method provided in the present application;

FIG. 4 is a schematic structural diagram of an embodiment of an electronic device provided in the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, the target behavior identification method described in the present application may include: step 100: and acquiring a video to be identified. Step 200: and carrying out target detection on the video to be identified to obtain a target detection result corresponding to each video frame. Step 300: and respectively carrying out target form detection, target attribute detection and target behavior detection on each video frame based on the target detection result to obtain a form detection result, an attribute detection result and a behavior detection result corresponding to each video frame. Step 400: and obtaining a target behavior identification result corresponding to the video to be identified based on the form detection result, the attribute detection result and the behavior detection result.

That is to say, according to the method and the device, the target detection result corresponding to each video frame is obtained by targeting the video to be recognized, then the more comprehensive target form detection, target attribute detection and target behavior detection are respectively carried out on the video frames after the target detection, and the target behavior recognition result corresponding to the video to be recognized is obtained according to the form detection result, the attribute detection result and the behavior detection result, so that the detection dimensions of the target form detection, the target attribute detection and the like are increased, and the accuracy of the target behavior recognition can be improved.

The following describes in detail a first embodiment of the target behavior recognition method of the present application.

Step 100: and acquiring a video to be identified.

The video to be identified comprises continuous video frames.

In some embodiments, the video to be recognized may be captured by a monocular camera or a binocular camera.

Step 200: and carrying out target detection on the video to be identified to obtain a target detection result corresponding to each video frame.

Optionally, a target detection algorithm may be used to detect the video to be recognized. The target detection algorithm may be based on a deep learning model, and specifically which type is adopted, which is not limited herein.

The object detection is to detect all objects of interest in the image, for example, the objects of interest may be human, animals, or other living things. The category of the object and its position in the image or in world coordinates can then be determined.

In some embodiments, object detection may be to detect the size of an object or various different shapes of an object in addition to determining the class and location of the object.

For example, assuming that the detected target object is a human, the obtained target detection result may include information of a head, a shoulder, an upper half, a front, a side, a back, and the like of the human.

Optionally, in order to facilitate subsequent operations, using the target detection result, the obtained target detection result may be first detected by using a detection data set Ω _odj {odj ₁ ，odj ₂ ，…，odj _n Record and save.

Wherein each data element in the detection data set represents a target detection result corresponding to each video frame, for example, odj _n Representing the target detection result corresponding to the nth video frame.

In some embodiments, there may be multiple targets in the video to be recognized, and therefore, in step 200, the positions of the multiple targets in the video to be recognized may be determined by performing target detection on the video to be recognized, for example, a target detection result obtained after performing target detection may include the position relationship of each target, such as a human body, an article, and the like.

In some embodiments, an edge detection algorithm may be employed to determine the location of objects in the video to be identified.

Step 300: and respectively carrying out target form detection, target attribute detection and target behavior detection on each video frame based on the target detection result to obtain a form detection result, an attribute detection result and a behavior detection result corresponding to each video frame.

Alternatively, a multitask model can be adopted to perform target morphology detection, target attribute detection and target behavior detection on each video frame respectively.

For example, assuming that the detected target object is a human, the target shape detection may be to detect a shape of a human body, and the shape of the human body may be limb motion information of the human body, which may be understood as a static motion representation, for example, whether the human body in the video to be recognized is in a state of holding the head with two hands, lifting the hand, waving the hand, pointing, falling down, sitting, lying on the stomach, or the like.

Similarly, if the detected target object is a person, the target attribute detection may be to detect whether the person in the video to be recognized is carrying a bag or wearing a hat, or to detect the posture of the person in the video to be recognized, for example, whether the front side is that the two hands are placed across the chest, or the back side is that the person is hunched over, or to detect the clothing of the person in the video to be recognized, for example, what color the jacket is, what style the trousers are, what hairstyle the person is, and the like, and may detect the sex of the person through the clothing, hairstyle, posture, and the like of the person.

Generally, the behavior of a human body can be roughly determined from the form of an object, but the form of the object represents a state in which each video frame is independently presented, and the behavior of the object cannot be accurately determined. Therefore, it is necessary to detect the essential behavior of the target according to the trajectory information of the target or the correlation between different targets, so as to obtain a behavior detection result.

Taking the form of a human body as an example, the form of a human body may represent an external state currently represented by a human body, and does not necessarily represent the essential behavior of a human body. For example, the target form detection shows the state information of the people lying on the stomach; and the target behavior detects because the track information according to the target or the interrelation between the different targets detects, so can detect out the information of the fore-and-aft state that the people was lying prone this moment, for example say that the former action that the people was lying prone is the state of standing, and the latter action also is the state that the people was standing, and according to the fore-and-aft state that the people was lying prone, can release the action that the people probably stood again after falling down.

Step 400: and obtaining a target behavior identification result corresponding to the video to be identified based on the form detection result, the attribute detection result and the behavior detection result.

It should be noted that the target behavior recognition result may be different from the behavior detection result mentioned in the previous step, and the target behavior recognition result may include human body shape and human body attributes. For example, if the target behavior is a behavior that the target falls and then stands, and the target attribute result is a backpack, the obtained target behavior recognition result corresponding to the video to be recognized may be a behavior that the lady carrying the backpack falls and then stands.

The target behavior recognition result can also be said to be a conscious activity behavior including a behavior subject, a behavior object, a behavior environment, and a behavior means. For example, taking a scene of a student in a class as an example, the target behavior recognition result may be that the student lies on a desk, the student is listening to the class seriously, and the student is playing a mobile phone.

The mobile phone is used as a behavior subject, the mobile phone is a behavior object, the playing is a method used when the behavior subject student acts on the object mobile phone, and the mobile phone playing can be an objective environment for the student to play the mobile phone.

However, due to the influence of factors such as the temporal-spatial information of the surrounding environment, the position of the target cannot be accurately identified by target detection, and therefore, in order to determine the position of the target more efficiently, some embodiments further track the target based on the target detection, and predict the trajectory information of the target.

The specific method can comprise the following steps:

step 1: performing target tracking on each video frame based on the target detection result to obtain a target tracking result corresponding to each video frame;

for example, the target detection result of a certain video frame may be that one person stands on the playground, and after the target tracking is performed on the person, the target tracking result may be that two persons play a ball on the playground.

Optionally, the target tracking result includes track information, amplitude information, correlation information between a plurality of targets, and the like.

Similarly, for facilitating subsequent operations, using the target tracking result, the obtained target tracking result may be first tracked by using the tracking data set Ω _otj {otj ₁ ，otj ₂ ，…，otj _n Record and save.

Wherein each data element in the tracking data set represents a target tracking result corresponding to each video frame, for example, otj _n Representing the target tracking result corresponding to the nth video frame.

In some embodiments, when the target tracking result includes track information, performing target tracking on each video frame based on the target detection result to obtain a target tracking result corresponding to each video frame, which may include the following sub-steps:

step 11: and determining a target video frame from all the video frames based on the target detection result.

The target video frame at least comprises a target object.

Step 12: and forming track information of the target object based on the target object in the target video frame and the target objects in the rest video frames.

For example, the target object may be at different positions in different video frames, or the relationship between multiple target objects in different video frames.

Specifically, it is assumed that target detection can detect that a certain video frame has two vehicles, such as a vehicle a and a vehicle B, and target tracking can determine respective track information of the two vehicles according to the position relationship of the two vehicles in the remaining video frames, and accurately distinguish which vehicle is the vehicle a and which vehicle is the vehicle B.

By tracking the target, inter-frame information among target video frames, environmental information around the target and the like can be fully utilized to obtain track information of the target, so that the target can be identified more efficiently and accurately.

Because the video to be recognized comprises a plurality of continuous video frames, after the video to be recognized is subjected to target detection and target tracking, the number of the obtained video frames containing the target object is still large, and the target form detection, the target attribute detection and the target behavior detection which are subsequently carried out are respectively carried out on each video frame. Therefore, in order to improve the detection speed, some embodiments may perform a preliminary screening on the obtained video frames containing the target object after performing the target detection and the target tracking on the video to be identified, where the specific screening process may be to screen all the video frames based on the target detection result and the target tracking result to screen out the video frames meeting the preset condition. The preset condition may be determined according to the target detection result and the target tracking result. For example, all video frames may be filtered according to whether there are target detection results and target tracking results. And if the target detection result corresponding to the video frame is no target, determining that the video frame does not meet the preset condition. And if the target detection result corresponding to the video frame is a target, the video frame is considered to meet the preset condition.

In some embodiments, the target detection result may include at least one of head information, shoulder information, upper body information, front information, side information, and back information of the target object in each video frame.

The preset condition may be that a score of the head information, the shoulder information, the upper body information, the front information, the side information, or the back information is greater than a preset score.

For example, a human body optimization algorithm may be adopted to score the human body part in the target detection result according to the related conditions such as the definition, the shielding range, the posture, the angle and the like, or score the trajectory information of the target tracking result according to the trajectory integrity and the trajectory definition, and perform optimization according to the obtained score to obtain a target optimization result. The preferred video frame is a video frame satisfying a preset condition.

Optionally, for the convenience of subsequent operations, using the target preferred result, the obtained target preferred result may be first adopted as the preferred data set Ω _qej {qej ₁ ，qej ₂ ，…，qej _n Record keeping is carried out.

Wherein each data element in the preferred data set represents a target preferred result corresponding to each video frame, for example qej _n Representing the target preference result corresponding to the nth video frame.

In some embodiments, the scoring mechanism may train a network in advance according to a standard comparison library or a weight of each part information, and score by using the trained network.

In addition, under the condition that the number of video frames contained in the video to be recognized is very large, although after the target detection and the target tracking are performed on the video to be recognized, the obtained video frames containing the target object are primarily screened out to some extent, some video frames may be preliminarily screened out to some extent, but the video frames obtained after the primary screening out may be still many, and therefore, in some embodiments, after the video frames meeting the preset conditions are screened out, further selection may be performed, specifically:

and selecting the video frames meeting the preset conditions according to a preset proportion to obtain the selected video frames.

For example, the target tracking result and the target optimization result may be combined, and then the image picking algorithm is used for analysis and selection, so as to obtain a target selection result. The target preference result includes the video frames meeting the preset condition, and the target selection result may be a result obtained by further selecting the target preference result according to a preset ratio.

Optionally, in order to facilitate subsequent operations, the target selection result may be used to select the data set Ω according to the obtained target selection result _spi {spi ₁ ，spi ₂ ，…，spi _n Record and save.

Wherein each data element in the pick data set represents a target pick corresponding to each video frame, e.g., spi _n Representing the target picking result corresponding to the nth video frame.

After each video frame is subjected to corresponding target detection, target tracking and target optimization, the optimized video frame substantially has a corresponding target detection result and a target tracking result.

Wherein, the preset ratio can be 1: 100 or 1: 1000, and the frame rate selection can also be adopted. Wherein, for example, according to the ratio of n/m, n and m are natural numbers larger than 1, and n is smaller than m, wherein m represents the frame rate. Such as m being 30, 60, 90 or 120. The specific proportional relationship is not limited herein.

In addition, in order to better identify the form and the attribute of the target, the detected and tracked target may be subjected to key point analysis, that is, step 1: and (3) performing target tracking on each video frame based on the target detection result, and performing the operation of the step (2) after obtaining the target tracking result corresponding to each video frame.

In step 2, the video frames may be subjected to keypoint detection, so as to obtain a keypoint detection result corresponding to each video frame.

Wherein, the key points can be extracted by a top-down or bottom-up method.

Assuming that the detection target is a human, the detected key points may be various parts and joints of the human body, such as a nose, a right eye, a left eye, a right ear, a left ear, a right shoulder, a left shoulder, a right elbow, a left elbow, a right wrist, a left wrist, a right knee, a left knee, a right ankle, a left ankle, a neck, and the like.

In some embodiments, the key point detection may also be performed on the video frames that satisfy the preset condition, so as to obtain a key point detection result corresponding to each video frame.

Or, in some embodiments, the keypoint detection may be performed on the selected video frames to obtain a keypoint detection result corresponding to each video frame.

Illustratively, the key point algorithm analysis may be performed according to the target detection result, the target tracking result, and the target map selection result to obtain a key point detection result, and for convenience of subsequent operations, the obtained key point detection result may be a key point data set Ω _kpi (kpi ₁ ，kpi ₂ ，…，kpi _n Record and save.

Wherein each data element in the key point data set represents a key point detection result corresponding to each video frame, for example, kpi _n Representing the detection result of the key point corresponding to the nth video frame.

After the above-mentioned key point detection of the target, the form detection, the attribute detection and the behavior detection of the target can be performed, and the specific detection method can refer to the contents of the following step 3, step 4 and step 5.

For example, step 3 may be to perform target morphology detection on the video frames based on the key point detection result and the target detection result to obtain a morphology detection result corresponding to each video frame.

For example, a preliminary morphological image may be obtained by connecting all the key points, and then the preliminary morphological image is compared with a preset human morphology, and a final target morphology is determined by combining a target detection result.

Optionally, for convenience of subsequent operations, using the form detection result, the obtained form detection result may be first adopted as the form data set Ω _bai {bai ₁ ，bai ₂ ，…，bai _n Record and save.

Wherein each data element in the configuration data set represents a configuration detection result corresponding to each video frame, for example, bai _n Representing the morphology detection result corresponding to the nth video frame.

Since the key point detection can be implemented by using a trained network model, for example, a deep learning model, after the key point is detected, information corresponding to each key point can be obtained, for example, according to the key point 0, it can be obtained that the part corresponding to the key point is a nose and the like. Therefore, in some embodiments, all the key points are connected according to a preset manner to obtain the target morphology. Specifically, the following may be mentioned:

1) And determining a target video frame with a target object based on the target detection result.

2) And connecting the key points in the key point detection result corresponding to each target video frame according to a preset mode to form a key point image of the target object.

Wherein, the preset mode can be combined according to the characteristics of the human body structure. For example, assume that the detected keypoint results are: "0" - "13" respectively corresponds to the nose, right eye, left eye, right ear, left ear, right shoulder, left shoulder, right elbow, left elbow, right wrist, left wrist, right knee, left knee, and neck of a human body. The numbers corresponding to the respective parts or joints are connected according to the structure of the human body, and the key point image as shown in fig. 2 can be obtained.

3) And carrying out target form detection on the key point image to obtain a form detection result corresponding to each target video frame.

For example, the key point image shown in fig. 2 is subjected to target form detection, and compared with a preset form, it is possible to obtain a form of the hand-up corresponding to the key point image in fig. 2.

And 4, step 4: and performing target attribute detection on the video frames based on the target tracking result and the key point detection result to obtain an attribute detection result corresponding to each video frame.

The target attributes such as a backpack, a hat, glasses and the like can be judged according to the target tracking result and the positions of the key points. For example, after a video frame is subjected to target tracking, it is found that a backpack in the video frame has no track and is still, and the backpack is located at a corresponding position of a key point of a human body, such as an arm, it can be determined that the target human body carries the backpack.

Optionally, for convenience of subsequent operations, using the attribute detection result, the obtained attribute detection result may be first adopted as the attribute data set Ω _pedi {ped ₁ ，ped ₂ ，…，pedi _n Record and save.

Wherein each data element in the attribute data set represents a corresponding attribute detection result, such as a pidi, of each video frame _n Representing the attribute detection result corresponding to the nth video frame.

Optionally, in some embodiments, the target attribute detection may be performed on the target video frame according to the track information and the key point information of the target, specifically as follows:

(1) And determining a target video frame with a target object based on the target tracking result.

(2) And performing target attribute detection on the target video frames based on the track information corresponding to each target video frame and the key points in the key point detection result to determine the attribute detection result of the target object, wherein the attribute detection result comprises at least one of a backpack, a hat and a kettle.

And 5: and performing target behavior detection on all video frames based on the target tracking result and the target detection result to obtain a behavior detection result corresponding to each video frame.

Optionally, in some embodiments, event analysis may be performed on all video frames based on the target tracking result and the target detection result, after the target event corresponding to each video frame is obtained, the target events are classified, and then the behavior detection result is obtained.

The target events may be classified by accident, necessity, etc., or the events may be classified according to the environment, for example, the target events may be emergencies, such as natural disasters, public health events, social security events, etc.

The present application is not limited herein as to how the target event is specifically classified.

Due to the limited resources, the attention paid to different event types is also different.

In some embodiments, some target events may be flagged as events of varying degrees of interest.

For example, the key event may be analyzed according to the target detection result and the target tracking result, so as to obtain a key event result. Similarly, for the convenience of subsequent operations, by using the target detection result, the obtained key event result may be first applied to the key event data set Ω _iej {iej ₁ ，iej ₂ ，…，iej _n Record and save.

Wherein each data element in the key event data set represents a key event result corresponding to each video frame, for example, iej _n Representing the target detection result corresponding to the nth video frame.

Wherein the important events can be accidental injury, illness, and fainting, such as iej ₁ Can represent an accidental injury, iej ₂ May represent illness, etc.

Similarly, in order to increase the detection speed, in some embodiments, after the target detection and the target tracking are performed on the video to be identified, or before the event analysis is performed on all the video frames, the obtained video frames containing the target object may be screened and selected, and then the analysis of the key event and the analysis of the behavior classification algorithm are performed according to the screened and selected results, so as to obtain the behavior detection result of the target.

For convenience of subsequent operations, the behavior detection result may be used to adopt a behavior data set Ω as the obtained behavior detection result _scj {scj ₁ ，scj ₂ ，…，scj _n Record and save.

Wherein each data element in the behavior data set represents a behavior detection result corresponding to each video frame, for example, scj _n Representing the behavior detection result corresponding to the nth video frame.

For the process of screening and selecting targets involved in the behavior detection process, reference may be made to the description of the above steps, and details of the application are not repeated here.

And finally obtaining a target behavior identification result corresponding to the video to be identified based on the form detection result obtained in the step 3, the attribute detection result obtained in the step 4 and the behavior detection result obtained in the step 5. Likewise, the target behavior recognition result may employ the recognition data set Ω _sci {Ω _sci1 ，Ω _sci2 ，…，Ω _scin Represents it.

For example, take the detected object as an example of a student, where Ω _sci1 Can express the action of lying down on a desk, omega _sci2 Can represent the serious behavior of students in class omega _sci3 Can express the mobile phone playing behavior of students, omega _sci4 Can represent the falling behavior of students, omega _sci5 May represent student sitting behavior, etc.

That is, by increasing the detection dimensions of object form detection, object attribute detection, and the like, the accuracy of object behavior identification can be improved.

With reference to the above embodiments, a more complete embodiment of the present application is described below, specifically referring to fig. 3, where fig. 3 is a schematic flow chart of the complete embodiment of the present application, and specifically includes the following steps:

(1) Firstly, acquiring a video to be identified;

(2) Carrying out target detection on the acquired video to be identified to obtain a target detection result corresponding to each video frame;

(3) Tracking the video frame after target detection through a target tracking algorithm to obtain a target tracking result;

(4) Analyzing through a target optimization algorithm according to the target detection result and the target tracking result to obtain a target optimization result;

(5) Merging the target tracking result and the target optimal result, and selecting by adopting a graph selection algorithm to obtain a target selection result;

(6) Performing key point analysis on the video frame according to the target detection result and the target selection result to obtain a key point detection result;

(7) According to the target detection result and the key point detection result, performing human body shape analysis on the video frame to obtain a shape detection result;

(8) According to the target tracking result and the key point detection result, performing human body attribute analysis on the video frame to obtain an attribute detection result;

(9) Performing key event identification on the optimized target detection result and the target tracking result to obtain a key event result, and analyzing the key event result through a behavior classification algorithm to obtain a behavior detection result;

(10) And finally, carrying out comprehensive analysis according to the form detection result, the attribute detection result and the behavior detection result to obtain a target behavior identification result.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of an electronic device 130 provided in the present application, where the electronic device 130 includes a memory 131 and a processor 132, the memory 131 is used for storing program data, and the processor 132 is used for executing the program data to implement the following method:

acquiring a video to be identified; the video to be identified comprises continuous video frames; carrying out target detection on a video to be identified to obtain a target detection result corresponding to each video frame; respectively carrying out target form detection, target attribute detection and target behavior detection on each video frame based on the target detection result to obtain a form detection result, an attribute detection result and a behavior detection result corresponding to each video frame; and obtaining a target behavior identification result corresponding to the video to be identified based on the form detection result, the attribute detection result and the behavior detection result.

It is to be understood that the processor 132 is also configured to execute the sequence data to implement the method of any of the above embodiments.

Optionally, in an embodiment, the electronic device 130 may be a chip, a Programmable Gate Array (FPGA), a single chip, or the like, where the chip may be a processing chip, such as a CPU, a GPU, an MCU, or the like, or a memory chip, such as a DRAM, an SRAM, or the like.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a computer-readable storage medium 140 provided in the present application, where the computer-readable storage medium 140 stores program data 141, and when the program data 141 is executed by a processor, the method is implemented as follows:

acquiring a video to be identified; the video to be identified comprises continuous video frames; carrying out target detection on a video to be identified to obtain a target detection result corresponding to each video frame; respectively carrying out target form detection, target attribute detection and target behavior detection on each video frame based on the target detection result to obtain a form detection result, an attribute detection result and a behavior detection result corresponding to each video frame; and obtaining a target behavior recognition result corresponding to the video to be recognized based on the form detection result, the attribute detection result and the behavior detection result.

It will be appreciated that program data 141, when executed by a processor, is also used to implement the method of any of the embodiments described above.

Embodiments of the present application may be implemented in software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solutions of the present application, which are essential or contributing to the prior art, or all or part of the technical solutions may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

Claims

1. A method for identifying a target behavior, the method comprising:

acquiring a video to be identified; wherein the video to be identified comprises continuous video frames;

performing target detection on the video to be identified to obtain a target detection result corresponding to each video frame;

respectively performing target form detection, target attribute detection and target behavior detection on each video frame based on the target detection result to obtain a form detection result, an attribute detection result and a behavior detection result corresponding to each video frame;

and obtaining a target behavior recognition result corresponding to the video to be recognized based on the form detection result, the attribute detection result and the behavior detection result.

2. The method of claim 1, wherein the performing the target morphology detection, the target attribute detection and the target behavior detection on each video frame based on the target detection result to obtain the morphology detection result, the attribute detection result and the behavior detection result corresponding to each video frame comprises:

performing target tracking on each video frame based on the target detection result to obtain a target tracking result corresponding to each video frame;

performing key point detection on the video frames to obtain a key point detection result corresponding to each video frame;

performing target form detection on the video frames based on the key point detection result and the target detection result to obtain a form detection result corresponding to each video frame;

performing target attribute detection on the video frames based on the target tracking result and the key point detection result to obtain an attribute detection result corresponding to each video frame;

and performing target behavior detection on all the video frames based on the target tracking result and the target detection result to obtain the behavior detection result corresponding to each video frame.

3. The method according to claim 2, wherein the target tracking result includes track information, and the performing target tracking on each video frame based on the target detection result to obtain the target tracking result corresponding to each video frame includes:

determining a target video frame from all the video frames based on the target detection result; wherein the target video frame at least comprises a target object;

forming the trajectory information of the target object based on the target object in the target video frame and the target objects in the remaining video frames.

4. The method according to claim 2, wherein the performing target morphology detection on the video frames based on the keypoint detection result and the target detection result to obtain the morphology detection result corresponding to each of the video frames comprises:

determining a target video frame with a target object based on the target detection result;

connecting key points in the key point detection result corresponding to each target video frame according to a preset mode to form a key point image of a target object;

and carrying out target morphology detection on the key point images to obtain a morphology detection result corresponding to each target video frame.

5. The method according to claim 2, wherein the target tracking result includes track information, and the performing target attribute detection on the video frames based on the target tracking result and the key point detection result to obtain the attribute detection result corresponding to each of the video frames includes:

determining a target video frame with a target object based on the target tracking result;

and performing target attribute detection on the target video frames based on the track information corresponding to each target video frame and key points in the key point detection results, and determining the attribute detection results of the target objects, wherein the attribute detection results comprise at least one of a backpack, a hat and a kettle.

6. The method according to claim 2, wherein the performing target behavior detection on all the video frames based on the target tracking result and the target detection result to obtain the behavior detection result corresponding to each of the video frames comprises:

performing event analysis on all the video frames based on the target tracking result and the target detection result to obtain a target event corresponding to each video frame;

and classifying the target event to obtain the behavior detection result.

7. The method according to claim 6, wherein the target tracking result includes track information, and performing event analysis on all the video frames based on the target tracking result and the target detection result to obtain a target event corresponding to each video frame includes:

and if the track information of the target object in the video to be identified is abnormal in two adjacent video frames, taking the target event corresponding to the target object as a key event.

8. The method according to claim 2, wherein before the performing the keypoint detection on the video frames to obtain the keypoint detection result corresponding to each of the video frames, the method comprises:

screening all the video frames based on the target detection result and the target tracking result to screen out the video frames meeting the preset condition;

the performing the key point detection on the video frames to obtain a key point detection result corresponding to each video frame includes:

and carrying out key point detection on the video frames meeting the preset conditions to obtain a key point detection result corresponding to each video frame.

9. The method according to claim 8, wherein the target detection result comprises at least one of head information, shoulder information, upper body information, front information, side information, and back information of a target object in each video frame, and the preset condition is that a score of the head information, the shoulder information, the upper body information, the front information, the side information, or the back information is greater than a preset score.

10. The method according to claim 8, wherein the screening out the video frames satisfying the predetermined condition comprises:

selecting video frames meeting preset conditions according to a preset proportion to obtain the selected video frames;

and carrying out key point detection on the selected video frames to obtain a key point detection result corresponding to each video frame.

11. An electronic device, characterized in that the electronic device comprises a memory for storing program data and a processor for executing the program data to implement the target behavior recognition method according to any of claims 1-10.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores program data for implementing the target behavior recognition method according to any one of claims 1 to 10 when the program data is executed by a processor.