CN114783060A

CN114783060A - Standing behavior identification method and device

Info

Publication number: CN114783060A
Application number: CN202210440719.1A
Authority: CN
Inventors: 王高升; 赵玉瑶
Original assignee: Beijing Eswin Computing Technology Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2022-07-22

Abstract

The application discloses a standing behavior identification method and device, relates to the technical field of behavior identification processing, and can achieve the purpose of accurately identifying whether human body behaviors are standing behaviors or not. The main technical scheme of the application is as follows: acquiring video data; performing human head detection processing on at least one image frame included in the video data to obtain human head frame information corresponding to at least one human head object included in the image frame; tracking the target human head object from the image frame according to the human head frame information to obtain tracking information corresponding to the target human head object, wherein the tracking information comprises human head frame information corresponding to the target human head object in the image frame; inputting the tracking information into a preset standing behavior recognition model, and outputting a human behavior recognition result of the target human head object. The method and the device are mainly applied to identifying whether the human behavior is the standing behavior.

Description

Standing behavior identification method and device

Technical Field

The present application relates to the field of behavior recognition processing technologies, and in particular, to a method and an apparatus for recognizing a standing behavior.

Background

The human behavior recognition has wide application in the fields of video monitoring, security protection, video retrieval and the like, and the standing up/sitting down behavior recognition is used as one kind of human behavior recognition, is used for researching the dynamic processes of sitting down/standing up and standing up/sitting down of the human body, and has direct application in scenes of intelligent classrooms, intelligent recording and broadcasting and the like.

At present, in a closed scene like a smart classroom, smart recording and broadcasting and the like, motion detection of a standing behavior of a human body is mainly performed by processing shot continuous video frame images of the human body behavior by adopting an optical flow method and an interframe difference method and performing corresponding judgment according to the shot height difference of the human body, namely, if the height of the human body is changed from low to high, the human body is indicated as the standing behavior, but if the height of the human body is changed from high to low, the human body is indicated as a non-standing behavior (such as sitting behavior). However, as the number of people in the scene increases and the influence of various factors such as the variation of light rays is not constant, the body area is blocked in the shot image, and the shot height difference is not clear enough, so that it is difficult to accurately identify when the human body is in the standing behavior.

Disclosure of Invention

The application provides a standing behavior identification method and a standing behavior identification device, and the method and the device are mainly used for avoiding the influence of objective factors such as human body shielding and light change caused by the increase of the number of people in a scene, and can achieve the purpose of accurately identifying whether a human body behavior is a standing behavior.

In order to achieve the above purpose, the present application mainly provides the following technical solutions:

a first aspect of the present application provides a method for identifying a standing behavior, including:

acquiring video data;

performing human head detection processing on at least one image frame included in the video data to obtain human head frame information corresponding to at least one human head object included in the image frame;

tracking a target human head object from the image frame according to the human head frame information to obtain tracking information corresponding to the target human head object, wherein the tracking information comprises human head frame information corresponding to the target human head object in the image frame;

inputting the tracking information into a preset standing behavior recognition model, and outputting a human behavior recognition result of the target human head object, wherein the preset standing behavior recognition model is a model obtained by training in advance based on the tracking information of a human head object sample and the standing behavior recognition result labeled on the tracking information.

A second aspect of the present application provides a standing behavior recognition apparatus, including:

an acquisition unit configured to acquire video data;

the human head detection processing unit is used for carrying out human head detection processing on at least one image frame included in the video data to obtain human head frame information corresponding to at least one human head object included in the image frame;

the tracking processing unit is used for tracking a target human head object from the image frame according to the human head frame information to obtain tracking information corresponding to the target human head object, wherein the tracking information comprises human head frame information corresponding to the target human head object in the image frame;

and the model processing unit is used for inputting the tracking information into a preset standing behavior recognition model and outputting a human behavior recognition result of the target human head object, wherein the preset standing behavior recognition model is a model obtained by pre-training based on the tracking information of a human head object sample and the standing behavior recognition result labeled to the tracking information.

A third aspect of the application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of standing up behavior recognition as described above.

The present application provides, in a fourth aspect, an electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of standing behavior recognition as described above when executing the computer program.

By means of the technical scheme, the technical scheme provided by the application at least has the following advantages:

the application provides a standing behavior identification method and a standing behavior identification device, video data can be collected in real time in the process of personnel movement in a shooting scene, head detection processing is carried out through image frames included in the video data, head frame information corresponding to a head object included in the image frames is obtained, tracking processing is further carried out on a target head object, head frame information corresponding to the target head object is obtained from the image frames and is used as tracking information of the target head object, and finally a preset standing behavior identification model is used for processing, so that whether the human behavior corresponding to each target head object is a standing behavior or not is identified. The method is a specific implementation method realized by utilizing human head detection and tracking processing and a preset standing behavior model, and cannot be influenced by objective factors such as human body shielding and light change caused by the increase of the number of people in a scene. Compared with the prior art, the method and the device solve the problem that when the human body is in the standing behavior is difficult to identify due to the influence of objective factors, and can identify whether the human body behavior is the standing behavior more accurately.

The above description is only an overview of the technical solutions of the present application, and the present application may be implemented in accordance with the content of the description so as to make the technical means of the present application more clearly understood, and the detailed description of the present application will be given below in order to make the above and other objects, features, and advantages of the present application more clearly understood.

Drawings

Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of a method for identifying a standing behavior according to an embodiment of the present disclosure;

fig. 2 is a flowchart of another method for identifying a standing behavior according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a specific implementation method for training a human head detection model according to an embodiment of the present application;

fig. 4 is a schematic diagram of human head frame information corresponding to a human head object detected in an image frame according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a real-time human head detection and tracking process performed on each image frame according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a batch human head detection and tracking process performed on multiple image frames according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of an embodiment of a complementary solution for batch human head detection and tracking processing;

FIG. 8 is a flowchart of a specific implementation method for training a standing behavior recognition model according to an embodiment of the present application;

fig. 9 is a block diagram illustrating an apparatus for recognizing a standing behavior according to an embodiment of the present disclosure;

fig. 10 is a block diagram of another device for identifying a standing behavior according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the application provides a standing behavior identification method, which is characterized in that the method comprises the following specific steps of carrying out human head object detection and tracking processing on shot video data, and processing tracking information by adopting a preset standing behavior identification model, thereby identifying whether a human body behavior corresponding to each target human head object is a standing behavior, as shown in fig. 1, for the method, the following specific steps are provided:

101. video data is acquired.

In the embodiment of the application, video data is continuously collected during the process of shooting the activities of people in the scene, and the video data can be one image frame or a plurality of continuous image frames.

For example, a camera-equipped device may be used to capture moving pictures of people in a scene and synchronously transmit the moving pictures to a connectable server, so that the server side continuously receives video data for subsequent image processing operations.

102. And carrying out human head detection processing on at least one image frame included in the video data to obtain human head frame information corresponding to at least one human head object included in the image frame.

In the embodiment of the present application, for one image frame, the embodiment of the present application is to identify a human head object as an alternative to identify the whole human body, and can more effectively circumvent the following problems: under the scene that the number of people is increased, human body occlusion exists, so that the whole human body existing in the image frame is identified, and the situation that only a certain part of the human body is identified, so that the people existing in the image frame are difficult to accurately identify, is difficult to avoid.

For example, for an image frame, the identified head object may be displayed in a rectangular frame form through the head detection process, so that each head object corresponds to a frame information, for example, but not limited to, the position information of the rectangular frame in the image frame.

103. And tracking the target head object from the image frame according to the head frame information corresponding to at least one head object contained in the image frame to obtain the tracking information corresponding to the target head object.

The target human head object is used for referring to the personnel needing to be tracked, and the tracking information comprises human head frame information corresponding to the target human head object in the image frame.

In the process of shooting the human activities in the scene, as the continuously received video data is composed of continuous image frames, the smaller the shooting frame rate is, the more clearly the continuous image frames can show the track pictures of the human activities, and then the same human can appear in each image frame unless(s) he leaves the scene. Therefore, in the embodiment of the present application, once the target person to be tracked is identified, the corresponding head object of the target person should appear in the next consecutive image frames.

Accordingly, in the embodiment of the present application, what the tracking information corresponding to the target head object refers to is: when the target person appears in the continuous image frames, the head objects corresponding to the target person (that is, the target head objects are represented instead in the embodiment of the present application) appearing in each image frame are obtained, and the head objects constitute the behavior track of the target person, so that the head objects actually have a matching relationship with the target head objects, and then the tracking information of the target head object is constituted by using the head frame information corresponding to the head objects.

In the embodiment of the present application, the tracking processing on the target human head object may be performed by using, but not limited to, methods such as mean shift, particle filter, kalman filter, MOT based on deep learning, SOT, and the like.

With respect to step 102 and step 103, the human head detection process and the human head tracking process provided in the embodiment of the present application may include, but are not limited to, the following two specific implementation processes, which are specifically explained as follows:

exemplary 1 implementation procedure: the human head detection processing and the human head tracking processing are performed in real time for each image frame in the process of continuously receiving video data, that is, when one image frame is received, step 102 is performed first and then step 103 is further performed in real time.

Exemplary 2 implementation procedure: during the process of continuously receiving video data, a plurality of consecutive image frames are collected, and the image frames are processed in batch at the service side, that is, step 102 and then step 103 are executed in batch for the plurality of image frames.

104. Inputting the tracking information into a preset standing behavior recognition model, and outputting a human behavior recognition result of the target human head object.

The preset standing behavior recognition model is a model obtained by pre-training based on tracking information of the human head object sample and a standing behavior recognition result labeled to the tracking information.

In the embodiment of the application, the head frame information of the head object matched with the target head object in different image frames can be obtained according to the tracking information, and because the different image frames are continuously received according to the time sequence, time sequencing also exists among the matched head objects, so that the human body behavior track of the target person can be judged according to the head object, namely the change process of different behaviors of the target person between squatting/sitting and standing, and the change process of the human body behavior is identified, namely the state maintenance of sitting, standing or a certain behavior, and the like.

For example, the embodiment of the present application may use a pre-trained recurrent neural network model as the preset standing behavior recognition model. The Recurrent Neural Network (RNN) is a recurrent neural network that takes sequence data as input, recurses in the evolution direction of the sequence, and all nodes are connected in a chain manner, and has the characteristics of memorability, parameter sharing and complete graphic, so that the recurrent neural network has certain advantages when learning the nonlinear characteristics of the sequence. According to the embodiment of the application, a plurality of head objects with time sequence are tracked by utilizing a pre-trained recurrent neural network model, the head objects are matched with a target head object, and the position change tracks of the target head object in continuous images can be clearly tracked based on different positions of the head objects with time sequence in the continuous image frames, so that what the behavior tracks of target people are can be indirectly obtained.

The embodiment of the application provides a standing behavior identification method, video data can be collected in real time in the process of personnel activity in a shooting scene, head detection processing is carried out through image frames included in the video data, head frame information corresponding to a head object included in the image frames is obtained, tracking processing is further carried out on a target head object, head frame information corresponding to the target head object is obtained from the image frames and is used as tracking information of the target head object, and finally a preset standing behavior identification model is used for processing, so that whether the human behavior corresponding to each target head object is the standing behavior or not is identified. The embodiment of the application is a specific implementation method realized by utilizing human head detection tracking processing and a preset standing behavior model, and the method cannot be influenced by objective factors such as human body shielding and light change caused by the increase of the number of people in a scene. Compared with the prior art, the method and the device solve the problem that when the human body is in the standing behavior is difficult to identify due to the influence of objective factors, and can identify whether the human body behavior is the standing behavior more accurately.

In order to explain the above embodiments in more detail, the embodiment of the present application further provides another method for identifying a standing behavior, and as shown in fig. 2, the embodiment of the present application provides the following specific steps:

201. video data is acquired.

In the embodiment of the present application, for the explanation of this step, refer to step 101, which is not described herein again.

202. And processing the image frame by using a preset human head detection model, and outputting the coordinate information of the target frame detected from the image frame and the confidence coefficient of the target frame.

The preset human head detection model is a model pre-trained by using a convolutional neural network in the embodiment of the present application, and a specific implementation method of model training is shown in fig. 3, and the embodiment of the present application provides the following steps:

s301, a plurality of image frame samples are obtained, and the human head objects are labeled in the image frame samples.

For the embodiments of the present application, the human head object in the image frame sample may be, but is not limited to, labeled in the form of a rectangular box. For example, if three people are included in the scene, then for the captured image frame, three frames may be circled in the form of rectangular frames, each representing a person's head object.

S302, performing data enhancement processing operation on the image frame sample to obtain a processed sample.

For the embodiment of the present application, the data enhancement processing operation may be, but not limited to, random cropping, random scaling, fuzzy processing, and the like, and the purpose of the enhancement processing operation is to increase the diversity of samples participating in model training, so as to improve the generalization of the trained model.

And S303, inputting the sample into a convolutional neural network for training to obtain a human head detection model.

Convolutional Neural Networks (CNNs) are a class of feedforward Neural Networks including convolution calculation and having a deep structure, and are one of the representative algorithms for deep learning, and Convolutional Neural Networks have a characteristic learning capability, and can perform translation invariant classification on input information according to their hierarchical structure, and are also called "translation invariant artificial Neural Networks". The embodiment of the application utilizes the convolutional neural network to train the model, so that the model has the human head detection function, namely, the human head object in the image frame is detected.

And the coordinate information of the target frame and the confidence coefficient of the target frame are detected from the image frame through the processing of the preset human head detection model.

The object frame refers to an object frame similar to the size of a human head object appearing in an image frame, the confidence of the object frame is used for referring to the confidence that one object frame is the human head object, the coordinate information of the object frame is used for referring to the position information of the object frame in the image frame, and the coordinate information comprises the upper left corner position coordinate and the lower right corner position coordinate of the object frame in the image frame.

Illustratively, the embodiment of the present application provides a schematic diagram of a target frame, as shown in fig. 4, with the horizontal direction of an image frame as an X-axis, the vertical direction as a Y-axis, and the coordinates of the upper left corner as (0, 0). In the embodiment of the present application, a target frame detected from an image frame is processed by a preset human head detection model, the coordinates of the upper left corner (X0, Y0) and the coordinates of the lower right corner (X1, Y1) of the target frame are listed, and the confidence of the target frame is conf1, and if the target frame is identified as H1, H1 is ═ X0, Y0, X1, Y1, conf 1.

Further, assuming that n target frames can be detected from one image frame, the expressions of the n target frames are obtained as follows: h { [ l1, a1, r1, b1, conf1], … [ lk, ak, rk, bk, confk ], …, [ ln, an, rn, bn, confn ] ] }, where l denotes an upper left X-axis coordinate value of the target frame, a denotes an upper left Y-axis coordinate value of the target frame, r denotes an lower right X-axis coordinate value of the target frame, b denotes a lower right Y-axis coordinate value of the target frame, conf denotes a confidence of the target frame, and k denotes any one target frame between the first target frame and the nth target frame.

203. And judging whether the confidence coefficient of the target frame reaches a preset confidence coefficient threshold value.

In the embodiment of the application, a preset confidence threshold is set according to practical experience and is used for measuring the credibility of a target frame becoming a human head object.

204a, if the confidence coefficient of the target frame is judged to reach a preset confidence coefficient threshold value, determining that the target frame is a human head object.

204b, if the confidence coefficient of the target frame does not reach the preset confidence coefficient threshold value, determining that the target frame is not the human head object.

205a, after the target frame is determined to be the human head object, the coordinate information corresponding to the target frame is used as the human head frame information corresponding to the human head object detected from the image frame.

Illustratively, for the target frame H1 ═ X0, Y0, X1, Y1, conf1], after determining that it is a head object, the head frame information corresponding to the head object is [ X0, Y0, X1, Y1], where (X0, Y0) is the upper left corner coordinate of the rectangular frame corresponding to the head object in the image frame, (X1, Y1) is the lower right corner coordinate of the rectangular frame corresponding to the head object in the image frame.

206a, tracking the target head object from the image frame according to the head frame information corresponding to the head object detected from the image frame, and obtaining the tracking information corresponding to the target head object.

The target head object is used for referring to the person needing to be tracked, and the tracking information comprises the corresponding head frame information of the target head object in the image frame.

In the process of shooting the human activities in the scene, as the continuously received video data is composed of continuous image frames, the smaller the shooting frame rate is, the more clearly the continuous image frames can show the track pictures of the human activities. Accordingly, in the embodiment of the present application, the tracking information corresponding to the target head object is that when the target person (i.e., the target head object is represented instead in the embodiment of the present application) appears in the consecutive image frames, the head frame information of the head object matching the target head object can be identified from the image frames, and the tracking information of the target head object is formed by using the head frame information corresponding to the head object.

However, it should be noted that in the embodiment of the present application, in the process of a person moving in a shooting scene, a server side may continuously receive image frames transmitted by shooting, and accordingly, the embodiment of the present application may perform real-time human head detection and tracking processing on each image frame, or may perform human head detection and tracking processing on a plurality of image frames in batch, which are two parallel technical solutions, the former is more beneficial to real-time performance of image frame processing, and the latter is more beneficial to efficiency of image frame processing, which is explained in detail below.

In the embodiment of the present application, as a first parallel technical solution: for the acquired video data, if only one image frame is included, in the process of continuously receiving the image frames at the server side, it can be regarded that the server side performs the human head detection and tracking processing in real time after acquiring one image frame each time, and the specific implementation steps are as shown in fig. 5, as follows steps S401 to S406 b.

S401, in the process that the image frames are continuously received by the server side, real-time human head detection and tracking processing are determined to be carried out on each image frame.

S402, judging whether the image frame is a starting image frame or not.

In the embodiment of the present application, the starting image frame is the first image frame of a human head detection and tracking processing task. For example, in the process of continuously receiving the image frames at the server side, one image frame can be arbitrarily selected as the starting image frame, and then the human head detection and tracking processing task is executed once in the first image frame and a plurality of continuous image frames after the first image frame by taking the starting image frame as the first image frame.

It should be noted that, for one human head detection and tracking processing task, if the task refers to a real-time processing operation for each image frame, the task actually includes a plurality of times of combined operations of human head detection and tracking processing according to actual processing requirements (i.e., one combined operation is performed upon receiving one image frame); but if the task refers to batch processing for a plurality of image frames, then the task is made to include only one batch human head detection operation and one batch tracking processing operation.

And, for the embodiment of the present application, the head object detected from the initial image frame is used as the target head object, and further, the target head object is tracked in the continuous image frame after the initial image frame.

S403a, if the image frame is determined to be the initial image frame, a tracking marker is added to the head object detected from the initial image frame to be the target head object.

In the embodiment of the present application, if the image frame is used as the starting image frame, the image frame is used as the first image frame to perform a combined operation of human head detection and tracking processing. Accordingly, the human head object detected from the image frame is used as the target human head object, namely, the human head object represents the tracked person.

And, in order to facilitate the following tracking operation on the target head object, a different tracking label may be added to each target head object, for example, assuming that three target head objects are contained in the image frame, the labels may be "track 1", "track 2", and "track 3", respectively.

S404a, searching the head frame information corresponding to the target head object from the continuous other image frames after the initial image frame as the tracking information of the target head object.

In the embodiment of the application, in the process of continuously receiving shot image frames on the server side, real-time human head detection and tracking processing is carried out on each image frame, after a target human head object is determined from a starting image frame, next, human head detection and tracking processing is also carried out on a second image frame adjacent to the starting image frame in real time, and next, third, fourth, fifth and the like adjacent to the second image frame are respectively carried out on the human head detection and then the tracking processing is carried out on a plurality of required continuous image frames in real time.

The purpose of such real-time human head detection and tracking processing is: and searching the head objects matched with the target head object from the second image frame, the third image frame, the fourth image frame, the fifth image frame and the like to a plurality of required continuous image frames one by one, thereby taking the head frame information of the matched head objects as the corresponding head frame information of the target head object in different image frames respectively, and forming the tracking information of the target head object on different image frames.

For example, for a target human head object (its tracking mark track1), its human head frame information in the starting image frame is t1 ═ l1, a1, r1, b1], where l represents the X-axis coordinate value of the upper left corner of the target frame, a represents the Y-axis coordinate value of the upper left corner of the target frame, r represents the X-axis coordinate value of the lower right corner of the target frame, and b represents the Y-axis coordinate value of the lower right corner of the target frame.

Accordingly, in the second image frame adjacent to the starting image frame, the target head object (still carrying the tracking marker track1) is matched with the target head object, which is identified as t2 and whose corresponding head frame information is t2 ═ l2, a2, r2, b 2; by analogy, the head object matching the target head object in the third image frame is identified as t3, and its corresponding head frame information is t3 ═ [ l3, a3 r3, b3], and accordingly, the head object matching the target head object existing on m consecutive image frames after the starting image frame and the head frame information of the head object on the m-th image frame, that is, tm ═ lm, am rm, bm, can be obtained. Accordingly, the tracking information of the target human head object on the m image frames obtained by the embodiment of the present application is track1 { [ l1, t1, r1, b1], … [ lm, tm, rm, bm ] }.

Specifically, in the embodiment of the present application, the implementation method of the step S404a is substantially the same as the implementation steps of S403b-S406b, and therefore, for the explanation of the step S404, please refer to the following steps S403b-S406 b.

S403b, if the image frame is not determined as the initial image frame, obtaining at least one first person object and the head frame information corresponding to the first person object from the image frame, and obtaining at least one second person object and the head frame information corresponding to the second person object from the adjacent image frame before the image frame.

In the embodiment of the application, in the process of continuously receiving the image frames on the server side, if an obtained image frame is not a start image frame, it indicates that the start image frame is before the image frame and has been subjected to the human head detection processing and obtained the target human head object, and then for a continuous image frame after the start image frame (that is, other image frames after the start image frame), only a human head object matched with the target human head object needs to be searched from the image frame.

In order to facilitate distinguishing different image frames processed according to a time sequence, a human head object detected from a current image frame processed in real time is identified as a first human head object, and a human head object detected from an adjacent image frame before the image frame is identified as a second human head object.

It should be noted that the image frame where the second human head object is located is equivalent to the image frame where the real-time human head detection and tracking processing is just finished, and the second human head object carries a unique corresponding tracking mark, and the tracking mark is used for marking the human head object matched with the target human head object in different image frames.

For example, assuming that three target head objects, respectively labeled "track 1", "track 2" and "track 3", are included in the starting image frame, if three head objects are also detected in the second image frame adjacent to the starting image frame, after the tracking processing, it can be known to which target head object the three head objects are respectively matched, that is, to which target head object the three head objects respectively belong, so that according to the matching relationship, if a head object belongs to which target head object's behavior track, it will carry the tracking label of the target head object. Accordingly, if a human head object matched with the target human head object exists in each image frame, the human head object carries the same tracking mark as the target human head object.

Therefore, for the embodiment of the present application, the tracking mark carried by the second human head object is just transferred from the starting image frame, which indicates the behavior track of which target human head object the second human head object belongs to.

S404b, searching the first head object matched with the second head object from the image frame by calculating and comparing the head frame information corresponding to the first head object and the head frame information corresponding to the second head object.

S405b, adding the same tracking tag to the first head object based on the first head object matching the second head object.

S406b, using the head frame information corresponding to the first head object carrying the same tracking mark as the tracking information of the target head object in the image frame.

In the embodiment of the present application, it is explained with respect to steps S404B-S406B that, for the captured video data of the behavior trace of the person, the change of the same person is slight in two adjacent image frames, for example, assuming that there are two persons in the captured scene, two target person head objects a and B are obtained from the starting image frame, for the two adjacent captured image frames, the person head objects c and d are detected in the former image frame, and the person head objects e and f are detected in the latter image frame, then by comparing the person head frame information of the person head objects through calculation, it can be determined that: for the head objects c and d, which belongs to the behavior tracks of the target head objects A and B; and as for the head objects e and f, which belongs to the behavior tracks of the target head objects A and B, determining which head object is matched with the target head object A, and which head object is matched with the target head object B, and adding the same tracking mark as the target head object to the head object according to the matching relation.

Accordingly, for a target head object, assuming that a head object having the same tracking flag as that of the target head object is found from an image frame, the head frame information indicating the head object corresponds to the tracking information of the target head object in the image frame.

In the embodiment of the present application, as a second parallel solution: for the acquired video data, if the video data includes a plurality of consecutive image frames, in the process of continuously receiving the image frames at the server side, the server side may perform human head detection and tracking processing in batch after acquiring the batch of image frames, and the specific implementation steps are as shown in fig. 6, as follows steps S501 to S507.

S501, in the process that the image frames are continuously received by the server side, the fact that a plurality of continuous image frames are processed in batches to carry out human head detection and tracking processing is determined.

S502, determining a starting image frame and other image frames subsequent to the starting image frame from the plurality of image frames.

S503, adding a tracking mark to the human head object detected from the start image frame as the target human head object.

For the second parallel technical scheme provided by the embodiment of the application, firstly, the head detection processing is performed on the continuous image frames to obtain the head object contained in each image frame, and then the tracking processing is performed on the head objects.

In the embodiment of the present application, any image frame may be selected as a starting image frame, where the starting image frame is used to determine a target head object, that is, to represent a tracked person, and to add a tracking marker to the target head object.

And S504, taking the human head object detected from the other image frames as the other human head object.

In the embodiment of the present application, in order to distinguish the head objects detected in different image frames, the head object detected in the initial image frame is used as the target head object, and the head objects detected in other image frames subsequent to the initial image frame are used as the other head objects.

And S505, calculating and comparing the head objects respectively contained in the two adjacent image frames and the head frame information corresponding to the head objects one by one from the initial image frame, and searching the first other head object matched with the target head object from other image frames.

It should be noted that, in the process of shooting the activities of the people in the scene, a situation may occur that(s) he leaves the scene, or a situation that(s) he newly enters the scene may occur, and accordingly, for the target head object, a situation may occur that there is no head object matching the target head object in some other image frame, so in order to clearly distinguish and refer to other head objects contained in other image frames, the other head object matching the target head object is referred to as a "first other head object", and the other head object not matching the target head object is referred to as a "second other head object".

And S506, adding the same tracking mark to the first other human head object matched with the target human head object according to the tracking mark carried by the target human head object.

And S507, forming tracking information corresponding to the target head object by utilizing the head frame information corresponding to the first other head object with the same tracking mark.

In the embodiment of the application, if the first other-person head object matching with the target person head object is found from the other image frames, the same tracking mark is added to the first other-person head object matching with the target person head object, that is, the first other-person head object is indicated to belong to the behavior track of the target person head object. Therefore, the same tracking mark can be used for obtaining first other head objects which are matched with the target head object and exist on different image frames, and the head frame information corresponding to the first other head objects forms the tracking information corresponding to the target head object.

And further, as a complement to the parallel technical solution two: in the process of searching for the first other-person head object matched with the target head object from other image frames, if there is a second other-person head object which is not matched in other image frames, the following specific implementation steps can be further performed, as shown in fig. 7, steps S601-S605.

S601, adding a tracking mark to the second other human head object to serve as a new target human head object.

And S602, taking the image frame where the new target human head object is as a new initial image frame.

In the embodiment of the present application, if a second other-person head object that does not match the target person head object exists in some other image frame, it indicates that a new entering person exists in the scene, and the embodiment of the present application may further use an image frame that is first captured of the second other-person head object as a new starting image frame, so as to track a behavior trajectory of the new entering person in other image frames subsequent to the new starting image frame.

And S603, searching for a third other human head object matched with the new target human head object from a plurality of continuous image frames adjacent to the new starting image frame.

It should be noted that, in order to distinguish and refer to the head objects included in other image frames, the head object matched with the new target head object is identified as a third other head object.

And S604, adding the same tracking mark to a third other person head object matched with the new target person head object according to the tracking mark carried by the new target person head object.

And S605, forming the tracking information corresponding to the new target head object by using the head frame information corresponding to the third other head object with the same tracking mark.

In the embodiment of the application, the tracking mark is added to the new target head object, and the same tracking mark is also added to a third other head object matched with the new target head object, so as to indicate which head objects in other image frames belong to the behavior track of the same new target head object. Therefore, the new tracking information of the target head object is formed by utilizing the head frame information corresponding to the third other head object contained in the behavior track.

Next, in the embodiment of the present application, no matter which of the two parallel technical solutions obtains the tracking information of the target human head object, the tracking information may be further utilized to perform the human behavior recognition processing operation, specifically, the following steps 207a to 209.

And 207a, judging whether the number of the other human head objects matched with the target human head object reaches a preset threshold value in the tracking information.

208a, if the number of the other head objects matched with the target head object is judged to reach a preset threshold value, processing the tracking information by using a preset standing behavior recognition model, and outputting a human behavior recognition result, wherein the human behavior recognition result is a behavior state in one of the following three items: standing up behavior, sitting down behavior, and state preservation.

208c, if the number of the other human head objects matched with the target human head object is judged not to reach the preset threshold value, continuously receiving the tracking information corresponding to the target human head object obtained from the continuous other image frames until the number of the other human head objects matched with the target human head object in the accumulated tracking information reaches the preset threshold value, and triggering the preset standing behavior recognition model to execute human behavior recognition operation.

In the embodiment of the application, in the process of shooting the human activity in the scene, as for the continuously received video data, the video data is composed of continuous image frames, the smaller the shooting frame rate is, the more clearly the continuous image frames can show the track picture of the human activity, but the number of the continuous image frames is required to reach a certain threshold, otherwise the change display of the human activity is not obvious, for example, the smaller the shooting frame rate is, but only 10 continuous image frames are, the behavior track of the human is difficult to show in such a short time, and if the preset standing-up behavior recognition model is triggered to perform the human behavior recognition operation, it is difficult to clearly identify what the actual behavior of the human is.

Accordingly, the embodiment of the present application sets a preset threshold in advance according to practical experience, and uses the preset threshold to measure the number of other head objects matching the ears of the target head object included in the tracking information, namely, indirectly limit the number of required continuous image frames. If this preset threshold is reached, it can be ensured that the image frames are sufficient to exhibit a trajectory change in human behavior.

However, if the tracking information includes that the number of the other human head objects is less than the preset threshold, with the continuous received image frames and the human head detection and tracking processing performed in the image frames, more other human head objects matched with the target human head object can be continuously obtained, the number of the other human head objects is continuously accumulated until the preset threshold is reached, and then the preset standing behavior recognition model is triggered to execute the human behavior recognition operation.

Further, in the embodiment of the present application, as for the training method of the preset standing behavior recognition model, as shown in fig. 8, the following steps S701 to S706 are given in the embodiment of the present application.

S701, obtaining a video data sample, wherein the video data sample comprises a plurality of continuous image frame samples.

In the embodiment of the application, the video data samples are video data of the activities of people in the historical shooting scene, the video data comprises a plurality of continuous image frame samples, and the plurality of continuous image frames need to clearly show the change process of the behavior track of people.

S702, performing human head detection processing on the image frame samples to obtain human head frame information samples corresponding to at least one human head object sample contained in the image frame samples.

In the embodiment of the present application, the image frame samples may be processed by using the human head detection model trained in S301 to S303, and the human head object sample included in each image frame sample and the human head frame information sample corresponding to the human head object sample are output.

And S703, tracking the target human head object sample from the plurality of image frame samples according to the human head frame information sample to obtain a tracking information sample corresponding to the target human head object sample, wherein the tracking information sample comprises the human head frame information sample corresponding to the target human head object sample in the image frame samples.

In the embodiment of the present application, for example, a first image frame sample included in a video data sample may be used as a starting image frame sample, a detected human head object is used as a target human head object sample, and a human head object sample matched with the target human head object sample is further searched from other image frame samples after the starting image frame sample, where the human head object samples actually form a behavior track of a target person, and tracking information for the target human head object sample is formed according to human head frame information samples corresponding to the human head objects.

And S704, dividing the tracking information sample into a plurality of groups of sample information by taking the number of the preset image frames as a group of samples, wherein each group of sample information at least comprises a human head frame information sample which is acquired from adjacent image frame samples and matched with the target human head object sample.

The preset number of image frames is preset according to practical experience and is used for limiting the number of at least required continuous image frame samples on the premise of ensuring that the change process of the human behavior track is clearly shown.

In this embodiment of the present application, the tracking information samples may be divided into multiple sets of sample information by taking a preset number of image frames as a set of samples, and an exemplary specific grouping method may include: each group of sample information has no overlapped human head frame information sample; alternatively, there may be overlapping human head box information samples for each set of sample information. The grouping method adopted by the embodiment of the application only needs to ensure that each group can clearly show the behavior track of the personnel.

And S705, labeling corresponding labels to each group of sample information.

For the embodiment of the present application, the main purpose is to identify whether a human behavior is a standing behavior, and then in order to make the trained model have a function of clearly judging whether the human behavior is a standing behavior, exemplarily, three types of labels are preset: 0, standing up, which is used for representing the process from sitting to standing up or squatting to standing up; 1, sitting, for representing the process from standing up to sitting down or squatting down; 2, body retention, for behavior expressed as non-standing or sitting.

Therefore, according to the behavior track displayed by each group of samples, corresponding labels are added to each group of sample information, and a plurality of groups of sample information carrying the labels are input into the model for training.

And S706, inputting the multiple groups of sample information carrying the labels into a recurrent neural network for training, and outputting a standing behavior recognition model for human behavior recognition.

For the embodiment of the application, each group of sample information is composed of a plurality of image frame samples arranged according to time sequence, and the embodiment of the application utilizes the characteristic that the recurrent neural network is more suitable for processing time sequence data, so that model training is carried out according to labels carried by a plurality of groups of sample information, and a standing behavior recognition model for human behavior recognition is output.

Further, as an implementation of the methods shown in fig. 1 and fig. 2, an embodiment of the present application provides a device for identifying a standing behavior. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device is applied to identifying whether the activity of the person in the scene is a standing behavior, and specifically as shown in fig. 9, the device comprises:

an acquisition unit 801 for acquiring video data;

a person head detection processing unit 802, configured to perform person head detection processing on at least one image frame included in the video data to obtain person head frame information corresponding to at least one person head object included in the image frame;

a tracking processing unit 803, configured to perform tracking processing on a target human head object from the image frame according to the human head frame information, to obtain tracking information corresponding to the target human head object, where the tracking information includes human head frame information corresponding to the target human head object in the image frame;

a model processing unit 804, configured to input the tracking information into a preset standing behavior recognition model, and output a human behavior recognition result performed on the target human head object, where the preset standing behavior recognition model is a model obtained by training in advance based on tracking information of a human head object sample and a standing behavior recognition result labeled to the tracking information.

Further, as shown in fig. 10, if the number of the image frames included in the video data is one, the tracking processing unit 803 includes:

a determining module 8031, configured to determine whether the image frame is an initial image frame;

an adding module 8032, configured to add a tracking mark to a head object detected from the start image frame if the image frame is determined to be the start image frame, and use the tracking mark as a target head object;

a searching module 8033, configured to search, from another consecutive image frame after the image frame, head frame information corresponding to the target head object, as tracking information of the target head object;

the searching module 8033 is further configured to, if it is determined that the image frame is not the initial image frame, search, according to the frame information corresponding to the target frame object existing in the adjacent image frame before the image frame, frame information corresponding to the target frame object from the image frame, as tracking information for the target frame object in the image frame.

Further, as shown in fig. 10, the lookup module 8033 includes:

an obtaining sub-module 80331, configured to obtain, from the image frame, at least one detected first head object and head frame information corresponding to the first head object;

the obtaining sub-module 80331 is further configured to obtain, from an adjacent image frame before the image frame, at least one detected second head object and head frame information corresponding to the second head object, where the second head object carries a uniquely corresponding tracking mark, and the tracking mark is used to mark a head object matched with a target head object in different image frames;

the searching submodule 80332 is configured to search for a first head object matching the second head object from the image frame by calculating and comparing head frame information corresponding to the first head object and head frame information corresponding to the second head object;

an adding sub-module 80333 for adding the same tracking tag to the first head object based on the first head object matching the second head object;

the determining sub-module 80334 is configured to use the head frame information corresponding to the first head object carrying the same tracking mark as the tracking information of the target head object in the image frame.

Further, as shown in fig. 10, if the number of the image frames included in the video data is multiple, the tracking processing unit 803 includes:

a determining module 8034, configured to determine a starting image frame and other image frames subsequent to the starting image frame from the plurality of image frames;

the adding module 8032 is configured to add a tracking mark to the head object detected from the starting image frame as a target head object;

the determining module 8034 is further configured to use the head object detected from the other image frames as the other head object;

the searching module 8033 is further configured to calculate and compare, one by one, a head object and head frame information corresponding to the head object, which are respectively included in two adjacent image frames, starting from the starting image frame, and search, from the other image frames, a first other head object matched with the target head object;

the adding module 8032 is further configured to add the same tracking mark to the first other human head object matched with the target human head object according to the tracking mark carried by the target human head object;

a composing module 8035, configured to compose tracking information corresponding to the target head object by using the head frame information corresponding to the first other head object with the same tracking mark.

Further, as shown in fig. 10, the tracking processing unit 803 further includes:

the adding module 8032 is further configured to, in the process of searching for a first other-person head object matched with the target person head object from the other image frames, add a tracking marker to a second other-person head object as a new target person head object if there is a second other-person head object that is not matched in the other image frames;

the determining module 8034 is further configured to use the image frame where the new target head object is located as a new initial image frame;

the searching module 8033 is further configured to search, from a plurality of consecutive image frames adjacent to the new start image frame, a third other human head object matching the new target human head object;

the adding module 8032 is further configured to add the same tracking mark to a third other head object matched with the new target head object according to the tracking mark carried by the new target head object;

the composing module 8035 is further configured to compose tracking information corresponding to the new target head object by using the head frame information corresponding to the third other head object with the same tracking mark.

Further, as shown in fig. 10, the apparatus further includes:

a determining unit 805, configured to determine whether the number of other human head objects matching the target human head object in the tracking information reaches a preset threshold before the tracking information is input into a preset standing behavior recognition model and a human behavior recognition result of the target human head object is output;

the model processing unit 804 is further configured to, when it is determined that the number of other human head objects matched with the target human head object reaches a preset threshold, process the tracking information by using the preset standing behavior recognition model, and output a human behavior recognition result, where the human behavior recognition result is a behavior state in one of the following three items: standing up behavior, sitting down behavior and state preservation;

an executing unit 806, configured to, when it is determined that the number of other head objects matching the target head object does not reach a preset threshold, continue to receive tracking information corresponding to the target head object acquired from consecutive other image frames until the number of other head objects matching the target head object in the accumulated tracking information reaches the preset threshold, trigger the preset standing behavior recognition model to execute a human behavior recognition operation.

Further, as shown in fig. 10, the human head detection processing unit 802 includes:

a model processing module 8021, configured to process the image frame by using a preset human head detection model, and output coordinate information of a target frame detected from the image frame and a confidence of the target frame, where the coordinate information includes an upper left corner position coordinate and a lower right corner position coordinate of the target frame in the image frame;

a judging module 8022, configured to judge whether the confidence of the target frame reaches a preset confidence threshold;

a determining module 8023, configured to determine that the target frame is a human head object when the confidence of the target frame is determined to reach a preset confidence threshold;

the determining module 8023 is further configured to use the coordinate information corresponding to the target frame as the head frame information corresponding to the head object detected from the image frame.

Further, as shown in fig. 10, the apparatus further includes:

the obtaining unit 801 is further configured to obtain a plurality of image frame samples, where the image frame samples are labeled with head objects;

an enhancement processing unit 807, configured to perform a data enhancement processing operation on the image frame sample to obtain a processed sample;

and the first model training unit 808 is configured to input the sample into a convolutional neural network for training, so as to obtain a human head detection model.

Further, as shown in fig. 10, the apparatus further includes:

the obtaining unit 801 is further configured to obtain a video data sample, where the video data sample includes a plurality of consecutive image frame samples;

the human head detection processing unit 802 is further configured to perform human head detection processing on the image frame sample to obtain a human head frame information sample corresponding to at least one human head object sample included in the image frame sample;

the tracking processing unit 803 is further configured to perform tracking processing on a target human head object sample from a plurality of image frame samples according to the human head frame information sample, so as to obtain a tracking information sample corresponding to the target human head object sample, where the tracking information sample includes a human head frame information sample corresponding to the target human head object sample in the image frame samples;

a grouping unit 809, configured to divide the tracking information sample into multiple groups of sample information by taking a preset number of image frames as a group of samples, where each group of sample information at least includes a human head frame information sample that is obtained from an adjacent image frame sample and matches the target human head object sample;

a labeling unit 810, configured to label each set of sample information with a corresponding label;

and the second model training unit 811 is configured to input the multiple sets of sample information carrying the labels into a recurrent neural network for training, and output a standing behavior recognition model for human behavior recognition.

To sum up, the embodiment of the present application provides a method and an apparatus for recognizing a standing behavior, wherein transmitted video data is continuously received at a server side in a process of a person moving in a shooting scene, the embodiment of the present application can determine a target person (i.e., a tracked target person) according to any image frame, and then real-time head detection and tracking processing can be adopted for subsequent continuously received image frames, or batch head detection and tracking processing can be performed after a plurality of image frames are received, based on the two parallel technical schemes, the embodiment of the present application provides a more flexible selection scheme. After the tracking information corresponding to the target head object is obtained, the tracking information is processed by using the preset standing behavior recognition model, so that whether the human behavior of the tracked target person represented by the target head object is the standing behavior or not is recognized. According to the embodiment of the application, the scheme that the whole human body is replaced by the target human head object is recognized, so that the situation that some people are difficult to clearly recognize due to the fact that the number of the people in a scene is increased is effectively avoided, the method for human head detection processing, tracking processing and model processing can effectively avoid the influence of objective factors such as light change, and accuracy of human body behavior recognition is improved.

The standing behavior recognition device provided by the embodiment of the application comprises a processor and a memory, wherein the acquisition unit, the human head detection processing unit, the tracking processing unit, the model processing unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, objective factors such as human body shielding and light change caused by the increase of the number of people in the scene can be avoided by adjusting the kernel parameters, and whether the human body behavior is the standing behavior can be identified more accurately.

An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for identifying a standing behavior as described above.

An embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of standing behavior recognition as described above when executing the computer program.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), including at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of standing-up behavior recognition, the method comprising:

acquiring video data;

inputting the tracking information into a preset standing behavior recognition model, and outputting a human behavior recognition result of the target human head object, wherein the preset standing behavior recognition model is obtained by training in advance based on the tracking information of the human head object sample and the standing behavior recognition result marked on the tracking information.

2. The method according to claim 1, wherein if the number of the image frames included in the video data is one, the tracking processing a target head object from the image frames according to the head frame information to obtain tracking information corresponding to the target head object includes:

judging whether the image frame is an initial image frame;

if the image frame is judged to be the initial image frame, adding a tracking mark to the head object detected from the initial image frame and taking the tracking mark as a target head object; searching human head frame information corresponding to the target human head object from other continuous image frames behind the image frame to serve as tracking information of the target human head object;

if the image frame is not the initial image frame, searching the head frame information corresponding to the target head object from the image frame according to the head frame information corresponding to the target head object existing in the adjacent image frame before the image frame, and using the head frame information as the tracking information of the target head object in the image frame.

3. The method according to claim 2, wherein the searching for the frame information corresponding to the target head object from the image frames according to the frame information corresponding to the target head object existing in the adjacent image frames before the image frames, as the tracking information for the target head object in the image frames, comprises:

acquiring at least one detected first head object and head frame information corresponding to the first head object from the image frame;

acquiring at least one detected second head object and head frame information corresponding to the second head object from an adjacent image frame before the image frame, wherein the second head object carries a unique corresponding tracking mark, and the tracking mark is used for marking the head object matched with the target head object in different image frames;

searching a first head object matched with the second head object from the image frame by calculating and comparing the head frame information corresponding to the first head object with the head frame information corresponding to the second head object;

adding the same tracking mark to the first head object according to the first head object matched with the second head object;

and taking the head frame information corresponding to the first head object carrying the same tracking mark as the tracking information of the target head object in the image frame.

4. The method according to claim 1, wherein if the number of the image frames included in the video data is multiple, the tracking processing a target head object from the image frames according to the head frame information to obtain tracking information corresponding to the target head object includes:

determining a starting image frame and other image frames subsequent to the starting image frame from a plurality of image frames;

adding a tracking mark to a human head object detected from the starting image frame as a target human head object;

using the human head object detected from the other image frames as the other human head object;

calculating and comparing the head objects respectively contained in the two adjacent image frames one by one and the head frame information corresponding to the head objects from the starting image frame, and searching a first other head object matched with the target head object from the other image frames;

adding the same tracking mark to a first other head object matched with the target head object according to the tracking mark carried by the target head object;

and utilizing the corresponding human head frame information of the first other human head object with the same tracking mark to form the tracking information corresponding to the target human head object.

5. The method of claim 4, further comprising:

in the process of searching for a first other-person head object matched with the target person head object from the other image frames, if a second other-person head object which is not matched exists in the other image frames, adding a tracking mark to the second other-person head object to serve as a new target person head object;

taking the image frame where the new target human head object is as a new initial image frame;

searching for a third other person head object matching the new target person head object from a plurality of continuous image frames adjacent to the new starting image frame;

adding the same tracking mark to a third other human head object matched with the new target human head object according to the tracking mark carried by the new target human head object;

and utilizing the head frame information corresponding to the third other head object with the same tracking mark to form the tracking information corresponding to the new target head object.

6. The method according to claim 1, wherein before the inputting the tracking information into a preset standing behavior recognition model and outputting a human behavior recognition result of the target human head object, the method further comprises:

judging whether the number of other head objects matched with the target head object reaches a preset threshold value or not in the tracking information;

if so, processing the tracking information by using the preset standing behavior recognition model, and outputting a human behavior recognition result, wherein the human behavior recognition result is in a behavior state of one of the following three items: standing up behavior, sitting down behavior and state preservation;

if not, continuously receiving tracking information corresponding to the target head object acquired from other continuous image frames until the number of other head objects matched with the target head object in the accumulated tracking information reaches the preset threshold value, and triggering the preset standing behavior recognition model to execute human behavior recognition operation.

7. The method according to claim 1, wherein the performing human head detection processing on at least one image frame included in the video data to obtain human head frame information corresponding to at least one human head object included in the image frame includes:

processing the image frame by using a preset human head detection model, and outputting coordinate information of a target frame detected from the image frame and confidence of the target frame, wherein the coordinate information comprises an upper left corner position coordinate and a lower right corner position coordinate of the target frame in the image frame;

judging whether the confidence coefficient of the target frame reaches a preset confidence coefficient threshold value or not;

if yes, determining that the target frame is a human head object;

and taking the coordinate information corresponding to the target frame as the human head frame information corresponding to the human head object detected from the image frame.

8. The method according to any one of claims 1 to 7, further comprising:

acquiring a plurality of image frame samples, wherein the image frame samples are marked with human head objects;

performing data enhancement processing operation on the image frame sample to obtain a processed sample;

and inputting the sample into a convolutional neural network for training to obtain a human head detection model.

9. The method according to any one of claims 1 to 7, further comprising:

obtaining a video data sample, the video data sample comprising a plurality of consecutive image frame samples;

performing human head detection processing on the image frame sample to obtain a human head frame information sample corresponding to at least one human head object sample contained in the image frame sample;

tracking a target human head object sample from a plurality of image frame samples according to the human head frame information sample to obtain a tracking information sample corresponding to the target human head object sample, wherein the tracking information sample comprises a human head frame information sample corresponding to the target human head object sample in the image frame samples;

dividing the tracking information sample into a plurality of groups of sample information by taking the number of preset image frames as a group of samples, wherein each group of sample information at least comprises a human head frame information sample which is acquired from adjacent image frame samples and matched with the target human head object sample;

labeling corresponding labels to each group of sample information;

and inputting a plurality of groups of sample information carrying the labels into a recurrent neural network for training, and outputting a standing behavior recognition model for recognizing human body behaviors.

10. A standing behavior recognition apparatus, characterized in that the apparatus comprises:

an acquisition unit configured to acquire video data;

and the model processing unit is used for inputting the tracking information into a preset standing behavior recognition model and outputting a human behavior recognition result of the target human head object, wherein the preset standing behavior recognition model is a model obtained by pre-training based on the tracking information of the human head object sample and the standing behavior recognition result marked on the tracking information.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the standing behavior identification method according to any one of claims 1-9.

12. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the standing behavior identification method according to any of claims 1-9 when executing the computer program.