CN113065504A

CN113065504A - Behavior identification method and device

Info

Publication number: CN113065504A
Application number: CN202110407936.6A
Authority: CN
Inventors: 程斌
Original assignee: Sias Shanghai Information Technology Co ltd
Current assignee: Sias Shanghai Information Technology Co ltd
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-07-02

Abstract

The invention relates to a behavior recognition method and device and a computer readable storage medium. The behavior recognition method comprises the following steps: acquiring multi-frame continuous images of an identification object; extracting the posture characteristic of the recognition object in each image; combining the posture characteristics of the recognition object according to a time sequence to obtain the behavior characteristics of the recognition object in the multi-frame continuous images; and inputting the behavior characteristics into a pre-trained behavior analysis module to identify the behavior of the recognition object in the plurality of frames of continuous images. By implementing the behavior recognition method, the invention can recognize the continuous behavior actions of the human body according to the multi-frame continuous images, thereby further meeting the actual recognition requirements of a plurality of fields such as intelligent control, security monitoring, sports competition, movie and television entertainment, education assistance and the like on dynamic behaviors so as to expand the application range of the human body recognition technology.

Description

Behavior identification method and device

Technical Field

The present invention relates to the field of Computer Vision (Computer Vision) technologies, and in particular, to a method for recognizing continuous behaviors based on human body limb point detection, a recognition apparatus implementing the recognition method, and a corresponding Computer-readable storage medium.

Background

The gesture recognition technology is an emerging technology in the technical field of computer science, can automatically recognize facial expressions, body gestures and finger actions of a recognized person based on a human body image, outputs corresponding gesture classification results, and has great commercial value in multiple fields of intelligent control, security monitoring, sports competition, movie and television entertainment, education assistance and the like.

Existing gesture recognition techniques are implemented primarily based on the static image itself, which contains the body gestures. In the schemes, a deep learning model is generally trained by taking a pixel combination mode in a large number of images as a training sample, and the posture of an identified person in the image is identified based on feature data of all pixels in the image to be identified, so that the defects of high feature dimension and large data processing capacity exist, and the actual requirements of various fields on identification of human continuous behaviors cannot be met.

In order to overcome the above defects, there is a need in the art for a continuous behavior recognition technology for recognizing continuous behavior actions of a human body according to multi-frame continuous images, so as to further meet actual recognition requirements for dynamic behaviors in multiple fields such as intelligent control, security monitoring, sports competition, movie and television entertainment, and education assistance, and to expand the application range of the human body recognition technology.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

In order to overcome the above-mentioned drawbacks, the present invention provides a behavior recognition method, a behavior recognition apparatus, and a computer-readable storage medium.

The behavior recognition method provided by the first aspect of the present invention includes the following steps: acquiring multi-frame continuous images of an identification object; extracting the posture characteristic of the recognition object in each image; combining the posture characteristics of the recognition object according to a time sequence to obtain the behavior characteristics of the recognition object in the multi-frame continuous images; and inputting the behavior characteristics into a pre-trained behavior analysis module to identify the behavior of the recognition object in the plurality of frames of continuous images. By implementing the behavior recognition method, the invention can recognize the continuous behavior actions of the human body according to the multi-frame continuous images, thereby further meeting the actual recognition requirements of a plurality of fields such as intelligent control, security monitoring, sports competition, movie and television entertainment, education assistance and the like on dynamic behaviors so as to expand the application range of the human body recognition technology.

The behavior recognition device provided according to the second aspect of the present invention includes a memory and a processor. The processor is connected to the memory and configured to implement the behavior recognition method provided by the first aspect of the present invention. By implementing the behavior recognition method, the behavior recognition device can recognize continuous behavior actions of the human body according to multi-frame continuous images, so that the actual recognition requirements of multiple fields such as intelligent control, security monitoring, sports competition, movie and television entertainment, education assistance and the like on dynamic behaviors are further met, and the application range of the human body recognition technology is expanded.

The above computer-readable storage medium provided according to a third aspect of the present invention has computer instructions stored thereon. The computer instructions, when executed by a processor, may implement the behavior recognition method provided by the first aspect of the invention. By implementing the behavior recognition method, the computer-readable storage medium can recognize continuous behavior actions of the human body according to the multi-frame continuous images, so that the actual recognition requirements of multiple fields such as intelligent control, security monitoring, sports competition, movie and television entertainment, education assistance and the like on dynamic behaviors are further met, and the application range of the human body recognition technology is expanded.

Drawings

The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.

Fig. 1 illustrates a flow diagram of a behavior recognition method provided in accordance with some embodiments of the present invention.

Fig. 2A and 2B respectively illustrate schematic diagrams of confidence maps provided according to some embodiments of the invention.

FIG. 3 illustrates a schematic diagram of a correlation diagram provided in accordance with some embodiments of the invention.

Fig. 4 illustrates a schematic diagram of an end-to-end network architecture provided in accordance with some embodiments of the present invention.

FIG. 5 illustrates a schematic diagram of human key points provided in accordance with some embodiments of the present invention.

FIG. 6 illustrates a schematic diagram of a human body bounding box provided in accordance with some embodiments of the present invention.

FIG. 7 illustrates a schematic diagram of a behavior analysis module provided in accordance with some embodiments of the invention.

FIG. 8 illustrates a schematic diagram of an lstm unit provided in accordance with some embodiments of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided for illustrative purposes, and other advantages and effects of the present invention will become apparent to those skilled in the art from the present disclosure. While the invention will be described in connection with the preferred embodiments, there is no intent to limit its features to those embodiments. On the contrary, the invention is described in connection with the embodiments for the purpose of covering alternatives or modifications that may be extended based on the claims of the present invention. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be practiced without these particulars. Moreover, some of the specific details have been left out of the description in order to avoid obscuring or obscuring the focus of the present invention.

As described above, the existing gesture recognition technology is mainly implemented based on the still image itself containing the human body gesture. In the schemes, a deep learning model is generally trained by taking a pixel combination mode in a large number of images as a training sample, and the posture of an identified person in the image is identified based on feature data of all pixels in the image to be identified, so that the defects of high feature dimension and large data processing capacity exist, and the actual requirements of various fields on identification of human continuous behaviors cannot be met.

In some non-limiting embodiments, the behavior recognizing method provided in the first aspect of the present invention may be implemented by the behavior recognizing apparatus provided in the second aspect of the present invention. Specifically, the behavior recognizing device provided by the second aspect of the present invention may include a memory and a processor. The memory includes, but is not limited to, the above-described computer-readable storage medium provided by the third aspect of the invention having computer instructions stored thereon. The processor is connected to the memory and configured to execute the computer instructions stored in the memory to implement the behavior recognition method provided in the first aspect.

The working principle of the above-described behavior recognition device will be described below in connection with some embodiments of the behavior recognition method. It will be appreciated by those skilled in the art that these examples of the behaviour recognition method are merely non-limiting embodiments provided by the present invention, which are intended to clearly illustrate the broad concepts of the present invention and to provide some detailed illustrations convenient to the public rather than to limit the manner in which the behaviour recognition apparatus operates. In contrast, this behavior recognition apparatus is also only a non-limiting embodiment provided by the present invention, and does not limit the subject of implementation of these behavior recognition methods.

Referring to fig. 1, fig. 1 is a flow chart illustrating a behavior recognition method according to some embodiments of the invention.

As shown in fig. 1, in some embodiments of the present invention, a posture estimation (position estimate) module and a behavior analysis module 13 may be configured in the behavior recognition apparatus, wherein the posture estimation module further includes a limb point generation module 11 and a posture feature generation module 12. In the course of implementing the behavior recognition method, the processor of the behavior recognition device may first acquire the multi-frame continuous images 101-132 of the recognition object. The multi-frame continuous images 101 to 132 may be real-time continuous images acquired by an ordinary camera such as an IP camera or a USB camera, or may be cached continuous images acquired from a video file. Then, the processor may sequentially input the acquired multiple frames of continuous images 101 to 132 into the limb point generating module 11 to acquire coordinate information of the multiple limb points 201 to 218 of the identification object in each frame of image 101 to 132. Then, the processor can input the coordinate information of the plurality of limb points 201-218 of the recognition object into the gesture feature generation module 12 to extract the gesture feature of the recognition object in each frame image. Then, the processor can combine the posture characteristics of the recognition object in time sequence to obtain the behavior characteristics of the recognition object in the plurality of frames of continuous images. Finally, the processor may input the acquired behavior feature into the pre-trained behavior analysis module 13 to identify the behavior of the recognition object in the plurality of frames of continuous images.

Specifically, in some embodiments, the limb point generating module 11 may obtain coordinate information of the plurality of limb points 201 to 218 of the recognition object in each frame image 101 to 132 based on a plurality of Confidence maps (Confidence maps). Referring to fig. 2A and 2B, fig. 2A and 2B respectively illustrate schematic diagrams of confidence maps provided according to some embodiments of the present invention.

In the embodiment shown in fig. 2A and 2B, in response to inputting a frame of image (e.g., image 101) with a recognition object, the limb point generating module 11 may generate confidence maps about the frame of image 101, and represent the positions of the limb points of the recognition object in the frame of image 101 with the confidence maps. Specifically, the number of layers of the confidence map output by the limb point generation module 11 may be the same as the number of categories of the human limb points 201 to 218. Each confidence map can correspond to one type of entity points 201-218 of the human body.

For example, in the embodiment shown in FIG. 2A, the confidence map corresponds to the left elbow 207 of the identified object.

The confidence map may represent the confidence of each pixel point with respect to the left elbow 207 by adjusting the brightness, so as to form a peak of the confidence at the position of the left elbow 207 of the two identified objects, respectively. In this way, the limb point generating module 11 can determine the coordinate information of the left elbow 207 of the two identified objects in the frame image 101 according to the positions of the two confidence peaks in the confidence map.

For another example, in the embodiment shown in FIG. 2B, the confidence map corresponds to the left shoulder 205 of the identified object.

The confidence map may represent the confidence of each pixel point with respect to the left shoulder 205 by adjusting the brightness, so that a peak of the confidence is formed at the positions of the left shoulders 205 of the two recognition objects, respectively. In this way, the limb point generating module 11 can determine the coordinate information of the left shoulders 205 of the two recognition objects in the frame image 101 according to the positions of the two confidence peaks in the confidence map.

By analogy, the limb point generation module 11 may determine the coordinate information of each of the limb points 201 to 218 of the identified object in each of the frames of images 101 to 132 according to the confidence peak values in the multiple confidence maps corresponding to each of the frames of images 101 to 132, which is not described herein again.

It will be appreciated by those skilled in the art that the above scheme of including two recognition objects and two confidence peaks in one confidence map is only a non-limiting example provided by the present invention, and is intended to clearly demonstrate the broad concepts of the present invention and to provide a specific scheme for facilitating the implementation by the public without limiting the scope of the invention. Alternatively, in other embodiments, if there is only one identified object in a frame of image, there should be only one peak in each confidence map. Similarly, in other embodiments, if there are n identified objects in a frame of image, there should be n peaks in each confidence map.

Then, for the embodiment that only one recognition object exists in the image, the processor can determine the posture feature of the recognition object in the frame image directly according to the coordinate information of each limb point 201-218 of the recognition object in the frame image. However, for the embodiment shown in fig. 2A and 2B that includes a plurality of recognition objects in the image, the processor needs to further adopt a plurality of association maps (Part Affinity Fields) to determine the recognition objects of the limb points 201 to 218 in the image, so as to determine the pose features of the recognition objects in the frame image according to the coordinate information of the limb points 201 to 218 of each recognition object in the frame image.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating a correlation diagram according to some embodiments of the invention.

As shown in FIG. 3, in some embodiments of the present invention, in response to inputting a frame of image (e.g., image 101) with an identification object, the limb point generating module 11 may further generate a plurality of correlation maps about the frame of image 101, and represent the connection relationships between the limb points 201-218 in the frame of image 101 with the correlation maps. Specifically, the number of layers of the relationship graph output by the body point generation module 11 may be the same as the number of body segments formed by the body points 201 to 218. Each map may correspond to a type of limb segment of the human body.

For the embodiment of the left shoulder 205 and the left elbow 207 shown in fig. 3, the correlation map may represent the direction vector of the limb where each pixel is located by adjusting the brightness value. Specifically, in the correlation diagram of FIG. 3, there is one affinity region between each two limb points (e.g., 206 and 207). Each pixel in the affinity region has a 2D vector to describe its direction. If a pixel is not on the left forearm (i.e., the limb connecting the left shoulder 206 and the left elbow 207) of the identified object, then the intensity value of this point may be lower. Conversely, if a pixel is located on the left forearm (i.e., the limb connecting the left shoulder 206 and the left elbow 207) of the identified object, the luminance value of this pixel may be higher or even peak. The left upper arm highlighting of the two recognition objects in fig. 3 indicates that there are particularly meaningful values in the correlation diagram, the magnitude of which indicates the unit vector of the direction vector of the left shoulder 206 to the left elbow 207. Thus, the limb point generating module 11 can determine the connection relationship between the two left shoulders 206 and the two left elbows 207 according to the positions of the two sections of correlation peak values in the correlation diagram, and further connect the limb points 201 to 218 one by one according to each section of correlation peak value in each correlation diagram, so as to determine the identification object to which each limb point 201 to 218 belongs.

It will be appreciated by those skilled in the art that the above scheme of a correlation diagram including two identification objects and two correlation peaks is only a non-limiting example provided by the present invention, and is intended to clearly illustrate the main concept of the present invention and provide a specific scheme convenient for the public to implement without limiting the scope of the present invention. Alternatively, in other embodiments, if there is only one identified object in one frame of image, there should be only one peak in each correlation map. Similarly, in other embodiments, if there are n identification objects in one frame of image, there should be n peaks in each correlation map.

In some embodiments of the present invention, the above-mentioned scheme for determining the coordinate information of the limb points 201 to 218 of the recognition objects in the frame images based on the confidence maps and the correlation maps may be implemented based on a pre-trained end-to-end neural network model. Referring to fig. 4, fig. 4 illustrates a schematic diagram of an end-to-end network structure provided in accordance with some embodiments of the present invention.

In the embodiment shown in fig. 4, the limb point generation module 11 employs an end-to-end neural network method, which uses vgg19 classification networks subjected to some customized network tailoring as a feature extraction network, where F denotes the result of the feature extraction network. As shown in fig. 4, the core part of the feature extraction network is divided into an upper structure and a lower structure, the convolutional layer module in the upper half part is used for outputting the confidence map (S), and the convolutional layer module in the lower half part is used for outputting the correlation map (L) of each limb point. The loss functions of the upper path and the lower path are used for calculating the L2 loss between the predicted value and the ideal value. Further, the feature extraction network may be implemented by dividing into a plurality of steps (stages), and after each step is finished, the confidence map (S) and the association map (L) need to be merged with the result (F).

It will be understood by those skilled in the art that although fig. 4 shows only two steps of two feature extraction networks, this is omitted for convenience of illustration of the relationship between the steps. In some embodiments of the present invention, the feature extraction network may use 5 steps to obtain the confidence map (S) and the association map (L) corresponding to each frame image. Optionally, in other embodiments, the feature extraction network may also appropriately adjust the number of steps according to the accuracy requirement of the limb point identification to obtain the confidence map (S) and the association map (L) corresponding to each frame of image.

Further, since the data format of the training sample data set is the position coordinates of each limb point of the human body, and the loss function during training is an L2 loss function of the confidence map and the association map, the technician needs to perform format conversion on the training sample data set during training, convert the labeled data of each limb point into the corresponding confidence map and association map, and convert the confidence map and association map output by the feature extraction network into the coordinate information of the limb point during testing. Therefore, the limb point generation module 11 can output the position coordinates of the limb points 201 to 218 of the human body in the image according to the input RGB three-channel image based on the feature extraction network trained by the method.

Referring to fig. 5, fig. 5 illustrates a schematic diagram of human key points provided according to some embodiments of the present invention.

As shown in fig. 5, in some embodiments of the present invention, the limb point generating module 11 may output a plurality of limb points of a identified subject, such as a nose 201, a chest 202, a right shoulder 203, a right elbow 204, a right wrist 205, a left shoulder 206, a left elbow 207, a left wrist 208, a right crotch 209, a right knee 210, a right ankle 211, a left crotch 212, a left knee 213, a left ankle 214, a right eye 215, a left eye 216, a right ear 217, and a left ear 218, based on a frame of input images. The posture feature generating module 12 may generate the posture features of the human body based on the coordinate information of a plurality of limb points of the limbs, such as the right shoulder 203, the right elbow 204, the right wrist 205, the left shoulder 206, the left elbow 207, the left wrist 208, the right crotch 209, the right knee 210, the right ankle 211, the left crotch 212, the left knee 213, and the left ankle 214, for the behavior analyzing module 13 to recognize the continuous behavior of the limbs of the human body.

As shown in FIG. 1, after acquiring the coordinate information of each limb point 203-214 of each recognition object in the multiple frames of continuous images 101-132, the processor may input the coordinate information of each limb point 203-214 of each recognition object into the pose feature generation module 12, respectively, to determine the pose feature of each recognition object in each frame of image 101-132.

Taking the second frame image 102 as an example, in response to inputting the coordinate information of the plurality of limb points of the one frame image 102, the posture feature generation module 12 may first determine the angle information a of the right forearm of the recognition object according to the coordinate information of the right shoulder 203 and the right elbow 204 of the recognition object₂₁(ii) a The angle information A of the right forearm of the recognition object is determined from the coordinate information of the right elbow 204 and the right wrist 205 of the recognition object₂₂(ii) a Determining angle information A of the left forearm of the recognition object based on the coordinate information of the left shoulder 206 and the left elbow 207 of the recognition object₂₃(ii) a Angle information A of the left forearm of the recognition object is determined from the coordinate information of the left elbow 207 and the left wrist 208 of the recognition object₂₄(ii) a Determining angle information A of the right thigh of the recognition object based on the coordinate information of the right crotch 209 and the right knee 210 of the recognition object₂₅(ii) a Determining angle information A of the right lower leg of the recognition object based on the coordinate information of the right knee 210 and the right ankle 211 of the recognition object₂₆(ii) a Determining angle information A of the left thigh of the recognition object based on the coordinate information of the left crotch 212 and the left knee 213 of the recognition object₂₇(ii) a And determines angle information A of the left lower leg of the recognition object based on the coordinate information of the left knee 213 and the left ankle 214 of the recognition object₂₈. It is understood that the above-mentioned angle information a₂₁～A₂₈The angle of the corresponding limb relative to the horizontal, vertical, or any one of the custom standard directions may be indicated. These angle information A₂₁～A₂₈The angles of the limbs in the frame image 102 can be characterized. In some embodiments, the angle between the limb and the standard direction can be uniformly defined as the right angle between the limb and the standard direction, and expressed by an inverse trigonometric function (arctan) toAnd the data uniformity of the angle information is ensured.

Thereafter, the attitude feature generation module 12 may generate the above-mentioned angle information a₂₁～A₂₈Angle information A of each limb n in the previous frame image 101₁₁～A₁₈Difference value, i.e. Delta A_2n＝|A_2n-A_1nTo calculate the angular difference information Δ a of each limb segment n corresponding to the second frame image 102₂₁～ΔA₂₈. These angle difference information Δ a₂₁～ΔA₂₈The amount of angular change of each limb segment n between the two images 101, 102, as well as the rate of angular change of each limb segment, can be characterized.

In addition, the pose feature generation module 12 may use the end limb point of each limb segment n (e.g., the right elbow 204 at the end of the right forearm) as the positioning limb point P_nAnd calculates each positioning limb point P in the image 102 of the current frame_2nCoordinate position (x)_2n,y_2n) The limb point P is positioned corresponding to each in the previous frame image 101_1nCoordinate position (x)_1n,y_1n) Is Euclidean distance of, i.e.

As the limb point difference information Δ P of each limb 2n corresponding to the present frame image 102₂₁～ΔP₂₈. These limb point difference information Δ P₂₁～ΔP₂₈The amount of change in the position of each limb n between the two images 101, 102, as well as the rate of change in the position of each limb n, can be characterized.

Optionally, in other embodiments, the pose feature generation module 12 may also use the head limb point (e.g., the right shoulder 203 at the head end of the right forearm) of each limb segment n as the positioning limb point P_nAnd calculates each positioning limb point P in the image 102 of the current frame_2nCoordinate position (x)_2n,y_2n) The limb point P is positioned corresponding to each in the previous frame image 101_1nCoordinate position (x)_1n,y_1n) As the limb point difference information Δ P of each limb segment 2n corresponding to the present frame image 102, is the euclidean distance of₂₁～ΔP₂₈。

Then, the pose feature generating module 12 may sequentially associate the angle information a of each limb n corresponding to the frame image 102_2nAngle difference information Δ a_2nAnd the difference information Δ P of the limb points_2nAs the posture feature of the recognition object in the present frame image 102, that is

(A₂₁，ΔA₂₁，ΔP₂₁，…，A₂₈，ΔA₂₈，ΔP₂₈). For the above-mentioned embodiment in which 8 limbs are formed by 12 limb points 203 to 214 for four limbs, the posture characteristics of each recognition object in each frame image 101 to 132

Are all 24 dimensions.

Compared with the scheme that the posture of the recognized person in the image is recognized based on the feature data of all pixels in the image to be recognized in the prior art, the posture recognition method and the device can recognize the posture of the recognized object according to the coordinate information of a few of limb points 201-218, and have a better recognition effect on a multi-person scene with small-area shielding.

It will be appreciated by those skilled in the art that the above-mentioned scheme for recognizing the continuous behavior of the human body based on the coordinate information of a plurality of limb points of the limbs such as the right shoulder 203, the right elbow 204, the right wrist 205, the left shoulder 206, the left elbow 207, the left wrist 208, the right crotch 209, the right knee 210, the right ankle 211, the left crotch 212, the left knee 213 and the left ankle 214 is only a non-limiting embodiment provided by the present invention, and is intended to clearly illustrate the main concept of the present invention and provide a specific scheme convenient for the public to implement, but not to limit the scope of the present invention. Optionally, in other embodiments, the posture feature generating module 12 may further generate the posture feature of the human body based on the coordinate information of the nose 201, the chest 202, the right eye 215, the left eye 216, the right ear 217, the left ear 218, and other limb points of the recognition object, so that the behavior analyzing module 13 can further recognize continuous behaviors about the head, the expression, or other aspects of the human body.

Those skilled in the art can also understand that the above-mentioned construction of the posture features of the recognition object in each frame image by using the angle information, the angle difference information and the limb point difference information can not only significantly characterize various postures of the limbs in the continuous behavior, but also truly characterize the rates of rotation and displacement of the limbs in the continuous behavior, thereby improving the accuracy of the behavior recognition result. Optionally, in other embodiments, based on the concept of the present invention, a technician may also select other data about the limb points and/or limbs to construct the posture features according to the recognition requirements of the continuous behaviors of the head, the expression or other aspects of the human body, so as to ensure the accuracy of the behavior recognition result.

As described above, the present invention requires the recognition of the continuous behavior of the human body based on a plurality of frames of continuous images. In some embodiments, in order to deal with the case where a plurality of recognition objects are included in the image at the same time, the behavior recognition apparatus according to the second aspect of the present invention may further include a tracking module configured to track the plurality of recognition objects in the plurality of frames of consecutive images to respectively determine the postures of the recognition objects in each frame of image.

Specifically, the tracking module may first determine that each frame of images 101 to 132 includes several identification objects according to the output result of the limb point generating module 11, and determine the coordinate information of all the limb points 201 to 218 of each identification object. Then, the tracking module may determine the human body boundary frames of the recognition objects in the frame images 101 to 132 according to the coordinate information of the plurality of limb points belonging to the same recognition object in the frame images 101 to 132.

Referring to fig. 6, fig. 6 illustrates a schematic diagram of a human body bounding box according to some embodiments of the present invention.

As shown in FIG. 6, in some embodiments, the tracking module may first determine the maximum and minimum coordinates (e.g., x) for each limb point 201-218_max＝x₂₀₇，x_min＝x₂₀₄，y_max＝y₂₁₅，y_min＝x₂₁₄) Determining a point surrounding all the limbs 201-218Rectangular. Then, the tracking module may multiply the shoulder width of the normal human body by a preset scaling factor to generate an outward expansion value, and expand the area of the rectangle according to the outward expansion value to obtain the human body bounding box of the recognition object.

After obtaining the human body bounding box of each recognition object in each frame of image 101-132, the tracking module may first calculate a plurality of recognition objects (e.g., a first recognition object O) contained in the first frame of image 101₁₁And a second recognition object O₁₂) The coordinate range A occupied by the human body bounding box in the first frame image 101_O11And A_O12. Then, the tracking module may perform Intersection Over Union (IOU) operation on the human body boundary frames of the identification objects in the second frame image 102 and the human body boundary frames in the first frame image 101, respectively, so as to calculate the coincidence degree between the human body boundary frame of the identification objects in the second frame image 102 and the human body boundary frame in the first frame image 101.

Specifically, it is assumed that the second frame image 102 includes three recognition objects O therein₂₁、O₂₂And O₂₃. The tracking module may first calculate the recognition object O₂₁The coordinate range A occupied by the human body bounding box in the second frame image 102_O21Then, the coordinate range A is set_O21And the coordinate range A_O11Making an intersection (i.e., A)_O21∩A_O11) To calculate the coincidence area of the two and to determine the coordinate range A_O21And the coordinate range A_O11Do union set (i.e. A)_O21∪A_O11) To calculate the combined area of the two. Then, the tracking module may calculate the identification object O by using the overlap area as a dividend and the merge area as a divisor₂₁And object of recognition O₁₁By the intersection ratio of the human body bounding boxes, i.e.

To characterize the degree of coincidence of the two human bounding boxes. Likewise, the tracking module may also calculate the recognition object O as described above₂₁And object of recognition O₁₂The cross-over ratio of the human body bounding box of (I), i.e. IOU_21-12To characterize two human body edgesThe overlapping degree of the bounding box is not described herein.

In the determination of the recognition object O₂₁The human body boundary frame and each recognition object O in the first frame image 101₁₁And object of recognition O₁₂After the cross-over ratio of the human body bounding box, the tracking module may cross-over ratio the IOU_21-11Cross-over ratio IOU_21-12And comparing, and determining the human body boundary box corresponding to the larger one as a candidate boundary box. Then, the tracking module can firstly identify the object O by utilizing the pre-trained convolutional neural network₂₁The image in the human body boundary frame is subjected to feature extraction to obtain a first coding feature (encoding feature) with the length of 128 dimensions

Then, feature extraction is carried out on the images in the candidate bounding box to obtain a second coding feature with the length of 128 dimensions

And calculating the first coding feature

And the second coding feature

Cosine value of

To verify the similarity of the images in the two human body bounding boxes.

If the first coding feature

And a second coding feature

The cosine value cos theta is greater than or equal to the preset cosine threshold value, the tracking module can judge that the images in the two human body boundary frames indicate the same identification object, and therefore the identification object O in the second frame image 102 is used₂₁Is determined as the recognition object O in the first frame image 101₁₁So as to achieve the object tracking effect. Otherwise, if the first coding feature

And a second coding feature

The cosine value cos theta is smaller than the preset cosine threshold, the tracking module can judge that the images in the two human body boundary frames indicate different identification objects, and further judge the identification object O in the second frame image 102₂₁Is a newly appearing recognition object.

By analogy, the tracking module can also determine the recognition objects O in the second frame image 102 one by one as described above₂₂And O₂₃Whether or not it is the recognition object O appearing in the first frame image 101₁₁Or O₁₂To complete the object tracking from the first frame image 101 to the second frame image 102. Further, the tracking module may also perform object tracking on the recognition objects contained in the third frame image 103, the fourth frame image 104, and up to the third twelve frame image 132 as described above to determine the pose of each recognition object in each frame image 101-132.

As shown in FIG. 1, after determining the gesture of the same recognition object in each frame image 101-132, the processor of the behavior recognition device can combine the gesture features P101-P132 of the recognition object in each frame image 101-132 in time sequence to obtain the behavior features of the recognition object in the multi-frame continuous images 101-132, such as

(A₁₁，ΔA₁₁，ΔP₁₁，…，A₁₈，ΔA₁₈，ΔP₁₈，…，A_n1，ΔA_n1，ΔP_n1，…，A_n8，ΔA_n8，ΔP_n8…，A₃₂₁，ΔA₃₂₁，ΔP₃₂₁，…，A₃₂₈，ΔA₃₂₈，ΔP₃₂₈). In this embodiment, the behavior is specialSign for

May be 32 x 24 dimensions. Thereafter, the processor of the behavior recognition device may obtain the behavior characteristics

The method comprises the steps of inputting the behavior analysis module 13 trained in advance to identify continuous behaviors of the recognition object implemented in multiple frames of continuous images 101-132.

Referring to fig. 7, fig. 7 illustrates a schematic diagram of a behavior analysis module according to some embodiments of the invention.

As shown in FIG. 7, in some embodiments of the present invention, the behavior analysis module 13 may be a classification network model implemented based on a Long Short-Term Memory (LSTM) network. A technician may connect the fully connected input layer 71, the Long Short Term Memory (LSTM) modules 721-722, and the fully connected output layer 73 in sequence to construct the behavior analysis module 13, wherein each of the Long Short Term Memory (LSTM) modules 721-722 may include a plurality of Long Short Term Memory (LSTM) units.

The long-short term memory Network (LSTM) is a time-cycle Neural Network designed to solve the long-term dependence problem of the conventional Recurrent Neural Network (RNN). Due to the unique design structure, Long Short Term Memory (LSTM) networks are suitable for processing and predicting significant events in time series with very long intervals and delays, with better performance than the Recurrent Neural Networks (RNNs) and Hidden Markov Models (HMMs). Referring further to fig. 8, fig. 8 illustrates a schematic diagram of an lstm unit provided in accordance with some embodiments of the present invention.

As shown in FIG. 8, each long short term memory network (lstm) unit may include a short term memory unit

Long-term memory cell

sigmoid activation function

Sine activation function

Forgetting gate (forget gate) f_t＝σ(x_tU^f+h_t-1W^f) Input gate i_t＝σ(x_tUⁱ+h_t-1Wⁱ) Output gate o_t＝σ(x_tU^o+h_t-1W^o) And h and_t＝tanh(C_t)*o_t. In the embodiment shown in FIG. 8, the cell structures (cell structures) C and h of each lstm unit are used as the input of the next lstm unit, and h is used as the output of the present lstm unit. In the above formula, U and W are parameters to be trained of the long-short term memory network.

In constructing the behavior analysis module 13, a technician may first determine a recognition frame rate of the behavior analysis module 13. The recognition frame rate refers to a frame rate of the image to be recognized to which the behavior analysis module 13 mainly faces. The technician may then determine the number of unit frames identifying successive behavioral actions in conjunction with the average duration of each behavioral action in the video content, and thus the number of LSTM units included in each LSTM module 721-722.

In some embodiments, for a fighting video with a frame rate of 25 frames/second, a technician may first count the duration of each motion (e.g., four-limb motions such as walking, punching, kicking, etc.) of the identified subject in the video. Taking the average duration of each action as 1.3 seconds as an example, a technician can determine the identification unit of the behavior analysis module 13 as 32 frames/behaviors according to the identification frame rate of 25 frames/second and the behavior duration of about 1.3 seconds, thereby respectively adopting 32 cascaded LSTM units to construct one LSTM module 721-722. Thereafter, the technician may connect the fully-connected input layer 71, the first LSTM module 721, the second LSTM module 722, and the fully-connected output layer 73 in sequence to construct the behavior analysis module 13.

After the behavior analysis module 13 is constructed, a technician may also select a large number of fight video samples with a frame rate of 25 frames/second to train the learning parameters U and W of each lstm unit in the behavior analysis module 13, so that the behavior analysis module 13 has a function of outputting corresponding behavior classification according to the input multi-frame continuous images.

Specifically, the technicians may first group the fight video samples in a group of 32 continuous images, and then classify and label the behaviors of the recognition objects implemented in the 32 continuous images one by one. Then, the technician may use each group of continuous images as input and the corresponding classification labels as output, and determine the learning parameters U and W with the best convergence effect through repeated iteration and network modification, so that the behavior analysis module 13 has a function of correct classification. The specific method for training the classification network belongs to the prior art in the field, and is not described herein again.

Further, in some embodiments, the technician may also perform random mirror image flipping on each group of fighting video samples in a group unit to obtain multiple groups of new fighting video samples, and train the behavior analysis module 13 with the new fighting video samples and the original video samples, so as to improve the recognition accuracy of the behavior analysis module 13 on the continuous behaviors by increasing the diversity of the training sample data.

Based on the above description, compared with the prior art that the training data (e.g., 224 × 3) is directly combined with the pixel points in the image, the present invention trains the classification network based on the pose features output by the pose estimation (position estimate) module, so that the training data has a smaller dimension (32 × 24). By training the classification network by adopting the training data with smaller dimensionality, the training speed of the model can be increased, the training difficulty of the model can be reduced, and the problems of video memory overflow and the like which are common in the prior art can be effectively avoided.

After the construction and training of the behavior analysis module 13 are completed, the behavior analysis module 13 may have a function of outputting corresponding behavior classifications according to the input multi-frame continuous images. Thus, in response to 32 frames from aboveBehavior features obtained from continuous images 101-132

The behavior analysis module 13 may first input the behavior characteristics of the input through the fully connected input layer 71

The output data of the fully connected input layer 71 is input to the first LSTM module 721 after high-dimensional mapping, and the output data is processed for the first time by 32 LSTM units in the first LSTM module in cascade. Thereafter, the behavior analysis module 13 may input the first result output by the 32 LSTM units of the first LSTM module 721 into the 32 LSTM units of the second LSTM module 722, perform the second cascade processing on the first result by the 32 LSTM units of the second LSTM module 722, and obtain the second result output by the last LSTM unit of the second LSTM module 722. Thereafter, the behavior analysis module 13 may input the second result output by the last lstm unit into the fully-connected output layer 72, and determine the behavior classification with the highest probability according to the second result by the fully-connected output layer 73 to output as the result.

It can be understood by those skilled in the art that the behavior analysis module 13 including the two LSTM modules 721-722 connected in series is only a non-limiting embodiment provided by the present invention, and is intended to clearly demonstrate the main concept of the present invention and provide a preferred solution capable of integrating various factors such as recognition rate and recognition accuracy, but not to limit the protection scope of the present invention.

Optionally, in other embodiments, the behavior analysis module may include only one LSTM module, or include more than three LSTM modules, and also achieve the effect of performing recognition classification on the continuous behavior of the recognition object. Specifically, for an embodiment with only one LSTM module, the behavior analysis module may input output data of a fully-connected input layer into the LSTM module, process the output data by a plurality of LSTM units in the LSTM module in a cascade manner, input a result output by the last LSTM unit into the fully-connected output layer, and determine a corresponding behavior classification by the fully-connected output layer according to the result.

Further, in some embodiments of the present invention, if the video to be recognized includes consecutive images of more than one recognition unit (i.e. 32 frames), the processor of the behavior recognition apparatus may first select the images to be recognized of the first to thirty-second frames from the multiple frames of consecutive images according to the recognition frame rate (i.e. 25 frames/second) to generate corresponding behavior characteristics

And according to the behavioral characteristics

To judge the behavior of the recognition object implemented in the images to be recognized of the first to thirty-second frames. In some embodiments, the behavior analysis module 13 may output a probability value that the object behavior belongs to the behavior classification while outputting the behavior classification result. If the probability value is greater than or equal to the preset probability threshold, the processor may determine that the recognition object implements the behavior corresponding to the behavior classification result in the images to be recognized of the first to thirty-second frames. On the contrary, if the probability value is smaller than the preset probability threshold, the processor may determine that the recognition object does not perform any known behavior in the images to be recognized of the first to thirty-second frames.

After completing the behavior recognition of the first to thirty-second frames of images to be recognized, the processor may remove the first frame of image from the 32 frames of images to be recognized and supplement a thirty-third frame of image from the original multiple frames of continuous images to update the 32 frames of images to be recognized. Thereafter, the behavior recognizing device may acquire the behavior feature of the recognition object in the second to thirteenth frames of the image to be recognized as described above

And characterize the behavior

Input into the behavior analysis module 13 to determine a new behavior classification result and its corresponding profileThe value is obtained. Similarly, the processor may compare the probability value corresponding to the new behavior classification result with a preset probability threshold to determine whether the recognition object implements the corresponding behavior in the second to third thirteen frames of images to be recognized.

By analogy, the behavior recognition device can continuously update the image to be recognized from the multi-frame continuous images and acquire the corresponding behavior characteristics of the image to be recognized from the multi-frame continuous images

And determining whether the identification object implements the corresponding behaviors or not until the identification of all the images in the multi-frame continuous images is completed, thereby determining all the behaviors of the identification object implemented in the multi-frame continuous images one by one.

Further, in consideration of the influence of the behavior duration on the accuracy of the behavior recognition result, the frame rate of the plurality of frames of continuous images to be recognized should be the same as the recognition frame rate of the behavior analysis module 13. To this end, in some embodiments, in response to acquiring multiple frames of consecutive images, the processor of the behavior recognition device may first detect an actual frame rate of the acquired multiple frames of consecutive images. If the actual frame rate of the acquired multiple continuous images is greater than the recognition frame rate of the behavior analysis module 13, the processor may preferably extract multiple continuous images from the multiple continuous images on average to reduce the actual frame rate to the recognition frame rate of the behavior analysis module 13. Alternatively, if the actual frame rate of the acquired multiple continuous images is less than the recognition frame rate of the behavior analysis module 13, the processor may preferably add multiple images, which are repeated in the previous image, to the multiple continuous images on average, so as to raise the actual frame rate to the recognition frame rate of the behavior analysis module 13. In this way, even if the actual frame rate of the image to be recognized does not match the recognition frame rate of the behavior analysis module 13, the behavior recognition apparatus provided by the present invention can accurately recognize the continuous behavior performed by the recognition object in the image.

It will be appreciated by those skilled in the art that the above scheme for determining the identification unit of 32 frames/action for 25 frames/second fighting video is only a non-limiting embodiment provided by the present invention, and is intended to clearly illustrate the main concept of the present invention and provide a specific scheme convenient for the public to implement, not to limit the scope of protection of the present invention. Optionally, in other embodiments, the technician may also determine the recognition unit of the behavior analysis module 13 as n × m frames/behaviors according to the frame rate of n frames/second of the image to be recognized and the average duration of m seconds/behaviors in the image to be recognized, so as to ensure the recognition accuracy of the behavior recognition device for the continuous behaviors.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

Those of skill in the art would understand that information, signals, and data may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits (bits), symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Although the processors described in the above embodiments may be implemented by a combination of software and hardware. It will be appreciated that the processor may also be implemented solely in software or hardware. For a hardware implementation, the processor may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, other electronic devices designed to perform the functions described herein, or a selected combination thereof. For a software implementation, the processor may be implemented by means of separate software modules, such as program modules (procedures) and function modules (functions), running on a common chip, which each may perform one or more of the functions and operations described herein.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of behavior recognition, comprising the steps of:

acquiring multi-frame continuous images of an identification object;

extracting the posture characteristic of the recognition object in each image;

combining the posture characteristics of the recognition object according to a time sequence to obtain the behavior characteristics of the recognition object in the multi-frame continuous images; and

inputting the behavior characteristics into a pre-trained behavior analysis module to identify the behavior of the recognition object in the plurality of frames of continuous images.

2. The behavior recognition method according to claim 1, wherein the step of extracting the posture feature of the recognition object in each of the images includes:

acquiring coordinate information of a plurality of limb points of the identification object in each image; and

and determining the posture characteristics of the recognition object in each image according to the coordinate information of the plurality of limb points.

3. The behavior recognition method according to claim 2, wherein the step of acquiring coordinate information of a plurality of limb points of the recognition object in each of the images includes:

inputting each image into a posture estimation module respectively to obtain a plurality of confidence maps corresponding to each frame of image respectively, wherein each confidence map corresponds to one type of limb point; and

and determining coordinate information of various corresponding limb points of the identified object in each image according to the confidence coefficient peak value in each confidence coefficient image of each image.

4. A behavior recognition method according to claim 2, wherein the step of determining the posture feature of the recognition object in each of the images based on the coordinate information of the plurality of limb points includes:

determining a section of limb of the identification object according to every two adjacent limb points so as to determine a plurality of sections of the limb of the identification object;

determining angle information of each limb corresponding to a frame of image according to coordinate information of a first limb point and a second limb point of each limb in the frame of image;

calculating the angle difference information of each limb corresponding to the frame of image according to the angle information of each limb corresponding to the frame of image and the angle information of the previous frame of image;

calculating the limb point difference information of each limb corresponding to the frame of image according to the coordinate information of the first limb point of each limb in the frame of image and the previous frame of image; and

and determining the posture characteristics of the recognition object in the frame image according to the angle information, the angle difference information and the limb point difference information of each limb of the recognition object corresponding to the frame image.

5. The behavior recognition method of claim 4, wherein the limb points comprise: left wrist, left elbow, left shoulder, right wrist, right elbow, right shoulder, left ankle, left knee, left crotch, right ankle, right knee and right crotch,

the limb includes: left forearm, right forearm, left shank, left thigh, right shank and right thigh.

6. A behavior recognition method according to claim 3, wherein the plurality of frames of consecutive images include a plurality of the recognition objects, and the step of acquiring coordinate information of a plurality of limb points of the recognition objects in each of the images further includes:

inputting the images into the posture estimation module respectively to obtain a plurality of association graphs corresponding to each frame of image respectively, wherein each association graph indicates direction information from a first limb point of a section of limb to a second limb point of the section of limb; and

and determining the identification objects of the first limb point and the second limb point of each section of corresponding limb according to the association graphs of each frame of image.

7. The behavior recognition method according to claim 2, wherein the step of acquiring coordinate information of a plurality of limb points of the recognition object in each of the images includes:

determining a human body boundary frame of the identification object in a frame of image according to the coordinate information of the plurality of limb points belonging to the same identification object in the frame of image;

performing intersection and comparison operation on the human body boundary frame of the identification object in the frame of image and each human body boundary frame in the previous frame of image respectively to calculate the coincidence degree of the human body boundary frame of the identification object in the frame of image and each human body boundary frame in the previous frame of image; and

and tracking the position of the identification object in each image according to the contact ratio.

8. A behavior recognition method according to claim 7, wherein the step of tracking the position of the recognition object in each of the images according to the degree of coincidence includes:

in the previous frame of image, determining the human body boundary frame with the maximum coincidence degree with the human body boundary frame of the identification object in the frame of image as a candidate boundary frame;

extracting the features of the image in the human body boundary frame of the identification object in the frame of image to obtain a first coding feature;

performing feature extraction on the image in the candidate bounding box to obtain a second coding feature;

calculating cosine values of the first coding feature and the second coding feature; and

and determining the recognition object in the frame image as the recognition object corresponding to the candidate bounding box in the previous frame image in response to the cosine value being greater than or equal to a preset cosine threshold value.

9. The behavior recognition method according to claim 1, further comprising the steps of:

determining a recognition frame rate of the behavior analysis module;

constructing an LSTM module by using LSTM units with corresponding quantity according to the identification frame rate and the duration of the behavior to be identified;

constructing the behavior analysis module by using a full connection input layer, the LSTM module and a full connection output layer; and

and training the behavior analysis module according to a plurality of groups of behavior image samples which accord with the identification frame rate and corresponding marking data thereof so as to enable the behavior analysis module to have a function of outputting corresponding behavior classification according to input multi-frame continuous images, wherein each group of behavior image samples comprises a plurality of frames of continuous image samples, each frame of image sample comprises at least one identification object sample, and the marking data indicates the behavior classification of each identification object sample in the group of behavior image samples.

10. The behavior recognition method according to claim 9, wherein the step of training the behavior analysis module according to the plurality of groups of behavior image samples that meet the recognition frame rate and their corresponding label data comprises:

carrying out random mirror image overturning on the plurality of groups of behavior image samples to obtain a plurality of groups of new behavior image samples; and

and training the behavior analysis module according to the plurality of groups of behavior image samples and the plurality of groups of new behavior image samples.

11. A behavior recognition method according to claim 9, wherein the step of temporally associating each of the posture features of the recognition object to obtain the behavior feature of the recognition object in the plurality of frames of the continuous images comprises:

selecting a corresponding number of images to be identified from the multi-frame continuous images according to the identification frame rate; and

and connecting the angle information, the angle difference information and the limb point difference information of each limb of the identification object in each image to be identified in a time sequence to acquire the behavior characteristics of the identification object in the corresponding number of images to be identified.

12. A method for behavior recognition according to claim 11, wherein said step of inputting the behavior features into a behavior analysis module trained in advance to recognize the behavior of the recognition object in the plurality of frames of consecutive images comprises:

inputting the behavior characteristics of the recognition objects in the images to be recognized in the corresponding number into the full-connection input layer, and performing high-dimensional mapping on the input behavior characteristics;

inputting the output data of the fully-connected input layer into the LSTM module, and processing the output data by a plurality of LSTM units in the LSTM module in a cascade manner; and

and inputting the result output by the last lstm unit into the full-connection output layer, and determining a corresponding behavior classification by the full-connection output layer according to the result.

13. The behavior recognition method of claim 12, wherein the step of constructing the behavior analysis module with a fully connected input layer, the LSTM module, and a fully connected output layer comprises:

and constructing the behavior analysis module by using one layer of the fully-connected input layer, two LSTM modules and one layer of the fully-connected output layer.

14. The behavior recognition method according to claim 13, wherein the step of inputting the output data of the fully-connected input layer into the LSTM module, the processing of the output data by the LSTM units in the LSTM module in cascade comprises:

inputting the output data of the fully-connected input layer into a first LSTM module, and performing primary processing on the output data by a plurality of LSTM units in the first LSTM module in a cascade manner; and

and inputting the first result output by the plurality of LSTM units into the plurality of LSTM units of a second LSTM module, performing second processing on the first result by cascading the plurality of LSTM units in the second LSTM module, and acquiring a second result output by the last LSTM unit of the second LSTM module.

15. The behavior recognition method according to claim 12, further comprising the steps of:

obtaining a probability value of the behavior classification; and

and in response to the probability value being greater than or equal to a preset probability threshold, determining that the recognition object performs a behavior corresponding to the behavior classification in the corresponding number of images to be recognized.

16. The behavior recognition method of claim 15, further comprising the steps of:

removing the first frame images from the corresponding number of images to be identified, and supplementing the first frame images which are not selected from the plurality of continuous images so as to update the corresponding number of images to be identified;

acquiring behavior characteristics of the identification object in each updated image to be identified;

inputting the updated behavior characteristics of the identified object in each image to be identified into the behavior analysis module to determine a new behavior classification and obtain the probability value of the new behavior classification;

repeating the steps of updating the image to be identified, obtaining the updated behavior characteristics, and determining the new behavior classification and the probability value thereof until the identification of all images in the multi-frame continuous images is completed; and

and determining that the recognition object carries out behaviors corresponding to the behavior classification in the corresponding multi-frame images to be recognized according to the probability value which is greater than or equal to the probability threshold.

17. A behavior recognition method according to claim 11, wherein, prior to the step of temporally associating the respective posture features of the recognition object to obtain the behavior features of the recognition object in the plurality of frames of continuous images, the behavior recognition method further comprises the steps of:

acquiring the actual frame rate of the multiple continuous images; and

if the actual frame rate is greater than the identification frame rate, extracting multiple frames of images from the multiple frames of continuous images on average to reduce the actual frame rate to the identification frame rate, and/or if the actual frame rate is less than the identification frame rate, adding multiple frames of images which are repeated by the previous frame of image to the multiple frames of continuous images on average to improve the actual frame rate to the identification frame rate.

18. A behavior recognition apparatus, comprising:

a memory; and

a processor connected to the memory and configured to implement the behavior recognition method of any of claims 1-17.

19. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, carry out a method of behaviour recognition according to any one of claims 1 to 17.