CN117152670A

CN117152670A - Behavior recognition method and system based on artificial intelligence

Info

Publication number: CN117152670A
Application number: CN202311422270.7A
Authority: CN
Inventors: 李火亮; 陈达剑
Original assignee: Jiangxi Tuoshi Intelligent Technology Co ltd
Current assignee: Jiangxi Tuoshi Intelligent Technology Co ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2023-12-01

Abstract

The invention provides a behavior recognition method and a system based on artificial intelligence, wherein the method comprises the steps of carrying out single-frame slicing processing and optical flow extraction on a target video; performing space domain sequence feature extraction on a plurality of RGB images, and performing time domain feature extraction on the optical flow images; feature fusion is carried out on features in the image feature set, target region interception is carried out on the fused feature image, and region segmentation is carried out on the target image; storing the key area image into an identification feature set; filling the skeleton point set, and determining a space-time pose image of the target based on the optimized skeleton point set; the method and the device can solve the problems of insufficient generalization capability, weak robustness and the like of the model when strong external interference, strong ambient light shielding and other interference factors exist in the intelligent real environment, improve the accuracy of target behavior recognition and greatly improve the speed of target recognition.

Description

Behavior recognition method and system based on artificial intelligence

Technical Field

The invention belongs to the technical field of intelligent recognition, and particularly relates to an artificial intelligence-based behavior recognition method and system.

Background

In the current artificial intelligence field, the intelligent recognition technology is widely applied to each line, meanwhile, the intelligent behavior recognition technology monitoring system software is a deep learning algorithm taking the behavior recognition technology as a key technology, a main module framework of a person is constructed according to an artificial intelligent neuron network, various body-building behaviors are measured and calculated according to tracks and target human body contours, and meanwhile, abnormal behaviors of various photographed field operators are monitored through the behavior recognition technology, so that monitoring staff is helped to improve the efficiency of solving various abnormal emergency events;

in the prior art, the target characteristics are usually extracted manually and are expressed by observing the characteristics of different behaviors through researchers, the problems of incomplete characteristic representation and overlarge noise exist, and meanwhile, after the target characteristics are identified, the prior art has larger identification errors, so that the accurate identification of the target behaviors cannot be realized.

Disclosure of Invention

In order to solve the technical problems, the invention provides an artificial intelligence-based behavior recognition method and system, which are used for solving the technical problems in the prior art.

In one aspect, the present invention provides the following technical solutions, and an artificial intelligence based behavior recognition method, including:

Acquiring a target video, performing single-frame slicing processing on the target video to obtain a plurality of single-frame RGB images, and performing optical flow extraction on the target video to obtain a plurality of optical flow images;

performing spatial sequence feature extraction on a plurality of RGB images to obtain a first feature and a second feature, performing time domain feature extraction on the optical flow image to obtain a third feature, and storing the first feature, the second feature and the third feature into an image feature set;

feature fusion is carried out on the features in the image feature set to obtain a fused feature image, target area interception is carried out on the fused feature image to obtain a target image, and area segmentation is carried out on the target image to obtain a target face image and a target body image;

extracting key points of the target face image, intercepting a key area image of the target face image based on the key points, and storing the key area image into an identification feature set;

extracting a skeleton point set of the target body image, filling the skeleton point set to obtain an optimized skeleton point set, determining a space-time pose image of the target based on the optimized skeleton point set, and storing the space-time pose image into the image feature set according to a frame number corresponding relation;

And acquiring a template behavior image set, inputting the template behavior image set into a preset behavior recognition model for training, inputting the image feature set into the trained preset behavior recognition model, and outputting a behavior recognition result.

Compared with the prior art, the application has the beneficial effects that: firstly, acquiring a target video, carrying out single-frame slicing processing on the target video to obtain a plurality of single-frame RGB images, and carrying out optical flow extraction on the target video to obtain a plurality of optical flow images; then, carrying out space domain sequence feature extraction on a plurality of RGB images to obtain a first feature and a second feature, carrying out time domain feature extraction on the optical flow images to obtain a third feature, and storing the first feature, the second feature and the third feature into an image feature set; then, carrying out feature fusion on the features in the image feature set to obtain a fused feature image, carrying out target region interception on the fused feature image to obtain a target image, and carrying out region segmentation on the target image to obtain a target face image and a target body image; then extracting key points of the target face image, intercepting a key area image of the target face image based on the key points, and storing the key area image into an identification feature set; then extracting a skeleton point set of the target body image, filling the skeleton point set to obtain an optimized skeleton point set, determining a space-time pose image of the target based on the optimized skeleton point set, and storing the space-time pose image into an image feature set according to a frame number corresponding relation; finally, a template behavior image set is acquired and is input into a preset behavior recognition model for training, an image feature set is input into the trained preset behavior recognition model, and a behavior recognition result is output.

Preferably, the step of performing single-frame slicing processing on the target video to obtain a plurality of single-frame RGB images, and performing optical flow extraction on the target video to obtain a plurality of optical flow images includes:

carrying out single-frame splitting and slicing treatment on the target video so as to split the target video into a plurality of continuous single-frame RGB images;

extracting a single-frame space image of the target video, and calculatingFirst image information of time-corresponding single-frame spatial image +.>And->Second image information of time-corresponding single-frame spatial image +.>：

；

In the method, in the process of the invention,is the pixel center point of the image, < >>First coefficient matrix, ">For the second coefficient matrix->For the third coefficient matrix->For the fourth coefficient matrix, ++>、/>First scalar and second scalar, < ->Is the displacement of the pixel point；

JudgingTime-corresponding single-frame spatial image and +.>Whether the pixel values of the single-frame space images corresponding to the moments are the same or not;

if it isTime-corresponding single-frame spatial image and +.>The pixel values of the corresponding single-frame spatial image at the moment are the same, then +.>，/>，/>；

Based on the first image informationAnd the second image information +>Solving the displacement of the pixel point>Based on the displacement amount of each pixel >Determining optical flow field information of the single-frame spatial image to obtain a plurality of optical flow images, wherein the displacement of the pixel point is +.>The method comprises the following steps:

。

preferably, the step of performing spatial sequence feature extraction on the plurality of RGB images to obtain a first feature and a second feature, and performing temporal feature extraction on the optical flow image to obtain a third feature includes:

inputting the RGB image into a first 2DCNN for spatial domain feature extraction to obtain a first feature;

acquiring an image path and a category label of the RGB image, storing the image path and the category label into a CSV file, and inputting the CSV file and the RGB image into an LSTM (LSTM) for sequence feature extraction to obtain a second feature;

and performing binarization gray scale processing on the optical flow image to obtain an optical flow gray scale image, calculating a first optical flow field and a second optical flow field of the optical flow gray scale image by adopting a dense optical flow method, and inputting the optical flow image, the first optical flow field and the second optical flow field into a second 2DCNN for time domain feature extraction to obtain a third feature.

Preferably, the step of extracting the key points of the target face image, intercepting the key region image of the target face image based on the key points, and storing the key region image in the recognition feature set includes:

Inputting the target facial image into a key point extraction network for key point extraction to obtain a plurality of key points, determining a facial key point matrix based on the key points, and determining a target matrix based on the facial key point matrix, wherein the facial key point matrixIs +.>The method comprises the following steps of:

；/>；

in the method, in the process of the invention,is the +.>Line->Column element->Is the first in the face key point matrixLine->Column element->Is the%>Line->Column element->Is the%>Line->Column element->The number of key points;

based on the facial key point matrixAnd said at least one ofTarget matrix->Calculating an intermediate matrix +.>：

；

In the method, in the process of the invention,for the first orthogonal matrix->Is a second orthogonal matrix->Is singular value decomposition;

based on the intermediate matrixDetermining a conversion matrix->Based on the transformation matrix, the facial key point matrixPerforming transformation processing to obtain a transformed coordinate matrix +.>：

；

In the method, in the process of the invention,、/>the first column vector and the second column vector in the facial key point matrix are respectively +.>、/>A first column vector and a second column vector in the target matrix;

based on the transformed coordinate matrixAnd intercepting a key area image of the target face image, and storing the key area image into an identification feature set.

Preferably, the step of extracting a skeleton point set of the target body image, and performing filling processing on the skeleton point set to obtain an optimized skeleton point set includes:

inputting the target body image into a skeleton point prediction network to obtain a plurality of initial skeleton points;

obtaining the prediction reliability of the skeleton points and an affinity field, and performing iterative processing on the prediction reliability and the affinity field to obtain iterative reliabilityAffinity field with the iteration->：

；

In the method, in the process of the invention,、/>first prediction function and second prediction function, respectively, ">For the image connection feature mapping,is->Iteration reliability after a number of iterations, +.>Is->An iterative affinity field after the iteration;

the iterative reliability is respectively based on a first loss function and a second loss functionAffinity field with the iterationPerforming predictive compensation, wherein the first loss function +.>And said second loss function->The method comprises the following steps of:

；

in the method, in the process of the invention,for the position of skeleton points in the image, +.>For mask (S)>Is->The degree of reliability of the iteration of the individual phases,for the mean value of the iterative reliability, +.>Is->A stepwise alternating affinity field, +.>Is the average value of the iterative affinity field, +.>For the number of stages->Is the square of the 2 norms;

Based on the iterative reliability after prediction compensationIs associated with the iterative affinity field->And screening the skeleton points to obtain a skeleton point set, and filling the skeleton point set according to a preset filling algorithm to obtain an optimized skeleton point set.

Preferably, the iterative reliability after the prediction-based compensationIs associated with the iterative affinity field->Screening the skeleton points to obtain a skeleton point set, and filling the skeleton point set according to a preset filling algorithm to obtain an optimized skeleton point set, wherein the step of obtaining the optimized skeleton point set comprises the following steps:

calculating the reliability and the affinity field of each skeleton point, wherein the reliability of the iteration after prediction compensation is smaller than that of the iteration after prediction compensationAnd/or said iterative affinity field +.>Removing skeleton points, and storing the reserved skeleton points into a skeleton point set;

eliminating the concentrated coordinates of the skeleton points asIs selected as a basic skeleton point in the skeleton point set>Selecting the front and the rear parts of the frame number corresponding to the reference skeleton point>Skeleton points of the frames and storing them in a filling dataset;

calculating coordinate values of skeleton points to be filled based on the coordinates of skeleton points in the filling data set；

；

In the method, in the process of the invention,for filling in samples of x coordinate values of each skeleton point in the data set The mean value of (I)>For filling the sample mean of y coordinate values of each skeleton point in the data set,/->Sample median for x-coordinate values of skeleton points in the filled data set,/->A sample median value for filling y coordinate values of each skeleton point in the data set;

and supplementing the coordinate values of the skeleton points to be filled into the corresponding filling data set to obtain an optimized skeleton point set.

Preferably, the step of determining the space-time pose image of the target based on the optimized skeleton point set and storing the space-time pose image into the image feature set according to the frame number correspondence includes:

acquiring a preset time periodCoordinate information of each skeleton point in the optimized skeleton point set;

connecting each skeleton point based on the coordinate information of each skeleton point to obtain a basic skeleton pose image;

selecting a root skeleton point from the basic skeleton pose image, calculating the shortest distance between the skeleton point and the root skeleton point, storing the skeleton point with the shortest distance not larger than a preset distance into a sampling skeleton point set, and sampling the skeleton point in the sampling skeleton point set based on a preset sampling function, wherein the preset sampling function is as follows:

；

in the method, in the process of the invention,for sampling the rest skeleton points except the root skeleton point in the skeleton point set, the +. >Is a root skeleton point;

mapping the adjacent areas of the skeleton points obtained by sampling intoSub-region:

；

in the method, in the process of the invention,for the mapping function +.>For the neighboring area of the skeleton point obtained by the sampling process, < >>The skeleton points are obtained through sampling;

determining updated skeleton points based on mapping resultsBased on the updated skeleton point +.>Updating the basic skeleton pose image to obtain a time-space pose image, and storing the time-space pose image into the image feature set according to the corresponding relation of the frame number, wherein the updated skeleton points +.>The method comprises the following steps:

；

in the method, in the process of the invention,for the mapping result of the rest skeleton points except the root skeleton point in the sampling skeleton point set, the ++>For the current skeleton point in a preset time period +.>Time position in->For a preset period of time->Is a starting point time position of (1).

In a second aspect, the present invention provides the following technical solutions, a behavior recognition system based on artificial intelligence, the system comprising:

the acquisition module is used for acquiring a target video, carrying out single-frame slicing processing on the target video to obtain a plurality of single-frame RGB images, and carrying out optical flow extraction on the target video to obtain a plurality of optical flow images;

the extraction module is used for carrying out space domain sequence feature extraction on a plurality of RGB images to obtain a first feature and a second feature, carrying out time domain feature extraction on the optical flow image to obtain a third feature, and storing the first feature, the second feature and the third feature into an image feature set;

The fusion module is used for carrying out feature fusion on the features in the image feature set to obtain a fusion feature image, carrying out target region interception on the fusion feature image to obtain a target image, and carrying out region segmentation on the target image to obtain a target face image and a target body image;

the intercepting module is used for extracting key points of the target face image, intercepting a key area image of the target face image based on the key points, and storing the key area image into an identification feature set;

the filling module is used for extracting a skeleton point set of the target body image, filling the skeleton point set to obtain an optimized skeleton point set, determining a space-time pose image of the target based on the optimized skeleton point set, and storing the space-time pose image into the image feature set according to a frame number corresponding relation;

the recognition module is used for acquiring a template behavior image set, inputting the template behavior image set into a preset behavior recognition model for training, inputting the image feature set into the trained preset behavior recognition model, and outputting a behavior recognition result.

In a third aspect, the present invention provides a computer, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the artificial intelligence based behavior recognition method as described above when executing the computer program.

In a fourth aspect, the present invention provides a storage medium having a computer program stored thereon, which when executed by a processor implements an artificial intelligence based behavior recognition method as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an artificial intelligence based behavior recognition method according to an embodiment of the present invention;

FIG. 2 is a detailed flowchart of step S1 in an artificial intelligence-based behavior recognition method according to an embodiment of the present invention;

FIG. 3 is a detailed flowchart of step S2 in an artificial intelligence-based behavior recognition method according to an embodiment of the present invention;

FIG. 4 is a detailed flowchart of step S4 in an artificial intelligence based behavior recognition method according to an embodiment of the present invention;

FIG. 5 is a detailed flowchart of step S51 in an artificial intelligence based behavior recognition method according to an embodiment of the present invention;

FIG. 6 is a detailed flowchart of step S514 in an artificial intelligence based behavior recognition method according to an embodiment of the present invention;

FIG. 7 is a detailed flowchart of step S52 in an artificial intelligence based behavior recognition method according to an embodiment of the present invention;

FIG. 8 is a block diagram of an artificial intelligence based behavior recognition system according to a second embodiment of the present invention;

fig. 9 is a schematic hardware structure of a computer according to another embodiment of the invention.

Embodiments of the present invention will be further described below with reference to the accompanying drawings.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended to illustrate embodiments of the invention and should not be construed as limiting the invention.

In the description of the embodiments of the present invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate description of the embodiments of the present invention and simplify description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present invention, the meaning of "plurality" is two or more, unless explicitly defined otherwise.

In the embodiments of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured" and the like are to be construed broadly and include, for example, either permanently connected, removably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the embodiments of the present invention will be understood by those of ordinary skill in the art according to specific circumstances.

Example 1

In a first embodiment of the present invention, as shown in fig. 1, an artificial intelligence-based behavior recognition method includes:

S1, acquiring a target video, performing single-frame slicing processing on the target video to obtain a plurality of single-frame RGB images, and performing optical flow extraction on the target video to obtain a plurality of optical flow images;

specifically, the target video may be a video image including a target behavior captured by the intelligent device in a certain period of time, and after the target video is captured, the video image is difficult to directly process and analyze, so that an RGB image of Zhang Shanzhen can be obtained after the target video is subjected to single-frame slicing processing, and meanwhile, when an optical flow image is obtained subsequently, single-frame splitting processing is required to be performed on the target video, so that the target video is converted into a picture form, so that subsequent behavior recognition and analysis can be performed conveniently.

As shown in fig. 2, the step S1 includes:

s11, carrying out single-frame splitting and slicing processing on the target video so as to split the target video into a plurality of continuous single-frame RGB images;

specifically, the RGB image has a queue of three channel information of red, green and blue, and contains spatial information, so that the image can be subjected to video frame splitting and slicing processing through OpenCV, so that a plurality of continuous RGB images can be obtained, and after splitting, the continuous RGB images can be arranged according to the corresponding frame number sequence and named respectively, so that a time sequence relationship exists between the RGB images, and the subsequent feature extraction is facilitated.

S12, extracting a single-frame space image of the target video, and calculatingFirst image information of time-corresponding single-frame spatial image +.>And->Second image information of time-corresponding single-frame spatial image +.>：

；

In the method, in the process of the invention,is the pixel center point of the image, < >>First coefficient matrix, ">For the second coefficient matrix->For the third coefficient matrix->For the fourth coefficient matrix, ++>、/>First scalar and second scalar, < ->The displacement of the pixel point;

specifically, the image information may be understood as the motion information of the pixels in the image, that is, from one moment to another moment, the dense motion information of the pixels in the image is the optical flow information, so in step S12, the optical flow information may be obtained by calculating the image information of two adjacent frames and solving in reverse according to the image information of two adjacent frames.

S13, judgingTime-corresponding single-frame spatial image and +.>Whether the pixel values of the single-frame space images corresponding to the moments are the same or not;

specifically, in the target video, since the target video contains a target and the target makes some behaviors in the video, the default background and noise will not be suddenly changed, so that the gradient and local optical flow of the image can be regarded as constant, the subsequent optical flow information calculation can be performed, if different, the sudden change of the background and noise between two adjacent images is considered, and the image cannot be used as the material for identifying the target behaviors.

S14, ifTime-corresponding single-frame spatial image and +.>The pixel values of the corresponding single-frame spatial image at the moment are the same, then +.>，/>，/>。

S15, based on the first image informationAnd the second image information +>Solving the displacement of the pixel point>Based on the displacement amount of each pixel>Determining optical flow field information of the single-frame spatial image to obtain a plurality of optical flow images, wherein the displacement of the pixel point is +.>The method comprises the following steps:

；

specifically, after the displacement of the pixel points in the image from one moment to another moment is calculated, the optical flow field of each pixel point can be determined, and the optical flow field of each pixel point forms the whole optical flow field information of the image.

S2, performing spatial sequence feature extraction on a plurality of RGB images to obtain first features and second features, performing time domain feature extraction on the optical flow images to obtain third features, and storing the first features, the second features and the third features into an image feature set;

specifically, the first feature is a spatial feature, the second feature is a sequence feature, the third feature is a time feature, only a single feature with prominence is usually extracted in the prior art, so that the robustness of a model and the expression capability of the feature are reduced, and the final behavior recognition result cannot be larger.

As shown in fig. 3, the step S2 includes:

s21, inputting the RGB image into a first 2DCNN for spatial domain feature extraction to obtain a first feature;

specifically, the first 2DCNN in this step is a two-dimensional CNN extraction network, which uses a processing result similar to a dual-stream network to extract the first feature of the RGB image, but the number of channels of the 2DCNN is different from that of the dual-stream network, and when the 2DCNN processes the RGB image, the input channel is 1, that is, only one RGB image is input at a time, so that simple information flow transmission, easy interpretation and easy implementation can be ensured.

S22, acquiring an image path and a category label of the RGB image, storing the image path and the category label into a CSV file, and inputting the CSV file and the RGB image into an LSTM (LSTM) for sequence feature extraction to obtain a second feature;

specifically, in the aspect of motion feature extraction, the LSTM may effectively obtain sequence motion information contained in a frame sequence, and further, because of its context feature expression capability, it may deconstruct and express front and rear motion sequence information of a relative position before and after a motion, so as to ensure the effectiveness of feature extraction.

S23, performing binarization gray scale processing on the optical flow image to obtain an optical flow gray scale image, calculating a first optical flow field and a second optical flow field of the optical flow gray scale image by adopting a dense optical flow method, and inputting the optical flow image, the first optical flow field and the second optical flow field into a second 2DCNN for time domain feature extraction to obtain a third feature;

specifically, in order to reduce the complexity of the operation, the optical flow image is subjected to binarization gray scale processing, so that the complexity of the operation can be reduced without losing the optical flow motion information. In order to improve the feature extraction speed of the second 2DCNN, the network structure of the second 2DCNN in the present application is similar to that of the first 2DCNN, but the number of input channels of the second 2DCNN is different, the number of input channels of the second 2DCNN is 8, that is, 8 Zhang Guang flowsheets are input each time, wherein 4 Zhang Wei transverse optical flowsheets exist, and 4 longitudinal optical flowsheets exist.

S3, feature fusion is carried out on the features in the image feature set to obtain a fused feature image, target region interception is carried out on the fused feature image to obtain a target image, and region segmentation is carried out on the target image to obtain a target face image and a target body image;

Specifically, after three features are extracted, in order to facilitate model recognition and processing, the model needs to be subjected to fusion processing, features in the image feature set are subjected to feature fusion in a serial fusion mode, recognition accuracy of the model can be further improved in a serial fusion mode, after the features are fused, target images in the images can be subjected to frame selection by using some target recognition algorithms, and after the target images are determined, head images in the target images are separated from body images, so that a target face image and a target body image are obtained, wherein the target face image can be used for analyzing facial behaviors of the target, such as yawning, drowsiness, laughing, crying and the like, and the target body image can be used for analyzing limb behaviors of the target, such as walking, dancing, jumping or some illegal operations and the like.

S4, extracting key points of the target face image, intercepting a key area image of the target face image based on the key points, and storing the key area image into an identification feature set;

specifically, the key points of the target facial image are approximately distributed at the five-sense-organ positions of the human face, so that the corresponding key region image is intercepted according to the key points, the key region image is also approximately the five-sense-organ image of the target, the facial behavior of the human body is approximately caused by the form of the five-sense-organ of the human body, such as yawning, laughing, coughing and the like, can be represented by the opening and closing states and the bending degrees of the mouth and eyes of the human body, and after the corresponding key region image is extracted, the key region image can be input into a model and the similarity between the key region image and the template image is calculated, so that the facial behavior of the target can be determined.

As shown in fig. 4, the step S4 includes:

s41, inputting the target facial image into a key point extraction network for key point extraction to obtain a plurality of key points, determining a facial key point matrix based on the key points, and determining a target matrix based on the facial key point matrix, wherein the facial key point matrixIs +.>The method comprises the following steps of:

；/>；

in the method, in the process of the application,is the +.>Line->Column element->Is the first in the face key point matrixLine->Column element->Is the%>Line->Column element->Is the%>Line->Column element->The number of key points;

specifically, in the present application, the number of key points is 5, which are the positions of the left and right eyes, nose and left and right mouth, respectively, so that the key point matrix of the face isIs +.>In (I)>5, since the extracted key points are two-dimensional in the image, the coordinates of the first key point are regarded as (++>，/>) The coordinates of the nth key point are (++>，) Facial key point matrix obtained by combining the coordinates of the key points>And the target matrix->Is the resulting target matrix affine transformed from the facial key point matrix.

S42, based on the facial key point matrixIs +.>Calculating an intermediate matrix +.>：

；

in particular, hereFor singular value decomposition process, +.>Is one ofPersonal->Orthogonal matrix of->Is one ofIs a matrix of orthogonal matrices.

S43, based on the intermediate matrixDetermining a conversion matrix->Based on the transformation matrix +.>Performing transformation processing to obtain a transformed coordinate matrix +.>：

；

specifically, in step S41, the facial key point matrixIs +.>Are all a matrix of n rows and 2 columns, thus +.>Column vectors composed of elements of the first column of the facial key matrix, and so on, ++>Column vector consisting of elements of the second column of the facial key matrix, < >>Column vector consisting of elements of the first column in the target matrix, < >>Is a column vector made up of elements of the second column in the target matrix.

S44, based on the transformation coordinate matrixIntercepting a key area image of the target face image, and storing the key area image into an identification feature set;

Specifically, the transformed coordinate matrix hereinThe target face image is converted into coordinates on an orthogonal plane, namely, an inclined face image is converted into a positive face image, and then the corresponding key area image can be intercepted according to the coordinate values of the converted key points.

S5, extracting a skeleton point set of the target body image, filling the skeleton point set to obtain an optimized skeleton point set, determining a space-time pose image of the target based on the optimized skeleton point set, and storing the space-time pose image into the image feature set according to a frame number corresponding relation;

specifically, in an actual target image, the motion of a human body can be determined according to the form of each bone, so that the problem that the target gesture cannot be accurately identified due to too few bone points can be avoided by determining a skeleton point set of the target body image and filling the skeleton point set, the accuracy of target behavior identification is realized, the speed of target identification is greatly improved, then a space-time pose image of the target in a space-time dimension can be determined according to optimized skeleton points, the limb behaviors of the target in the space-time dimension can be conveniently analyzed and understood through the space-time pose image, the space-time pose image is stored in an image feature set according to a frame number corresponding relation after the space-time pose image is obtained, and the frame number corresponding relation is specifically as follows: for each frame of image, the image contains the face image and the driving image of the target at the same time, so that a corresponding relation exists for the image of the same frame, and the space-time pose image obtained in the step and the key area image obtained in the step S4 can be correspondingly stored in the image feature set according to the corresponding relation.

And, the step S5 includes:

s51, extracting a skeleton point set of the target body image, and filling the skeleton point set to obtain an optimized skeleton point set;

s52, determining a space-time pose image of the target based on the optimized skeleton point set, and storing the space-time pose image into the image feature set according to the corresponding relation of the frame number.

As shown in fig. 5, the step S51 includes:

s511, inputting the target body image into a skeleton point prediction network to obtain a plurality of initial skeleton points;

specifically, in this step, the skeleton point prediction network is a multi-stage convolutional neural network.

S512, obtaining the prediction reliability of the skeleton points and an affinity field, and performing iterative processing on the prediction reliability and the affinity field to obtain iterative reliabilityAffinity field with the iteration->：

；

specifically, the prediction reliability can be used for measuring the possibility of a certain skeleton point of a human body in image pixels, and the affinity field is used for determining the matching degree between the skeleton points.

S513, based on the first loss function and the second loss function, respectively for the iterative reliabilityIs associated with the iterative affinity field->Performing predictive compensation, wherein the firstA loss function->And said second loss function->The method comprises the following steps of:

；

in particular, the purpose of the mask is to penalize the iterative reliabilityIs associated with the iterative affinity field->To ensure correct results, while in the course of the iteration, by means of said first loss function +.>And said second loss function->According to the iteration number, the iteration reliability is +.>Is associated with the iterative affinity field->Periodic compensation is performed, whereby the final said iterative reliability is obtained>Is associated with the iterative affinity field->。

S514, based on the iteration reliability after prediction compensationIs associated with the iterative affinity field->Screening the skeleton points to obtain a skeleton point set, and filling the skeleton point set according to a preset filling algorithm to obtain an optimized skeleton point set;

As shown in fig. 6, the step S514 includes:

s5141, calculating each of theThe reliability and affinity field of skeleton points will be less than the iterative reliability after predictive compensationAnd/or said iterative affinity field +.>Removing skeleton points, and storing the reserved skeleton points into a skeleton point set;

specifically, if the reliability of the skeleton points and the affinity field do not meet the iterative reliability calculated in the stepsAnd/or said iterative affinity field +.>This means that the skeleton point is unreasonable or not matched with the rest of skeleton points, so that it needs to be removed, and the influence of the skeleton point on the subsequent behavior classification process is avoided.

S5142, eliminating the concentrated coordinates of the skeleton points asIs selected as a basic skeleton point in the skeleton point set>Selecting the front and the rear parts of the frame number corresponding to the reference skeleton point>Skeleton points of the frames and storing them in a filling dataset;

specifically, if the coordinates of the skeleton point areInformation representing the location is miscalculated or miscalculated and therefore needs to be culled.

S5143, calculating coordinate values of the skeleton points to be filled based on the coordinate values of the skeleton points in the filling data set；

；

In the method, in the process of the invention,for filling the sample mean of x coordinate values of each skeleton point in the data set, +. >For filling the sample mean of y coordinate values of each skeleton point in the data set,/->Sample median for x-coordinate values of skeleton points in the filled data set,/->Is a sample median value that fills the y-coordinate values of each skeleton point in the dataset.

S5144, supplementing the coordinate values of the skeleton points to be filled into the corresponding filling data set to obtain an optimized skeleton point set.

As shown in fig. 7, the step S52 includes:

s521, obtaining a preset time periodAnd the coordinate information of each skeleton point in the optimized skeleton point set is included.

S522, connecting each skeleton point based on the coordinate information of each skeleton point to obtain a basic skeleton pose image;

specifically, the distance between default skeleton points is 1, so that skeleton points in an image of a single frame are connected according to the distance, and then a connecting line diagram of the skeleton points in a preset time period is connected, so that a basic skeleton pose image can be obtained.

S523, selecting a root skeleton point from the basic skeleton pose image, calculating the shortest distance between the skeleton point and the root skeleton point, storing the skeleton point with the shortest distance not larger than a preset distance into a sampling skeleton point set, and sampling the skeleton point in the sampling skeleton point set based on a preset sampling function, wherein the preset sampling function is as follows:

；

In the method, in the process of the invention,for sampling the rest skeleton points except the root skeleton point in the skeleton point set, the +.>Is a root skeleton point;

the preset sampling function is determined according to the set of skeleton points with the shortest distance not larger than the preset distance, and meanwhile, the preset sampling function is a sampling function when the sampling distance is 2 and the distance between two adjacent skeleton points is 1.

S524, mapping the adjacent areas of the skeleton points obtained by the sampling process intoSub-region:

；/>

in the method, in the process of the invention,for the mapping function +.>For the neighboring area of the skeleton point obtained by the sampling process, < >>And (5) sampling the obtained skeleton points.

S525Determining updated skeleton points based on mapping resultsBased on the updated skeleton point +.>Updating the basic skeleton pose image to obtain a time-space pose image, and storing the time-space pose image into the image feature set according to the corresponding relation of the frame number, wherein the updated skeleton points +.>The method comprises the following steps:

；

S6, acquiring a template behavior image set, inputting the template behavior image set into a preset behavior recognition model for training, inputting the image feature set into the trained preset behavior recognition model, and outputting a behavior recognition result;

Specifically, the preset behavior recognition model in the application is an ST-GCN model, specifically a graph convolution neural network recognition model, firstly template images of various facial behaviors, template images of limb behaviors and template behavior image sets formed by matching the facial behaviors and the limb behaviors are input into the preset behavior recognition model, the preset behavior recognition model is trained through the template image behavior sets, and behavior labels are provided in the template images, so that after the image feature sets obtained in the steps are input into the trained preset behavior recognition model, corresponding behavior labels can be output, and a final behavior recognition result is obtained.

According to the behavior recognition method based on artificial intelligence, firstly, a target video is obtained, single-frame slicing processing is carried out on the target video to obtain a plurality of single-frame RGB images, and optical flow extraction is carried out on the target video to obtain a plurality of optical flow images; then, carrying out space domain sequence feature extraction on a plurality of RGB images to obtain a first feature and a second feature, carrying out time domain feature extraction on the optical flow images to obtain a third feature, and storing the first feature, the second feature and the third feature into an image feature set; then, carrying out feature fusion on the features in the image feature set to obtain a fused feature image, carrying out target region interception on the fused feature image to obtain a target image, and carrying out region segmentation on the target image to obtain a target face image and a target body image; then extracting key points of the target face image, intercepting a key area image of the target face image based on the key points, and storing the key area image into an identification feature set; then extracting a skeleton point set of the target body image, filling the skeleton point set to obtain an optimized skeleton point set, determining a space-time pose image of the target based on the optimized skeleton point set, and storing the space-time pose image into an image feature set according to a frame number corresponding relation; finally, a template behavior image set is acquired and is input into a preset behavior recognition model for training, an image feature set is input into the trained preset behavior recognition model, and a behavior recognition result is output.

Example two

As shown in fig. 8, in a second embodiment of the present invention, there is provided an artificial intelligence-based behavior recognition system, the system including:

the acquisition module 1 is used for acquiring a target video, carrying out single-frame slicing processing on the target video to obtain a plurality of single-frame RGB images, and carrying out optical flow extraction on the target video to obtain a plurality of optical flow images;

the extraction module 2 is used for carrying out space domain sequence feature extraction on a plurality of RGB images to obtain a first feature and a second feature, carrying out time domain feature extraction on the optical flow image to obtain a third feature, and storing the first feature, the second feature and the third feature into an image feature set;

the fusion module 3 is used for carrying out feature fusion on the features in the image feature set to obtain a fused feature image, carrying out target region interception on the fused feature image to obtain a target image, and carrying out region segmentation on the target image to obtain a target face image and a target body image;

the intercepting module 4 is used for extracting key points of the target face image, intercepting a key area image of the target face image based on the key points, and storing the key area image into an identification feature set;

The filling module 5 is used for extracting a skeleton point set of the target body image, filling the skeleton point set to obtain an optimized skeleton point set, determining a space-time pose image of the target based on the optimized skeleton point set, and storing the space-time pose image into the image feature set according to a frame number corresponding relation;

the recognition module 6 is used for acquiring a template behavior image set, inputting the template behavior image set into a preset behavior recognition model for training, inputting the image feature set into the trained preset behavior recognition model, and outputting a behavior recognition result.

Wherein, the acquisition module 1 comprises:

the splitting module is used for carrying out single-frame splitting and slicing processing on the target video so as to split the target video into a plurality of continuous single-frame RGB images;

an image information determining sub-module for extracting single-frame space image of the target video and calculatingFirst image information of time-corresponding single-frame spatial image +.>And->Second image information of single-frame spatial image corresponding to time：

；

In the method, in the process of the invention,is the pixel center point of the image, < >>First coefficient matrix, ">For the second coefficient matrix->For the third coefficient matrix- >For the fourth coefficient matrix, ++>、/>First scalar and second scalar, < ->The displacement of the pixel point;

a judging sub-module for judgingTime-corresponding single-frame spatial image and +.>Whether the pixel values of the single-frame space images corresponding to the moments are the same or not;

a pixel determination sub-module for ifTime-corresponding single-frame spatial image and +.>The pixel values of the corresponding single-frame spatial image at the moment are the same, then +.>，/>，/>；

A displacement determination sub-module for determining a displacement based on the first image informationAnd the second image informationSolving the displacement of the pixel point>Based on the displacement amount of each pixel>Determining optical flow field information of the single-frame spatial image to obtain a plurality of optical flow images, wherein the displacement of the pixel point is +.>The method comprises the following steps:

。

the extraction module 2 comprises:

the first feature extraction module is used for inputting the RGB image into a first 2DCNN for spatial domain feature extraction so as to obtain a first feature;

the second feature extraction module is used for acquiring an image path and a category label of the RGB image, storing the image path and the category label into a CSV file, and inputting the CSV file and the RGB image into an LSTM (LSTM) for sequence feature extraction so as to obtain a second feature;

And the third feature extraction module is used for carrying out binarization gray scale processing on the optical flow image to obtain an optical flow gray scale image, calculating a first optical flow field and a second optical flow field of the optical flow gray scale image by adopting a dense optical flow method, and inputting the optical flow image, the first optical flow field and the second optical flow field into a second 2DCNN for carrying out time domain feature extraction to obtain a third feature.

The interception module 4 comprises:

a matrix determining sub-module, configured to input the target facial image into a key point extraction network to perform key point extraction, so as to obtain a plurality of key points, determine a facial key point matrix based on the key points, and determine a target matrix based on the facial key point matrix, where the face is related toKey point matrixIs +.>The method comprises the following steps of:

；/>；

a matrix calculation sub-module for calculating a matrix of facial key points based on the facial key pointsIs +.>Calculating an intermediate matrix +. >：

；

a conversion sub-module for based on the intermediate matrixDetermining a conversion matrix->Based on the transformation matrix +.>Performing transformation processing to obtain a transformed coordinate matrix +.>：

；

a interception sub-module for converting the coordinate matrixAnd intercepting a key area image of the target face image, and storing the key area image into an identification feature set.

The filling module 5 comprises:

the filling sub-module is used for extracting a skeleton point set of the target body image, and filling the skeleton point set to obtain an optimized skeleton point set;

and the pose determining sub-module is used for determining a space-time pose image of the target based on the optimized skeleton point set and storing the space-time pose image into the image feature set according to the corresponding relation of the frame number.

Wherein the filling sub-module comprises:

the skeleton point output unit is used for inputting the target body image into a skeleton point prediction network so as to obtain a plurality of initial skeleton points;

The iteration unit is used for acquiring the prediction reliability of the skeleton points and the affinity field, and carrying out iteration processing on the prediction reliability and the affinity field to obtain iteration reliabilityAffinity field with the iteration->：

；

a compensation unit for respectively correcting the iterative reliability based on the first loss function and the second loss functionIs associated with the iterative affinity field->Performing predictive compensation, wherein the first loss function +.>And said second loss function->The method comprises the following steps of:

；

a filling unit for estimating the iterative reliability after compensationIs associated with the iterative affinity field->And screening the skeleton points to obtain a skeleton point set, and filling the skeleton point set according to a preset filling algorithm to obtain an optimized skeleton point set.

The filling unit includes:

a first culling subunit, configured to calculate a reliability and an affinity field of each skeleton point, where the reliability is smaller than the iterative reliability after prediction compensationAnd/or said iterative affinity field +.>Removing skeleton points, and storing the reserved skeleton points into a skeleton point set;

a second eliminating subunit, configured to eliminate the central coordinates of the skeleton points asIs selected as a basic skeleton point in the skeleton point set>Selecting the front and the rear parts of the frame number corresponding to the reference skeleton point>Skeleton points of the frames and storing them in a filling dataset;

a coordinate calculation subunit for based on theCoordinate values of skeleton points to be filled are calculated according to the coordinates of skeleton points in the filling data set；/>

；

In the method, in the process of the invention,for filling the sample mean of x coordinate values of each skeleton point in the data set, +.>For filling the sample mean of y coordinate values of each skeleton point in the data set,/->Sample median for x-coordinate values of skeleton points in the filled data set,/->A sample median value for filling y coordinate values of each skeleton point in the data set;

and the supplementing subunit is used for supplementing the coordinate values of the skeleton points to be filled into the corresponding filling data set so as to obtain an optimized skeleton point set.

The pose determination submodule comprises:

an information acquisition unit for acquiring a preset time periodCoordinate information of each skeleton point in the optimized skeleton point set;

the framework connecting unit is used for connecting each framework point based on the coordinate information of each framework point so as to obtain a basic framework pose image;

the sampling unit is used for selecting a root skeleton point from the basic skeleton pose image, calculating the shortest distance between the skeleton point and the root skeleton point, storing the skeleton point with the shortest distance not larger than a preset distance into a sampling skeleton point set, and carrying out sampling treatment on the skeleton point in the sampling skeleton point set based on a preset sampling function, wherein the preset sampling function is as follows:

；

a mapping unit for mapping the adjacent regions of the skeleton points obtained by the sampling process intoSub-region:

；

a time-space pose image unit for determining updated skeleton points based on the mapping resultBased on the updated skeleton points Updating the basic skeleton pose image to obtain a time-space pose image, and storing the time-space pose image into the image feature set according to the corresponding relation of the frame number, wherein the updated skeleton points +.>The method comprises the following steps:

；

in the method, in the process of the application,for the mapping result of the rest skeleton points except the root skeleton point in the sampling skeleton point set, the ++>For the current skeleton point in a preset time period +.>Time position in->For a preset period of time->Is a starting point time position of (1).

In other embodiments of the present application, a computer is provided in the following embodiments, and the computer includes a memory 102, a processor 101, and a computer program stored in the memory 102 and capable of running on the processor 101, where the processor 101 implements the artificial intelligence based behavior recognition method as described above when executing the computer program.

In particular, the processor 101 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present application.

Memory 102 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 102 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 102 may include removable or non-removable (or fixed) media, where appropriate. The memory 102 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 102 is a Non-Volatile (Non-Volatile) memory. In a particular embodiment, the Memory 102 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 102 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 101.

The processor 101 implements the artificial intelligence based behavior recognition method described above by reading and executing computer program instructions stored in the memory 102.

In some of these embodiments, the computer may also include a communication interface 103 and a bus 100. As shown in fig. 9, the processor 101, the memory 102, and the communication interface 103 are connected to each other via the bus 100 and perform communication with each other.

The communication interface 103 is used to implement communications between modules, devices, units, and/or units in embodiments of the application. The communication interface 103 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 100 includes hardware, software, or both, coupling components of a computer device to each other. Bus 100 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 100 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 100 may include one or more buses, where appropriate. Although embodiments of the application have been described and illustrated with respect to a particular bus, the application contemplates any suitable bus or interconnect.

The computer can execute the artificial intelligence based behavior recognition method based on the acquired artificial intelligence based behavior recognition system, thereby realizing the behavior recognition of the target.

In still other embodiments of the present application, in combination with the above-described artificial intelligence-based behavior recognition method, embodiments of the present application provide a storage medium having a computer program stored thereon, which when executed by a processor implements the above-described artificial intelligence-based behavior recognition method.

Those of skill in the art will appreciate that the logic and/or steps represented in the flow diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. An artificial intelligence based behavior recognition method, comprising:

2. The artificial intelligence-based behavior recognition method according to claim 1, wherein the step of performing single-frame slicing processing on the target video to obtain RGB images of a plurality of single frames, and performing optical flow extraction on the target video to obtain optical flow images comprises:

extracting a single-frame space image of the target video, and calculatingFirst image information of a single-frame spatial image corresponding to a time instantAnd->Second image information of time-corresponding single-frame spatial image +.>：

；

if it isTime-corresponding single-frame spatial image and +.>The pixel values of the single-frame spatial images corresponding to the moments are the same, then，/>，/>；

Based on the first image informationAnd the second image information +>Solving the displacement of the pixel point>Based on the displacement amount of each pixel>Determining optical flow field information of the single-frame spatial image to obtain a plurality of optical flow images, wherein the displacement of the pixel point is +.>The method comprises the following steps:

。

3. the artificial intelligence based behavior recognition method of claim 1, wherein the step of performing spatial sequence feature extraction on the plurality of RGB images to obtain a first feature and a second feature, and performing temporal feature extraction on the optical flow image to obtain a third feature comprises:

4. The artificial intelligence based behavior recognition method of claim 1, wherein the extracting key points of the target face image, intercepting key region images of the target face image based on the key points, and storing the key region images in a recognition feature set comprises:

inputting the target facial image into a key point extraction network for key point extraction to obtain a plurality of key points, determining a facial key point matrix based on the key points, and determining a target matrix based on the facial key point matrix, wherein the facial key point matrix Is +.>The method comprises the following steps of:

；/>；

in the method, in the process of the invention,is the +.>Line->Column element->Is the +.>Line 1Column element->Is the%>Line->Column element->Is the%>Line->The elements of the column are arranged such that,the number of key points;

based on the facial key point matrixIs +.>Calculating an intermediate matrix +.>：

；

based on the intermediate matrixDetermining a conversion matrix->Based on the transformation matrix +.>Performing transformation processing to obtain a transformed coordinate matrix +.>：

；

5. The artificial intelligence based behavior recognition method of claim 1, wherein the step of extracting a skeleton point set of the target body image, and performing a filling process on the skeleton point set to obtain an optimized skeleton point set comprises:

；

In the method, in the process of the invention,、/>first prediction function and second prediction function, respectively, ">Mapping for image connection features->Is->Iteration reliability after a number of iterations, +.>Is->An iterative affinity field after the iteration;

the iterative reliability is respectively based on a first loss function and a second loss functionIs associated with the iterative affinity field->Performing predictive compensation, wherein the first loss function +.>And said second loss function->The method comprises the following steps of:

；

in the method, in the process of the invention,for the position of skeleton points in the image, +.>For mask (S)>Is->Iterative reliability of individual phases, +.>For iterative reliabilityAverage value->Is->A stepwise alternating affinity field, +.>Is the average value of the iterative affinity field, +.>For the number of stages->Is the square of the 2 norms;

6. The artificial intelligence based behavior recognition method of claim 5, wherein the iterative reliability after the prediction-based compensationIs associated with the iterative affinity field->Screening the skeleton points to obtain a skeleton point set, and filling the skeleton point set according to a preset filling algorithm to obtain an optimized skeleton point set, wherein the step of obtaining the optimized skeleton point set comprises the following steps:

calculate eachThe reliability and affinity field of each skeleton point are smaller than the iterative reliability after prediction compensationAnd/or said iterative affinity field +.>Removing skeleton points, and storing the reserved skeleton points into a skeleton point set;

；

In the method, in the process of the invention,for filling the sample mean of x coordinate values of each skeleton point in the data set, +.>For filling the sample mean of y coordinate values of each skeleton point in the data set,/->Sample median for x-coordinate values of skeleton points in the filled data set,/- >A sample median value for filling y coordinate values of each skeleton point in the data set;

7. The artificial intelligence based behavior recognition method according to claim 1, wherein the step of determining a spatio-temporal pose image of a target based on the optimized skeleton point set, and storing the spatio-temporal pose image in the image feature set according to a frame number correspondence comprises:

；

8. An artificial intelligence based behavior recognition system, the system comprising:

9. A computer comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the artificial intelligence based behavior recognition method of any one of claims 1 to 7 when the computer program is executed.

10. A storage medium having stored thereon a computer program which, when executed by a processor, implements the artificial intelligence based behavior recognition method of any one of claims 1 to 7.