WO2023138154A1

WO2023138154A1 - Object recognition method, network training method and apparatus, device, medium, and program

Info

Publication number: WO2023138154A1
Application number: PCT/CN2022/129057
Authority: WO
Inventors: 苏海昇
Original assignee: 上海商汤智能科技有限公司
Priority date: 2022-01-24
Filing date: 2022-11-01
Publication date: 2023-07-27
Also published as: CN114494962A

Abstract

Disclosed in embodiments of the present invention are an object recognition method, a network training method and apparatus, a device, a medium, and a program. The object recognition method comprises: acquiring a video frame to be recognized comprising a target object of a screen; said video frame being any video frame in a video stream of the target object; determining an initial attitude sequence of the target object on the basis of said video frame and a historical video frame of said video frame in the video stream; performing probability mapping on the initial attitude sequence to obtain a target attitude sequence of the target object in said video frame; performing feature conversion on the target attitude sequence in space and time to obtain an attitude feature trajectory of the target object in said video frame; and determining a behavior state of the target object in said video frame on the basis of the attitude feature trajectory.

Description

Object recognition method, network training method, device, equipment, medium and program

Cross References to Related Applications

This disclosure claims the priority of the Chinese patent application number 202210082276.3 submitted on January 24, 2022, the applicant is Shanghai Shangtang Intelligent Technology Co., Ltd., and the application name is "object recognition method, network training method, device, equipment and medium". The full text of this application is incorporated into this disclosure by reference.

technical field

Embodiments of the present disclosure relate to the field of image processing, and in particular to an object recognition method, a network training method, a device, equipment, a medium and a program.

Background technique

In related technologies, abnormal video frames are identified from video streams based on pixel methods such as optical flow or frame gradient, which are easily affected by noise in video images, resulting in poor identification results.

Contents of the invention

An embodiment of the present disclosure provides a technical solution for object recognition.

The technical scheme of the embodiment of the present disclosure is realized in this way:

An embodiment of the present disclosure provides an object recognition method, the method comprising: acquiring a video frame to be recognized including a target object; the video frame to be recognized is any video frame in the video stream of the target object; determining an initial pose sequence of the target object based on the video frame to be recognized and historical video frames of the video frame to be recognized in the video stream; performing probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized; performing feature transformation on the target pose sequence in space and time to obtain the target object in the video frame to be recognized The gesture feature trajectory; based on the gesture feature trajectory, determine the behavior state of the target object in the video frame to be recognized.

An embodiment of the present disclosure provides a training method for an object recognition network, the method comprising: acquiring a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;

Determine the sample normalized pose sequence of the sample object in the sample video frame; use the object recognition network to be trained to carry out probability mapping on the sample normalized pose sequence to obtain a sample pose sequence; perform feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame; perform pose reconstruction on the sample pose feature track to obtain a reconstructed pose sequence; determine the reconstruction loss of the similarity between the reconstructed pose sequence and the sample normalized pose sequence; The parameters are adjusted so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.

An embodiment of the present disclosure provides an object recognition device, the device comprising: a first acquisition part configured to acquire a video frame to be recognized whose picture includes a target object; the video frame to be recognized is any video frame in a video stream of the target object; a first determination part configured to determine an initial pose sequence of the target object based on the video frame to be recognized and historical video frames of the video frame to be recognized in the video stream; a first mapping part configured to perform probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized; is configured to perform feature transformation on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized; a second determining part is configured to determine the behavior state of the target object in the video frame to be recognized based on the pose feature track.

An embodiment of the present disclosure provides a training device for an object recognition network, the device comprising: a second acquisition part configured to obtain a sample video frame including a sample object; wherein the sample video frame is any video frame in the sample video stream of the sample object; a third determination part configured to determine a sample normalized pose sequence of the sample object in the sample video frame; a second mapping part configured to use an object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain a sample pose sequence; performing feature conversion in space and time to obtain a sample pose feature track of the sample object in the sample video frame; a reconstruction part configured to perform pose reconstruction on the sample pose feature track to obtain a reconstructed pose sequence; a fourth determination part configured to determine a reconstruction loss of similarity between the reconstructed pose sequence and the sample normalized pose sequence; an adjustment part configured to adjust network parameters of the object recognition network to be trained based on the reconstruction loss, so that the adjusted reconstruction loss output by the object recognition network meets a convergence condition.

An embodiment of the present disclosure provides a computer device. The computer device includes a memory and a processor. The memory stores computer-executable instructions. When the processor runs the computer-executable instructions on the memory, it can implement the above object recognition method, or the object recognition network training method.

An embodiment of the present disclosure provides a computer storage medium, on which computer-executable instructions are stored. After the computer-executable instructions are executed, the above-mentioned object recognition method or the training method of an object recognition network can be implemented.

An embodiment of the present disclosure provides a computer program, where the computer program includes computer-readable codes. When the computer-readable codes run in an electronic device, a processor of the electronic device executes the above-mentioned object recognition method, or the above-mentioned object recognition network training method.

Embodiments of the present disclosure provide an object recognition method, network training method, device, equipment, medium, and program. First, the acquired image includes a video frame to be recognized of the target object, and the video frame to be recognized is any video frame in the video stream of the target object; secondly, based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream, the initial pose sequence of the target object is determined; then, probability mapping is performed on the initial pose sequence to obtain the target pose sequence of the target object in the video frame to be recognized; thus, the target object in the video frame to be recognized can be obtained. The posture situation of the target is mapped to the specific probability corresponding to the previous movement; finally, the target posture sequence is subjected to feature conversion in space and time to obtain the posture feature trajectory of the target object in the video frame to be recognized; and based on the posture feature trajectory, determine the behavior state of the target object in the video frame to be recognized. In this way, through the dynamic posture information of the target object in the time dimension and space dimension, the accuracy of determining the motion information of the target object in the video frame to be recognized can be improved, and the accuracy of the behavior state of the target object in the video frame to be recognized based on posture characteristics can be improved.

In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments will be described in detail below together with the accompanying drawings.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative work, wherein:

FIG. 1 is a schematic flowchart of a first object recognition method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart of a second object recognition method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic flowchart of a training method for an object recognition network provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of converting a gesture trajectory into a motion feature in a probability domain based on a motion embedder provided by an embodiment of the present disclosure;

FIG. 5 is a schematic framework diagram of a training method for an object recognition network provided by an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a space-time converter provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of an internal processing flow of a space-time converter provided by an embodiment of the present disclosure;

FIG. 8A is a schematic diagram of the structural composition of an object recognition device provided by an embodiment of the present disclosure;

FIG. 8B is a schematic diagram of the structural composition of a training device for an object recognition network provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of the structural composition of a computer device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the specific technical solutions of the invention will be further described in detail below in conjunction with the drawings in the embodiments of the present disclosure. The following embodiments are configured to illustrate embodiments of the present disclosure, but are not intended to limit the scope of embodiments of the present disclosure.

In the following description, "some embodiments" are referred to, which describes a subset of all possible embodiments, but it can be understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined without conflict.

In the following description, the terms "first\second\third" are only used to distinguish similar objects, and do not represent a specific ordering of objects. It is understandable that "first\second\third" can be interchanged with a specific order or sequence if allowed, so that the embodiments of the present disclosure described here can be implemented in an order other than those illustrated or described here.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the disclosure belong. The terms used herein are only for the purpose of describing the embodiments of the present disclosure, and are not intended to limit the embodiments of the present disclosure.

Before the embodiments of the present disclosure are further described in detail, the nouns and terms involved in the embodiments of the present disclosure will be described, and the nouns and terms involved in the embodiments of the present disclosure are applicable to the following explanations.

1) Normalization: It is a method of simplifying calculations, that is, transforming dimensional expressions into dimensionless expressions and becoming scalars; among them, normalization is a dimensionless processing method that makes the absolute value of the physical system numerical value into a certain relative value relationship. It is mainly used to simplify the amount of calculation and reduce the value.

2) Confidence: In statistics, the confidence interval (Confidence interval) of a probability sample is an interval estimate of a certain population parameter of this sample. The confidence interval shows the degree to which the true value of the parameter falls with a certain probability around the measured result. The confidence interval gives the range of the degree of confidence in the measured value of the measured parameter, that is, the "certain probability" required above. This probability is called the confidence level.

The exemplary application of the object recognition device provided by the embodiment of the present disclosure is described below. The device provided by the embodiment of the present disclosure can be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a camera, and a mobile device (for example, a personal digital assistant, a dedicated message device, a portable game device) with an image acquisition function, and can also be implemented as a server. Below, an exemplary application when the device is implemented as a terminal or a server will be described.

The method can be applied to a computer device, and the functions implemented by the method can be realized by calling a program code by a processor in the computer device. Of course, the program code can be stored in a computer storage medium. It can be seen that the computer device includes at least a processor and a storage medium.

The embodiment of the present disclosure provides an object recognition method, as shown in FIG. 1 , which is a schematic flow chart of the first object recognition method provided by the embodiment of the present disclosure; the following description is made in conjunction with the steps shown in FIG. 1 :

In step S101, a video frame to be recognized including a target object is acquired.

In some embodiments, the video frame to be identified is any video frame in the video stream of the target object. The image acquisition of the target object can be carried out through the device with image acquisition function to obtain the video stream, or the video stream sent by other devices can be obtained directly; at the same time, any video frame is randomly selected from the video stream as the video frame to be recognized. At the same time, the video stream may be set in a preset area, and at least one device with a collection function collects the preset area. Among them, at least one of the devices can be set at multiple collection points in the preset area, so as to collect relevant information in the preset area, such as: target objects appearing in the preset area. At the same time, the preset area can be any area in the real scene, such as: shopping malls, parks, roads, etc., and can also be road intersections, etc.

In some embodiments, the video stream may also be obtained by a related device collecting images of the target object, where the video stream may include at least one scene information, that is, the target object may be in multiple scenes in the video stream. In the following embodiments of the present disclosure, the target object may be a passer-by walking on the road, a vehicle driving on the road, or a puppy running in a park.

In some embodiments, the number of target objects included in the video frame of the video frame to be identified may be one, two or more. At the same time, when the number of target objects included is two or more, the areas where different target objects are located in the video frame to be identified can be adjacent, far away from, or partially overlapped, etc., and the areas occupied by different target objects in the video frame to be identified can be the same or different. In the following embodiments of the present disclosure, the number of target objects is taken as an example for illustration.

In some embodiments, the poses presented by the target objects included in different video frames in the video stream may be the same or different. Exemplarily, when the target object is a person, the person included in different video frames may be walking, running, standing, and so on.

Step S102: Determine an initial pose sequence of the target object based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream.

In some embodiments, the historical video frame of the video frame to be identified in the video stream may refer to at least one video frame in the video stream before the timing of the video frame to be identified and adjacent to the video frame to be identified; wherein, the number of historical video frames may be one frame, or two frames or more.

In some embodiments, at first, the key point recognition of the target object is carried out on the video frame to be recognized and the historical video frame respectively, so as to obtain the key point of the target object in the video frame to be recognized, and the key point of the target object in the historical video frame; then, based on the key point of the target object in the video frame to be recognized, and the key point of the target object in the historical video frame, determine the posture information of the target object in the video frame to be recognized and the posture information of the target object in the historical video frame; finally according to the temporal relationship between the historical video frame and the video frame to be recognized, the posture information in the historical video frame and the target object to be recognized The pose information in the video frame is recognized and sorted to obtain the initial pose sequence of the target object.

In some embodiments, when the target object is a person, human body key points or human body joint points can be identified on the video frame to be recognized and the historical video frame respectively, and then the human body key point in the video frame to be recognized and the human body key point in the historical video frame are obtained, and based on the human body key point in the video frame to be recognized and the human body key point in the historical video frame, the human body posture information in the video frame to be recognized is obtained respectively, and the human body posture information in the historical video frame is obtained; finally, based on the timing relationship between the historical video frame and the video frame to be recognized, The human body posture information and the human body posture information in the video frame to be recognized are sorted to obtain the initial posture sequence.

In some embodiments, each initial pose in the initial pose sequence may be represented based on the position information of the key points of the target object in the corresponding video frame; wherein, the position information may be the two-dimensional coordinate information of the key points of the target object in the corresponding video frame.

Step S103, performing probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized.

In some embodiments, probability mapping is performed on the determined initial pose sequence of the target object, that is, correlation mapping is performed on the pose information in the respective frames of the historical video frame and the video frame to be recognized, to obtain the target pose sequence of the target object in the video frame to be recognized. In some embodiments, probability mapping may refer to performing probability mapping on each initial pose based on position information of key points of the initial pose in the sequence of initial poses. At the same time, the target pose sequence of the target object in the video frame to be recognized can be obtained by sorting the determined target poses based on the temporal relationship between the historical video frame and the video frame to be recognized.

In some embodiments, when the number of historical video frames is 7 frames, the initial posture sequence correspondingly includes the posture data of the target object in 8 video frames (7 historical video frames and 1 frame to be recognized video frame); first, probability mapping is performed on the posture data of the target object in the 8 video frames, which can be based on the posture data of the target object in each video frame, such as: posture position information or pixel information, determine the relevant probability parameters, and then fuse the posture data of each video frame in the 8 video frames based on the probability parameters correspondingly, to obtain Characterize the motion pose of the target object in the video frame to be recognized, that is, the target pose sequence of the target object.

In some embodiments, the target pose sequence includes pose data in multiple frames of video frames, and the pose data in each frame of video frames may be represented by key points of the pose data, and the key points may be represented by coordinate information of the key points in the corresponding video frame.

Step S104, performing feature transformation on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized.

In some embodiments, the determined target pose sequence may be subjected to feature conversion in the time dimension and the space dimension in turn, and then the pose feature track of the target object in the video frame to be recognized may be obtained; wherein, the target pose sequence may be input into a network including a space converter and a time converter for feature conversion, so as to obtain the corresponding gesture feature track.

In some embodiments, the coordinate information corresponding to the key point of each target pose in the target pose sequence may be first converted into a feature vector, and then based on the attention parameter, the converted feature vector is adjusted in the time dimension and the space dimension to obtain the final gesture feature track.

Step S105, based on the trajectory of the gesture feature, determine the behavior state of the target object in the video frame to be recognized.

In some embodiments, the behavior state of the target object in the video frame to be recognized can be determined based on the determined gesture feature trajectory; wherein, the behavior state can be the motion information of the target object, such as: the target object is walking, jumping, standing, etc. Wherein, the posture feature trajectory may be input to the corresponding network model to determine the behavior state corresponding to the posture feature trajectory.

In some embodiments, after determining the behavior state of the target object in the video frame to be recognized, the behavior state may be identified based on the scene information associated with the video frame to be recognized, and then determine whether the video frame to be recognized is an abnormal video frame. Exemplarily, when it is determined that the behavior state of the target object in the video frame to be recognized is jumping, the scene information associated with the video frame to be recognized is obtained, such as: a zebra crossing at a crossroad; then a preset behavior rule associated with the scene information is determined; finally, the behavior state is determined based on the preset behavior rule, that is, jumping is an abnormal behavior, and then the video frame to be recognized is determined to be an abnormal video frame.

In some embodiments, anomaly recognition is performed on a video frame including any video frame to be recognized in the video stream of the target object; wherein, first, the historical video frame of the video frame to be recognized in the video stream can be determined, and the pose information of the target object in the video frame to be recognized and the pose information of the target object in the historical video frame are obtained; secondly, the initial pose sequence of the target object is determined based on the pose information in the historical video frequency and the pose information in the video frame to be recognized, and the initial pose sequence is performed. The target posture sequence of the motion information of the object; then the target posture sequence is converted into a feature vector, and the feature vector is subjected to feature conversion and dimension adjustment on the spatial dimension and the time dimension at the same time, so as to obtain a gesture feature trajectory representing the motion of the target object in the video frame to be recognized; finally, based on the posture feature trajectory, that is, the dynamic feature information determines the behavior state of the target object in the video frame to be recognized, and determines whether the video frame to be recognized is an abnormal video frame based on the behavior state.

In the object recognition method provided by the embodiments of the present disclosure, firstly, the acquired picture includes the video frame to be recognized of the target object, and the video frame to be recognized is any video frame in the video stream of the target object; in this way, the video stream of the target object and the video frame to be recognized are obtained to provide a basis for subsequent determination of the behavior state of the target object in the video frame to be recognized; secondly, based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream, the initial pose sequence of the target object is determined; and probability mapping is performed on the initial pose sequence to obtain the target in the video frame to be recognized The target posture sequence of the object; in this way, the posture situation of the target object in the video frame to be recognized can be mapped to the specific probability corresponding to the previous movement, that is, the determined target posture sequence can be considered at the same time. The previous motion and current motion information of the target object, and then the determined target posture sequence can more accurately match the motion information of the target object in the video frame to be recognized; finally, the target posture sequence is subjected to feature conversion in space and time to obtain the posture feature trajectory of the target object in the video frame to be recognized; and based on the posture feature trajectory, determine the target object in the video frame to be recognized behavioral state. In this way, with the help of the dynamic posture information of the target object in the time dimension and the space dimension, determining the corresponding posture feature track can improve the accuracy of determining the motion information of the target object in the video frame to be recognized, and then can improve the accuracy of determining the behavior state of the target object in the video frame to be recognized based on the posture feature.

In some embodiments, based on the key points of the target object in the historical video frame and the key points of the target object in the video frame to be recognized, the pose information in the video frame to be recognized and the pose information in the historical video frame are determined; then based on the pose information in the video frame to be recognized and the pose information in the historical video frame respectively, the initial pose sequence of the target object is obtained; in this way, when considering the previous motion and current motion information of the target object at the same time, that is, considering both the pose information in the historical video frame and the pose information in the video frame to be recognized, the determined target The initial pose sequence of the object is more accurate, that is, the accuracy of determining the initial pose sequence can be improved on the basis of determining the pose information with high accuracy. That is, step S102 provided in the above embodiment may be implemented through the following steps S201 to S203. As shown in FIG. 2, it is a schematic flowchart of the second object recognition method provided by the embodiment of the present disclosure, and the following description is made in conjunction with the steps shown in FIG. 1 and FIG. 2:

Step S201, performing key point recognition on the video frame to be recognized and the historical video frame respectively, to obtain the key point of the target object in the video frame to be recognized and the key point of the target object in the historical video frame.

In some embodiments, a trained neural network can be used to identify the key points of the video frame to be recognized and the historical video frame respectively, and then obtain the key points of the target object in the video frame to be recognized and the key points of the target object in the historical video frame; wherein, the trained neural network can be any neural network, which is not limited in this embodiment of the present disclosure.

In some embodiments, when the target object is a person, key points of the human body, that is, joint points of the human body, may be identified on the video frame to be recognized and the historical video frame, respectively, to obtain the joint points of the human body contained in the video frame to be recognized and the historical video frame. Among them, the joint points of the human body can include 17 joint points, such as: nose, left and right eyes, left and right ears, left and right shoulders, left and right elbows, left and right wrists, left and right hips, left and right knees, and left and right ankles.

Step S202, based on the key points of the target object in the video frame to be recognized and the key points of the target object in the historical video frame respectively, determine pose information in the video frame to be recognized and pose information in the historical video frame.

In some embodiments, the pose information of the target object in the video frame to be recognized can be represented based on the key points of the target object in the video frame to be recognized, and the pose information of the target object in the historical video frame can be represented based on the key points of the target object in the historical video frame; for example, the pose information of the target object in the video frame to be recognized can be characterized based on the position information of the key point of the target object in the video frame to be recognized; wherein, the position information can refer to the coordinate information of any key point in the video frame to be recognized. Similarly, the pose information of the target object can also be represented in the above-mentioned manner in the historical video frames.

Step S203, according to the temporal relationship between the historical video frame and the video frame to be recognized, sort the pose information in the historical video frame and the pose information in the video frame to be recognized to obtain an initial pose sequence.

In some embodiments, according to the sequential relationship between the historical video frames and the timing of the video frame, the posture information in the historical video frames and the posture information in the video frame to be identified are sorted in order to get the initial posture sequence; in an example, the posture information in the P _N ₊₁ indicates the attitude information in the video frame. Initial information, the initial posture sequence can be represented by [P ₁ , ..., _{P N} _, P _N+1 ]; where n is an integer greater than or equal to 1. At the same time, each initial pose in the initial pose sequence can be represented based on the coordinate information of key points in the corresponding video frame as described above.

In some embodiments of the present disclosure, a correlation normalization operation and a mapping operation may be performed on each initial pose in the initial pose sequence to obtain the target pose sequence of the target object in the video frame to be recognized. In this way, the posture of the target object in the video frame to be recognized can be mapped to the specific probability corresponding to the previous movement, and then the posture information of the target object in the video frame to be recognized can be represented based on historical posture information and current posture information; that is, the posture information of the target object in the video frame to be recognized can be represented based on dynamic parameters, and the accuracy of the motion information of the target object in the video frame to be recognized can be improved. That is, step S103 provided in the above embodiment can be realized through the following steps S204 and S205:

Step S204, based on the position information of the key points in each initial pose, a center point sequence for determining the displacement of adjacent initial poses and a normalized pose sequence of the initial pose sequences are obtained.

In some embodiments, based on the position information of key points in each initial pose in the initial pose sequence, a center point sequence for determining the displacement of adjacent initial poses and a normalized pose sequence corresponding to the initial pose sequence can be determined.

In some embodiments of the present disclosure, the bounding box of each initial pose can be determined based on the position information of the key points in each initial pose, and then based on the bounding box and the initial pose, a corresponding center point sequence and a normalized pose sequence can be determined. In this way, the accuracy of the determined central point sequence and the normalized attitude sequence can be improved; at the same time, based on the normalization operation for each initial attitude, the corresponding normalized data, that is, the normalized attitude sequence, is obtained, so that the subsequent accuracy and speed of determining the target attitude sequence based on the normalized data can be improved. That is, the above step S204 can be realized through the following process:

The first step is to determine the bounding box of each initial pose based on the position information of key points in each initial pose.

In some embodiments, the minimum rectangular frame that can enclose the initial pose may be determined in the corresponding video frame, that is, the minimum rectangular frame that can enclose the position information of key points in each initial pose. The sizes of the bounding boxes of different initial poses may be the same or different, and the position information of the bounding boxes of different initial poses in the corresponding video frames may be the same or different.

In some embodiments, when the position information of the key point is the coordinate information of the key point in the corresponding video frame, first, the position information of the key point in each initial pose can be compared to determine a minimum coordinate point and a maximum coordinate point, and then based on the minimum coordinate point and the maximum coordinate point, determine the smallest rectangular frame that can surround the minimum coordinate point and the maximum coordinate point, that is, the bounding box of each initial pose.

The second step is to sort the center points of the bounding boxes of each initial pose in the initial pose sequence to obtain the center point sequence.

In some embodiments, the center point of the bounding box of each initial pose is determined, and then the center of the bounding box of each initial pose is sorted based on the sequence information of the initial poses in the initial pose sequence to obtain a center point sequence. Wherein, the center point of the bounding box of each initial pose can be characterized based on the position information of the center point, that is, the coordinate information in the corresponding video frame.

The third step is to normalize each initial pose by using the bounding box of each initial pose to obtain the normalized pose sequence.

In some embodiments, the bounding box of each initial pose is used, and each initial pose is normalized to obtain a normalized pose sequence; wherein, the size information of the bounding box of each initial pose may be used, such as the width and height of the bounding box, and each initial pose is normalized to obtain a normalized pose corresponding to each initial pose, and then a normalized pose sequence is obtained.

Step S205, performing probability mapping on the normalized pose sequence based on the center point sequence to obtain the target pose sequence.

In some embodiments, each normalized pose in the normalized pose sequence is in one-to-one correspondence with each initial pose in the initial pose sequence, and the center point sequence is corresponding to the bounding box of each initial pose in the initial pose sequence, that is, the center point sequence corresponds to the normalized pose sequence; here, probability mapping is performed on each normalized pose in the normalized pose sequence based on the center point sequence to obtain a corresponding target pose sequence. Exemplarily, the corresponding probability parameter may be determined based on the central point sequence, and finally, based on the probability parameter, probability mapping is performed on each normalized posture in the normalized posture sequence to obtain the target posture sequence.

In some embodiments of the present disclosure, the normalized displacement sequence corresponding to the central point sequence can be determined based on the relationship between adjacent central points in the central point sequence, and then the normalized posture sequence can be probabilistically mapped based on the normalized displacement sequence to obtain the target posture sequence; in this way, the target posture sequence can take into account both the previous motion and the current motion information of the target object, so that the determined target posture sequence can more accurately match the motion information of the target object in the video frame to be recognized. That is, the above step S205 can be realized through the following process:

First, in the center point sequence, a displacement sequence is obtained based on the difference between the position information of every two adjacent center points.

In some embodiments, in the center point sequence, the difference between the position information of every two adjacent center points can be determined according to the arrangement order of the center points as the displacement corresponding to the previous center point or the next center point in the adjacent center points, and then the displacement sequence is obtained; wherein, the position information of the adjacent center points is the center point of the bounding box that can surround the initial posture determined in the adjacent video frame, and the coordinate information in the corresponding video frame.

In some embodiments, when the position information of the central point is represented by two-dimensional coordinates, the difference between the abscissas of every two adjacent central points and the difference between the vertical coordinates of every two adjacent central points can be calculated by a correlation function to obtain the displacement of the two adjacent central points.

Secondly, based on the size information of the bounding box of each initial pose, each displacement in the displacement sequence is normalized to obtain a normalized displacement sequence.

In some embodiments, each displacement in the displacement sequence may be normalized based on the size information of the bounding box of each initial pose, such as the height and width of the bounding box, to obtain a normalized displacement sequence. Exemplarily, the size information of the bounding box of each initial pose may be used, such as adding the length and width of the bounding box to obtain the corresponding size information, and then dividing each displacement in the displacement sequence based on the corresponding size information to obtain a normalized displacement sequence.

Then, probability mapping is performed on the normalized pose sequence based on the normalized displacement sequence to obtain the target pose sequence.

In some embodiments, based on the determined normalized displacement sequence, probability mapping is performed on the normalized pose sequence to obtain the corresponding target pose sequence; that is, the normalized pose sequence can be mapped to a specific probability to represent the motion information of the target object in the video frame to be recognized, that is, the target pose sequence. Here, the probability mapping of the normalized attitude sequence based on the normalized displacement sequence can improve the accuracy and speed of determining the target attitude sequence.

In some embodiments of the present disclosure, the continuous distribution function associated with the normalized displacement sequence can be determined, and then each normalized displacement is input to the continuous distribution function, and the corresponding scaling factor, that is, the scaling probability is determined, and finally, based on the scaling probability, each normalized posture in the normalized posture sequence is mapped to obtain the target posture sequence. In this way, mapping historical pose information and current pose information to the pose information of the target object in the video frame to be recognized based on a specific probability can improve the accuracy of characterizing the motion information of the target object in the video frame to be recognized, that is, the target pose sequence. That is, the above step S253 can be realized through the following process:

In the first step, each normalized displacement in the normalized displacement sequence is fitted to obtain a fitting result.

In some embodiments, a preset function, such as Rayleigh Distribution, is used to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result. It is also possible to use a Gaussian distribution to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result.

The second step is to determine the continuous distribution function that the fitting result satisfies.

In some embodiments, the continuous distribution function satisfied by the fitting result is determined; wherein, the continuous distribution function may be represented by a conventional function expression, or may be represented by a text description.

In the third step, each normalized displacement is input into the continuous distribution function to obtain the scaling probability of each normalized displacement.

In some embodiments, each normalized displacement can be input into the continuous distribution function to obtain a scaling factor corresponding to each normalized displacement, that is, a scaling probability.

The fourth step is to map each normalized pose based on the scaling probability of each normalized displacement to obtain the target pose sequence.

Here, the normalized posture sequence corresponds to the initial posture sequence, and the scaling probability corresponding to each normalized displacement corresponds to the central point sequence, that is, each central point in the initial posture sequence (the normalized displacement sequence is determined based on the central point sequence); furthermore, each normalized posture in the normalized posture sequence can be fused with the scaling probability of each normalized displacement to obtain the target posture sequence; that is, each normalized posture can be divided by the scaling probability of each normalized displacement to obtain the target posture sequence.

Here, inputting the target pose sequence obtained above into the space-time converter can now perform feature conversion and dimension adjustment on each target pose in the target pose sequence in the time and space dimensions, and then obtain a pose feature trajectory that is more matching with the motion information of the target object in the video frame to be recognized. That is, step S104 provided in the above embodiment can be realized through the following steps S206 and S207:

Step S206 , in the target pose sequence, perform feature transformation on each target pose based on key points of each target pose to obtain a feature sequence to be adjusted.

In some embodiments, in the target pose sequence, based on each target pose including multiple key points, feature conversion is performed on the key point information of each target pose, that is, feature conversion is performed on each target pose, so as to obtain the feature sequence to be adjusted.

Step S207, perform feature dimension adjustment on each feature to be adjusted in space and time to obtain the gesture feature track.

In some embodiments, based on the attention parameters of the spatial dimension in the attention mechanism and the attention parameters of the time dimension in the attention mechanism, feature fusion and dimension adjustment can be performed on each feature to be adjusted, so as to obtain the gesture feature trajectory.

In some embodiments of the present disclosure, the feature sequence to be adjusted can be adjusted sequentially in the space dimension and the time dimension, so as to obtain the final gesture feature trajectory. Here, the attitude information usually includes two dimensions, namely the space dimension and the time dimension. Therefore, sequentially adjusting the feature sequence to be adjusted in the space dimension and the time dimension can make the determined attitude feature track match the actual motion of the target object; thus, the determined attitude feature track can be more accurate. That is, the above step S207 can be realized through the following process:

In the first step, each feature to be adjusted is fused with a preset spatial feature of each feature to be adjusted to obtain a sequence of spatial features.

In some embodiments, the preset spatial feature of each feature to be adjusted may be a spatial feature parameter related to attributes of key points in each feature to be adjusted in the spatial dimension. Exemplarily, when the target object is a person, the preset spatial characteristics of the features to be adjusted corresponding to different key points of the person are different, which are determined based on the positions of the key points in the human body and the attributes of the key points of the human body.

In some embodiments, each feature to be adjusted and the preset spatial feature of each feature to be adjusted may be fused based on key points to obtain a sequence of spatial features; wherein each feature to be adjusted corresponds to each target pose, and each target pose includes multiple key points; that is, each feature to be adjusted corresponds to multiple key points of the same target pose.

In the second step, based on the attention parameter of the spatial dimension in the attention mechanism, multi-layer dimension adjustment is performed on the spatial feature sequence to obtain the spatial posture feature sequence.

In some embodiments, based on the attention parameters of the spatial dimension in the attention mechanism, such as: query value (Query), key value (Key), matrix (Value matrix), each spatial feature in the spatial feature sequence is adjusted in multiple dimensions, and then the spatial posture feature sequence is obtained.

The third step is to fuse each space attitude feature with the preset time feature of each space attitude feature to obtain a time feature sequence.

In some embodiments, the preset time feature of each space gesture feature may be a time feature parameter related to the attributes of key points in each space gesture feature in the time dimension. For example, when the target object is a person, the preset time characteristics of the features to be adjusted corresponding to different key points of the person are different, which is determined by the position of the key point in the human body and the attribute of the key point of the human body.

In some embodiments, based on the key points, each space posture feature and the preset time feature of each space posture feature are fused to obtain a time feature sequence; wherein each space posture feature corresponds to each target posture, and each target posture includes multiple key points; that is, each space posture feature corresponds to multiple key points of the same target posture.

Step 4: Based on the attention parameters of the time dimension in the attention mechanism, multi-dimensional adjustments are performed on the time feature sequence to obtain the gesture feature trajectory.

Among them, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer. In some embodiments, based on the attention parameters of the time dimension in the attention mechanism, multi-layer dimension adjustment or feature encoding can be performed on each time feature in the time feature sequence, so as to obtain the gesture feature trajectory.

Here, the attention parameters of the time dimension in the attention mechanism correspond one-to-one to the attention parameters of the space dimension in the attention mechanism, which are associated with the input feature parameters. At the same time, during multi-layer dimension adjustment, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer; for example, the features obtained after the adjustment of the first layer are directly used as the input of the second layer.

Here, in the object recognition method provided by the embodiment of the present disclosure, after determining the behavior state of the target object in the video frame to be recognized based on the gesture feature trajectory, that is, after step S105 provided in the above embodiment, the following process may also be performed:

First, acquire scene information corresponding to the video frame to be identified.

In some embodiments, the scene information corresponding to the video frame to be recognized is obtained, such as: office, park, parking lot, etc.

Second, a preset behavior rule associated with the scene information is determined.

In some embodiments, the preset behavior rules associated with the scene information are determined. Here, when the scene information is an office, the associated preset behavior rules are that the office staff in the office are only allowed to work normally and communicate with others, but the office staff are not allowed to lie down and rest. In the case that the scene information is a parking lot, its associated preset behavior rules are that vehicles in the parking lot are allowed to park, and low-speed driving is allowed, but high-speed driving is not allowed.

Finally, the preset behavior rule is used to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame.

In some embodiments, the preset behavior rule is used to identify the behavior state to determine whether the behavior state belongs to an abnormal behavior state, and further to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame. In this way, firstly, through the scene information corresponding to the video frame to be recognized, determine the preset behavior rules associated with the scene information of the video frame to be recognized, that is, determine the preset behavior rules that the objects contained in the picture in the video frame to be recognized need to abide by; then, based on the preset behavior rules, determine whether the behavior state of the object in the video frame to be recognized is reasonable, and determine whether the video frame to be recognized is an abnormal video frame based on the judgment of whether it is reasonable; in this way, it is possible to more accurately and conveniently determine whether the video frame belongs to an abnormal video frame.

In some embodiments of the present disclosure, the target object includes at least two objects to be identified. Using the preset behavior rules, determining whether the video frame to be identified to which the behavior state belongs is an abnormal video frame can be achieved by the following steps:

In the first step, the behavior state of each object to be recognized in the at least two objects to be recognized is recognized by using the preset behavior rule to obtain a behavior recognition result set.

In some embodiments, preset behavior rules are used to identify the behavior state of each object to be identified, determine the behavior recognition result of each object to be identified, and then obtain a behavior recognition result set; where the behavior recognition result of each object to be identified can be represented by abnormal and normal.

In the second step, the confidence of each behavior recognition result is sorted to obtain the result scoring sequence.

In some embodiments, the confidence levels of each behavior recognition result are sorted to obtain a result scoring sequence. Among them, the confidence level can be associated with the behavior state in the recognition result, and can also be associated with the picture definition corresponding to the changed behavior state.

The third step is to determine whether the video frame to be recognized is the abnormal video frame based on the behavior recognition result corresponding to the preset position in the result scoring sequence.

In some embodiments, the behavior recognition result corresponding to the preset position in the result scoring sequence is determined as the target recognition result of the video frame to be recognized, and then based on the target recognition result, it is determined whether the video frame to be recognized is an abnormal video frame.

In some embodiments, first, the behavior rules are usually preset to identify the behavior state of each object to be recognized included in the video frame to be recognized, and obtain multiple recognition results, that is, a behavior recognition result set, so that each object in the video frame to be recognized is analyzed, that is, a more comprehensive and specific behavior analysis is performed on the object included in the video frame to be recognized; Sequence, flexibly select a preset position in the sequence, and then judge the video frame to be recognized based on the behavior recognition result corresponding to the flexibly selected preset position in the sequence, and can flexibly identify whether the video frame to be recognized is an abnormal video frame, thereby improving the accuracy of detecting abnormal video frames.

In some embodiments, an object recognition network can be used to recognize the target object in the video frame to be recognized, and then obtain the behavior state of the target object in the video frame to be recognized; wherein, the object recognition network is obtained by training the object recognition network to be trained, and the training of the object recognition network to be trained can be realized by the steps shown in Figure 3, which is a schematic flow diagram of a training method for an object recognition network provided by an embodiment of the present disclosure; the following description is made in conjunction with the steps shown in Figure 3:

Step S31, acquiring a sample video frame including a sample object.

Wherein, the sample video frame is any video frame in the sample video stream of the sample object.

In some embodiments, a device with an image collection function can be used to collect related scenes or objects to obtain a sample video frame; wherein, the number of sample objects in the sample video frame can be one, or two or more. In the embodiments of the present disclosure, the number of sample objects is taken as one for illustration.

Step S32, determining a sample normalized pose sequence of the sample object in the sample video frame.

In some embodiments, the sample historical video frame with the sample video frame can be firstly determined in the sample video stream, and then the sample pose information of the target object in the sample historical video frame and the sample pose information of the sample object in the sample video frame are sorted according to the time sequence relationship between the sample historical video frame and the sample video frame to obtain the sample normalized pose sequence of the sample object in the sample video frame; then normalize the first sample pose sequence to obtain the sample normalized pose sequence, the implementation process here is the same as the implementation process of the above step S203 and step S204 Similarly, the normalized pose data characterizing the motion information of the sample object in the sample video frame is determined.

In some embodiments, the sample normalized pose sequence includes multiple sample normalized poses, and the multiple sample normalized poses are sorted based on the timing information of the sample video frame where the sample object is located in the sample video stream, and the sample normalized pose information can be characterized based on the key points contained in the poses in the sample video frame.

Step S33, using the object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain a sample pose sequence.

Here, the implementation process of step S33 is similar to the implementation process of the above-mentioned step S205, that is, the posture situation of the sample object in the sample video frame is mapped to the specific probability corresponding to the previous movement, and then the sample pose sequence of the sample object in the sample video frame is determined.

Step S34, performing feature transformation on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame.

Here, the implementation process of step S34 is similar to that of the above-mentioned step S104, and the implementation process of steps S206 and S207, that is, the determined sample pose sequence is input to a network including a space converter and a time converter for feature conversion, so as to obtain the corresponding sample pose feature trajectory.

In some embodiments, when the sample pose sequence is input to the corresponding conversion network, the key points in each sample pose in the sample pose sequence can be partially masked to reduce the calculation amount of related data in the network, thereby improving the running speed.

Step S35, performing pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence.

In some embodiments, pose reconstruction can be performed based on the sample pose feature track, that is, correspondingly obtain a reconstructed pose sequence associated with the sample pose feature track, where the sample pose feature track can be converted from a feature vector to a coordinate parameter, and then the corresponding reconstructed pose sequence can be obtained.

Step S36, determining the reconstruction loss of the similarity between the reconstructed pose sequence and the sample normalized pose sequence.

In some embodiments, the reconstruction loss for determining the similarity between the reconstructed pose sequence and the sample normalized pose sequence may be calculated based on the similarity between the reconstructed pose sequence and the sample normalized pose sequence based on the coordinate information of key points in each pose to obtain the reconstruction loss. Wherein, the reconstructed pose sequence and the sample normalized pose sequence can also be calculated based on the similarity between the coordinate information of the key points in each pose and the confidence of the key points in each pose to obtain the reconstruction loss.

Step S37 , based on the reconstruction loss, adjust the network parameters of the object recognition network to be trained, so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.

In some embodiments, the network parameters of the object recognition network to be trained are adjusted based on the reconstruction loss, so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.

Here, through the above steps S31 to S37, in the object recognition network to be trained, based on the probability mapping of the relevant pose sequence and the feature transformation and adjustment in the time and space dimensions, a reconstruction loss is introduced to supervise the similarity between the sample normalized pose sequence and the reconstructed pose sequence in the sample video frame; in this way, by training the object recognition network to be trained, the recognition accuracy of the entire object recognition network can be improved, so that an object recognition network with higher performance can be obtained; that is, the trained object recognition network can recognize the information movement of objects in any video frame in the video stream The accuracy is higher.

The above-mentioned object recognition method and the training method of the object recognition network will be described below in conjunction with a specific embodiment. However, it should be noted that this specific embodiment is only for better illustrating the embodiment of the present disclosure, and does not constitute an improper limitation to the embodiment of the present disclosure.

In the related art, there are certain limitations in abnormal video frame detection based on attitude methods. The reason is that pose-based methods are based on static features in video frames, while video frame anomaly detection relies more on dynamic features. Therefore, efficient motion representation is crucial for regular video pattern learning in abnormal video frame detection. In this case, implementing detection models based on static features in pose methods would be overwhelmed by learning motion and normal states simultaneously, which would degrade the performance of detection models.

In the object recognition method and the training method of the object recognition network provided by the embodiments of the present disclosure, a motion prior rule learner (MoPRL) based on Motion Prior Regularity Learner (MoPRL) is provided to alleviate the limitation of the above-mentioned pose-based method. MoPRL consists of two sub-parts, Motion Embedder (ME) and Spatial-Temporal Transformer (STT). Among them, ME is used to extract the spatio-temporal representation of the input pose from a probabilistic perspective; where the pose motion is modeled based on the displacement between pose center points between adjacent frames. At the same time, this movement can be transformed into the probability domain. That is, the motion prior information is obtained through statistics, which represents the explicit distribution of displacement on the training data. That is, to represent the corresponding motion, each pose displacement is mapped to a specific probability based on the previous motion. Simultaneously adopting the designed pose masking strategy, the STT is used as a task-specific model to learn the regular pattern by inputting the pose of the ME and its motion features. The framework of the object recognition network provided by the embodiments of the present disclosure adopts a self-supervised sequential input structure, which is naturally suitable for attitude rule learning.

Based on the MoPRL in the present disclosure, it is possible to intuitively represent the pose motion of a target object in a video frame in the probability domain based on ME, providing an effective pose motion representation for its regularity learning. At the same time, the regularity of pose trajectories can be simulated by using STT with pose masking and attention distraction. The following are the implementation steps for implementing the training method of the object recognition network provided by the embodiment of the present disclosure. Here, the default sample object is a human being for the following description:

The first step is to obtain a sample video stream; wherein, the sample video stream can be represented by a training set D _train ={F ₁ , . . . , F _m }, where F _i represents any video frame in the sample video stream. At the same time, each video frame in the sample video stream includes sample objects with labeled pose information, which can be represented by the test set D _test ={(F ₁ ,L ₁ ),...,(F _n ,L _n )}; where, L _i ∈{0,1}, which means that there are normal samples and abnormal samples in the training set and test set.

The second step is to determine the sample historical video frame of the sample video frame in the sample video stream, that is, by sliding the sample video frame based on the window, determine at least one sample historical video frame that is in the sample video stream before the sample video frame and adjacent to the sample video frame. Exemplarily, when the sample video frame is the eighth video frame in the sample video stream, the sample historical video frame may be the first seven video frames in the sample video stream, and the eighth sample video frame is used as an example for illustration below.

In the third step, based on the human body posture recognition, human body posture recognition is performed on the sample historical video frame and the sample video frame respectively, and the human body posture information represented by the key points of the human body is obtained; wherein, each human body posture can be represented by P _i ={J _i,1 ,...,J _i,k }; wherein, i represents the video frame in the sample historical video frame and the sample video frame, and k represents the maximum number of joints of the human body in a single human body posture, that is, J _i,j represents the jth joint point in the i-th human body posture. Wherein, each joint point can be represented by coordinates (xi _,j ,y _i,j ).

The fourth step is to use the human body posture trajectory sequence in the first 8 sample video frames, that is, S _i ={P ₁ ,...,P _t }, and t represents the number of postures included in the human body posture trajectory sequence. In this embodiment, j is equal to 8. If there is one person in the sample video frame, it can be represented by F _i ={S ₁ ,...,S _l }.

Wherein, according to the temporal relationship between the historical sample video frames and the sample video frames in the sample video stream, the human body postures in the historical sample video frames and the human body postures in the sample video frames are sorted, and a sample human body posture sequence S _i ={P ₁ ,...,P _t } representing the sample video frames is obtained.

The fifth step is to determine the smallest rectangular frame capable of each sample human pose in the human pose sequence, and then determine the center point of the corresponding smallest rectangular frame as the center point of each sample human pose, that is, (xi _, y _i ), and at the same time based on the size of the smallest rectangular frame of each sample human pose, such as the width and height of the smallest rectangular frame: (w _i , h _i ); normalize the sample human poses in each sample human pose sequence, that is, normalize each joint point in the sample human poses in each sample human pose sequence processing to obtain the normalized expression sequence of each sample human body posture, namely

Among them, the standardized coordinate parameters corresponding to each joint point in each sample human body pose are correspondingly obtained, that is,

In the sixth step, the fifth step is used to obtain the normalized representation sequence of each sample human body posture

The center point (xi _, y _i ) of each sample human body pose and the width and height (w _i , h _i ) of the minimum rectangular frame of each sample human body pose are input to ME to obtain the corresponding sample target pose sequence in the sample video frame. Among them, it can be realized through the following process:

First, based on the historical sample video frame and the temporal relationship of the sample video frame in the sample video stream, calculate the sample normalized displacement corresponding to the center point of the sample human pose in every two adjacent video frames in the historical sample video frame and the sample video frame, which can be realized by the following formula (1) and formula (2):

Among them, _υi represents the displacement of the center point of the sample human body in every two adjacent sample video frames, that is, the average speed from the sample human body posture P _i to the sample human body posture P _i+1 , where _υi can represent the displacement corresponding to the i-th video frame in the sample historical video frame and the sample video frame; at the same time, based on the displacement of the sample human body center in each two adjacent sample video frames and the size of the smallest rectangular frame corresponding to each sample human body posture, determine the sample normalized displacement corresponding to the i-th video frame

That is, the sample human body pose P _i (sample normalized human body pose

) corresponding to the normalized displacement

Second, fit the above using the preset fitting function

Then get the discretized data set with this

The matching predefined distribution function distribution, through which we can get the continuous displacement distribution function. Proving that the Rayleigh distribution matches the above normalized shift during training

The performance parameters of the corresponding displacement distribution function are optimal. Here, in order to obtain a multi-level information representation including temporal and spatial information, the normalized pose

(Representing spatial information and motion prior) combined, here is more reflected in time, and then each normalized displacement

Input to the corresponding displacement distribution function to get the normalized displacement

The matching scaling factor can be represented by a probability parameter, as shown in formula (3):

Among them, ρ is the discretization data set

The matching predefined distribution function distribution, that is, the continuous displacement distribution function; s _i is the same as

The corresponding scaling factor.

Finally, formula (4) is used to calculate each normalized pose

The corresponding pose feature after the scaling operation, that is, the pose representing the embedded motion fuses the spatial and temporal information of the i-th pose. Here to avoid numerical errors, a scaling factor can be used as the denominator. In this way, in the case of gestures corresponding to less frequent occurrences, gestures of larger size can be obtained.

Wherein, P _i =[J _i,1 ,...,J _i,j ] characterizes the target sample pose sequence of the sample video frame and can be represented by [P ₁ ,...,P _t ]. As shown in FIG. 4 , it is a schematic diagram of converting a posture trajectory into a motion feature in the probability domain based on a motion embedder ME provided by an embodiment of the present disclosure; wherein, 401 and 402 are respectively normalized human body posture sequences of different samples input into the ME; wherein, 401 is a marked normal posture trajectory, and 402 is a marked abnormal posture trajectory. 405 is a determined continuous distribution function corresponding to the motion probability of the sample human body; respectively map 401 and 402 to the continuous distribution function 405 to obtain a corresponding scaling factor; then multiply 401 and 402 by the reciprocal of the corresponding scaling factor respectively to obtain the target sample posture sequence 403 corresponding to 401 and the target sample posture sequence 405 corresponding to 402; wherein, because the abnormal posture trajectory represented in 402 is represented, the corresponding scaling factor is the probability value mapped to the relevant motion information is smaller, and then the pose size in the corresponding target sample pose sequence is larger.

In the seventh step, in order to better learn the regularity of the human body posture trajectory, the embodiment of the present disclosure uses a space-time transformer, that is, STT to process the parameters with time and space information based on the motion embedder in the sixth step, because it has recognized advantages in modeling sequence data. However, the data computational complexity of the traditional space-time transformer model is O((N×T) ² ) (where N is the number of joints in a single pose and T is the number of poses in a single pose trajectory), i.e., this O grows exponentially with the increase of N and T. Based on this, the space-time transformer can be divided into two parts based on the attention mechanism, that is, the space-time conversion part and the time conversion part, and then a model with a data calculation complexity of O(N ² +T ² ) can be obtained. Here, such a model is referred to as STT; where, STT includes a space transformer of L _s layer network and a time transformer of L _t layer network. At the same time, to give full play to the potential of STT, L _s and L _t can be regarded as hyperparameters, and their corresponding specific values can be determined through training.

First, pose mask processing is performed. This part is performed during the training process of the object recognition network and can be omitted during the application process. This is to reduce the amount of data calculation and improve the robustness of model processing. For arbitrary masked pose embeddings. Before inputting to the space-time transformer, we first normalize any joint point J _i,j in any sample pose, and map it to the embedding to obtain the joint vector z _i,j , where z _i,j ∈ R ^C , and C is the embedding dimension, as shown in the following formula (5):

Among them, mask(·) is a mask function running on J _i,j with a preset probability, and E∈R ^C×2 is a training parameter. at the same time,

Characterize the spatial feature vector corresponding to the attribute of the jth joint point; and then obtain the i-th pose, that is, the feature vector corresponding to P _i = [J _i,1 ,...,J _i,j ], Z _i =[zi _,0 ,...,zi _,N ], here, the feature vector corresponding to the entire target sample pose sequence [P ₁ ,...,P _t ] can be represented by Z=[Z ₁ ,...,Z _T ].

Secondly, in the space domain with L _s layer network, that is, in the space transformer, and in the case of Z∈R ^T×N×C corresponding to the attitude trajectory, the feature dimension adjustment or feature encoding is performed on the determined feature vector based on the parameters corresponding to the attention mechanism; where the input trajectory of the l-th layer is expressed as Z ^l , and l∈[1,L _s ]. The spatial domain multi-layer operation with L _s layers can be performed in the following way, as in Equation (6) to Equation (8):

Among them, Q, K, and V are query, key, and value matrices respectively, and W _Q , W _K , and W _V all belong to R ^C×C . Here, the subscript _ln denotes the tensor parameters after spatial domain normalization of layer L _s , where softmax and fc represent the softmax operation and fully connected layer, respectively. At the same time, the space-time converter uses the parameters corresponding to the multi-layer multi-head attention mechanism to perform attention operations, and can obtain feature parameters with more sufficient performance parameters that carry spatial information. Here, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer.

Finally, in the time domain with L _t layer network, i.e., the time converter, the spatial feature vector corresponding to the joint point of each sample pose output from the L _s layer of the space converter

Adjust the parameters of the attention mechanism in the spatial dimension, as shown in formula (9):

in,

Characterize the temporal feature vector corresponding to the attribute of the jth joint point.

At the same time, the corresponding operations of the time converter are shown in formulas (6) to (8), and will not be described in detail here. In the L _t layer of the time converter, the final dynamic feature representing the sample video frame is input, that is, Z°.

The eighth step is the training process. The training process is realized by the commonly used reconstruction method, that is, [P ₁ ,...,P _t ] (where [P ₁ ,...,P _t ] corresponds to the sample normalized pose sequence

) as input, after corresponding to the corresponding Z°, the attitude reconstruction of the Z° is obtained

That is, as shown in formula (10):

Then, calculate

and

The similarity loss between each joint point in each sample pose, as shown in formula (11):

where ω ^i,j is the confidence degree of each pose joint; at the same time

is reconstruction

The coordinate information of the joint points of each pose in .

Finally, the object recognition network is trained based on the loss Loss obtained in the eighth step, so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.

Here, after the object recognition network is trained to obtain a trained object recognition network, a test video stream is input to the trained object recognition network to obtain a corresponding recognition result.

Wherein, after the video stream to be tested is input to the trained object recognition network, the dynamic trajectory features of m objects included in the video frames in the video stream are obtained. The behavior corresponding to each dynamic trajectory feature is abnormally scored, that is, A _m,n , where m and n represent the pose feature of the mth frame of the nth frame trajectory in the video frame, and A _m,n satisfies the following formula (12):

Among them, the dynamic feature track corresponding to the highest abnormal score is selected, and based on it, it is determined whether the video frame in the video stream belongs to the abnormal video frame, that is, the highest A _m is selected through formula (13).

A _m = max(A _m,n ) formula (13);

Based on the above-mentioned training method of the object recognition network, as shown in FIG. 5 , it is a schematic framework diagram of a training method of the object recognition network provided by the embodiment of the present disclosure; wherein, 501 is to determine the sample pose information (including multi-frame pose information) of the sample object in any sample video frame in the sample video stream (here, the number of sample objects is 1), and input it to the ME of MoPRL, that is, 502 for probability mapping, and obtain the sample pose sequence of the sample object in the sample video frame, namely 503.其次，将503得到的样本姿态序列输入至MoPRL中的STT，即504中进行特征转换，以模拟样本姿态序列中的规律性进而得到样本姿态特征轨迹，即：第一步先进行姿态掩膜以及姿态嵌入得到与每一姿态中的关节点对应的特征向量；第二步，将第一步得到的与每一姿态中的关节点对应的特征向量，依次输入空间转换器(具有L _s层网络的空间域)和时间转换器(具有L _t层网络的时间域)进行特征编码，得到对应的样本姿态特征轨迹；第三步，对该样本姿态特征轨迹进行姿态重建，得到重建姿态序列，即505；最后，确定重建姿态序列和样本姿态信息之间相似度的重建损失即506，并基于该重建损失对待训练的对象识别网络的网络参数进行调整，以使调整后的对象识别网络输出的重建损失满足收敛条件。

At the same time, as shown in FIG. 6 , a schematic structural diagram of a space-time converter provided by an embodiment of the present disclosure is shown; wherein 601 is a space converter, 602 is a time converter, and 603 is the final output gesture feature track, wherein the gesture feature track in 603 is characterized by features corresponding to multiple joint points in each human body pose (here, the recognition object is a human being as an example for illustration). At the same time, in 601, in the spatial dimension, based on the attention parameters associated with the spatial dimension, feature encoding and dimension adjustment are performed on multiple joint points of each human body posture input to the space converter to obtain a sequence of spatial features to be input to the time converter; in 602, in the time dimension, based on the attention parameters associated with the time dimension, feature encoding and dimension adjustment are performed on the spatial features corresponding to the multiple joint points of each human body posture input to the time converter to obtain the final dynamic feature parameters; wherein the space converter has an L _s layer network; the time converter has an L _t layer network.

Correspondingly, FIG. 7 is a schematic diagram of the internal processing flow of a space-time converter provided by an embodiment of the present disclosure, wherein, 704 is the corresponding trajectory model parameters on the time dimension or the space dimension, that is, on the space dimension, the corresponding Z ^(I-1) (T×N×C); at the same time, the corresponding Z ^(I-1) (N×T×C) on the time dimension; 2's multi-layer perceptron, which maps multiple input feature sets to a single output feature, and finally obtains 701 corresponding trajectory feature parameters.

Based on the object recognition network and the training method of the object recognition network provided by the embodiments of the present disclosure, it is possible to obtain the more intuitive gesture motion of the object included in each video frame in the video stream, and embed the gesture motion into the corresponding video frame through ME to obtain the gesture trajectory sequence of the object in each video frame; and input the gesture trajectory sequence into the converter of time and space segmentation to learn the posture law, and obtain the dynamic characteristics representing the motion information of the object in each video frame; wherein, the step of determining the posture trajectory sequence is mainly realized through the following steps:

First, determine the normalized displacement between adjacent poses in multiple video frames (adjacent historical video frames) associated with each video frame; secondly, display the discrete distribution of the obtained multiple normalized displacements to obtain an explicit discrete distribution describing multiple normalized displacements; then, fit multiple normalized displacements based on the Rayleigh distribution or Gaussian distribution to determine a continuous distribution function that matches multiple normalized displacements; finally, use the normalized posture and its motion probability (the motion probability is to input each normalized displacement into the determined continuous distribution function to obtain the corresponding scaling Probability), to obtain motion embedding poses carrying spatial and temporal information.

Based on this, in the embodiments of the present disclosure, the intuitive motion of the object included in the video frame can be represented in the probability domain, and at the same time, the regularity of the pose trajectory can be learned by using a time- and space-segmented spatio-temporal converter. The accuracy of the motion information of the determined object in the video frame can be improved.

An embodiment of the present disclosure provides an object recognition device. FIG. 8A is a schematic structural composition diagram of an object recognition device provided by an embodiment of the present disclosure. As shown in FIG. 8A , the object recognition device 800 includes:

The first acquiring part 801 is configured to acquire a video frame to be identified whose picture includes a target object; the video frame to be identified is any video frame in the video stream of the target object;

The first determining part 802 is configured to determine an initial posture sequence of the target object based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream;

The first mapping part 803 is configured to perform probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized;

The first conversion part 804 is configured to perform feature conversion on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized;

The second determining part 805 is configured to determine the behavior state of the target object in the video frame to be recognized based on the gesture feature trajectory.

In some embodiments, the first determining part 802 is further configured to respectively perform key point identification on the video frame to be identified and the historical video frame to obtain the key point of the target object in the video frame to be identified and the key point of the target object in the historical video frame; respectively determine the pose information in the video frame to be recognized and the pose information in the historical video frame based on the key points of the target object in the video frame to be recognized and the key points of the target object in the historical video frame; sequence relationship, and sort the pose information in the historical video frame and the pose information in the video frame to be recognized to obtain an initial pose sequence.

In some embodiments, the first mapping part 803 includes: a determining subsection configured to obtain a center point sequence for determining the displacement of adjacent initial poses based on position information of key points in each initial pose, and a normalized pose sequence of the initial pose sequence; a mapping subpart is configured to perform probability mapping on the normalized pose sequence based on the center point sequence to obtain the target pose sequence.

In some embodiments, the determination subpart is further configured to determine the bounding box of each initial pose based on the position information of key points in each initial pose; in the sequence of initial poses, sort the center points of the bounding boxes of each initial pose to obtain the sequence of center points; use the bounding boxes of each initial pose to normalize each initial pose to obtain the normalized pose sequence.

In some embodiments, the mapping subpart is further configured to obtain a displacement sequence based on the difference between the position information of every two adjacent central points in the central point sequence; based on the size information of the bounding box of each initial pose, normalize each displacement in the displacement sequence to obtain a normalized displacement sequence; perform probability mapping on the normalized pose sequence based on the normalized displacement sequence to obtain the target pose sequence.

In some embodiments, the mapping subpart is further configured to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result; determine a continuous distribution function satisfied by the fitting result; input each normalized displacement into the continuous distribution function to obtain a scaling probability of each normalized displacement; map each normalized posture based on the scaling probability of each normalized displacement to obtain the target posture sequence.

In some embodiments, the first conversion part 804 is further configured to perform feature conversion on each target pose based on the key points of each target pose in the target pose sequence to obtain a feature sequence to be adjusted; perform feature dimension adjustment on each feature to be adjusted in space and time to obtain the pose feature trajectory.

In some embodiments, the first conversion part 804 is further configured to fuse each feature to be adjusted and the preset spatial feature of each feature to be adjusted to obtain a spatial feature sequence; based on the attention parameter of the spatial dimension in the attention mechanism, perform multi-layer dimension adjustment on the spatial feature sequence to obtain a spatial attitude feature sequence; fuse each spatial attitude feature and the preset temporal feature of each spatial attitude feature to obtain a temporal feature sequence; The output of one layer of dimension adjustment is the input of the next layer of dimension adjustment.

In some embodiments, the object recognition apparatus 800 further includes: an abnormality recognition part configured to obtain scene information corresponding to the video frame to be recognized; determine a preset behavior rule associated with the scene information; use the preset behavior rule to determine whether the video frame to be recognized to which the behavior state belongs is an abnormal video frame.

In some embodiments, the target object includes at least two objects to be identified, and the abnormal identification part is further configured to use the preset behavior rules to identify the behavior state of each of the at least two objects to be identified to obtain an intermediate identification result set; based on the reasonableness of the behavior state, sort the confidence of each intermediate identification result to obtain a result scoring sequence; determine whether the video frame to be identified is the abnormal video frame based on the intermediate identification result corresponding to a preset position in the result scoring sequence.

An embodiment of the present disclosure also provides a training device for an object recognition network. FIG. 8B is a schematic diagram of the structural composition of a training device for an object recognition network provided by an embodiment of the present disclosure. As shown in FIG. 8B , the training device 810 for an image conversion network includes:

The second acquiring part 811 is configured to acquire a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;

The third determining part 812 is configured to determine a sample normalized pose sequence of the sample object in the sample video frame;

The second mapping part 813 is configured to use the object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain the sample pose sequence;

The second conversion part 814 is configured to perform feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame;

The reconstruction part 815 is configured to perform pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence;

The fourth determining part 816 is configured to determine a reconstruction loss of similarity between the reconstructed pose sequence and the sample normalized pose sequence;

The adjustment part 817 is configured to adjust the network parameters of the object recognition network to be trained based on the reconstruction loss, so that the reconstruction loss output by the adjusted object recognition network meets a convergence condition.

It should be noted that the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be noted that, in the embodiments of the present disclosure, if the above-mentioned object recognition method, or the training method of the object recognition network is implemented in the form of software function parts, and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on such an understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can be a terminal, a server, etc.) execute all or part of the methods described in the various embodiments of the present disclosure. The aforementioned storage media include: various media that can store program codes such as U disk, sports hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, embodiments of the present disclosure are not limited to any specific combination of hardware and software.

Correspondingly, an embodiment of the present disclosure further provides a computer program product, the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the object recognition method provided by the embodiments of the present disclosure, or the training method of an object recognition network can be realized.

Correspondingly, an embodiment of the present disclosure provides a computer device. FIG. 9 is a schematic diagram of the composition and structure of a computer device according to an embodiment of the present disclosure. As shown in FIG. Wherein, the communication interface 902 is configured to realize connection and communication between these components. Wherein, the communication interface 902 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface. The processor 901 is configured to execute the information processing program in the memory, so as to realize the object recognition method provided by the above embodiments, or the training method of the object recognition network.

Correspondingly, an embodiment of the present disclosure further provides a computer storage medium, on which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the object recognition method provided in the above-mentioned embodiments, or the training method of an object recognition network is implemented.

Wherein, the computer storage medium may be a volatile storage medium or a non-volatile storage medium.

An embodiment of the present disclosure further provides a computer program, where the computer program includes computer-readable codes. When the computer-readable codes run in an electronic device, the processor of the electronic device executes the method for implementing the above-mentioned object recognition, or, the method for training the above-mentioned object recognition network.

The above descriptions of the object recognition device, the training device of the object recognition network, the computer equipment, the storage medium, and the program embodiment are similar to the description of the above-mentioned method embodiment, and have similar technical descriptions and beneficial effects as the corresponding method embodiment. Due to space limitations, you can refer to the description of the above-mentioned method embodiment. For the technical details not disclosed in the object recognition device, the training device of the object recognition network, the computer equipment, the storage medium and the program embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be understood that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one of the disclosed embodiments. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments. It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such a process, method, article or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided by the embodiments of the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or can be integrated into another system, or some features can be ignored or not implemented. In addition, the mutual coupling, or direct coupling, or communication connection of various components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units. Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium.

Alternatively, if the above-mentioned integrated units in the embodiments of the present disclosure are realized in the form of software function parts and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the embodiments of the present disclosure can essentially be embodied in the form of a software product, or the part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in each embodiment of the embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks. The above is only the specific implementation of the embodiments of the present disclosure, but the scope of protection of the embodiments of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the embodiments of the present disclosure, and should be covered within the scope of protection of the embodiments of the present disclosure. Therefore, the protection scope of the embodiments of the present disclosure should be determined by the protection scope of the claims.

Industrial Applicability

The embodiment of the present disclosure discloses an object recognition method, network training method, device, equipment, medium, and program; wherein, the object recognition method includes: acquiring a video frame to be recognized in a picture including a target object; the video frame to be recognized is any video frame in the video stream of the target object; determining an initial pose sequence of the target object based on the video frame to be recognized and historical video frames of the video frame to be recognized in the video stream; performing probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized; performing feature conversion in time and time to obtain a gesture feature track of the target object in the video frame to be identified; based on the gesture feature track, determining the behavior state of the target object in the video frame to be identified.

Claims

A method for object recognition, the method comprising:

The acquired picture includes a video frame to be identified of the target object; the video frame to be identified is any video frame in the video stream of the target object;

determining an initial pose sequence of the target object based on the video frame to be identified and historical video frames of the video frame to be identified in the video stream;

Probability mapping is performed on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be identified;

Performing feature conversion on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be identified;

Based on the gesture feature track, determine the behavior state of the target object in the video frame to be recognized.
The method according to claim 1, wherein said determining the initial pose sequence of the target object based on the video frame to be identified and the historical video frame of the video frame to be identified in the video stream comprises:

Respectively performing key point identification on the video frame to be identified and the historical video frame to obtain key points of the target object in the video frame to be identified and key points of the target object in the historical video frame;

Based on the key points of the target object in the video frame to be identified and the key points of the target object in the historical video frame respectively, determine the gesture information in the video frame to be identified and the gesture information in the historical video frame;

According to the time sequence relationship between the historical video frame and the video frame to be recognized, the pose information in the historical video frame and the pose information in the video frame to be recognized are sorted to obtain an initial pose sequence.
The method according to claim 1 or 2, wherein the probabilistic mapping of the initial pose sequence to obtain the target pose sequence of the target object in the video frame to be identified comprises:

Based on the position information of the key points in each initial pose, obtain a center point sequence for determining the displacement of adjacent initial poses, and a normalized pose sequence of the initial pose sequence;

Probabilistic mapping is performed on the normalized pose sequence based on the central point sequence to obtain the target pose sequence.
The method according to claim 3, wherein, based on the position information of key points in each initial pose, obtaining a center point sequence for determining the displacement of adjacent initial poses, and a normalized pose sequence of the initial pose sequence include:

Based on the position information of key points in each initial pose, determine the bounding box of each initial pose;

In the initial pose sequence, sorting the center points of the bounding boxes of each initial pose to obtain the center point sequence;

The bounding box of each initial pose is used to normalize each initial pose to obtain the normalized pose sequence.
The method according to claim 3 or 4, wherein the probabilistic mapping of the normalized pose sequence based on the central point sequence to obtain the target pose sequence includes:

In the center point sequence, a displacement sequence is obtained based on the difference between the position information of every two adjacent center points;

Based on the size information of the bounding box of each initial pose, normalize each displacement in the displacement sequence to obtain a normalized displacement sequence;

Probabilistic mapping is performed on the normalized pose sequence based on the normalized displacement sequence to obtain the target pose sequence.
The method according to claim 5, wherein said performing probability mapping on said normalized attitude sequence based on said normalized displacement sequence to obtain said target attitude sequence comprises:

Fitting each normalized displacement in the normalized displacement sequence to obtain a fitting result;

determining the continuous distribution function that the fitting result satisfies;

inputting each of the normalized displacements into the continuous distribution function to obtain a scaling probability of each of the normalized displacements;

Based on the scaling probability of each normalized displacement, each normalized pose is mapped to obtain the target pose sequence.
The method according to any one of claims 1 to 6, wherein said performing feature transformation on said target pose sequence in space and time to obtain a pose feature track of said target object in said video frame to be identified, comprising:

In the target pose sequence, based on the key points of each target pose, perform feature conversion on each target pose to obtain a feature sequence to be adjusted;

For each feature to be adjusted, the feature dimension is adjusted in space and time to obtain the gesture feature trajectory.
The method according to claim 7, wherein said adjusting the feature dimension of each feature to be adjusted in space and time to obtain the trajectory of the gesture feature comprises:

Fusing each feature to be adjusted with a preset spatial feature of each feature to be adjusted to obtain a sequence of spatial features;

Based on the attention parameter of the spatial dimension in the attention mechanism, multi-layer dimension adjustment is performed on the spatial feature sequence to obtain the spatial posture feature sequence;

Fusing each space attitude feature with the preset time feature of each space attitude feature to obtain a time feature sequence;

Based on the attention parameter of the time dimension in the attention mechanism, multi-layer dimension adjustment is performed on the time feature sequence to obtain the posture feature track;

Among them, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer.
The method according to any one of claims 1 to 8, wherein, after determining the behavior state of the target object in the video frame to be identified based on the gesture feature track, the method further comprises:

Acquiring scene information corresponding to the video frame to be identified;

determining preset behavior rules associated with the scene information;

Using the preset behavior rule, determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame.
The method according to claim 9, wherein the target object includes at least two objects to be identified, and using the preset behavior rules to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame comprises:

Using the preset behavior rules to identify the behavior state of each of the at least two objects to be identified to obtain an intermediate identification result set;

Sorting the confidence of each intermediate recognition result to obtain the result scoring sequence;

Based on an intermediate recognition result corresponding to a preset position in the result scoring sequence, it is determined whether the video frame to be recognized is the abnormal video frame.
A training method for an object recognition network, the method comprising:

Obtaining a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;

determining a sequence of sample normalized poses of the sample object in the sample video frame;

Using the object recognition network to be trained, performing probability mapping on the sample normalized pose sequence to obtain the sample pose sequence;

Performing feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame;

performing pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence;

determining a reconstruction loss of similarity between said reconstructed pose sequence and said sample normalized pose sequence;

Based on the reconstruction loss, the network parameters of the object recognition network to be trained are adjusted, so that the reconstruction loss output by the adjusted object recognition network meets a convergence condition.
An object recognition device, the device comprising:

The first acquisition part is configured to acquire a video frame to be identified whose picture includes a target object; the video frame to be identified is any video frame in the video stream of the target object;

The first determining part is configured to determine an initial pose sequence of the target object based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream;

The first mapping part is configured to perform probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized;

The first conversion part is configured to perform feature conversion on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized;

The second determination part is configured to determine the behavior state of the target object in the video frame to be recognized based on the gesture feature track.
A training device for an object recognition network, said device comprising:

The second acquisition part is configured to acquire a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;

A third determining part configured to determine a sample normalized pose sequence of the sample object in the sample video frame;

The second mapping part is configured to use the object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain the sample pose sequence;

The second conversion part is configured to perform feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame;

The reconstruction part is configured to perform pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence;

A fourth determining part configured to determine a reconstruction loss of similarity between the reconstructed pose sequence and the sample normalized pose sequence;

The adjustment part is configured to adjust the network parameters of the object recognition network to be trained based on the reconstruction loss, so that the adjusted reconstruction loss output by the object recognition network meets a convergence condition.
A computer device, the computer device comprising a memory and a processor, computer-executable instructions are stored in the memory, the processor can implement the object recognition method according to any one of claims 1 to 10 when running the computer-executable instructions on the memory, or, the processor can realize the object recognition network training method according to claim 11 when running the computer-executable instructions on the memory.
A computer storage medium, the computer storage medium is stored with computer-executable instructions, and after the computer-executable instructions are executed, the object recognition method described in any one of claims 1 to 10 can be realized, or, after the computer-executable instructions are executed, the object recognition network training method described in claim 11 can be realized.
A computer program, the computer program comprising computer-readable code, in the case where the computer-readable code is run in an electronic device, the processor of the electronic device executes the object recognition method for realizing any one of claims 1 to 10, or, for realizing the object recognition network training method according to claim 11.