WO2023138154A1 - Object recognition method, network training method and apparatus, device, medium, and program - Google Patents

Object recognition method, network training method and apparatus, device, medium, and program Download PDF

Info

Publication number
WO2023138154A1
WO2023138154A1 PCT/CN2022/129057 CN2022129057W WO2023138154A1 WO 2023138154 A1 WO2023138154 A1 WO 2023138154A1 CN 2022129057 W CN2022129057 W CN 2022129057W WO 2023138154 A1 WO2023138154 A1 WO 2023138154A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
sequence
pose
sample
feature
Prior art date
Application number
PCT/CN2022/129057
Other languages
French (fr)
Chinese (zh)
Inventor
苏海昇
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023138154A1 publication Critical patent/WO2023138154A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • the coordinate information corresponding to the key point of each target pose in the target pose sequence may be first converted into a feature vector, and then based on the attention parameter, the converted feature vector is adjusted in the time dimension and the space dimension to obtain the final gesture feature track.
  • the behavior state of the target object in the video frame to be recognized can be determined based on the determined gesture feature trajectory; wherein, the behavior state can be the motion information of the target object, such as: the target object is walking, jumping, standing, etc.
  • the posture feature trajectory may be input to the corresponding network model to determine the behavior state corresponding to the posture feature trajectory.
  • the acquired picture includes the video frame to be recognized of the target object, and the video frame to be recognized is any video frame in the video stream of the target object; in this way, the video stream of the target object and the video frame to be recognized are obtained to provide a basis for subsequent determination of the behavior state of the target object in the video frame to be recognized; secondly, based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream, the initial pose sequence of the target object is determined; and probability mapping is performed on the initial pose sequence to obtain the target in the video frame to be recognized
  • the target posture sequence of the object in this way, the posture situation of the target object in the video frame to be recognized can be mapped to the specific probability corresponding to the previous movement, that is, the determined target posture sequence can be considered at the same time.
  • a trained neural network can be used to identify the key points of the video frame to be recognized and the historical video frame respectively, and then obtain the key points of the target object in the video frame to be recognized and the key points of the target object in the historical video frame; wherein, the trained neural network can be any neural network, which is not limited in this embodiment of the present disclosure.
  • the pose information of the target object in the video frame to be recognized can be represented based on the key points of the target object in the video frame to be recognized, and the pose information of the target object in the historical video frame can be represented based on the key points of the target object in the historical video frame; for example, the pose information of the target object in the video frame to be recognized can be characterized based on the position information of the key point of the target object in the video frame to be recognized; wherein, the position information can refer to the coordinate information of any key point in the video frame to be recognized.
  • the pose information of the target object can also be represented in the above-mentioned manner in the historical video frames.
  • a correlation normalization operation and a mapping operation may be performed on each initial pose in the initial pose sequence to obtain the target pose sequence of the target object in the video frame to be recognized.
  • the posture of the target object in the video frame to be recognized can be mapped to the specific probability corresponding to the previous movement, and then the posture information of the target object in the video frame to be recognized can be represented based on historical posture information and current posture information; that is, the posture information of the target object in the video frame to be recognized can be represented based on dynamic parameters, and the accuracy of the motion information of the target object in the video frame to be recognized can be improved. That is, step S103 provided in the above embodiment can be realized through the following steps S204 and S205:
  • a center point sequence for determining the displacement of adjacent initial poses and a normalized pose sequence corresponding to the initial pose sequence can be determined.
  • the first step is to determine the bounding box of each initial pose based on the position information of key points in each initial pose.
  • the second step is to sort the center points of the bounding boxes of each initial pose in the initial pose sequence to obtain the center point sequence.
  • the center point of the bounding box of each initial pose is determined, and then the center of the bounding box of each initial pose is sorted based on the sequence information of the initial poses in the initial pose sequence to obtain a center point sequence.
  • the center point of the bounding box of each initial pose can be characterized based on the position information of the center point, that is, the coordinate information in the corresponding video frame.
  • the third step is to normalize each initial pose by using the bounding box of each initial pose to obtain the normalized pose sequence.
  • the bounding box of each initial pose is used, and each initial pose is normalized to obtain a normalized pose sequence; wherein, the size information of the bounding box of each initial pose may be used, such as the width and height of the bounding box, and each initial pose is normalized to obtain a normalized pose corresponding to each initial pose, and then a normalized pose sequence is obtained.
  • Step S205 performing probability mapping on the normalized pose sequence based on the center point sequence to obtain the target pose sequence.
  • the normalized displacement sequence corresponding to the central point sequence can be determined based on the relationship between adjacent central points in the central point sequence, and then the normalized posture sequence can be probabilistically mapped based on the normalized displacement sequence to obtain the target posture sequence; in this way, the target posture sequence can take into account both the previous motion and the current motion information of the target object, so that the determined target posture sequence can more accurately match the motion information of the target object in the video frame to be recognized. That is, the above step S205 can be realized through the following process:
  • a displacement sequence is obtained based on the difference between the position information of every two adjacent center points.
  • the difference between the abscissas of every two adjacent central points and the difference between the vertical coordinates of every two adjacent central points can be calculated by a correlation function to obtain the displacement of the two adjacent central points.
  • probability mapping is performed on the normalized pose sequence based on the normalized displacement sequence to obtain the target pose sequence.
  • probability mapping is performed on the normalized pose sequence to obtain the corresponding target pose sequence; that is, the normalized pose sequence can be mapped to a specific probability to represent the motion information of the target object in the video frame to be recognized, that is, the target pose sequence.
  • the probability mapping of the normalized attitude sequence based on the normalized displacement sequence can improve the accuracy and speed of determining the target attitude sequence.
  • the continuous distribution function associated with the normalized displacement sequence can be determined, and then each normalized displacement is input to the continuous distribution function, and the corresponding scaling factor, that is, the scaling probability is determined, and finally, based on the scaling probability, each normalized posture in the normalized posture sequence is mapped to obtain the target posture sequence.
  • mapping historical pose information and current pose information to the pose information of the target object in the video frame to be recognized based on a specific probability can improve the accuracy of characterizing the motion information of the target object in the video frame to be recognized, that is, the target pose sequence. That is, the above step S253 can be realized through the following process:
  • each normalized displacement in the normalized displacement sequence is fitted to obtain a fitting result.
  • a preset function such as Rayleigh Distribution
  • Rayleigh Distribution is used to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result. It is also possible to use a Gaussian distribution to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result.
  • the second step is to determine the continuous distribution function that the fitting result satisfies.
  • the continuous distribution function satisfied by the fitting result is determined; wherein, the continuous distribution function may be represented by a conventional function expression, or may be represented by a text description.
  • each normalized displacement is input into the continuous distribution function to obtain the scaling probability of each normalized displacement.
  • each normalized displacement can be input into the continuous distribution function to obtain a scaling factor corresponding to each normalized displacement, that is, a scaling probability.
  • the normalized posture sequence corresponds to the initial posture sequence
  • the scaling probability corresponding to each normalized displacement corresponds to the central point sequence, that is, each central point in the initial posture sequence (the normalized displacement sequence is determined based on the central point sequence); furthermore, each normalized posture in the normalized posture sequence can be fused with the scaling probability of each normalized displacement to obtain the target posture sequence; that is, each normalized posture can be divided by the scaling probability of each normalized displacement to obtain the target posture sequence.
  • step S104 provided in the above embodiment can be realized through the following steps S206 and S207:
  • Step S206 in the target pose sequence, perform feature transformation on each target pose based on key points of each target pose to obtain a feature sequence to be adjusted.
  • Step S207 perform feature dimension adjustment on each feature to be adjusted in space and time to obtain the gesture feature track.
  • feature fusion and dimension adjustment can be performed on each feature to be adjusted, so as to obtain the gesture feature trajectory.
  • each feature to be adjusted is fused with a preset spatial feature of each feature to be adjusted to obtain a sequence of spatial features.
  • the preset spatial feature of each feature to be adjusted may be a spatial feature parameter related to attributes of key points in each feature to be adjusted in the spatial dimension.
  • the preset spatial characteristics of the features to be adjusted corresponding to different key points of the person are different, which are determined based on the positions of the key points in the human body and the attributes of the key points of the human body.
  • each feature to be adjusted and the preset spatial feature of each feature to be adjusted may be fused based on key points to obtain a sequence of spatial features; wherein each feature to be adjusted corresponds to each target pose, and each target pose includes multiple key points; that is, each feature to be adjusted corresponds to multiple key points of the same target pose.
  • each spatial feature in the spatial feature sequence is adjusted in multiple dimensions, and then the spatial posture feature sequence is obtained.
  • the following process may also be performed:
  • the preset behavior rules associated with the scene information are determined.
  • the associated preset behavior rules are that the office staff in the office are only allowed to work normally and communicate with others, but the office staff are not allowed to lie down and rest.
  • the scene information is a parking lot
  • its associated preset behavior rules are that vehicles in the parking lot are allowed to park, and low-speed driving is allowed, but high-speed driving is not allowed.
  • the preset behavior rule is used to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame.
  • the preset behavior rule is used to identify the behavior state to determine whether the behavior state belongs to an abnormal behavior state, and further to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame.
  • determine the preset behavior rules associated with the scene information of the video frame to be recognized that is, determine the preset behavior rules that the objects contained in the picture in the video frame to be recognized need to abide by; then, based on the preset behavior rules, determine whether the behavior state of the object in the video frame to be recognized is reasonable, and determine whether the video frame to be recognized is an abnormal video frame based on the judgment of whether it is reasonable; in this way, it is possible to more accurately and conveniently determine whether the video frame belongs to an abnormal video frame.
  • the target object includes at least two objects to be identified.
  • determining whether the video frame to be identified to which the behavior state belongs is an abnormal video frame can be achieved by the following steps:
  • the behavior recognition result corresponding to the preset position in the result scoring sequence is determined as the target recognition result of the video frame to be recognized, and then based on the target recognition result, it is determined whether the video frame to be recognized is an abnormal video frame.
  • an object recognition network can be used to recognize the target object in the video frame to be recognized, and then obtain the behavior state of the target object in the video frame to be recognized; wherein, the object recognition network is obtained by training the object recognition network to be trained, and the training of the object recognition network to be trained can be realized by the steps shown in Figure 3, which is a schematic flow diagram of a training method for an object recognition network provided by an embodiment of the present disclosure; the following description is made in conjunction with the steps shown in Figure 3:
  • Step S31 acquiring a sample video frame including a sample object.
  • sample video frame is any video frame in the sample video stream of the sample object.
  • the sample normalized pose sequence includes multiple sample normalized poses, and the multiple sample normalized poses are sorted based on the timing information of the sample video frame where the sample object is located in the sample video stream, and the sample normalized pose information can be characterized based on the key points contained in the poses in the sample video frame.
  • Step S34 performing feature transformation on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame.
  • pose reconstruction can be performed based on the sample pose feature track, that is, correspondingly obtain a reconstructed pose sequence associated with the sample pose feature track, where the sample pose feature track can be converted from a feature vector to a coordinate parameter, and then the corresponding reconstructed pose sequence can be obtained.
  • a reconstruction loss is introduced to supervise the similarity between the sample normalized pose sequence and the reconstructed pose sequence in the sample video frame; in this way, by training the object recognition network to be trained, the recognition accuracy of the entire object recognition network can be improved, so that an object recognition network with higher performance can be obtained; that is, the trained object recognition network can recognize the information movement of objects in any video frame in the video stream The accuracy is higher.
  • a motion prior rule learner based on Motion Prior Regularity Learner (MoPRL) is provided to alleviate the limitation of the above-mentioned pose-based method.
  • MoPRL consists of two sub-parts, Motion Embedder (ME) and Spatial-Temporal Transformer (STT).
  • ME Motion Embedder
  • STT Spatial-Temporal Transformer
  • ME is used to extract the spatio-temporal representation of the input pose from a probabilistic perspective; where the pose motion is modeled based on the displacement between pose center points between adjacent frames. At the same time, this movement can be transformed into the probability domain. That is, the motion prior information is obtained through statistics, which represents the explicit distribution of displacement on the training data.
  • each joint point can be represented by coordinates (xi ,j ,y i,j ).
  • the fifth step is to determine the smallest rectangular frame capable of each sample human pose in the human pose sequence, and then determine the center point of the corresponding smallest rectangular frame as the center point of each sample human pose, that is, (xi , y i ), and at the same time based on the size of the smallest rectangular frame of each sample human pose, such as the width and height of the smallest rectangular frame: (w i , h i ); normalize the sample human poses in each sample human pose sequence, that is, normalize each joint point in the sample human poses in each sample human pose sequence processing to obtain the normalized expression sequence of each sample human body posture, namely Among them, the standardized coordinate parameters corresponding to each joint point in each sample human body pose are correspondingly obtained, that is,
  • the fifth step is used to obtain the normalized representation sequence of each sample human body posture
  • the center point (xi , y i ) of each sample human body pose and the width and height (w i , h i ) of the minimum rectangular frame of each sample human body pose are input to ME to obtain the corresponding sample target pose sequence in the sample video frame.
  • the reconstruction part 815 is configured to perform pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence
  • an embodiment of the present disclosure further provides a computer program product, the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the object recognition method provided by the embodiments of the present disclosure, or the training method of an object recognition network can be realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in embodiments of the present invention are an object recognition method, a network training method and apparatus, a device, a medium, and a program. The object recognition method comprises: acquiring a video frame to be recognized comprising a target object of a screen; said video frame being any video frame in a video stream of the target object; determining an initial attitude sequence of the target object on the basis of said video frame and a historical video frame of said video frame in the video stream; performing probability mapping on the initial attitude sequence to obtain a target attitude sequence of the target object in said video frame; performing feature conversion on the target attitude sequence in space and time to obtain an attitude feature trajectory of the target object in said video frame; and determining a behavior state of the target object in said video frame on the basis of the attitude feature trajectory.

Description

对象识别方法、网络训练方法、装置、设备、介质及程序Object recognition method, network training method, device, equipment, medium and program
相关申请的交叉引用Cross References to Related Applications
本公开要求2022年01月24日提交的中国专利申请号为202210082276.3、申请人为上海商汤智能科技有限公司,申请名称为“对象识别方法、网络训练方法、装置、设备及介质”的优先权,该申请的全文以引用的方式并入本公开中。This disclosure claims the priority of the Chinese patent application number 202210082276.3 submitted on January 24, 2022, the applicant is Shanghai Shangtang Intelligent Technology Co., Ltd., and the application name is "object recognition method, network training method, device, equipment and medium". The full text of this application is incorporated into this disclosure by reference.
技术领域technical field
本公开实施例涉及图像处理领域,尤其涉及一种对象识别方法、网络训练方法、装置、设备、介质及程序。Embodiments of the present disclosure relate to the field of image processing, and in particular to an object recognition method, a network training method, a device, equipment, a medium and a program.
背景技术Background technique
相关技术中,基于光流或帧梯度等像素法从视频流识别异常视频帧,易受视频画面中的噪声的影响,使得识别结果不佳。In related technologies, abnormal video frames are identified from video streams based on pixel methods such as optical flow or frame gradient, which are easily affected by noise in video images, resulting in poor identification results.
发明内容Contents of the invention
本公开实施例提供一种对象识别技术方案。An embodiment of the present disclosure provides a technical solution for object recognition.
本公开实施例的技术方案是这样实现的:The technical scheme of the embodiment of the present disclosure is realized in this way:
本公开实施例提供一种对象识别方法,所述方法包括:获取画面包括目标对象的待识别视频帧;所述待识别视频帧为所述目标对象的视频流中的任一视频帧;基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。An embodiment of the present disclosure provides an object recognition method, the method comprising: acquiring a video frame to be recognized including a target object; the video frame to be recognized is any video frame in the video stream of the target object; determining an initial pose sequence of the target object based on the video frame to be recognized and historical video frames of the video frame to be recognized in the video stream; performing probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized; performing feature transformation on the target pose sequence in space and time to obtain the target object in the video frame to be recognized The gesture feature trajectory; based on the gesture feature trajectory, determine the behavior state of the target object in the video frame to be recognized.
本公开实施例提供一种对象识别网络的训练方法,所述方法包括:获取包括样本对象的样本视频帧;其中,所述样本视频帧为所述样本对象的样本视频流中的任一视频帧;An embodiment of the present disclosure provides a training method for an object recognition network, the method comprising: acquiring a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;
确定所述样本视频帧中所述样本对象的样本归一化姿态序列;采用待训练的对象识别网络,对所述样本归一化姿态序列进行概率映射,得到样本姿态序列;将所述样本姿态序列在空间和时间上进行特征转换,得到所述样本视频帧中所述样本对象的样本姿态特征轨迹;对所述样本姿态特征轨迹进行姿态重建,得到重建姿态序列;确定所述重建姿态序列和所述样本归一化姿态序列之间相似度的重建损失;基于所述重建损失,对所述待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。Determine the sample normalized pose sequence of the sample object in the sample video frame; use the object recognition network to be trained to carry out probability mapping on the sample normalized pose sequence to obtain a sample pose sequence; perform feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame; perform pose reconstruction on the sample pose feature track to obtain a reconstructed pose sequence; determine the reconstruction loss of the similarity between the reconstructed pose sequence and the sample normalized pose sequence; The parameters are adjusted so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.
本公开实施例提供一种对象识别装置,所述装置包括:第一获取部分,被配置为获取画面包括目标对象的待识别视频帧;所述待识别视频帧为所述目标对象的视频流中的任一视频帧;第一确定部分,被配置为基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;第一映射部分,被配置为对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;第一转换部分,被配置为对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;第二确定部分,被配置为基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。An embodiment of the present disclosure provides an object recognition device, the device comprising: a first acquisition part configured to acquire a video frame to be recognized whose picture includes a target object; the video frame to be recognized is any video frame in a video stream of the target object; a first determination part configured to determine an initial pose sequence of the target object based on the video frame to be recognized and historical video frames of the video frame to be recognized in the video stream; a first mapping part configured to perform probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized; is configured to perform feature transformation on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized; a second determining part is configured to determine the behavior state of the target object in the video frame to be recognized based on the pose feature track.
本公开实施例提供一种对象识别网络的训练装置,所述装置包括:第二获取部分,被配置为获取包括样本对象的样本视频帧;其中,所述样本视频帧为所述样本对象的样本视频流中的任一视频帧;第三确定部分,被配置为确定所述样本视频帧中所述样本对象的样本归一化姿态序列;第二映射部分,被配置为采用待训练的对象识别网络,对所述样本归一化姿态序列进行概率映射,得到样本姿态序列;第二转换部分,被配置为将所述样本姿态序列在空间和时间上进行特征转换,得到所述样本视频帧中所述样本对象的样本姿态特征轨迹;重建部分,被配置为对所述样本姿态特征轨迹进行姿态重建,得到重建姿态序列;第四确定部分,被配置为确定所述重建姿态序列和所述样本归一化姿态序列之间相似度的重建损失;调整部分,被配置为基于所述重建损失,对所述待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。An embodiment of the present disclosure provides a training device for an object recognition network, the device comprising: a second acquisition part configured to obtain a sample video frame including a sample object; wherein the sample video frame is any video frame in the sample video stream of the sample object; a third determination part configured to determine a sample normalized pose sequence of the sample object in the sample video frame; a second mapping part configured to use an object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain a sample pose sequence; performing feature conversion in space and time to obtain a sample pose feature track of the sample object in the sample video frame; a reconstruction part configured to perform pose reconstruction on the sample pose feature track to obtain a reconstructed pose sequence; a fourth determination part configured to determine a reconstruction loss of similarity between the reconstructed pose sequence and the sample normalized pose sequence; an adjustment part configured to adjust network parameters of the object recognition network to be trained based on the reconstruction loss, so that the adjusted reconstruction loss output by the object recognition network meets a convergence condition.
本公开实施例提供一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时能够实现上述的对象识别方法,或,对象识别网络的训练方法。An embodiment of the present disclosure provides a computer device. The computer device includes a memory and a processor. The memory stores computer-executable instructions. When the processor runs the computer-executable instructions on the memory, it can implement the above object recognition method, or the object recognition network training method.
本公开实施例提供一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,该计算机可执行指令被执行后,能够实现上述的对象识别方法,或,对象识别网络的训练方法。An embodiment of the present disclosure provides a computer storage medium, on which computer-executable instructions are stored. After the computer-executable instructions are executed, the above-mentioned object recognition method or the training method of an object recognition network can be implemented.
本公开实施例提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现上述的对象识别方法,或,上述的对象识别网络的训练方法。An embodiment of the present disclosure provides a computer program, where the computer program includes computer-readable codes. When the computer-readable codes run in an electronic device, a processor of the electronic device executes the above-mentioned object recognition method, or the above-mentioned object recognition network training method.
本公开实施例提供一种对象识别方法、网络训练方法、装置、设备、介质及程序,首先,获取画面包括目标对象的待识别视频帧,所述待识别视频帧为所述目标对象的视频流中的任一视频帧;其次,基于待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;然后,对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;如此,可以将待识别视频帧中目标对象的姿态情况映射至前期运动对应的特定概率;最后,对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;并基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。如此,通过目标对象在时间维度和空间维度上的动态姿态信息,能够提高确定待识别视频帧中目标对象的运动信息的准确度,进而能够提高基于姿态特征确定待识别视频帧中目标对象的行为状态的准确度。Embodiments of the present disclosure provide an object recognition method, network training method, device, equipment, medium, and program. First, the acquired image includes a video frame to be recognized of the target object, and the video frame to be recognized is any video frame in the video stream of the target object; secondly, based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream, the initial pose sequence of the target object is determined; then, probability mapping is performed on the initial pose sequence to obtain the target pose sequence of the target object in the video frame to be recognized; thus, the target object in the video frame to be recognized can be obtained. The posture situation of the target is mapped to the specific probability corresponding to the previous movement; finally, the target posture sequence is subjected to feature conversion in space and time to obtain the posture feature trajectory of the target object in the video frame to be recognized; and based on the posture feature trajectory, determine the behavior state of the target object in the video frame to be recognized. In this way, through the dynamic posture information of the target object in the time dimension and space dimension, the accuracy of determining the motion information of the target object in the video frame to be recognized can be improved, and the accuracy of the behavior state of the target object in the video frame to be recognized based on posture characteristics can be improved.
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments will be described in detail below together with the accompanying drawings.
附图说明Description of drawings
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开实施例的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图,其中:In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative work, wherein:
图1为本公开实施例提供的第一种对象识别方法的流程示意图;FIG. 1 is a schematic flowchart of a first object recognition method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的第二种对象识别方法的流程示意图;FIG. 2 is a schematic flowchart of a second object recognition method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的一种对象识别网络的训练方法的流程示意图;FIG. 3 is a schematic flowchart of a training method for an object recognition network provided by an embodiment of the present disclosure;
图4为本公开实施例提供的一种基于运动嵌入器将姿态轨迹转换为概率域中运动特征的示意图;FIG. 4 is a schematic diagram of converting a gesture trajectory into a motion feature in a probability domain based on a motion embedder provided by an embodiment of the present disclosure;
图5为本公开实施例提供的一种对象识别网络的训练方法的框架示意图;FIG. 5 is a schematic framework diagram of a training method for an object recognition network provided by an embodiment of the present disclosure;
图6为本公开实施例提供的一种时空转换器的结构示意图;FIG. 6 is a schematic structural diagram of a space-time converter provided by an embodiment of the present disclosure;
图7为本公开实施例提供的一种时空转换器内部处理流程的示意图;FIG. 7 is a schematic diagram of an internal processing flow of a space-time converter provided by an embodiment of the present disclosure;
图8A为本公开实施例提供的一种对象识别装置的结构组成示意图;FIG. 8A is a schematic diagram of the structural composition of an object recognition device provided by an embodiment of the present disclosure;
图8B为本公开实施例提供的一种对象识别网络的训练装置的结构组成示意图;FIG. 8B is a schematic diagram of the structural composition of a training device for an object recognition network provided by an embodiment of the present disclosure;
图9为本公开实施例提供的一种计算机设备的结构组成示意图。FIG. 9 is a schematic diagram of the structural composition of a computer device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对发明的具体技术方案做进一步详细描述。以下实施例被配置为说明本公开实施例,但不用来限制本公开实施例的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the specific technical solutions of the invention will be further described in detail below in conjunction with the drawings in the embodiments of the present disclosure. The following embodiments are configured to illustrate embodiments of the present disclosure, but are not intended to limit the scope of embodiments of the present disclosure.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, "some embodiments" are referred to, which describes a subset of all possible embodiments, but it can be understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined without conflict.
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本公开实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, the terms "first\second\third" are only used to distinguish similar objects, and do not represent a specific ordering of objects. It is understandable that "first\second\third" can be interchanged with a specific order or sequence if allowed, so that the embodiments of the present disclosure described here can be implemented in an order other than those illustrated or described here.
除非另有定义,本文所使用的所有的技术和科学术语与属于本公开实施例的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本公开实施例的目的,不是旨在限制本公开实施例。Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the disclosure belong. The terms used herein are only for the purpose of describing the embodiments of the present disclosure, and are not intended to limit the embodiments of the present disclosure.
对本公开实施例进行进一步详细说明之前,对本公开实施例中涉及的名词和术语进行说明,本公开实施例中涉及的名词和术语适用于如下的解释。Before the embodiments of the present disclosure are further described in detail, the nouns and terms involved in the embodiments of the present disclosure will be described, and the nouns and terms involved in the embodiments of the present disclosure are applicable to the following explanations.
1)归一化:是一种简化计算的方式,即将有量纲的表达式经过变换,化为无量纲的表达式,成为标量;其中,归一化是一种无量纲处理手段,使物理系统数值的绝对值变成某种相对值关系。其主要用于简化计算量,缩小量值。1) Normalization: It is a method of simplifying calculations, that is, transforming dimensional expressions into dimensionless expressions and becoming scalars; among them, normalization is a dimensionless processing method that makes the absolute value of the physical system numerical value into a certain relative value relationship. It is mainly used to simplify the amount of calculation and reduce the value.
2)置信度:在统计学中,一个概率样本的置信区间(Confidence interval)是对这个样本的某个总体参数的区间估计。置信区间展现的是这个参数的真实值有一定概率落在测量结果的周围的程度。置信区间给出的是被测量参数测量值的可信程度范围,即前面所要求的“一定概率”。这个概率被称为置信水平。2) Confidence: In statistics, the confidence interval (Confidence interval) of a probability sample is an interval estimate of a certain population parameter of this sample. The confidence interval shows the degree to which the true value of the parameter falls with a certain probability around the measured result. The confidence interval gives the range of the degree of confidence in the measured value of the measured parameter, that is, the "certain probability" required above. This probability is called the confidence level.
下面说明本公开实施例提供的对象识别设备的示例性应用,本公开实施例提供的设备可以实施为具有图像采集功能的笔记本电脑,平板电脑,台式计算机,相机,移动设备(例如,个人数字助理,专用消息设备,便携式游戏设备)等各种类型的用户终端,也可以实施为服务器。下面,将说明设备实施为终端或服务器时示例性应用。The exemplary application of the object recognition device provided by the embodiment of the present disclosure is described below. The device provided by the embodiment of the present disclosure can be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a camera, and a mobile device (for example, a personal digital assistant, a dedicated message device, a portable game device) with an image acquisition function, and can also be implemented as a server. Below, an exemplary application when the device is implemented as a terminal or a server will be described.
该方法可以应用于计算机设备,该方法所实现的功能可以通过计算机设备中的处理器调用程序代码来实现,当然程序代码可以保存在计算机存储介质中,可见,该计算机设备至少包括处理器和存储介质。The method can be applied to a computer device, and the functions implemented by the method can be realized by calling a program code by a processor in the computer device. Of course, the program code can be stored in a computer storage medium. It can be seen that the computer device includes at least a processor and a storage medium.
本公开实施例提供一种对象识别方法,如图1所示,为本公开实施例提供的第一种对象识别方法的流程示意图;结合图1所示步骤进行以下说明:The embodiment of the present disclosure provides an object recognition method, as shown in FIG. 1 , which is a schematic flow chart of the first object recognition method provided by the embodiment of the present disclosure; the following description is made in conjunction with the steps shown in FIG. 1 :
步骤S101,获取画面包括目标对象的待识别视频帧。In step S101, a video frame to be recognized including a target object is acquired.
在一些实施例中,所述待识别视频帧为所述目标对象的视频流中的任一视频帧。可以通过具有图像采集功能的设备对目标对象进行图像采集,得到视频流,也可以是直接获取其他设备发送的视频流;同时从该视频流中随机选取任一视频帧,作为待识别视频帧。同时,视频流可以是设置于预设区域内的,具有采集功能的至少一个设备对该预设 区域内进行采集得到的。其中,至少一个设备中设备可以设置于预设区域内的多个采集点位,以实现对预设区域内的相关信息,比如:出现在预设区域内的目标对象进行采集。同时预设区域可以是现实场景中的任一区域,比如:商场、公园或道路等,还可以是道路十字路口等。In some embodiments, the video frame to be identified is any video frame in the video stream of the target object. The image acquisition of the target object can be carried out through the device with image acquisition function to obtain the video stream, or the video stream sent by other devices can be obtained directly; at the same time, any video frame is randomly selected from the video stream as the video frame to be recognized. At the same time, the video stream may be set in a preset area, and at least one device with a collection function collects the preset area. Among them, at least one of the devices can be set at multiple collection points in the preset area, so as to collect relevant information in the preset area, such as: target objects appearing in the preset area. At the same time, the preset area can be any area in the real scene, such as: shopping malls, parks, roads, etc., and can also be road intersections, etc.
在一些实施例中,视频流也可以是相关设备对目标对象进行图像采集得到,其中,视频流可以包括至少一个场景信息,即在该视频流中目标对象可以处于多个场景中。在本公开以下实施例中,目标对象可以是在道路上行走的路人,也可以是行驶在道路上的车辆,也可以是在公园中奔跑的小狗等。In some embodiments, the video stream may also be obtained by a related device collecting images of the target object, where the video stream may include at least one scene information, that is, the target object may be in multiple scenes in the video stream. In the following embodiments of the present disclosure, the target object may be a passer-by walking on the road, a vehicle driving on the road, or a puppy running in a park.
在一些实施例中,待识别视频帧的视频画面中包括的目标对象的数量可以是一个、两个及以上。同时在包括的目标对象的数量为两个及以上时,不同目标对象在待识别视频帧中所处的区域可以相邻、远离或部分重叠等,且不同目标对象在待识别视频帧中所占区域面积可以相同,也可不同。在本公开以下实施例中,均以目标对象的数量为一个为例进行说明。In some embodiments, the number of target objects included in the video frame of the video frame to be identified may be one, two or more. At the same time, when the number of target objects included is two or more, the areas where different target objects are located in the video frame to be identified can be adjacent, far away from, or partially overlapped, etc., and the areas occupied by different target objects in the video frame to be identified can be the same or different. In the following embodiments of the present disclosure, the number of target objects is taken as an example for illustration.
在一些实施例中,视频流中的不同视频帧包括的目标对象所呈现的姿态可以相同,也可以不同。示例性地,在目标对象为人的情况下,不同视频帧中包括的人可以是正在行走、跑步、站立等。In some embodiments, the poses presented by the target objects included in different video frames in the video stream may be the same or different. Exemplarily, when the target object is a person, the person included in different video frames may be walking, running, standing, and so on.
步骤S102,基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列。Step S102: Determine an initial pose sequence of the target object based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream.
在一些实施例中,待识别视频帧在视频流中的历史视频帧,可以指代在视频流中,处于待识别视频帧所在的时序之前,且与该待识别视频帧相邻的至少一帧视频帧;其中,该历史视频帧的数量可以是一帧,也可以是两帧及以上。In some embodiments, the historical video frame of the video frame to be identified in the video stream may refer to at least one video frame in the video stream before the timing of the video frame to be identified and adjacent to the video frame to be identified; wherein, the number of historical video frames may be one frame, or two frames or more.
在一些实施例中,首先,分别对待识别视频帧和历史视频帧进行目标对象的关键点识别,以得到待识别视频帧中目标对象的关键点,以及历史视频帧中目标对象的关键点;然后,基于待识别视频帧中目标对象的关键点,以及历史视频帧中目标对象的关键点,确定待识别视频帧中目标对象的姿态信息和历史视频帧中目标对象的姿态信息;最后按照历史视频帧和待识别视频帧之间的时序关系,将历史视频帧中的姿态信息和待识别视频帧中的姿态信息进行排序,得到目标对象的初始姿态序列。In some embodiments, at first, the key point recognition of the target object is carried out on the video frame to be recognized and the historical video frame respectively, so as to obtain the key point of the target object in the video frame to be recognized, and the key point of the target object in the historical video frame; then, based on the key point of the target object in the video frame to be recognized, and the key point of the target object in the historical video frame, determine the posture information of the target object in the video frame to be recognized and the posture information of the target object in the historical video frame; finally according to the temporal relationship between the historical video frame and the video frame to be recognized, the posture information in the historical video frame and the target object to be recognized The pose information in the video frame is recognized and sorted to obtain the initial pose sequence of the target object.
在一些实施例中,在目标对象为人的情况下,可以是分别对待识别视频帧和历史视频帧进行人体关键点或人体关节点识别,进而得到待识别视频帧中的人体关键点和历史视频帧中的人体关键点,并基于待识别视频帧中的人体关键点和历史视频帧中的人体关键点,分别得到待识别视频帧中的人体姿态信息,以历史视频帧中的人体姿态信息;最后基于历史视频帧和待识别视频帧之间的时序关系,将历史视频帧中的人体姿态信息和待识别视频帧中的人体姿态信息进行排序,得到初始姿态序列。In some embodiments, when the target object is a person, human body key points or human body joint points can be identified on the video frame to be recognized and the historical video frame respectively, and then the human body key point in the video frame to be recognized and the human body key point in the historical video frame are obtained, and based on the human body key point in the video frame to be recognized and the human body key point in the historical video frame, the human body posture information in the video frame to be recognized is obtained respectively, and the human body posture information in the historical video frame is obtained; finally, based on the timing relationship between the historical video frame and the video frame to be recognized, The human body posture information and the human body posture information in the video frame to be recognized are sorted to obtain the initial posture sequence.
在一些实施例中,该初始姿态序列中每一初始姿态可以基于目标对象的关键点在对应的视频帧中的位置信息来表示;其中,该位置信息可以是目标对象的关键点在对应的视频帧中的二维坐标信息。In some embodiments, each initial pose in the initial pose sequence may be represented based on the position information of the key points of the target object in the corresponding video frame; wherein, the position information may be the two-dimensional coordinate information of the key points of the target object in the corresponding video frame.
步骤S103,对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列。Step S103, performing probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized.
在一些实施例中,对确定的目标对象的初始姿态序列进行概率映射,即历史视频帧和待识别视频帧各自画面中的姿态信息进行相关映射,得到待识别视频帧中目标对象的目标姿态序列。在一些实施例中,概率映射可以指代基于初始姿态序列中初始姿态的关键点的位置信息,对每一初始姿态进行概率映射。同时待识别视频帧中目标对象的目标姿态序列可以基于历史视频帧和待识别视频帧之间的时序关系,对确定的目标姿态进行排序确定得到的。In some embodiments, probability mapping is performed on the determined initial pose sequence of the target object, that is, correlation mapping is performed on the pose information in the respective frames of the historical video frame and the video frame to be recognized, to obtain the target pose sequence of the target object in the video frame to be recognized. In some embodiments, probability mapping may refer to performing probability mapping on each initial pose based on position information of key points of the initial pose in the sequence of initial poses. At the same time, the target pose sequence of the target object in the video frame to be recognized can be obtained by sorting the determined target poses based on the temporal relationship between the historical video frame and the video frame to be recognized.
在一些实施例中,在历史视频帧的数量为7帧的情况下,该初始姿态序列对应地包括 8帧视频帧(7帧历史视频帧和1帧待识别视频帧)中具有的目标对象的姿态数据;首先,对该8帧视频帧中目标对象的姿态数据进行概率映射,可以是基于每一帧视频帧目标对象的姿态数据,例如:姿态位置信息或像素信息,确定相关概率参数,然后基于该概率参数对应的对8帧视频帧中每一视频帧的姿态数据进行融合,得到能够表征待识别视频帧中目标对象的运动姿态,即目标对象的目标姿态序列。In some embodiments, when the number of historical video frames is 7 frames, the initial posture sequence correspondingly includes the posture data of the target object in 8 video frames (7 historical video frames and 1 frame to be recognized video frame); first, probability mapping is performed on the posture data of the target object in the 8 video frames, which can be based on the posture data of the target object in each video frame, such as: posture position information or pixel information, determine the relevant probability parameters, and then fuse the posture data of each video frame in the 8 video frames based on the probability parameters correspondingly, to obtain Characterize the motion pose of the target object in the video frame to be recognized, that is, the target pose sequence of the target object.
在一些实施例中,目标姿态序列包括多帧视频帧中的姿态数据,每一帧视频帧中的姿态数据可以是使用姿态数据的关键点来表征,同时关键点可以是以关键点在对应的视频帧画面中的坐标信息来表示的。In some embodiments, the target pose sequence includes pose data in multiple frames of video frames, and the pose data in each frame of video frames may be represented by key points of the pose data, and the key points may be represented by coordinate information of the key points in the corresponding video frame.
步骤S104,对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹。Step S104, performing feature transformation on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized.
在一些实施例中,可以是对确定的目标姿态序列依次在时间维度和空间维度上进行特征转换,进而得到待识别视频帧中目标对象的姿态特征轨迹;其中,可以是将目标姿态序列输入至包括有空间转换器和时间转换器的网络中进行特征转换,以得到对应的姿态特征轨迹。In some embodiments, the determined target pose sequence may be subjected to feature conversion in the time dimension and the space dimension in turn, and then the pose feature track of the target object in the video frame to be recognized may be obtained; wherein, the target pose sequence may be input into a network including a space converter and a time converter for feature conversion, so as to obtain the corresponding gesture feature track.
在一些实施例中,可以是先将目标姿态序列中每一目标姿态的关键点对应的坐标信息转换为特征向量,然后基于注意力参数,对转换后的特征向量在时间维度和空间维度上进行调整,以得到最后的姿态特征轨迹。In some embodiments, the coordinate information corresponding to the key point of each target pose in the target pose sequence may be first converted into a feature vector, and then based on the attention parameter, the converted feature vector is adjusted in the time dimension and the space dimension to obtain the final gesture feature track.
步骤S105,基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。Step S105, based on the trajectory of the gesture feature, determine the behavior state of the target object in the video frame to be recognized.
在一些实施例中,可以基于确定的姿态特征轨迹,确定待识别视频帧中目标对象的行为状态;其中,该行为状态可以是目标对象的运动信息,比如:目标对象处于行走、跳跃、站立等。其中,可以是将姿态特征轨迹输入至对应的网络模型,以确定与该姿态特征轨迹对应的行为状态。In some embodiments, the behavior state of the target object in the video frame to be recognized can be determined based on the determined gesture feature trajectory; wherein, the behavior state can be the motion information of the target object, such as: the target object is walking, jumping, standing, etc. Wherein, the posture feature trajectory may be input to the corresponding network model to determine the behavior state corresponding to the posture feature trajectory.
在一些实施例中,在确定待识别视频帧中目标对象的行为状态之后,可以对基于与待识别视频帧关联的场景信息,对该行为状态进行识别,进而确定该待识别视频帧是否为异常视频帧。示例性地,在确定待识别视频帧中目标对象的行为状态是跳跃的情况下,获取与待识别视频帧关联的场景信息,比如:十字路口的斑马线;然后确定与该场景信息关联的预设行为规则;最后基于该预设行为规则确定该行为状态,即跳跃为异常行为,进而确定待识别视频帧为异常视频帧。In some embodiments, after determining the behavior state of the target object in the video frame to be recognized, the behavior state may be identified based on the scene information associated with the video frame to be recognized, and then determine whether the video frame to be recognized is an abnormal video frame. Exemplarily, when it is determined that the behavior state of the target object in the video frame to be recognized is jumping, the scene information associated with the video frame to be recognized is obtained, such as: a zebra crossing at a crossroad; then a preset behavior rule associated with the scene information is determined; finally, the behavior state is determined based on the preset behavior rule, that is, jumping is an abnormal behavior, and then the video frame to be recognized is determined to be an abnormal video frame.
在一些实施例中,对于视频画面包括目标对象的视频流中任一待识别视频帧的画面进行异常识别;其中,首先,可以确定待识别视频帧在视频流中的历史视频帧,同时获取待识别视频帧中目标对象的姿态信息,以及历史视频帧中目标对象的姿态信息;其次,基于历史视频频中的姿态信息,以及待识别视频帧中的姿态信息确定目标对象的初始姿态序列,并对该初始姿态序列进行相关归一化操作以及映射操作,确定用于表征待识别视频帧中目标对象的运动信息的目标姿态序列;然后将该目标姿态序列转换为特征向量,同时对该特征向量在空间维度和时间维度上进行特征转换以及维度调整,以得到表征待识别视频帧中目标对象运动情况的姿态特征轨迹;最后,基于该姿态特征轨迹,即动态特征信息确定待识别视频帧中目标对象的行为状态,并基于该行为状态确定待识别视频帧是否为异常视频帧。In some embodiments, anomaly recognition is performed on a video frame including any video frame to be recognized in the video stream of the target object; wherein, first, the historical video frame of the video frame to be recognized in the video stream can be determined, and the pose information of the target object in the video frame to be recognized and the pose information of the target object in the historical video frame are obtained; secondly, the initial pose sequence of the target object is determined based on the pose information in the historical video frequency and the pose information in the video frame to be recognized, and the initial pose sequence is performed. The target posture sequence of the motion information of the object; then the target posture sequence is converted into a feature vector, and the feature vector is subjected to feature conversion and dimension adjustment on the spatial dimension and the time dimension at the same time, so as to obtain a gesture feature trajectory representing the motion of the target object in the video frame to be recognized; finally, based on the posture feature trajectory, that is, the dynamic feature information determines the behavior state of the target object in the video frame to be recognized, and determines whether the video frame to be recognized is an abnormal video frame based on the behavior state.
本公开实施例提供的对象识别方法,首先,获取画面包括目标对象的待识别视频帧,所述待识别视频帧为所述目标对象的视频流中的任一视频帧;这样,获取目标对象的视频流以及待识别视频帧,为后续确定待识别视频帧中目标对象的行为状态提供基础;其次,基于待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;并对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;如此,可以将待识别视频帧中目标对象的姿态情况映 射至前期运动对应的特定概率,即能够使得确定的目标姿态序列同时考虑到目标对象的先前运动以及当前运动信息,进而使得确定的目标姿态序列能够更加精准地匹配待识别视频帧中目标对象的运动信息;最后,对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;并基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。如此,借助目标对象在时间维度和空间维度上的动态姿态信息,确定对应的姿态特征轨迹,能够提高确定待识别视频帧中目标对象的运动信息的准确度,进而能够提高基于姿态特征确定待识别视频帧中目标对象的行为状态的准确度。In the object recognition method provided by the embodiments of the present disclosure, firstly, the acquired picture includes the video frame to be recognized of the target object, and the video frame to be recognized is any video frame in the video stream of the target object; in this way, the video stream of the target object and the video frame to be recognized are obtained to provide a basis for subsequent determination of the behavior state of the target object in the video frame to be recognized; secondly, based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream, the initial pose sequence of the target object is determined; and probability mapping is performed on the initial pose sequence to obtain the target in the video frame to be recognized The target posture sequence of the object; in this way, the posture situation of the target object in the video frame to be recognized can be mapped to the specific probability corresponding to the previous movement, that is, the determined target posture sequence can be considered at the same time. The previous motion and current motion information of the target object, and then the determined target posture sequence can more accurately match the motion information of the target object in the video frame to be recognized; finally, the target posture sequence is subjected to feature conversion in space and time to obtain the posture feature trajectory of the target object in the video frame to be recognized; and based on the posture feature trajectory, determine the target object in the video frame to be recognized behavioral state. In this way, with the help of the dynamic posture information of the target object in the time dimension and the space dimension, determining the corresponding posture feature track can improve the accuracy of determining the motion information of the target object in the video frame to be recognized, and then can improve the accuracy of determining the behavior state of the target object in the video frame to be recognized based on the posture feature.
在一些实施例中,分别基于历史视频帧中目标对象的关键点,以及待识别视频帧中目标对象的关键点,确定待识别视频帧中的姿态信息和历史视频帧中的姿态信息;进而分别基于待识别视频帧中的姿态信息和历史视频帧中的姿态信息,得到目标对象的初始姿态序列;如此,在同时考虑目标对象的先前运动和当前的运动信息的情况下,即同时考虑历史视频帧中的姿态信息和待识别视频帧中的姿态信息的情况下,使得确定的目标对象的初始姿态序列更加精准,即能够在确定精准度较高的姿态信息的基础上,提高确定初始姿态序列的精准度。即上述实施例提供的步骤S102可以通过以下步骤S201至步骤S203来实现。如图2所示,为本公开实施例提供的第二种对象识别方法的流程示意图,结合图1和图2所示的步骤进行以下说明:In some embodiments, based on the key points of the target object in the historical video frame and the key points of the target object in the video frame to be recognized, the pose information in the video frame to be recognized and the pose information in the historical video frame are determined; then based on the pose information in the video frame to be recognized and the pose information in the historical video frame respectively, the initial pose sequence of the target object is obtained; in this way, when considering the previous motion and current motion information of the target object at the same time, that is, considering both the pose information in the historical video frame and the pose information in the video frame to be recognized, the determined target The initial pose sequence of the object is more accurate, that is, the accuracy of determining the initial pose sequence can be improved on the basis of determining the pose information with high accuracy. That is, step S102 provided in the above embodiment may be implemented through the following steps S201 to S203. As shown in FIG. 2, it is a schematic flowchart of the second object recognition method provided by the embodiment of the present disclosure, and the following description is made in conjunction with the steps shown in FIG. 1 and FIG. 2:
步骤S201,分别对所述待识别视频帧和所述历史视频帧进行关键点识别,得到所述待识别视频帧中所述目标对象的关键点,和所述历史视频帧中所述目标对象的关键点。Step S201, performing key point recognition on the video frame to be recognized and the historical video frame respectively, to obtain the key point of the target object in the video frame to be recognized and the key point of the target object in the historical video frame.
在一些实施例中,可以采用已训练好的神经网络,分别对待识别视频帧和历史视频帧进行关键点识别,进而得到待识别视频帧中目标对象的关键点,以及历史视频帧中目标对象的关键点;其中,已训练好的神经网络可是任一神经网络,本公开实施例对此不作任何限定。In some embodiments, a trained neural network can be used to identify the key points of the video frame to be recognized and the historical video frame respectively, and then obtain the key points of the target object in the video frame to be recognized and the key points of the target object in the historical video frame; wherein, the trained neural network can be any neural network, which is not limited in this embodiment of the present disclosure.
在一些实施例中,在目标对象为人的情况下,可以分别对待识别视频帧和历史视频帧进行人体关键点,即人体关节点识别,得到待识别视频帧和历史视频帧各自包含的人体关节点。其中,人体关节点可以包括17个关节点,比如:鼻子、左右眼、左右耳、左右肩、左右肘、左右腕、左右胯、左右膝、左右踝。In some embodiments, when the target object is a person, key points of the human body, that is, joint points of the human body, may be identified on the video frame to be recognized and the historical video frame, respectively, to obtain the joint points of the human body contained in the video frame to be recognized and the historical video frame. Among them, the joint points of the human body can include 17 joint points, such as: nose, left and right eyes, left and right ears, left and right shoulders, left and right elbows, left and right wrists, left and right hips, left and right knees, and left and right ankles.
步骤S202,分别基于所述待识别视频帧中所述目标对象的关键点,和所述历史视频帧中所述目标对象的关键点,确定所述待识别视频帧中的姿态信息和所述历史视频帧中的姿态信息。Step S202, based on the key points of the target object in the video frame to be recognized and the key points of the target object in the historical video frame respectively, determine pose information in the video frame to be recognized and pose information in the historical video frame.
在一些实施例中,可以是基于待识别视频帧中目标对象的关键点,来表征待识别视频帧中目标对象的姿态信息,同时基于历史视频帧中目标对象的关键点,来表征历史视频帧中目标对象的姿态信息;示例性地,可以是基于待识别视频帧中目标对象的关键点的位置信息来表征待识别视频帧中目标对象的姿态信息;其中,该位置信息可以指代任一关键点在待识别视频帧中的坐标信息。同样,在历史视频帧中也可采用上述方式对目标对象的姿态信息进行表征。In some embodiments, the pose information of the target object in the video frame to be recognized can be represented based on the key points of the target object in the video frame to be recognized, and the pose information of the target object in the historical video frame can be represented based on the key points of the target object in the historical video frame; for example, the pose information of the target object in the video frame to be recognized can be characterized based on the position information of the key point of the target object in the video frame to be recognized; wherein, the position information can refer to the coordinate information of any key point in the video frame to be recognized. Similarly, the pose information of the target object can also be represented in the above-mentioned manner in the historical video frames.
步骤S203,按照所述历史视频帧和所述待识别视频帧之间的时序关系,将所述历史视频帧中的姿态信息和所述待识别视频帧中的姿态信息进行排序,得到初始姿态序列。Step S203, according to the temporal relationship between the historical video frame and the video frame to be recognized, sort the pose information in the historical video frame and the pose information in the video frame to be recognized to obtain an initial pose sequence.
在一些实施例中,可以是按照历史视频帧和待识别视频帧之间的时序关系,将历史视频帧中的姿态信息和待识别视频帧中的姿态信息依次进行排序,得到初始姿态序列;示例性地,在P n+1表征待识别视频帧中的姿态信息,P 1至P n表征多帧历史视频帧中的姿态信息,初始姿态序列可以使用[P 1,…,P n,P n+1]来表示;其中,n为大于或等于1的整数。同时,初始姿态序列中每一初始姿态可以基于上文所描述的,使用关键点在对应视频帧中的坐标信息进行表征。 In some embodiments, according to the sequential relationship between the historical video frames and the timing of the video frame, the posture information in the historical video frames and the posture information in the video frame to be identified are sorted in order to get the initial posture sequence; in an example, the posture information in the P N +1 indicates the attitude information in the video frame. Initial information, the initial posture sequence can be represented by [P 1 , ..., P N , P N+1 ]; where n is an integer greater than or equal to 1. At the same time, each initial pose in the initial pose sequence can be represented based on the coordinate information of key points in the corresponding video frame as described above.
在本公开的一些实施例中,可以对该初始姿态序列中每一初始姿态进行相关归一化 操作以及映射操作,得到待识别视频帧中目标对象的目标姿态序列。如此,可以实现将待识别视频帧中目标对象的姿态情况映射至前期运动对应的特定概率,进而能够基于历史姿态信息以及当前姿态信息对待识别视频帧中目标对象的姿态信息进行表征;也就是实现基于动态参数来表征待识别视频帧中目标对象的姿态信息,进而能够提高表征待识别视频帧中目标对象的运动信息的精准度。即上述实施例提供的步骤S103可以通过以下步骤S204和步骤S205来实现:In some embodiments of the present disclosure, a correlation normalization operation and a mapping operation may be performed on each initial pose in the initial pose sequence to obtain the target pose sequence of the target object in the video frame to be recognized. In this way, the posture of the target object in the video frame to be recognized can be mapped to the specific probability corresponding to the previous movement, and then the posture information of the target object in the video frame to be recognized can be represented based on historical posture information and current posture information; that is, the posture information of the target object in the video frame to be recognized can be represented based on dynamic parameters, and the accuracy of the motion information of the target object in the video frame to be recognized can be improved. That is, step S103 provided in the above embodiment can be realized through the following steps S204 and S205:
步骤S204,基于每一初始姿态中关键点的位置信息,得到用于确定相邻初始姿态位移的中心点序列,以及所述初始姿态序列的归一化姿态序列。Step S204, based on the position information of the key points in each initial pose, a center point sequence for determining the displacement of adjacent initial poses and a normalized pose sequence of the initial pose sequences are obtained.
在一些实施例中,可以基于初始姿态序列中每一初始姿态中关键点的位置信息,确定用于确定相邻初始姿态位移的中心点序列,以及初始姿态序列对应的归一化姿态序列。In some embodiments, based on the position information of key points in each initial pose in the initial pose sequence, a center point sequence for determining the displacement of adjacent initial poses and a normalized pose sequence corresponding to the initial pose sequence can be determined.
在本公开的一些实施例中,可以基于每一初始姿态中关键点的位置信息,确定每一初始姿态的包围框,进而基于该包围框和初始姿态,对应的确定中心点序列,以及归一化姿态序列。如此,能够提高确定的中心点序列以及归一化姿态序列的准确度;同时基于对每一初始姿态进行归一化操作,得到对应的归一化数据,即归一化姿态序列,以便后续基于该归一化数据提高确定目标姿态序列的精度和速度。即上述步骤S204可以通过以下过程来实现:In some embodiments of the present disclosure, the bounding box of each initial pose can be determined based on the position information of the key points in each initial pose, and then based on the bounding box and the initial pose, a corresponding center point sequence and a normalized pose sequence can be determined. In this way, the accuracy of the determined central point sequence and the normalized attitude sequence can be improved; at the same time, based on the normalization operation for each initial attitude, the corresponding normalized data, that is, the normalized attitude sequence, is obtained, so that the subsequent accuracy and speed of determining the target attitude sequence based on the normalized data can be improved. That is, the above step S204 can be realized through the following process:
第一步,基于所述每一初始姿态中关键点的位置信息,确定所述每一初始姿态的包围框。The first step is to determine the bounding box of each initial pose based on the position information of key points in each initial pose.
在一些实施例中,可以是在对应的视频帧中确定能够包围该初始姿态的最小矩形框,即能够包围每一初始姿态中关键点的位置信息的最小矩形框。其中,不同初始姿态的包围框的尺寸可以相同,也可不同,同时不同初始姿态的包围框在对应的视频帧中的位置信息可以相同,也可不同。In some embodiments, the minimum rectangular frame that can enclose the initial pose may be determined in the corresponding video frame, that is, the minimum rectangular frame that can enclose the position information of key points in each initial pose. The sizes of the bounding boxes of different initial poses may be the same or different, and the position information of the bounding boxes of different initial poses in the corresponding video frames may be the same or different.
在一些实施例中,在关键点的位置信息为关键点在对应视频帧中的坐标信息的情况下,首先,可以将每一初始姿态中关键点的位置信息进行比对,确定出一个最小坐标点和一个最大坐标点,然后基于该最小坐标点和最大坐标点,确定能够包围该最小坐标点和最大坐标点的最小矩形框,即每一初始姿态的包围框。In some embodiments, when the position information of the key point is the coordinate information of the key point in the corresponding video frame, first, the position information of the key point in each initial pose can be compared to determine a minimum coordinate point and a maximum coordinate point, and then based on the minimum coordinate point and the maximum coordinate point, determine the smallest rectangular frame that can surround the minimum coordinate point and the maximum coordinate point, that is, the bounding box of each initial pose.
第二步,在所述初始姿态序列中,对所述每一初始姿态的包围框的中心点进行排序,得到所述中心点序列。The second step is to sort the center points of the bounding boxes of each initial pose in the initial pose sequence to obtain the center point sequence.
在一些实施例中,确定每一初始姿态的包围框的中心点,然后将该每一初始姿态的包围框的中心,基于初始姿态序列中初始姿态的序列信息进行排序,得到中心点序列。其中,每一初始姿态的包围框的中心点可以基于中心点的位置信息,即在对应的视频帧中的坐标信息来表征。In some embodiments, the center point of the bounding box of each initial pose is determined, and then the center of the bounding box of each initial pose is sorted based on the sequence information of the initial poses in the initial pose sequence to obtain a center point sequence. Wherein, the center point of the bounding box of each initial pose can be characterized based on the position information of the center point, that is, the coordinate information in the corresponding video frame.
第三步,采用所述每一初始姿态的包围框,对所述每一初始姿态进行归一化,得到所述归一化姿态序列。The third step is to normalize each initial pose by using the bounding box of each initial pose to obtain the normalized pose sequence.
在一些实施例中,采用每一初始姿态的包围框,对每一初始姿态进行归一化,得到归一化姿态序列;其中,可以是采用每一初始姿态的包围框的尺寸信息,比如包围框的宽和高,对每一初始姿态进行归一化处理,得到每一初始姿态对应的归一化姿态,进而得到归一化姿态序列。In some embodiments, the bounding box of each initial pose is used, and each initial pose is normalized to obtain a normalized pose sequence; wherein, the size information of the bounding box of each initial pose may be used, such as the width and height of the bounding box, and each initial pose is normalized to obtain a normalized pose corresponding to each initial pose, and then a normalized pose sequence is obtained.
步骤S205,基于所述中心点序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列。Step S205, performing probability mapping on the normalized pose sequence based on the center point sequence to obtain the target pose sequence.
在一些实施例中,归一化姿态序列中每一归一化姿态与初始姿态序列中每一初始姿态一一对应,同时中心点序列是与初始姿态序列中每一初始姿态的包围框对应,即中心点序列与归一化姿态序列对应;这里,基于中心点序列对应的对归一化姿态序列中每一归一化姿态进行概率映射,得到对应的目标姿态序列。示例性地,可以是基于中心点序列确定对应的概率参数,最后基于该概率参数,对归一化姿态序列中每一归一化姿态进 行概率映射,得到目标姿态序列。In some embodiments, each normalized pose in the normalized pose sequence is in one-to-one correspondence with each initial pose in the initial pose sequence, and the center point sequence is corresponding to the bounding box of each initial pose in the initial pose sequence, that is, the center point sequence corresponds to the normalized pose sequence; here, probability mapping is performed on each normalized pose in the normalized pose sequence based on the center point sequence to obtain a corresponding target pose sequence. Exemplarily, the corresponding probability parameter may be determined based on the central point sequence, and finally, based on the probability parameter, probability mapping is performed on each normalized posture in the normalized posture sequence to obtain the target posture sequence.
在本公开的一些实施例中,可以基于中心点序列中相邻中心点之间的关系,确定与中心点序列对应的归一化位移序列,然后基于该归一化位移序列对归一化姿态序列进行概率映射,得到目标姿态序列;如此,能够使得目标姿态序列同时考虑到目标对象的先前运动以及当前运动信息,进而使得确定的目标姿态序列能够更加精准地匹配待识别视频帧中目标对象的运动信息。即上述步骤S205可以通过以下过程实现:In some embodiments of the present disclosure, the normalized displacement sequence corresponding to the central point sequence can be determined based on the relationship between adjacent central points in the central point sequence, and then the normalized posture sequence can be probabilistically mapped based on the normalized displacement sequence to obtain the target posture sequence; in this way, the target posture sequence can take into account both the previous motion and the current motion information of the target object, so that the determined target posture sequence can more accurately match the motion information of the target object in the video frame to be recognized. That is, the above step S205 can be realized through the following process:
首先,在所述中心点序列中,基于每两个相邻中心点的位置信息之间的差值,得到位移序列。First, in the center point sequence, a displacement sequence is obtained based on the difference between the position information of every two adjacent center points.
在一些实施例中,在中心点序列中,可以按照中心点的排列顺序,将每两个相邻中心点的位置信息之间的差值,确定为相邻中心点中前一个中心点或后一个中心点对应的位移,进而得到位移序列;其中,相邻中心点的位置信息,即相邻视频帧中确定的能够包围初始姿态的包围框的中心点,在对应视频帧中的坐标信息。In some embodiments, in the center point sequence, the difference between the position information of every two adjacent center points can be determined according to the arrangement order of the center points as the displacement corresponding to the previous center point or the next center point in the adjacent center points, and then the displacement sequence is obtained; wherein, the position information of the adjacent center points is the center point of the bounding box that can surround the initial posture determined in the adjacent video frame, and the coordinate information in the corresponding video frame.
在一些实施例中,在中心点的位置信息以二维坐标来表示的情况下,可以将每两个相邻中心点的横坐标之间的差值,以及每两个相邻中心点的纵坐标之间的差值,进行相关函数计算,得到该两个相邻中心点的位移。In some embodiments, when the position information of the central point is represented by two-dimensional coordinates, the difference between the abscissas of every two adjacent central points and the difference between the vertical coordinates of every two adjacent central points can be calculated by a correlation function to obtain the displacement of the two adjacent central points.
其次,基于所述每一初始姿态的包围框的尺寸信息,对所述位移序列中每一位移进行归一化,得到归一化位移序列。Secondly, based on the size information of the bounding box of each initial pose, each displacement in the displacement sequence is normalized to obtain a normalized displacement sequence.
在一些实施例中,可以基于每一初始姿态的包围框的尺寸信息,比如,包围框的高和宽,对位移序列中每一位移进行归一化,得到归一化位移序列。示例性地,可以是采用每一初始姿态的包围框的尺寸信息,比如包围框的长和宽相加,得到对应的尺寸信息,然后对位移序列中每一位移基于对应的尺寸信息进行相除,得到归一化位移序列。In some embodiments, each displacement in the displacement sequence may be normalized based on the size information of the bounding box of each initial pose, such as the height and width of the bounding box, to obtain a normalized displacement sequence. Exemplarily, the size information of the bounding box of each initial pose may be used, such as adding the length and width of the bounding box to obtain the corresponding size information, and then dividing each displacement in the displacement sequence based on the corresponding size information to obtain a normalized displacement sequence.
然后,基于所述归一化位移序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列。Then, probability mapping is performed on the normalized pose sequence based on the normalized displacement sequence to obtain the target pose sequence.
在一些实施例中,基于确定的归一化位移序列,对归一化姿态序列进行概率映射,得到对应的目标姿态序列;即可以将归一化姿态序列映射至特定概率,以表征待识别视频帧中目标对象的运动信息,即目标姿态序列。这里,基于归一化位移序列对归一化姿态序列进行概率映射,能够提高确定目标姿态序列的精准度和速度。In some embodiments, based on the determined normalized displacement sequence, probability mapping is performed on the normalized pose sequence to obtain the corresponding target pose sequence; that is, the normalized pose sequence can be mapped to a specific probability to represent the motion information of the target object in the video frame to be recognized, that is, the target pose sequence. Here, the probability mapping of the normalized attitude sequence based on the normalized displacement sequence can improve the accuracy and speed of determining the target attitude sequence.
在本公开的一些实施例中,可以确定与归一化位移序列关联的连续分布函数,进而将每一归一化位移输入至该连续分布函数,确定对应的缩放因子,即缩放概率,最后基于该缩放概率对归一化姿态序列中每一归一化姿态进行映射,得到目标姿态序列。如此,基于特定概率对历史姿态信息以及当前姿态信息对待识别视频帧中目标对象的姿态信息进行映射,能够提高表征待识别视频帧中目标对象的运动信息,即目标姿态序列的精准度。即上述步骤S253可以通过以下过程实现:In some embodiments of the present disclosure, the continuous distribution function associated with the normalized displacement sequence can be determined, and then each normalized displacement is input to the continuous distribution function, and the corresponding scaling factor, that is, the scaling probability is determined, and finally, based on the scaling probability, each normalized posture in the normalized posture sequence is mapped to obtain the target posture sequence. In this way, mapping historical pose information and current pose information to the pose information of the target object in the video frame to be recognized based on a specific probability can improve the accuracy of characterizing the motion information of the target object in the video frame to be recognized, that is, the target pose sequence. That is, the above step S253 can be realized through the following process:
第一步,拟合所述归一化位移序列中每一归一化位移,得到拟合结果。In the first step, each normalized displacement in the normalized displacement sequence is fitted to obtain a fitting result.
在一些实施例中,采用预设函数,比如:瑞利分布(Rayleigh Distribution)拟合归一化位移序列中每一归一化位移,得到拟合结果。也可以是使用高斯分布拟合归一化位移序列中每一归一化位移,得到拟合结果。In some embodiments, a preset function, such as Rayleigh Distribution, is used to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result. It is also possible to use a Gaussian distribution to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result.
第二步,确定所述拟合结果满足的连续分布函数。The second step is to determine the continuous distribution function that the fitting result satisfies.
在一些实施例中,确定该拟合结果满足的连续分布函数;其中,该连续分布函数可以是常规函数表示式进行表示,也可以使用文字描述进行表示。In some embodiments, the continuous distribution function satisfied by the fitting result is determined; wherein, the continuous distribution function may be represented by a conventional function expression, or may be represented by a text description.
第三步,将所述每一归一化位移输入所述连续分布函数,得到所述每一归一化位移的缩放概率。In the third step, each normalized displacement is input into the continuous distribution function to obtain the scaling probability of each normalized displacement.
在一些实施例中,可以将每一归一化位移输入至该连续分布函数,得到每一归一化位移对应的缩放因子,即缩放概率。In some embodiments, each normalized displacement can be input into the continuous distribution function to obtain a scaling factor corresponding to each normalized displacement, that is, a scaling probability.
第四步,基于所述每一归一化位移的缩放概率,对每一归一化姿态进行映射,得到 所述目标姿态序列。The fourth step is to map each normalized pose based on the scaling probability of each normalized displacement to obtain the target pose sequence.
这里,归一化姿态序列与初始姿态序列对应,同时每一归一化位移对应的缩放概率,与中心点序列,即初始姿态序列中每一中心点对应(基于中心点序列确定归一化位移序列);进而,可以直将归一化姿态序列中每一归一化姿态,与每一归一化位移的缩放概率进行融合,得到目标姿态序列;即可以将每一归一化姿态与每一归一化位移的缩放概率进行相除,得到目标姿态序列。Here, the normalized posture sequence corresponds to the initial posture sequence, and the scaling probability corresponding to each normalized displacement corresponds to the central point sequence, that is, each central point in the initial posture sequence (the normalized displacement sequence is determined based on the central point sequence); furthermore, each normalized posture in the normalized posture sequence can be fused with the scaling probability of each normalized displacement to obtain the target posture sequence; that is, each normalized posture can be divided by the scaling probability of each normalized displacement to obtain the target posture sequence.
这里,将上文得到的目标姿态序列输入至时空转换器,能够现对目标姿态序列中每一目标姿态在时间维度和空间维度上进行特征转换以及维度调整,进而得到与待识别视频帧中目标对象的运动信息更加匹配的姿态特征轨迹。即上述实施例提供的步骤S104可以通过以下步骤S206和步骤S207来实现:Here, inputting the target pose sequence obtained above into the space-time converter can now perform feature conversion and dimension adjustment on each target pose in the target pose sequence in the time and space dimensions, and then obtain a pose feature trajectory that is more matching with the motion information of the target object in the video frame to be recognized. That is, step S104 provided in the above embodiment can be realized through the following steps S206 and S207:
步骤S206,在所述目标姿态序列中,基于每一目标姿态的关键点,对所述每一目标姿态进行特征转换,得到待调整特征序列。Step S206 , in the target pose sequence, perform feature transformation on each target pose based on key points of each target pose to obtain a feature sequence to be adjusted.
在一些实施例中,在目标姿态序列中,可以基于每一目标姿态包括多个关键点,对每一目标姿态的关键点信息进行特征转换,即实现对每一目标姿态进行特征转换,从而得到待调整特征序列。In some embodiments, in the target pose sequence, based on each target pose including multiple key points, feature conversion is performed on the key point information of each target pose, that is, feature conversion is performed on each target pose, so as to obtain the feature sequence to be adjusted.
步骤S207,对每一待调整特征在空间和时间上进行特征维度调整,得到所述姿态特征轨迹。Step S207, perform feature dimension adjustment on each feature to be adjusted in space and time to obtain the gesture feature track.
在一些实施例中,可以依次基于空间维度在注意力机制中的注意力参数,以及时间维度在注意力机制中的注意力参数,对每一待调整特征进行特征融合以及维度调整,从而得到姿态特征轨迹。In some embodiments, based on the attention parameters of the spatial dimension in the attention mechanism and the attention parameters of the time dimension in the attention mechanism, feature fusion and dimension adjustment can be performed on each feature to be adjusted, so as to obtain the gesture feature trajectory.
在本公开的一些实施例中,可以将待调整特征序列依次在空间维度和时间维度进行调整,以得到最终的姿态特征轨迹。这里,姿态信息通常包含的两个维度,即空间维度和时间维度,因此,依次在空间维度和时间维度上,依次对待调整特征序列进行调整,能够使得确定的姿态特征轨迹与目标对象的实际运动情况更加匹配;如此,能够使得确定的姿态特征轨迹精准度更高。即上述步骤S207可以通过以下过程来实现:In some embodiments of the present disclosure, the feature sequence to be adjusted can be adjusted sequentially in the space dimension and the time dimension, so as to obtain the final gesture feature trajectory. Here, the attitude information usually includes two dimensions, namely the space dimension and the time dimension. Therefore, sequentially adjusting the feature sequence to be adjusted in the space dimension and the time dimension can make the determined attitude feature track match the actual motion of the target object; thus, the determined attitude feature track can be more accurate. That is, the above step S207 can be realized through the following process:
第一步,将所述每一待调整特征和所述每一待调整特征的预设空间特征进行融合,得到空间特征序列。In the first step, each feature to be adjusted is fused with a preset spatial feature of each feature to be adjusted to obtain a sequence of spatial features.
在一些实施例中,每一待调整特征的预设空间特征,可以是在空间维度上与每一待调整特征中关键点的属性相关的空间特征参数。示例性地,在目标对象为人的情况下,人的不同关键点对应的待调整特征的预设空间特征不同,其是与关键点在人体位置以及人体关键点的属性确定的。In some embodiments, the preset spatial feature of each feature to be adjusted may be a spatial feature parameter related to attributes of key points in each feature to be adjusted in the spatial dimension. Exemplarily, when the target object is a person, the preset spatial characteristics of the features to be adjusted corresponding to different key points of the person are different, which are determined based on the positions of the key points in the human body and the attributes of the key points of the human body.
在一些实施例中,可以是基于关键点,对每一待调整特征和每一待调整特征的预设空间特征进行融合,得到空间特征序列;其中,每一待调整特征对应每一目标姿态,每一目标姿态包括多个关键点;也就是说,每一待调整特征对应的同一目标姿态的多个关键点。In some embodiments, each feature to be adjusted and the preset spatial feature of each feature to be adjusted may be fused based on key points to obtain a sequence of spatial features; wherein each feature to be adjusted corresponds to each target pose, and each target pose includes multiple key points; that is, each feature to be adjusted corresponds to multiple key points of the same target pose.
第二步,基于空间维度在注意力机制中的注意力参数,对所述空间特征序列进行多层维度调整,得到空间姿态特征序列。In the second step, based on the attention parameter of the spatial dimension in the attention mechanism, multi-layer dimension adjustment is performed on the spatial feature sequence to obtain the spatial posture feature sequence.
在一些实施例中,可以基于空间维度在注意力机制中的注意力参数,比如:查询值(Query)、键值(Key)、矩阵(Value matrix),对空间特征序列中每一空间特征进行多层维度调整,进而得到空间姿态特征序列。In some embodiments, based on the attention parameters of the spatial dimension in the attention mechanism, such as: query value (Query), key value (Key), matrix (Value matrix), each spatial feature in the spatial feature sequence is adjusted in multiple dimensions, and then the spatial posture feature sequence is obtained.
第三步,将每一空间姿态特征和所述每一空间姿态特征的预设时间特征进行融合,得到时间特征序列。The third step is to fuse each space attitude feature with the preset time feature of each space attitude feature to obtain a time feature sequence.
在一些实施例中,每一空间姿态特征的预设时间特征,可以是在时间维度上与每一空间姿态特征中关键点的属性相关的时间特征参数。示例性地,在目标对象为人的情况下,人的不同关键点对应的待调整特征的预设时间特征不同,其是与关键点在人体位置 以及人体关键点的属性确定的。In some embodiments, the preset time feature of each space gesture feature may be a time feature parameter related to the attributes of key points in each space gesture feature in the time dimension. For example, when the target object is a person, the preset time characteristics of the features to be adjusted corresponding to different key points of the person are different, which is determined by the position of the key point in the human body and the attribute of the key point of the human body.
在一些实施例中,可以是基于关键点,对每一空间姿态特征和每一空间姿态特征的预设时间特征进行融合,得到时间特征序列;其中,每一空间姿态特征对应每一目标姿态,每一目标姿态包括多个关键点;也就是说,每一空间姿态特征对应的同一目标姿态的多个关键点。In some embodiments, based on the key points, each space posture feature and the preset time feature of each space posture feature are fused to obtain a time feature sequence; wherein each space posture feature corresponds to each target posture, and each target posture includes multiple key points; that is, each space posture feature corresponds to multiple key points of the same target posture.
第四步,基于时间维度在注意力机制中的注意力参数,对所述时间特征序列进行多层维度调整,得到所述姿态特征轨迹。Step 4: Based on the attention parameters of the time dimension in the attention mechanism, multi-dimensional adjustments are performed on the time feature sequence to obtain the gesture feature trajectory.
其中,上一层维度调整的输出为下一层维度调整的输入。在一些实施例中,可以基于时间维度在注意力机制中的注意力参数,对时间特征序列中每一时间特征进行多层维度调整或特征编码,进而得到姿态特征轨迹。Among them, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer. In some embodiments, based on the attention parameters of the time dimension in the attention mechanism, multi-layer dimension adjustment or feature encoding can be performed on each time feature in the time feature sequence, so as to obtain the gesture feature trajectory.
这里,时间维度在注意力机制中的注意力参数,与空间维度在注意力机制中的注意力参数一一对应,其与输入的特征参数相关联。同时,在多层维度调整时,上一层维度调整的输出即为下一层维度调整的输入;示例性地,第一层调整之后得到的特征,直接作为第二层的输入。Here, the attention parameters of the time dimension in the attention mechanism correspond one-to-one to the attention parameters of the space dimension in the attention mechanism, which are associated with the input feature parameters. At the same time, during multi-layer dimension adjustment, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer; for example, the features obtained after the adjustment of the first layer are directly used as the input of the second layer.
这里,本公开实施例提供的对象识别方法中,在基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态之后,即在上述实施例提供的步骤S105之后,还可以执行以下过程:Here, in the object recognition method provided by the embodiment of the present disclosure, after determining the behavior state of the target object in the video frame to be recognized based on the gesture feature trajectory, that is, after step S105 provided in the above embodiment, the following process may also be performed:
首先,获取所述待识别视频帧对应的场景信息。First, acquire scene information corresponding to the video frame to be identified.
在一些实施例中,获取待识别视频帧对应的场景信息,比如:办公室、公园、停车场等。In some embodiments, the scene information corresponding to the video frame to be recognized is obtained, such as: office, park, parking lot, etc.
其次,确定与所述场景信息关联的预设行为规则。Second, a preset behavior rule associated with the scene information is determined.
在一些实施例中,确定与场景信息关联的预设行为规则,这里,在场景信息为办公室的情况下,其关联的预设行为规则为处于办公室中的办公人员仅允许正常办公,以及与他人沟通,但不允许办公人员躺着休息等。在场景信息为停车场的情况下,其关联的预设行为规则为处于停车场的车辆允许停放、允许低速行驶,但不允许高速行驶等。In some embodiments, the preset behavior rules associated with the scene information are determined. Here, when the scene information is an office, the associated preset behavior rules are that the office staff in the office are only allowed to work normally and communicate with others, but the office staff are not allowed to lie down and rest. In the case that the scene information is a parking lot, its associated preset behavior rules are that vehicles in the parking lot are allowed to park, and low-speed driving is allowed, but high-speed driving is not allowed.
最后,采用所述预设行为规则,确定所述行为状态所属的所述待识别视频帧是否为异常视频帧。Finally, the preset behavior rule is used to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame.
在一些实施例中,采用该预设行为规则对行为状态进行识别,以确定该行为状态是否属于异常行为状态,进而以确定行为状态所属的待识别视频帧是否为异常视频帧。如此,首先,通过待识别视频帧对应的场景信息,确定待识别视频帧所处场景信息关联的预设行为规则,即确定待识别视频帧中的画面中包含的对象需遵守的预设行为规则;然后,基于该预设行为规则确定待识别视频帧中对象的行为状态是否合理,并基于是否合理的判断,来确定待识别视频帧是否为异常视频帧;这样,能够更加精准且便捷地确定视频帧是否属于异常视频帧。In some embodiments, the preset behavior rule is used to identify the behavior state to determine whether the behavior state belongs to an abnormal behavior state, and further to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame. In this way, firstly, through the scene information corresponding to the video frame to be recognized, determine the preset behavior rules associated with the scene information of the video frame to be recognized, that is, determine the preset behavior rules that the objects contained in the picture in the video frame to be recognized need to abide by; then, based on the preset behavior rules, determine whether the behavior state of the object in the video frame to be recognized is reasonable, and determine whether the video frame to be recognized is an abnormal video frame based on the judgment of whether it is reasonable; in this way, it is possible to more accurately and conveniently determine whether the video frame belongs to an abnormal video frame.
在本公开的一些实施例中,所述目标对象包括至少两个待识别对象,上述采用所述预设行为规则,确定所述行为状态所属的所述待识别视频帧是否为异常视频帧,可以通过以下步骤来实现:In some embodiments of the present disclosure, the target object includes at least two objects to be identified. Using the preset behavior rules, determining whether the video frame to be identified to which the behavior state belongs is an abnormal video frame can be achieved by the following steps:
第一步,采用所述预设行为规则,对所述至少两个待识别对象中每一待识别对象的行为状态进行识别,得到行为识别结果集。In the first step, the behavior state of each object to be recognized in the at least two objects to be recognized is recognized by using the preset behavior rule to obtain a behavior recognition result set.
在一些实施例中,采用预设行为规则,对每一待识别对象的行为状态进行识别,确定每一待识别对象的行为识别结果,进而得到行为识别结果集;其中,每一待识别对象的行为识别结果可以使用异常和正常进行表示。In some embodiments, preset behavior rules are used to identify the behavior state of each object to be identified, determine the behavior recognition result of each object to be identified, and then obtain a behavior recognition result set; where the behavior recognition result of each object to be identified can be represented by abnormal and normal.
第二步,对每一行为识别结果的置信度进行排序,得到结果评分序列。In the second step, the confidence of each behavior recognition result is sorted to obtain the result scoring sequence.
在一些实施例中,对每一行为识别结果的置信度进行排序,得到结果评分序列。其中,置信度可以是与识别结果中的行为状态关联,也可以与改行为状态所对应的画面清 晰度关联。In some embodiments, the confidence levels of each behavior recognition result are sorted to obtain a result scoring sequence. Among them, the confidence level can be associated with the behavior state in the recognition result, and can also be associated with the picture definition corresponding to the changed behavior state.
第三步,基于处于所述结果评分序列中预设位置对应的行为识别结果,确定所述待识别视频帧是否为所述异常视频帧。The third step is to determine whether the video frame to be recognized is the abnormal video frame based on the behavior recognition result corresponding to the preset position in the result scoring sequence.
在一些实施例中,将处于结果评分序列中预设位置对应的行为识别结果,确定为待识别视频帧的目标识别结果,进而基于该目标识别结果确定待识别视频帧是否为异常视频帧。In some embodiments, the behavior recognition result corresponding to the preset position in the result scoring sequence is determined as the target recognition result of the video frame to be recognized, and then based on the target recognition result, it is determined whether the video frame to be recognized is an abnormal video frame.
在一些实施例中,首先,通常预设行为规则,对待识别视频帧中包括的每一待识别对象的行为状态进行识别,得到多个识别结果,即行为识别结果集,这样实现对待识别视频帧每一对象进行分析,即对待识别视频帧中包括的对象进行更加全面且具体的行为分析;然后,对行为识别结果集中每一行为识别结果的置信度进行排序,得到一序列,并基于该序列中处于预设位置对应的行为识别结果,确定待识别视频帧是否为异常视频帧;这样,通过生成的序列,灵活选取序列中的预设位置,进而基于灵活选取的预设位置在该序列中所对应的行为识别结果,来判断待识别视频帧,能够灵活识别对待识别视频帧是否为异常视频帧,进而能够提高检测视频帧异常的精准度。In some embodiments, first, the behavior rules are usually preset to identify the behavior state of each object to be recognized included in the video frame to be recognized, and obtain multiple recognition results, that is, a behavior recognition result set, so that each object in the video frame to be recognized is analyzed, that is, a more comprehensive and specific behavior analysis is performed on the object included in the video frame to be recognized; Sequence, flexibly select a preset position in the sequence, and then judge the video frame to be recognized based on the behavior recognition result corresponding to the flexibly selected preset position in the sequence, and can flexibly identify whether the video frame to be recognized is an abnormal video frame, thereby improving the accuracy of detecting abnormal video frames.
在一些实施例中,可以采用对象识别网络实现对待识别视频帧中的目标对象进行识别,进而得到待识别视频帧中目标对象的行为状态;其中,对象识别网络为对待训练的对象识别网络进行训练得到的,待训练的对象识别网络的训练可以如图3所示的步骤实现,图3为本公开实施例提供的一种对象识别网络的训练方法的流程示意图;结合图3所示步骤进行以下说明:In some embodiments, an object recognition network can be used to recognize the target object in the video frame to be recognized, and then obtain the behavior state of the target object in the video frame to be recognized; wherein, the object recognition network is obtained by training the object recognition network to be trained, and the training of the object recognition network to be trained can be realized by the steps shown in Figure 3, which is a schematic flow diagram of a training method for an object recognition network provided by an embodiment of the present disclosure; the following description is made in conjunction with the steps shown in Figure 3:
步骤S31,获取包括样本对象的样本视频帧。Step S31, acquiring a sample video frame including a sample object.
其中,所述样本视频帧为所述样本对象的样本视频流中的任一视频帧。Wherein, the sample video frame is any video frame in the sample video stream of the sample object.
在一些实施例中,可以通过具有图像采集功能的设备对相关场景或对象进行采集,得到样本视频帧;其中,样本视频帧中样本对象的数量可以是一个,也可以是两个及以上,在本公开实施例中,以样本对象的数量为一个进行示例说明。In some embodiments, a device with an image collection function can be used to collect related scenes or objects to obtain a sample video frame; wherein, the number of sample objects in the sample video frame can be one, or two or more. In the embodiments of the present disclosure, the number of sample objects is taken as one for illustration.
步骤S32,确定所述样本视频帧中所述样本对象的样本归一化姿态序列。Step S32, determining a sample normalized pose sequence of the sample object in the sample video frame.
在一些实施例中,可以首先在样本视频流中确定与样本视频帧的样本历史视频帧,然后将样本历史视频帧的目标对象的样本姿态信息,以及样本视频帧中样本对象的样本姿态信息,以样本历史视频帧和样本视频帧之间的时序关系,进行排序,得到样本视频帧中样本对象的样本归一化姿态序列;然后对第一样本姿态序列进行归一化处理,得到样本归一化姿态序列,这里实现过程与上述步S203和步骤S204的实现过程类似,即确定表征样本视频帧中样本对象的运动信息的归一化姿态数据。In some embodiments, the sample historical video frame with the sample video frame can be firstly determined in the sample video stream, and then the sample pose information of the target object in the sample historical video frame and the sample pose information of the sample object in the sample video frame are sorted according to the time sequence relationship between the sample historical video frame and the sample video frame to obtain the sample normalized pose sequence of the sample object in the sample video frame; then normalize the first sample pose sequence to obtain the sample normalized pose sequence, the implementation process here is the same as the implementation process of the above step S203 and step S204 Similarly, the normalized pose data characterizing the motion information of the sample object in the sample video frame is determined.
在一些实施例中,样本归一化姿态序列中包括多个样本归一化姿态,且多个样本归一化姿态基于样本对象所在样本视频帧在样本视频流中的时序信息进行排序,同时样本归一化姿态信息可以基于样本视频帧中姿态中包含的关键点来表征。In some embodiments, the sample normalized pose sequence includes multiple sample normalized poses, and the multiple sample normalized poses are sorted based on the timing information of the sample video frame where the sample object is located in the sample video stream, and the sample normalized pose information can be characterized based on the key points contained in the poses in the sample video frame.
步骤S33,采用待训练的对象识别网络,对所述样本归一化姿态序列进行概率映射,得到样本姿态序列。Step S33, using the object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain a sample pose sequence.
这里,步骤S33的实现过程与上述步骤S205的实现过程类似,即将样本视频帧中样本对象的姿态情况映射至前期运动对应的特定概率,进而确定样本视频帧中样本对象的样本姿态序列。Here, the implementation process of step S33 is similar to the implementation process of the above-mentioned step S205, that is, the posture situation of the sample object in the sample video frame is mapped to the specific probability corresponding to the previous movement, and then the sample pose sequence of the sample object in the sample video frame is determined.
步骤S34,将所述样本姿态序列在空间和时间上进行特征转换,得到所述样本视频帧中所述样本对象的样本姿态特征轨迹。Step S34, performing feature transformation on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame.
这里,步骤S34的实现过程与上述步骤S104,以及步骤S206和步骤S207的实现过程类似,即确定样本姿态序列输入至包括有空间转换器和时间转换器的网络中进行特征转换,以得到对应的样本姿态特征轨迹。Here, the implementation process of step S34 is similar to that of the above-mentioned step S104, and the implementation process of steps S206 and S207, that is, the determined sample pose sequence is input to a network including a space converter and a time converter for feature conversion, so as to obtain the corresponding sample pose feature trajectory.
在一些实施例中,这里将样本姿态序列输入至对应的转换网络时,可以对样本姿态 序列中每一样本姿态中关键点进行部分掩膜,以降低网络中的相关数据的计算量,进而能够提高运行速度。In some embodiments, when the sample pose sequence is input to the corresponding conversion network, the key points in each sample pose in the sample pose sequence can be partially masked to reduce the calculation amount of related data in the network, thereby improving the running speed.
步骤S35,对所述样本姿态特征轨迹进行姿态重建,得到重建姿态序列。Step S35, performing pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence.
在一些实施例中,可以基于样本姿态特征轨迹进行姿态重建,即对应的得到与样本姿态特征轨迹关联的重建姿态序列,这里是可以对该样本姿态特征轨迹从特征向量转换为坐标参数,进而以得到对应的重建姿态序列。In some embodiments, pose reconstruction can be performed based on the sample pose feature track, that is, correspondingly obtain a reconstructed pose sequence associated with the sample pose feature track, where the sample pose feature track can be converted from a feature vector to a coordinate parameter, and then the corresponding reconstructed pose sequence can be obtained.
步骤S36,确定所述重建姿态序列和所述样本归一化姿态序列之间相似度的重建损失。Step S36, determining the reconstruction loss of the similarity between the reconstructed pose sequence and the sample normalized pose sequence.
在一些实施例中,确定重建姿态序列与样本归一化姿态序列之间相似度的重建损失,可以是对重建姿态序列与样本归一化姿态序列,基于每一姿态中关键点的坐标信息之间的相似度进行计算,以得到该重建损失。其中,还可以是对重建姿态序列与样本归一化姿态序列,基于每一姿态中关键点的坐标信息之间的相似度以及每一姿态中关键点的置信度进行计算,以得到该重建损失。In some embodiments, the reconstruction loss for determining the similarity between the reconstructed pose sequence and the sample normalized pose sequence may be calculated based on the similarity between the reconstructed pose sequence and the sample normalized pose sequence based on the coordinate information of key points in each pose to obtain the reconstruction loss. Wherein, the reconstructed pose sequence and the sample normalized pose sequence can also be calculated based on the similarity between the coordinate information of the key points in each pose and the confidence of the key points in each pose to obtain the reconstruction loss.
步骤S37,基于所述重建损失,对所述待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。Step S37 , based on the reconstruction loss, adjust the network parameters of the object recognition network to be trained, so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.
在一些实施例中,基于重建损失对待训练的对象识别网络的网络参数进行调整,使得调整后的对象识别网络输出的重建损失满足收敛条件。In some embodiments, the network parameters of the object recognition network to be trained are adjusted based on the reconstruction loss, so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.
这里,通过上述步骤S31至步骤S37,在待训练的对象识别网络中,基于相关姿态序列的概率映射以及在时间维度和空间维度上的特征转换以及调整,引入对样本视频帧中样本归一化姿态序列和重建姿态序列之间相似度进行监督的重建损失;如此,通过对待训练的对象识别网络进行训练,能够提高整个对象识别网络的识别精度,从而能够得到性能较高的对象识别网络;即能够使得训练好的对象识别网络在识别视频流中任一视频帧中对象的信息运动的精准度更高。Here, through the above steps S31 to S37, in the object recognition network to be trained, based on the probability mapping of the relevant pose sequence and the feature transformation and adjustment in the time and space dimensions, a reconstruction loss is introduced to supervise the similarity between the sample normalized pose sequence and the reconstructed pose sequence in the sample video frame; in this way, by training the object recognition network to be trained, the recognition accuracy of the entire object recognition network can be improved, so that an object recognition network with higher performance can be obtained; that is, the trained object recognition network can recognize the information movement of objects in any video frame in the video stream The accuracy is higher.
下面结合一个具体实施例对上述对象识别方法以及对象识别网络的训练方法进行说明,然而值得注意的是,该具体实施例仅是为了更好地说明本公开实施例,并不构成对本公开实施例的不当限定。The above-mentioned object recognition method and the training method of the object recognition network will be described below in conjunction with a specific embodiment. However, it should be noted that this specific embodiment is only for better illustrating the embodiment of the present disclosure, and does not constitute an improper limitation to the embodiment of the present disclosure.
相关技术中,基于姿态方法进行异常视频帧检测也存在一定局限性。原因在于基于姿态方法是基于视频帧中的静态特征,而视频帧异常检测更依赖于动态特征。因此,有效的运动表示对于异常视频帧检测中规则的视频模式学习至关重要。在这种情况下,基于姿态方法中静态特征实现检测模型会因同时学习运动和正常状态而不堪重负,从而会降低检测模型的性能。In the related art, there are certain limitations in abnormal video frame detection based on attitude methods. The reason is that pose-based methods are based on static features in video frames, while video frame anomaly detection relies more on dynamic features. Therefore, efficient motion representation is crucial for regular video pattern learning in abnormal video frame detection. In this case, implementing detection models based on static features in pose methods would be overwhelmed by learning motion and normal states simultaneously, which would degrade the performance of detection models.
本公开实施例提供的对象识别方法,以及对象识别网络的训练方法中,给出了一种基于运动先验规则学习器(Motion Prior Regularity Learner,MoPRL)来缓解上述基于姿态的方法的局限性。MoPRL由运动嵌入器(Motion Embedder,ME)和时空转换器(Spatial-Temporal Transformer,STT)两个子部分组成。其中,ME用于从概率的角度提取输入姿态的时空表示;其中,基于相邻帧之间姿态中心点之间的位移来建模姿态运动。同时可以将这种运动变换到概率域中。即通过统计得到运动先验信息,它代表了位移在训练数据上的显式分布。也就是说,为表示相应的运动,将每个姿态位移被映射到基于先前运动的特定概率。同时采用设计的姿态掩蔽策略,将STT作为特定于任务的模型,通过输入ME的姿态及其运动特征来学习规则模式。本公开实施例提供的对象识别网络的框架采用了自监督顺序输入结构,自然适合姿态规则学习。In the object recognition method and the training method of the object recognition network provided by the embodiments of the present disclosure, a motion prior rule learner (MoPRL) based on Motion Prior Regularity Learner (MoPRL) is provided to alleviate the limitation of the above-mentioned pose-based method. MoPRL consists of two sub-parts, Motion Embedder (ME) and Spatial-Temporal Transformer (STT). Among them, ME is used to extract the spatio-temporal representation of the input pose from a probabilistic perspective; where the pose motion is modeled based on the displacement between pose center points between adjacent frames. At the same time, this movement can be transformed into the probability domain. That is, the motion prior information is obtained through statistics, which represents the explicit distribution of displacement on the training data. That is, to represent the corresponding motion, each pose displacement is mapped to a specific probability based on the previous motion. Simultaneously adopting the designed pose masking strategy, the STT is used as a task-specific model to learn the regular pattern by inputting the pose of the ME and its motion features. The framework of the object recognition network provided by the embodiments of the present disclosure adopts a self-supervised sequential input structure, which is naturally suitable for attitude rule learning.
基于本公开中的MoPRL,能够基于ME实现在概率域中直观地表示一帧视频帧中目标对象的姿态运动,为其规则性学习提供了有效的姿态运动表示。同时能够利用具有姿态掩蔽和注意力分散的STT来模拟姿态轨迹的规律性。以下为实现本公开实施例提供的对象识别网络的训练方法的实现步骤,这里默认样本对象为人进行以下说明:Based on the MoPRL in the present disclosure, it is possible to intuitively represent the pose motion of a target object in a video frame in the probability domain based on ME, providing an effective pose motion representation for its regularity learning. At the same time, the regularity of pose trajectories can be simulated by using STT with pose masking and attention distraction. The following are the implementation steps for implementing the training method of the object recognition network provided by the embodiment of the present disclosure. Here, the default sample object is a human being for the following description:
第一步,获取样本视频流;其中,样本视频流可以使用训练集D train={F 1,...,F m}表示,其中,F i表征样本视频流中任一视频帧。同时在样本视频流中每一视频帧包括已标注姿态信息的样本对象,即可使用测试集D test={(F 1,L 1),...,(F n,L n)}来表示;其中,L i∈{0,1},其表示在训练集和测试集中存在正常样本和异常样本。 The first step is to obtain a sample video stream; wherein, the sample video stream can be represented by a training set D train ={F 1 , . . . , F m }, where F i represents any video frame in the sample video stream. At the same time, each video frame in the sample video stream includes sample objects with labeled pose information, which can be represented by the test set D test ={(F 1 ,L 1 ),...,(F n ,L n )}; where, L i ∈{0,1}, which means that there are normal samples and abnormal samples in the training set and test set.
第二步,在样本视频流中确定样本视频帧的样本历史视频帧,即可以通过对样本视频帧基于窗口滑动,确定在样本视频流中处于样本视频帧之前,且与样本视频帧相邻的至少一帧样本历史视频帧。示例性地,在样本视频帧为样本视频流中的第8帧视频帧的情况下,其样本历史视频帧可以是样本视频流中的前7帧视频帧,以下均以第8帧样本视频帧为例进行说明。The second step is to determine the sample historical video frame of the sample video frame in the sample video stream, that is, by sliding the sample video frame based on the window, determine at least one sample historical video frame that is in the sample video stream before the sample video frame and adjacent to the sample video frame. Exemplarily, when the sample video frame is the eighth video frame in the sample video stream, the sample historical video frame may be the first seven video frames in the sample video stream, and the eighth sample video frame is used as an example for illustration below.
第三步,基于人体姿态识别,分别对样本历史视频帧和样本视频帧进行人体姿态识别,得到以人体关键点表征的人体姿态信息;其中,每一人体姿态可以使用P i={J i,1,...,J i,k}表示;其中,i表示样本历史视频帧和样本视频帧中的视频帧,k表示单个人体姿态中的人体的最大关节数,即J i,j表示第i个人体姿态中的第j个关节点。其中,每一关节点可以使用坐标(x i,j,y i,j)来表示。 In the third step, based on the human body posture recognition, human body posture recognition is performed on the sample historical video frame and the sample video frame respectively, and the human body posture information represented by the key points of the human body is obtained; wherein, each human body posture can be represented by P i ={J i,1 ,...,J i,k }; wherein, i represents the video frame in the sample historical video frame and the sample video frame, and k represents the maximum number of joints of the human body in a single human body posture, that is, J i,j represents the jth joint point in the i-th human body posture. Wherein, each joint point can be represented by coordinates (xi ,j ,y i,j ).
第四步,使用前8帧样本视频帧中的人体姿态轨迹序列,即S i={P 1,...,P t},t表示该人体姿态轨迹序列包括的姿态数量,在本实施例中,j等于8,若样本视频帧画面中包括l个人的情况下,可以使用F i={S 1,...,S l}表示。 The fourth step is to use the human body posture trajectory sequence in the first 8 sample video frames, that is, S i ={P 1 ,...,P t }, and t represents the number of postures included in the human body posture trajectory sequence. In this embodiment, j is equal to 8. If there is one person in the sample video frame, it can be represented by F i ={S 1 ,...,S l }.
其中,按照历史样本视频帧和样本视频帧在样本视频流中的时序关系,将历史样本视频帧中的人体姿态和样本视频帧中的人体姿态进行排序,得到表征样本视频帧的样本人体姿态序列S i={P 1,...,P t}。 Wherein, according to the temporal relationship between the historical sample video frames and the sample video frames in the sample video stream, the human body postures in the historical sample video frames and the human body postures in the sample video frames are sorted, and a sample human body posture sequence S i ={P 1 ,...,P t } representing the sample video frames is obtained.
第五步,然后确定能够人体姿态序列中每一样本人体姿态的最小矩形框,然后将其对应的最小矩形框的中心点,确定为每一样本人体姿态的中心点,即(x i,y i),同时基于每一样本人体姿态的最小矩形框的尺寸,比如最小矩形框的宽和高:(w i,h i);对每一样本人体姿态序列中的样本人体姿态进行归一化处理,即对每一样本人体姿态序列中的样本人体姿态中每一关节点进行归一化处理,得到每一样本人体姿态的归一化表达序列,即
Figure PCTCN2022129057-appb-000001
其中,对应地得到每一样本人体姿态中每一关节点对应的标准化坐标参数,即
Figure PCTCN2022129057-appb-000002
The fifth step is to determine the smallest rectangular frame capable of each sample human pose in the human pose sequence, and then determine the center point of the corresponding smallest rectangular frame as the center point of each sample human pose, that is, (xi , y i ), and at the same time based on the size of the smallest rectangular frame of each sample human pose, such as the width and height of the smallest rectangular frame: (w i , h i ); normalize the sample human poses in each sample human pose sequence, that is, normalize each joint point in the sample human poses in each sample human pose sequence processing to obtain the normalized expression sequence of each sample human body posture, namely
Figure PCTCN2022129057-appb-000001
Among them, the standardized coordinate parameters corresponding to each joint point in each sample human body pose are correspondingly obtained, that is,
Figure PCTCN2022129057-appb-000002
第六步,将第五步得到每一样本人体姿态的归一化表示序列
Figure PCTCN2022129057-appb-000003
每一样本人体姿态的中心点(x i,y i)以及每一样本人体姿态的最小矩形框的宽和高(w i,h i)输入至ME,得到样本视频帧中对应的样本目标姿态序列。其中,可以通过以下过程来实现:
In the sixth step, the fifth step is used to obtain the normalized representation sequence of each sample human body posture
Figure PCTCN2022129057-appb-000003
The center point (xi , y i ) of each sample human body pose and the width and height (w i , h i ) of the minimum rectangular frame of each sample human body pose are input to ME to obtain the corresponding sample target pose sequence in the sample video frame. Among them, it can be realized through the following process:
首先,基于历史样本视频帧和样本视频帧在样本视频流中的时序关系,计算历史样本视频帧和样本视频帧中每相邻两个视频帧中样本人体姿态的中心点对应的样本归一化位移,其可以通过以下公式(1)和公式(2)来实现:First, based on the historical sample video frame and the temporal relationship of the sample video frame in the sample video stream, calculate the sample normalized displacement corresponding to the center point of the sample human pose in every two adjacent video frames in the historical sample video frame and the sample video frame, which can be realized by the following formula (1) and formula (2):
Figure PCTCN2022129057-appb-000004
Figure PCTCN2022129057-appb-000004
Figure PCTCN2022129057-appb-000005
Figure PCTCN2022129057-appb-000005
其中,υ i表征每两个相邻样本视频帧中样本人体中心点的位移,即样本人体姿态P i到样本人体姿态P i+1的平均速度,这里υ i可以表征样本历史视频帧和样本视频帧中的第i视 频帧对应的位移;同时基于每两个相邻样本视频帧中样本人体中心的位移与每一样本人体姿态对应的最小矩形框的尺寸,确定表征第i视频帧对应的样本归一化位移
Figure PCTCN2022129057-appb-000006
也就是样本人体姿态P i(样本归一化人体姿态
Figure PCTCN2022129057-appb-000007
)对应的归一化位移
Figure PCTCN2022129057-appb-000008
Among them, υi represents the displacement of the center point of the sample human body in every two adjacent sample video frames, that is, the average speed from the sample human body posture P i to the sample human body posture P i+1 , where υi can represent the displacement corresponding to the i-th video frame in the sample historical video frame and the sample video frame; at the same time, based on the displacement of the sample human body center in each two adjacent sample video frames and the size of the smallest rectangular frame corresponding to each sample human body posture, determine the sample normalized displacement corresponding to the i-th video frame
Figure PCTCN2022129057-appb-000006
That is, the sample human body pose P i (sample normalized human body pose
Figure PCTCN2022129057-appb-000007
) corresponding to the normalized displacement
Figure PCTCN2022129057-appb-000008
其次,使用预设拟合函数拟合以上
Figure PCTCN2022129057-appb-000009
进而得到与此离散化数据集
Figure PCTCN2022129057-appb-000010
匹配的预定义分布函数分布,通过它我们可以得到连续的位移分布函数。在训练过程中证明瑞利分布匹配上述归一化位移
Figure PCTCN2022129057-appb-000011
对应的位移分布函数性能参数最优。这里,为了获得包含时间和空间信息的多层次信息表示,将归一化的姿态
Figure PCTCN2022129057-appb-000012
(代表空间信息和运动先验)结合起来,这里更多体现在时间上,进而将每一归一化位移
Figure PCTCN2022129057-appb-000013
输入至对应的位移分布函数,得到与该归一化位移
Figure PCTCN2022129057-appb-000014
匹配的缩放因子,可以使用概率参数来表示,如公式(3)所示:
Second, fit the above using the preset fitting function
Figure PCTCN2022129057-appb-000009
Then get the discretized data set with this
Figure PCTCN2022129057-appb-000010
The matching predefined distribution function distribution, through which we can get the continuous displacement distribution function. Proving that the Rayleigh distribution matches the above normalized shift during training
Figure PCTCN2022129057-appb-000011
The performance parameters of the corresponding displacement distribution function are optimal. Here, in order to obtain a multi-level information representation including temporal and spatial information, the normalized pose
Figure PCTCN2022129057-appb-000012
(Representing spatial information and motion prior) combined, here is more reflected in time, and then each normalized displacement
Figure PCTCN2022129057-appb-000013
Input to the corresponding displacement distribution function to get the normalized displacement
Figure PCTCN2022129057-appb-000014
The matching scaling factor can be represented by a probability parameter, as shown in formula (3):
Figure PCTCN2022129057-appb-000015
Figure PCTCN2022129057-appb-000015
其中,ρ为与离散化数据集
Figure PCTCN2022129057-appb-000016
匹配的预定义分布函数分布,即连续的位移分布函数;s i为与
Figure PCTCN2022129057-appb-000017
对应的缩放因子。
Among them, ρ is the discretization data set
Figure PCTCN2022129057-appb-000016
The matching predefined distribution function distribution, that is, the continuous displacement distribution function; s i is the same as
Figure PCTCN2022129057-appb-000017
The corresponding scaling factor.
最后,利用公式(4)计算每一归一化的姿态
Figure PCTCN2022129057-appb-000018
进行缩放操作之后对应的姿态特征,即表征嵌入运动的姿势融合第i个姿势的空间和时间信息。这里为了避免数值误差,可以将缩放因子用作分母。如此,在出现频率较低所对应的姿态的情况下,可以获得更大尺寸的姿势。
Finally, formula (4) is used to calculate each normalized pose
Figure PCTCN2022129057-appb-000018
The corresponding pose feature after the scaling operation, that is, the pose representing the embedded motion fuses the spatial and temporal information of the i-th pose. Here to avoid numerical errors, a scaling factor can be used as the denominator. In this way, in the case of gestures corresponding to less frequent occurrences, gestures of larger size can be obtained.
Figure PCTCN2022129057-appb-000019
Figure PCTCN2022129057-appb-000019
其中,P i=[J i,1,...,J i,j]表征样本视频帧的目标样本姿态序列可以使用[P 1,...,P t]表征。如图4所示,为本公开实施例提供的一种基于运动嵌入器ME将姿态轨迹转换为概率域中运动特征的示意图;其中,401和402分别为输入至ME中的不同样本归一化人体姿态序列;其中,401为已标注的正常姿态轨迹,402为已标注的异常姿态轨迹。405为确定的与样本人体的运动概率对应的连续分布函数;将401和402分别映射至该连续分布函数405,得到对应的缩放因子;进而基于401和402分别于对应的缩放因子的倒数相乘,得到401对应的目标样本姿态序列403,以及402对应的目标样本姿态序列405;其中,因402中表征的为异常姿态轨迹,其对应的缩放因子,即映射至相关运动信息的概率值较小,进而得到对应的目标样本姿态序列中的姿态尺寸较大。 Wherein, P i =[J i,1 ,...,J i,j ] characterizes the target sample pose sequence of the sample video frame and can be represented by [P 1 ,...,P t ]. As shown in FIG. 4 , it is a schematic diagram of converting a posture trajectory into a motion feature in the probability domain based on a motion embedder ME provided by an embodiment of the present disclosure; wherein, 401 and 402 are respectively normalized human body posture sequences of different samples input into the ME; wherein, 401 is a marked normal posture trajectory, and 402 is a marked abnormal posture trajectory. 405 is a determined continuous distribution function corresponding to the motion probability of the sample human body; respectively map 401 and 402 to the continuous distribution function 405 to obtain a corresponding scaling factor; then multiply 401 and 402 by the reciprocal of the corresponding scaling factor respectively to obtain the target sample posture sequence 403 corresponding to 401 and the target sample posture sequence 405 corresponding to 402; wherein, because the abnormal posture trajectory represented in 402 is represented, the corresponding scaling factor is the probability value mapped to the relevant motion information is smaller, and then the pose size in the corresponding target sample pose sequence is larger.
第七步,为了更好地学习人体姿势轨迹的规律性,本公开实施例中使用时空转换器,即STT来处理第六步中基于运动嵌入器得到具有时间和空间信息的参数,因为其具有对序列数据建模的公认优势。然而,传统的时空变换器模型的数据计算复杂度为O((N×T) 2)(其中,N是单个姿势中的关节数,T是单个姿态轨迹中的姿势数),即该O随着N和T的增加呈指数增长。基于此,可以基于注意力机制将时空变换器划分为两大部分,即时空转换部分和时间转换部分,进而能够得到数据计算复杂度为O(N 2+T 2)的模型。这里,将这种模型称为STT;其中,STT包含L s层网络的空间变换器和L t层网络的时间变换器。同时充分发挥STT的潜力,可以将L s和L t视为超参数,通过训练确定其对应的具体数值。 In the seventh step, in order to better learn the regularity of the human body posture trajectory, the embodiment of the present disclosure uses a space-time transformer, that is, STT to process the parameters with time and space information based on the motion embedder in the sixth step, because it has recognized advantages in modeling sequence data. However, the data computational complexity of the traditional space-time transformer model is O((N×T) 2 ) (where N is the number of joints in a single pose and T is the number of poses in a single pose trajectory), i.e., this O grows exponentially with the increase of N and T. Based on this, the space-time transformer can be divided into two parts based on the attention mechanism, that is, the space-time conversion part and the time conversion part, and then a model with a data calculation complexity of O(N 2 +T 2 ) can be obtained. Here, such a model is referred to as STT; where, STT includes a space transformer of L s layer network and a time transformer of L t layer network. At the same time, to give full play to the potential of STT, L s and L t can be regarded as hyperparameters, and their corresponding specific values can be determined through training.
首先,进行姿态掩膜处理,这部分在对象识别网络的训练过程进行,在应用过程可以省略,其为了降低数据计算量同时提高模型处理的鲁棒性。对于任意掩蔽姿势嵌入。在输入至时空变换器之前,我们首先对于任一样本归一化姿态中的任一关节点J i,j,将其映射到嵌入获得关节向量z i,j,其中,z i,j∈R C,同时C是嵌入维度,如下式(5)所示: First, pose mask processing is performed. This part is performed during the training process of the object recognition network and can be omitted during the application process. This is to reduce the amount of data calculation and improve the robustness of model processing. For arbitrary masked pose embeddings. Before inputting to the space-time transformer, we first normalize any joint point J i,j in any sample pose, and map it to the embedding to obtain the joint vector z i,j , where z i,j ∈ R C , and C is the embedding dimension, as shown in the following formula (5):
Figure PCTCN2022129057-appb-000020
Figure PCTCN2022129057-appb-000020
其中,mask(·)是在J i,j上以预设概率运行的掩码函数,E∈R C×2为一训练参数。同时,
Figure PCTCN2022129057-appb-000021
表征与第j个关节点的属性对应的空间特征向量;进而得到第i个姿态,即与P i=[J i,1,...,J i,j]对应的特征向量,Z i=[z i,0,...,z i,N],这里,整个目标样本姿态序列[P 1,...,P t]对应的特征向量,可以使用Z=[Z 1,...,Z T]来表征。
Among them, mask(·) is a mask function running on J i,j with a preset probability, and E∈R C×2 is a training parameter. at the same time,
Figure PCTCN2022129057-appb-000021
Characterize the spatial feature vector corresponding to the attribute of the jth joint point; and then obtain the i-th pose, that is, the feature vector corresponding to P i = [J i,1 ,...,J i,j ], Z i =[zi ,0 ,...,zi ,N ], here, the feature vector corresponding to the entire target sample pose sequence [P 1 ,...,P t ] can be represented by Z=[Z 1 ,...,Z T ].
其次,在具有L s层网络的空间域,即空间转换器中,且姿态轨迹对应的Z∈R T×N×C的情况下,基于注意力机制对应的参数对确定的特征向量进行特征维度调整或特征编码;其中,第l层的输入轨迹表示为Z l,且l∈[1,L s]。具有L s层的空间域多层操作可通过以下方式进行,如公式(6)至公式(8): Secondly, in the space domain with L s layer network, that is, in the space transformer, and in the case of Z∈R T×N×C corresponding to the attitude trajectory, the feature dimension adjustment or feature encoding is performed on the determined feature vector based on the parameters corresponding to the attention mechanism; where the input trajectory of the l-th layer is expressed as Z l , and l∈[1,L s ]. The spatial domain multi-layer operation with L s layers can be performed in the following way, as in Equation (6) to Equation (8):
Figure PCTCN2022129057-appb-000022
Figure PCTCN2022129057-appb-000022
Figure PCTCN2022129057-appb-000023
Figure PCTCN2022129057-appb-000023
Figure PCTCN2022129057-appb-000024
Figure PCTCN2022129057-appb-000024
其中,Q、K、V分别是查询、键和值矩阵,同时,W Q、W K、W V均属于R C×C。这里,下标l n表示L s层的空间域规范化后的张量参数,其中,softmax和fc分别代表softmax操作和完全连接层。同时时空转换器利用多层多头注意力机制对应的参数进行注意力操作,能够得到性能参数更够的携带空间信息的特征参数。这里,上一层维度调整的输出为下一层维度调整的输入。 Among them, Q, K, and V are query, key, and value matrices respectively, and W Q , W K , and W V all belong to R C×C . Here, the subscript ln denotes the tensor parameters after spatial domain normalization of layer L s , where softmax and fc represent the softmax operation and fully connected layer, respectively. At the same time, the space-time converter uses the parameters corresponding to the multi-layer multi-head attention mechanism to perform attention operations, and can obtain feature parameters with more sufficient performance parameters that carry spatial information. Here, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer.
最后,在具有L t层网络的时间域,即时间转换器中,将从空间转换器的L s层输出的每一样本姿态的关节点对应的空间特征向量
Figure PCTCN2022129057-appb-000025
进行空间维度上的注意力机制参数进行调整,如公式(9)所示:
Finally, in the time domain with L t layer network, i.e., the time converter, the spatial feature vector corresponding to the joint point of each sample pose output from the L s layer of the space converter
Figure PCTCN2022129057-appb-000025
Adjust the parameters of the attention mechanism in the spatial dimension, as shown in formula (9):
Figure PCTCN2022129057-appb-000026
Figure PCTCN2022129057-appb-000026
其中,
Figure PCTCN2022129057-appb-000027
表征与第j个关节点的属性对应的时间特征向量。
in,
Figure PCTCN2022129057-appb-000027
Characterize the temporal feature vector corresponding to the attribute of the jth joint point.
同时,在时间转换器对应操作如公式(6)至(8)所示,这里不再做细节阐述。在时间转换器的L t层输入最终的表征样本视频帧的动态特征,即Z°。 At the same time, the corresponding operations of the time converter are shown in formulas (6) to (8), and will not be described in detail here. In the L t layer of the time converter, the final dynamic feature representing the sample video frame is input, that is, Z°.
第八步,训练过程,通过常用的重建方法实现训练过程,即,将[P 1,...,P t](其中,[P 1,...,P t]对应样本归一化姿态序列
Figure PCTCN2022129057-appb-000028
)作为输入,其对应得到对应的Z°之后,对该Z°进行姿态重建得到
Figure PCTCN2022129057-appb-000029
即如公式(10)所示:
The eighth step is the training process. The training process is realized by the commonly used reconstruction method, that is, [P 1 ,...,P t ] (where [P 1 ,...,P t ] corresponds to the sample normalized pose sequence
Figure PCTCN2022129057-appb-000028
) as input, after corresponding to the corresponding Z°, the attitude reconstruction of the Z° is obtained
Figure PCTCN2022129057-appb-000029
That is, as shown in formula (10):
Figure PCTCN2022129057-appb-000030
Figure PCTCN2022129057-appb-000030
然后,计算
Figure PCTCN2022129057-appb-000031
Figure PCTCN2022129057-appb-000032
之间在每一样本姿态中的每一关节点上的相似度损失,如公式(11)所示:
Then, calculate
Figure PCTCN2022129057-appb-000031
and
Figure PCTCN2022129057-appb-000032
The similarity loss between each joint point in each sample pose, as shown in formula (11):
Figure PCTCN2022129057-appb-000033
Figure PCTCN2022129057-appb-000033
其中,式中ω i,j是每个姿势关节的置信度;同时
Figure PCTCN2022129057-appb-000034
是重建
Figure PCTCN2022129057-appb-000035
中每个姿态的关节点的坐标信息。
where ω i,j is the confidence degree of each pose joint; at the same time
Figure PCTCN2022129057-appb-000034
is reconstruction
Figure PCTCN2022129057-appb-000035
The coordinate information of the joint points of each pose in .
最后,基于第八步得到的损失Loss训练对象识别网络,以使调整后的对象识别网络输出的重建损失满足收敛条件。Finally, the object recognition network is trained based on the loss Loss obtained in the eighth step, so that the reconstruction loss output by the adjusted object recognition network meets the convergence condition.
这里,在对对象识别网络进行训练,得到训练好的对象识别网络之后,将测试的视频流输入至该训练好的对象识别网络,得到对应的识别结果。Here, after the object recognition network is trained to obtain a trained object recognition network, a test video stream is input to the trained object recognition network to obtain a corresponding recognition result.
其中,将需要测试的视频流输入至训练好的对象识别网络之后,得到视频流中视频帧中包括的m个对象的动态轨迹特征。对每一动态轨迹特征所对应的行为进行异常评分,即A m,n,其中,m和n分别表示视频帧中第n帧轨迹的第m帧姿态特征,同时A m,n满足一下公式(12): Wherein, after the video stream to be tested is input to the trained object recognition network, the dynamic trajectory features of m objects included in the video frames in the video stream are obtained. The behavior corresponding to each dynamic trajectory feature is abnormally scored, that is, A m,n , where m and n represent the pose feature of the mth frame of the nth frame trajectory in the video frame, and A m,n satisfies the following formula (12):
Figure PCTCN2022129057-appb-000036
Figure PCTCN2022129057-appb-000036
其中,从中选取异常评分最高分所对应的动态特征轨迹,基于其确定视频流中视频帧是否属于异常视频帧,即通过公式(13)选取最高A mAmong them, the dynamic feature track corresponding to the highest abnormal score is selected, and based on it, it is determined whether the video frame in the video stream belongs to the abnormal video frame, that is, the highest A m is selected through formula (13).
A m=max(A m,n)    公式(13); A m = max(A m,n ) formula (13);
基于上述对象识别网络的训练方法,如图5所示,为本公开实施例提供的一种对象识别网络的训练方法的框架示意图;其中,501为确定样本视频流中任一样本视频帧中样本对象(此处以样本对象的数量为1进行说明)样本姿态信息(包括多帧姿态信息),将其输入至MoPRL的ME,即502中进行概率映射,得到该样本视频帧中样本对象的样本姿态序列,即503。其次,将503得到的样本姿态序列输入至MoPRL中的STT,即504中进行特征转换,以模拟样本姿态序列中的规律性进而得到样本姿态特征轨迹,即:第一步先进行姿态掩膜以及姿态嵌入得到与每一姿态中的关节点对应的特征向量;第二步,将第一步得到的与每一姿态中的关节点对应的特征向量,依次输入空间转换器(具有L s层网络的空间域)和时间转换器(具有L t层网络的时间域)进行特征编码,得到对应的样本姿态特征轨迹;第三步,对该样本姿态特征轨迹进行姿态重建,得到重建姿态序列,即505;最后,确定重建姿态序列和样本姿态信息之间相似度的重建损失即506,并基于该重建损失对待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。 Based on the above-mentioned training method of the object recognition network, as shown in FIG. 5 , it is a schematic framework diagram of a training method of the object recognition network provided by the embodiment of the present disclosure; wherein, 501 is to determine the sample pose information (including multi-frame pose information) of the sample object in any sample video frame in the sample video stream (here, the number of sample objects is 1), and input it to the ME of MoPRL, that is, 502 for probability mapping, and obtain the sample pose sequence of the sample object in the sample video frame, namely 503.其次,将503得到的样本姿态序列输入至MoPRL中的STT,即504中进行特征转换,以模拟样本姿态序列中的规律性进而得到样本姿态特征轨迹,即:第一步先进行姿态掩膜以及姿态嵌入得到与每一姿态中的关节点对应的特征向量;第二步,将第一步得到的与每一姿态中的关节点对应的特征向量,依次输入空间转换器(具有L s层网络的空间域)和时间转换器(具有L t层网络的时间域)进行特征编码,得到对应的样本姿态特征轨迹;第三步,对该样本姿态特征轨迹进行姿态重建,得到重建姿态序列,即505;最后,确定重建姿态序列和样本姿态信息之间相似度的重建损失即506,并基于该重建损失对待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。
同时,如图6所示,示出本公开实施例提供的一种时空转换器的结构示意图;其中601为空间转换器,602为时间转换器,603为最终输出的姿态特征轨迹,其中603中的姿态特征轨迹以每一人体姿态中多个关节点对应的特征进行表征(这里以识别对象为人为例进行说明)。同时601中在空间维度上,基于与空间维度上关联的注意力参数,对输入至空间转换器的每一人体姿态的多个关节点进行特征编码以及维度调整,得到待输入至时间转换器的空间特征序列;602中在时间维度上,基于与时间维度上关联的注意力参数,对输入至时间转换器的每一人体姿态的多个关节点对应的空间特征进行特征编码以及维度调整,得到最终的动态特征参数;其中,空间转换器具有L s层网络;时间转换器具有L t层网络。 At the same time, as shown in FIG. 6 , a schematic structural diagram of a space-time converter provided by an embodiment of the present disclosure is shown; wherein 601 is a space converter, 602 is a time converter, and 603 is the final output gesture feature track, wherein the gesture feature track in 603 is characterized by features corresponding to multiple joint points in each human body pose (here, the recognition object is a human being as an example for illustration). At the same time, in 601, in the spatial dimension, based on the attention parameters associated with the spatial dimension, feature encoding and dimension adjustment are performed on multiple joint points of each human body posture input to the space converter to obtain a sequence of spatial features to be input to the time converter; in 602, in the time dimension, based on the attention parameters associated with the time dimension, feature encoding and dimension adjustment are performed on the spatial features corresponding to the multiple joint points of each human body posture input to the time converter to obtain the final dynamic feature parameters; wherein the space converter has an L s layer network; the time converter has an L t layer network.
对应地,图7为本公开实施例提供的一种时空转换器内部处理流程的示意图,其中,704为时间维度或空间维度上各自对应的轨迹模型参数,即在空间维度上,对应的Z (I-1)(T×N×C);同时在时间维度上对应的Z (I-1)(N×T×C);703为注意力机制,即可分别在时间维度和空间维度基于对应的注意力参数以及轨迹模型参数,进行特征编码以及特征维度调整,同时通过702的多层感知机,将输入的多个特征集映射到单一的输出的特征,最终得到701对应的轨迹特征参数。 Correspondingly, FIG. 7 is a schematic diagram of the internal processing flow of a space-time converter provided by an embodiment of the present disclosure, wherein, 704 is the corresponding trajectory model parameters on the time dimension or the space dimension, that is, on the space dimension, the corresponding Z (I-1) (T×N×C); at the same time, the corresponding Z (I-1) (N×T×C) on the time dimension; 2's multi-layer perceptron, which maps multiple input feature sets to a single output feature, and finally obtains 701 corresponding trajectory feature parameters.
基于本公开实施例提供的对象识别网络,以及对象识别网络的训练方法,能够获取到视频流中每一视频帧包括的对象的更加直观的姿态运动,并将该姿态运动通过ME嵌入至对应视频帧中,以得到每一视频帧中对象的姿态轨迹序列;并将该姿态轨迹序列输入至时间和空间分割的转换器中进行姿态规律学习,得到表征每一视频帧中对象的运动信息的动态特征;其中,确定姿态轨迹序列骤主要通过以下几步来实现:Based on the object recognition network and the training method of the object recognition network provided by the embodiments of the present disclosure, it is possible to obtain the more intuitive gesture motion of the object included in each video frame in the video stream, and embed the gesture motion into the corresponding video frame through ME to obtain the gesture trajectory sequence of the object in each video frame; and input the gesture trajectory sequence into the converter of time and space segmentation to learn the posture law, and obtain the dynamic characteristics representing the motion information of the object in each video frame; wherein, the step of determining the posture trajectory sequence is mainly realized through the following steps:
首先确定与每一视频帧关联的多帧视频帧(相邻的历史视频帧)中相邻姿态之间的归一化位移;其次,将得到的多个归一化位移进行显示离散分布,得到描述多个归一化位移的显式离散分布;然后,基于瑞利分布或高斯分布拟合多个归一化位移,确定与多个归一化位移匹配的连续分布函数;最后,利用归一化姿态及其运动概率(该运动概率即将每一归一化位移输入至确定的连续分布函数,得到对应的缩放概率),得到携带空间信息和时间信息的运动嵌入姿态。First, determine the normalized displacement between adjacent poses in multiple video frames (adjacent historical video frames) associated with each video frame; secondly, display the discrete distribution of the obtained multiple normalized displacements to obtain an explicit discrete distribution describing multiple normalized displacements; then, fit multiple normalized displacements based on the Rayleigh distribution or Gaussian distribution to determine a continuous distribution function that matches multiple normalized displacements; finally, use the normalized posture and its motion probability (the motion probability is to input each normalized displacement into the determined continuous distribution function to obtain the corresponding scaling Probability), to obtain motion embedding poses carrying spatial and temporal information.
基于此,本公开实施例中能够实现在概率域表征视频流中视频帧包括的对象的直观运动,同时采用时间和空间分割的时空转换器来学习姿态轨迹的规律性。能够提高确定的视频帧中对象的运动信息准确度。Based on this, in the embodiments of the present disclosure, the intuitive motion of the object included in the video frame can be represented in the probability domain, and at the same time, the regularity of the pose trajectory can be learned by using a time- and space-segmented spatio-temporal converter. The accuracy of the motion information of the determined object in the video frame can be improved.
本公开实施例提供一种对象识别装置,图8A为本公开实施例提供的一种对象识别装置的结构组成示意图,如图8A所示,所述对象识别装置800包括:An embodiment of the present disclosure provides an object recognition device. FIG. 8A is a schematic structural composition diagram of an object recognition device provided by an embodiment of the present disclosure. As shown in FIG. 8A , the object recognition device 800 includes:
第一获取部分801,被配置为获取画面包括目标对象的待识别视频帧;所述待识别视频帧为所述目标对象的视频流中的任一视频帧;The first acquiring part 801 is configured to acquire a video frame to be identified whose picture includes a target object; the video frame to be identified is any video frame in the video stream of the target object;
第一确定部分802,被配置为基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;The first determining part 802 is configured to determine an initial posture sequence of the target object based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream;
第一映射部分803,被配置为对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;The first mapping part 803 is configured to perform probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized;
第一转换部分804,被配置为对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;The first conversion part 804 is configured to perform feature conversion on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized;
第二确定部分805,被配置为基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。The second determining part 805 is configured to determine the behavior state of the target object in the video frame to be recognized based on the gesture feature trajectory.
在一些实施例中,所述第一确定部分802,还被配置为分别对所述待识别视频帧和所述历史视频帧进行关键点识别,得到所述待识别视频帧中所述目标对象的关键点,和所述历史视频帧中所述目标对象的关键点;分别基于所述待识别视频帧中所述目标对象的关键点,和所述历史视频帧中所述目标对象的关键点,确定所述待识别视频帧中的姿态信息和所述历史视频帧中的姿态信息;按照所述历史视频帧和所述待识别视频帧之间的时序关系,将所述历史视频帧中的姿态信息和所述待识别视频帧中的姿态信息进行排序,得到初始姿态序列。In some embodiments, the first determining part 802 is further configured to respectively perform key point identification on the video frame to be identified and the historical video frame to obtain the key point of the target object in the video frame to be identified and the key point of the target object in the historical video frame; respectively determine the pose information in the video frame to be recognized and the pose information in the historical video frame based on the key points of the target object in the video frame to be recognized and the key points of the target object in the historical video frame; sequence relationship, and sort the pose information in the historical video frame and the pose information in the video frame to be recognized to obtain an initial pose sequence.
在一些实施例中,所述第一映射部分803,包括:确定子部分,被配置为基于每一初始姿态中关键点的位置信息,得到用于确定相邻初始姿态位移的中心点序列,以及所述初始姿态序列的归一化姿态序列;映射子部分,被配置为基于所述中心点序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列。In some embodiments, the first mapping part 803 includes: a determining subsection configured to obtain a center point sequence for determining the displacement of adjacent initial poses based on position information of key points in each initial pose, and a normalized pose sequence of the initial pose sequence; a mapping subpart is configured to perform probability mapping on the normalized pose sequence based on the center point sequence to obtain the target pose sequence.
在一些实施例中,所述确定子部分,还被配置为基于所述每一初始姿态中关键点的位置信息,确定所述每一初始姿态的包围框;在所述初始姿态序列中,对所述每一初始姿态的包围框的中心点进行排序,得到所述中心点序列;采用所述每一初始姿态的包围框,对所述每一初始姿态进行归一化,得到所述归一化姿态序列。In some embodiments, the determination subpart is further configured to determine the bounding box of each initial pose based on the position information of key points in each initial pose; in the sequence of initial poses, sort the center points of the bounding boxes of each initial pose to obtain the sequence of center points; use the bounding boxes of each initial pose to normalize each initial pose to obtain the normalized pose sequence.
在一些实施例中,所述映射子部分,还被配置为在所述中心点序列中,基于每两个相邻中心点的位置信息之间的差值,得到位移序列;基于所述每一初始姿态的包围框的尺寸信息,对所述位移序列中每一位移进行归一化,得到归一化位移序列;基于所述归一化位移序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列。In some embodiments, the mapping subpart is further configured to obtain a displacement sequence based on the difference between the position information of every two adjacent central points in the central point sequence; based on the size information of the bounding box of each initial pose, normalize each displacement in the displacement sequence to obtain a normalized displacement sequence; perform probability mapping on the normalized pose sequence based on the normalized displacement sequence to obtain the target pose sequence.
在一些实施例中,所述映射子部分,还被配置为拟合所述归一化位移序列中每一归一化位移,得到拟合结果;确定所述拟合结果满足的连续分布函数;将所述每一归一化位移输入所述连续分布函数,得到所述每一归一化位移的缩放概率;基于所述每一归一化位移的缩放概率,对每一归一化姿态进行映射,得到所述目标姿态序列。In some embodiments, the mapping subpart is further configured to fit each normalized displacement in the normalized displacement sequence to obtain a fitting result; determine a continuous distribution function satisfied by the fitting result; input each normalized displacement into the continuous distribution function to obtain a scaling probability of each normalized displacement; map each normalized posture based on the scaling probability of each normalized displacement to obtain the target posture sequence.
在一些实施例中,所述第一转换部分804,还被配置为在所述目标姿态序列中,基于 每一目标姿态的关键点,对所述每一目标姿态进行特征转换,得到待调整特征序列;对每一待调整特征在空间和时间上进行特征维度调整,得到所述姿态特征轨迹。In some embodiments, the first conversion part 804 is further configured to perform feature conversion on each target pose based on the key points of each target pose in the target pose sequence to obtain a feature sequence to be adjusted; perform feature dimension adjustment on each feature to be adjusted in space and time to obtain the pose feature trajectory.
在一些实施例中,所述第一转换部分804,还被配置为将所述每一待调整特征和所述每一待调整特征的预设空间特征进行融合,得到空间特征序列;基于空间维度在注意力机制中的注意力参数,对所述空间特征序列进行多层维度调整,得到空间姿态特征序列;将每一空间姿态特征和所述每一空间姿态特征的预设时间特征进行融合,得到时间特征序列;基于时间维度在注意力机制中的注意力参数,对所述时间特征序列进行多层维度调整,得到所述姿态特征轨迹;其中,上一层维度调整的输出为下一层维度调整的输入。In some embodiments, the first conversion part 804 is further configured to fuse each feature to be adjusted and the preset spatial feature of each feature to be adjusted to obtain a spatial feature sequence; based on the attention parameter of the spatial dimension in the attention mechanism, perform multi-layer dimension adjustment on the spatial feature sequence to obtain a spatial attitude feature sequence; fuse each spatial attitude feature and the preset temporal feature of each spatial attitude feature to obtain a temporal feature sequence; The output of one layer of dimension adjustment is the input of the next layer of dimension adjustment.
在一些实施例中,所述对象识别装置800,还包括:异常识别部分,被配置为获取所述待识别视频帧对应的场景信息;确定与所述场景信息关联的预设行为规则;采用所述预设行为规则,确定所述行为状态所属的所述待识别视频帧是否为异常视频帧。In some embodiments, the object recognition apparatus 800 further includes: an abnormality recognition part configured to obtain scene information corresponding to the video frame to be recognized; determine a preset behavior rule associated with the scene information; use the preset behavior rule to determine whether the video frame to be recognized to which the behavior state belongs is an abnormal video frame.
在一些实施例中,所述目标对象包括至少两个待识别对象,所述异常识别部分,还被配置为采用所述预设行为规则,对所述至少两个待识别对象中每一待识别对象的行为状态进行识别,得到中间识别结果集;基于行为状态的合理程度,对每一中间识别结果的置信度进行排序,得到结果评分序列;基于处于所述结果评分序列中预设位置对应的中间识别结果,确定所述待识别视频帧是否为所述异常视频帧。In some embodiments, the target object includes at least two objects to be identified, and the abnormal identification part is further configured to use the preset behavior rules to identify the behavior state of each of the at least two objects to be identified to obtain an intermediate identification result set; based on the reasonableness of the behavior state, sort the confidence of each intermediate identification result to obtain a result scoring sequence; determine whether the video frame to be identified is the abnormal video frame based on the intermediate identification result corresponding to a preset position in the result scoring sequence.
本公开实施例还提供一种对象识别网络的训练装置,图8B为本公开实施例提供的一种对象识别网络的训练装置的结构组成示意图,如图8B所示,所述图像转换网络的训练装置810包括:An embodiment of the present disclosure also provides a training device for an object recognition network. FIG. 8B is a schematic diagram of the structural composition of a training device for an object recognition network provided by an embodiment of the present disclosure. As shown in FIG. 8B , the training device 810 for an image conversion network includes:
第二获取部分811,被配置为获取包括样本对象的样本视频帧;其中,所述样本视频帧为所述样本对象的样本视频流中的任一视频帧;The second acquiring part 811 is configured to acquire a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;
第三确定部分812,被配置为确定所述样本视频帧中所述样本对象的样本归一化姿态序列;The third determining part 812 is configured to determine a sample normalized pose sequence of the sample object in the sample video frame;
第二映射部分813,被配置为采用待训练的对象识别网络,对所述样本归一化姿态序列进行概率映射,得到样本姿态序列;The second mapping part 813 is configured to use the object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain the sample pose sequence;
第二转换部分814,被配置为将所述样本姿态序列在空间和时间上进行特征转换,得到所述样本视频帧中所述样本对象的样本姿态特征轨迹;The second conversion part 814 is configured to perform feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame;
重建部分815,被配置为对所述样本姿态特征轨迹进行姿态重建,得到重建姿态序列;The reconstruction part 815 is configured to perform pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence;
第四确定部分816,被配置为确定所述重建姿态序列和所述样本归一化姿态序列之间相似度的重建损失;The fourth determining part 816 is configured to determine a reconstruction loss of similarity between the reconstructed pose sequence and the sample normalized pose sequence;
调整部分817,被配置为基于所述重建损失,对所述待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。The adjustment part 817 is configured to adjust the network parameters of the object recognition network to be trained based on the reconstruction loss, so that the reconstruction loss output by the adjusted object recognition network meets a convergence condition.
需要说明的是,以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。It should be noted that the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
需要说明的是,本公开实施例中,如果以软件功能部分的形式实现上述的对象识别方法,或,对象识别网络的训练方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是终端、服务器等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、运动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本公开实施例不限制于任何特定的硬件和软件结合。It should be noted that, in the embodiments of the present disclosure, if the above-mentioned object recognition method, or the training method of the object recognition network is implemented in the form of software function parts, and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on such an understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can be a terminal, a server, etc.) execute all or part of the methods described in the various embodiments of the present disclosure. The aforementioned storage media include: various media that can store program codes such as U disk, sports hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, embodiments of the present disclosure are not limited to any specific combination of hardware and software.
对应地,本公开实施例再提供一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,该计算机可执行指令被执行后,能够实现本公开实施例提供的对象识别 方法,或,对象识别网络的训练方法。Correspondingly, an embodiment of the present disclosure further provides a computer program product, the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the object recognition method provided by the embodiments of the present disclosure, or the training method of an object recognition network can be realized.
相应的,本公开实施例提供一种计算机设备,图9为本公开实施例计算机设备的组成结构示意图,如图9所示,所述计算机设备900包括:一个处理器901、至少一个通信总线904、通信接口902、至少一个外部通信接口和存储器903。其中,通信接口902配置为实现这些组件之间的连接通信。其中,通信接口902可以包括显示屏,外部通信接口可以包括标准的有线接口和无线接口。其中所述处理器901,配置为执行存储器中信息处理程序,以实现上述实施例提供的对象识别方法,或,对象识别网络的训练方法。Correspondingly, an embodiment of the present disclosure provides a computer device. FIG. 9 is a schematic diagram of the composition and structure of a computer device according to an embodiment of the present disclosure. As shown in FIG. Wherein, the communication interface 902 is configured to realize connection and communication between these components. Wherein, the communication interface 902 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface. The processor 901 is configured to execute the information processing program in the memory, so as to realize the object recognition method provided by the above embodiments, or the training method of the object recognition network.
相应的,本公开实施例再提供一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,所述该计算机可执行指令被处理器执行时实现上述实施例提供的对象识别方法,或,对象识别网络的训练方法。Correspondingly, an embodiment of the present disclosure further provides a computer storage medium, on which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the object recognition method provided in the above-mentioned embodiments, or the training method of an object recognition network is implemented.
其中,该计算机存储介质可以是易失性存储介质或非易失性存储介质。Wherein, the computer storage medium may be a volatile storage medium or a non-volatile storage medium.
本公开实施例再提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现上述的对象识别方法,或,用于实现上述的对象识别网络的训练方法。An embodiment of the present disclosure further provides a computer program, where the computer program includes computer-readable codes. When the computer-readable codes run in an electronic device, the processor of the electronic device executes the method for implementing the above-mentioned object recognition, or, the method for training the above-mentioned object recognition network.
以上对象识别装置、对象识别网络的训练装置、计算机设备、存储介质及程序实施例的描述,与上述方法实施例的描述是类似的,具有同相应方法实施例相似的技术描述和有益效果,限于篇幅,可参照上述方法实施例的记载。对于本公开对象识别装置、对象识别网络的训练装置、计算机设备、存储介质及程序实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。The above descriptions of the object recognition device, the training device of the object recognition network, the computer equipment, the storage medium, and the program embodiment are similar to the description of the above-mentioned method embodiment, and have similar technical descriptions and beneficial effects as the corresponding method embodiment. Due to space limitations, you can refer to the description of the above-mentioned method embodiment. For the technical details not disclosed in the object recognition device, the training device of the object recognition network, the computer equipment, the storage medium and the program embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本公开实施例的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本公开实施例的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本公开实施例的实施过程构成任何限定。上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be understood that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one of the disclosed embodiments. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments. It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such a process, method, article or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
在本公开实施例所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided by the embodiments of the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or can be integrated into another system, or some features can be ignored or not implemented. In addition, the mutual coupling, or direct coupling, or communication connection of various components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本公开实施例各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在 执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units. Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium.
或者,本公开实施例上述集成的单元如果以软件功能部分的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开实施例各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。以上所述,仅为本公开实施例的具体实施方式,但本公开实施例的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开实施例揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开实施例的保护范围之内。因此,本公开实施例的保护范围应以所述权利要求的保护范围为准。Alternatively, if the above-mentioned integrated units in the embodiments of the present disclosure are realized in the form of software function parts and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the embodiments of the present disclosure can essentially be embodied in the form of a software product, or the part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in each embodiment of the embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks. The above is only the specific implementation of the embodiments of the present disclosure, but the scope of protection of the embodiments of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the embodiments of the present disclosure, and should be covered within the scope of protection of the embodiments of the present disclosure. Therefore, the protection scope of the embodiments of the present disclosure should be determined by the protection scope of the claims.
工业实用性Industrial Applicability
本公开实施例公开了一种对象识别方法、网络训练方法、装置、设备、介质及程序;其中,所述对象识别方法包括:获取画面包括目标对象的待识别视频帧;所述待识别视频帧为所述目标对象的视频流中的任一视频帧;基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。The embodiment of the present disclosure discloses an object recognition method, network training method, device, equipment, medium, and program; wherein, the object recognition method includes: acquiring a video frame to be recognized in a picture including a target object; the video frame to be recognized is any video frame in the video stream of the target object; determining an initial pose sequence of the target object based on the video frame to be recognized and historical video frames of the video frame to be recognized in the video stream; performing probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized; performing feature conversion in time and time to obtain a gesture feature track of the target object in the video frame to be identified; based on the gesture feature track, determining the behavior state of the target object in the video frame to be identified.

Claims (16)

  1. 一种对象识别方法,所述方法包括:A method for object recognition, the method comprising:
    获取画面包括目标对象的待识别视频帧;所述待识别视频帧为所述目标对象的视频流中的任一视频帧;The acquired picture includes a video frame to be identified of the target object; the video frame to be identified is any video frame in the video stream of the target object;
    基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;determining an initial pose sequence of the target object based on the video frame to be identified and historical video frames of the video frame to be identified in the video stream;
    对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;Probability mapping is performed on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be identified;
    对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;Performing feature conversion on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be identified;
    基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。Based on the gesture feature track, determine the behavior state of the target object in the video frame to be recognized.
  2. 根据权利要求1所述的方法,其中,所述基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列,包括:The method according to claim 1, wherein said determining the initial pose sequence of the target object based on the video frame to be identified and the historical video frame of the video frame to be identified in the video stream comprises:
    分别对所述待识别视频帧和所述历史视频帧进行关键点识别,得到所述待识别视频帧中所述目标对象的关键点,和所述历史视频帧中所述目标对象的关键点;Respectively performing key point identification on the video frame to be identified and the historical video frame to obtain key points of the target object in the video frame to be identified and key points of the target object in the historical video frame;
    分别基于所述待识别视频帧中所述目标对象的关键点,和所述历史视频帧中所述目标对象的关键点,确定所述待识别视频帧中的姿态信息和所述历史视频帧中的姿态信息;Based on the key points of the target object in the video frame to be identified and the key points of the target object in the historical video frame respectively, determine the gesture information in the video frame to be identified and the gesture information in the historical video frame;
    按照所述历史视频帧和所述待识别视频帧之间的时序关系,将所述历史视频帧中的姿态信息和所述待识别视频帧中的姿态信息进行排序,得到初始姿态序列。According to the time sequence relationship between the historical video frame and the video frame to be recognized, the pose information in the historical video frame and the pose information in the video frame to be recognized are sorted to obtain an initial pose sequence.
  3. 根据权利要求1或2所述的方法,其中,所述对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列,包括:The method according to claim 1 or 2, wherein the probabilistic mapping of the initial pose sequence to obtain the target pose sequence of the target object in the video frame to be identified comprises:
    基于每一初始姿态中关键点的位置信息,得到用于确定相邻初始姿态位移的中心点序列,以及所述初始姿态序列的归一化姿态序列;Based on the position information of the key points in each initial pose, obtain a center point sequence for determining the displacement of adjacent initial poses, and a normalized pose sequence of the initial pose sequence;
    基于所述中心点序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列。Probabilistic mapping is performed on the normalized pose sequence based on the central point sequence to obtain the target pose sequence.
  4. 根据权利要求3所述的方法,其中,所述基于每一初始姿态中关键点的位置信息,得到用于确定相邻初始姿态位移的中心点序列,以及所述初始姿态序列的归一化姿态序列,包括:The method according to claim 3, wherein, based on the position information of key points in each initial pose, obtaining a center point sequence for determining the displacement of adjacent initial poses, and a normalized pose sequence of the initial pose sequence include:
    基于所述每一初始姿态中关键点的位置信息,确定所述每一初始姿态的包围框;Based on the position information of key points in each initial pose, determine the bounding box of each initial pose;
    在所述初始姿态序列中,对所述每一初始姿态的包围框的中心点进行排序,得到所述中心点序列;In the initial pose sequence, sorting the center points of the bounding boxes of each initial pose to obtain the center point sequence;
    采用所述每一初始姿态的包围框,对所述每一初始姿态进行归一化,得到所述归一化姿态序列。The bounding box of each initial pose is used to normalize each initial pose to obtain the normalized pose sequence.
  5. 根据权利要求3或4所述的方法,其中,所述基于所述中心点序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列,包括:The method according to claim 3 or 4, wherein the probabilistic mapping of the normalized pose sequence based on the central point sequence to obtain the target pose sequence includes:
    在所述中心点序列中,基于每两个相邻中心点的位置信息之间的差值,得到位移序列;In the center point sequence, a displacement sequence is obtained based on the difference between the position information of every two adjacent center points;
    基于所述每一初始姿态的包围框的尺寸信息,对所述位移序列中每一位移进行归一化,得到归一化位移序列;Based on the size information of the bounding box of each initial pose, normalize each displacement in the displacement sequence to obtain a normalized displacement sequence;
    基于所述归一化位移序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列。Probabilistic mapping is performed on the normalized pose sequence based on the normalized displacement sequence to obtain the target pose sequence.
  6. 根据权利要求5所述的方法,其中,所述基于所述归一化位移序列对所述归一化姿态序列进行概率映射,得到所述目标姿态序列,包括:The method according to claim 5, wherein said performing probability mapping on said normalized attitude sequence based on said normalized displacement sequence to obtain said target attitude sequence comprises:
    拟合所述归一化位移序列中每一归一化位移,得到拟合结果;Fitting each normalized displacement in the normalized displacement sequence to obtain a fitting result;
    确定所述拟合结果满足的连续分布函数;determining the continuous distribution function that the fitting result satisfies;
    将所述每一归一化位移输入所述连续分布函数,得到所述每一归一化位移的缩放概率;inputting each of the normalized displacements into the continuous distribution function to obtain a scaling probability of each of the normalized displacements;
    基于所述每一归一化位移的缩放概率,对每一归一化姿态进行映射,得到所述目标姿态序列。Based on the scaling probability of each normalized displacement, each normalized pose is mapped to obtain the target pose sequence.
  7. 根据权利要求1至6任一所述的方法,其中,所述对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹,包括:The method according to any one of claims 1 to 6, wherein said performing feature transformation on said target pose sequence in space and time to obtain a pose feature track of said target object in said video frame to be identified, comprising:
    在所述目标姿态序列中,基于每一目标姿态的关键点,对所述每一目标姿态进行特征转换,得到待调整特征序列;In the target pose sequence, based on the key points of each target pose, perform feature conversion on each target pose to obtain a feature sequence to be adjusted;
    对每一待调整特征在空间和时间上进行特征维度调整,得到所述姿态特征轨迹。For each feature to be adjusted, the feature dimension is adjusted in space and time to obtain the gesture feature trajectory.
  8. 根据权利要求7所述的方法,其中,所述对每一待调整特征在空间和时间上进行特征维度调整,得到所述姿态特征轨迹,包括:The method according to claim 7, wherein said adjusting the feature dimension of each feature to be adjusted in space and time to obtain the trajectory of the gesture feature comprises:
    将所述每一待调整特征和所述每一待调整特征的预设空间特征进行融合,得到空间特征序列;Fusing each feature to be adjusted with a preset spatial feature of each feature to be adjusted to obtain a sequence of spatial features;
    基于空间维度在注意力机制中的注意力参数,对所述空间特征序列进行多层维度调整,得到空间姿态特征序列;Based on the attention parameter of the spatial dimension in the attention mechanism, multi-layer dimension adjustment is performed on the spatial feature sequence to obtain the spatial posture feature sequence;
    将每一空间姿态特征和所述每一空间姿态特征的预设时间特征进行融合,得到时间特征序列;Fusing each space attitude feature with the preset time feature of each space attitude feature to obtain a time feature sequence;
    基于时间维度在注意力机制中的注意力参数,对所述时间特征序列进行多层维度调整,得到所述姿态特征轨迹;Based on the attention parameter of the time dimension in the attention mechanism, multi-layer dimension adjustment is performed on the time feature sequence to obtain the posture feature track;
    其中,上一层维度调整的输出为下一层维度调整的输入。Among them, the output of the dimension adjustment of the previous layer is the input of the dimension adjustment of the next layer.
  9. 根据权利要求1至8任一所述的方法,其中,所述基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态之后,所述方法还包括:The method according to any one of claims 1 to 8, wherein, after determining the behavior state of the target object in the video frame to be identified based on the gesture feature track, the method further comprises:
    获取所述待识别视频帧对应的场景信息;Acquiring scene information corresponding to the video frame to be identified;
    确定与所述场景信息关联的预设行为规则;determining preset behavior rules associated with the scene information;
    采用所述预设行为规则,确定所述行为状态所属的所述待识别视频帧是否为异常视频帧。Using the preset behavior rule, determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame.
  10. 根据权利要求9所述的方法,其中,所述目标对象包括至少两个待识别对象,所述采用所述预设行为规则,确定所述行为状态所属的所述待识别视频帧是否为异常视频帧,包括:The method according to claim 9, wherein the target object includes at least two objects to be identified, and using the preset behavior rules to determine whether the video frame to be identified to which the behavior state belongs is an abnormal video frame comprises:
    采用所述预设行为规则,对所述至少两个待识别对象中每一待识别对象的行为状态进行识别,得到中间识别结果集;Using the preset behavior rules to identify the behavior state of each of the at least two objects to be identified to obtain an intermediate identification result set;
    对每一中间识别结果的置信度进行排序,得到结果评分序列;Sorting the confidence of each intermediate recognition result to obtain the result scoring sequence;
    基于处于所述结果评分序列中预设位置对应的中间识别结果,确定所述待识别视频帧是否为所述异常视频帧。Based on an intermediate recognition result corresponding to a preset position in the result scoring sequence, it is determined whether the video frame to be recognized is the abnormal video frame.
  11. 一种对象识别网络的训练方法,所述方法包括:A training method for an object recognition network, the method comprising:
    获取包括样本对象的样本视频帧;其中,所述样本视频帧为所述样本对象的样本视频流中的任一视频帧;Obtaining a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;
    确定所述样本视频帧中所述样本对象的样本归一化姿态序列;determining a sequence of sample normalized poses of the sample object in the sample video frame;
    采用待训练的对象识别网络,对所述样本归一化姿态序列进行概率映射,得到样本姿态序列;Using the object recognition network to be trained, performing probability mapping on the sample normalized pose sequence to obtain the sample pose sequence;
    将所述样本姿态序列在空间和时间上进行特征转换,得到所述样本视频帧中所述样本对象的样本姿态特征轨迹;Performing feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame;
    对所述样本姿态特征轨迹进行姿态重建,得到重建姿态序列;performing pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence;
    确定所述重建姿态序列和所述样本归一化姿态序列之间相似度的重建损失;determining a reconstruction loss of similarity between said reconstructed pose sequence and said sample normalized pose sequence;
    基于所述重建损失,对所述待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。Based on the reconstruction loss, the network parameters of the object recognition network to be trained are adjusted, so that the reconstruction loss output by the adjusted object recognition network meets a convergence condition.
  12. 一种对象识别装置,所述装置包括:An object recognition device, the device comprising:
    第一获取部分,被配置为获取画面包括目标对象的待识别视频帧;所述待识别视频帧为所述目标对象的视频流中的任一视频帧;The first acquisition part is configured to acquire a video frame to be identified whose picture includes a target object; the video frame to be identified is any video frame in the video stream of the target object;
    第一确定部分,被配置为基于所述待识别视频帧和所述待识别视频帧在所述视频流中的历史视频帧,确定所述目标对象的初始姿态序列;The first determining part is configured to determine an initial pose sequence of the target object based on the video frame to be recognized and the historical video frames of the video frame to be recognized in the video stream;
    第一映射部分,被配置为对所述初始姿态序列进行概率映射,得到所述待识别视频帧中所述目标对象的目标姿态序列;The first mapping part is configured to perform probability mapping on the initial pose sequence to obtain a target pose sequence of the target object in the video frame to be recognized;
    第一转换部分,被配置为对所述目标姿态序列在空间和时间上进行特征转换,得到所述待识别视频帧中所述目标对象的姿态特征轨迹;The first conversion part is configured to perform feature conversion on the target pose sequence in space and time to obtain a pose feature track of the target object in the video frame to be recognized;
    第二确定部分,被配置为基于所述姿态特征轨迹,确定所述待识别视频帧中所述目标对象的行为状态。The second determination part is configured to determine the behavior state of the target object in the video frame to be recognized based on the gesture feature track.
  13. 一种对象识别网络的训练装置,所述装置包括:A training device for an object recognition network, said device comprising:
    第二获取部分,被配置为获取包括样本对象的样本视频帧;其中,所述样本视频帧为所述样本对象的样本视频流中的任一视频帧;The second acquisition part is configured to acquire a sample video frame including a sample object; wherein, the sample video frame is any video frame in the sample video stream of the sample object;
    第三确定部分,被配置为确定所述样本视频帧中所述样本对象的样本归一化姿态序列;A third determining part configured to determine a sample normalized pose sequence of the sample object in the sample video frame;
    第二映射部分,被配置为采用待训练的对象识别网络,对所述样本归一化姿态序列进行概率映射,得到样本姿态序列;The second mapping part is configured to use the object recognition network to be trained to perform probability mapping on the sample normalized pose sequence to obtain the sample pose sequence;
    第二转换部分,被配置为将所述样本姿态序列在空间和时间上进行特征转换,得到所述样本视频帧中所述样本对象的样本姿态特征轨迹;The second conversion part is configured to perform feature conversion on the sample pose sequence in space and time to obtain a sample pose feature track of the sample object in the sample video frame;
    重建部分,被配置为对所述样本姿态特征轨迹进行姿态重建,得到重建姿态序列;The reconstruction part is configured to perform pose reconstruction on the sample pose feature trajectory to obtain a reconstructed pose sequence;
    第四确定部分,被配置为确定所述重建姿态序列和所述样本归一化姿态序列之间相似度的重建损失;A fourth determining part configured to determine a reconstruction loss of similarity between the reconstructed pose sequence and the sample normalized pose sequence;
    调整部分,被配置为基于所述重建损失,对所述待训练的对象识别网络的网络参数进行调整,以使调整后的对象识别网络输出的重建损失满足收敛条件。The adjustment part is configured to adjust the network parameters of the object recognition network to be trained based on the reconstruction loss, so that the adjusted reconstruction loss output by the object recognition network meets a convergence condition.
  14. 一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时能够实现权利要求1至10任一项所述的对象识别方法,或,所述处理器运行所述存储器上的计算机可执行指令时能够实现权利要求11所述的对象识别网络的训练方法。A computer device, the computer device comprising a memory and a processor, computer-executable instructions are stored in the memory, the processor can implement the object recognition method according to any one of claims 1 to 10 when running the computer-executable instructions on the memory, or, the processor can realize the object recognition network training method according to claim 11 when running the computer-executable instructions on the memory.
  15. 一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,该计算机可执行指令被执行后,能够实现权利要求1至10任一项所述的对象识别方法,或,该计算机可执行指令被执行后,能够实现权利要求11所述的对象识别网络的训练方法。A computer storage medium, the computer storage medium is stored with computer-executable instructions, and after the computer-executable instructions are executed, the object recognition method described in any one of claims 1 to 10 can be realized, or, after the computer-executable instructions are executed, the object recognition network training method described in claim 11 can be realized.
  16. 一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现权利要求1至10任一项所述的对象识别方法,或,用于实现权利要求11所述的对象识别网络的训练方法。A computer program, the computer program comprising computer-readable code, in the case where the computer-readable code is run in an electronic device, the processor of the electronic device executes the object recognition method for realizing any one of claims 1 to 10, or, for realizing the object recognition network training method according to claim 11.
PCT/CN2022/129057 2022-01-24 2022-11-01 Object recognition method, network training method and apparatus, device, medium, and program WO2023138154A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210082276.3A CN114494962A (en) 2022-01-24 2022-01-24 Object identification method, network training method, device, equipment and medium
CN202210082276.3 2022-01-24

Publications (1)

Publication Number Publication Date
WO2023138154A1 true WO2023138154A1 (en) 2023-07-27

Family

ID=81475503

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/129057 WO2023138154A1 (en) 2022-01-24 2022-11-01 Object recognition method, network training method and apparatus, device, medium, and program

Country Status (2)

Country Link
CN (1) CN114494962A (en)
WO (1) WO2023138154A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494962A (en) * 2022-01-24 2022-05-13 上海商汤智能科技有限公司 Object identification method, network training method, device, equipment and medium
CN116311519B (en) * 2023-03-17 2024-04-19 北京百度网讯科技有限公司 Action recognition method, model training method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170316578A1 (en) * 2016-04-29 2017-11-02 Ecole Polytechnique Federale De Lausanne (Epfl) Method, System and Device for Direct Prediction of 3D Body Poses from Motion Compensated Sequence
CN111160258A (en) * 2019-12-30 2020-05-15 联想(北京)有限公司 Identity recognition method, device, system and storage medium
CN111401270A (en) * 2020-03-19 2020-07-10 南京未艾信息科技有限公司 Human motion posture recognition and evaluation method and system
CN111414840A (en) * 2020-03-17 2020-07-14 浙江大学 Gait recognition method, device, equipment and computer readable storage medium
CN111950496A (en) * 2020-08-20 2020-11-17 广东工业大学 Identity recognition method for masked person
CN112528891A (en) * 2020-12-16 2021-03-19 重庆邮电大学 Bidirectional LSTM-CNN video behavior identification method based on skeleton information
WO2022000420A1 (en) * 2020-07-02 2022-01-06 浙江大学 Human body action recognition method, human body action recognition system, and device
CN113920585A (en) * 2021-10-22 2022-01-11 上海商汤智能科技有限公司 Behavior recognition method and device, equipment and storage medium
CN114494962A (en) * 2022-01-24 2022-05-13 上海商汤智能科技有限公司 Object identification method, network training method, device, equipment and medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170316578A1 (en) * 2016-04-29 2017-11-02 Ecole Polytechnique Federale De Lausanne (Epfl) Method, System and Device for Direct Prediction of 3D Body Poses from Motion Compensated Sequence
CN111160258A (en) * 2019-12-30 2020-05-15 联想(北京)有限公司 Identity recognition method, device, system and storage medium
CN111414840A (en) * 2020-03-17 2020-07-14 浙江大学 Gait recognition method, device, equipment and computer readable storage medium
CN111401270A (en) * 2020-03-19 2020-07-10 南京未艾信息科技有限公司 Human motion posture recognition and evaluation method and system
WO2022000420A1 (en) * 2020-07-02 2022-01-06 浙江大学 Human body action recognition method, human body action recognition system, and device
CN111950496A (en) * 2020-08-20 2020-11-17 广东工业大学 Identity recognition method for masked person
CN112528891A (en) * 2020-12-16 2021-03-19 重庆邮电大学 Bidirectional LSTM-CNN video behavior identification method based on skeleton information
CN113920585A (en) * 2021-10-22 2022-01-11 上海商汤智能科技有限公司 Behavior recognition method and device, equipment and storage medium
CN114494962A (en) * 2022-01-24 2022-05-13 上海商汤智能科技有限公司 Object identification method, network training method, device, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QING ZHIWU; SU HAISHENG; GAN WEIHAO; WANG DONGLIANG; WU WEI; WANG XIANG; QIAO YU; YAN JUNJIE; GAO CHANGXIN; SANG NONG: "Temporal Context Aggregation Network for Temporal Action Proposal Refinement", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 20 June 2021 (2021-06-20), pages 485 - 494, XP034009439, DOI: 10.1109/CVPR46437.2021.00055 *

Also Published As

Publication number Publication date
CN114494962A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2023138154A1 (en) Object recognition method, network training method and apparatus, device, medium, and program
CN110472531B (en) Video processing method, device, electronic equipment and storage medium
TWI742690B (en) Method and apparatus for detecting a human body, computer device, and storage medium
CN110135249B (en) Human behavior identification method based on time attention mechanism and LSTM (least Square TM)
Wei et al. Boosting deep attribute learning via support vector regression for fast moving crowd counting
CN108388882B (en) Gesture recognition method based on global-local RGB-D multi-mode
CN111967379B (en) Human behavior recognition method based on RGB video and skeleton sequence
US11417095B2 (en) Image recognition method and apparatus, electronic device, and readable storage medium using an update on body extraction parameter and alignment parameter
KR102462934B1 (en) Video analysis system for digital twin technology
CN111104930B (en) Video processing method, device, electronic equipment and storage medium
Jain et al. Deep neural learning techniques with long short-term memory for gesture recognition
CN109558902A (en) A kind of fast target detection method
JP6948851B2 (en) Information processing device, information processing method
Ji et al. A large-scale varying-view rgb-d action dataset for arbitrary-view human action recognition
CN111444488A (en) Identity authentication method based on dynamic gesture
CN112949622A (en) Bimodal character classification method and device fusing text and image
Xu et al. Robust hand gesture recognition based on RGB-D Data for natural human–computer interaction
Wang et al. Quantifying legibility of indoor spaces using Deep Convolutional Neural Networks: Case studies in train stations
CN116189306A (en) Human behavior recognition method based on joint attention mechanism
Wang et al. Occluded person re-identification via defending against attacks from obstacles
CN116012626B (en) Material matching method, device, equipment and storage medium for building elevation image
CN116645501A (en) Unbiased scene graph generation method based on candidate predicate relation deviation
Zhong A convolutional neural network based online teaching method using edge-cloud computing platform
Ou et al. 3D deformable convolution temporal reasoning network for action recognition
WO2023041969A1 (en) Face-hand correlation degree detection method and apparatus, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921578

Country of ref document: EP

Kind code of ref document: A1