WO2021098616A1 - 运动姿态识别方法、运动姿态识别装置、终端设备及介质 - Google Patents

运动姿态识别方法、运动姿态识别装置、终端设备及介质 Download PDF

Info

Publication number
WO2021098616A1
WO2021098616A1 PCT/CN2020/128854 CN2020128854W WO2021098616A1 WO 2021098616 A1 WO2021098616 A1 WO 2021098616A1 CN 2020128854 W CN2020128854 W CN 2020128854W WO 2021098616 A1 WO2021098616 A1 WO 2021098616A1
Authority
WO
WIPO (PCT)
Prior art keywords
moving body
posture
target
period
video
Prior art date
Application number
PCT/CN2020/128854
Other languages
English (en)
French (fr)
Inventor
乔宇
邹静
王亚立
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021098616A1 publication Critical patent/WO2021098616A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application belongs to the field of image recognition technology, and in particular relates to a motion gesture recognition method, a motion gesture recognition device, a terminal device, and a computer-readable storage medium.
  • a smart fitness system equipped with a wearable device can collect the user's exercise data through the wearable device, and recognize the user's exercise posture based on the exercise data, so as to guide the user to exercise without guidance.
  • the gesture estimation is based on the motion data collected by the wearable device when the gesture estimation is performed based on the motion data collected by the wearable device, a certain set of motion data can be performed at the same time. Characterizing multiple motion poses, that is, performing pose estimation based on motion data has a relatively high error tolerance rate. It can be seen that the existing motion pose recognition schemes have a problem of low recognition efficiency.
  • the embodiments of the present application provide a motion gesture recognition method, a motion gesture recognition device, a terminal device, and a computer-readable storage medium to solve the problem of low recognition efficiency in existing motion gesture recognition solutions.
  • the first aspect of the embodiments of the present application provides a motion gesture recognition method, including:
  • the trained dual-stream long- and short-term video pose estimation model includes a dual-stream 3D convolutional neural Network and recurrent neural network;
  • the target posture information of the moving body estimated in the target time period is obtained;
  • the first time period is The previous unit period of the target period;
  • the estimated target posture information is used to characterize the estimated posture of the moving body in the target period;
  • the acquiring the to-be-recognized video image containing the moving body in the target period, and inputting the to-be-recognized video image into a trained dual-stream long-short-term video pose estimation model includes:
  • the RGB picture set and the motion optical flow picture set are input into a dual-stream 3D convolutional neural network of a trained dual-stream long and short-term video pose estimation model.
  • the feature extraction of the moving body on the to-be-recognized video image through the dual-stream 3D convolutional neural network to obtain the comprehensive features of the moving body in the target period includes:
  • Feature splicing of the appearance feature and the movement feature is performed to obtain a comprehensive feature of the moving body in the target time period.
  • said obtaining the target estimated posture information of the moving body in the target period of time based on the first posture estimation information of the moving body in the first time period and the comprehensive characteristics of the moving body through the recursive neural network includes :
  • the target period has a previous unit period
  • the previous unit period of the target period is identified as the first period
  • the first posture estimation information corresponding to the first period is combined with the comprehensive characteristics of the moving body Input a recurrent neural network, and perform state vector calculations through the recurrent neural network according to the first posture estimation information and the integrated characteristics of the moving body to obtain target estimated posture information of the moving body within a target time period;
  • the first posture estimation information is made empty, and the comprehensive characteristics of the moving body are input into the recurrent neural network, and the processing is performed according to the comprehensive characteristics of the moving body through the recurrent neural network
  • the state vector is calculated to obtain the estimated posture information of the target of the moving body in the target time period.
  • the method further includes:
  • a target image corresponding to the estimated posture is determined from the to-be-recognized video image.
  • the method further includes:
  • the training sample set is used to train the dual-stream long and short-term video pose estimation model to obtain a trained dual-stream long and short-term video pose estimation model.
  • the generating a training sample set based on the sample video file includes:
  • T is an integer greater than 0, and each of the video clips is correspondingly configured with three-dimensional key point information of the moving body;
  • the T video clips and the three-dimensional key point information corresponding to each of the video clips are used as a training sample set.
  • a second aspect of the embodiments of the present application provides a motion gesture recognition device, including:
  • the acquisition and input unit is used to acquire the to-be-recognized video image containing a moving body in the target time period, and input the to-be-recognized video image into the trained dual-stream long and short-term video pose estimation model;
  • the trained dual-stream long and short-term video pose estimation Models include dual-stream 3D convolutional neural network and recurrent neural network;
  • the first execution unit is configured to perform feature extraction of the moving body on the to-be-recognized video image through the dual-stream 3D convolutional neural network to obtain comprehensive features of the moving body in the target period;
  • the second execution unit is configured to obtain the target estimated posture information of the moving body in the target time period based on the first posture estimation information of the moving body in the first time period and the comprehensive characteristics of the moving body through a recurrent neural network
  • the first period is the unit period before the target period; the estimated target posture information is used to characterize the estimated posture of the moving body in the target period;
  • the determining unit is configured to determine the Euclidean distance between the estimated posture and the preset reference posture based on the estimated posture information of the target and the preset reference posture; the Euclidean distance is used for description The magnitude of the difference between the estimated posture and the preset reference posture.
  • the third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the terminal device.
  • the processor executes the computer program, The steps of the motion gesture recognition method provided by the first solution are implemented.
  • the fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements the motion gesture recognition method provided by the first solution The steps.
  • the fifth aspect of the embodiments of the present application provides a computer program product that, when the computer program product runs on a terminal device, causes the terminal device to execute the steps of the motion gesture recognition method described in any one of the first aspects.
  • the motion gesture recognition method uses a trained dual-stream long and short-term video pose estimation model to perform motion gesture recognition on a video image to be recognized that contains a moving body in the target period, and through the trained dual-stream long and short-term video pose
  • the dual-stream 3D convolutional neural network in the estimation model extracts the comprehensive features of the moving body from the video image to be recognized, and then uses the recurrent neural network based on the first posture estimation information of the moving body in the first period of time and the comprehensive features of the moving body to obtain the moving body
  • the estimated posture information of the target in the target period Since the first period is the previous unit period of the target period, the recursive prediction of the posture estimation is realized.
  • the obtained moving body is represented by the estimated posture information of the target in the target period.
  • the estimated posture of the stance has time sequence continuity, so that the Euclidean distance between the determined estimated posture and the preset reference posture is more accurate, and the efficiency of motion posture recognition of the moving body is improved.
  • FIG. 1 is an implementation flowchart of a motion gesture recognition method provided by an embodiment of the present application
  • FIG. 2 is an implementation flowchart of a motion gesture recognition method provided by another embodiment of the present application.
  • FIG. 3 is an implementation flowchart of a motion gesture recognition method provided by still another embodiment of the present application.
  • FIG. 4 is a structural block diagram of a motion gesture recognition device provided by an embodiment of the present application.
  • Fig. 5 is a structural block diagram of a terminal device provided by another embodiment of the present application.
  • FIG. 1 is an implementation flowchart of a motion gesture recognition method provided by an embodiment of the present application.
  • the motion gesture recognition method is used to perform motion gesture recognition on the moving body in the video image
  • its execution subject is a computer terminal, for example, a computer or server used for video image collection and video image analysis.
  • the motion gesture recognition method shown in Figure 1 includes the following steps:
  • S11 Obtain the to-be-recognized video image containing a moving body in the target time period, and input the to-be-recognized video image into a trained dual-stream long and short-term video pose estimation model; the trained dual-stream long and short-term video pose estimation model includes a dual-stream 3D volume Product neural network and recurrent neural network.
  • the video image to be identified may be a real-time moving image of a moving body, or a video file obtained by performing video recording on the moving body when the moving body is in motion.
  • the target period is used to describe the content duration of the video image to be recognized.
  • the trained dual-stream length and short-term video pose estimation model is a model obtained after training the constructed dual-stream length and short-term video pose estimation model. Because the trained dual-stream length and short-term video pose estimation model is built with dual-stream 3D Convolutional neural network and recurrent neural network, therefore, the feature extraction of the video image to be recognized can be performed through the dual-stream 3D convolutional neural network in the model, and the posture estimation can be performed based on the features extracted by the 3D convolutional neural network through the recurrent neural network. Since the recursive neural network performs posture estimation based on the current features extracted by the 3D convolutional neural network and the previous posture estimation result, the posture prediction obtained by the recurrent neural network is used for posture estimation.
  • the estimation result has time sequence continuity, that is, an attitude estimation scheme with continuity can obtain a more accurate attitude estimation result.
  • the unit time period can be selected as the selection criterion of the target time period, that is, the target time period can be composed of one or more unit time periods.
  • the video image to be recognized contains a moving body, and a frame of video image cannot describe the motion trajectory or motion posture of the moving body, in order to be able to recognize the motion state of the moving body in the video image to be recognized, the video image to be recognized includes multiple Frame and continuous video images, that is, the to-be-identified video image in the content duration represented by the target period, can continuously display multiple frames and continuous video images.
  • Scenario 1 In the process of playing a preset video file, obtain a to-be-recognized video image containing a moving body in the target time period.
  • the preset video file may include a sports fitness video file.
  • the video image acquisition device is called to collect the video image of the moving body in the target area to obtain the video file, which is obtained from the video file
  • the target time period contains the video image to be recognized of the moving body.
  • the sports fitness video file can be played through the mobile phone terminal, and the camera on the mobile phone terminal can be called at the same time to collect the video image of the moving body in the target area to obtain the video file, and select the target time period from the video file through the video image selection window The video image to be recognized containing the moving body.
  • Scenario 2 If a preset instruction for performing motion gesture recognition on the video image to be recognized is detected, the video image to be recognized containing a moving body in the target period is acquired.
  • the video image to be recognized including the athlete in the target period is selected from the video file through the video image selection window.
  • step S11 may specifically include:
  • RGB picture extraction and motion optical flow picture extraction can be performed on each frame of the video image to be recognized, so as to obtain the RGB picture set and motion optical flow picture set of the entire video image to be recognized.
  • the motion optical flow diagram is used to express the changes of the moving body in the image. Since the optical flow picture contains the motion information of the moving body, it can be used to determine the movement of the moving body.
  • the optical flow picture contains the motion information of the moving body, and also contains the three-dimensional structure information of the moving body, so that the dual-stream 3D convolutional neural network performs feature convolution on the RGB picture and the motion optical flow picture to obtain more accurate features, that is, The recognition efficiency of moving body motion gestures based on RGB pictures and motion optical flow pictures is higher.
  • S12 Perform feature extraction of the moving body on the to-be-recognized video image through the dual-stream 3D convolutional neural network to obtain a comprehensive feature of the moving body in the target period.
  • the dual-stream 3D convolutional neural network is a convolutional neural network in which dual image streams are input and the comprehensive features of the moving body are output.
  • the input of the dual-stream 3D convolutional neural network can be the RGB image and the motion optical flow image of the video image to be recognized, and the convolutional layer, the pooling layer and the feature series layer are constructed in the dual-stream 3D convolutional neural network.
  • the convolutional layer and the pooling layer are combined to perform feature convolutions of different dimensions on the RGB image and the motion optical flow image to achieve the treatment Recognize the video image to extract the features of the moving body, and then use the feature tandem layer to perform feature fusion on the extracted features, and finally obtain the comprehensive features of the moving body, that is, the comprehensive feature of the moving body in the target period.
  • step S12 may specifically include:
  • the appearance and motion characteristics of the moving body in the to-be-recognized video image are extracted;
  • the appearance feature and the movement feature are feature spliced to obtain the comprehensive feature of the moving body in the target time period.
  • the dual-stream 3D convolutional neural network (CNN_a, CNN_m) is configured with multiple image feature convolutional layers, multiple pooling layers, and at least one feature stitching layer.
  • the image feature convolutional layer can be used to compare RGB pictures.
  • S13 Obtain target estimated posture information of the moving body in the target time period based on the first posture estimation information of the moving body in the first time period and the comprehensive characteristics of the moving body through a recurrent neural network;
  • the period is the previous unit period of the target period;
  • the estimated target posture information is used to characterize the estimated posture of the moving body in the target period.
  • the first period is a unit period before the target period, and the first posture estimation information is used to describe the motion state of the moving body obtained by performing posture estimation in the first period.
  • the recurrent neural network outputs the estimated posture information of the moving body in the target period, it is necessary to determine and obtain the estimated posture information of the moving body in the previous unit time period of the target period, that is, determine And obtain the first posture estimation information of the moving body in the first time period, so according to the first posture estimation information and the comprehensive characteristics of the moving body of the moving body in the target time period, determine the target estimated posture of the moving body in the target time period Information, the estimated posture of the moving body in the target period represented by the estimated posture information of the target has continuity with the estimated posture of the moving body in the previous unit time period, so that each time the target period of the moving body is The posture estimation is based on the posture estimation result of the unit period before the target period, which ensures the continuity of the time sequence between each posture estimation.
  • step S13 may include:
  • the target period has a previous unit period
  • the previous unit period of the target period is identified as the first period
  • the first posture estimation information corresponding to the first period is combined with the comprehensive characteristics of the moving body
  • Input a recurrent neural network and perform state vector calculations through the recurrent neural network according to the first posture estimation information and the integrated characteristics of the moving body to obtain the target posture information of the moving body during the target time period;
  • the first posture estimation information is made empty, and the comprehensive characteristics of the moving body are input into the recurrent neural network, and the state is performed according to the comprehensive characteristics of the moving body through the recurrent neural network Vector calculation to obtain the estimated posture information of the target of the moving body in the target time period.
  • the first posture estimation information is set to be empty, which means that the first posture estimation information is not used as a calculation factor in the calculation of the state vector, and only the moving body is considered.
  • the selection or determination of the target time period is related to the duration of the video file where the video image to be recognized is located.
  • the target period is the first unit period of the video file where the video image to be recognized is located, that is, there is no unit period before the target period, it is impossible to determine the posture estimation information of the moving body in the unit period before the target period, so only through the recurrent neural network
  • the state vector is calculated according to the comprehensive characteristics of the moving body to obtain the estimated posture information of the moving body in the target period.
  • the target period is the intermediate unit period of the video file where the video image to be recognized is located, that is, there is a unit period before the target period
  • S14 Determine the Euclidean distance between the estimated posture and the preset reference posture based on the estimated posture information of the target and the preset reference posture information.
  • step S14 the estimated posture information of the target is used to represent the estimated posture of the moving body in the target period.
  • the preset reference pose information is used to represent the reference pose of the moving body in the preset reference video image.
  • Euclidean distance is used to describe the difference between the estimated attitude and the reference attitude.
  • the estimated posture information of the target may include a set of three-dimensional coordinate values of each key point on the moving body in the video image to be recognized.
  • the preset reference posture information may include a set of three-dimensional coordinate values of each key point on the moving body in the preset reference video image.
  • the key points of the moving body in the preset reference video image are predefined key points.
  • the moving body may include multiple moving parts, and each moving part may be composed of one or more key points.
  • the preset reference posture information includes a set of three-dimensional coordinate values of each key point on the moving body, which can be used to describe the overall moving posture of the moving body.
  • the key points of the moving body in the preset reference video image and the motion coordinates of each key point at each moment are determined information, that is, the preset reference video image is determined.
  • the motion posture of the moving body in each time period, and the motion posture of the moving body in each time period in the preset reference video image is used as the reference posture, and the motion posture recognition and comparison of the moving body in the video image to be recognized are performed.
  • the moving body in the preset reference video image is also a human body
  • the key points of the moving body may be various movable joints of the human body.
  • the estimated target posture information is used to characterize the motion posture of the human body in the target period in the video image to be recognized, and may be a set of coordinate values corresponding to different motion postures of various movable joints (key points) of the human body in the target period.
  • the Euclidean distance between the estimated posture of the human body in the video image to be recognized in the target period and the preset reference posture can be determined, that is, the estimated posture of the human body and the preset reference posture.
  • the difference between the preset reference posture and the preset reference posture is to realize the recognition of the movement posture of the moving body in the video image to be recognized under the target time period.
  • the video image to be recognized is to collect the video image of the moving body in the target area by calling the video image acquisition device. After the video file is obtained, the video image is retrieved from the video file The determined target period includes the video image of the moving body.
  • the estimated posture information of the target is used to describe the posture of the sports body in the target period obtained by the posture estimation of the video image to be recognized in the process of playing the sports fitness video file.
  • the terminal is used to play the video file of the fitness teaching, and at the same time, the camera of the terminal is used to collect the video image of the user in the target area, that is, the moving body.
  • the captured video file obtains the to-be-recognized video image containing the moving body in the target time period.
  • the body movements made by the user while learning the actions in the fitness teaching video are related to the actions in the fitness teaching video, that is, the two sets of actions should be Synchronous action, the time difference is negligible, so the image content of the fitness teaching video can be used as the preset reference video image, and the target posture information of the moving body in the target period of the to-be-recognized video image can be compared with the preset reference video image.
  • the Euclidean distance between the estimated posture and the preset reference posture can be determined, that is, it can be determined that the user’s body movements when learning the exercises in the fitness teaching video are compared with those in the fitness teaching video.
  • the size of the gap between the actions is not limited to the user.
  • the video image to be recognized is obtained by calling the video image acquisition device to collect the video image of the athletes in the target area during the sports competition. After obtaining the video file, The target period determined from the video file contains the video image of the athlete. The estimated posture information of the target is used to describe the posture of the athletes captured live during the sports competition.
  • the multi-directional camera configured in the environment of the competition venue is used to collect the images of the athletes during the game live to obtain the video file.
  • the pre-recognition for the motion gesture recognition of the video image to be recognized is detected
  • the instruction is set, the video image to be recognized containing the athlete in the target period is selected from the video file through the video image selection window.
  • the estimated posture information of the athlete in the target period in the video image to be recognized is compared with the preset reference posture information of the preset reference video image, namely The Euclidean distance between the estimated posture and the preset reference posture can be determined, that is, the difference between the physical movements made by the athletes during the competition and the foul movements during sports competitions can be determined.
  • the motion gesture recognition method uses a trained dual-stream long-short-term video pose estimation model to perform motion gesture recognition on the video image to be recognized that contains a moving body in the target period.
  • the dual-stream 3D convolutional neural network in the long- and short-term video pose estimation model extracts the comprehensive features of the moving body from the video image to be recognized, and then uses the recurrent neural network based on the first pose estimation information and the comprehensive features of the moving body in the first period of time. Obtain the target estimated posture information of the sports body in the target period.
  • the estimated posture represented by the posture information has temporal continuity, which makes the Euclidean distance between the determined estimated posture and the preset reference posture more accurate, and improves the efficiency of motion posture recognition of the moving body.
  • FIG. 2 is a flowchart of an implementation of a motion gesture recognition method provided by another embodiment of the present application. Compared with the embodiment corresponding to FIG. 1, the motion gesture recognition method provided in this embodiment further includes S21 to S22 after step S14. The details are as follows:
  • step S21 and step S22 in this embodiment are parallel steps, and the execution of the two is irrelevant. Step S22 is no longer executed when step S21 is executed, and step S22 is no longer executed when step S22 is executed. S21, until the Euclidean distance is re-determined.
  • the Euclidean distance is used to describe the difference between the estimated posture and the reference posture, that is, the larger the value of the Euclidean distance, the greater the value of the Euclidean distance, the greater the difference between the estimated posture and the reference posture.
  • the greater the difference the farther the estimated posture is from the reference posture; the smaller the value of Euclidean distance, the smaller the difference between the estimated posture and the reference posture, the more similar the estimated posture and the reference posture are .
  • motion gesture recognition methods provided by all embodiments of the present application can be used in, but not limited to, the fields of video teaching error correction and foul action recognition.
  • the Euclidean distance is equal to or greater than the preset threshold, it means that the body's motion posture is quite different from the preset reference posture in the teaching video, so the output The prompt information is used to correct the movement posture of the moving body.
  • the Euclidean distance is less than the preset threshold, it means that the motion posture of the human body is slightly different from the preset reference posture in the teaching video, and the target image corresponding to the estimated posture is determined from the video image to be recognized, which is convenient for viewing The actions learned by the user based on the instructional video.
  • the terminal is used to play the video file of the fitness teaching, and at the same time, the camera of the terminal is used to collect the video image of the user in the target area, that is, the moving body.
  • the captured video file obtains the to-be-recognized video image containing the human body exercise and fitness in the target time period. Compare the estimated posture information of the human body exercise in the target period in the video image to be recognized with the preset reference posture information of the preset reference video image to determine the Euclidean between the estimated posture and the preset reference posture Reid distance.
  • the Euclidean distance is equal to or greater than the preset threshold, it means that the motion posture of the human body is quite different from the preset reference posture in the teaching video, so the output prompt information is used to correct the motion posture of the moving body.
  • the Euclidean distance is less than the preset threshold, it means that the motion posture of the human body is slightly different from the preset reference posture in the teaching video, and the target image corresponding to the estimated posture is determined from the video image to be recognized, which is convenient for viewing The actions learned by the user based on the instructional video.
  • the Euclidean distance is equal to or greater than the preset threshold, it means that the athlete's motion posture is quite different from the preset reference posture of the foul action, so a prompt is output
  • the information is the action image information of the athletes not fouling the rules.
  • the Euclidean distance is less than the preset threshold, it means that the athlete's movement posture and the foul action are small, that is, the foul behavior of the athlete is determined, and the target image corresponding to the estimated posture is determined from the live video image, that is, The image of the foul behavior is used as evidence for judging the foul of the athlete.
  • the multi-directional camera configured in the environment of the competition venue is used to collect the images of the athletes during the game live to obtain the video file.
  • the pre-recognition for the motion gesture recognition of the video image to be recognized is detected
  • the video image to be recognized containing the athlete in the target period is selected from the video file through the video image selection window.
  • the output prompt information is the action image information of the athlete not fouling.
  • the Euclidean distance is less than the preset threshold, it means that the athlete's movement posture and the foul action are small, that is, the foul behavior of the athlete is determined, and the target image corresponding to the estimated posture is determined from the live video image, that is, The image of the foul behavior is used as evidence for judging the foul of the athlete.
  • the motion gesture recognition scheme can be applied to more fields, and the application range of the motion gesture recognition technology of the moving body is improved.
  • FIG. 3 is a flowchart of an implementation of a motion gesture recognition method provided by still another embodiment of the present application.
  • the motion gesture recognition method provided in this embodiment further includes S31 to S33 before step S11. The details are as follows:
  • the sample video file contains a moving body.
  • the moving body may be a human body or a robot that simulates a human body.
  • a moving body can include multiple moving parts, and each moving part is composed of at least one key point.
  • the coordinate information of each key point The different coordinates of the same key point in consecutive picture frames indicate that the key point has a motion track. By determining the coordinate set of each key point in the consecutive picture frames, the motion posture of the moving body can be determined.
  • step S32 includes:
  • Segment the sample video file to obtain T video clips; where T is an integer greater than 0, and each of the video clips is correspondingly configured with the three-dimensional key point information of the moving body; and the T video
  • T is an integer greater than 0
  • each of the video clips is correspondingly configured with the three-dimensional key point information of the moving body
  • T video The segment and the three-dimensional key point information corresponding to each of the video segments are used as a training sample set.
  • the three-dimensional key point information is used to characterize the three-dimensional coordinates of each key point on the moving body, and the coordinate changes of all key points on the moving body can be used to characterize the posture change of the moving body.
  • the video segment is used as the input of the dual-stream long and short-term video pose estimation module, and the dual-stream long and short-term video pose estimation module outputs the corresponding
  • the posture estimation information is compared with the 3D key point information.
  • the posture estimation information is used to describe the estimated posture
  • the 3D key point information is used to describe the actual posture of the moving body in the video clip.
  • the posture estimation information is compared with the three-dimensional key point information, that is, the Euclidean distance between the estimated posture and the actual posture is calculated, and the corresponding loss function is generated based on the Euclidean distance. Gradient backhaul realizes the training of the dual-stream length and short-term video pose estimation model.
  • the input to the dual-stream long- and short-term video pose estimation model is the video image with the determined motion body pose information
  • the dual-stream long and short-term video pose estimation module is based on the video
  • the image carries out the motion posture estimation of the moving body and compares it with the determined posture information of the moving body, and then adjusts the dual-stream length and short-term video posture estimation model so that the result obtained can be closer to the determined posture information of the moving body. convergence.
  • the trained dual-stream length and short-term video pose estimation model is obtained.
  • the preset video file is The posture information has been determined, and the posture estimation is based on the video image to be recognized, and the obtained estimation information can well represent the motion state of the moving body. Comparing it with the posture information of the moving body in the video file, the difference between the posture of the moving body in the video image to be recognized and the posture of the moving body in the video file can be determined.
  • FIG. 4 is a structural block diagram of a motion gesture recognition device provided by an embodiment of the present application.
  • the units included in the motion gesture recognition device are used to execute the steps in the embodiments corresponding to FIGS. 1 to 3.
  • the motion gesture recognition device 400 includes: an acquisition and input unit 41, a first execution unit 42, a second execution unit 43, and a determination unit 44. among them:
  • the acquisition and input unit 41 is configured to acquire a video image to be recognized that includes a moving body in the target period, and input the video image to be recognized into a trained dual-stream long and short-term video pose estimation model; the trained dual-stream long and short-term video pose
  • the estimation model includes dual-stream 3D convolutional neural network and recurrent neural network.
  • the first execution unit 42 is configured to perform feature extraction of the moving body on the to-be-recognized video image through the dual-stream 3D convolutional neural network to obtain comprehensive features of the moving body in the target period.
  • the second execution unit 43 is configured to obtain the target estimated posture of the moving body in the target time period based on the first posture estimation information of the moving body in the first time period and the comprehensive characteristics of the moving body through a recurrent neural network Information; the first time period is the previous unit time period of the target time period; the estimated target posture information is used to characterize the estimated posture of the moving body in the target time period.
  • the determining unit 44 is configured to determine the Euclidean distance between the estimated posture and the preset reference posture based on the estimated posture information of the target and the preset reference posture; the Euclidean distance is used for Describe the difference between the estimated posture and the preset reference posture.
  • the acquisition and input unit 41 is specifically configured to extract the RGB picture set and the motion optical flow picture set of the video image to be recognized; input the RGB picture set and the motion optical flow picture set to training The dual-stream 3D convolutional neural network of a good dual-stream length and short-term video pose estimation model.
  • the first execution unit 42 is specifically configured to extract the moving body in the video image to be recognized based on the RGB picture set and the motion optical flow picture set through the dual-stream 3D convolutional neural network
  • the appearance feature and the movement feature in the target time period; the feature splicing of the appearance feature and the movement feature is performed to obtain the comprehensive feature of the moving body in the target time period.
  • the second execution unit 43 is specifically configured to, if there is a previous unit time period in the target time period, identify the previous unit time period of the target time period as the first time period, and set the first unit time period before the target time period.
  • the first posture estimation information corresponding to the time period and the comprehensive feature of the moving body are input to a recurrent neural network, and the state vector is calculated according to the first posture estimation information and the comprehensive feature of the moving body through the recurrent neural network to obtain The estimated posture information of the target of the moving body in the target period; if the target period does not exist in the previous unit period, the first posture estimation information is made empty, and the comprehensive characteristics of the moving body are input into the recurrent neural network Calculating the state vector according to the comprehensive characteristics of the moving body through the recurrent neural network to obtain the estimated posture information of the moving body in the target time period.
  • the motion gesture recognition device 400 further includes:
  • the third execution unit 45 is configured to output motion posture correction information for correcting the moving body if the Euclidean distance is equal to or greater than a preset threshold value.
  • the fourth execution unit 46 is configured to determine a target image corresponding to the estimated posture from the video image to be recognized if the Euclidean distance is less than a preset threshold.
  • the motion gesture recognition device 400 further includes:
  • the obtaining unit 47 is used to obtain a sample video file containing a moving body.
  • the sample generating unit 48 is configured to generate a training sample set based on the sample video file.
  • the training unit 49 is configured to use the training sample set to train the dual-stream long and short-term video pose estimation model to obtain a trained dual-stream long and short-term video pose estimation model.
  • the sample generating unit 48 is specifically configured to segment the sample video file to obtain T video fragments; where T is an integer greater than 0, and each of the video fragments has corresponding configurations.
  • T is an integer greater than 0, and each of the video fragments has corresponding configurations.
  • the three-dimensional key point information of the moving body; the T video segments and the three-dimensional key point information corresponding to each of the video segments are used as a training sample set.
  • the motion gesture recognition method uses a trained dual-stream long-short-term video pose estimation model to perform motion gesture recognition on the video image to be recognized that contains a moving body in the target period.
  • the dual-stream 3D convolutional neural network in the long- and short-term video pose estimation model extracts the comprehensive features of the moving body from the video image to be recognized, and then uses the recurrent neural network based on the first pose estimation information and the comprehensive features of the moving body in the first period of time. Obtain the target estimated posture information of the sports body in the target period.
  • the estimated posture represented by the posture information has temporal continuity, which makes the Euclidean distance between the determined estimated posture and the preset reference posture more accurate, and improves the efficiency of motion posture recognition of the moving body.
  • the motion gesture recognition scheme can be applied to more fields, and the application scope of the motion gesture recognition technology of the moving body is improved.
  • Fig. 5 is a structural block diagram of a terminal device provided by another embodiment of the present application.
  • the terminal device 5 of this embodiment includes a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and running on the processor 50, such as a motion gesture recognition method. program.
  • the processor 50 executes the computer program 52
  • the steps in each embodiment of the aforementioned motion gesture recognition method are implemented, for example, S11 to S14 shown in FIG. 1.
  • the processor 50 executes the computer program 52
  • the functions of the units in the embodiment corresponding to FIG. 4 are implemented, for example, the functions of the units 41 to 44 shown in FIG. 4, please refer to the corresponding implementation in FIG. 5 for details The relevant description in the example will not be repeated here.
  • the computer program 52 may be divided into one or more units, and the one or more units are stored in the memory 51 and executed by the processor 50 to complete the application.
  • the one or more units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 52 in the terminal device 5.
  • the computer program 52 may be divided into an acquisition and input unit, a first execution unit, a second execution unit, and a determination unit, and the specific functions of each unit are as described above.
  • the terminal device may include, but is not limited to, a processor 50 and a memory 51.
  • FIG. 5 is only an example of the terminal device 5, and does not constitute a limitation on the terminal device 5. It may include more or less components than shown in the figure, or a combination of certain components, or different components.
  • the terminal device may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 50 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5.
  • the memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) equipped on the terminal device 5. Card, Flash Card, etc.
  • the memory 51 may also include both an internal storage unit of the terminal device 5 and an external storage device.
  • the memory 51 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 51 can also be used to temporarily store data that has been output or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本申请适用于图像识别技术领域,提供了一种运动姿态识别方法、运动姿态识别装置、终端设备及介质,其中,一种运动姿态识别方法,通过训练好的双流长短时视频姿态估计模型,对目标时段包含运动体的待识别视频图像进行运动姿态识别,通过训练好的双流长短时视频姿态估计模型中的双流3D卷积神经网络对待识别视频图像进行运动体综合特征提取,再通过递归神经网络基于第一时段内运动体的第一姿态预估信息与运动体综合特征,得到运动体在目标时段内的目标预估姿态信息,令得到的运动体在目标时段内的目标预估姿态信息表征的预估姿态具有时序连贯性,使得出的预估姿态与预设参考姿态之间的欧几里得距离更加准确,提高了对运动体的运动姿态识别效率。

Description

运动姿态识别方法、运动姿态识别装置、终端设备及介质 技术领域
本申请属于图像识别技术领域,尤其涉及一种运动姿态识别方法、运动姿态识别装置、终端设备及计算机可读存储介质。
背景技术
随着人们的生活水平越来越高,越来越多的智能家电也备受消费者们喜爱。例如,搭配了穿戴设备的智能健身系统,可以通过穿戴设备采集用户的运动数据,并基于运动数据对用户进行运动姿态进行识别,以实现对用户在无人指导的情况下指导用户运动健身。
技术问题
然而,在现有的运动姿态识别方案中,由于根据采集到的运动数据进行姿态预估时,是基于穿戴设备采集到的运动数据进行姿态预估或者姿态还原,而某一组运动数据可以同时表征多个运动姿态,也即根据运动数据进行姿态预估存在相当高的容错率,可见现有的运动姿态识别方案中存在识别效率较低的问题。
技术解决方案
有鉴于此,本申请实施例提供了一种运动姿态识别方法、运动姿态识别装置、终端设备及计算机可读存储介质,以解决现有的运动姿态识别方案中存在识别效率较低的问题。
本申请实施例的第一方面提供了一种运动姿态识别方法,包括:
获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型;所述训练好的双流长短时视频姿态估计模型包括双流3D卷积神经网络和递归神经网络;
通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征;
通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息;所述第一时段为所述目标时段的前一单位时段;所述目标预估姿态信息用于表征所述运动体在所述目标时段的预估姿态;
基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离;所述欧几里得距离用于描述所述预估姿态与所述预设参考姿态之间的差别大小。
进一步的,所述获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型,包括:
提取所述待识别视频图像的RGB图片集合和运动光流图片集合;
将所述RGB图片集合和所述运动光流图片集合输入训练好的双流长短时视频姿态估计模型的双流3D卷积神经网络。
进一步的,所述通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征,包括:
通过所述双流3D卷积神经网络基于所述RGB图片集合和所述运动光流图片集合,抽取所述待识别视频图像中运动体在所述目标时段内的外观特征和运动特征;
将所述外观特征与所述运动特征进行特征拼接,得到所述目标时段的运动体综合特征。
进一步的,所述通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息,包括:
若所述目标时段存在前一单位时段,则将所述目标时段的前一单位时段识别为第一时段,并将所述第一时段对应的第一姿态预估信息与所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述第一姿态预估信息与所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息;
若所述目标时段不存在前一单位时段,则令第一姿态预估信息为空,并将所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息。
进一步的,所述基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离的步骤之后,还包括:
若所述欧几里得距离等于或大于预设阈值,则输出用于纠正所述运动体的运动姿态纠正信息;
若所述欧几里得距离小于预设阈值,则从所述待识别视频图像中确定出与所述预估姿态对应的目标图像。
进一步的,所述方法还包括:
获取包含运动体的样本视频文件;
基于所述样本视频文件生成训练样本集合;
利用所述训练样本集合对双流长短时视频姿态估计模型进行训练,以得到训练好的双流长短时视频姿态估计模型。
进一步的,所述基于所述样本视频文件生成训练样本集合,包括:
对所述样本视频文件进行分段,得到T个视频片段;其中,T为大于0的整数,每个所述视频片段对应配置有所述运动体的三维关键点信息;
将T个所述视频片段和每个所述视频片段对应的三维关键点信息作为训练样本集合。
本申请实施例的第二方面提供了一种运动姿态识别装置,包括:
获取与输入单元,用于获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型;所述训练好的双流长短时视频姿态估计模型包括双流3D卷积神经网络和递归神经网络;
第一执行单元,用于通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征;
第二执行单元,用于通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息;所述第一时段为所述目标时段的前一单位时段;所述目标预估姿态信息用于表征所述运动体在所述目标时段的预估姿态;
确定单元,用于基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离;所述欧几里得距离用于描述所述预估姿态与所述预设参考姿态之间的差别大小。
本申请实施例的第三方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述终端设备上运行的计算机程序,所述处理器执行所述计算机程序时实现第一方案提供的运动姿态识别方法的各步骤。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现第一方案提供的运动姿态识别方法的各步骤。
本申请实施例的第五方面提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的运动姿态识别方法的各步骤。
有益效果
实施本申请实施例提供的一种运动姿态识别方法、运动姿态识别装置、终端设备及计算机可读存储介质具有以下有益效果:
本申请实施例提供的一种运动姿态识别方法,通过训练好的双流长短时视频姿态估计模型,对目标时段包含运动体的待识别视频图像进行运动姿态识别,通过训练好的双流长短时视频姿态估计模型中的双流3D卷积神经网络对待识别视频图像进行运动体综合特征提取,再通过递归神经网络基于第一时段内运动体的第一姿态预估信息与运动体综合特征,得到运动体在目标时段内的目标预估姿态信息,由于第一时段为目标时段的前一单位时段,实现了姿态预估的递归预测,因此令得到的运动体在目标时段内的目标预估姿态信息所表征的预估姿态具有时序连贯性,使得确定出的预估姿态与预设参考姿态之间的欧几里得距离更加准确,提高了对运动体的运动姿态识别效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种运动姿态识别方法的实现流程图;
图2是本申请另一实施例提供的一种运动姿态识别方法的实现流程图;
图3是本申请再一实施例提供的一种运动姿态识别方法的实现流程图;
图4是本申请实施例提供的一种运动姿态识别装置结构框图;
图5是本申请另一实施例提供的一种终端设备的结构框图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
请参阅图1,图1是本申请实施例提供的一种运动姿态识别方法的实现流程图。本实施例中,运动姿态识别方法用于对视频图像中的运动体进行运动姿态识别,其执行主体为计算机终端,例如,用于进行视频图像采集和视频图像分析的计算机或服务器等。
如图1所示的运动姿态识别方法包括以下步骤:
S11:获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型;所述训练好的双流长短时视频姿态估计模型包括双流3D卷积神经网络和递归神经网络。
在步骤S11中,待识别视频图像可以是运动体实时的运动图像,或者是对运动体在运动时,进行视频录制得到的视频文件。目标时段用于描述待识别视频图像的内容时长。
需要说明的是,训练好的双流长短时视频姿态估计模型是对构建好的双流长短时视频姿态估计模型进行训练后得到的模型,由于训练好的双流长短时视频姿态估计模型中搭建有双流3D卷积神经网络和递归神经网络,因此可以通过模型中的双流3D卷积神经网络对待识别视频图像进行特征提取,并且通过递归神经网络基于3D卷积神经网络提取到的特征进行姿态预估。由于递归神经网络在进行姿态预估时,是基于3D卷积神经网络提取到的当前特征与前一次的姿态预估结果完成姿态预估的,因此利用递归神经网络进行姿态预估得到的姿态预估结果具有时序连续性,也即具有连续性的姿态预估方案能够得到准确性更高的姿态预估结果。
在实际应用中,可以通过选定单位时间段作为目标时间段的选取标准,也即目标时间段可以是由一个或多个单位时间段组成。由于待识别视频图像中包含有运动体,且一帧视频图像无法描述运动体的运动轨迹或者运动姿态,因此为了能够对待识别视频图像中的运动体进行运动状态识别,待识别视频图像中包括多帧且连续的视频图像,也即待识别视频图像在目标时段所表征的内容时长中,能够连续显示多帧且连续的视频图像。
置于何时获取目标时段包含运动体的待识别视频图像,可以包括但不仅限于以下两个场景。
场景1:在播放预设视频文件的过程中,获取目标时段包含运动体的待识别视频图像。
例如,预设视频文件可以包括运动健身视频文件,在播放该运动健身视频文件的过程中,通过调用视频图像采集设备对目标区域的运动体进行视频图像采集,得到视频文件,从视频文件中获取到目标时段包含运动体的待识别视频图像。具体地,可以通过手机终端播放播放运动健身视频文件,同时调用手机终端上的摄像头,对目标区域的运动体进行视频图像采集,从而得到视频文件,通过视频图像选取视窗从视频文件中选取目标时段包含运动体的待识别视频图像。
场景2:若检测到用于对待识别视频图像进行运动姿态识别的预设指令,则获取目标时段包含运动体的待识别视频图像。
例如,对体育竞技进行实况视频录制,得到事实的视频文件。当检测到用于对待识别视频图像进行运动姿态识别的预设指令时,通过视频图像选取视窗从视频文件中选取目标时段包含运动员的待识别视频图像。
作为本实施例一种可能实现的方式,步骤S11具体可以包括:
提取所述待识别视频图像的RGB图片集合和运动光流图片集合;将所述RGB图片集合和所述运动光流图片集合输入训练好的双流长短时视频姿态估计模型的双流3D卷积神经网络。
在本实施例中,可以通过对待识别视频图像中的每一帧图像进行RGB图片提取和运动光流图片提取,进而得到整个待识别视频图像的RGB图片集合和运动光流图片集合。运动光流图用于表达图像中运动体的变化,由于光流图片包含了运动体的运动信息,因此可被用来确定运动体的运动情况。光流图片中包含了运动体的运动信息,而且还包含有运动体的三维结构信息,使得双流3D卷积神经网络对RGB图片和运动光流图片进行特征卷积得到的特征更准确,也即基于RGB图片和运动光流图片进行运动体运动姿态的识别效率更高。
S12:通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征。
在步骤S12中,双流3D卷积神经网络是以双图像流作为输入,运动体综合特征为输出的卷积神经网络。
在本实施例中,双流3D卷积神经网络的输入可以为待识别视频图像的RGB图像和运动光流图像,通过在双流3D卷积神经网络中搭建卷积层、池化层以及特征串联层,在将RGB图像和运动光流图像输入双流3D卷积神经网络后,由卷积层和池化层相结合,分别对RGB图像和运动光流图像进行不同维度的特征卷积,从而实现对待识别视频图像进行运动体的特征提取,再由特征串联层对提取得到的特征进行特征融合,最后得到运动体综合特征,也即目标时段的运动体综合特征。
作为本实施例一种可能实现的方式,步骤S12具体可以包括:
通过所述双流3D卷积神经网络基于所述RGB图片集合和所述运动光流图片集合,抽取所述待识别视频图像中运动体在所述目标时段内的外观特征和运动特征;将所述外观特征与所述运动特征进行特征拼接,得到所述目标时段的运动体综合特征。
在本实施例中,双流3D卷积神经网络(CNN_a、CNN_m)中配置有多个图像特征卷积层、多个池化层以及至少一个特征拼接层,通过图像特征卷积层能够对RGB图片集合中的每一帧RGB图片和运动光流图片集合中的每一帧运动光流图片进行不同层次的图像特征卷积,其中,由池化层对每次卷积后得到的特征进行数据选取,再进行下一次图像特征卷积,进而得出用于描述运动体的外观轮廓特征A(t)=CNN_a(V(t))和运动特征M(t)=CNN_m(V(t)),通过特征拼接层将外观轮廓特征和运动特征进行拼接,即可得到运动体综合特征C(t)=[A(t),M(t)],其中,t用于表征目标时段。
S13:通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息;所述第一时段为所述目标时段的前一单位时段;所述目标预估姿态信息用于表征所述运动体在所述目标时段的预估姿态。
在步骤S13中,第一时段为所述目标时段的前一单位时段,第一姿态预估信息用于描述在第一时段下进行姿态预估得到的运动体的运动状态。
在本实施例中,由于递归神经网络输出运动体在目标时段内的目标预估姿态信息时,需要先确定和获取目标时段的前一单位时间段内运动体的预估姿态信息,也即确定和获取第一时段内运动体的第一姿态预估信息,因此在根据第一姿态预估信息与运动体在目标时段内的运动体综合特征,确定运动体在目标时段内的目标预估姿态信息,该目标预估姿态信息所表征的运动体在目标时段的预估姿态,与前一单位时间段内运动体的预估姿态之间具有连续性,使得每次在对运动体目标时段内的姿态预估都是基于目标时段前一单位时段的姿态预估结果所进行,保证了每次姿态预估之间的时序连贯性。
作为本实施例一种可能实现的方式,步骤S13可以包括:
若所述目标时段存在前一单位时段,则将所述目标时段的前一单位时段识别为第一时段,并将所述第一时段对应的第一姿态预估信息与所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述第一姿态预估信息与所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息;若所述目标时段不存在前一单位时段,则令第一姿态预估信息为空,并将所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息。
需要说明的是,若目标时段不存在前一单位时段,则令第一姿态预估信息为空,是在状态向量计算时不将第一姿态预估信息作为计算因子进行计算,仅考虑运动体综合特征。
在本实施例中,由于目标时段用于描述待识别视频图像的内容时长,因此目标时段的选取或确定与待识别视频图像所在视频文件的时长有关。当目标时段为待识别视频图像所在视频文件的首个单位时段,也即目标时段之前没有单位时段,故无法确定目标时段前一单位时段中运动体的姿态预估信息,故仅通过递归神经网络根据运动体综合特征进行状态向量计算,以得到运动体在目标时段内的目标预估姿态信息。当目标时段为待识别视频图像所在视频文件的中间单位时段,也即目标时段之前有单位时段,故可以先确定目标时段前一单位时段中运动体的姿态预估信息,也即第一姿态预估信息,通过递归神经网络根据第一姿态预估信息和运动体综合特征进行状态向量计算,从而得到运动体在目标时段内的目标预估姿态信息。
可以理解的是,外观轮廓特征的表达方程可以为A(t)=CNN_a(V(t)),t表征目标时段,运动特征的表达方程可以为M(t)=CNN_m(V(t)),t表征目标时段,通过特征拼接层将外观轮廓特征和运动特征进行拼接,即可得到运动体综合特征,运动体综合特征的表达方程可以为C(t)=[A(t),M(t)],t表征目标时段。通过递归神经网络基于第一时段内运动体的第一姿态预估信息与运动体综合特征,得到运动体在目标时段内的目标预估姿态信息,该目标预估姿态信息的表达方程可以为P(t)=LSTM(P(t-1),C(t));其中,P(t)为目标预估姿态信息,t表征目标时段;P(t-1)为第一姿态预估信息,t-1表征第一时段,也即目标时段的前一单位时段;C(t)为运动体综合特征。
S14:基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离。
在步骤S14中,目标预估姿态信息用于表征运动体在目标时段的预估姿态。预设参考姿态信息用于表征预设参考视频图像中运动体的参考姿态。欧几里得距离用于描述预估姿态与参考姿态之间的差别大小。
需要说明的是,目标预估姿态信息可以包括待识别视频图像中的运动体上各关键点的三维坐标值集合。预设参考姿态信息可以包括预设参考视频图像中的运动体上各关键点的三维坐标值集合。
在本申请的所有实施例中,预设参考视频图像中运动体的关键点为预先定义的关键点,运动体可以包含多个运动部位,每个运动部位可以由一个或多个关键点组成。预设参考姿态信息包含运动体上各关键点的三维坐标值集合,即可用于描述运动体整体的运动姿态。在对待识别视频图像进行运动状态识别时,预设参考视频图像中运动体的关键点以及各关键点在每个时刻的运动坐标均为已确定的信息,也即确定了预设参考视频图像中运动体在每个时段中的运动姿态,且以预设参考视频图像中运动体在每个时段中的运动姿态为参考姿态,对待识别视频图像中的运动体进行运动姿态识别和比较。
以待识别视频图像中的运动体是人体为例,相应的,预设参考视频图像中运动体也为人体,运动体的关键点可以为人体的各个活动关节。目标预估姿态信息用于表征待识别视频图像中人体在目标时段下的运动姿态,可以是人体的各个活动关节(关键点)在目标时段中不同运动姿态所对应的坐标值集合。基于目标预估姿态信息与预设参考姿态信息,能够确定目标时段下待识别视频图像中人体的预估姿态与预设参考姿态之间的欧几里得距离,也即人体的预估姿态与与预设参考姿态之间的差别大小,也即实现对目标时段下待识别视频图像中的运动体运功姿态的识别。
以预设参考视频图像为播放运动健身视频文件所展示的图像为例,待识别视频图像是通过调用视频图像采集设备对目标区域的运动体进行视频图像采集,得到视频文件后,从视频文件中确定出的目标时段包含运动体的视频图像。目标预估姿态信息用于描述播放运动健身视频文件的过程中,对待识别视频图像进行姿态预估得到的运动体在目标时段的姿态。
例如,在用户学习健身教学视频中动作的场景中,利用终端播放健身教学的视频文件,同时通过调用终端的摄像头对目标区域的用户的肢体动作,也即运动体进行视频图像采集,再从摄像头拍摄到的视频文件中获取到目标时段包含运动体的待识别视频图像,由于用户在学习健身教学视频中动作时所做的肢体动作与健身教学视频中的动作相关,也即两组动作应当为同步动作,时间差忽略不计,因此可以将健身教学视频的图像内容作为预设参考视频图像,将待识别视频图像中运动体在目标时段的目标预估姿态信息,与预设参考视频图像的预设参考姿态信息进行比较,即可确定预估姿态与预设参考姿态之间的欧几里得距离,也即确定用户在学习健身教学视频中动作时所做的肢体动作,与健身教学视频中的动作之间的差距大小。
以预设参考视频图像为体育竞技时的犯规动作的图像为例,待识别视频图像是通过调用视频图像采集设备,在体育竞技过程中对目标区域的运动员进行视频图像采集,得到视频文件后,从视频文件中确定出的目标时段包含运动员的视频图像。目标预估姿态信息用于描述体育竞技过程中实况拍摄到的运动员的姿态。
例如,在体育竞技的场景中,利用比赛场地所处环境中配置的多方位摄像头实况采集运动员在比赛过程中的图像,得到视频文件,当检测到用于对待识别视频图像进行运动姿态识别的预设指令时,通过视频图像选取视窗从视频文件中选取目标时段包含运动员的待识别视频图像。由于预设参考视频图像为体育竞技时的犯规动作的图像,因此将待识别视频图像中运动员在目标时段的目标预估姿态信息,与预设参考视频图像的预设参考姿态信息进行比较,即可确定预估姿态与预设参考姿态之间的欧几里得距离,也即确定运动员在比赛过程中所做出的肢体动作,与体育竞技时的犯规动作之间的差距大小。
以上可以看出,本实施例提供的一种运动姿态识别方法,通过训练好的双流长短时视频姿态估计模型,对目标时段包含运动体的待识别视频图像进行运动姿态识别,通过训练好的双流长短时视频姿态估计模型中的双流3D卷积神经网络对待识别视频图像进行运动体综合特征提取,再通过递归神经网络基于第一时段内运动体的第一姿态预估信息与运动体综合特征,得到运动体在目标时段内的目标预估姿态信息,由于第一时段为目标时段的前一单位时段,实现了姿态预估的递归预测,因此令得到的运动体在目标时段内的目标预估姿态信息所表征的预估姿态具有时序连贯性,使得确定出的预估姿态与预设参考姿态之间的欧几里得距离更加准确,提高了对运动体的运动姿态识别效率。
请参阅图2,图2是本申请另一实施例提供的一种运动姿态识别方法的实现流程图。相对于图1对应的实施例,本实施例提供的运动姿态识别方法在步骤S14之后还包括S21~S22。详述如下:
S21:若所述欧几里得距离等于或大于预设阈值,则输出提示信息。
S22:若所述欧几里得距离小于预设阈值,则从所述待识别视频图像中确定出与所述预估姿态对应的目标图像。
需要说明的是,本实施例中的步骤S21与步骤S22为并列步骤,两者执行不分现有,且当执行了步骤S21便不再执行步骤S22,当执行了步骤S22便不再执行步骤S21,直到重新确定欧几里得距离。
在本实施例中,由于欧几里得距离用于描述预估姿态与参考姿态之间的差别大小,也即欧几里得距离的值越大,则表示预估姿态与参考姿态之间的差别越大,预估姿态与参考姿态相差的就越远;欧几里得距离的值越小,则表示预估姿态与参考姿态之间的差别越小,预估姿态与参考姿态就越相似。
在实际应用中,当待识别视频图像的内容不同时,欧几里得距离所表示的实际含义也有所不同。
需要说明的是,本申请所有实施例提供的运动姿态识别方法,可以用于但不仅限于视频教学纠错和犯规动作识别领域。
以待识别视频图像为人体运动健身的视频图像为例,当欧几里得距离等于或大于预设阈值,则表示人体的运动姿态与教学视频中的预设参考姿态相差较大,故输出的提示信息用于纠正运动体的运动姿态。当欧几里得距离小于预设阈值,则表示人体的运动姿态与教学视频中的预设参考姿态相差较小,则从待识别视频图像中确定出与预估姿态对应的目标图像,便于观看用户自己根据教学视频所学习的动作。
例如,在用户学习健身教学视频中动作的场景中,利用终端播放健身教学的视频文件,同时通过调用终端的摄像头对目标区域的用户的肢体动作,也即运动体进行视频图像采集,再从摄像头拍摄到的视频文件中获取到目标时段包含人体运动健身的待识别视频图像。将待识别视频图像中人体运动健身在目标时段的目标预估姿态信息,与预设参考视频图像的预设参考姿态信息进行比较,即可确定预估姿态与预设参考姿态之间的欧几里得距离。当欧几里得距离等于或大于预设阈值,则表示人体的运动姿态与教学视频中的预设参考姿态相差较大,故输出的提示信息用于纠正运动体的运动姿态。当欧几里得距离小于预设阈值,则表示人体的运动姿态与教学视频中的预设参考姿态相差较小,则从待识别视频图像中确定出与预估姿态对应的目标图像,便于观看用户自己根据教学视频所学习的动作。
以待识别视频图像为体育竞技的实况视频图像为例,当欧几里得距离等于或大于预设阈值,则表示运动员的运动姿态与犯规动作的预设参考姿态相差较大,故输出的提示信息为运动员未犯规的动作图像信息。当欧几里得距离小于预设阈值,则表示运动员的运动姿态与犯规动作相差较小,即确定运动员的犯规行为,则从实况视频图像中确定出与预估姿态对应的目标图像,也即犯规行为的图像,作为判定运动员犯规的证据。
例如,在体育竞技的场景中,利用比赛场地所处环境中配置的多方位摄像头实况采集运动员在比赛过程中的图像,得到视频文件,当检测到用于对待识别视频图像进行运动姿态识别的预设指令时,通过视频图像选取视窗从视频文件中选取目标时段包含运动员的待识别视频图像。将待识别视频图像中运动员在目标时段的目标预估姿态信息,与预设参考视频图像的预设参考姿态信息进行比较,即可确定预估姿态与预设参考姿态之间的欧几里得距离。当欧几里得距离等于或大于预设阈值,则表示运动员的运动姿态与犯规动作的预设参考姿态相差较大,故输出的提示信息为运动员未犯规的动作图像信息。当欧几里得距离小于预设阈值,则表示运动员的运动姿态与犯规动作相差较小,即确定运动员的犯规行为,则从实况视频图像中确定出与预估姿态对应的目标图像,也即犯规行为的图像,作为判定运动员犯规的证据。
以上可以看出,通过比较欧几里得距离与预设阈值的大小,使得运动姿态识别的方案能够被应用于更多的领域中,提高了运动体运动姿态识别技术的适用范围。
请参阅图3,图3是本申请再一实施例提供的一种运动姿态识别方法的实现流程图。相对于图1或图2对应的实施例,本实施例提供的运动姿态识别方法在步骤S11之前还包括S31~S33。详述如下:
S31:获取包含运动体的样本视频文件。
S32:基于所述样本视频文件生成训练样本集合。
S33:利用所述训练样本集合对双流长短时视频姿态估计模型进行训练,以得到训练好的双流长短时视频姿态估计模型。
在本实施例中,样本视频文件中包含运动体,运动体可以是人体,也可以是模拟人体的机器人等。运动体可以包括多个运动部位,每个运动部位由至少一个关键点组成,在对双流长短时视频姿态估计模型进行训练时,需要确定样本视频文件中,每个包含运动体的画面帧中,各个关键点的坐标信息。同一个关键点在连续的画面帧中的坐标不同,则表示该关键点存在运动轨迹,通过确定每个关键点在连续画面帧中的坐标集合,即可确定运动体运动姿态。
作为本实施例一种可能实现的方式,步骤S32包括:
对所述样本视频文件进行分段,得到T个视频片段;其中,T为大于0的整数,每个所述视频片段对应配置有所述运动体的三维关键点信息;将T个所述视频片段和每个所述视频片段对应的三维关键点信息作为训练样本集合。
在本实施例中,三维关键点信息用于表征运动体上各个关键点的三维坐标,运动体上所有关键点的坐标变化情况,能够用于表征运动体的姿态变化。
需要说明的是,在利用训练样本集合对双流长短时视频姿态估计模型进行训练时,将视频片段作为双流长短时视频姿态估计模的输入,由双流长短时视频姿态估计模根据视频片段输出相应的姿态预估信息,再将该姿态预估信息与三维关键点信息进行比较,其中,姿态预估信息用于描述预估姿态,三维关键点信息用于描述视频片段中运动体的实际姿态。将该姿态预估信息与三维关键点信息进行比较,也即计算预估姿态与实际姿态之间的欧几里得距离,并基于该欧几里得距离生成相应的损失函数,通过模型参数的梯度回传,实现对双流长短时视频姿态估计模型的训练。
在实际应用中,对双流长短时视频姿态估计模型进行训练时,向双流长短时视频姿态估计模型进输入的是已确定运动体姿态信息的视频图像,由双流长短时视频姿态估计模基于该视频图像进行运动体的运动姿态预估,并与已确定运动体姿态信息进行比较,进而对调整双流长短时视频姿态估计模型进行调整,使其得到的结果能够与已确定运动体姿态信息更加接近和收敛。对双流长短时视频姿态估计模型训练完成后,得到训练好的双流长短时视频姿态估计模型,利用训练好的双流长短时视频姿态估计模型进行移动姿态识别时,预设的视频文件中运动体的姿态信息已经确定,基于待识别视频图像进行姿态预估,得到的预估信息能够很好地表征运动体的运动状态。将其与视频文件中运动体的姿态信息进行比较,即可确定待识别视频图像中的运动体的运动姿态,与视频文件中运动体的姿态之间的差别大小。
请参阅图4,图4是本申请实施例提供的一种运动姿态识别装置的结构框图。本实施例中该运动姿态识别装置包括的各单元用于执行图1至图3对应的实施例中的各步骤。具体请参阅图1至图3以及图1至图3所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图4,运动姿态识别装置400包括:获取与输入单元41、第一执行单元42、第二执行单元43以及确定单元44。其中:
获取与输入单元41,用于获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型;所述训练好的双流长短时视频姿态估计模型包括双流3D卷积神经网络和递归神经网络。
第一执行单元42,用于通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征。
第二执行单元43,用于通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息;所述第一时段为所述目标时段的前一单位时段;所述目标预估姿态信息用于表征所述运动体在所述目标时段的预估姿态。
确定单元44,用于基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离;所述欧几里得距离用于描述所述预估姿态与所述预设参考姿态之间的差别大小。
作为本申请一实施例,获取与输入单元41具体用于,提取所述待识别视频图像的RGB图片集合和运动光流图片集合;将所述RGB图片集合和所述运动光流图片集合输入训练好的双流长短时视频姿态估计模型的双流3D卷积神经网络。
作为本申请一实施例,第一执行单元42具体用于,通过所述双流3D卷积神经网络基于所述RGB图片集合和所述运动光流图片集合,抽取所述待识别视频图像中运动体在所述目标时段内的外观特征和运动特征;将所述外观特征与所述运动特征进行特征拼接,得到所述目标时段的运动体综合特征。
作为本申请一实施例,第二执行单元43具体用于,若所述目标时段存在前一单位时段,则将所述目标时段的前一单位时段识别为第一时段,并将所述第一时段对应的第一姿态预估信息与所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述第一姿态预估信息与所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息;若所述目标时段不存在前一单位时段,则令第一姿态预估信息为空,并将所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息。
作为本申请一实施例,运动姿态识别装置400,还包括:
第三执行单元45,用于若所述欧几里得距离等于或大于预设阈值,则输出用于纠正所述运动体的运动姿态纠正信息。
第四执行单元46,用于若所述欧几里得距离小于预设阈值,则从所述待识别视频图像中确定出与所述预估姿态对应的目标图像。
作为本申请一实施例,运动姿态识别装置400,还包括:
获取单元47,用于获取包含运动体的样本视频文件。
样本生成单元48,用于基于所述样本视频文件生成训练样本集合。
训练单元49,用于利用所述训练样本集合对双流长短时视频姿态估计模型进行训练,以得到训练好的双流长短时视频姿态估计模型。
作为本申请一实施例,样本生成单元48具体用于,对所述样本视频文件进行分段,得到T个视频片段;其中,T为大于0的整数,每个所述视频片段对应配置有所述运动体的三维关键点信息;将T个所述视频片段和每个所述视频片段对应的三维关键点信息作为训练样本集合。
以上可以看出,本实施例提供的一种运动姿态识别方法,通过训练好的双流长短时视频姿态估计模型,对目标时段包含运动体的待识别视频图像进行运动姿态识别,通过训练好的双流长短时视频姿态估计模型中的双流3D卷积神经网络对待识别视频图像进行运动体综合特征提取,再通过递归神经网络基于第一时段内运动体的第一姿态预估信息与运动体综合特征,得到运动体在目标时段内的目标预估姿态信息,由于第一时段为目标时段的前一单位时段,实现了姿态预估的递归预测,因此令得到的运动体在目标时段内的目标预估姿态信息所表征的预估姿态具有时序连贯性,使得确定出的预估姿态与预设参考姿态之间的欧几里得距离更加准确,提高了对运动体的运动姿态识别效率。
此外,通过比较欧几里得距离与预设阈值的大小,使得运动姿态识别的方案能够被应用于更多的领域中,提高了运动体运动姿态识别技术的适用范围。
图5是本申请另一实施例提供的一种终端设备的结构框图。如图5所示,该实施例的终端设备5包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机程序52,例如运动姿态识别方法的程序。处理器50执行所述计算机程序52时实现上述各个运动姿态识别方法各实施例中的步骤,例如图1所示的S11至S14。或者,所述处理器50执行所述计算机程序52时实现上述图4对应的实施例中各单元的功能,例如,图4所示的单元41至44的功能,具体请参阅图5对应的实施例中的相关描述,此处不赘述。
示例性的,所述计算机程序52可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序52在所述终端设备5中的执行过程。例如,所述计算机程序52可以被分割成获取与输入单元、第一执行单元、第二执行单元以及确定单元,各单元具体功能如上所述。
所述终端设备可包括,但不仅限于,处理器50、存储器51。本领域技术人员可以理解,图5仅仅是终端设备5的示例,并不构成对终端设备5的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器50可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器51可以是所述终端设备5的内部存储单元,例如终端设备5的硬盘或内存。所述存储器51也可以是所述终端设备5的外部存储设备,例如所述终端设备5上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器51还可以既包括所述终端设备5的内部存储单元也包括外部存储设备。所述存储器51用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器51还可以用于暂时地存储已经输出或者将要输出的数据。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种运动姿态识别方法,其特征在于,包括:
    获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型;所述训练好的双流长短时视频姿态估计模型包括双流3D卷积神经网络和递归神经网络;
    通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征;
    通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息;所述第一时段为所述目标时段的前一单位时段;所述目标预估姿态信息用于表征所述运动体在所述目标时段的预估姿态;
    基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离;所述欧几里得距离用于描述所述预估姿态与所述预设参考姿态之间的差别大小。
  2. 根据权利要求1所述的运动姿态识别方法,其特征在于,所述获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型,包括:
    提取所述待识别视频图像的RGB图片集合和运动光流图片集合;
    将所述RGB图片集合和所述运动光流图片集合输入训练好的双流长短时视频姿态估计模型的双流3D卷积神经网络。
  3. 根据权利要求2所述的运动姿态识别方法,其特征在于,所述通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征,包括:
    通过所述双流3D卷积神经网络基于所述RGB图片集合和所述运动光流图片集合,抽取所述待识别视频图像中运动体在所述目标时段内的外观特征和运动特征;
    将所述外观特征与所述运动特征进行特征拼接,得到所述目标时段的运动体综合特征。
  4. 根据权利要求1所述的运动姿态识别方法,其特征在于,所述通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息,包括:
    若所述目标时段存在前一单位时段,则将所述目标时段的前一单位时段识别为第一时段,并将所述第一时段对应的第一姿态预估信息与所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述第一姿态预估信息与所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息;
    若所述目标时段不存在前一单位时段,则令第一姿态预估信息为空,并将所述运动体综合特征输入递归神经网络,通过所述递归神经网络根据所述运动体综合特征进行状态向量计算,以得到所述运动体在目标时段内的目标预估姿态信息。
  5. 根据权利要求1至4任一项所述的运动姿态识别方法,其特征在于,所述基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离的步骤之后,还包括:
    若所述欧几里得距离等于或大于预设阈值,则输出提示信息;
    若所述欧几里得距离小于预设阈值,则从所述待识别视频图像中确定出与所述预估姿态对应的目标图像。
  6. 根据权利要求1至4任一项所述的运动姿态识别方法,其特征在于,所述方法还包括:
    获取包含运动体的样本视频文件;
    基于所述样本视频文件生成训练样本集合;
    利用所述训练样本集合对双流长短时视频姿态估计模型进行训练,以得到训练好的双流长短时视频姿态估计模型。
  7. 根据权利要求6所述的运动姿态识别方法,其特征在于,所述基于所述样本视频文件生成训练样本集合,包括:
    对所述样本视频文件进行分段,得到T个视频片段;其中,T为大于0的整数,每个所述视频片段对应配置有所述运动体的三维关键点信息;
    将T个所述视频片段和每个所述视频片段对应的三维关键点信息作为训练样本集合。
  8. 一种运动姿态识别装置,其特征在于,包括:
    获取与输入单元,用于获取目标时段包含运动体的待识别视频图像,并将所述待识别视频图像输入训练好的双流长短时视频姿态估计模型;所述训练好的双流长短时视频姿态估计模型包括双流3D卷积神经网络和递归神经网络;
    第一执行单元,用于通过所述双流3D卷积神经网络对所述待识别视频图像进行运动体的特征提取,得到目标时段的运动体综合特征;
    第二执行单元,用于通过递归神经网络基于第一时段内所述运动体的第一姿态预估信息与所述运动体综合特征,得到所述运动体在目标时段内的目标预估姿态信息;所述第一时段为所述目标时段的前一单位时段;所述目标预估姿态信息用于表征所述运动体在所述目标时段的预估姿态;
    确定单元,用于基于所述目标预估姿态信息与预设参考姿态信息,确定所述预估姿态与预设参考姿态之间的欧几里得距离;所述欧几里得距离用于描述所述预估姿态与所述预设参考姿态之间的差别大小。
  9. 一种终端设备,其特征在于,所述终端设备包括存储器、处理器以及存储在所述存储器中并可在所述终端设备上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述运动姿态识别方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述运动姿态识别方法的步骤。
PCT/CN2020/128854 2019-11-21 2020-11-13 运动姿态识别方法、运动姿态识别装置、终端设备及介质 WO2021098616A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911148347.XA CN110942006B (zh) 2019-11-21 2019-11-21 运动姿态识别方法、运动姿态识别装置、终端设备及介质
CN201911148347.X 2019-11-21

Publications (1)

Publication Number Publication Date
WO2021098616A1 true WO2021098616A1 (zh) 2021-05-27

Family

ID=69907853

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128854 WO2021098616A1 (zh) 2019-11-21 2020-11-13 运动姿态识别方法、运动姿态识别装置、终端设备及介质

Country Status (2)

Country Link
CN (1) CN110942006B (zh)
WO (1) WO2021098616A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569753A (zh) * 2021-07-29 2021-10-29 杭州逗酷软件科技有限公司 视频中的动作比对方法、装置、存储介质与电子设备
CN113807318A (zh) * 2021-10-11 2021-12-17 南京信息工程大学 一种基于双流卷积神经网络和双向gru的动作识别方法
CN115035395A (zh) * 2022-07-07 2022-09-09 北京拙河科技有限公司 用于机场航站楼场景的安全分析装置方法及装置

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110942006B (zh) * 2019-11-21 2023-04-18 中国科学院深圳先进技术研究院 运动姿态识别方法、运动姿态识别装置、终端设备及介质
CN111669636B (zh) * 2020-06-19 2022-02-25 海信视像科技股份有限公司 一种音画同步的视频录制方法及显示设备
CN111967522B (zh) * 2020-08-19 2022-02-25 南京图格医疗科技有限公司 一种基于漏斗卷积结构的图像序列分类方法
CN112434604A (zh) * 2020-11-24 2021-03-02 中国科学院深圳先进技术研究院 基于视频特征的动作时段定位方法与计算机设备
CN113705329A (zh) * 2021-07-07 2021-11-26 浙江大华技术股份有限公司 重识别方法、目标重识别网络的训练方法及相关设备
CN114842550B (zh) * 2022-03-31 2023-01-24 合肥的卢深视科技有限公司 犯规行为检测方法、装置、电子设备和存储介质
CN114677666B (zh) * 2022-03-31 2024-05-31 东风商用车有限公司 一种振动试验中驾驶室运动姿态检测方法及系统
CN114419526B (zh) * 2022-03-31 2022-09-09 合肥的卢深视科技有限公司 犯规行为检测方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256433A (zh) * 2017-12-22 2018-07-06 银河水滴科技(北京)有限公司 一种运动姿态评估方法及系统
US20190130578A1 (en) * 2017-10-27 2019-05-02 Siemens Healthcare Gmbh Vascular segmentation using fully convolutional and recurrent neural networks
CN110096938A (zh) * 2018-01-31 2019-08-06 腾讯科技(深圳)有限公司 一种视频中的动作行为的处理方法和装置
CN110427900A (zh) * 2019-08-07 2019-11-08 广东工业大学 一种智能指导健身的方法、装置和设备
CN110942006A (zh) * 2019-11-21 2020-03-31 中国科学院深圳先进技术研究院 运动姿态识别方法、运动姿态识别装置、终端设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130578A1 (en) * 2017-10-27 2019-05-02 Siemens Healthcare Gmbh Vascular segmentation using fully convolutional and recurrent neural networks
CN108256433A (zh) * 2017-12-22 2018-07-06 银河水滴科技(北京)有限公司 一种运动姿态评估方法及系统
CN110096938A (zh) * 2018-01-31 2019-08-06 腾讯科技(深圳)有限公司 一种视频中的动作行为的处理方法和装置
CN110427900A (zh) * 2019-08-07 2019-11-08 广东工业大学 一种智能指导健身的方法、装置和设备
CN110942006A (zh) * 2019-11-21 2020-03-31 中国科学院深圳先进技术研究院 运动姿态识别方法、运动姿态识别装置、终端设备及介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569753A (zh) * 2021-07-29 2021-10-29 杭州逗酷软件科技有限公司 视频中的动作比对方法、装置、存储介质与电子设备
CN113569753B (zh) * 2021-07-29 2024-05-31 杭州逗酷软件科技有限公司 视频中的动作比对方法、装置、存储介质与电子设备
CN113807318A (zh) * 2021-10-11 2021-12-17 南京信息工程大学 一种基于双流卷积神经网络和双向gru的动作识别方法
CN113807318B (zh) * 2021-10-11 2023-10-31 南京信息工程大学 一种基于双流卷积神经网络和双向gru的动作识别方法
CN115035395A (zh) * 2022-07-07 2022-09-09 北京拙河科技有限公司 用于机场航站楼场景的安全分析装置方法及装置
CN115035395B (zh) * 2022-07-07 2023-11-10 北京拙河科技有限公司 用于机场航站楼场景的安全分析装置方法及装置

Also Published As

Publication number Publication date
CN110942006A (zh) 2020-03-31
CN110942006B (zh) 2023-04-18

Similar Documents

Publication Publication Date Title
WO2021098616A1 (zh) 运动姿态识别方法、运动姿态识别装置、终端设备及介质
CN108712661B (zh) 一种直播视频处理方法、装置、设备及存储介质
CN112819852A (zh) 对基于姿态的运动进行评估
CN110544301A (zh) 一种三维人体动作重建系统、方法和动作训练系统
CN110298220B (zh) 动作视频直播方法、系统、电子设备、存储介质
CN110428486B (zh) 虚拟互动的健身方法、电子设备及存储介质
CN110427900B (zh) 一种智能指导健身的方法、装置和设备
WO2019114726A1 (zh) 图像识别方法、装置、电子设备以及可读存储介质
KR102594938B1 (ko) 인공신경망을 이용한 스포츠 자세를 비교 및 교정하는 장치 및 방법
CN113822254B (zh) 一种模型训练方法及相关装置
CN113255522B (zh) 基于时间一致性的个性化运动姿态估计与分析方法及系统
KR102412553B1 (ko) 인공지능 기반 댄스 동작 비교 방법 및 장치
KR102258128B1 (ko) 인공지능 기반의 영상 인식을 이용한 댄스 트레이닝을 위한 사용자 모션 분석 방법
CN113269013B (zh) 对象行为分析方法、信息显示方法及电子设备
CN114926762A (zh) 运动评分方法、系统、终端及存储介质
CN114694256A (zh) 实时网球动作识别方法、装置、设备及介质
CN114513694A (zh) 评分确定方法、装置、电子设备和存储介质
CN113743237A (zh) 跟随动作的准确度判定方法、装置、电子设备及存储介质
CN111353347B (zh) 动作识别纠错方法、电子设备、存储介质
CN116704603A (zh) 一种基于肢体关键点分析的动作评估纠正方法及系统
CN113689527A (zh) 一种人脸转换模型的训练方法、人脸图像转换方法
CN115393963A (zh) 运动动作纠正方法、系统、存储介质、计算机设备及终端
WO2022260589A1 (zh) 触碰动画显示方法、装置、设备及介质
CN114241595A (zh) 数据处理方法、装置、电子设备及计算机存储介质
CN110996149A (zh) 一种信息处理方法、装置和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890430

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890430

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.05.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20890430

Country of ref document: EP

Kind code of ref document: A1