CN114463784A

CN114463784A - Multi-person rope skipping analysis method based on video-audio multi-mode deep learning

Info

Publication number: CN114463784A
Application number: CN202210091782.9A
Authority: CN
Inventors: 朱亮亮; 熊杰
Original assignee: Kaiwang Hangzhou Technology Co ltd
Current assignee: Kaiwang Hangzhou Technology Co ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-10

Abstract

The invention discloses a multi-person rope skipping analysis method based on video-audio multi-mode deep learning, which comprises the following steps: the method comprises the steps of obtaining audio and video files in the rope skipping process, separating videos and audios, detecting and extracting portraits aiming at video image signals, tracking target portraits, extracting skeletal feature point coordinates of the target portraits and preprocessing the skeletal feature point coordinates, obtaining single-channel audio signals, slicing and intercepting the single-channel audio signals, carrying out time-frequency transformation to obtain frequency spectrum signals and preprocessing the frequency spectrum signals, fusing the preprocessed video signals and the preprocessed audio signals to obtain video-audio fusion signals, obtaining output signal streams and converting the output signal streams into square wave signals after the video-audio fusion signals pass through a bidirectional long-short time memory cyclic convolution neural network and a cascaded full-connection network, and carrying out statistical analysis on rising edges or falling edges after filtering. The invention can effectively filter the interference of non-skipping rope testers and realize more accurate skipping rope statistical analysis.

Description

Multi-person rope skipping analysis method based on video-audio multi-mode deep learning

Technical Field

The invention relates to the technical field of deep learning, in particular to a multi-person rope skipping analysis method based on video-audio multi-mode deep learning.

Background

The rope skipping activity is long-flowing and is an extremely old sports game. Nowadays, the method also becomes a very popular quick and effective exercise mode in fast-paced life.

At present, the scheme for intelligently counting skipping ropes on the market mainly comprises two schemes, wherein one scheme is arranged in skipping rope equipment, and the other scheme is a video monitoring mode. The intelligent hardware with built-in intelligence in the skipping rope needs additional purchasing equipment, so that the intelligent skipping rope is inconvenient to popularize, and video acquisition equipment is very popular. At present, the scheme for analyzing the performance of rope skipping personnel by using videos mostly adopts a single image to analyze fluctuation of up-and-down jumping of a portrait, and the failed attempt in the rope skipping process cannot be eliminated, so that the accurate recording of the performance of the rope skipping is difficult to achieve.

Therefore, how to provide a multi-person rope skipping analysis method based on video-audio multi-modal deep learning, which can accurately record rope skipping achievements, is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides a multi-user rope skipping analysis method based on video-audio multi-mode deep learning, which can accurately record rope skipping scores.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-person rope skipping analysis method based on video-audio multi-mode deep learning comprises the following steps:

s1, obtaining an audio-video file in a rope skipping process, and separating video and audio in the audio-video file to obtain a video image signal and a stereo audio signal;

s2, detecting and extracting a portrait aiming at a video image signal, tracking a target portrait, extracting a skeleton characteristic point coordinate of the target portrait, preprocessing the skeleton characteristic point coordinate, and acquiring a preprocessed video signal;

s3, acquiring a single-channel audio signal in the stereo audio signal, slicing and intercepting the single-channel audio signal, performing time-frequency transformation to obtain a frequency spectrum signal, preprocessing, and acquiring a preprocessed audio signal;

s4, fusing the preprocessed video signal and the preprocessed audio signal to obtain a video-audio fusion signal, enabling the video-audio fusion signal to pass through a bidirectional long-time and short-time memory cyclic convolution neural network and a cascaded full connection layer to obtain an output signal stream, and converting the output signal stream into a square wave signal;

s5, filtering the square wave signals, and performing statistical analysis on rising edges or falling edges, wherein one rising edge or falling edge is a real rope skipping action, so that the rope skipping statistical analysis is obtained.

Preferably, when the rope skipping process is obtained in S1, the rope skipping tester is recorded through the video acquisition device, the audio-video file of the rope skipping process is obtained, and it is ensured that the body of the rope skipping tester can be completely kept in the view finder during the whole rope skipping process.

Preferably, the acquired audio-video file is an audio-video file acquired in real time or exported after being recorded.

Preferably, the specific contents of S2 include:

s21, detecting and extracting the portrait by adopting a pre-trained YOLOv5 network model;

s22, tracking the target portrait frame inter-frame by adopting a KM algorithm, and tracking the same target portrait on the target portrait frame;

s23, carrying out frame-by-frame 32-point body 3D space modeling by adopting a BLAZEPose algorithm, and acquiring 32 skeleton feature point coordinates of each frame of the target;

s24, performing interframe smoothing on the coordinates of the bone feature points by using a Kalman filtering algorithm, and performing up-conversion on the coordinates of the obtained bone feature points through Kalman filtering prediction to obtain a preprocessed video signal.

Preferably, the specific contents of S3 include:

s31, acquiring a single-channel audio signal in the stereo audio signal, performing resampling and short-time Fourier transform on the single-channel signal, slicing the audio signal according to preset time length, and intercepting the audio signal after determining the frame shift length by adopting a Hamming window;

s32, performing time-frequency transformation on each section of intercepted audio signal by adopting fast Fourier transform to obtain a frequency spectrum signal;

and S33, performing amplitude suppression on the audio frequency spectrum signal by adopting power compression of a parameter A ^0.3, avoiding the overhung interference audio frequency from covering the effective signal, and acquiring the preprocessed audio signal, wherein A is an input audio time sequence signal.

Preferably, the specific method for fusing the preprocessed video signal and the audio signal to obtain the video-audio fused signal in S4 includes:

and respectively inputting the preprocessed video signal and audio signal into respective corresponding convolution networks, and performing cascade fusion on the data passing through the convolution networks to obtain a video-audio fusion signal.

Preferably, the target figure includes 1 or more.

Through the technical scheme, compared with the prior art, the invention discloses a multi-person rope skipping analysis method based on video-audio multi-mode deep learning, the method can obtain a video-audio fusion signal by fusing the video and the audio of a rope skipping video, further obtain a square wave signal by obtaining an output signal flow through the deep learning, and analyze the whole rope skipping state of a tester by performing statistical analysis on the rising edge of the square wave signal, the analysis method can filter unsuccessful rope skipping attempts of each tester, solves the technical problems that in the prior art, single video analysis is difficult to avoid the error recording of rope skipping without rope skipping, and single audio analysis counting cannot filter rope skipping without rope skipping, realizes the filtering of the conditions of rope skipping without rope skipping, rope skipping without rope, rope breaking or rope skipping interruption and the like, the interference of non-skipping rope testers can be filtered, and more accurate skipping rope statistical analysis is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a multi-person rope skipping analysis method based on video-audio multi-modal deep learning according to the present invention;

fig. 2 is a schematic structural diagram of a multi-modal video-audio network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a multi-person rope skipping analysis method based on video-audio multi-mode deep learning, which comprises the following steps of:

It should be noted that:

in the present embodiment, the video part frame rate is 25FPS, the input video slice is often 100ms per slot, i.e. 4 frames of images, and as shown in fig. 2, the fully connected layer therein is a 3-layer fully connected network, and the output signal is a 0 (low level), 1 (high level) signal, and the output signal stream is converted into a square wave signal with 100ms as the level holding time.

In S5, the square wave signal passes through a low-pass filter with the cut-off frequency of 20Hz to filter the high-frequency interference signal.

In order to further implement the technical scheme, when the rope skipping process is obtained in S1, the rope skipping tester is recorded through the video acquisition device, an audio-video file of the rope skipping process is obtained, and it is ensured that the body of the rope skipping tester can be completely kept in the view-finding frame in the whole rope skipping process.

In order to further implement the technical scheme, the acquired audio and video files are acquired in real time or exported after being recorded.

In order to further implement the above technical solution, the specific content of S2 includes:

and S24, performing interframe smoothing on the coordinates of the bone feature points by using a Kalman filtering algorithm, and performing up-conversion on the coordinates of the obtained bone feature points through Kalman filtering prediction to obtain a preprocessed video signal.

It should be noted that:

in the present embodiment, the preprocessed video signal acquired in S2 is 8 frame data.

In order to further implement the above technical solution, the specific content of S3 includes:

It should be noted that:

in this embodiment, 16kHz resampling is used for an audio signal, an SFTF is used for audio analysis, then the audio signal is sliced according to 100ms, a hamming window of 30ms is used, a frame shift length is 10ms, and time-frequency transformation is performed by using 512-point FFT to obtain a spectrum signal.

And amplitude suppression is performed on the audio spectrum using power compression with parameter $ p ═ 0.3 $. Each 100ms audio slice is translated by adopting a 30ms hamming window to obtain 8 frames of short-time audio signals (the frame rate is the same as that of video after twice up-sampling), and each short-time signal is subjected to fast fourier transform to obtain 257 x 8 x 2 scalar data serving as the preprocessed audio signals.

In order to further implement the above technical solution, the specific method for obtaining the video-audio fusion signal by fusing the preprocessed video signal and the audio signal in S4 includes:

It should be noted that:

the convolution network parameters for each of the video and audio signals are shown in the following table:

TABLE 1 video input signal multilayer convolutional network

TABLE 2 Audio input Signal multilayer convolutional network

In order to further implement the technical scheme, the target portrait comprises more than 1 portrait.

It should be noted that:

when multi-person rope skipping analysis is carried out, the audio analysis part keeps consistent for each tester, and the portrait tracking part and the skeleton extraction part are used as the video part of the multi-modal neural network to be input, so that rope skipping statistical analysis can be carried out on each tester.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-person rope skipping analysis method based on video-audio multi-mode deep learning is characterized by comprising the following steps:

2. The method for analyzing the multi-person rope skipping based on the video-audio multi-modal deep learning of claim 1, wherein when the rope skipping process is obtained in the step S1, a rope skipping tester is recorded through a video acquisition device to obtain an audio-video file of the rope skipping process, and it is ensured that the body of the rope skipping tester can be completely kept in a view frame during the whole rope skipping process.

3. The multi-person rope skipping analysis method based on video-audio multi-modal deep learning as claimed in claim 1, wherein the acquired audio/video file is an audio/video file acquired in real time or exported after being recorded.

4. The method for analyzing multi-person rope skipping based on video-audio multi-modal deep learning as claimed in claim 1, wherein the specific content of S2 includes:

5. The method for analyzing multi-person rope skipping based on video-audio multi-modal deep learning as claimed in claim 1, wherein the specific content of S3 includes:

6. The method for analyzing rope skipping by multiple persons based on video-audio multi-modal deep learning as claimed in claim 1, wherein the step S4 of fusing the preprocessed video signal and the preprocessed audio signal to obtain the video-audio fused signal comprises:

7. The method as claimed in claim 1, wherein the target portrait includes more than 1 person.