CN114463784A - Multi-person rope skipping analysis method based on video-audio multi-mode deep learning - Google Patents

Multi-person rope skipping analysis method based on video-audio multi-mode deep learning Download PDF

Info

Publication number
CN114463784A
CN114463784A CN202210091782.9A CN202210091782A CN114463784A CN 114463784 A CN114463784 A CN 114463784A CN 202210091782 A CN202210091782 A CN 202210091782A CN 114463784 A CN114463784 A CN 114463784A
Authority
CN
China
Prior art keywords
audio
video
signal
rope skipping
preprocessed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210091782.9A
Other languages
Chinese (zh)
Inventor
朱亮亮
熊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kaiwang Hangzhou Technology Co ltd
Original Assignee
Kaiwang Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kaiwang Hangzhou Technology Co ltd filed Critical Kaiwang Hangzhou Technology Co ltd
Priority to CN202210091782.9A priority Critical patent/CN114463784A/en
Publication of CN114463784A publication Critical patent/CN114463784A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Abstract

The invention discloses a multi-person rope skipping analysis method based on video-audio multi-mode deep learning, which comprises the following steps: the method comprises the steps of obtaining audio and video files in the rope skipping process, separating videos and audios, detecting and extracting portraits aiming at video image signals, tracking target portraits, extracting skeletal feature point coordinates of the target portraits and preprocessing the skeletal feature point coordinates, obtaining single-channel audio signals, slicing and intercepting the single-channel audio signals, carrying out time-frequency transformation to obtain frequency spectrum signals and preprocessing the frequency spectrum signals, fusing the preprocessed video signals and the preprocessed audio signals to obtain video-audio fusion signals, obtaining output signal streams and converting the output signal streams into square wave signals after the video-audio fusion signals pass through a bidirectional long-short time memory cyclic convolution neural network and a cascaded full-connection network, and carrying out statistical analysis on rising edges or falling edges after filtering. The invention can effectively filter the interference of non-skipping rope testers and realize more accurate skipping rope statistical analysis.

Description

Multi-person rope skipping analysis method based on video-audio multi-mode deep learning
Technical Field
The invention relates to the technical field of deep learning, in particular to a multi-person rope skipping analysis method based on video-audio multi-mode deep learning.
Background
The rope skipping activity is long-flowing and is an extremely old sports game. Nowadays, the method also becomes a very popular quick and effective exercise mode in fast-paced life.
At present, the scheme for intelligently counting skipping ropes on the market mainly comprises two schemes, wherein one scheme is arranged in skipping rope equipment, and the other scheme is a video monitoring mode. The intelligent hardware with built-in intelligence in the skipping rope needs additional purchasing equipment, so that the intelligent skipping rope is inconvenient to popularize, and video acquisition equipment is very popular. At present, the scheme for analyzing the performance of rope skipping personnel by using videos mostly adopts a single image to analyze fluctuation of up-and-down jumping of a portrait, and the failed attempt in the rope skipping process cannot be eliminated, so that the accurate recording of the performance of the rope skipping is difficult to achieve.
Therefore, how to provide a multi-person rope skipping analysis method based on video-audio multi-modal deep learning, which can accurately record rope skipping achievements, is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a multi-user rope skipping analysis method based on video-audio multi-mode deep learning, which can accurately record rope skipping scores.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-person rope skipping analysis method based on video-audio multi-mode deep learning comprises the following steps:
s1, obtaining an audio-video file in a rope skipping process, and separating video and audio in the audio-video file to obtain a video image signal and a stereo audio signal;
s2, detecting and extracting a portrait aiming at a video image signal, tracking a target portrait, extracting a skeleton characteristic point coordinate of the target portrait, preprocessing the skeleton characteristic point coordinate, and acquiring a preprocessed video signal;
s3, acquiring a single-channel audio signal in the stereo audio signal, slicing and intercepting the single-channel audio signal, performing time-frequency transformation to obtain a frequency spectrum signal, preprocessing, and acquiring a preprocessed audio signal;
s4, fusing the preprocessed video signal and the preprocessed audio signal to obtain a video-audio fusion signal, enabling the video-audio fusion signal to pass through a bidirectional long-time and short-time memory cyclic convolution neural network and a cascaded full connection layer to obtain an output signal stream, and converting the output signal stream into a square wave signal;
s5, filtering the square wave signals, and performing statistical analysis on rising edges or falling edges, wherein one rising edge or falling edge is a real rope skipping action, so that the rope skipping statistical analysis is obtained.
Preferably, when the rope skipping process is obtained in S1, the rope skipping tester is recorded through the video acquisition device, the audio-video file of the rope skipping process is obtained, and it is ensured that the body of the rope skipping tester can be completely kept in the view finder during the whole rope skipping process.
Preferably, the acquired audio-video file is an audio-video file acquired in real time or exported after being recorded.
Preferably, the specific contents of S2 include:
s21, detecting and extracting the portrait by adopting a pre-trained YOLOv5 network model;
s22, tracking the target portrait frame inter-frame by adopting a KM algorithm, and tracking the same target portrait on the target portrait frame;
s23, carrying out frame-by-frame 32-point body 3D space modeling by adopting a BLAZEPose algorithm, and acquiring 32 skeleton feature point coordinates of each frame of the target;
s24, performing interframe smoothing on the coordinates of the bone feature points by using a Kalman filtering algorithm, and performing up-conversion on the coordinates of the obtained bone feature points through Kalman filtering prediction to obtain a preprocessed video signal.
Preferably, the specific contents of S3 include:
s31, acquiring a single-channel audio signal in the stereo audio signal, performing resampling and short-time Fourier transform on the single-channel signal, slicing the audio signal according to preset time length, and intercepting the audio signal after determining the frame shift length by adopting a Hamming window;
s32, performing time-frequency transformation on each section of intercepted audio signal by adopting fast Fourier transform to obtain a frequency spectrum signal;
and S33, performing amplitude suppression on the audio frequency spectrum signal by adopting power compression of a parameter A ^0.3, avoiding the overhung interference audio frequency from covering the effective signal, and acquiring the preprocessed audio signal, wherein A is an input audio time sequence signal.
Preferably, the specific method for fusing the preprocessed video signal and the audio signal to obtain the video-audio fused signal in S4 includes:
and respectively inputting the preprocessed video signal and audio signal into respective corresponding convolution networks, and performing cascade fusion on the data passing through the convolution networks to obtain a video-audio fusion signal.
Preferably, the target figure includes 1 or more.
Through the technical scheme, compared with the prior art, the invention discloses a multi-person rope skipping analysis method based on video-audio multi-mode deep learning, the method can obtain a video-audio fusion signal by fusing the video and the audio of a rope skipping video, further obtain a square wave signal by obtaining an output signal flow through the deep learning, and analyze the whole rope skipping state of a tester by performing statistical analysis on the rising edge of the square wave signal, the analysis method can filter unsuccessful rope skipping attempts of each tester, solves the technical problems that in the prior art, single video analysis is difficult to avoid the error recording of rope skipping without rope skipping, and single audio analysis counting cannot filter rope skipping without rope skipping, realizes the filtering of the conditions of rope skipping without rope skipping, rope skipping without rope, rope breaking or rope skipping interruption and the like, the interference of non-skipping rope testers can be filtered, and more accurate skipping rope statistical analysis is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a multi-person rope skipping analysis method based on video-audio multi-modal deep learning according to the present invention;
fig. 2 is a schematic structural diagram of a multi-modal video-audio network according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a multi-person rope skipping analysis method based on video-audio multi-mode deep learning, which comprises the following steps of:
s1, obtaining an audio-video file in a rope skipping process, and separating video and audio in the audio-video file to obtain a video image signal and a stereo audio signal;
s2, detecting and extracting a portrait aiming at a video image signal, tracking a target portrait, extracting a skeleton characteristic point coordinate of the target portrait, preprocessing the skeleton characteristic point coordinate, and acquiring a preprocessed video signal;
s3, acquiring a single-channel audio signal in the stereo audio signal, slicing and intercepting the single-channel audio signal, performing time-frequency transformation to obtain a frequency spectrum signal, preprocessing, and acquiring a preprocessed audio signal;
s4, fusing the preprocessed video signal and the preprocessed audio signal to obtain a video-audio fusion signal, enabling the video-audio fusion signal to pass through a bidirectional long-time and short-time memory cyclic convolution neural network and a cascaded full connection layer to obtain an output signal stream, and converting the output signal stream into a square wave signal;
s5, filtering the square wave signals, and performing statistical analysis on rising edges or falling edges, wherein one rising edge or falling edge is a real rope skipping action, so that the rope skipping statistical analysis is obtained.
It should be noted that:
in the present embodiment, the video part frame rate is 25FPS, the input video slice is often 100ms per slot, i.e. 4 frames of images, and as shown in fig. 2, the fully connected layer therein is a 3-layer fully connected network, and the output signal is a 0 (low level), 1 (high level) signal, and the output signal stream is converted into a square wave signal with 100ms as the level holding time.
In S5, the square wave signal passes through a low-pass filter with the cut-off frequency of 20Hz to filter the high-frequency interference signal.
In order to further implement the technical scheme, when the rope skipping process is obtained in S1, the rope skipping tester is recorded through the video acquisition device, an audio-video file of the rope skipping process is obtained, and it is ensured that the body of the rope skipping tester can be completely kept in the view-finding frame in the whole rope skipping process.
In order to further implement the technical scheme, the acquired audio and video files are acquired in real time or exported after being recorded.
In order to further implement the above technical solution, the specific content of S2 includes:
s21, detecting and extracting the portrait by adopting a pre-trained YOLOv5 network model;
s22, tracking the target portrait frame inter-frame by adopting a KM algorithm, and tracking the same target portrait on the target portrait frame;
s23, carrying out frame-by-frame 32-point body 3D space modeling by adopting a BLAZEPose algorithm, and acquiring 32 skeleton feature point coordinates of each frame of the target;
and S24, performing interframe smoothing on the coordinates of the bone feature points by using a Kalman filtering algorithm, and performing up-conversion on the coordinates of the obtained bone feature points through Kalman filtering prediction to obtain a preprocessed video signal.
It should be noted that:
in the present embodiment, the preprocessed video signal acquired in S2 is 8 frame data.
In order to further implement the above technical solution, the specific content of S3 includes:
s31, acquiring a single-channel audio signal in the stereo audio signal, performing resampling and short-time Fourier transform on the single-channel signal, slicing the audio signal according to preset time length, and intercepting the audio signal after determining the frame shift length by adopting a Hamming window;
s32, performing time-frequency transformation on each section of intercepted audio signal by adopting fast Fourier transform to obtain a frequency spectrum signal;
and S33, performing amplitude suppression on the audio frequency spectrum signal by adopting power compression of a parameter A ^0.3, avoiding the overhung interference audio frequency from covering the effective signal, and acquiring the preprocessed audio signal, wherein A is an input audio time sequence signal.
It should be noted that:
in this embodiment, 16kHz resampling is used for an audio signal, an SFTF is used for audio analysis, then the audio signal is sliced according to 100ms, a hamming window of 30ms is used, a frame shift length is 10ms, and time-frequency transformation is performed by using 512-point FFT to obtain a spectrum signal.
And amplitude suppression is performed on the audio spectrum using power compression with parameter $ p ═ 0.3 $. Each 100ms audio slice is translated by adopting a 30ms hamming window to obtain 8 frames of short-time audio signals (the frame rate is the same as that of video after twice up-sampling), and each short-time signal is subjected to fast fourier transform to obtain 257 x 8 x 2 scalar data serving as the preprocessed audio signals.
In order to further implement the above technical solution, the specific method for obtaining the video-audio fusion signal by fusing the preprocessed video signal and the audio signal in S4 includes:
and respectively inputting the preprocessed video signal and audio signal into respective corresponding convolution networks, and performing cascade fusion on the data passing through the convolution networks to obtain a video-audio fusion signal.
It should be noted that:
the convolution network parameters for each of the video and audio signals are shown in the following table:
TABLE 1 video input signal multilayer convolutional network
Figure BDA0003489483610000061
TABLE 2 Audio input Signal multilayer convolutional network
Figure BDA0003489483610000062
In order to further implement the technical scheme, the target portrait comprises more than 1 portrait.
It should be noted that:
when multi-person rope skipping analysis is carried out, the audio analysis part keeps consistent for each tester, and the portrait tracking part and the skeleton extraction part are used as the video part of the multi-modal neural network to be input, so that rope skipping statistical analysis can be carried out on each tester.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A multi-person rope skipping analysis method based on video-audio multi-mode deep learning is characterized by comprising the following steps:
s1, obtaining an audio-video file in a rope skipping process, and separating video and audio in the audio-video file to obtain a video image signal and a stereo audio signal;
s2, detecting and extracting a portrait aiming at a video image signal, tracking a target portrait, extracting a skeleton characteristic point coordinate of the target portrait, preprocessing the skeleton characteristic point coordinate, and acquiring a preprocessed video signal;
s3, acquiring a single-channel audio signal in the stereo audio signal, slicing and intercepting the single-channel audio signal, performing time-frequency transformation to obtain a frequency spectrum signal, preprocessing, and acquiring a preprocessed audio signal;
s4, fusing the preprocessed video signal and the preprocessed audio signal to obtain a video-audio fusion signal, enabling the video-audio fusion signal to pass through a bidirectional long-time and short-time memory cyclic convolution neural network and a cascaded full connection layer to obtain an output signal stream, and converting the output signal stream into a square wave signal;
s5, filtering the square wave signals, and performing statistical analysis on rising edges or falling edges, wherein one rising edge or falling edge is a real rope skipping action, so that the rope skipping statistical analysis is obtained.
2. The method for analyzing the multi-person rope skipping based on the video-audio multi-modal deep learning of claim 1, wherein when the rope skipping process is obtained in the step S1, a rope skipping tester is recorded through a video acquisition device to obtain an audio-video file of the rope skipping process, and it is ensured that the body of the rope skipping tester can be completely kept in a view frame during the whole rope skipping process.
3. The multi-person rope skipping analysis method based on video-audio multi-modal deep learning as claimed in claim 1, wherein the acquired audio/video file is an audio/video file acquired in real time or exported after being recorded.
4. The method for analyzing multi-person rope skipping based on video-audio multi-modal deep learning as claimed in claim 1, wherein the specific content of S2 includes:
s21, detecting and extracting the portrait by adopting a pre-trained YOLOv5 network model;
s22, tracking the target portrait frame inter-frame by adopting a KM algorithm, and tracking the same target portrait on the target portrait frame;
s23, carrying out frame-by-frame 32-point body 3D space modeling by adopting a BLAZEPose algorithm, and acquiring 32 skeleton feature point coordinates of each frame of the target;
and S24, performing interframe smoothing on the coordinates of the bone feature points by using a Kalman filtering algorithm, and performing up-conversion on the coordinates of the obtained bone feature points through Kalman filtering prediction to obtain a preprocessed video signal.
5. The method for analyzing multi-person rope skipping based on video-audio multi-modal deep learning as claimed in claim 1, wherein the specific content of S3 includes:
s31, acquiring a single-channel audio signal in the stereo audio signal, performing resampling and short-time Fourier transform on the single-channel signal, slicing the audio signal according to preset time length, and intercepting the audio signal after determining the frame shift length by adopting a Hamming window;
s32, performing time-frequency transformation on each section of intercepted audio signal by adopting fast Fourier transform to obtain a frequency spectrum signal;
and S33, performing amplitude suppression on the audio frequency spectrum signal by adopting power compression of a parameter A ^0.3, avoiding the overhung interference audio frequency from covering the effective signal, and acquiring the preprocessed audio signal, wherein A is an input audio time sequence signal.
6. The method for analyzing rope skipping by multiple persons based on video-audio multi-modal deep learning as claimed in claim 1, wherein the step S4 of fusing the preprocessed video signal and the preprocessed audio signal to obtain the video-audio fused signal comprises:
and respectively inputting the preprocessed video signal and audio signal into respective corresponding convolution networks, and performing cascade fusion on the data passing through the convolution networks to obtain a video-audio fusion signal.
7. The method as claimed in claim 1, wherein the target portrait includes more than 1 person.
CN202210091782.9A 2022-01-26 2022-01-26 Multi-person rope skipping analysis method based on video-audio multi-mode deep learning Pending CN114463784A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210091782.9A CN114463784A (en) 2022-01-26 2022-01-26 Multi-person rope skipping analysis method based on video-audio multi-mode deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210091782.9A CN114463784A (en) 2022-01-26 2022-01-26 Multi-person rope skipping analysis method based on video-audio multi-mode deep learning

Publications (1)

Publication Number Publication Date
CN114463784A true CN114463784A (en) 2022-05-10

Family

ID=81412330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210091782.9A Pending CN114463784A (en) 2022-01-26 2022-01-26 Multi-person rope skipping analysis method based on video-audio multi-mode deep learning

Country Status (1)

Country Link
CN (1) CN114463784A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797943A (en) * 2023-02-08 2023-03-14 广州数说故事信息科技有限公司 Multimode-based video text content extraction method, system and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797943A (en) * 2023-02-08 2023-03-14 广州数说故事信息科技有限公司 Multimode-based video text content extraction method, system and storage medium

Similar Documents

Publication Publication Date Title
Zhao et al. A novel framework for remote photoplethysmography pulse extraction on compressed videos
KR102262686B1 (en) Voice quality evaluation method and voice quality evaluation device
CN105105737B (en) Motion state rhythm of the heart method based on photoplethaysmography and spectrum analysis
CN103377647B (en) A kind of note spectral method of the automatic music based on audio/video information and system
US11913970B2 (en) Wireless motion detection using multiband filters
US9997168B2 (en) Method and apparatus for signal extraction of audio signal
US9424743B2 (en) Real-time traffic detection
CN106650576A (en) Mining equipment health state judgment method based on noise characteristic statistic
EP2927906B1 (en) Method and apparatus for detecting voice signal
RU2764125C1 (en) Method for assessing video quality and apparatus, device and data carrier
CN108152788A (en) Sound-source follow-up method, sound-source follow-up equipment and computer readable storage medium
CN110069199A (en) A kind of skin-type finger gesture recognition methods based on smartwatch
US20230326468A1 (en) Audio processing of missing audio information
CN110718235A (en) Abnormal sound detection method, electronic device and storage medium
CN107316651A (en) Audio-frequency processing method and device based on microphone
CN110461215A (en) Health mark is determined using portable device
CN112754444A (en) Radar-based non-contact pig respiration detection method
CN114463784A (en) Multi-person rope skipping analysis method based on video-audio multi-mode deep learning
CN110974189A (en) Method, device, equipment and system for detecting signal quality of pulse wave
CN110269587A (en) Infant's motion analysis system and infant eyesight analysis system based on movement
CN113208634A (en) Attention detection method and system based on EEG brain waves
CN106999072A (en) The multichannel ballistocardiograph of dynamic channel selection using cepstrum smoothing and based on quality
CN108260012B (en) Electronic device, video playing control method and related product
CN109994128A (en) Voice quality problem localization method, device, equipment and medium
CN113473117B (en) Non-reference audio and video quality evaluation method based on gated recurrent neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination