CN107170466B

CN107170466B - Mopping sound detection method based on audio

Info

Publication number: CN107170466B
Application number: CN201710242995.6A
Authority: CN
Inventors: 王成; 龙舟; 钱跃良; 王向东; 袁静; 李锦涛
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-04-14
Filing date: 2017-04-14
Publication date: 2020-12-29
Anticipated expiration: 2037-04-14
Also published as: CN107170466A

Abstract

The invention provides a method for detecting sound of walking and mopping based on audio. The method comprises the following steps: performing framing processing on the collected left and right feet double-track audio data to obtain corresponding audio frames; obtaining the probability that the audio frame belongs to the mopping sound and the probability that the audio frame belongs to the normal footstep sound by using a classifier by taking the feature vector extracted from the audio frame as input, wherein the classifier is obtained by training, and the training samples comprise a positive sample for identifying the normal footstep sound, a mopping sample for identifying the mopping sound and a negative sample for identifying other sounds except for the footsteps; and obtaining a time interval corresponding to the mopping sound according to the obtained probability that each audio frame belongs to the mopping sound and the probability that each audio frame belongs to the normal footstep sound. The method can accurately detect the floor mopping sound in the walking process, and is beneficial to gait detection, falling early warning and the like.

Description

Mopping sound detection method based on audio

Technical Field

The invention relates to the technical field of computer application, in particular to a detection method of dragging sound based on audio information.

Background

Gait analysis (gait analysis) is a technique for obtaining and analyzing gait parameters by observing or collecting the posture of a human body while walking. Generally, gait parameters include spatial parameters (e.g., stride length, step size, step width, etc.), temporal parameters (e.g., step frequency, step speed, etc.), and the symmetry of the left and right feet of these parameters, the stability of long-term data, etc. Gait analysis is widely applied and researched in the aspects of physical exercise, medical rehabilitation and the like.

In gait analysis, whether the foot drags the ground is medically called foot clearance, the steps of a normal person are relatively stable in the processes of landing and leaving the ground, the height of the ground is sufficient in the swinging process, the patient raises and lands with difficulty in starting and stopping, the situation that the ground is rubbed can be generated, and in the swinging process, the obvious floor dragging sound can also be generated due to the fact that the foot is not raised to high enough. The detection of foot clearance has important significance for rehabilitation medicine, gait detection, falling early warning and the like.

However, in the prior art, gait analysis is usually based on video images, pressure sensors, electromyography, etc., but these devices are highly invasive to the patient, and especially for mopping events, it is difficult to directly determine from the motion sensors. In addition, although there is an audio-based step detection method (for example, chinese patent application No. 201610971951.2 by wang et al entitled "two-channel-based step detection method") in the prior art, it does not include a scheme for determining whether a foot is dragging the floor, and there is also no general and effective foot dragging detection mechanism in the prior art.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method capable of accurately detecting a sound of a footstep mopping based on audio. The method comprises the following steps:

step 1: performing framing processing on the collected left and right feet double-track audio data to obtain corresponding audio frames;

step 2: obtaining the probability that the audio frame belongs to the mopping sound and the probability that the audio frame belongs to the normal footstep sound by using a classifier by taking the feature vector extracted from the audio frame as input, wherein the classifier is obtained by training, and the training samples comprise a positive sample for identifying the normal footstep sound, a mopping sample for identifying the mopping sound and a negative sample for identifying other sounds except for the footsteps;

and step 3: and obtaining a time interval corresponding to the mopping sound according to the obtained probability that each audio frame belongs to the mopping sound and the probability that each audio frame belongs to the normal footstep sound.

Preferably, the positive samples include audio frames labeled heel strike and audio frames labeled forefoot strike under normal gait.

Preferably, the positive samples include three audio frames centered at each heel strike labeling position and three audio frames centered at each forefoot strike labeling position in the left-foot channel audio data, and three audio frames centered at each heel strike labeling position and three audio frames centered at each forefoot strike labeling position in the known right-foot channel audio data, in a known normal gait.

Preferably, the mopping sample comprises an audio frame labeled heel strike and an audio frame labeled forefoot strike in a mopping gait.

Preferably, the floor mopping sound samples include three audio frames centered at each position where the heel strike is labeled and three audio frames centered at each position where the forefoot strike is labeled in the left foot channel audio data, and three audio frames centered at each position where the heel strike is labeled in the known right foot channel audio data in the known floor mopping sound samples.

Preferably, the negative examples include nine audio frames between the forefoot strike and the heel strike of the preceding step in the left-foot channel audio data, and nine audio frames between the forefoot strike and the heel strike of the following step in the right-foot channel audio data.

Preferably, in step 2, the audio frame of the left foot channel and the probability of the audio frame belonging to the floor form a floor mopping sound probability curve of the left foot channel, the audio frame of the right foot channel and the probability of the audio frame belonging to the floor mopping sound form a floor mopping sound probability curve of the right foot channel, the audio frame of the left foot channel and the probability of the audio frame belonging to the normal footstep sound form a normal footstep sound probability curve of the left foot channel, and the audio frame of the right foot channel and the probability of the audio frame belonging to the normal footstep sound form a normal footstep sound probability curve of the right foot channel; and in the step 3, the probability curves of the mopping sound of the left and right foot sound channels are fused into a comprehensive probability curve of the mopping sound, the probability curves of the normal footstep sound of the left and right foot sound channels are fused into a comprehensive probability curve of the normal footstep sound, and the time interval corresponding to the normal footstep sound and the time interval corresponding to the mopping sound are obtained based on a preset probability threshold.

Preferably, the time interval corresponding to the normal footstep sound is the time interval in which the probability curve of the synthesized normal footstep sound is less than 0.5; the time interval corresponding to the mopping sound is the time interval that the probability curve of the integrated mopping sound is larger than 0.35.

Compared with the prior art, the invention has the advantages that: whether the audio frame is the mopping sound and/or the normal footstep sound can be accurately detected according to the two-channel audio data, and in addition, the method based on the machine learning can be suitable for various different scenes and has strong universality.

Drawings

The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:

FIG. 1 shows a flow diagram of a method of training a classifier for detecting mopping sounds according to one embodiment of the invention;

FIG. 2 illustrates a flow diagram of a method of detecting mopping sounds and normal footsteps according to one embodiment of the present invention;

FIG. 3 illustrates a framing approach according to one embodiment of the present invention;

FIG. 4 illustrates an example of a binaural data annotation according to an embodiment of the invention;

fig. 5 shows an example of detection results of the mopping sound and the normal footstep sound according to one embodiment of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects and effects of the present invention, the audio-based step floor detection method according to the present invention will now be described in detail with reference to the accompanying drawings.

For a clear understanding of the present invention, the following patents (or patent applications) are incorporated herein by reference in their entirety:

1. chinese patent application No. 201610971951.2, entitled "method for detecting steps based on binaural tracks", assigned to wancheng et al.

2. The application number of the Wangcheng et al is 201610517381.X, the name of the invention is "a method for establishing gait data set and gait analysis method".

FIG. 1 shows a schematic flow diagram of an audio-based mopping detection method according to one embodiment of the present invention. The method specifically comprises the following steps:

step S110, collecting audio data

Through arranging wearable gait data acquisition device respectively left and right feet department, gather the sound signal that the people produced when walking, can obtain the dual track audio data.

In one embodiment, the wearable gait data acquisition device comprises a microphone unit capable of acquiring acoustic signals. The gait data acquisition device comprises a left foot gait data acquisition node and a right foot gait data acquisition node, and each gait acquisition node comprises a storage unit, a microprocessor, a power supply unit, a wireless transceiving unit, a signal acquisition unit and a signal transmitter. When data is collected, a signal collector (such as a microphone) collects sound signals, and the collected signals are sent to a microprocessor for processing.

In one embodiment, the method for acquiring binaural data comprises: and respectively fixing the left foot gait acquisition node and the right foot gait acquisition node on the left foot and the right foot of the tested person. Two gait acquisition equipment nodes are simultaneously used on the two feet, namely the left foot gait acquisition node acquires the audio data of the left foot, and the right foot gait acquisition node acquires the audio data of the right foot, so that a double-track is formed, the data of the left foot and the right foot are analyzed and fused, and more accurate information than a single-foot measurement mode can be obtained. In particular, the gait data acquisition nodes may be worn at various locations on the shoe, such as the front, lateral or rear side of the upper, or the sole near the ball of the foot, the middle or near the heel. Preferably, the left foot gait acquisition node and the right foot gait acquisition node are worn on the symmetrical positions of the left foot and the right foot.

For a specific method for collecting audio data during walking, refer to the chinese patent application "a method for establishing a gait data set and a gait analysis method" (chinese patent application No. cn201610517381. x).

Step S120: data slicing

And performing frame division and windowing on the collected two-channel audio data to obtain a series of audio frames. As shown in fig. 3, each audio frame contains 200 samples at an audio sampling rate of 8000hz, and an overlap interval of 120 samples between adjacent frames is set during framing. After framing, in order to reduce signal discontinuity at the start and end of a frame, a hamming window is added to an audio frame, that is, a sliding window is added to audio data, and the corresponding audio frame is taken as a basic unit of investigation in this embodiment by using the sliding window, and since the spectral characteristics and some physical characteristic parameters of audio are basically kept unchanged within a range of 10-30 ms, the window length of the hamming window is generally 10ms to 30 ms.

Step S130: extracting and selecting audio features

And performing feature extraction on the audio frame to obtain a feature vector of the audio frame. According to one embodiment of the invention, the feature vector comprises: autocorrelation coefficients, self-band energy (0 to 4kHz) characteristics, zero crossing rate, linear prediction coefficients (LPCC characteristics), and mel-frequency cepstral coefficients (MFCC) characteristics. Table 1 shows the composition of feature vectors in one embodiment, including: the method comprises the following steps of (1) obtaining 36 dimensions of 10-dimensional sub-band energy characteristics, 12-dimensional Mel cepstrum coefficient characteristics, 12-dimensional linear prediction coefficients, zero crossing rate and autocorrelation coefficients.

TABLE 1

Feature(s)	Dimension number
		Coefficient of autocorrelation	1
Sub-band energy (0 to 4kHz)	10
		Zero crossing rate	1
LPCC	12
		MFCC	12

It should be understood that the dimensions of the feature vectors and the specific combination of feature vectors described above are not exclusive. In other embodiments, the feature vector may be a free combination of some or all of the above features or other feature combinations that can better characterize the information implied by the audio frame.

Step S140: selecting training samples

The typical footstep sound has the characteristics that the heel and the forefoot touch the ground, corresponding touch signals can be collected by the audio data collection equipment of the left foot and the right foot, but the audio signal of the foot on the side is relatively strong. Therefore, in the manual labeling, the positions of two sounds per step (i.e., two sounds with the heel and the forefoot landed) are sequentially labeled on the audio on the corresponding side according to the left and right feet on the two audio of the left and right feet, as shown in fig. 4.

In this embodiment, in order to realize the detection of three categories of normal footstep sound, mopping sound and other sounds, the training samples include a positive sample for identifying normal footstep sound, a mopping sample for identifying mopping sound and a negative sample for identifying other sounds other than footsteps.

Preferably, under normal gait, 3 frames each are taken as positive samples centered at each marked position on the two audio tracks for the left and right feet, so that in a single track (i.e. corresponding to the channel for the left foot or to the channel for the right foot), 6 positive samples are taken for each step, and then 9 consecutive frames are taken as negative samples at the middle position of two adjacent steps (the second sound of the previous foot and the first sound of the next foot), so that there are 18 negative samples between each two steps.

It will be appreciated by those skilled in the art that other audio frames may be selected as long as they are distinguishable from the normal footstep sound.

For the collection of the mopping samples, the mopping samples can be obtained by simulating mopping by normal people, and the labeling method is the same as the labeling method of normal footsteps, for example, in the mopping gait, 3 frames are respectively taken as the positive samples on two audios of the left foot and the right foot by taking the position of each label as the center, so that each footstep corresponds to 6 mopping samples in a single sound channel.

In one embodiment, the training samples collected include positive samples of 3264 audio frame, negative samples of 4026 audio frame, and floor samples of 463 audio frame.

It should be understood that the number of positive samples, the number of negative samples, and the number of mopping samples per step and the number of audio frames per sample may be determined by considering the training time and the obtained model accuracy, and is not limited to the specific values listed herein.

Step S150: training classifier model

By using the positive and negative samples and the mopping sample to form a sample library, a classifier, such as a Support Vector Machine (SVM), a weighted support vector machine, an extreme learning machine or a weighted extreme learning machine, can be trained by using a computer machine learning method. The input of the classifier is a feature vector extracted from each audio frame, and the output is the probability of whether a certain audio frame is a mopping sound, a normal footstep sound and other sounds, and the sum of the probabilities of the three is 1 for each audio frame.

The mopping sound can be detected by using the trained classifier, as shown in fig. 2, in this embodiment, the detection method includes the following steps:

step S210, collecting audio data.

In this step, an audio frame of the two-channel audio data to be detected is obtained according to the method of step S110 shown in fig. 1.

Step S220: and obtaining the probability that the audio frame belongs to normal footstep sound and mopping sound by using the trained classifier.

And detecting each audio frame of the two-channel audio data to be detected by using the trained classifier to obtain the probability of each audio frame belonging to the mopping sound, and establishing a corresponding probability curve. The probability curve is a curve whose abscissa is the audio frame number (or the time represented by the audio frame), and whose ordinate is the probability that the corresponding audio frame belongs to the mopping sound. The two audio data of the left foot and the right foot correspond to the two probability curves. Similarly, the probability of each audio frame belonging to normal footstep sound and the probability curves of other sounds can be obtained, and the probability curves corresponding to the left and right feet are also established, which are shown in fig. 5 as the probability curve of normal footstep sound and the probability curve of mopping sound.

Step S230: smoothing the probability curve and identifying mopping and normal footstep sounds

In this step, the probability curves of the dragging sounds of the left and right feet and the probability curve of the normal step sound are fused and smoothed, for example, in one embodiment, after the probability curves of the left and right feet are merged based on a summation method, in order to overcome the large instability and noise points existing in the probability curves, a low-pass filter (relative cut-off frequency 0.1) is used for smoothing, and the smoothed probability curves have a relatively obvious "large probability" interval, for example, in the probability curve of the dragging sounds illustrated in fig. 5, a time interval greater than 0.35 is obviously present, so that intervals continuously exceeding the threshold value can be found according to a preset threshold value, and the intervals are determined as the intervals belonging to the dragging sounds.

In another embodiment, the step interval is determined based on a binaural probability maximization method. Generally, the probability that the audio data of the local channel is determined as the floor sound is higher, so that the audio data of the local channel can be more depended on, and the probability of the audio data of the other side plays a complementary role. For each pair of candidate audio frames (referring to the audio frames of the left and right channels with the same time table), the one with the higher probability is selected, and then the probability value of the audio frame position in the comprehensive probability curve is represented by the selected one, so that the probability curve of the comprehensive left and right feet audio data is obtained. And for the comprehensive probability curve, searching for intervals continuously exceeding a threshold value by using a preset probability threshold value, and judging the intervals as the intervals belonging to the mopping sound.

The specific process of fusion and smoothing for probability curves can also be found in patent application No. CN201610971951.2 (binaural-based step detection method).

In tests, the inventor finds that the patient with mopping has high probability of mopping the floor by the footstep, the normal footstep sound can obtain high probability per se, and the mopping sound can be included at the position where the normal footstep sound is counted. Therefore, in order to accurately distinguish the mopping sound from the normal footstep sound, in the present invention, the mopping sound is not counted in the time interval in which the mopping sound is judged to be the normal footstep sound. For example, the determination process is to consider an interval smaller than a preset threshold value of 0.5 on the probability curve of the normal footstep sound as the normal footstep sound; an interval on the probability curve of the normal footstep sound of more than 0.5 and on the probability curve of the mopping sound of more than 0.35 is regarded as a time interval of the mopping sound. In this way, the sound of mopping can be prevented from being masked by the sound of normal footsteps to some extent. Of course, depending on the specific application requirements, the normal footstep sound and the mopping sound can be counted together. For example, an interval on the probability curve of the normal footstep sound smaller than a preset threshold value of 0.5 is regarded as the normal footstep sound; an interval greater than 0.35 on the probability curve of the mopping sound is regarded as a time interval of the mopping sound.

Fig. 5 is a diagram showing the detection results of the normal footfall and the mopping sound, in which the abscissa indicates the number of each audio frame and the ordinate indicates the probability curve of the normal footfall and the probability curve of the mopping sound, respectively. The original normal footstep sound probability curve refers to the footstep sound probability curve after the audio data of the left foot and the right foot are fused and before the audio data of the left foot and the right foot are smoothed. The probability curve of the normal footstep sound and the probability curve of the mopping sound are the probability curves after audio data fusion and smoothing of the left and right feet. Fig. 5 shows that the probability judgment threshold for the mopping sound is greater than 0.35, and the probability judgment threshold for the normal footstep sound is less than 0.5.

The floor mopping sound detection method based on the audio frequency does not miss normal footstep sound and floor mopping sound, and has high recall rate and accuracy.

To further verify the technical effects of the present invention, the inventors performed tests based on the classifier model of the present invention. The test data includes: 3 healthy people and 2 abnormal gait patients walk back and forth 4 times in a distance of 5 meters. The test results are shown in table 2, and no mopping phenomenon was found in healthy people, but the patients had more mopping phenomenon with gait.

TABLE 2

Healthy person 1	Healthy person 2	Healthy person 3	Patient 1	Patient 2
					0	0	0	6 times of	8 times (by volume)

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An audio-based mopping sound detection method comprises the following steps:

step 2: the characteristic vector extracted from the audio frame is used as input, a classifier is used for obtaining the probability of the audio frame belonging to the mopping sound and the probability of the audio frame belonging to the normal footstep sound, the audio frame of the left foot sound channel and the probability of the audio frame belonging to the mopping sound form the mopping sound probability curve of the left foot sound channel, the audio frame of the right foot sound channel and the probability of the audio frame belonging to the mopping sound form the mopping sound probability curve of the right foot sound channel, the audio frame of the left foot sound channel and the probability of the audio frame belonging to the normal footstep sound form the normal footstep sound probability curve of the left foot sound channel, and the audio frame of the right foot sound channel and the probability of the audio frame belonging to the normal footstep sound form the normal footstep, wherein the classifier is obtained by training, and training samples comprise positive samples for identifying normal footsteps, mopping samples for identifying mopping sounds and negative samples for identifying other sounds other than footsteps;

and step 3: and fusing the probability curves of the mopping sound of the left and right foot sound channels into a comprehensive probability curve of the mopping sound, fusing the probability curves of the normal footstep sound of the left and right foot sound channels into a comprehensive probability curve of the normal footstep sound, and obtaining a time interval corresponding to the normal footstep sound and a time interval corresponding to the mopping sound based on a preset probability threshold.

2. The method of claim 1, wherein the positive samples comprise audio frames labeled heel strike and audio frames labeled forefoot strike under normal gait.

3. The method of claim 2, wherein the positive samples comprise three audio frames centered at each location labeled heel strike in the left foot channel audio data and three audio frames centered at each location labeled heel strike in the known normal gait, and three audio frames centered at each location labeled heel strike in the known right foot channel audio data.

4. The method of claim 1, wherein the mopping sample comprises audio frames labeled heel strike and audio frames labeled forefoot strike in mopping gait.

5. The method of claim 4, wherein the mopping sound samples comprise three audio frames centered at each location labeled heel strike in the left foot channel audio data and three audio frames centered at each location labeled forefoot strike in the known mopping gait, and three audio frames centered at each location labeled heel strike and three audio frames centered at each location labeled forefoot strike in the known right foot channel audio data.

6. The method of claim 1, wherein the negative examples include nine audio frames between a forefoot strike of a preceding step and a heel strike of a succeeding step in left foot channel audio data, and nine audio frames between a forefoot strike of a preceding step and a heel strike of a succeeding step in right foot channel audio data.

7. The method according to any one of claims 1 to 6, wherein the time interval corresponding to a normal footfall is a time interval in which the probability curve of the integrated normal footfall is less than 0.5; the time interval corresponding to the mopping sound is the time interval that the probability curve of the integrated mopping sound is larger than 0.35.

8. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of any one of claims 1 to 7.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of any of claims 1 to 7 are implemented by the processor when executing the program.