CN110069199B

CN110069199B - Skin type finger gesture recognition method based on smart watch

Info

Publication number: CN110069199B
Application number: CN201910248707.7A
Authority: CN
Inventors: 杨盘隆; 曹书敏; 李向阳
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2022-01-11
Anticipated expiration: 2039-03-29
Also published as: CN110069199A

Abstract

The invention discloses a skin type finger gesture recognition method based on a smart watch, which comprises the following steps: step 1, signal acquisition: acquiring a passive acoustic signal generated by friction between fingers and the skin of the back of the hand through the intelligent watch; step 2, data preprocessing: removing the noise signal through filtering processing to obtain an acoustic signal to be detected; step 3, gesture motion detection: dividing the acoustic signal to be detected into a plurality of independent gesture acoustic signals through gesture detection processing; and 4, feature extraction: converting the independent gesture acoustic signals into time-frequency spectrograms of gray level images representing the acoustic signals and images of Mel cepstrum coefficients as characteristic values; step 5, gesture motion recognition: and taking the extracted characteristic value of each independent finger action as input data, and performing finger gesture action recognition on the input data by using a convolutional neural network model to obtain corresponding finger gesture actions. The method expands the input area of the wearable device without being equipped with other devices, and has the advantages of simplicity and good real-time performance.

Description

Skin type finger gesture recognition method based on smart watch

Technical Field

The invention relates to the field of smart watch application, in particular to a skin type finger gesture recognition method based on a smart watch.

Background

Currently, there is a lot of research that has developed many new types of software based on smartwatches and proposed many new methods for interaction of smartwatches. Many previous studies have utilized special hardware or sensors. For example, Google developed a specially-made chip in project Soli, which replaced the normal input with a manual action using a 60GHz radar; WatchIt provides a device prototype that extends the smart band input; mole proposed using a motion sensor on its own in the watch to infer what the user is writing.

The human skin can be used as an input surface at any time, and a plurality of technologies have been studied on the aspect, for example, a novel skin-worn hardware is designed in the prior published technical scheme. Or by some unusual technique such as electrical signals, sound, or even optical projection. inskin proposes to use a small biocompatible metal-plated sensor for input on the human body. The skinntrack uses a ring to reflect the RF signal and measure the phase offset of the received signal to track the finger. Skinput uses the human body for the transmission of biological sounds so that the skin can become the input surface for a set of sensors on the arm. The skinnbutton proposed to embed a small projector into a watch to project an image onto the skin.

However, these conventional methods using human skin input have a problem that specific hardware and a complicated system configuration are required.

Disclosure of Invention

Based on the problems in the prior art, the invention aims to provide a skin type finger gesture recognition method based on a smart watch, which can realize handwriting input by using a worn smart watch and using the back of a human hand as an input surface.

The purpose of the invention is realized by the following technical scheme:

the embodiment of the invention provides a skin type finger gesture recognition method based on a smart watch, which comprises the following steps:

step 1, signal acquisition: when a finger rubs and marks a gesture on the skin of the back of the hand, a microphone of the smart watch collects a passive acoustic signal generated by the friction between the finger and the skin of the back of the hand;

step 2, data preprocessing: removing noise signals in the passive acoustic signals collected in the step 1 through filtering processing to obtain acoustic signals to be detected;

step 3, gesture motion detection: dividing the acoustic signal to be detected obtained after the preprocessing in the step 2 into a plurality of independent gesture acoustic signals through gesture detection processing;

and 4, feature extraction: converting the plurality of independent gesture acoustic signals into time-frequency spectrograms of the acoustic signals represented by gray level images and images of Mel cepstrum coefficients as characteristic values of each independent finger action;

step 5, gesture motion recognition: and (4) taking the characteristic value of each independent finger action extracted in the step (4) as input data, and performing finger gesture action recognition on the input data by using a convolutional neural network model to obtain a corresponding finger gesture action.

According to the technical scheme provided by the invention, the skin type finger gesture recognition method based on the intelligent watch has the beneficial effects that:

gather the sound that the finger was write the gesture at back of the hand skin through the microphone of intelligence wrist-watch and move friction to handle the sound of gathering, from the finger gesture of writing of discerning, can carry out wearable equipment's interactive input by the finger on back of the hand skin, expanded wearable equipment's input region, and need not to be equipped with other heavy equipment. Has the advantages of simplicity and good real-time performance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for recognizing a skin-type finger gesture based on a smart watch according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a gesture action set of a recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an application system architecture of an identification method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an acoustic signal processed in the recognition method provided in the embodiment of the present invention, wherein (1) is a schematic diagram of an original signal processed in the recognition method provided in the embodiment of the present invention; (2) a schematic diagram of a signal after band-pass filtering processed in the identification method provided by the embodiment of the invention;

fig. 5 is a schematic diagram of an acoustic signal after wavelet transform in the identification method according to the embodiment of the present invention;

fig. 6 is a graph, a frequency spectrum diagram and a mel-frequency cepstrum after data processing of four standard gesture actions in the recognition method according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the specific contents of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art.

As shown in fig. 1, an embodiment of the present invention provides a skin-type finger gesture recognition method based on a smart watch, including:

In step 2 of the above method, the filtering process includes:

and filtering (namely, band-pass filtering) through an FIR filter to remove low-frequency and high-frequency noises in the collected passive acoustic signals, so as to obtain the acoustic signals to be detected which only contain the acoustic signals of the finger gesture actions.

The above-mentioned removing of low-frequency and high-frequency noise in the collected passive acoustic signal by FIR filter filtering is: the filtered signals obtained by filtering output of the plurality of FIR filters are used as acoustic signals y [ n ] to be detected for removing out-of-band noise, and y [ n ] is as follows:

b_k＝b_n+2-k,k＝1,2,...,n+1；

wherein N is the number of FIR filters, and N is set to 112; b_iIs the frequency response value of the ith instance at the nth layer of the FIR filter, i.e. the coefficients of the FIR filter at that layer; the parameters of the two FIR filters are set to be 6000 and 14000Hz for the two pass band cut-off frequencies and 5000 and 15000Hz for the two stop band cut-off frequencies, respectively, and the sampling rate of the original signal is 44100 Hz.

In step 3 of the method, dividing the acoustic signal to be detected obtained after the preprocessing in step 2 into a plurality of independent gesture acoustic signals through gesture detection processing includes:

detecting a starting point and an end point of each independent finger gesture from the acoustic signals to be detected, extracting effective segments through the starting point and the end point of each independent finger gesture, and identifying a plurality of independent gesture acoustic signals according to the effective segments.

In the method, the starting point and the end point of each independent finger gesture are detected from the acoustic signal to be detected, and the mode of extracting the effective segment through the starting point and the end point of each independent finger gesture is as follows:

acoustic signals y [ n ] to be detected are transmitted through a sliding window]Dividing into multiple pieces of data, and calculating short-term average energy of each piece of data

Where W is the window size, W is set to 882, and the step size is set to 750;

by judging the average energy E [ n ]]And

first difference value therebetween

If the empirical threshold gamma is exceeded, the starting point n of the finger gesture is confirmed, i.e.

Setting two guard intervals on both sides of estimated gesture input sound

And

computing

Obtaining candidates of gesture input starting points, wherein the candidate gesture starting points are n₁,n₂,...,n_m}; these points are subtracted by

As a starting point, add

As an end point, i.e. the candidate set is changed to

In step 4 of the method, the manner of converting the plurality of independent gesture acoustic signals into the grayscale images to obtain the characteristic values of the finger motions is as follows:

the method comprises the steps of calculating short-time Fourier transform of each independent gesture acoustic signal to obtain a time spectrum graph, calculating Mel cepstrum of each independent gesture sound signal to obtain an image of a Mel cepstrum coefficient, combining the time spectrum graph obtained through the short-time Fourier transform and the image of the Mel cepstrum coefficient obtained through calculation, converting the combined image into a gray level image, and obtaining a characteristic value of the gesture acoustic signal through the gray level image;

wherein the short-time Fourier transform (STFT) is as follows:

where w [ t ] is a window function, Y [ m, ω ] is the Fourier transform of Y [ n ] w [ n-m ], a Hamming window of 512 size, a FFT of 512 length, and an overlap length of 256 is set.

Wherein, the Mel cepstrum coefficient is calculated as:

wherein f is frequency, and the set frequency is 100 Hz.

In step 5 of the above method, the convolutional neural network model used is:

taking a LeNet structure as a main structure, and using a convolution layer of AlexNet;

comprises four convolution layers and four pooling layers, and then two full-connection layers and an output layer; wherein the content of the first and second substances,

the sizes of the convolution kernels are two 11 × 11,5 × 5 and 3 × 3, the pool size is 3 × 3, and the stride is 2.

In the convolutional neural network model of method step 5 above,

will be provided with

Adding to the error function of the convolutional neural network model, using in fully connected layers;

each layer of the convolutional neural network model uses a degeneration mechanism, and the probability value of the convolutional neural network model is set to be a fixed value p of 0.8 in the training process.

According to the method, the hand back is expanded into the input surface, and the input finger gesture on the hand back is recognized by using the sound signal of the friction of the finger sliding gesture on the skin of the hand back, so that the problems that an existing intelligent wearable device is small in screen, consumes power on a touch screen, and is inconvenient to input can be solved.

The embodiments of the present invention are described in further detail below.

The skin type finger gesture recognition method based on the smart watch disclosed by the embodiment of the invention is characterized in that a virtual handwriting input keyboard is constructed on the back of a hand on the basis of a commercial smart watch to recognize finger gestures, so that skin type handwriting input is realized. The identification method comprises the steps that a microphone of the intelligent watch is used for collecting acoustic signals of sliding friction of fingers on the back of a hand; and then, extracting the characteristics of the collected acoustic signals to be used as the input of machine learning, thereby realizing the identification of each action. The method is different from the existing identification method in that the identification method expands the input mode of the intelligent wearable device, but does not need to be worn on a finger or hold an additional device on a hand.

The skin type finger gesture recognition method based on the smart watch extracts friction sound between fingers and the back of the hand by using the microphone embedded in the smart wearable device (the smart watch), so that the finger gesture is recognized in real time. The recognition method is mainly based on that a microphone of a commercial intelligent wearable device can capture weak skin friction sound, so that the back of the hand is feasible to be used as an extended gesture input surface. The present invention employs multi-finger gestures on the back of the hand to perform more useful operations in view of the input of wearable devices that extend small screens and few buttons. As shown in fig. 2, 4 basic finger gestures may be defined, including swipe left, swipe right, pinch, and expand, which are commonly used gestures when a user interacts with a smart wearable device. To achieve more useful motion control, these four types of gestures are extended to 12-to-finger gestures for two or more fingers. Multi-finger gestures provide the smart wearable device developer with the flexibility to select the most appropriate gesture for his application from a particular set of multi-finger gestures. These gestures are related only to finger and hand movements and do not involve physical movements. The unique motion of each finger gesture introduces the difference of acoustic signals, and the differences can be utilized to identify the finger gestures, so that the purpose of expanding the skin of the back of the hand as an input surface of the intelligent wearable back of the hand is achieved.

The identification method comprises the following specific processes:

a. firstly, an intelligent watch or an intelligent bracelet running the application of the identification method needs to be prepared and worn on a hand;

b. taking the back of the hand wearing the watch as an extended input surface, and performing the finger gesture actions shown in FIG. 2 on the extended input surface;

c. click the start button in intelligent wrist-watch or the intelligent bracelet interface, open the recording function, then do the finger gesture action on the back of the hand, corresponding reaction will be given to intelligent wrist-watch or intelligent bracelet, carries out the operation that corresponds promptly.

The recognition method is applied to an acoustic sensing system, can recognize finger gestures, and realizes man-machine interaction by using a microphone in commercial intelligent equipment. Fig. 3 depicts the system architecture of the system, which mainly comprises five processing modules, namely: signal acquisition, data preprocessing, gesture action detection, feature extraction and gesture action recognition; wherein the content of the first and second substances,

in the first signal acquisition module, the smart watch collects the passive acoustic signals emitted by the finger gestures, and then, due to its limited computing power, it sends the recorded signals via bluetooth to a smartphone or other computing processing device (e.g., tablet, computer, etc.) for further processing.

In a second data pre-processing module, minimizing the interference of ambient noise by applying a band-pass filter to the original audio signal;

then, a gesture detection method is adopted in a gesture action detection module to extract a part with gesture action in the preprocessed sound signal;

in the feature extraction module, converting the acoustic signals into a time-frequency spectrogram and a Mel inverse pedigree map as the features of each independent finger action;

in the last gesture action recognition module, the finger gesture action is recognized by using a Convolutional Neural Network (CNN), wherein a time-frequency spectrogram and mel-frequency cepstrum coefficients are converted into a visual image as the input of the CNN. And finally, the smart watch or the smart phone calls the corresponding function of each individual application based on the output of the CNN to interact with the user.

When the recognition method works, the smart watch can continuously monitor gesture input on the back of the hand and transmit recorded sound to the smart phone. The sampling rate of the microphone embedded in the smart watch is 44100Hz, which is sufficient to collect ambient sounds. FIG. 5 shows a single finger gesture with a right hand swipe.

The raw sound captured by commercial microphones is itself noisy and the surrounding environment usually has different noise levels, so wavelet time-frequency analysis is used to obtain the frequency range of the sound produced by the friction of the fingers against the skin of the back of the hand. Received acoustic signals x [ n ] having a time-varying length n are analyzed by computing a Discrete Wavelet Transform (DWT) through a series of filters, i.e.

Wherein x is_α,L[n]And x_α,H[n]Respectively, the outputs of the low-pass filter g and the high-pass filter h. As shown in fig. 4(1) and 4(2), there is a highlighted vertical line around 0.4 seconds indicating that a finger gesture occurred. The bright lines occupy frequencies between 500Hz and 20000Hz, which are not reachable by normal noise.

In order to optimize the audio signals for finger gesture recognition, the proposed system first passes them through a band pass filter to remove low and high frequency noise. An FIR filter is a natural choice, which is inherently stable and can be designed to produce a linear frequency response. The output signal of the FIR filter is:

b_k＝b_n+2-k,k＝1,2,...,n+1；

where N is the number of FIR filters, b_iIs the frequency response value of the ith instance at the nth layer of the FIR filter, i.e., the coefficients of the FIR filter at that layer. Thus, the filter is used to eliminate out-of-band interference of the sound signal. As mentioned above, the frequency of the sound signal change due to skin friction is typically within 5000-15000 Hz. The parameters of the two FIR filters are set to be 6000 and 14000Hz for two pass band cut-off frequencies and 5000 and 15000Hz for two stop band cut-off frequencies respectively, wherein the sampling rate of the original sound signal is 44100 Hz. And setting N112 according to experience to obtain a required denoising result. It is clear that FIR filters almost eliminate out-of-band noise as shown in fig. 4(2), and fig. 4(1) is a schematic diagram of the original sound signal.

The method for extracting an audio signal for each finger gesture by the skin-type input system is based on signal processing in the time domain. The primary effect of the frictional sound between the fingers and the back of the hand on the received sound signal is a rising or falling edge. These variations are crucial for detecting finger gestures and the uniqueness of the different variation patterns is exploited to classify finger movements. To detect the starting point of the gesture, a valid segment is extracted from the processed acoustic signal.

Inspired by Constant False Alarm Rate (CFAR), a method of detecting a start point and an end point of an input gesture is proposed. The system of the present invention uses a sliding window to transmit a sound signal y [ n ]]Data divided into segments. The system calculates the short-term average energy per segment of data

Where W is the window size. The window size is set to 882 and the step size is set to 750, i.e. each segment contains 0.02s of audio signal and the sample rate is 44100 Hz. Since the signal is already processed by the band pass filter, the short term average energy is slightly different from the nearby part if there is no gesture. However, when gesture input occurs, the average energy E [ n ]]Sudden burst and its first difference

Becomes larger. Furthermore, the occurrence of a gesture input results in E [ n ]]And

the difference between them is larger and is expected to exceed the empirical threshold gamma. Then, the starting point n of the finger gesture is pointed out, i.e.

In the gesture detection process, the system sets two guard intervals on two sides of the estimated gesture input sound

And

the system calculates

To obtain candidates for a gesture input starting point. For example, the set of candidate gesture origins is n₁,n₂,...,n_m}. These points should be subtracted

As a starting point, add

As an end point, i.e. the candidate set is changed to

The purpose of this operation is to better extract the complete gesture input signal.

After detecting the gesture input, the system obtains a valid sound signal for each gesture. The first row of fig. 6 shows the sound signals of the extracted gesture sliding left, sliding right, pinching, and expanding. Since it is not sufficient to use only time domain signals, the system extracts features mainly based on time-frequency analysis.

However, the classical fourier transform does not provide resolution in both time and frequency. Conversely, the use of a Short Time Fourier Transform (STFT) improves this limitation by dividing the longer time signal into shorter segments. STFT is defined by the following equation:

where w [ t ] is a window function and Y [ m, ω ] is essentially a Fourier transform of Y [ n ] w [ n-m ]. The efficacy of STFT depends on the choice of the appropriate variables. In this system, a Hamming window of 512 is set, a FFT of 512 length, and an overlap length of 256. FIG. 6 is a second row of drawing gestures for a sound spectrogram of swipe left, swipe right, and pinch. Then, 12 MFCC features are extracted. The mel cepstrum image is shown in the third row of fig. 6. During gesture extraction, the system calculates and combines the STFT and MFCC coefficients, which are then converted to a grayscale image.

The skin type input system provided by the invention utilizes the CNN model to identify different gesture inputs, calculates STFT and MFCC coefficients and converts the result into a gray image as the input of the CNN model, so that the acquisition of a proper image is very important for ensuring the classification performance of the CNN.

In the identification method, a CNN structure suitable for running on mobile equipment is designed by referring to two popular CNN structures (namely LeNet-5 and AlexNet) to serve as the CNN model of the invention. The CNN model combines the advantages of CNN structures of LeNet-5 and AlexNet. The CNN model is specifically as follows: the structure of LeNet was selected as the main structure and the convolution layer of AlexNet was used. Four convolutional layers and four pooling layers are used, followed by two fully-connected layers and one output layer. The sizes of the convolution kernels are two 11 × 11,5 × 5 and 3 × 3, the pool size is 3 × 3, and the stride is 2. Meanwhile, a regularization method and a degeneration mechanism which are commonly used for processing overfitting are introduced into the identification method of the invention. L2 regularization may be achieved by

An error function added to the neural network, only used in fully connected layers. We use a degeneration mechanism at each layer, and set the probability to a fixed value p-0.8 during training.

Those of ordinary skill in the art will understand that: all or part of the processes of the methods for implementing the embodiments may be implemented by a program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. Namely, the method can be operated in a smart watch, a bracelet and a mobile phone in an application program mode.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A skin-type finger gesture recognition method based on a smart watch is characterized by comprising the following steps:

and 4, feature extraction: converting the plurality of independent gesture acoustic signals into time-frequency spectrograms of the acoustic signals represented by gray level images and images of Mel cepstrum coefficients as characteristic values of each independent finger action; the method specifically comprises the following steps: the method comprises the steps of calculating short-time Fourier transform of each independent gesture acoustic signal to obtain a time spectrum graph, calculating Mel cepstrum of each independent gesture sound signal to obtain an image of a Mel cepstrum coefficient, combining the time spectrum graph obtained through the short-time Fourier transform and the image of the Mel cepstrum coefficient obtained through calculation, converting the combined image into a gray level image, and obtaining a characteristic value of the gesture acoustic signal through the gray level image; wherein the short-time Fourier transform (STFT) is as follows:

wherein w [ t ] is a window function, Y [ m, ω ] is Fourier transform of Y [ n ] w [ n-m ], a Hamming window with the size of 512 is set, FFT with the length of 512 is set, and the overlapping length is 256;

step 5, gesture motion recognition: taking the characteristic value of each independent finger action extracted in the step 4 as input data, and performing finger gesture action recognition on the input data by using a convolutional neural network model to obtain corresponding finger gesture actions; the convolutional neural network model used was: taking a LeNet structure as a main structure, and using a convolution layer of AlexNet; comprises four convolution layers and four pooling layers, and then two full-connection layers and an output layer; where the convolution kernel size is two 11 × 11,5 × 5 and 3 × 3, the pool size is 3 × 3, and the stride is 2.

2. A smart watch based skin-type finger gesture recognition method according to claim 1, wherein in step 2 of the method, the filtering process used comprises:

and filtering and removing low-frequency and high-frequency noises in the acquired passive acoustic signals through an FIR filter to obtain the acoustic signals to be detected only containing the acoustic signals of the finger gesture actions.

3. The smart watch-based skin-type finger gesture recognition method according to claim 2, wherein the method for eliminating low-frequency and high-frequency noise in the collected passive acoustic signals through FIR filter filtering is as follows:

the filtered signals obtained by filtering output of the plurality of FIR filters are used as acoustic signals y [ n ] to be detected for removing out-of-band noise, and y [ n ] is as follows:

b_k＝b_n+2-k,k＝1,2,...,n+1；

wherein N is the number of FIR filters, and N is set to 112; b_iIs the frequency response value of the ith instance in the FIR filter, i.e. is the coefficient of the FIR filter; the parameters of the two FIR filters are set to be 6000 and 14000Hz for the two pass band cut-off frequencies and 5000 and 15000Hz for the two stop band cut-off frequencies, respectively, and the sampling rate of the original signal is 44100 Hz.

4. The smart watch-based skin-type finger gesture recognition method according to any one of claims 1 to 3, wherein in step 3 of the method, the step of dividing the acoustic signal to be detected obtained after the preprocessing in the step 2 into a plurality of independent gesture acoustic signals through gesture detection processing comprises the steps of:

5. The smart-watch-based skin-type finger gesture recognition method according to claim 4, wherein the starting point and the ending point of each independent finger gesture are detected from the acoustic signal to be detected, and the effective segments are extracted through the starting point and the ending point of each independent finger gesture as follows:

Where W is the window size, W is set to 882, and the step size is set to 750;

by judging the average energy E [ n ]]And

first difference value therebetween

Setting two guard intervals on both sides of estimated gesture input sound

And

computing

As a starting point, add

As an end point, i.e. the candidate set is changed to

6. A smart-watch based cutaneous finger gesture recognition method according to claim 1, characterized in that in the convolutional neural network model of step 5 of the method,

will be provided with

each layer of the convolutional neural network model uses a degradation mechanism, and the probability is set to be a fixed value p of 0.8 in the training process.