CN116027911B

CN116027911B - Non-contact handwriting input recognition method based on audio signal

Info

Publication number: CN116027911B
Application number: CN202310316251.XA
Authority: CN
Inventors: 李凡; 孟玲; 曾秋阳; 刘晓晨
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-05-30
Anticipated expiration: 2043-03-29
Also published as: CN116027911A

Abstract

The invention relates to a non-contact handwriting input recognition method based on an audio signal, belonging to the technical field of voice recognition and mobile computing application. The invention uses the loudspeaker in the mobile device to continuously play the predefined audio signal, and uses the microphone to collect the audio signal reflected when the finger is written. When a user handwriting input, movement of the hand causes a change in the reflected audio signal. The handwriting input content of the user is identified in real time by designing a lightweight class classification network to study the fine granularity change of the audio transmission channel. And expanding the data set in two dimensions of handwriting distance and handwriting speed by utilizing a data enhancement technology, so as to adapt to handwriting habits of different users. In addition, the recognition result of the classification network is corrected through a spelling error correction algorithm, so that the fault tolerance is improved.

Description

Non-contact handwriting input recognition method based on audio signal

Technical Field

The invention relates to a handwriting input recognition method, in particular to a non-contact handwriting input recognition method based on active acoustic sensing by utilizing a smart phone loudspeaker and a microphone, thereby expanding a man-machine interaction mode, and belonging to the technical field of voice recognition and mobile computing application.

Background

Touch screen interaction is widely used in various mobile devices (such as smart phones and smart tablets) as a simple and direct man-machine interaction mode. However, with the continuous expansion of the usage scenario of mobile devices, it is gradually difficult to meet the increasing demands of people only by touching and interacting with fingers on a touch screen.

With the popularity of smart wearable devices, more and more users begin to use these devices to perform activities such as recreational entertainment, health monitoring, and the like. However, to ensure portability, wearable devices are typically equipped with only a small-sized (about 1 inch) screen, and it is difficult for a user to perform touch screen interactions such as handwriting input on such a small screen. Therefore, to overcome the limitations of touch screen interaction on wearable devices, it is necessary to study a convenient and efficient off-screen handwriting recognition scheme.

Currently, there are some handwriting recognition methods implemented using motion sensors widely deployed in wristband devices. For example, when a user performs handwriting using a hand wearing a wrist band device, the motion sensor may sense a change in data caused by the movement of the user's hand, thereby recognizing the content of handwriting input. However, research has found that people tend to wear wristband devices on non-dominant hands to avoid knocks, but handwriting input often uses dominant hands. In this case, the wristband device cannot capture behavior information of the hands used by the user, and thus cannot complete handwriting content recognition. Wearing the wristband device on a dominant hand can result in a poor user experience if the user's behavioral habits are forcibly altered.

In addition, there are some handwriting recognition methods implemented using audio devices (speakers and microphones) commonly found on mobile devices, mainly based on two ways: passive acoustic sensing and active acoustic sensing. Where passive acoustic sensing uses microphones to directly collect audio generated by a user's finger sliding on a plane near the device to identify the input content, this approach is susceptible to ambient noise and handwriting plane material. Active acoustic sensing uses a speaker to play an audio signal for sensing that is obscured by the user's finger and reflected back to the microphone when the user handwriting in the vicinity of the device. By analyzing the pattern of change of the reflected signal, the user's input content can be identified. Unlike passive acoustic sensing, the method based on active acoustic sensing does not depend on handwriting plane materials, but the existing solution based on active acoustic sensing is expensive in calculation cost, causes extra burden to users, or requires multiple pairs of audio devices, and cannot be applied to most mobile devices only provided with one pair of audio devices, so that the expandability is poor.

In view of the foregoing, there are various drawbacks and shortcomings of the conventional handwriting recognition methods, and a new method is needed to overcome the above-mentioned limitations.

Disclosure of Invention

The invention aims to overcome the defects and the shortcomings of the prior art and creatively provides a non-contact handwriting input recognition method based on an audio signal. The method utilizes a pair of microphones and a loudspeaker combination on the mobile equipment to collect audio signals reflected by finger movement during handwriting, thereby identifying handwriting contents nearby the equipment and realizing handwriting input identification.

The innovation points of the invention include: the speaker in the mobile device is used to continuously play the predefined audio signal and the microphone is used to collect the audio signal reflected when the finger is handwriting. When a user handwriting input, movement of the hand causes a change in the reflected audio signal. The handwriting input content of the user is identified in real time by designing a lightweight class classification network to study the fine granularity change of the audio transmission channel. And expanding the data set in two dimensions of handwriting distance and handwriting speed by utilizing a data enhancement technology, so as to adapt to handwriting habits of different users. In addition, the recognition result of the classification network is corrected through a spelling error correction algorithm, so that the fault tolerance is improved.

The aim of the invention is achieved by the following technical scheme.

A contactless handwriting recognition method based on an audio signal, comprising the steps of:

step 1: the audio signal reflected when the finger handwriting input is collected by a microphone is collected by playing a predefined audio signal frame by using a speaker in the mobile device.

In a real environment, there are rich multipath effects. In order to distinguish between different transmission paths, the present invention preferably designs the transmitted signal as the original signal with strong auto-correlation and weak cross-correlation by the following method:

firstly, two 13-bit barker codes are spliced to obtain a 26-bit barker code so as to avoid frequency leakage.

The baseband sequence signal is then interpolated using a 12-frequency domain, and modulated to limit the bandwidth of the signal.

Step 2: the audio signals collected by the microphone are preprocessed, and the influence of environment noise and inherent time delay of audio equipment is eliminated.

Specifically, the following method can be adopted for treatment:

first, audio noise (e.g., speech sounds, music sounds, etc.) is removed by a bandpass filter.

The signal is then aligned by means of the arrival time of the signal transmitted via the direct path with the greatest energy, reducing the effect of the inherent delay of the audio device.

Step 3: IQ demodulation (I: in-phase; q: quadrature) is performed on the signal to obtain a baseband complex signal, so as to obtain more abundant information.

Because the audio signal received by the microphone is a passband real signal, IQ demodulation is required to construct a baseband complex signal to obtain more abundant handwritten information. Specifically, the following manner may be adopted:

first, the aligned audio signals are processed

Multiplying the two waves with cosine and sine wave to obtain quadrature component +.>

And in-phase component->

。

Then, a low pass filter with a cut-off frequency of 2kHz is used for filtering

And->

Is a high frequency part of the (c).

Finally, the IQ component is combined to construct a baseband complex signal.

Step 4: estimating differential Channel Impulse Response (CIR) eliminates the effects of static multipath effects.

Specifically, the following method can be adopted for realizing:

first, a Channel Impulse Response (CIR) is calculated using a least square method.

Then, the Channel Impulse Response (CIR) is differenced along the time axis to obtain a differential channel impulse response (dCIR), eliminating the influence of static multipath effects.

Step 5: and (3) post-processing the signals, eliminating random noise, reducing subsequent calculation cost and dividing handwriting input signals.

Specifically, the post-treatment can be achieved by the following method:

first, a smoothing filter is used to suppress outliers in the differential channel impulse response (dCIR) and to eliminate random noise introduced during sampling.

The differential channel impulse response (dCIR) is then downsampled 2-fold to reduce subsequent computational overhead.

Finally, signal segmentation of individual characters/words is achieved based on the logarithmic short-term energy and an adaptive threshold.

Step 6: the handwritten content is classified using a classification model.

Specifically, the following method may be employed:

first, with data enhancement techniques, the dataset is expanded in both the handwriting distance and handwriting speed dimensions.

The handwritten content is then character-level classified using a classification model based on a convolutional gating loop unit (CNN-GRU).

Step 7: word suggestions are provided by using a spelling error correction tool based on the editing distance and word frequency, handwriting errors/model classification errors of a user are corrected, and handwriting input recognition results are output.

Specifically, the classification result obtained in step 6 is input into an existing spelling error correction tool (e.g., symspllpy). Correcting the handwriting error or the model classification error of the user based on the minimum editing distance and word frequency, and outputting a handwriting input recognition result.

Advantageous effects

Compared with the prior art, the method has the following advantages:

1. the invention can realize the non-contact handwriting input recognition with high precision, low delay and robustness based on the active acoustic sensing by only relying on the common loudspeaker and microphone in the mobile device. The loudspeaker plays a predefined emission signal, and the microphone receives a signal reflected by the finger handwriting moving process. The invention helps to reduce the disease transmission risk caused by contacting the public touch screen, and overcomes the screen size limitation of the wearable device.

2. According to the invention, the differential channel impulse response dCIR is extracted from the handwritten reflected audio signals acquired by the microphone through the denoising algorithm, the alignment algorithm, the demodulation algorithm and the differential channel impulse response estimation algorithm to extract the fine-granularity audio characteristics of the audio transmission channel so as to capture the difference between different characters handwritten by a user.

3. The invention uses the data enhancement technology to expand the collected character data set in two dimensions of handwriting distance and handwriting speed so as to adapt to different handwriting habits of different users; designing a lightweight class network based on CNN-GRU to identify the handwriting content of the user in real time; and finally, providing word suggestions for users by using a spelling error correction algorithm so as to improve the fault tolerance of the system.

4. The invention realizes the system prototype on the smart phone and performs a great deal of experiments in different real environments. The evaluation result shows that the recognition accuracy of the handwriting input in the character dimension is 97.62%, and the word accuracy and the word error rate are 96.4% and 1.5%, respectively.

Drawings

Fig. 1 is a schematic diagram of a contactless handwriting recognition method based on an audio signal according to the present invention.

Fig. 2 is a schematic diagram of a predefined audio signal played by a speaker according to an embodiment of the present invention.

FIG. 3 is a graph showing the logarithmic short-term energy of the differential channel impulse response dCIR calculated when "a word" is handwritten in an embodiment of the invention.

Fig. 4 is a schematic diagram of a segmentation result when handwriting "a word" in an embodiment of the present invention.

Fig. 5 is a schematic diagram of an android-based application for data acquisition according to an embodiment of the present invention.

Fig. 6 shows the handwriting recognition performance according to the embodiment of the present invention.

Fig. 7 illustrates the performance of handwriting recognition at different handwriting distances (distances between fingers and audio devices) according to an embodiment of the present invention.

Fig. 8 illustrates the handwriting recognition performance at different handwriting speeds according to an embodiment of the present invention.

Detailed Description

The principles of the present invention are described in further detail below with reference to the examples and the drawings.

Fig. 1 shows a schematic diagram of an embodiment of the present invention, which is composed of seven parts of signal generation, preprocessing, demodulation, differential channel impulse response estimation, post-processing, handwriting classification and word suggestion. Implemented with a pair of microphone and speaker combinations of the mobile device.

Examples

Because of the abundance of multipath effects in the real environment, the present invention selects barker codes with strong auto-correlation and weak cross-correlation characteristics as the original signals to design the transmitted audio signals in order to distinguish between different transmission paths. The specific method comprises the following steps:

step 1.1: and splicing the two 13-bit barker codes to obtain a 26-bit barker code so as to improve the perception distance.

Step 1.2: the baseband sequence signal is obtained using frequency domain interpolation and then modulated to limit the bandwidth of the signal.

Specifically, firstly, the spliced barker code signal obtained in the step 1.1 is converted into a frequency domain through a fast Fourier transform algorithm, and zero padding is carried out on the signal in the frequency domain, so that the signal becomes 12 times of the previous length. At this time, the length of the signal is 312, i.e

。

Then, the signal is converted into a time domain by using an inverse fast Fourier transform algorithm to obtain a baseband sequence signal

。

Finally, the signal bandwidth is limited to the range of 18kHz to 22kHz using a 20kHz carrier frequency.

In addition, to reduce interference between adjacent frames, a blank interval of 168 sampling points is added to the signal. Thus, the length of each frame of the signal is 480 samples. The transmitted signal is:

wherein->

Representing the baseband sequence signal after frequency domain interpolation, < >>

Representing the carrier frequency +.>

Representing the circumference ratio>

Indicating the instant in time.

Fig. 2 shows a schematic diagram of a predefined audio signal played by a loudspeaker.

Step 2.1: common audio noise is removed by a bandpass filter.

After the microphone collects the reflected audio signal, a band-pass filter of 18 kHz-22 kHz is first used to remove audio noise (e.g., speaking sounds, music sounds, etc.). The zero phase filter is then used to reduce the signal phase offset introduced by the filtering.

Step 2.2: the signals are aligned to reduce the effects of the inherent delay of the audio device.

Audio devices commonly have inherent playback delays that can have some impact on channel measurements. It has been observed that there is typically no obstacle on the direct path of the speaker to the microphone, which means that the audio signal arriving at the microphone via the direct path has the greatest energy. The invention can thus align the signals by means of the arrival time of the audio signal transmitted via the direct path.

First, the short-time energy (STE) of the audio signal is calculated to locate the start frame where the signal arrives via the direct path. First, the

The short-time energy of the frame is: />

，/>

Represents the +.>

In frame +.>

The values of the sampling points. The start frame is positioned within the first 20 frames (e.g., 0.2 s) of the received signal. Setting a dynamic threshold +.>

，/>

. When the energy of consecutive 3 frames exceeds +.>

Then frame 1 is determined to be the start frame.

In order to obtain a more accurate arrival time of the audio signal, a strong autocorrelation between the start frame and the transmitted signal is calculated, and the time at which the correlation is maximum is the time at which the signal arrives at the microphone via the direct path.

Finally, the time of signal transmission is calculated according to the length of the direct path, thereby eliminating the inherent play time delay of the audio equipment.

Step 3: IQ demodulation is carried out on the signal to obtain a baseband complex signal so as to obtain richer information.

The audio signal received by the microphone is a passband real signal, and the invention carries out IQ demodulation on the passband real signal to construct a baseband complex signal so as to obtain richer handwriting information. In a handwriting interaction scenario, the audio signal may be occluded by a different occlusion.

In the setting environment have

A propagation path for audio signals received by the microphone according to the principle of superposition of the signals>

The method comprises the following steps:

wherein->

And->

Respectively indicate the audio signal passing through +.>

Attenuation and delay of the individual propagation paths, +.>

Representing the transmitted signal>

Indicating instant of time, +.>

Representing the circumference ratio>

Representing the carrier frequency.

In IQ demodulation, the audio signal received by the microphone is first received

Multiplying the sum with cosine and sine waves, respectively, to obtain the quadrature component +.>

And in-phase component->

。

Then, a low pass filter with a cut-off frequency of 2kHz is used for filtering

And->

Is a high frequency part of the (c).

Finally, constructing baseband complex signal by combining IQ component

：/>

Wherein->

Is natural index (i.e.)>

And->

Respectively indicate the audio signal passing through +.>

Attenuation and delay of the individual propagation paths, +.>

Representing imaginary units.

After the baseband complex signal is obtained through IQ demodulation, the channel is continuously measured so as to track the channel change caused by finger movement during handwriting interaction.

Step 4.1: the Channel Impulse Response (CIR) is calculated using a least squares method.

First, a Channel Impulse Response (CIR) is calculated using a least squares method

，/>

Wherein, the method comprises the steps of, wherein,

，/>

is baseband complex signal +.>

Is a subsequence of->

Representing baseband complex signal +.>

Is the first of (2)

Sampling points->

Is a cyclic training sequence matrix,/->

Wherein

Representing->

The training sequence after a delay of a sample point,Tthe transpose is represented by the number,

，/>

representing baseband sequence signal->

Is a subsequence of->

Representing the Channel Impulse Response (CIR).

The calculation process of the optimal linear unbiased estimation of (1) is as follows: />

，HRepresenting the conjugate transpose.

Due to

Is>

The individual element is delay->

Sample points and delay->

The product of the training sequences of the sample points and the autocorrelation function of the training sequences may be approximately ideal. When->

And->

The product tends to 0 when not equal, therefore, +.>

Approximately as a diagonal matrix. />

The invention further simplifies the calculation process of the Channel Impulse Response (CIR) into:

wherein->

Represents a reference length->

The greater the confidence in the calculated Channel Impulse Response (CIR). />

Tap number representing Channel Impulse Response (CIR), for example>

The larger the perception range is, the larger the perception range is. According to step 1.2->

Is a fixed length.

To balance confidence and perceived distance, the present invention will

Set to 120. It should be noted that the others belong to [100, 140 ]]Is also within the scope of the present invention; will->

Set to 192, it should be noted that others belong to [180, 200]Is also within the scope of the present invention. At this time, the distance of handwriting perception is about 42cm.

Step 4.2: the Channel Impulse Response (CIR) is differenced along the time axis to obtain a differential channel impulse response (dCIR) that eliminates the effects of static multipath effects.

Step 5: post-processing the signal. The method aims at eliminating random noise, reducing subsequent calculation amount and dividing handwriting input signals.

The specific method for post-treatment is as follows:

step 5.1: a smoothing filter is used to remove random noise during sampling.

The audio device inevitably introduces random noise during the sampling process, which will lead to some outliers in the differential channel impulse response (dCIR) obtained in step 4.

To eliminate the effect of outliers, in this embodiment, a Savitzky-Golay filter is used to smooth the differential channel impulse response (dCIR).

Step 5.2: downsampling is utilized to reduce subsequent computational overhead.

In order to reduce the calculation amount of the subsequent processing, the invention performs the average pooling processing on the differential channel impulse response (dCIR). Specifically, the pooling core size is 2×1, and the step size is 2×1.

By downsampling, differential channel impulse response (dCIThe size of R) is defined by

Reduced to->

，/>

Representing the number of frames of audio collected by the microphone.

Step 5.3: signal segmentation of individual characters or words is achieved based on logarithmic short-term energy and adaptive thresholds.

It was observed that the user had a natural pause in the finger during handwriting of successive characters. During this time, the user's finger will remain stationary and the differential channel impulse response (dCIR) will be close to 0. The distance between the dynamic barrier and the audio device is further than the distance between the finger and the audio device, although some other dynamic barrier may also be present in the environment. Thus, dynamic multipath effects are negligible.

Based on the above findings, the present invention further proposes a segmentation method combining the logarithmic short-term energy of the differential channel impulse response and the adaptive threshold for detecting the start and end frames of each character and word. The method comprises the following steps:

first, the logarithmic short-time energy of the differential channel impulse response (dCIR) is calculated frame by frame, the first

Logarithmic short time energy of frame

The method comprises the following steps: />

Wherein->

Indicate->

Frame->

Differential channel impulse response value of individual taps, < >>

Representing modulo calculation +.>

Indicating the number of differential channel impulse response (dCIR) taps after downsampling.

An adaptive threshold is then calculated based on the sliding window. First, the

Adaptive threshold of individual windows->

The method comprises the following steps:

wherein->

Representing the size of the sliding window, +.>

Is a ratio constant occupied by the average logarithm short-time energy of the current window in the adaptive threshold value of the current window. In this embodiment, <' > a->

Set to 0.3, the others are [0.1,0.5 ]]Is also within the scope of the present invention. />

Indicate->

Adaptive threshold for individual windows,>

representing the current Window->

Energy value of frame.

To determine the beginning and ending frames of a character segment, the first is

Adaptive threshold for frame energy and window in which it resides

A comparison is made. The present invention defines three time thresholds, including +.>

、/>

And->

The minimum number of consecutive frames of the handwritten character, the minimum number of consecutive two character intervals, and the minimum number of consecutive two word intervals are represented, respectively.

Then, the position of the interval section is judged by the number of consecutive frames below the adaptive threshold:

when in continuous

When the logarithmic short-time energy of each frame is lower than the self-adaptive threshold of the current window, the starting frame of the interval is the ending frame of the current word, and the ending frame of the interval is the starting frame of the next word;

when in continuous

When the logarithmic short-time energy of each frame is lower than the adaptive threshold of the current window, the start frame of the interval is regarded as the end frame of the current character, and the end frame of the interval is the start frame of the next character.

In the present embodiment, it will

Let 30 be the same, the other is [20, 40]Is also in the inventionIs within the range of (2); will->

Let 30 be the same, the other is [20, 40]Is also within the scope of the present invention; />

Set to 80, it should be noted that the others belong to the category [60, 100]Is also within the scope of the present invention.

Fig. 3 shows a schematic diagram of the logarithmic short-time energy of the differential channel impulse response (dCIR) calculated when handwriting "a word" with a finger. Fig. 4 shows the segmentation result when handwriting "a word" with a finger.

Step 6: the handwritten content is classified using a classification model.

Step 6.1: the data set is expanded in both the handwriting distance and handwriting speed dimensions using data enhancement techniques.

Different users have different handwriting habits, such as distance between finger and audio device, handwriting speed. Therefore, the invention expands the data set based on different handwriting distances and handwriting speeds respectively, so as to realize handwriting distance independence and handwriting speed independence.

When the finger is far away from the audio device, the length of the audio signal transmission channel reflected by the finger is longer, and the delay time of the corresponding channel is longer

Will become longer. Thus, in this embodiment, by adjusting the arrival time of the signals received by the microphone when aligned, the analog finger and audio device are extended by 5cm and 10cm of the signal compared to the original distance.

As the handwriting speed changes, the time required to handwriting the character will also change, which will result in a stretching in the differential channel impulse response diagram in the time dimension. Thus, the present embodiment uses a cubic spline interpolation method to interpolate the original differential channel impulse response modulus length along the time dimension to simulate signals with handwriting speed changes of 2/3 times and 4/3 times the original speed.

Step 6.2: and classifying the handwritten content at a character level by using a classification model based on CNN-GRU.

The time taken for the same character to be handwritten by different users and the time taken for the same user to handwritten different characters are different, so in this embodiment, the character classification is modeled as a sequence classification problem. The method comprises the following steps:

first, the signal segment obtained in step 5.3 is divided into sub-segments based on a sliding window with a window size of 40 frames and a step length of 20 frames, and a short-time differential channel impulse response (st-dCIR) is obtained.

The short-time differential channel impulse response is then normalized and input into a Convolutional Neural Network (CNN) subnetwork to extract potential eigenvectors. Specifically, the convolutional neural network subnetwork includes 4 CNN layers, 4 pooling layers, and 4 batch normalization layers.

Thereafter, a series of feature vectors are input to a Gate Recursion Unit (GRU) to learn the temporal feature information in the signal.

And finally, inputting the output of the gate recursion unit network to the full connection layer and the softmax layer for classification, and finally outputting the recognized character class.

Step 7: word suggestions are provided using spelling error correction tools based on edit distance and word frequency to correct user handwriting errors or model classification errors.

Specifically, the classification result obtained in step 6.2 is input into the existing spelling error correction tool symspellpy. If the word displayed by the classification result does not exist in the dictionary, correcting the word by using the set maximum editing distance, and defaulting to use the word with the highest word frequency as the correction result. Where the maximum edit distance refers to the number of operations (operations include insert, delete, and replace) that can be performed at most to correct the word. In addition, the user may also manually select an appropriate word for input. If the proper word cannot be found according to the maximum editing distance, the recognition result of the classification model is directly output.

So far, from step 1 to step 7, handwriting input recognition is realized.

Example verification

In order to verify the performance of the invention, the system prototype is realized by deploying the embodiment on 3 smart phones (Honor 30 PRO, HUAWEI Mate 30 and OPPO Reno 9) with different models. A total of 8 participants (4 men and 4 women, between 20 and 50 years of age) were enrolled in the experiment. In the data acquisition process, the smart phone is placed on a desktop, and a participant sits on a chair and performs handwriting input on a plane beside the device. In order to improve the acquisition efficiency, an android-based audio acquisition application was developed. Fig. 5 shows a schematic diagram of an android-based application for data acquisition, with an audio sampling rate set to 48kHz. After the participant clicks the start button, the speaker continues to play the predefined audio signal while the microphone is on and capturing audio data. The participant then handwriting a character with a finger near the phone and clicks the end button after the acquisition is completed. The application will automatically store the captured audio in WAV format. To extract the characteristics of the user when handwriting a single character and two consecutive characters, the present embodiment requires each participant to handwriting 26 english characters (a, b, …, y, z) 5 times and 676 to english characters (aa, ab, …, zy, zz) 5 times. A total of 70200 (26 english characters x 5 x 8 people +676 versus english characters x 5 x 8 people) audio files were collected. After the data enhancement processing, 28080 sample data are finally obtained.

Confusion matrix, accuracy, recall, and F1 score are used to evaluate the performance of the system in the character dimension. Wherein the Confusion matrix (fusion matrix) is defined as: the rows in the confusion matrix represent the true values and the list displays the predicted values. First, the

Line and->

The value in the column indicates the label +.>

Samples of individual characters are predicted as +.>

The ratio of the individual characters; accuracy (Precision) is defined as: predicting the proportion of the sample actually being A in the samples being A; recall (Recall) is defined as: predicting the correct proportion in the sample which is actually A; the F1 score (F1-score) is defined as: a harmonic mean of accuracy and recall.

The word accuracy rate and word error rate are used to evaluate the performance of the system in the word dimension. Wherein, word accuracy (Word accuracy) is defined as: predicting the proportion of the correct word number to the number of all words; the word error rate (Character error rate) is defined as: the ratio of the edit distance of the predicted word to the actual word to the total number of characters in the actual word.

First, the overall performance of the present invention was tested. For character level performance, FIG. 6 illustrates the accuracy, recall, and F1 score of the present invention. The average of the accuracy, recall and F1 scores were 97.62%, 97.61% and 97.60%, respectively. All samples were above 94% accurate, with recall not less than 91.5% and F1 score above 93%. For word-level performance, the present example collected 100 commonly used English words, 5 times per word. The experimental results show that the word accuracy rate and the word error rate are 93.2% and 2.32%, respectively, when the word suggestion module is not used. When the word suggestion module is enabled, the word level rate and word error rate are 96.4% and 1.5%, respectively, verifying the validity of the word suggestion module.

The effect of the distance between the finger and the audio device on the performance of the invention when the user is handwriting is then tested. The performance of the system at different handwriting distances was evaluated by a large number of experiments in this example, and the evaluation results are shown in fig. 7. As can be seen from the figure, the present invention shows good performance at all three handwriting distances. The handwriting input performance is optimal at a position 10cm away from the audio equipment, and the word accuracy rate reaches 97.78%, which is slightly higher than 93.80% at a position 5cm away and 93.33% at a position 15cm away.

Finally, the influence of three different handwriting speeds on the performance of the invention is tested, and the performance of the invention under three scenes of low handwriting speed (0.1 m/s), moderate handwriting speed (0.15 m/s) and high handwriting speed (0.2 m/s) is compared. Fig. 8 shows that when the handwriting speed is too high, the word level is significantly reduced, and the word error rate is significantly increased. This is because the handwriting speed is too fast, so that the generated short-time differential channel impulse response st-dCIR sequence is too short, and the CNN-GRU classification model cannot effectively learn sequence information, thereby influencing the recognition accuracy.

While the foregoing has been provided for the purpose of illustrating the general principles of the invention, it will be understood that the foregoing disclosure is only illustrative of the principles of the invention and is not intended to limit the scope of the invention, but is to be construed as limited to the specific principles of the invention.

Claims

1. The contactless handwriting input recognition method based on the audio signal is characterized by comprising the following steps of:

step 1: playing a predefined audio signal frame by using a loudspeaker in the mobile device, and collecting an audio signal reflected when finger handwriting is input by using a microphone;

step 2: preprocessing an audio signal acquired by a microphone to eliminate the influence of environment noise and inherent time delay of audio equipment; firstly, removing audio noise through a band-pass filter; then, aligning the signals by means of the arrival time of the signal transmitted via the direct path with maximum energy, reducing the effect of the inherent delay of the audio device;

step 3: IQ demodulation is carried out on the signal to obtain a baseband complex signal so as to obtain richer information;

step 4: estimating differential channel impulse response, eliminating the influence of static multipath effect;

firstly, calculating channel impulse response by using a least square method, and then, differencing the channel impulse response along a time axis to obtain differential channel impulse response, thereby eliminating the influence of static multipath effect;

step 5: post-processing the signals, eliminating random noise, reducing subsequent calculation cost and dividing handwriting input signals;

firstly, adopting a smoothing filter to inhibit abnormal values in differential channel impulse response, and eliminating random noise introduced in the sampling process;

then, 2 times down sampling is carried out on the differential channel impulse response;

finally, signal segmentation of single characters/words is realized based on the logarithmic short-time energy and the adaptive threshold;

step 6: classifying the handwritten content by using a classification model;

firstly, expanding a data set in two dimensions of handwriting distance and handwriting speed by utilizing a data enhancement technology; then, classifying the handwritten content at a character level by using a classification model based on a convolution gating circulation unit;

2. The method for recognizing a contactless handwriting input based on an audio signal according to claim 1, wherein in step 1, a transmitted signal is designed by taking a barker code having strong auto-correlation and weak cross-correlation as an original signal;

step 1.1: splicing two 13-bit barker codes to obtain a 26-bit barker code, so that frequency leakage is avoided;

step 1.2: the baseband sequence signal is interpolated using a 12-frequency domain, and the signal is modulated to limit the bandwidth of the signal.

3. The method for recognizing non-contact handwriting input based on audio signal according to claim 2, wherein a baseband sequence signal is obtained by using a frequency domain interpolation method, and then the signal is modulated to limit the bandwidth of the signal;

firstly, converting the spliced barker code signal obtained in the step 1.1 into a frequency domain through a fast Fourier transform algorithm, and performing zero filling on the signal in the frequency domain to enable the signal to be 12 times of the previous length;

；

Finally, using 20kHz carrier frequency to limit the signal bandwidth to 18 kHz-22 kHz;

to reduce interference between adjacent frames, a blank interval of 168 sampling points is added for the signal, and the length of each frame of the signal is 480 sampling points;

the transmitted signal is:

wherein->

Representing the carrier frequency +.>

Representing the circumference ratio>

Indicating the instant in time.

4. The method for contactless handwriting recognition based on audio signals according to claim 1, wherein step 2 comprises the steps of:

step 2.1: removing audio noise through a band-pass filter;

when the microphone collects the reflected audio signals, firstly, a band-pass filter is used for removing audio noise, and then a zero-phase filter is used for reducing the phase offset of the signals introduced by filtering;

step 2.2: the signals are aligned by means of the arrival time of the audio signals transmitted via the direct path to reduce the effects of the inherent delay of the audio device.

5. The method for contactless handwriting recognition based on audio signals according to claim 4, wherein step 2.2 comprises the steps of:

firstly, calculating short-time energy of an audio signal to locate a starting frame reached by the signal through a direct path; first, the

The short-time energy of the frame is: />

，/>

Represents the +.>

In frame +.>

Values of the sampling points; positioning a start frame within the first 20 frames of the received signal; setting a dynamic threshold +.>

，/>

The method comprises the steps of carrying out a first treatment on the surface of the When the energy of consecutive 3 frames exceeds +.>

Then the 1 st frame is judged as the initial frame;

then, strong autocorrelation between the initial frame and the transmitted signal is calculated, and the time with the maximum correlation is the time when the signal reaches the microphone through the direct path;

6. A contactless handwriting input recognition based on audio signals according to claim 1The other method is characterized in that in the step 3, the environment is provided with

The method comprises the following steps:

wherein->

And->

Respectively indicate the audio signal passing through +.>

Attenuation and delay of the individual propagation paths, +.>

Representing the transmitted signal>

Indicating instant of time, +.>

Representing the circumference ratio>

Representing the carrier frequency;

first, the aligned audio signals are processed

Respectively with cosine wave and cosine waveSine wave multiplication to obtain quadrature component +.>

And in-phase component->

；

Then, a low pass filter with a cut-off frequency of 2kHz is used for filtering

And->

Is a high frequency part of (2);

constructing baseband complex signals in combination with IQ components

：/>

Wherein->

Is natural index (i.e.)>

And->

Respectively indicate the audio signal passing through +.>

Attenuation and delay of the individual propagation paths, +.>

Representing imaginary units.

7. The method for recognizing a contactless handwriting input based on an audio signal according to claim 1, wherein in step 4, a channel impulse response is calculated first using a least square method

，/>

Wherein, the method comprises the steps of, wherein,

，/>

is baseband complex signal +.>

Is a subsequence of->

Representing baseband complex signal +.>

Is the first of (2)

Sampling points->

Is a cyclic training sequence matrix,/->

Wherein

Representing->

，/>

representing baseband sequence signal->

Is a subsequence of->

Representing the channel impulse response;

，HRepresents a conjugate transpose; />

Approximating a diagonal matrix;

the channel impulse response calculation process is further simplified to:

wherein->

Represents a reference length->

The larger the confidence of the channel impulse response is calculated to be higher; />

Tap number representing channel impulse response, +.>

The larger the perception range is, the larger the perception range is; />

The value range is 100, 140]The value range of P is [180, 200 ]]。

8. The method for contactless handwriting recognition based on an audio signal according to claim 1, wherein in step 5, a filter is used to smooth the differential channel impulse response;

carrying out average pooling treatment on the differential channel impulse response, wherein the pooling core size is 2 multiplied by 1, and the step length is 2 multiplied by 1;

by downsampling, the magnitude of the differential channel impulse response is determined by

Reduced to->

，/>

A frame number representing audio collected by the microphone;

the method for dividing the logarithmic short-time energy and the adaptive threshold combined with the differential channel impulse response is adopted to detect the beginning frame and the ending frame of each character and word, and is concretely as follows:

first, the logarithmic short-time energy of the differential channel impulse response is calculated frame by frame, the first

Logarithmic short-term energy of frame->

The method comprises the following steps:

wherein->

Indicate->

Frame->

Differential channel impulse response value of individual taps, < >>

Representing modulo calculation +.>

Representing the number of differential channel impulse response dCIR taps after downsampling;

then, calculating an adaptive threshold based on the sliding window; first, the

Adaptive threshold of individual windows->

The method comprises the following steps:

wherein->

Representing the size of the sliding window, +.>

Is a ratio constant of average log short-time energy of the current window in the adaptive threshold for adjusting the current window, +.>

；/>

Indicate->

The adaptive threshold value of the individual windows,/>

representing the current Window->

An energy value of the frame;

will be the first

Adaptive threshold for the energy of a frame and the window in which it is located>

Comparing, defining three time thresholds including +.>

、/>

And->

Representing the minimum continuous frame number of the handwritten character, the minimum frame number of the interval between two continuous characters and the minimum frame number of the interval between two continuous words respectively;

when in continuous

when in continuous

When the logarithmic short-time energy of each frame is lower than the adaptive threshold of the current window, the beginning frame of the interval is regarded as the ending frame of the current character, and the ending frame of the interval is the nextA start frame of the character;

the value range is 20, 40]，/>

The value range is 20, 40]，/>

The value range is [60, 100 ]]。

9. The method of claim 1, wherein in step 6, the analog finger and the audio device are extended by 5cm and 10cm from the original distance by adjusting the arrival time of the signals received by the microphone when the signals are aligned; interpolation is carried out on the original differential channel impulse response module length along the time dimension by using a cubic spline interpolation method to simulate signals of which the handwriting speed is changed into 2/3 times and 4/3 times of the original speed;

using a classification model based on CNN-GRU, classifying the handwritten content at a character level, including:

modeling character classification as a sequence classification problem, firstly, dividing a signal section obtained in the step 5 into subsections based on a sliding window with a window size of 40 frames and a step length of 20 frames to obtain a short-time differential channel impulse response;

then, normalizing the short-time differential channel impulse response, and inputting the short-time differential channel impulse response into a convolutional neural network subnet to extract potential eigenvectors;

then, inputting a series of feature vectors into a gate recursion unit to learn time feature information in the signal;

10. The method for recognizing non-contact handwriting input based on audio signals according to claim 1, wherein in step 7, the classification result obtained in step 6 is inputted into a spelling error correction tool, if the word displayed by the classification result does not exist in the dictionary, the word is corrected by using the set maximum edit distance, and the word with the highest word frequency is used as the correction result by default; wherein, the maximum editing distance refers to the operand which can be executed most by the corrected word; if the proper word cannot be found according to the maximum editing distance, the recognition result of the classification model is directly output.