CN116027911B - Non-contact handwriting input recognition method based on audio signal - Google Patents

Non-contact handwriting input recognition method based on audio signal Download PDF

Info

Publication number
CN116027911B
CN116027911B CN202310316251.XA CN202310316251A CN116027911B CN 116027911 B CN116027911 B CN 116027911B CN 202310316251 A CN202310316251 A CN 202310316251A CN 116027911 B CN116027911 B CN 116027911B
Authority
CN
China
Prior art keywords
signal
frame
audio
impulse response
handwriting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310316251.XA
Other languages
Chinese (zh)
Other versions
CN116027911A (en
Inventor
李凡
孟玲
曾秋阳
刘晓晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202310316251.XA priority Critical patent/CN116027911B/en
Publication of CN116027911A publication Critical patent/CN116027911A/en
Application granted granted Critical
Publication of CN116027911B publication Critical patent/CN116027911B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention relates to a non-contact handwriting input recognition method based on an audio signal, belonging to the technical field of voice recognition and mobile computing application. The invention uses the loudspeaker in the mobile device to continuously play the predefined audio signal, and uses the microphone to collect the audio signal reflected when the finger is written. When a user handwriting input, movement of the hand causes a change in the reflected audio signal. The handwriting input content of the user is identified in real time by designing a lightweight class classification network to study the fine granularity change of the audio transmission channel. And expanding the data set in two dimensions of handwriting distance and handwriting speed by utilizing a data enhancement technology, so as to adapt to handwriting habits of different users. In addition, the recognition result of the classification network is corrected through a spelling error correction algorithm, so that the fault tolerance is improved.

Description

Non-contact handwriting input recognition method based on audio signal
Technical Field
The invention relates to a handwriting input recognition method, in particular to a non-contact handwriting input recognition method based on active acoustic sensing by utilizing a smart phone loudspeaker and a microphone, thereby expanding a man-machine interaction mode, and belonging to the technical field of voice recognition and mobile computing application.
Background
Touch screen interaction is widely used in various mobile devices (such as smart phones and smart tablets) as a simple and direct man-machine interaction mode. However, with the continuous expansion of the usage scenario of mobile devices, it is gradually difficult to meet the increasing demands of people only by touching and interacting with fingers on a touch screen.
With the popularity of smart wearable devices, more and more users begin to use these devices to perform activities such as recreational entertainment, health monitoring, and the like. However, to ensure portability, wearable devices are typically equipped with only a small-sized (about 1 inch) screen, and it is difficult for a user to perform touch screen interactions such as handwriting input on such a small screen. Therefore, to overcome the limitations of touch screen interaction on wearable devices, it is necessary to study a convenient and efficient off-screen handwriting recognition scheme.
Currently, there are some handwriting recognition methods implemented using motion sensors widely deployed in wristband devices. For example, when a user performs handwriting using a hand wearing a wrist band device, the motion sensor may sense a change in data caused by the movement of the user's hand, thereby recognizing the content of handwriting input. However, research has found that people tend to wear wristband devices on non-dominant hands to avoid knocks, but handwriting input often uses dominant hands. In this case, the wristband device cannot capture behavior information of the hands used by the user, and thus cannot complete handwriting content recognition. Wearing the wristband device on a dominant hand can result in a poor user experience if the user's behavioral habits are forcibly altered.
In addition, there are some handwriting recognition methods implemented using audio devices (speakers and microphones) commonly found on mobile devices, mainly based on two ways: passive acoustic sensing and active acoustic sensing. Where passive acoustic sensing uses microphones to directly collect audio generated by a user's finger sliding on a plane near the device to identify the input content, this approach is susceptible to ambient noise and handwriting plane material. Active acoustic sensing uses a speaker to play an audio signal for sensing that is obscured by the user's finger and reflected back to the microphone when the user handwriting in the vicinity of the device. By analyzing the pattern of change of the reflected signal, the user's input content can be identified. Unlike passive acoustic sensing, the method based on active acoustic sensing does not depend on handwriting plane materials, but the existing solution based on active acoustic sensing is expensive in calculation cost, causes extra burden to users, or requires multiple pairs of audio devices, and cannot be applied to most mobile devices only provided with one pair of audio devices, so that the expandability is poor.
In view of the foregoing, there are various drawbacks and shortcomings of the conventional handwriting recognition methods, and a new method is needed to overcome the above-mentioned limitations.
Disclosure of Invention
The invention aims to overcome the defects and the shortcomings of the prior art and creatively provides a non-contact handwriting input recognition method based on an audio signal. The method utilizes a pair of microphones and a loudspeaker combination on the mobile equipment to collect audio signals reflected by finger movement during handwriting, thereby identifying handwriting contents nearby the equipment and realizing handwriting input identification.
The innovation points of the invention include: the speaker in the mobile device is used to continuously play the predefined audio signal and the microphone is used to collect the audio signal reflected when the finger is handwriting. When a user handwriting input, movement of the hand causes a change in the reflected audio signal. The handwriting input content of the user is identified in real time by designing a lightweight class classification network to study the fine granularity change of the audio transmission channel. And expanding the data set in two dimensions of handwriting distance and handwriting speed by utilizing a data enhancement technology, so as to adapt to handwriting habits of different users. In addition, the recognition result of the classification network is corrected through a spelling error correction algorithm, so that the fault tolerance is improved.
The aim of the invention is achieved by the following technical scheme.
A contactless handwriting recognition method based on an audio signal, comprising the steps of:
step 1: the audio signal reflected when the finger handwriting input is collected by a microphone is collected by playing a predefined audio signal frame by using a speaker in the mobile device.
In a real environment, there are rich multipath effects. In order to distinguish between different transmission paths, the present invention preferably designs the transmitted signal as the original signal with strong auto-correlation and weak cross-correlation by the following method:
firstly, two 13-bit barker codes are spliced to obtain a 26-bit barker code so as to avoid frequency leakage.
The baseband sequence signal is then interpolated using a 12-frequency domain, and modulated to limit the bandwidth of the signal.
Step 2: the audio signals collected by the microphone are preprocessed, and the influence of environment noise and inherent time delay of audio equipment is eliminated.
Specifically, the following method can be adopted for treatment:
first, audio noise (e.g., speech sounds, music sounds, etc.) is removed by a bandpass filter.
The signal is then aligned by means of the arrival time of the signal transmitted via the direct path with the greatest energy, reducing the effect of the inherent delay of the audio device.
Step 3: IQ demodulation (I: in-phase; q: quadrature) is performed on the signal to obtain a baseband complex signal, so as to obtain more abundant information.
Because the audio signal received by the microphone is a passband real signal, IQ demodulation is required to construct a baseband complex signal to obtain more abundant handwritten information. Specifically, the following manner may be adopted:
first, the aligned audio signals are processed
Figure SMS_1
Multiplying the two waves with cosine and sine wave to obtain quadrature component +.>
Figure SMS_2
And in-phase component->
Figure SMS_3
Then, a low pass filter with a cut-off frequency of 2kHz is used for filtering
Figure SMS_4
And->
Figure SMS_5
Is a high frequency part of the (c).
Finally, the IQ component is combined to construct a baseband complex signal.
Step 4: estimating differential Channel Impulse Response (CIR) eliminates the effects of static multipath effects.
Specifically, the following method can be adopted for realizing:
first, a Channel Impulse Response (CIR) is calculated using a least square method.
Then, the Channel Impulse Response (CIR) is differenced along the time axis to obtain a differential channel impulse response (dCIR), eliminating the influence of static multipath effects.
Step 5: and (3) post-processing the signals, eliminating random noise, reducing subsequent calculation cost and dividing handwriting input signals.
Specifically, the post-treatment can be achieved by the following method:
first, a smoothing filter is used to suppress outliers in the differential channel impulse response (dCIR) and to eliminate random noise introduced during sampling.
The differential channel impulse response (dCIR) is then downsampled 2-fold to reduce subsequent computational overhead.
Finally, signal segmentation of individual characters/words is achieved based on the logarithmic short-term energy and an adaptive threshold.
Step 6: the handwritten content is classified using a classification model.
Specifically, the following method may be employed:
first, with data enhancement techniques, the dataset is expanded in both the handwriting distance and handwriting speed dimensions.
The handwritten content is then character-level classified using a classification model based on a convolutional gating loop unit (CNN-GRU).
Step 7: word suggestions are provided by using a spelling error correction tool based on the editing distance and word frequency, handwriting errors/model classification errors of a user are corrected, and handwriting input recognition results are output.
Specifically, the classification result obtained in step 6 is input into an existing spelling error correction tool (e.g., symspllpy). Correcting the handwriting error or the model classification error of the user based on the minimum editing distance and word frequency, and outputting a handwriting input recognition result.
Advantageous effects
Compared with the prior art, the method has the following advantages:
1. the invention can realize the non-contact handwriting input recognition with high precision, low delay and robustness based on the active acoustic sensing by only relying on the common loudspeaker and microphone in the mobile device. The loudspeaker plays a predefined emission signal, and the microphone receives a signal reflected by the finger handwriting moving process. The invention helps to reduce the disease transmission risk caused by contacting the public touch screen, and overcomes the screen size limitation of the wearable device.
2. According to the invention, the differential channel impulse response dCIR is extracted from the handwritten reflected audio signals acquired by the microphone through the denoising algorithm, the alignment algorithm, the demodulation algorithm and the differential channel impulse response estimation algorithm to extract the fine-granularity audio characteristics of the audio transmission channel so as to capture the difference between different characters handwritten by a user.
3. The invention uses the data enhancement technology to expand the collected character data set in two dimensions of handwriting distance and handwriting speed so as to adapt to different handwriting habits of different users; designing a lightweight class network based on CNN-GRU to identify the handwriting content of the user in real time; and finally, providing word suggestions for users by using a spelling error correction algorithm so as to improve the fault tolerance of the system.
4. The invention realizes the system prototype on the smart phone and performs a great deal of experiments in different real environments. The evaluation result shows that the recognition accuracy of the handwriting input in the character dimension is 97.62%, and the word accuracy and the word error rate are 96.4% and 1.5%, respectively.
Drawings
Fig. 1 is a schematic diagram of a contactless handwriting recognition method based on an audio signal according to the present invention.
Fig. 2 is a schematic diagram of a predefined audio signal played by a speaker according to an embodiment of the present invention.
FIG. 3 is a graph showing the logarithmic short-term energy of the differential channel impulse response dCIR calculated when "a word" is handwritten in an embodiment of the invention.
Fig. 4 is a schematic diagram of a segmentation result when handwriting "a word" in an embodiment of the present invention.
Fig. 5 is a schematic diagram of an android-based application for data acquisition according to an embodiment of the present invention.
Fig. 6 shows the handwriting recognition performance according to the embodiment of the present invention.
Fig. 7 illustrates the performance of handwriting recognition at different handwriting distances (distances between fingers and audio devices) according to an embodiment of the present invention.
Fig. 8 illustrates the handwriting recognition performance at different handwriting speeds according to an embodiment of the present invention.
Detailed Description
The principles of the present invention are described in further detail below with reference to the examples and the drawings.
Fig. 1 shows a schematic diagram of an embodiment of the present invention, which is composed of seven parts of signal generation, preprocessing, demodulation, differential channel impulse response estimation, post-processing, handwriting classification and word suggestion. Implemented with a pair of microphone and speaker combinations of the mobile device.
Examples
A contactless handwriting recognition method based on an audio signal, comprising the steps of:
step 1: the audio signal reflected when the finger handwriting input is collected by a microphone is collected by playing a predefined audio signal frame by using a speaker in the mobile device.
Because of the abundance of multipath effects in the real environment, the present invention selects barker codes with strong auto-correlation and weak cross-correlation characteristics as the original signals to design the transmitted audio signals in order to distinguish between different transmission paths. The specific method comprises the following steps:
step 1.1: and splicing the two 13-bit barker codes to obtain a 26-bit barker code so as to improve the perception distance.
Step 1.2: the baseband sequence signal is obtained using frequency domain interpolation and then modulated to limit the bandwidth of the signal.
Specifically, firstly, the spliced barker code signal obtained in the step 1.1 is converted into a frequency domain through a fast Fourier transform algorithm, and zero padding is carried out on the signal in the frequency domain, so that the signal becomes 12 times of the previous length. At this time, the length of the signal is 312, i.e
Figure SMS_6
Then, the signal is converted into a time domain by using an inverse fast Fourier transform algorithm to obtain a baseband sequence signal
Figure SMS_7
Finally, the signal bandwidth is limited to the range of 18kHz to 22kHz using a 20kHz carrier frequency.
In addition, to reduce interference between adjacent frames, a blank interval of 168 sampling points is added to the signal. Thus, the length of each frame of the signal is 480 samples. The transmitted signal is:
Figure SMS_8
wherein->
Figure SMS_9
Representing the baseband sequence signal after frequency domain interpolation, < >>
Figure SMS_10
Representing the carrier frequency +.>
Figure SMS_11
Representing the circumference ratio>
Figure SMS_12
Indicating the instant in time.
Fig. 2 shows a schematic diagram of a predefined audio signal played by a loudspeaker.
Step 2: the audio signals collected by the microphone are preprocessed, and the influence of environment noise and inherent time delay of audio equipment is eliminated.
Step 2.1: common audio noise is removed by a bandpass filter.
After the microphone collects the reflected audio signal, a band-pass filter of 18 kHz-22 kHz is first used to remove audio noise (e.g., speaking sounds, music sounds, etc.). The zero phase filter is then used to reduce the signal phase offset introduced by the filtering.
Step 2.2: the signals are aligned to reduce the effects of the inherent delay of the audio device.
Audio devices commonly have inherent playback delays that can have some impact on channel measurements. It has been observed that there is typically no obstacle on the direct path of the speaker to the microphone, which means that the audio signal arriving at the microphone via the direct path has the greatest energy. The invention can thus align the signals by means of the arrival time of the audio signal transmitted via the direct path.
First, the short-time energy (STE) of the audio signal is calculated to locate the start frame where the signal arrives via the direct path. First, the
Figure SMS_13
The short-time energy of the frame is: />
Figure SMS_16
,/>
Figure SMS_18
Represents the +.>
Figure SMS_14
In frame +.>
Figure SMS_17
The values of the sampling points. The start frame is positioned within the first 20 frames (e.g., 0.2 s) of the received signal. Setting a dynamic threshold +.>
Figure SMS_19
,/>
Figure SMS_20
. When the energy of consecutive 3 frames exceeds +.>
Figure SMS_15
Then frame 1 is determined to be the start frame.
In order to obtain a more accurate arrival time of the audio signal, a strong autocorrelation between the start frame and the transmitted signal is calculated, and the time at which the correlation is maximum is the time at which the signal arrives at the microphone via the direct path.
Finally, the time of signal transmission is calculated according to the length of the direct path, thereby eliminating the inherent play time delay of the audio equipment.
Step 3: IQ demodulation is carried out on the signal to obtain a baseband complex signal so as to obtain richer information.
The audio signal received by the microphone is a passband real signal, and the invention carries out IQ demodulation on the passband real signal to construct a baseband complex signal so as to obtain richer handwriting information. In a handwriting interaction scenario, the audio signal may be occluded by a different occlusion.
In the setting environment have
Figure SMS_23
A propagation path for audio signals received by the microphone according to the principle of superposition of the signals>
Figure SMS_25
The method comprises the following steps:
Figure SMS_27
wherein->
Figure SMS_22
And->
Figure SMS_24
Respectively indicate the audio signal passing through +.>
Figure SMS_28
Attenuation and delay of the individual propagation paths, +.>
Figure SMS_30
Representing the transmitted signal>
Figure SMS_21
Representing the baseband sequence signal after frequency domain interpolation, < >>
Figure SMS_26
Indicating instant of time, +.>
Figure SMS_29
Representing the circumference ratio>
Figure SMS_31
Representing the carrier frequency.
In IQ demodulation, the audio signal received by the microphone is first received
Figure SMS_32
Multiplying the sum with cosine and sine waves, respectively, to obtain the quadrature component +.>
Figure SMS_33
And in-phase component->
Figure SMS_34
Then, a low pass filter with a cut-off frequency of 2kHz is used for filtering
Figure SMS_35
And->
Figure SMS_36
Is a high frequency part of the (c).
Finally, constructing baseband complex signal by combining IQ component
Figure SMS_38
:/>
Figure SMS_40
Wherein->
Figure SMS_42
Representing the baseband sequence signal after frequency domain interpolation, < >>
Figure SMS_39
Is natural index (i.e.)>
Figure SMS_41
And->
Figure SMS_43
Respectively indicate the audio signal passing through +.>
Figure SMS_44
Attenuation and delay of the individual propagation paths, +.>
Figure SMS_37
Representing imaginary units.
Step 4: estimating differential Channel Impulse Response (CIR) eliminates the effects of static multipath effects.
After the baseband complex signal is obtained through IQ demodulation, the channel is continuously measured so as to track the channel change caused by finger movement during handwriting interaction.
Step 4.1: the Channel Impulse Response (CIR) is calculated using a least squares method.
First, a Channel Impulse Response (CIR) is calculated using a least squares method
Figure SMS_48
,/>
Figure SMS_49
Wherein, the method comprises the steps of, wherein,
Figure SMS_55
,/>
Figure SMS_53
is baseband complex signal +.>
Figure SMS_60
Is a subsequence of->
Figure SMS_52
Representing baseband complex signal +.>
Figure SMS_59
Is the first of (2)
Figure SMS_47
Sampling points->
Figure SMS_57
Is a cyclic training sequence matrix,/->
Figure SMS_50
Wherein
Figure SMS_58
Representing->
Figure SMS_46
The training sequence after a delay of a sample point,Tthe transpose is represented by the number,
Figure SMS_54
,/>
Figure SMS_45
representing baseband sequence signal->
Figure SMS_56
Is a subsequence of->
Figure SMS_51
Representing the Channel Impulse Response (CIR).
Figure SMS_61
The calculation process of the optimal linear unbiased estimation of (1) is as follows: />
Figure SMS_62
HRepresenting the conjugate transpose.
Due to
Figure SMS_63
Is>
Figure SMS_64
The individual element is delay->
Figure SMS_65
Sample points and delay->
Figure SMS_66
The product of the training sequences of the sample points and the autocorrelation function of the training sequences may be approximately ideal. When->
Figure SMS_67
And->
Figure SMS_68
The product tends to 0 when not equal, therefore, +.>
Figure SMS_69
Approximately as a diagonal matrix. />
The invention further simplifies the calculation process of the Channel Impulse Response (CIR) into:
Figure SMS_70
wherein->
Figure SMS_71
Represents a reference length->
Figure SMS_72
The greater the confidence in the calculated Channel Impulse Response (CIR). />
Figure SMS_73
Tap number representing Channel Impulse Response (CIR), for example>
Figure SMS_74
The larger the perception range is, the larger the perception range is. According to step 1.2->
Figure SMS_75
Is a fixed length.
To balance confidence and perceived distance, the present invention will
Figure SMS_76
Set to 120. It should be noted that the others belong to [100, 140 ]]Is also within the scope of the present invention; will->
Figure SMS_77
Set to 192, it should be noted that others belong to [180, 200]Is also within the scope of the present invention. At this time, the distance of handwriting perception is about 42cm.
Step 4.2: the Channel Impulse Response (CIR) is differenced along the time axis to obtain a differential channel impulse response (dCIR) that eliminates the effects of static multipath effects.
Step 5: post-processing the signal. The method aims at eliminating random noise, reducing subsequent calculation amount and dividing handwriting input signals.
The specific method for post-treatment is as follows:
step 5.1: a smoothing filter is used to remove random noise during sampling.
The audio device inevitably introduces random noise during the sampling process, which will lead to some outliers in the differential channel impulse response (dCIR) obtained in step 4.
To eliminate the effect of outliers, in this embodiment, a Savitzky-Golay filter is used to smooth the differential channel impulse response (dCIR).
Step 5.2: downsampling is utilized to reduce subsequent computational overhead.
In order to reduce the calculation amount of the subsequent processing, the invention performs the average pooling processing on the differential channel impulse response (dCIR). Specifically, the pooling core size is 2×1, and the step size is 2×1.
By downsampling, differential channel impulse response (dCIThe size of R) is defined by
Figure SMS_78
Reduced to->
Figure SMS_79
,/>
Figure SMS_80
Representing the number of frames of audio collected by the microphone.
Step 5.3: signal segmentation of individual characters or words is achieved based on logarithmic short-term energy and adaptive thresholds.
It was observed that the user had a natural pause in the finger during handwriting of successive characters. During this time, the user's finger will remain stationary and the differential channel impulse response (dCIR) will be close to 0. The distance between the dynamic barrier and the audio device is further than the distance between the finger and the audio device, although some other dynamic barrier may also be present in the environment. Thus, dynamic multipath effects are negligible.
Based on the above findings, the present invention further proposes a segmentation method combining the logarithmic short-term energy of the differential channel impulse response and the adaptive threshold for detecting the start and end frames of each character and word. The method comprises the following steps:
first, the logarithmic short-time energy of the differential channel impulse response (dCIR) is calculated frame by frame, the first
Figure SMS_83
Logarithmic short time energy of frame
Figure SMS_84
The method comprises the following steps: />
Figure SMS_86
Wherein->
Figure SMS_82
Indicate->
Figure SMS_85
Frame->
Figure SMS_87
Differential channel impulse response value of individual taps, < >>
Figure SMS_88
Representing modulo calculation +.>
Figure SMS_81
Indicating the number of differential channel impulse response (dCIR) taps after downsampling.
An adaptive threshold is then calculated based on the sliding window. First, the
Figure SMS_90
Adaptive threshold of individual windows->
Figure SMS_93
The method comprises the following steps:
Figure SMS_97
wherein->
Figure SMS_91
Representing the size of the sliding window, +.>
Figure SMS_94
Is a ratio constant occupied by the average logarithm short-time energy of the current window in the adaptive threshold value of the current window. In this embodiment, <' > a->
Figure SMS_96
Set to 0.3, the others are [0.1,0.5 ]]Is also within the scope of the present invention. />
Figure SMS_98
Indicate->
Figure SMS_89
Adaptive threshold for individual windows,>
Figure SMS_92
representing the current Window->
Figure SMS_95
Energy value of frame.
To determine the beginning and ending frames of a character segment, the first is
Figure SMS_99
Adaptive threshold for frame energy and window in which it resides
Figure SMS_100
A comparison is made. The present invention defines three time thresholds, including +.>
Figure SMS_101
、/>
Figure SMS_102
And->
Figure SMS_103
The minimum number of consecutive frames of the handwritten character, the minimum number of consecutive two character intervals, and the minimum number of consecutive two word intervals are represented, respectively.
Then, the position of the interval section is judged by the number of consecutive frames below the adaptive threshold:
when in continuous
Figure SMS_104
When the logarithmic short-time energy of each frame is lower than the self-adaptive threshold of the current window, the starting frame of the interval is the ending frame of the current word, and the ending frame of the interval is the starting frame of the next word;
when in continuous
Figure SMS_105
When the logarithmic short-time energy of each frame is lower than the adaptive threshold of the current window, the start frame of the interval is regarded as the end frame of the current character, and the end frame of the interval is the start frame of the next character.
In the present embodiment, it will
Figure SMS_106
Let 30 be the same, the other is [20, 40]Is also in the inventionIs within the range of (2); will->
Figure SMS_107
Let 30 be the same, the other is [20, 40]Is also within the scope of the present invention; />
Figure SMS_108
Set to 80, it should be noted that the others belong to the category [60, 100]Is also within the scope of the present invention.
Fig. 3 shows a schematic diagram of the logarithmic short-time energy of the differential channel impulse response (dCIR) calculated when handwriting "a word" with a finger. Fig. 4 shows the segmentation result when handwriting "a word" with a finger.
Step 6: the handwritten content is classified using a classification model.
Step 6.1: the data set is expanded in both the handwriting distance and handwriting speed dimensions using data enhancement techniques.
Different users have different handwriting habits, such as distance between finger and audio device, handwriting speed. Therefore, the invention expands the data set based on different handwriting distances and handwriting speeds respectively, so as to realize handwriting distance independence and handwriting speed independence.
When the finger is far away from the audio device, the length of the audio signal transmission channel reflected by the finger is longer, and the delay time of the corresponding channel is longer
Figure SMS_109
Will become longer. Thus, in this embodiment, by adjusting the arrival time of the signals received by the microphone when aligned, the analog finger and audio device are extended by 5cm and 10cm of the signal compared to the original distance.
As the handwriting speed changes, the time required to handwriting the character will also change, which will result in a stretching in the differential channel impulse response diagram in the time dimension. Thus, the present embodiment uses a cubic spline interpolation method to interpolate the original differential channel impulse response modulus length along the time dimension to simulate signals with handwriting speed changes of 2/3 times and 4/3 times the original speed.
Step 6.2: and classifying the handwritten content at a character level by using a classification model based on CNN-GRU.
The time taken for the same character to be handwritten by different users and the time taken for the same user to handwritten different characters are different, so in this embodiment, the character classification is modeled as a sequence classification problem. The method comprises the following steps:
first, the signal segment obtained in step 5.3 is divided into sub-segments based on a sliding window with a window size of 40 frames and a step length of 20 frames, and a short-time differential channel impulse response (st-dCIR) is obtained.
The short-time differential channel impulse response is then normalized and input into a Convolutional Neural Network (CNN) subnetwork to extract potential eigenvectors. Specifically, the convolutional neural network subnetwork includes 4 CNN layers, 4 pooling layers, and 4 batch normalization layers.
Thereafter, a series of feature vectors are input to a Gate Recursion Unit (GRU) to learn the temporal feature information in the signal.
And finally, inputting the output of the gate recursion unit network to the full connection layer and the softmax layer for classification, and finally outputting the recognized character class.
Step 7: word suggestions are provided using spelling error correction tools based on edit distance and word frequency to correct user handwriting errors or model classification errors.
Specifically, the classification result obtained in step 6.2 is input into the existing spelling error correction tool symspellpy. If the word displayed by the classification result does not exist in the dictionary, correcting the word by using the set maximum editing distance, and defaulting to use the word with the highest word frequency as the correction result. Where the maximum edit distance refers to the number of operations (operations include insert, delete, and replace) that can be performed at most to correct the word. In addition, the user may also manually select an appropriate word for input. If the proper word cannot be found according to the maximum editing distance, the recognition result of the classification model is directly output.
So far, from step 1 to step 7, handwriting input recognition is realized.
Example verification
In order to verify the performance of the invention, the system prototype is realized by deploying the embodiment on 3 smart phones (Honor 30 PRO, HUAWEI Mate 30 and OPPO Reno 9) with different models. A total of 8 participants (4 men and 4 women, between 20 and 50 years of age) were enrolled in the experiment. In the data acquisition process, the smart phone is placed on a desktop, and a participant sits on a chair and performs handwriting input on a plane beside the device. In order to improve the acquisition efficiency, an android-based audio acquisition application was developed. Fig. 5 shows a schematic diagram of an android-based application for data acquisition, with an audio sampling rate set to 48kHz. After the participant clicks the start button, the speaker continues to play the predefined audio signal while the microphone is on and capturing audio data. The participant then handwriting a character with a finger near the phone and clicks the end button after the acquisition is completed. The application will automatically store the captured audio in WAV format. To extract the characteristics of the user when handwriting a single character and two consecutive characters, the present embodiment requires each participant to handwriting 26 english characters (a, b, …, y, z) 5 times and 676 to english characters (aa, ab, …, zy, zz) 5 times. A total of 70200 (26 english characters x 5 x 8 people +676 versus english characters x 5 x 8 people) audio files were collected. After the data enhancement processing, 28080 sample data are finally obtained.
Confusion matrix, accuracy, recall, and F1 score are used to evaluate the performance of the system in the character dimension. Wherein the Confusion matrix (fusion matrix) is defined as: the rows in the confusion matrix represent the true values and the list displays the predicted values. First, the
Figure SMS_110
Line and->
Figure SMS_111
The value in the column indicates the label +.>
Figure SMS_112
Samples of individual characters are predicted as +.>
Figure SMS_113
The ratio of the individual characters; accuracy (Precision) is defined as: predicting the proportion of the sample actually being A in the samples being A; recall (Recall) is defined as: predicting the correct proportion in the sample which is actually A; the F1 score (F1-score) is defined as: a harmonic mean of accuracy and recall.
The word accuracy rate and word error rate are used to evaluate the performance of the system in the word dimension. Wherein, word accuracy (Word accuracy) is defined as: predicting the proportion of the correct word number to the number of all words; the word error rate (Character error rate) is defined as: the ratio of the edit distance of the predicted word to the actual word to the total number of characters in the actual word.
First, the overall performance of the present invention was tested. For character level performance, FIG. 6 illustrates the accuracy, recall, and F1 score of the present invention. The average of the accuracy, recall and F1 scores were 97.62%, 97.61% and 97.60%, respectively. All samples were above 94% accurate, with recall not less than 91.5% and F1 score above 93%. For word-level performance, the present example collected 100 commonly used English words, 5 times per word. The experimental results show that the word accuracy rate and the word error rate are 93.2% and 2.32%, respectively, when the word suggestion module is not used. When the word suggestion module is enabled, the word level rate and word error rate are 96.4% and 1.5%, respectively, verifying the validity of the word suggestion module.
The effect of the distance between the finger and the audio device on the performance of the invention when the user is handwriting is then tested. The performance of the system at different handwriting distances was evaluated by a large number of experiments in this example, and the evaluation results are shown in fig. 7. As can be seen from the figure, the present invention shows good performance at all three handwriting distances. The handwriting input performance is optimal at a position 10cm away from the audio equipment, and the word accuracy rate reaches 97.78%, which is slightly higher than 93.80% at a position 5cm away and 93.33% at a position 15cm away.
Finally, the influence of three different handwriting speeds on the performance of the invention is tested, and the performance of the invention under three scenes of low handwriting speed (0.1 m/s), moderate handwriting speed (0.15 m/s) and high handwriting speed (0.2 m/s) is compared. Fig. 8 shows that when the handwriting speed is too high, the word level is significantly reduced, and the word error rate is significantly increased. This is because the handwriting speed is too fast, so that the generated short-time differential channel impulse response st-dCIR sequence is too short, and the CNN-GRU classification model cannot effectively learn sequence information, thereby influencing the recognition accuracy.
While the foregoing has been provided for the purpose of illustrating the general principles of the invention, it will be understood that the foregoing disclosure is only illustrative of the principles of the invention and is not intended to limit the scope of the invention, but is to be construed as limited to the specific principles of the invention.

Claims (10)

1. The contactless handwriting input recognition method based on the audio signal is characterized by comprising the following steps of:
step 1: playing a predefined audio signal frame by using a loudspeaker in the mobile device, and collecting an audio signal reflected when finger handwriting is input by using a microphone;
step 2: preprocessing an audio signal acquired by a microphone to eliminate the influence of environment noise and inherent time delay of audio equipment; firstly, removing audio noise through a band-pass filter; then, aligning the signals by means of the arrival time of the signal transmitted via the direct path with maximum energy, reducing the effect of the inherent delay of the audio device;
step 3: IQ demodulation is carried out on the signal to obtain a baseband complex signal so as to obtain richer information;
step 4: estimating differential channel impulse response, eliminating the influence of static multipath effect;
firstly, calculating channel impulse response by using a least square method, and then, differencing the channel impulse response along a time axis to obtain differential channel impulse response, thereby eliminating the influence of static multipath effect;
step 5: post-processing the signals, eliminating random noise, reducing subsequent calculation cost and dividing handwriting input signals;
firstly, adopting a smoothing filter to inhibit abnormal values in differential channel impulse response, and eliminating random noise introduced in the sampling process;
then, 2 times down sampling is carried out on the differential channel impulse response;
finally, signal segmentation of single characters/words is realized based on the logarithmic short-time energy and the adaptive threshold;
step 6: classifying the handwritten content by using a classification model;
firstly, expanding a data set in two dimensions of handwriting distance and handwriting speed by utilizing a data enhancement technology; then, classifying the handwritten content at a character level by using a classification model based on a convolution gating circulation unit;
step 7: word suggestions are provided by using a spelling error correction tool based on the editing distance and word frequency, handwriting errors/model classification errors of a user are corrected, and handwriting input recognition results are output.
2. The method for recognizing a contactless handwriting input based on an audio signal according to claim 1, wherein in step 1, a transmitted signal is designed by taking a barker code having strong auto-correlation and weak cross-correlation as an original signal;
step 1.1: splicing two 13-bit barker codes to obtain a 26-bit barker code, so that frequency leakage is avoided;
step 1.2: the baseband sequence signal is interpolated using a 12-frequency domain, and the signal is modulated to limit the bandwidth of the signal.
3. The method for recognizing non-contact handwriting input based on audio signal according to claim 2, wherein a baseband sequence signal is obtained by using a frequency domain interpolation method, and then the signal is modulated to limit the bandwidth of the signal;
firstly, converting the spliced barker code signal obtained in the step 1.1 into a frequency domain through a fast Fourier transform algorithm, and performing zero filling on the signal in the frequency domain to enable the signal to be 12 times of the previous length;
then, the signal is converted into a time domain by using an inverse fast Fourier transform algorithm to obtain a baseband sequence signal
Figure QLYQS_1
Finally, using 20kHz carrier frequency to limit the signal bandwidth to 18 kHz-22 kHz;
to reduce interference between adjacent frames, a blank interval of 168 sampling points is added for the signal, and the length of each frame of the signal is 480 sampling points;
the transmitted signal is:
Figure QLYQS_2
wherein->
Figure QLYQS_3
Representing the baseband sequence signal after frequency domain interpolation, < >>
Figure QLYQS_4
Representing the carrier frequency +.>
Figure QLYQS_5
Representing the circumference ratio>
Figure QLYQS_6
Indicating the instant in time.
4. The method for contactless handwriting recognition based on audio signals according to claim 1, wherein step 2 comprises the steps of:
step 2.1: removing audio noise through a band-pass filter;
when the microphone collects the reflected audio signals, firstly, a band-pass filter is used for removing audio noise, and then a zero-phase filter is used for reducing the phase offset of the signals introduced by filtering;
step 2.2: the signals are aligned by means of the arrival time of the audio signals transmitted via the direct path to reduce the effects of the inherent delay of the audio device.
5. The method for contactless handwriting recognition based on audio signals according to claim 4, wherein step 2.2 comprises the steps of:
firstly, calculating short-time energy of an audio signal to locate a starting frame reached by the signal through a direct path; first, the
Figure QLYQS_9
The short-time energy of the frame is: />
Figure QLYQS_11
,/>
Figure QLYQS_13
Represents the +.>
Figure QLYQS_8
In frame +.>
Figure QLYQS_10
Values of the sampling points; positioning a start frame within the first 20 frames of the received signal; setting a dynamic threshold +.>
Figure QLYQS_12
,/>
Figure QLYQS_14
The method comprises the steps of carrying out a first treatment on the surface of the When the energy of consecutive 3 frames exceeds +.>
Figure QLYQS_7
Then the 1 st frame is judged as the initial frame;
then, strong autocorrelation between the initial frame and the transmitted signal is calculated, and the time with the maximum correlation is the time when the signal reaches the microphone through the direct path;
finally, the time of signal transmission is calculated according to the length of the direct path, thereby eliminating the inherent play time delay of the audio equipment.
6. A contactless handwriting input recognition based on audio signals according to claim 1The other method is characterized in that in the step 3, the environment is provided with
Figure QLYQS_17
A propagation path for audio signals received by the microphone according to the principle of superposition of the signals>
Figure QLYQS_19
The method comprises the following steps:
Figure QLYQS_22
wherein->
Figure QLYQS_15
And->
Figure QLYQS_20
Respectively indicate the audio signal passing through +.>
Figure QLYQS_21
Attenuation and delay of the individual propagation paths, +.>
Figure QLYQS_24
Representing the transmitted signal>
Figure QLYQS_16
Representing the baseband sequence signal after frequency domain interpolation, < >>
Figure QLYQS_18
Indicating instant of time, +.>
Figure QLYQS_23
Representing the circumference ratio>
Figure QLYQS_25
Representing the carrier frequency;
first, the aligned audio signals are processed
Figure QLYQS_26
Respectively with cosine wave and cosine waveSine wave multiplication to obtain quadrature component +.>
Figure QLYQS_27
And in-phase component->
Figure QLYQS_28
Then, a low pass filter with a cut-off frequency of 2kHz is used for filtering
Figure QLYQS_29
And->
Figure QLYQS_30
Is a high frequency part of (2);
constructing baseband complex signals in combination with IQ components
Figure QLYQS_33
:/>
Figure QLYQS_35
Wherein->
Figure QLYQS_37
Representing the baseband sequence signal after frequency domain interpolation, < >>
Figure QLYQS_32
Is natural index (i.e.)>
Figure QLYQS_34
And->
Figure QLYQS_36
Respectively indicate the audio signal passing through +.>
Figure QLYQS_38
Attenuation and delay of the individual propagation paths, +.>
Figure QLYQS_31
Representing imaginary units.
7. The method for recognizing a contactless handwriting input based on an audio signal according to claim 1, wherein in step 4, a channel impulse response is calculated first using a least square method
Figure QLYQS_45
,/>
Figure QLYQS_40
Wherein, the method comprises the steps of, wherein,
Figure QLYQS_49
,/>
Figure QLYQS_47
is baseband complex signal +.>
Figure QLYQS_53
Is a subsequence of->
Figure QLYQS_44
Representing baseband complex signal +.>
Figure QLYQS_51
Is the first of (2)
Figure QLYQS_41
Sampling points->
Figure QLYQS_48
Is a cyclic training sequence matrix,/->
Figure QLYQS_39
Wherein
Figure QLYQS_50
Representing->
Figure QLYQS_42
The training sequence after a delay of a sample point,Tthe transpose is represented by the number,
Figure QLYQS_52
,/>
Figure QLYQS_46
representing baseband sequence signal->
Figure QLYQS_54
Is a subsequence of->
Figure QLYQS_43
Representing the channel impulse response;
Figure QLYQS_55
the calculation process of the optimal linear unbiased estimation of (1) is as follows: />
Figure QLYQS_56
HRepresents a conjugate transpose; />
Figure QLYQS_57
Approximating a diagonal matrix;
the channel impulse response calculation process is further simplified to:
Figure QLYQS_58
wherein->
Figure QLYQS_59
Represents a reference length->
Figure QLYQS_60
The larger the confidence of the channel impulse response is calculated to be higher; />
Figure QLYQS_61
Tap number representing channel impulse response, +.>
Figure QLYQS_62
The larger the perception range is, the larger the perception range is; />
Figure QLYQS_63
The value range is 100, 140]The value range of P is [180, 200 ]]。
8. The method for contactless handwriting recognition based on an audio signal according to claim 1, wherein in step 5, a filter is used to smooth the differential channel impulse response;
carrying out average pooling treatment on the differential channel impulse response, wherein the pooling core size is 2 multiplied by 1, and the step length is 2 multiplied by 1;
by downsampling, the magnitude of the differential channel impulse response is determined by
Figure QLYQS_64
Reduced to->
Figure QLYQS_65
,/>
Figure QLYQS_66
A frame number representing audio collected by the microphone;
the method for dividing the logarithmic short-time energy and the adaptive threshold combined with the differential channel impulse response is adopted to detect the beginning frame and the ending frame of each character and word, and is concretely as follows:
first, the logarithmic short-time energy of the differential channel impulse response is calculated frame by frame, the first
Figure QLYQS_69
Logarithmic short-term energy of frame->
Figure QLYQS_70
The method comprises the following steps:
Figure QLYQS_73
wherein->
Figure QLYQS_68
Indicate->
Figure QLYQS_71
Frame->
Figure QLYQS_72
Differential channel impulse response value of individual taps, < >>
Figure QLYQS_74
Representing modulo calculation +.>
Figure QLYQS_67
Representing the number of differential channel impulse response dCIR taps after downsampling;
then, calculating an adaptive threshold based on the sliding window; first, the
Figure QLYQS_76
Adaptive threshold of individual windows->
Figure QLYQS_78
The method comprises the following steps:
Figure QLYQS_82
wherein->
Figure QLYQS_77
Representing the size of the sliding window, +.>
Figure QLYQS_79
Is a ratio constant of average log short-time energy of the current window in the adaptive threshold for adjusting the current window, +.>
Figure QLYQS_81
;/>
Figure QLYQS_84
Indicate->
Figure QLYQS_75
The adaptive threshold value of the individual windows,/>
Figure QLYQS_80
representing the current Window->
Figure QLYQS_83
An energy value of the frame;
will be the first
Figure QLYQS_85
Adaptive threshold for the energy of a frame and the window in which it is located>
Figure QLYQS_86
Comparing, defining three time thresholds including +.>
Figure QLYQS_87
、/>
Figure QLYQS_88
And->
Figure QLYQS_89
Representing the minimum continuous frame number of the handwritten character, the minimum frame number of the interval between two continuous characters and the minimum frame number of the interval between two continuous words respectively;
then, the position of the interval section is judged by the number of consecutive frames below the adaptive threshold:
when in continuous
Figure QLYQS_90
When the logarithmic short-time energy of each frame is lower than the self-adaptive threshold of the current window, the starting frame of the interval is the ending frame of the current word, and the ending frame of the interval is the starting frame of the next word;
when in continuous
Figure QLYQS_91
When the logarithmic short-time energy of each frame is lower than the adaptive threshold of the current window, the beginning frame of the interval is regarded as the ending frame of the current character, and the ending frame of the interval is the nextA start frame of the character;
Figure QLYQS_92
the value range is 20, 40],/>
Figure QLYQS_93
The value range is 20, 40],/>
Figure QLYQS_94
The value range is [60, 100 ]]。
9. The method of claim 1, wherein in step 6, the analog finger and the audio device are extended by 5cm and 10cm from the original distance by adjusting the arrival time of the signals received by the microphone when the signals are aligned; interpolation is carried out on the original differential channel impulse response module length along the time dimension by using a cubic spline interpolation method to simulate signals of which the handwriting speed is changed into 2/3 times and 4/3 times of the original speed;
using a classification model based on CNN-GRU, classifying the handwritten content at a character level, including:
modeling character classification as a sequence classification problem, firstly, dividing a signal section obtained in the step 5 into subsections based on a sliding window with a window size of 40 frames and a step length of 20 frames to obtain a short-time differential channel impulse response;
then, normalizing the short-time differential channel impulse response, and inputting the short-time differential channel impulse response into a convolutional neural network subnet to extract potential eigenvectors;
then, inputting a series of feature vectors into a gate recursion unit to learn time feature information in the signal;
and finally, inputting the output of the gate recursion unit network to the full connection layer and the softmax layer for classification, and finally outputting the recognized character class.
10. The method for recognizing non-contact handwriting input based on audio signals according to claim 1, wherein in step 7, the classification result obtained in step 6 is inputted into a spelling error correction tool, if the word displayed by the classification result does not exist in the dictionary, the word is corrected by using the set maximum edit distance, and the word with the highest word frequency is used as the correction result by default; wherein, the maximum editing distance refers to the operand which can be executed most by the corrected word; if the proper word cannot be found according to the maximum editing distance, the recognition result of the classification model is directly output.
CN202310316251.XA 2023-03-29 2023-03-29 Non-contact handwriting input recognition method based on audio signal Active CN116027911B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310316251.XA CN116027911B (en) 2023-03-29 2023-03-29 Non-contact handwriting input recognition method based on audio signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310316251.XA CN116027911B (en) 2023-03-29 2023-03-29 Non-contact handwriting input recognition method based on audio signal

Publications (2)

Publication Number Publication Date
CN116027911A CN116027911A (en) 2023-04-28
CN116027911B true CN116027911B (en) 2023-05-30

Family

ID=86077902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310316251.XA Active CN116027911B (en) 2023-03-29 2023-03-29 Non-contact handwriting input recognition method based on audio signal

Country Status (1)

Country Link
CN (1) CN116027911B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117763399B (en) * 2024-02-21 2024-05-14 电子科技大学 Neural network classification method for self-adaptive variable-length signal input

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6285785B1 (en) * 1991-03-28 2001-09-04 International Business Machines Corporation Message recognition employing integrated speech and handwriting information
CN109657739A (en) * 2019-01-09 2019-04-19 西北大学 A kind of hand-written Letter Identification Method based on high frequency sound wave Short Time Fourier Transform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6285785B1 (en) * 1991-03-28 2001-09-04 International Business Machines Corporation Message recognition employing integrated speech and handwriting information
CN109657739A (en) * 2019-01-09 2019-04-19 西北大学 A kind of hand-written Letter Identification Method based on high frequency sound wave Short Time Fourier Transform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于多通道融合的连续手写识别纠错方法;敖翔;王绪刚;戴国忠;王宏安;;软件学报(第09期);全文 *
基于语音和笔的手写数学公式纠错方法;姜映映;敖翔;田丰;王绪刚;戴国忠;;计算机研究与发展(第04期);全文 *

Also Published As

Publication number Publication date
CN116027911A (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN106504768B (en) Phone testing audio frequency classification method and device based on artificial intelligence
CN110069199B (en) Skin type finger gesture recognition method based on smart watch
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
CN109147763B (en) Audio and video keyword identification method and device based on neural network and inverse entropy weighting
CN101465122A (en) Method and system for detecting phonetic frequency spectrum wave crest and phonetic identification
CN103559882B (en) A kind of meeting presider&#39;s voice extraction method based on speaker&#39;s segmentation
CN109657739B (en) Handwritten letter identification method based on high-frequency sound wave short-time Fourier transform
CN116027911B (en) Non-contact handwriting input recognition method based on audio signal
CN105938399B (en) The text input recognition methods of smart machine based on acoustics
CN108182418B (en) Keystroke identification method based on multi-dimensional sound wave characteristics
CN109658949A (en) A kind of sound enhancement method based on deep neural network
CN111986699B (en) Sound event detection method based on full convolution network
CN113223536A (en) Voiceprint recognition method and device and terminal equipment
CN109997186B (en) Apparatus and method for classifying acoustic environments
CN110946554A (en) Cough type identification method, device and system
Jaafar et al. Automatic syllables segmentation for frog identification system
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN114882914A (en) Aliasing tone processing method, device and storage medium
CN113870893A (en) Multi-channel double-speaker separation method and system
JP2022534003A (en) Speech processing method, speech processing device and human-computer interaction system
CN111862991A (en) Method and system for identifying baby crying
CN107993666A (en) Audio recognition method, device, computer equipment and readable storage medium storing program for executing
CN111192569B (en) Double-microphone voice feature extraction method and device, computer equipment and storage medium
CN109005023A (en) A kind of smart phone pattern password guess method based on nearly ultrasonic wave
Sarada et al. Multiple frame size and multiple frame rate feature extraction for speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant