WO2023124556A1 - 多键盘混合按键声音的识别方法、装置、设备及存储介质 - Google Patents
多键盘混合按键声音的识别方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023124556A1 WO2023124556A1 PCT/CN2022/130829 CN2022130829W WO2023124556A1 WO 2023124556 A1 WO2023124556 A1 WO 2023124556A1 CN 2022130829 W CN2022130829 W CN 2022130829W WO 2023124556 A1 WO2023124556 A1 WO 2023124556A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- keystroke
- signal
- signal segment
- segment
- keyboard
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000005236 sound signal Effects 0.000 claims abstract description 62
- 238000012549 training Methods 0.000 claims description 45
- 238000001514 detection method Methods 0.000 claims description 24
- 230000000694 effects Effects 0.000 claims description 24
- 238000012706 support-vector machine Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000009471 action Effects 0.000 claims description 7
- 239000012634 fragment Substances 0.000 abstract description 19
- 238000010586 diagram Methods 0.000 description 10
- 230000000875 corresponding effect Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000001228 spectrum Methods 0.000 description 7
- 238000005070 sampling Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/03—Arrangements for converting the position or the displacement of a member into a coded form
- G06F3/041—Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means
- G06F3/043—Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means using propagating acoustic waves
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the invention belongs to the technical field of signal recognition, and in particular relates to a recognition method, device, equipment and storage medium for multi-keyboard mixing key sounds.
- the model for distinguishing users from illegal users is obtained, and then the model is continuously run during the operation of the computer to authenticate the user. Once the user is considered to be an illegal user, take corresponding actions (such as locking the screen) ).
- This method can also effectively prevent the eavesdropper from directly using the eavesdropped computer.
- keyboard keystroke recognition has become a hot spot.
- the problem of keyboard keystroke recognition has become one of the key issues in protecting office information security.
- the existing keyboard keystroke recognition is mainly divided into two categories. One is to recognize keyboard keystrokes by implanting malicious programs on the computer. At present, security technologies such as firewalls can be used to prevent the leakage of keystrokes; the other is to use Sound, WIFI, light and other signals are used to identify keystrokes on the keyboard. This type of method eavesdrops on keystrokes in various forms and is often difficult to prevent.
- the second research method can be mainly divided into the following categories.
- CSI technology is used to identify keystrokes, such as WiFinger;
- keystrokes are identified based on video data, such as Blind Recognition of Touched Keys on Mobile Devices (Mobile Blind recognition of touch buttons on the device), (3) recognition of keyboard strokes based on sound signals, such as Accurate Combined Keystrokes Detection Using Acoustic Signals (accurate combination keystroke detection using acoustic signals), and keystroke detection by capturing sound signals Recognition of keystroke combinations (such as Ctrl+C).
- the purpose of the embodiments of this specification is to provide a method, device, device and storage medium for recognizing the sound of multi-keyboard mixed keystrokes.
- the present application provides a method for recognizing a multi-keyboard mixed keypress sound, the method comprising:
- the Mel-frequency cepstral coefficients are input into the preset single-key recognition model, and the corresponding typing content of each keyboard is output.
- acquiring the sound signal sent when the keyboard is struck includes:
- the sound signal sent by the recording component of the terminal when the keyboard is struck is acquired, and the terminal includes at least one recording component.
- the keystroke signal interception is performed on the sound signal, and the keystroke signal segment is determined, including:
- the energy value of the first signal segment is greater than the energy threshold, then intercept the signal segment of the first preset duration before the starting point of the first signal segment and the second preset duration after the first signal segment as the second signal segment;
- the second signal segment adopts the voice activity detection method to determine the keystroke signal segment.
- the second signal segment uses a voice activity detection method to determine the keystroke signal segment, including:
- the second signal segment uses a voice activity detection method to determine the start point and end point of the keystroke action, and extract the keystroke signal;
- the keystroke signal contains only one keystroke operation, a signal segment with a length of 41.7ms is intercepted backward from the starting point as the keystroke signal segment;
- the keystroke signal contains two keystroke operations, intercept a signal segment with a length of 41.7 ms backwards from the starting point as the first keystroke signal segment;
- the first keystroke signal segment and the second keystroke signal segment serve as keystroke signal segments.
- determining the Mel-frequency cepstral coefficient according to the keystroke signal segment includes:
- a low-pass filter is used to denoise to obtain a denoised signal segment
- the Mel-frequency cepstral coefficients are determined.
- the preset single-key recognition model is constructed through the following steps:
- the sound signal segment and the training keystroke signal segment are superimposed to determine the training segment of the keystroke signal with noise;
- the Mel-frequency cepstral coefficient training set is used as input data, and the preset single-key recognition model is obtained through training.
- the present application provides a multi-keyboard mixed key sound recognition device, the device comprising:
- An acquisition module configured to acquire the sound signal sent when the keyboard is struck
- the intercepting module is used to intercept the keystroke signal to the sound signal to determine the keystroke signal segment
- a determining module configured to determine the Mel frequency cepstral coefficients according to the keystroke signal segment
- the processing module is used for inputting the preset single-key recognition model of Mel-frequency cepstral coefficients, and outputs the corresponding typing content of each keyboard.
- the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
- the processor executes the program, it realizes the multi-keyboard mixing key sound as in the first aspect. recognition methods.
- the present application provides a readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for recognizing the mixed key sounds of multiple keyboards as in the first aspect is implemented.
- the multi-keyboard mixed key sound recognition method provided in the embodiment of the present application can be applied to recognize input content of multiple keyboards.
- the multi-keyboard mixed key sound recognition method provided by the embodiment of the present application only needs to use the recording component on the terminal, no additional equipment is required, the cost is low, and it is easy to obtain.
- the multi-keyboard mixed key sound recognition method provided by the embodiment of the present application proposes a BLSTM model based on the attention mechanism, and utilizes the characteristics of the connection between the signals received by the two recording components in the same time period, and uses the BLSTM to perform key recognition.
- the correct rate increased to 96.41%.
- Fig. 1 is the schematic flow chart of the recognition method of the multi-keyboard mixing key sound provided by the application
- Fig. 2 is the layout diagram of the experimental platform provided by the application.
- FIG. 3 is a schematic structural diagram of a preset single-key recognition model provided by the present application.
- Fig. 4 is the structural representation of the recognition device of the multi-keyboard mixing key sound provided by the present application.
- FIG. 5 is a schematic structural diagram of an electronic device provided by the present application.
- FIG. 1 shows a schematic flow chart applicable to the recognition method of multi-keyboard mixed key press sound provided by the embodiment of the present application.
- the recognition method of multi-keyboard mixed button sound may include:
- the recording component of the terminal collects sound signals emitted when the keyboard is struck, and uploads the collected sound signals to the cloud.
- the terminal may include any electronic device with a recording component, such as a mobile phone, a tablet computer, a wearable device, and the like.
- the recording element may be a microphone.
- a terminal may include at least one microphone, for example, a mobile phone may include two or more microphones.
- the recording component of the mobile phone collects the sound signal sent when the keyboard is struck
- the mobile phone is placed in the middle of the two keyboards, and two or more recording components on the mobile phone collect the keystroke sound. and upload to the cloud.
- the keystroke signal is intercepted on the collected sound signal, and then the intercepted keystroke signal is cut into signal segments to obtain keystroke signal segments. It can be understood that if the keystroke signal includes only one keystroke operation, one keystroke signal segment can be obtained, and if the keystroke signal includes two keystroke operations, two keystroke signal segments can be obtained.
- S120 performs keystroke signal interception on the sound signal, and determines the keystroke signal segment, which may include:
- the energy value of the first signal segment is greater than the energy threshold, then intercept the signal segment of the first preset duration before the starting point of the first signal segment and the second preset duration after the first signal segment as the second signal segment;
- the second signal segment adopts the voice activity detection method to determine the keystroke signal segment.
- the second signal segment adopts the voice activity detection method to determine the keystroke signal segment, which may include:
- the second signal segment uses a voice activity detection method to determine the start point and end point of the keystroke action, and extract the keystroke signal;
- the keystroke signal contains only one keystroke operation, a signal segment with a length of 41.7ms is intercepted backward from the starting point as the keystroke signal segment;
- the keystroke signal contains two keystroke operations, intercept a signal segment with a length of 41.7 ms backwards from the starting point as the first keystroke signal segment;
- the first keystroke signal segment and the second keystroke signal segment serve as keystroke signal segments.
- the energy threshold may be set according to actual requirements.
- the first preset duration and the second preset duration can be set according to actual needs, for example, both the first preset duration and the second preset duration are 1s.
- the energy value a of the signal segment is:
- n is the length of the signal segment.
- the energy value of the signal segment is calculated every 41.7ms. If the energy value exceeds the threshold, start from the starting point of the signal segment (that is, the first signal segment), and intercept the first 1s and the last 1s (the duration is 2s in total) ) as a signal segment (that is, a second signal segment) that may have a keystroke action.
- VAD Voice Activity Detection
- SVM Small Vector Machine
- step S130 determines Mel Frequency cepstral coefficient; if the keystroke signal contains two keystroke operations, the first keystroke signal segment is a signal segment with a length of 41.7ms intercepted backward from the starting point stp, and then the second keystroke is calculated by the regression neural network The position inv where the operation starts to occur (that is, the moment when the two keystroke operations begin to overlap), the second keystroke signal segment is a signal segment with a length of 41.7ms intercepted from inv backward, according to the first keystroke signal segment and the second keystroke signal segment respectively For the keystroke signal segment, step S130 determines the Mel-frequency cepstral coefficients.
- This application uses the LSTM-based regression neural network model to calculate the overlapping starting position, and its network structure includes: input layer, LSTM layer, Flatten layer and dense (fully connected) layer.
- This model uses the random superposition of single-key signals in the set of keystroke signal fragments (overlapping start position, signal source and label are random) to generate an overlapping signal containing two keystroke operations, while recording the overlapping starting position as a label.
- Input layer Receive the intercepted keystroke signal fragment as the input of the model.
- LSTM layer Encode the input data of the model so that the output data of LSTM contains timing information.
- Flatten layer Turn the output data of the LSTM layer into a one-dimensional vector, which is convenient for the calculation of the fully connected layer.
- Fully connected layer The input data of the fully connected layer is multiplied by the weight to obtain the estimated starting position of the overlap. This layer does not use an activation function.
- the present application judges the signal source (that is, which keyboard the keystroke signal comes from) by calculating the energy difference of the signal segment received by different recording elements.
- the attenuation of the keystroke signal is also different.
- the longer the path the higher the attenuation, that is, the lower the total energy of the signal received by the recording element.
- the two keyboards are located on both sides of the two recording elements, so the total energy difference corresponding to one keyboard is always positive, and the total energy difference corresponding to the other keyboard is always negative. From this, the source of the keystroke signal can be judged.
- a low-pass filter is used to denoise to obtain a denoised signal segment
- the Mel-frequency cepstral coefficients are determined.
- the low-pass filter is used for denoising to obtain the denoising signal segment; according to the denoising signal segment, the Mel frequency cepstral coefficient is calculated as the input data of the preset single key recognition model.
- the Mel frequency cepstrum is a linear transformation of the logarithmic energy spectrum based on the nonlinear Mel scale of the sound frequency, and the Mel frequency cepstrum coefficients are the coefficients that make up the Mel frequency cepstrum. It considers the human auditory characteristics, first maps the linear spectrum to the Mel nonlinear spectrum based on auditory perception, and then converts it to the cepstrum.
- the Mel frequency spectrum is obtained through the Mel filter bank.
- the preset single-key recognition model may be pre-trained.
- the preset single-key recognition model is the BLSTM neural network model based on the attention mechanism (bidirectional long-short-term memory recurrent neural network), and the network structure is shown in Figure 3, which includes two input layers, two BLSTM layers, a concatenate (series ) layer, an attention (attention mechanism) layer and a dense (fully connected) layer.
- Input layer Since there is some connection between the signals received by the two recording elements at the same time period, this neural network uses two input layers to receive the Mel frequency cepstral coefficients corresponding to the two recording elements as its input. data.
- BLSTM layer BLSTM is composed of forward LSTM (long-short-term memory recurrent neural network) and backward LSTM. It is often used in natural language processing tasks to model context information, that is, the data obtained by sequential data processing through BLSTM includes forward and backward information. Therefore, this application adopts two BLSTM layers to respectively receive the output data from the two input layers, and encode the input layer data, so that the output sequence of the BLSTM contains timing information.
- Concatenate layer concatenates the output sequences of two BLSTM layers.
- Attention layer Process the series of two BLSTMs in series, so that the attention output data contains the correlation information between the signals received by the two recording elements.
- Dense layer fully connected layer, which processes the data output by the attention layer and obtains the result of button recognition.
- the activation function used in the fully connected layer of this application is the sigmoid function, and the output dimension is set to the number of labels.
- the preset single-key recognition model is constructed through the following steps:
- the sound signal segment and the training keystroke signal segment are superimposed to determine the training segment of the keystroke signal with noise;
- the Mel-frequency cepstral coefficient training set is used as input data, and the preset single-key recognition model is obtained through training.
- a commercial mobile phone is used to collect sound signals of keystrokes on each keyboard, and the user is required to press a key every 2 seconds to avoid overlapping of keystroke signals; for the collected sound signals, the sounds of every two seconds are used as a group to be Processed signal; for a group of signals to be processed every two seconds, use the VAD (Voice Activity Detection) algorithm to intercept keystroke signal segments with a duration of about 41.7ms from the signal; for each keystroke signal segment intercepted, Randomly superimpose the collected sound signal fragments of equal length as the keystroke signal fragments after adding noise, thereby increasing the data volume of the training set; for the obtained keystroke signal fragment collection, the total energy value, kurtosis and 5 times
- the wavelet-transformed signal obtains a well-trained support vector machine model, which is used to judge whether the keystroke signal segment contains only one keystroke operation; for the intercepted keystroke signal segment and the keystroke signal segment after adding noise, the Mel frequency is calculated Cepstral coefficients to generate
- the total energy of the signal whose length is FrameLen is calculated every interval of FrameInc sampling points, and the array amp is obtained, which is the set of the total energy of each frame. Specifically, take FrameInc as the step size, extract a signal with a length of FrameLen as a frame i, and calculate the sum of absolute values of a frame as the total energy of this frame, that is, amp[i].
- the threshold Zs can be set according to actual requirements.
- the difference between the double-bond signal and the single-bond signal lies in the following three points: a.
- the double-bond signal roughly presents three or more peaks in the time domain; b.
- the total energy of the double-bond signal is higher than that of the original single-bond signal The total energy is high; c.
- the application performs five wavelet transformations on the original signal as the judgment feature. Therefore, this application calculates the total energy value, kurtosis and 5th wavelet transformed signal of the original data of the training set as the input features of the support vector machine, and generates a training set for judging whether it only contains a single keystroke operation.
- n is the length of the signal segment.
- the multi-keyboard mixed key sound recognition method provided in the embodiment of the present application can be applied to recognize input content of multiple keyboards.
- the multi-keyboard mixed key sound recognition method provided by the embodiment of the present application only needs to use the recording component on the terminal, no additional equipment is required, the cost is low, and it is easy to obtain.
- the multi-keyboard mixed key sound recognition method provided by the embodiment of the present application proposes a BLSTM model based on the attention mechanism, and utilizes the characteristics of the connection between the signals received by the two recording components in the same time period, and uses the BLSTM to perform key recognition.
- the correct rate increased to 96.41%.
- the experiments were carried out in conference rooms and dormitories.
- the environment of the conference room is relatively quiet, and the noise mainly comes from the sound of vehicles passing by in the distance, the sound of the air conditioner, and the sound reflected by the keystrokes.
- the dormitory environment is relatively noisy, and there are a series of interference noises such as various human voices, non-target keyboard keystrokes, and washing machine sounds, which bring challenges to the extraction of keystroke signal fragments.
- there are more objects in the dormitory and the environment is more complex, resulting in more complex sounds reflected from the keystrokes.
- the keyboard and mobile phone are placed on the mouse pad, and the keyboard and mobile phone are fixed on the mouse pad at the same time to avoid slight changes in the position of the keyboard during the keystroke process.
- the mechanical keyboard model is iKBC typeman W200, and the mechanical keyboard has not been used before data collection, and there is no button wear.
- the keystroke sound of the mechanical keyboard is relatively clear, and the key position is stable.
- the duration of the complete single-key signal is about 125ms, and the duration of the hit peak is about 42ms.
- the Huawei P20 has two microphones, which are located on the top and bottom of the phone. It uses the Android 8.1 system and provides a maximum sampling rate of 48kHz.
- the Redmi K30 has 3 microphones, which are located at the top, bottom and among the four cameras of the phone. It uses the Android 10.0 system and provides a maximum sampling rate of 96kHz.
- the sampling rate of the data collected by Huawei P20 is 48kHz
- the sampling rate of data collected by Redmi is 96kHz.
- Tapping speed The tester is required to tap a button every 2 seconds to avoid overlapping signals in the signals received by the microphone.
- the tester is required to tap the buttons A to Z, a total of 26 buttons, and each button is tapped 60 times in total.
- the tester is required to divide the 60 knocks of each button into Complete 3 times, collect 20 groups of audio signals of keystrokes each time, and each time interval is at least 4 hours.
- the correct rate of recognition of 26 keys on a keyboard can reach up to 96.41%.
- the key recognition accuracy rate of two keyboard mixed signals can reach up to 67%.
- the correct recognition rate of the first button 83.25%
- the recognition accuracy rate of the second button 74.84%.
- FIG. 4 shows a schematic structural diagram of an apparatus for recognizing a multi-keyboard mixed keypress sound according to an embodiment of the present application.
- the recognition device 400 of multi-keyboard mixed keystroke sound may include:
- Obtaining module 410 is used for obtaining the sound signal that sends out when keyboard is struck;
- the intercepting module 420 is used to intercept the keystroke signal to the sound signal, and determine the keystroke signal segment;
- a determining module 430 configured to determine the Mel frequency cepstral coefficients according to the keystroke signal segment
- the processing module 440 is used for inputting the preset single-key recognition model of Mel-frequency cepstral coefficients, and outputs the corresponding input content of each keyboard.
- the acquiring module 410 is also used for:
- the sound signal sent by the recording component of the terminal when the keyboard is struck is acquired, and the terminal includes at least one recording component.
- the interception module 420 is also used for:
- the energy value of the first signal segment is greater than the energy threshold, then intercept the signal segment of the first preset duration before the starting point of the first signal segment and the second preset duration after the first signal segment as the second signal segment;
- the second signal segment adopts the voice activity detection method to determine the keystroke signal segment.
- the interception module 420 is also used for:
- the second signal segment uses a voice activity detection method to determine the start point and end point of the keystroke action, and extract the keystroke signal;
- the keystroke signal contains only one keystroke operation, a signal segment with a length of 41.7ms is intercepted backward from the starting point as the keystroke signal segment;
- the keystroke signal contains two keystroke operations, intercept a signal segment with a length of 41.7 ms backwards from the starting point as the first keystroke signal segment;
- the first keystroke signal segment and the second keystroke signal segment serve as keystroke signal segments.
- the determination module 430 is also used for:
- a low-pass filter is used to denoise to obtain a denoised signal segment
- the Mel-frequency cepstral coefficients are determined.
- processing module 440 is also used for:
- the sound signal segment and the training keystroke signal segment are superimposed to determine the training segment of the keystroke signal with noise;
- the Mel-frequency cepstral coefficient training set is used as input data, and the preset single-key recognition model is obtained through training.
- the multi-keyboard mixed key sound recognition device provided in this embodiment can implement the above-mentioned method embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
- FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 5 , a schematic structural diagram of an electronic device 300 suitable for implementing the embodiments of the present application is shown.
- the electronic device 300 includes a central processing unit (CPU) 301, which can operate according to a program stored in a read-only memory (ROM) 302 or a program loaded from a storage section 308 into a random access memory (RAM) 303 Instead, various appropriate actions and processes are performed.
- ROM read-only memory
- RAM random access memory
- various programs and data necessary for the operation of the device 300 are also stored.
- the CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304.
- An input/output (I/O) interface 305 is also connected to the bus 304 .
- the following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, etc.; an output section 307 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 308 including a hard disk, etc. and a communication section 309 including a network interface card such as a LAN card, a modem, or the like.
- the communication section 309 performs communication processing via a network such as the Internet.
- Drive 310 is also connected to I/O interface 306 as needed.
- a removable medium 311, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 310 as necessary so that a computer program read therefrom is installed into the storage section 308 as necessary.
- the process described above with reference to FIG. 1 may be implemented as a computer software program.
- the embodiments of the present disclosure include a computer program product, which includes a computer program tangibly contained on a machine-readable medium, and the computer program includes program codes for executing the above-mentioned method for recognizing mixed key sounds of multiple keyboards.
- the computer program may be downloaded and installed from a network via communication portion 309 and/or installed from removable media 311 .
- each block in a flowchart or block diagram may represent a module, program segment, or part of code that includes one or more logical functions for implementing specified executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
- the units or modules involved in the embodiments described in the present application may be implemented by means of software or by means of hardware.
- the described units or modules may also be provided in a processor.
- the names of these units or modules do not constitute limitations on the units or modules themselves in some cases.
- a typical implementing device is a computer.
- the computer can be, for example, a personal computer, a notebook computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or any of these devices combination of devices.
- the present application also provides a storage medium, which may be the storage medium contained in the aforementioned device in the above embodiment, or may be a storage medium that exists independently and is not assembled into the device.
- the storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the multi-keyboard mixed key sound recognition method described in this application.
- Storage media includes permanent and non-permanent, removable and non-removable media.
- Information storage can be realized by any method or technology.
- Information may be computer readable instructions, data structures, modules of a program, or other data.
- Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
- computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Input From Keyboards Or The Like (AREA)
Abstract
本申请提供一种多键盘混合按键声音的识别方法、装置、设备及存储介质,该方法包括:获取键盘敲击时发出的声音信号;对声音信号进行击键信号截取,确定击键信号片段;根据击键信号片段,确定梅尔频率倒谱系数;梅尔频率倒谱系数输入预设单键识别模型,输出每个键盘相应的键入内容。该方案可以适用于多个键盘混合按键声音的识别,且识别准确率高。
Description
本发明属于信号识别技术领域,特别涉及一种多键盘混合按键声音的识别方法、装置、设备及存储介质。
如今,人们主要的办公场景大多是在一个房间内利用键盘鼠标操作电脑进行内容输入,工作人员输入的内容有时候带有涉及个人、客户甚至是公司隐私的信息,如个人密码、客户资料以及公司投标合同等等,这些一旦被不法分子利用就会对相关关系人造成巨大损失的信息,如2018 Cost of Data Breach Study指出在信息泄漏事件中企业平均损失为386万美元。所以,按键输入信息安全至关重要。
通常外部窃听者往往会采用侵入式窃听的方式,对按键输入进行窃听,通过在电脑上植入恶意程序获取被窃听者的键入信息。随着云安全等网络安全计算的发展,通过防火墙等安全技术即可有效防止外部人员进行窃听。但内部人员的窃听行为仍对键入信息安全带来巨大威胁。内部人员可以通过趁被窃听者离开电脑(如去上厕所)的一小段时间,无需密码即可使用被窃听者的电脑,进行攻击。针对这种窃听场景,相关研究人员提出用户连续认证的方式进行预防。根据电脑上记录的飞行时间等用户输入信息训练得到区分用户与非法用户的模型,然后在电脑运行期间连续运行该模型对用户进行认证,一旦认为用户是非法用户,则采取相应行动(如锁屏)。这种方式也能有效预防窃听者直接使用被窃听者电脑的情况。
随着信号检测系统的发展,键盘击键识别成为关注热点。键盘击键识别问题成为保护办公室信息安全的关键问题之一。
现有的键盘击键识别主要分为两大类,其一,是通过在电脑上植入恶意程序进行键盘击键识别,目前可以通过防火墙等安全技术防止键入内容的泄露;其二,是利用声音、WIFI、光等信号进行键盘击键内容的识别,这类方法窃听键入内容的形式多变,常常难以预防。第二种研究方法又可以主要分为以下几类。(1)基于WIFI信号,采用CSI技术进行键盘敲击内容的识别,如WiFinger,(2)基于光信号,根据视频数据进行键盘敲击内容的识别,如Blind Recognition of Touched Keys on Mobile Devices(移动设备上触摸按键的盲识别),(3)基于声音信号进行键盘敲击内容的识别,如Accurate Combined Keystrokes Detection Using Acoustic Signals(使用声学信号进行准确的组合击键检测),通过捕捉声音信号进行按键敲击组合(如Ctrl+C)的识别。
现有的键盘击键识别技术都是针对单个键盘的单一按键或特定按键组合(如Ctrl+C)的识别,但是在办公室场景中往往存在多个键盘同时敲击的情况,录音设备接收到的信号往往是多个键盘的混合声音信号。所以,现有的按键声音识别技术不具有普适性。
发明内容
本说明书实施例的目的是提供一种多键盘混合按键声音的识别方法、装置、设备及存储介质。
为解决上述技术问题,本申请实施例通过以下方式实现的:
第一方面,本申请提供一种多键盘混合按键声音的识别方法,该方法包括:
获取键盘敲击时发出的声音信号;
对声音信号进行击键信号截取,确定击键信号片段;
根据击键信号片段,确定梅尔频率倒谱系数;
梅尔频率倒谱系数输入预设单键识别模型,输出每个键盘相应的键入内容。
在其中一个实施例中,获取键盘敲击时发出的声音信号,包括:
获取终端的录音元件发送的键盘敲击时发出的声音信号,终端包括至少一个录音元件。
在其中一个实施例中,对声音信号进行击键信号截取,确定击键信号片段,包括:
每41.7ms计算声音信号中信号片段的能量值;
若第一信号片段的能量值大于能量阈值,则截取第一信号片段的起始点前第一预设时长和后第二预设时长的信号片段,作为第二信号片段;
第二信号片段采用语音活动检测方法,确定击键信号片段。
在其中一个实施例中,第二信号片段采用语音活动检测方法,确定击键信号片段,包括:
第二信号片段使用语音活动检测方法,确定击键动作的起始点和终止点,提取出击键信号;
计算击键信号的总能量、峰值和5次小波变换后的信号;
将击键信号的总能量、峰值和5次小波变换后的信号,输入预设支持向量机,判断击键信号是否仅包含一个击键操作;
若击键信号仅包含一个击键操作,则从起始点开始向后截取长度为41.7ms的信号片段,作为击键信号片段;
若击键信号包含两个击键操作,则从起始点开始向后截取长度为41.7ms的信号片段,作为第一击键信号片段;
通过回归神经网络计算第二个击键操作开始发生的起始位置,从起始位置向后截取长度为41.7ms的信号片段,作为第二击键信号片段;
第一击键信号片段和第二击键信号片段,作为击键信号片段。
在其中一个实施例中,根据击键信号片段,确定梅尔频率倒谱系数,包括:
根据击键信号片段,采用低通滤波器去噪,得到去噪信号片段;
根据去噪信号片段,确定梅尔频率倒谱系数。
在其中一个实施例中,预设单键识别模型通过下述步骤构建:
获取每个键盘的按键敲击时的声音信号;
根据声音信号,使用语音活动检测方法从所声音信号中截取持续时间为41.7ms的击键信号训练片段;
从声音信号中随机获取与训练击键信号片段等长的声音信号片段;
声音信号片段与训练击键信号片段叠加,确定带噪声击键信号训练片段;
分别根据所有击键信号训练片段和所有带噪声击键信号训练片段,确定梅尔频率倒谱系数训练集;
梅尔频率倒谱系数训练集作为输入数据,训练得到预设单键识别模型。
第二方面,本申请提供一种多键盘混合按键声音的识别装置,该装置包括:
获取模块,用于获取键盘敲击时发出的声音信号;
截取模块,用于对声音信号进行击键信号截取,确定击键信号片段;
确定模块,用于根据击键信号片段,确定梅尔频率倒谱系数;
处理模块,用于梅尔频率倒谱系数输入预设单键识别模型,输出每个键盘相 应的键入内容。
第三方面,本申请提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现如第一方面的多键盘混合按键声音的识别方法。
第四方面,本申请提供一种可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面的多键盘混合按键声音的识别方法。
由以上本说明书实施例提供的技术方案可见,该方案:
本申请实施例提供的多键盘混合按键声音的识别方法,可以适用于多键盘的键入内容进行识别。
本申请实施例提供的多键盘混合按键声音的识别方法,仅需使用终端上的录音元件即可,无需额外设备,成本低,易获得。
本申请实施例提供的多键盘混合按键声音的识别方法,提出一种基于注意力机制的BLSTM模型,利用两个录音元件同一时间段所接收信号之间存在联系的特点,将BLSTM进行按键识别的正确率提高到96.41%。
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请提供的多键盘混合按键声音的识别方法的流程示意图;
图2为本申请提供的实验平台布置图;
图3为本申请提供的预设单键识别模型的结构示意图;
图4为本申请提供的多键盘混合按键声音的识别装置的结构示意图;
图5为本申请提供的电子设备的结构示意图。
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述, 显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
在不背离本申请的范围或精神的情况下,可对本申请说明书的具体实施方式做多种改进和变化,这对本领域技术人员而言是显而易见的。由本申请的说明书得到的其他实施方式对技术人员而言是显而易见得的。本申请说明书和实施例仅是示例性的。
关于本文中所使用的“包含”、“包括”、“具有”、“含有”等等,均为开放性的用语,即意指包含但不限于。
本申请中的“份”如无特别说明,均按质量份计。
下面结合附图和实施例对本发明进一步详细说明。
参照图1,其示出了适用于本申请实施例提供的多键盘混合按键声音的识别方法的流程示意图。
如图1所示,多键盘混合按键声音的识别方法,可以包括:
S110、获取键盘敲击时发出的声音信号。
具体的,终端的录音元件收集键盘敲击时发出的声音信号,并将收集到的声音信号上传至云端。其中,终端可以包括任意带有录音元件的电子设备,如手机、平板电脑、可穿戴设备等。录音元件可以为麦克风。终端可以包括至少一个麦克风,例如手机可以包括两个或两个以上麦克风。
如图2所示,在手机的录音元件收集键盘敲击时发出的声音信号时,将手机放置在两个键盘的中间,手机上的两个或两个以上录音元件对击键声音进行收集,并上传至云端。
S120、对声音信号进行击键信号截取,确定击键信号片段。
具体的,在云端,对收集到的声音信号进行击键信号截取,然后对截取的击 键信号进行信号片段的切割,得到击键信号片段。可以理解的,若击键信号中仅包含一个击键操作时,可以获得一个击键信号片段,若击键信号中包含两个击键操作时,可以获得两个击键信号片段。
在一个实施例中,S120对声音信号进行击键信号截取,确定击键信号片段,可以包括:
每41.7ms计算声音信号中信号片段的能量值;
若第一信号片段的能量值大于能量阈值,则截取第一信号片段的起始点前第一预设时长和后第二预设时长的信号片段,作为第二信号片段;
第二信号片段采用语音活动检测方法,确定击键信号片段。
其中,第二信号片段采用语音活动检测方法,确定击键信号片段,可以包括:
第二信号片段使用语音活动检测方法,确定击键动作的起始点和终止点,提取出击键信号;
计算击键信号的总能量、峰值和5次小波变换后的信号;
将击键信号的总能量、峰值和5次小波变换后的信号,输入预设支持向量机,判断击键信号是否仅包含一个击键操作;
若击键信号仅包含一个击键操作,则从起始点开始向后截取长度为41.7ms的信号片段,作为击键信号片段;
若击键信号包含两个击键操作,则从起始点开始向后截取长度为41.7ms的信号片段,作为第一击键信号片段;
通过回归神经网络计算第二个击键操作开始发生的起始位置,从起始位置向后截取长度为41.7ms的信号片段,作为第二击键信号片段;
第一击键信号片段和第二击键信号片段,作为击键信号片段。
具体的,能量阈值可以根据实际需求进行设置。第一预设时长和第二预设时长可以根据实际需求进行设置,例如第一预设时长和第二预设时长均为1s。
信号片段的能量值a为:
对于接收到的声音信号,每41.7ms计算信号片段的能量值,若能量值超过阈值,则从信号片段(即第一信号片段)的起始点开始,截取前1s和后1s(持续时间共2s)的信号片段作为可能存在击键动作的信号片段(即第二信号片段)。
对于截取出来的信号片段,使用语音活动检测(Voice Activity Detection,VAD)方法找到击键动作的起始点stp和终止点,提取出击键信号。
对于通过VAD所提取的击键信号,计算其总能量值、峰度以及5次小波变换后的信号,通过训练好的SVM(支持向量机)判断击键信号是否仅包含一个击键操作;如果击键信号仅包含一个击键操作,则从VAD所获得的起始点stp开始,向后截取长度为41.7ms的信号片段,作为击键信号片段,根据该击键信号片段,步骤S130确定梅尔频率倒谱系数;如果击键信号包含两个击键操作,则第一击键信号片段为从起始点stp向后截取长度为41.7ms的信号片段,然后通过回归神经网络计算第二个击键操作开始发生的位置inv(即两个击键操作开始重叠的时刻),第二击键信号片段为从inv向后截取长度为41.7ms的信号片段,分别根据第一击键信号片段和第二击键信号片段,步骤S130确定梅尔频率倒谱系数。
获得用于计算重叠起始位置的回归神经网络模型的具体操作:
本申请采用基于LSTM的回归神经网络模型计算重叠起始位置,其网络结构包括:输入层、LSTM层、Flatten层和dense(全连接)层。本模型使用击键信号片段集合中的单键信号随机叠加(重叠的起始位置、信号来源和标签随机)生成包含两个击键操作的重叠信号,同时记录重叠起始位置作为标签。
输入层:接收截取后的击键信号片段,作为模型的输入。
LSTM层:对模型的输入数据进行编码,使得LSTM的输出数据包含时序信息。
Flatten层:将LSTM层的输出数据变为一维向量,便于全连接层进行计算。
全连接层:将全连接层的输入数据与权值相乘,获得估计的重叠起始位置。本层不使用激活函数。
损失函数:为了使预测值和真实值之间的误差尽可能小,损失函数设置为L(Y,f(X))=max(|Y-f(X)|)。
可以理解的,对于击键信号片段,本申请通过计算不同录音元件所接收信号片段的能量差判断信号来源(即击键信号来自于哪个键盘)。
判断信号来源的具体操作如下:
(1)两个录音元件接收到的击键信号片段,按时间对齐。
(2)对齐后,分别计算两个录音元件的信号片段总能量值,并获得其差值。
(3)由于同一声源到达两个录音元件经过的路径长度不同,所以击键信号的衰减程度也不同,路径越长则衰减程度越高,即录音元件接收到信号的总能量越低。两个键盘位于两个录音元件的两侧,所以,一个键盘对应的总能量差恒为正,另一个键盘对应的总能量差恒为负。由此可以判断击键信号来源。
S130、根据击键信号片段,确定梅尔频率倒谱系数,可以包括:
根据击键信号片段,采用低通滤波器去噪,得到去噪信号片段;
根据去噪信号片段,确定梅尔频率倒谱系数。
对于击键信号片段,使用低通滤波器进行去噪,得到去噪信号片段;根据去噪信号片段,计算梅尔频率倒谱系数作为预设单键识别模型的输入数据。
在声音领域中,梅尔频率倒谱是基于声音频率的非线性梅尔刻度的对数能量频谱的线性变换,梅尔频率倒谱系数则是组成梅尔频率倒谱的系数。其考虑了人类的听觉特征,先将线性频谱映射到基于听觉感知的梅尔非线性频谱中,然后转换到倒谱上。
计算梅尔频率倒谱系数的具体操作如下:
(1)对去噪信号片段进行预加重、分帧和加窗。
(2)对每一帧,通过FFT(快速傅里叶变换)获得对应的频谱。
(3)对于获得的频谱,通过梅尔滤波器组得到梅尔频谱。
(4)在梅尔频谱上进行取对数、逆变换等倒谱分析操作,得到梅尔频率倒谱系数。
S140、梅尔频率倒谱系数输入预设单键识别模型,输出每个键盘相应的键入内容。
具体的,预设单键识别模型可以是预先训练好的。预设单键识别模型为基于注意力机制的BLSTM神经网络模型(双向长短时记忆循环神经网络),网络结构如图3所示,其包括两个输入层、两个BLSTM层、一个concatenate(串联)层,一个attention(注意力机制)层和一个dense(全连接)层。
输入层:由于两个录音元件同一时间段接收到的信号之间存在某些联系,所以本神经网络采用两个输入层分别接收来自两个录音元件所对应的梅尔频率倒谱系数作为其输入数据。
BLSTM层:BLSTM由前向LSTM(长短时记忆循环神经网络)和后向LSTM组成,常在自然语言处理任务中用于建模上下文信息,即时序数据经过BLSTM处理所获得的数据包含前向和后向信息。所以,本申请采用两个BLSTM层分别接收来自两个输入层的输出数据,并对输入层数据进行编码,使得BLSTM的输出序列包含时序信息。
concatenate层:将两个BLSTM层的输出序列串联起来。
attention层:处理两个BLSTM串联后的序列,使得attention输出数据包含两个录音元件所接收信号之间的关联信息。
dense层:全连接层,对attention层输出的数据进行处理,获得按键识别的结果。本申请的全连接层采用的激活函数为sigmoid函数,输出维度设置为标签数量。
在一个实施例中,预设单键识别模型通过下述步骤构建:
获取每个键盘的按键敲击时的声音信号;
根据声音信号,使用语音活动检测方法从所声音信号中截取持续时间为41.7ms的击键信号训练片段;
从声音信号中随机获取与训练击键信号片段等长的声音信号片段;
声音信号片段与训练击键信号片段叠加,确定带噪声击键信号训练片段;
分别根据所有击键信号训练片段和所有带噪声击键信号训练片段,确定梅尔频率倒谱系数训练集;
梅尔频率倒谱系数训练集作为输入数据,训练得到预设单键识别模型。
具体的,利用商用手机对每个键盘收集按键敲击的声音信号,要求用户每2秒敲击一次按键,避免击键信号重叠;对于收集到的声音信号,每两秒的声音作为一组待处理的信号;对于每两秒一组的待处理信号,使用VAD(语音活动检测)算法从信号中截取持续时间约为41.7ms的击键信号片段;对于每个截取出来的击键信号片段,随机叠加等长的所收集到的声音信号片段,作为添加噪声后的击键信号片段,从而增加训练集的数据量;对于获得的击键信号片段集合,通过总能量值、峰度以及5次小波变换后的信号获得训练好的支持向量机模型,用于判断击键信号片段是否仅包含一个击键操作;对于截取的击键信号片段以及添加噪声后的击键信号片段,计算梅尔频率倒谱系数,生成训练集;对于训练集,使 其作为输入数据,训练并获得单键识别模型。
为了在原信号中切割出击键信号,避免后台对不包含击键信号的数据进行分类,我们采用一种常见的语音活动检测算法,双门限端点检测法,来识别和消除长时间的静音期。
使用VAD算法截取击键信号片段的具体操作如下:
对归一化后的信号,代入公式x
i=α*x
i-1+β,i=2...L进行更新,从而引入时序信息。
(2)对于引入时序信息后的信号,每间隔FrameInc个采样点计算长度为FrameLen信号的总能量,获得数组amp,数组amp为每帧总能量的集合。具体的,以FrameInc为步长,提取长度为FrameLen的信号作为一帧i,计算一帧的绝对值之和作为这一帧的总能量,即amp[i]。
(3)计算根据amp的最大值计算一个较高的短时能量阈值MH和一个较低的短时能量阈值ML。其中,MH=min(max(amp)/4,10);ML=min(max(amp)/8,2)。若amp[i]>ML,则该帧可能处于发音阶段(该帧设为status1),当status1帧的数量多于15时,则认为确定进入发音阶段。
(4)计算短时过零率(即单位时间内穿过坐标轴横轴的次数)。其中,每帧分别计算短时过零率,获得数组zcr。具体操作,zcr[i]为统计帧i中穿过坐标轴横轴的次数除以帧长度FrameLen。
(5)遍历数组amp,如果amp[i]超过阈值MH,则获得第一个参考起始点是stp1。
(6)从stp1向后遍历,如果amp[i]超过阈值MH或短时过零率zcr[i]超过阈值Zs则视为击键声音仍持续,继续向后遍历,否则,视为击键声音结束。其中,阈值Zs可以根据实际需求进行设置。
(7)根据找到的起始点和终止点,截取出击键信号片段。
获得用于判断击键信号片段是否仅包含一个击键操作的支持向量机模型的具体操作如下:
(1)使用击键信号片段集合中的单键信号随机叠加(重叠的起始位置、信 号来源和标签随机)生成包含两个击键操作的重叠信号。其中,随机是重叠的起始位置(一个数值)随机生成;随机选择两个按键信号(信号来自哪个键盘以及来自哪个按键都是随机的);叠加是将所选择的击键信号按照生成的重叠起始位置线性叠加。
(2)对训练集的单键信号和生成的重叠信号打上标签,生成支持向量机训练集的原数据。
(3)双键信号与单键信号的不同之处在于以下3点:a.双键信号在时域上大致呈现有三个或以上的峰;b.双键信号的总能量比原单键信号的总能量高;c.击键信号后半部分会出现能量较大的hit峰。所以,提取接收到的击键信号片段的总能量值、kurtosis(峰度),作为用于区分击键信号片段是否包含双键信号的判断特征。同时,为了描述双键和单键信号峰的数量不同,同时减少训练的数据量,本申请对原信号进行5次小波变换后的信号作为判断特征。所以,本申请对训练集的原数据计算其总能量值、峰度以及5次小波变换后的信号作为支持向量机的输入特征,生成判断是否仅包含单个击键操作的训练集。
(4)通过训练集,训练并获得SVM模型。
本申请实施例提供的多键盘混合按键声音的识别方法,可以适用于多键盘的键入内容进行识别。
本申请实施例提供的多键盘混合按键声音的识别方法,仅需使用终端上的录音元件即可,无需额外设备,成本低,易获得。
本申请实施例提供的多键盘混合按键声音的识别方法,提出一种基于注意力机制的BLSTM模型,利用两个录音元件同一时间段所接收信号之间存在联系的特点,将BLSTM进行按键识别的正确率提高到96.41%。
实验验证
实验环境:实验分别在会议室和宿舍进行。会议室环境较为安静,噪声主要来源于远处车辆经过的声音、空调的声音以及按键声反射的声音。会议室中存在 较多物体,环境较为复杂。宿舍环境较为嘈杂,存在各种人声、非目标键盘的击键声和洗衣机所发出的声音等一系列干扰噪声,给击键信号片段的提取带来挑战。同时,宿舍中存在更多的物体,环境更为复杂,导致按键声反射回来的声音更复杂。为了避免键盘下方的桌面材质和敲击按键时桌面震动的影响,将键盘和手机放在鼠标垫上面,同时将键盘和手机固定在鼠标垫上,避免在敲击键盘过程中键盘位置存在细微变化。
键盘:主要在机械键盘上进行实验。机械键盘型号为iKBC typeman W200,且在收集数据前该机械键盘没有被使用过,不存在按键磨损的情况。机械键盘的击键声较为清晰,且键位稳定,完整单键信号持续时间约为125ms,hit峰持续时间约为42ms。
手机:分别在华为P20和红米K30手机平台上部署软件,进行按键声音收集、数据传输和窃听文本展示。华为P20拥有2个麦克风,分别位于手机的顶部和底部,采用Android 8.1系统,最高提供48kHz的采样率。红米K30拥有3个麦克风,分别位于手机顶部、底部和四个摄像头中间,采用Android 10.0系统,最高提供96kHz的采样率。软件部署在红米K30手机平台上时只能调用位于顶部和底部这两个麦克风。所以,在两个手机平台上采集到的数据都为双声道数据,华为P20所采集数据的采样率为48kHz,红米所采集数据的采样率为96kHz。
敲击速度:要求测试者每2秒敲击一次按键,避免麦克风接收到的信号中存在重叠信号。
数据集:要求测试者分别对按键A到Z,共26个按键进行敲击,每种按键共敲击60次。为了排除本申请把环境中随时间变化的稳定特征(如人说话的声音、室外播放歌曲的声音)当作按键分类的特征的可能性,要求测试者将每种按键的60次敲击分为3次完成,每次收集20组按键敲击的音频信号,且每次时间间隔至少为4小时。
单键识别效果
一个键盘上的26个按键识别正确率最高可达96.41%。
双键识别效果
两个键盘混合信号的按键识别正确率最高可达67%。
总体模拟实验:使用两个键盘的单键信号进行线性叠加,重叠的起始位置为 随机生成的数值inv。线性叠加后的信号用于模拟多键盘的混合信号。该混合信号重叠起始点、信号来源和标签皆为随机选择。
信号来源判断的效果
前提:已知重叠的起始位置
单键判断正确率:99.87%
双键判断正确率:94.37%
双键识别效果
前提:已知重叠的起始位置和信号来源
第一个按键的识别正确率:83.25%;
第二个按键的识别正确率:74.84%。
参照图4,其示出了根据本申请一个实施例描述的多键盘混合按键声音的识别装置的结构示意图。
如图4所示,多键盘混合按键声音的识别装置400,可以包括:
获取模块410,用于获取键盘敲击时发出的声音信号;
截取模块420,用于对声音信号进行击键信号截取,确定击键信号片段;
确定模块430,用于根据击键信号片段,确定梅尔频率倒谱系数;
处理模块440,用于梅尔频率倒谱系数输入预设单键识别模型,输出每个键盘相应的键入内容。
可选的,获取模块410还用于:
获取终端的录音元件发送的键盘敲击时发出的声音信号,终端包括至少一个录音元件。
可选的,截取模块420还用于:
每41.7ms计算声音信号中信号片段的能量值;
若第一信号片段的能量值大于能量阈值,则截取第一信号片段的起始点前第一预设时长和后第二预设时长的信号片段,作为第二信号片段;
第二信号片段采用语音活动检测方法,确定击键信号片段。
可选的,截取模块420还用于:
第二信号片段使用语音活动检测方法,确定击键动作的起始点和终止点,提取出击键信号;
计算击键信号的总能量、峰值和5次小波变换后的信号;
将击键信号的总能量、峰值和5次小波变换后的信号,输入预设支持向量机,判断击键信号是否仅包含一个击键操作;
若击键信号仅包含一个击键操作,则从起始点开始向后截取长度为41.7ms的信号片段,作为击键信号片段;
若击键信号包含两个击键操作,则从起始点开始向后截取长度为41.7ms的信号片段,作为第一击键信号片段;
通过回归神经网络计算第二个击键操作开始发生的起始位置,从起始位置向后截取长度为41.7ms的信号片段,作为第二击键信号片段;
第一击键信号片段和第二击键信号片段,作为击键信号片段。
可选的,确定模块430还用于:
根据击键信号片段,采用低通滤波器去噪,得到去噪信号片段;
根据去噪信号片段,确定梅尔频率倒谱系数。
可选的,处理模块440还用于:
获取每个键盘的按键敲击时的声音信号;
根据声音信号,使用语音活动检测方法从所声音信号中截取持续时间为41.7ms的击键信号训练片段;
从声音信号中随机获取与训练击键信号片段等长的声音信号片段;
声音信号片段与训练击键信号片段叠加,确定带噪声击键信号训练片段;
分别根据所有击键信号训练片段和所有带噪声击键信号训练片段,确定梅尔频率倒谱系数训练集;
梅尔频率倒谱系数训练集作为输入数据,训练得到预设单键识别模型。
本实施例提供的一种多键盘混合按键声音的识别装置,可以执行上述方法的实施例,其实现原理和技术效果类似,在此不再赘述。
图5为本发明实施例提供的一种电子设备的结构示意图。如图5所示,示出了适于用来实现本申请实施例的电子设备300的结构示意图。
如图5所示,电子设备300包括中央处理单元(CPU)301,其可以根据存储在只读存储器(ROM)302中的程序或者从存储部分308加载到随机访问存储器(RAM)303中的程序而执行各种适当的动作和处理。在RAM 303中,还存 储有设备300操作所需的各种程序和数据。CPU 301、ROM 302以及RAM 303通过总线304彼此相连。输入/输出(I/O)接口305也连接至总线304。
以下部件连接至I/O接口305:包括键盘、鼠标等的输入部分306;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分307;包括硬盘等的存储部分308;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分309。通信部分309经由诸如因特网的网络执行通信处理。驱动器310也根据需要连接至I/O接口306。可拆卸介质311,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器310上,以便于从其上读出的计算机程序根据需要被安装入存储部分308。
特别地,根据本公开的实施例,上文参考图1描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行上述多键盘混合按键声音的识别方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分309从网络上被下载和安装,和/或从可拆卸介质311被安装。
附图中的流程图和框图,图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,前述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元或模块可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中。这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体 的,计算机例如可以为个人计算机、笔记本电脑、行动电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
作为另一方面,本申请还提供了一种存储介质,该存储介质可以是上述实施例中前述装置中所包含的存储介质;也可以是单独存在,未装配入设备中的存储介质。存储介质存储有一个或者一个以上程序,前述程序被一个或者一个以上的处理器用来执行描述于本申请的多键盘混合按键声音的识别方法。
存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
Claims (9)
- 一种多键盘混合按键声音的识别方法,其特征在于,所述方法包括:获取键盘敲击时发出的声音信号;对所述声音信号进行击键信号截取,确定击键信号片段;根据所述击键信号片段,确定梅尔频率倒谱系数;所述梅尔频率倒谱系数输入预设单键识别模型,输出每个键盘相应的键入内容。
- 根据权利要求1所述的方法,其特征在于,所述获取键盘敲击时发出的声音信号,包括:获取终端的录音元件发送的键盘敲击时发出的声音信号,所述终端包括至少一个录音元件。
- 根据权利要求2所述的方法,其特征在于,所述对所述声音信号进行击键信号截取,确定击键信号片段,包括:每41.7ms计算所述声音信号中信号片段的能量值;若所述第一信号片段的能量值大于能量阈值,则截取所述第一信号片段的起始点前第一预设时长和后第二预设时长的信号片段,作为第二信号片段;所述第二信号片段采用语音活动检测方法,确定所述击键信号片段。
- 根据权利要求3所述的方法,其特征在于,所述第二信号片段采用语音活动检测方法,确定所述击键信号片段,包括:所述第二信号片段使用所述语音活动检测方法,确定击键动作的起始点和终止点,提取出击键信号;计算所述击键信号的总能量、峰值和5次小波变换后的信号;将所述击键信号的总能量、峰值和5次小波变换后的信号,输入预设支持向量机,判断所述击键信号是否仅包含一个击键操作;若所述击键信号仅包含一个所述击键操作,则从所述起始点开始向后截取长度为41.7ms的信号片段,作为所述击键信号片段;若所述击键信号包含两个所述击键操作,则从所述起始点开始向后截取长度 为41.7ms的信号片段,作为第一击键信号片段;通过回归神经网络计算第二个所述击键操作开始发生的起始位置,从所述起始位置向后截取长度为41.7ms的信号片段,作为第二击键信号片段;所述第一击键信号片段和所述第二击键信号片段,作为所述击键信号片段。
- 根据权利要求1-4任一项所述的方法,其特征在于,所述根据所述击键信号片段,确定梅尔频率倒谱系数,包括:根据所述击键信号片段,采用低通滤波器去噪,得到去噪信号片段;根据所述去噪信号片段,确定所述梅尔频率倒谱系数。
- 根据权利要求1-4任一项所述的方法,其特征在于,所述预设单键识别模型通过下述步骤构建:获取每个键盘的按键敲击时的声音信号;根据所述声音信号,使用语音活动检测方法从所声音信号中截取持续时间为41.7ms的击键信号训练片段;从所述声音信号中随机获取与所述训练击键信号片段等长的声音信号片段;所述声音信号片段与所述训练击键信号片段叠加,确定带噪声击键信号训练片段;分别根据所有所述击键信号训练片段和所有所述带噪声击键信号训练片段,确定梅尔频率倒谱系数训练集;所述梅尔频率倒谱系数训练集作为输入数据,训练得到所述预设单键识别模型。
- 一种多键盘混合按键声音的识别装置,其特征在于,所述装置包括:获取模块,用于获取键盘敲击时发出的声音信号;截取模块,用于对所述声音信号进行击键信号截取,确定击键信号片段;确定模块,用于根据所述击键信号片段,确定梅尔频率倒谱系数;处理模块,用于所述梅尔频率倒谱系数输入预设单键识别模型,输出每个键盘相应的键入内容。
- 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1-6中任一所述的多键盘混合按键声音的识别方法。
- 一种可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-6中任一所述的多键盘混合按键声音的识别方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111628149.0A CN116415166A (zh) | 2021-12-28 | 2021-12-28 | 多键盘混合按键声音的识别方法、装置、设备及存储介质 |
CN202111628149.0 | 2021-12-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023124556A1 true WO2023124556A1 (zh) | 2023-07-06 |
Family
ID=86997523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/130829 WO2023124556A1 (zh) | 2021-12-28 | 2022-11-09 | 多键盘混合按键声音的识别方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116415166A (zh) |
WO (1) | WO2023124556A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117827011A (zh) * | 2024-03-04 | 2024-04-05 | 渴创技术(深圳)有限公司 | 基于用户行为预测的按键反馈方法、装置和存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170358306A1 (en) * | 2016-06-13 | 2017-12-14 | Alibaba Group Holding Limited | Neural network-based voiceprint information extraction method and apparatus |
WO2018006797A1 (zh) * | 2016-07-05 | 2018-01-11 | 深圳大学 | 利用声音信号检测键盘敲击内容的系统及方法 |
CN110111812A (zh) * | 2019-04-15 | 2019-08-09 | 深圳大学 | 一种键盘击键内容的自适应识别方法和系统 |
US20210074264A1 (en) * | 2017-10-23 | 2021-03-11 | Ping An Technology (Shenzhen) Co., Ltd. | Speech recognition method, apparatus, and computer readable storage medium |
-
2021
- 2021-12-28 CN CN202111628149.0A patent/CN116415166A/zh active Pending
-
2022
- 2022-11-09 WO PCT/CN2022/130829 patent/WO2023124556A1/zh unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170358306A1 (en) * | 2016-06-13 | 2017-12-14 | Alibaba Group Holding Limited | Neural network-based voiceprint information extraction method and apparatus |
WO2018006797A1 (zh) * | 2016-07-05 | 2018-01-11 | 深圳大学 | 利用声音信号检测键盘敲击内容的系统及方法 |
US20210074264A1 (en) * | 2017-10-23 | 2021-03-11 | Ping An Technology (Shenzhen) Co., Ltd. | Speech recognition method, apparatus, and computer readable storage medium |
CN110111812A (zh) * | 2019-04-15 | 2019-08-09 | 深圳大学 | 一种键盘击键内容的自适应识别方法和系统 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117827011A (zh) * | 2024-03-04 | 2024-04-05 | 渴创技术(深圳)有限公司 | 基于用户行为预测的按键反馈方法、装置和存储介质 |
CN117827011B (zh) * | 2024-03-04 | 2024-05-07 | 渴创技术(深圳)有限公司 | 基于用户行为预测的按键反馈方法、装置和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN116415166A (zh) | 2023-07-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ahmed et al. | Void: A fast and light voice liveness detection system | |
Chen et al. | Who is real bob? adversarial attacks on speaker recognition systems | |
Anand et al. | Spearphone: a lightweight speech privacy exploit via accelerometer-sensed reverberations from smartphone loudspeakers | |
Wang et al. | Secure your voice: An oral airflow-based continuous liveness detection for voice assistants | |
Zhang et al. | Voiceprint mimicry attack towards speaker verification system in smart home | |
Shi et al. | Face-Mic: inferring live speech and speaker identity via subtle facial dynamics captured by AR/VR motion sensors | |
Rathore et al. | SonicPrint: A generally adoptable and secure fingerprint biometrics in smart devices | |
US20200243067A1 (en) | Environment classifier for detection of laser-based audio injection attacks | |
Wang et al. | When the differences in frequency domain are compensated: Understanding and defeating modulated replay attacks on automatic speech recognition | |
Xie et al. | TeethPass: Dental occlusion-based user authentication via in-ear acoustic sensing | |
Anand et al. | Spearphone: A speech privacy exploit via accelerometer-sensed reverberations from smartphone loudspeakers | |
Ahmed et al. | Towards more robust keyword spotting for voice assistants | |
Huang et al. | Stop deceiving! an effective defense scheme against voice impersonation attacks on smart devices | |
WO2023124556A1 (zh) | 多键盘混合按键声音的识别方法、装置、设备及存储介质 | |
Wang et al. | Vsmask: Defending against voice synthesis attack via real-time predictive perturbation | |
Luo et al. | PhyAug: Physics-directed data augmentation for deep sensing model transfer in cyber-physical systems | |
Kim et al. | TapSnoop: Leveraging tap sounds to infer tapstrokes on touchscreen devices | |
Li et al. | Security and privacy problems in voice assistant applications: A survey | |
Jiang et al. | Securing liveness detection for voice authentication via pop noises | |
Khoria et al. | On significance of constant-Q transform for pop noise detection | |
Wang et al. | Low-effort VR Headset User Authentication Using Head-reverberated Sounds with Replay Resistance | |
Liu et al. | Wavoice: An mmWave-Assisted Noise-Resistant Speech Recognition System | |
Tian et al. | Spoofing detection under noisy conditions: a preliminary investigation and an initial database | |
Nagaraja et al. | VoIPLoc: passive VoIP call provenance via acoustic side-channels | |
Cao et al. | LiveProbe: Exploring continuous voice liveness detection via phonemic energy response patterns |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22913819 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |