CN116415166A

CN116415166A - Multi-keyboard mixed key sound identification method, device, equipment and storage medium

Info

Publication number: CN116415166A
Application number: CN202111628149.0A
Authority: CN
Inventors: 王璐; 赵家怡; 黄勇志; 伍楷舜
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-07-11
Also published as: WO2023124556A1

Abstract

The application provides a method, a device, equipment and a storage medium for identifying multi-keyboard mixed key sounds, wherein the method comprises the following steps: acquiring a sound signal emitted by a keyboard during knocking; intercepting a key-striking signal of the sound signal, and determining a key-striking signal segment; determining a Mel frequency cepstrum coefficient according to the keystroke signal fragment; the mel frequency cepstrum coefficient is input into a preset single-key identification model, and the corresponding keying content of each keyboard is output. The scheme can be suitable for the recognition of the mixed key sounds of a plurality of keyboards, and the recognition accuracy is high.

Description

Multi-keyboard mixed key sound identification method, device, equipment and storage medium

Technical Field

The invention belongs to the technical field of signal recognition, and particularly relates to a method, a device, equipment and a storage medium for recognizing multi-keyboard mixed key sounds.

Background

Today, most of the major office scenarios are where a computer is operated in a room with a keyboard and a mouse to input contents, and the contents input by a worker sometimes include information related to personal, customer, and even company privacy, such as personal passwords, customer information, and company bidding contracts, etc., which are information that, once utilized by lawbreakers, would have a huge loss of correlation personnel, such as 2018 Cost of Data Breach Study indicating that the average loss of enterprises is $ 386 ten thousand in the event of information leakage. Therefore, key input information security is critical.

In general, an external eavesdropper often uses an invasive eavesdropping mode to eavesdrop on the key input, and a malicious program is implanted on a computer to obtain the keyed information of the eavesdropper. With the development of network security computation such as cloud security, external personnel can be effectively prevented from eavesdropping through security technologies such as a firewall. But the eavesdropping action of the internal personnel still poses a great threat to the security of the entered information. The internal personnel can use the computer of the eavesdropper without the password for a short period of time when the eavesdropper leaves the computer (such as going to a toilet), so as to attack. For such eavesdropping scenario, related researchers propose a way of continuous authentication of users for prevention. Training according to user input information such as flight time recorded on a computer to obtain a model for distinguishing users from illegal users, continuously running the model during the running of the computer to authenticate the users, and taking corresponding actions (such as screen locking) once the users are considered to be illegal users. This approach also effectively prevents the eavesdropper from directly using the eavesdropped computer.

With the development of signal detection systems, keyboard keystroke recognition has become a focus of attention. The problem of keyboard keystroke recognition becomes one of the key problems for protecting the security of office information.

The existing keyboard keystroke recognition is mainly divided into two main categories, namely, the keyboard keystroke recognition is carried out by implanting malicious programs into a computer, and the leakage of the input content can be prevented through security technologies such as a firewall at present; secondly, the key-striking content of the keyboard is identified by utilizing signals such as sound, WIFI, light and the like, the form of eavesdropping on the key-striking content is variable, and the prevention is often difficult. The second research method can be divided into the following categories. The method comprises the steps of (1) identifying keyboard knocking contents by adopting a CSI technology based on WIFI signals, such as Wifinger, (2) identifying the keyboard knocking contents according to video data based on optical signals, such as Blind Recognition of Touched Keys on Mobile Devices (blind identification of touch keys on mobile equipment), (3) identifying the keyboard knocking contents based on sound signals, such as Accurate Combined Keystrokes Detection Using Acoustic Signals (accurate combined key-press detection by using acoustic signals), and identifying key-press combinations (such as Ctrl+C) by capturing the sound signals.

The existing keyboard keystroke recognition technology is to recognize a single key or a specific key combination (such as ctrl+c) of a single keyboard, but in an office scene, a situation that a plurality of keyboards are simultaneously struck often exists, and a signal received by a recording device is often a mixed sound signal of a plurality of keyboards. Therefore, the existing key sound recognition technology has no universality.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a method, an apparatus, a device, and a storage medium for identifying a multi-keyboard mixed key sound.

In order to solve the technical problems, the embodiments of the present application are implemented in the following manner:

in a first aspect, the present application provides a method for identifying a multi-keypad mixed key sound, the method including:

acquiring a sound signal emitted by a keyboard during knocking;

intercepting a key-striking signal of the sound signal, and determining a key-striking signal segment;

determining a Mel frequency cepstrum coefficient according to the keystroke signal fragment;

the mel frequency cepstrum coefficient is input into a preset single-key identification model, and the corresponding keying content of each keyboard is output.

In one embodiment, capturing a sound signal generated when a keyboard is struck includes:

the method comprises the steps of acquiring a sound signal sent by a recording element of a terminal when a keyboard is knocked, wherein the terminal comprises at least one recording element.

In one embodiment, performing key stroke signal interception on the sound signal to determine a key stroke signal segment includes:

calculating the energy value of the signal segment in the sound signal every 41.7 ms;

if the energy value of the first signal segment is larger than the energy threshold value, intercepting the signal segment with a first preset duration before the starting point of the first signal segment and a second preset duration after the starting point of the first signal segment as a second signal segment;

The second signal segment adopts a voice activity detection method to determine the keystroke signal segment.

In one embodiment, the second signal segment employs a voice activity detection method to determine a keystroke signal segment, comprising:

the second signal segment uses a voice activity detection method to determine the starting point and the ending point of the keystroke action and extract the keystroke signal;

calculating total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals;

inputting the total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals into a preset support vector machine, and judging whether the keystroke signals only comprise one keystroke operation or not;

if the key-press signal only comprises one key-press operation, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a key-press signal segment;

if the key-press signal comprises two key-press operations, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a first key-press signal segment;

calculating a starting position at which a second key-striking operation starts to occur through a regression neural network, and intercepting a signal segment with the length of 41.7ms from the starting position to the rear as a second key-striking signal segment;

a first key signal segment and a second key signal segment as key signal segments.

In one embodiment, determining the mel-frequency cepstral coefficient from the key stroke signal segment comprises:

denoising by adopting a low-pass filter according to the keystroke signal fragment to obtain a denoising signal fragment;

and determining the Mel frequency cepstrum coefficient according to the denoising signal segment.

In one embodiment, the preset single bond recognition model is constructed by:

acquiring sound signals of each keyboard when the keys are knocked;

according to the sound signal, intercepting a key-striking signal training fragment with the duration of 41.7ms from the sound signal by using a voice activity detection method;

randomly acquiring sound signal fragments with equal length as the training key-striking signal fragments from the sound signal;

the sound signal segment is overlapped with the training keystroke signal segment, and the training segment of the keystroke signal with noise is determined;

determining a mel frequency cepstrum coefficient training set according to all key-stroke signal training fragments and all noisy key-stroke signal training fragments respectively;

and training to obtain a preset single bond identification model by taking the training set of the mel frequency cepstrum coefficient as input data.

In a second aspect, the present application provides a device for recognizing a multi-keypad mixed key sound, the device comprising:

the acquisition module is used for acquiring sound signals emitted by the keyboard during knocking;

The intercepting module is used for intercepting the key-clicking signals of the sound signals and determining key-clicking signal fragments;

the determining module is used for determining the Mel frequency cepstrum coefficient according to the keystroke signal fragment;

and the processing module is used for inputting the mel frequency cepstrum coefficient into a preset single-key identification model and outputting the corresponding keying content of each keyboard.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for recognizing multi-keypad mixed key sounds as in the first aspect when the processor executes the program.

In a fourth aspect, the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for identifying multi-keypad mixed key sounds as in the first aspect.

The technical scheme provided by the embodiment of the present specification can be seen from the following scheme:

the recognition method of the multi-keyboard mixed key sound can be suitable for recognizing the keying content of the multi-keyboard.

According to the multi-keyboard mixed key sound identification method, only the recording element on the terminal is needed, no additional equipment is needed, the cost is low, and the method is easy to obtain.

According to the identification method of the multi-keyboard mixed key sound, which is provided by the embodiment of the application, a BLSTM model based on an attention mechanism is provided, and the accuracy of key identification of the BLSTM is improved to 96.41% by utilizing the characteristic that signals received by two recording elements in the same time period are in contact.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for identifying multi-keyboard mixed key sounds provided in the present application;

FIG. 2 is a layout of an experimental platform provided herein;

fig. 3 is a schematic structural diagram of a preset single bond recognition model provided in the present application;

fig. 4 is a schematic structural diagram of a multi-keyboard hybrid key sound recognition device provided in the present application;

fig. 5 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments of the present disclosure without departing from the scope or spirit of the disclosure. Other embodiments will be apparent to the skilled person from the description of the present application. The specification and examples are exemplary only.

As used herein, the terms "comprising," "including," "having," "containing," and the like are intended to be inclusive and mean an inclusion, but not limited to.

The "parts" in the present application are all parts by mass unless otherwise specified.

The invention is described in further detail below with reference to the drawings and examples.

Referring to fig. 1, a flow chart of a method for identifying multi-keyboard mixed key sounds according to an embodiment of the present application is shown.

As shown in fig. 1, the method for identifying multi-key-pad mixed key sounds may include:

s110, acquiring a sound signal generated when the keyboard is knocked.

Specifically, the recording element of the terminal collects the sound signals sent by the keyboard during the knocking, and uploads the collected sound signals to the cloud. The terminal can comprise any electronic equipment with a recording element, such as a mobile phone, a tablet computer, a wearable device and the like. The sound recording element may be a microphone. The terminal may comprise at least one microphone, for example the handset may comprise two or more microphones.

As shown in fig. 2, when the sound signal generated when the sound recording element of the mobile phone is used for collecting the keyboard is tapped, the mobile phone is placed in the middle of two keyboards, and two or more sound recording elements on the mobile phone are used for collecting the keystroke sounds and uploading the keystroke sounds to the cloud.

S120, carrying out key stroke signal interception on the sound signal, and determining a key stroke signal segment.

Specifically, at the cloud, key-striking signal interception is performed on the collected sound signals, and then, signal segment cutting is performed on the intercepted key-striking signals, so that key-striking signal segments are obtained. It will be appreciated that if only one key stroke is included in the key stroke signal, a key stroke signal segment may be obtained, and if two key strokes are included in the key stroke signal, two key stroke signal segments may be obtained.

In one embodiment, S120 performs key stroke signal interception on the sound signal, and determining a key stroke signal segment may include:

Wherein, the second signal segment adopts a voice activity detection method, and determining the keystroke signal segment can comprise:

Specifically, the energy threshold may be set according to actual requirements. The first preset duration and the second preset duration can be set according to actual requirements, for example, the first preset duration and the second preset duration are both 1s.

The energy value a of the signal segment is:

where n is the length of the signal segment.

For the received sound signal, the energy value of the signal segment is calculated every 41.7ms, and if the energy value exceeds the threshold value, the signal segments of the first 1s and the last 1s (the duration is 2s in total) are intercepted as the signal segments (i.e. the second signal segments) where the keystroke action may exist, starting from the starting point of the signal segment (i.e. the first signal segment).

For the intercepted signal segments, a voice activity detection (Voice Activity Detection, VAD) method is used for finding a starting point stp and an ending point of the keystroke action, and a keystroke signal is extracted.

For the keystroke signals extracted by the VAD, calculating the total energy value, kurtosis and signals after 5 times of wavelet transformation, and judging whether the keystroke signals only comprise one keystroke operation or not through a trained SVM (support vector machine); if the keystroke signal contains only one keystroke operation, a signal segment with the length of 41.7ms is intercepted backwards from the starting point stp obtained by the VAD as a keystroke signal segment, and the Mel frequency cepstrum coefficient is determined according to the keystroke signal segment in step S130; if the key stroke signal includes two key stroke operations, the first key stroke signal segment is a signal segment with a length of 41.7ms intercepted backward from the starting point stp, then the position inv where the second key stroke operation starts to occur (i.e. the moment when the two key stroke operations start to overlap) is calculated through the recurrent neural network, the second key stroke signal segment is a signal segment with a length of 41.7ms intercepted backward from inv, and the mel frequency cepstrum coefficient is determined in step S130 according to the first key stroke signal segment and the second key stroke signal segment respectively.

Specific operation of obtaining a recurrent neural network model for calculating the overlap start position:

the application adopts a regression neural network model based on LSTM to calculate the overlapping initial position, and the network structure comprises: input layer, LSTM layer, flame layer and dense layer. The present model uses random superposition of single bond signals in a set of keystroke signal fragments (overlapping start position, signal source and label random) to generate overlapping signals containing two keystroke operations, while recording the overlapping start position as a label.

Input layer: and receiving the intercepted keystroke signal fragment as the input of the model.

LSTM layer: the input data of the model is encoded such that the output data of the LSTM contains timing information.

Layer of flat: the output data of the LSTM layer is changed into a one-dimensional vector, so that the calculation of the full connection layer is facilitated.

Full tie layer: and multiplying the input data of the full connection layer by the weight value to obtain an estimated overlapping initial position. The layer does not use an activation function.

Loss function: in order to minimize the error between the predicted value and the true value, the loss function is set to

L(Y，f(X))＝max(|Y-f(X)|)。

It will be appreciated that for a keystroke signal fragment, the present application determines the source of the signal (i.e., from which keyboard the keystroke signal originated) by calculating the energy differences of the signal fragments received by the different recording elements.

The specific operation of determining the signal source is as follows:

(1) The key stroke signal segments received by the two recording elements are aligned in time.

(2) After alignment, the total energy value of the signal segments of the two recording elements is calculated respectively, and the difference value is obtained.

(3) Since the paths of the same sound source reaching the two recording elements are different in length, the attenuation degree of the keystroke signals is also different, and the attenuation degree is higher as the paths are longer, namely the total energy of the signals received by the recording elements is lower. The two keyboards are positioned on two sides of the two recording elements, so that the total energy difference corresponding to one keyboard is positive, and the total energy difference corresponding to the other keyboard is negative. The source of the keystroke signal can be determined.

S130, determining the Mel frequency cepstrum coefficient according to the keystroke signal fragment can comprise:

Denoising the keystroke signal fragment by using a low-pass filter to obtain a denoised signal fragment; and according to the denoising signal segment, calculating the mel frequency cepstrum coefficient as input data of a preset single bond identification model.

In the sound field, mel frequency cepstrum is a linear transformation of the logarithmic energy spectrum based on the nonlinear mel scale of sound frequencies, and mel frequency cepstrum coefficients are the coefficients that make up the mel frequency cepstrum. It takes into account the auditory characteristics of humans, maps the linear spectrum into mel nonlinear spectrum based on auditory perception, and then converts to cepstrum.

The specific operation of calculating the mel frequency cepstrum coefficient is as follows:

(1) Pre-emphasis, framing and windowing are performed on the de-noised signal segments.

(2) For each frame, a corresponding spectrum is obtained by FFT (fast fourier transform).

(3) For the obtained spectrum, a mel spectrum is obtained by a mel filter bank.

(4) And carrying out cepstrum analysis operations such as logarithm taking, inverse transformation and the like on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients.

S140, inputting the mel frequency cepstrum coefficient into a preset single-key identification model, and outputting the corresponding keying content of each keyboard.

Specifically, the preset single bond recognition model may be trained in advance. The preset single bond recognition model is a attention mechanism-based BLSTM neural network model (two-way long short-term memory cyclic neural network), and the network structure is shown in fig. 3, and includes two input layers, two BLSTM layers, a con-cate layer, an attention mechanism layer and a dense layer.

Input layer: because some relations exist between signals received by the two recording elements in the same time period, the neural network adopts two input layers to respectively receive the mel frequency cepstrum coefficients corresponding to the two recording elements as input data.

BLSTM layer: BLSTM consists of forward LSTM (long and short memory recurrent neural network) and backward LSTM, which are often used to model context information in natural language processing tasks, and data obtained by performing BLSTM processing on real-time data contains forward and backward information. Therefore, the present application employs two BLSTM layers to receive output data from two input layers, respectively, and encodes the input layer data such that the output sequence of the BLSTM contains timing information.

conccate layer: the output sequences of the two BLSTM layers are connected in series.

Layer of the coating: the series of two BLSTMs is processed such that the attention output data contains information about the association between the signals received by the two recording elements.

dense layer: and the full-connection layer is used for processing the data output by the attention layer to obtain a key identification result. The activation function adopted by the full connection layer is a sigmoid function, and the output dimension is set to be the number of labels.

In one embodiment, the preset single bond recognition model is constructed by:

acquiring sound signals of each keyboard when the keys are knocked;

Specifically, the commercial mobile phone is used for collecting the sound signals of key strokes for each keyboard, so that a user is required to strike keys every 2 seconds, and the key striking signals are prevented from being overlapped; for the collected sound signals, every two seconds of sound is taken as a set of signals to be processed; for a set of signals to be processed every two seconds, a segment of the key stroke signal having a duration of about 41.7ms is truncated from the signal using a VAD (voice activity detection) algorithm; for each intercepted keystroke signal fragment, randomly superposing the collected sound signal fragments with equal length as the keystroke signal fragments after noise addition, thereby increasing the data volume of the training set; for the obtained keystroke signal fragment set, a trained support vector machine model is obtained through the total energy value, kurtosis and signals after 5 times of wavelet transformation, and is used for judging whether the keystroke signal fragment only comprises one keystroke operation or not; calculating mel frequency cepstrum coefficients for the intercepted key stroke signal fragments and the key stroke signal fragments added with noise to generate a training set; and for the training set, using the training set as input data, training and obtaining a single bond recognition model.

In order to cut out the keystroke signal from the original signal and avoid the background classifying the data which does not contain the keystroke signal, a common voice activity detection algorithm and a double-threshold endpoint detection method are adopted to identify and eliminate the long-time mute period.

The specific operation of intercepting a keystroke signal fragment using the VAD algorithm is as follows:

(1) For the original signal, the formula x 'is used' _i ＝x _i /max(x ₁ L), i=1..l, normalized to obtain normalized signal, where L is the original signal length.

Substituting the normalized signal into the formula x _i ＝α*x _i-1 +β, i=2..l is updated, thereby introducing timing information.

(2) For the signal after the time sequence information is introduced, calculating the total energy of the signal with the length of FrameLen at every interval of FrameInc, and obtaining an array amp, wherein the array amp is a set of the total energy of each frame. Specifically, a signal with a length of FrameLen is extracted as a frame i by taking FrameInc as a step length, and the sum of absolute values of a frame is calculated as the total energy of the frame, i.e. amp [ i ].

(3) Calculation a higher short-term energy threshold MH and a lower short-term energy threshold ML are calculated from the maximum value of amp. Where mh=min (max (amp)/4, 10); ml=min (max (amp)/8, 2). If amp i > ML, the frame may be in the sonification phase (the frame is set to status 1), and when the number of status1 frames is more than 15, it is considered to determine to enter the sonification phase.

(4) The short time zero crossing rate (i.e., the number of times the horizontal axis of the coordinate axis is crossed per unit time) is calculated. Wherein, each frame calculates the short-time zero-crossing rate respectively, obtains array zcr. In particular, zcr [ i ] is the number of times the horizontal axis of the coordinate axis passes in the statistical frame i divided by the frame length FrameLen.

(5) Traversing the array amp, if amp [ i ] exceeds the threshold MH, then the first reference starting point is stpl.

(6) Traversing backward from stp1, if amp [ i ] exceeds a threshold MH or the short-time zero-crossing rate zcr [ i ] exceeds a threshold Zs, then the keystroke sound is regarded as continuing, the backward traversing is continued, otherwise the keystroke sound is regarded as ending. The threshold Zs may be set according to actual requirements.

(7) And intercepting the keystroke signal fragments according to the found starting point and ending point.

The specific operation of obtaining a support vector machine model for determining whether a key stroke signal segment contains only one key stroke operation is as follows:

(1) The superimposed signal containing two key strokes is generated using random superposition of single key signals in a set of key stroke signal segments (the starting position of the overlap, the source of the signal and the label are random). Wherein, the random is the overlapped initial position (a numerical value) is randomly generated; randomly selecting two key signals (the signal coming from which keyboard and from which key is random); the superposition is a linear superposition of the selected keystroke signals according to the generated superposition starting location.

(2) And labeling the single-key signal of the training set and the generated overlapping signal to generate the original data of the training set of the support vector machine.

(3) The double bond signal differs from the single bond signal by the following 3 points: a. the double bond signal has three or more peaks in the time domain; b. the total energy of the double bond signal is higher than that of the original single bond signal; c. the latter half of the key stroke signal will have a more energetic hit peak. Therefore, the total energy value, kurtosis (kurtosis), of the received keystroke signal fragment is extracted as a judgment feature for distinguishing whether the keystroke signal fragment contains a double bond signal. Meanwhile, in order to describe the difference of the number of double bond signal peaks and single bond signal peaks and simultaneously reduce the training data volume, the signal obtained after 5 times of wavelet transformation is used as a judging feature. Therefore, the method calculates the total energy value, kurtosis and 5 times of wavelet transformed signals of the original data of the training set as the input characteristics of the support vector machine, and generates the training set for judging whether the training set only comprises single key-clicking operation.

(4) And training and obtaining an SVM model through the training set.

Calculation formula of total energy value:

where n is the length of the signal segment. .

Calculation formula of kurtosis:

Experiment verification

Experimental environment: experiments were performed in conference rooms and dormitories, respectively. The conference room environment is quite, and noise mainly comes from sounds of passing vehicles at a far place, sounds of an air conditioner and sounds reflected by key sounds. There are many objects in the conference room, and the environment is complex. The dormitory environment is noisy, and a series of interference noise such as various human voices, key-clicking sounds of non-target keyboards, sounds emitted by the washing machine and the like exist, so that the extraction of key-clicking signal fragments is challenged. Meanwhile, more objects exist in the dormitory, the environment is more complex, and the sounds reflected by the key sounds are more complex. In order to avoid the influence of desktop materials below the keyboard and desktop vibration when knocking keys, the keyboard and the mobile phone are placed on the mouse pad, and meanwhile, the keyboard and the mobile phone are fixed on the mouse pad, so that the position of the keyboard is prevented from being slightly changed in the process of knocking the keyboard.

A keyboard: experiments were mainly performed on a mechanical keyboard. The mechanical keyboard is of the type iKBC typeman W200 and is not used before collecting data, there is no key wear. The keystroke sound of the mechanical keyboard is clear, the key position is stable, the duration of the complete single key signal is about 125ms, and the hit peak duration is about 42ms.

Mobile phone: software is deployed on the mobile phone platform of Hua P20 and Hongmi K30 respectively, and key sound collection, data transmission and eavesdropping text display are carried out. The Hua P20 has 2 microphones which are respectively positioned at the top and the bottom of the mobile phone, and an Android 8.1 system is adopted to provide a sampling rate of 48kHz at most. The red rice K30 is provided with 3 microphones which are respectively positioned at the top, the bottom and the middle of four cameras of the mobile phone, and the Android 10.0 system is adopted to provide a sampling rate of 96kHz at the highest. Software deployed on the red-rice K30 handset platform can only invoke the two microphones located at the top and bottom. Therefore, the data collected on the two mobile phone platforms are two-channel data, the sampling rate of the data collected by P20 is 48kHz, and the sampling rate of the data collected by red rice is 96kHz.

Knocking speed: the tester is required to strike the key every 2 seconds to avoid overlapping signals in the signals received by the microphone.

Data set: the tester was asked to tap a total of 26 keys a through Z, each 60 times. To exclude the possibility of the application regarding stable characteristics of the environment over time (e.g. voice of a person speaking, voice of a song played outdoors) as characteristics of key classification, the tester is required to divide 60 taps of each key into 3 completions, collect 20 sets of audio signals of key taps each time, and have a time interval of at least 4 hours each time.

Single bond recognition effect

The recognition accuracy rate of 26 keys on one keyboard can reach 96.41% at the highest.

Double bond recognition effect

The key identification accuracy of the mixed signals of the two keyboards can reach 67 percent at most.

Overall simulation experiment: and linearly superposing the single key signals of the two keyboards, wherein the initial position of the superposition is a randomly generated numerical value inv. The linearly superimposed signal is used to simulate a multi-keyboard mixed signal. The mixed signal overlap start point, signal source and label are all randomly selected.

Effect of signal source judgment

The precondition is that: knowing the start position of the overlap

Single bond judgment accuracy: 99.87%

Double bond judgment accuracy: 94.37%

Double bond recognition effect

The precondition is that: knowing the starting position and signal source of the overlap

Identification accuracy of first key: 83.25%;

recognition accuracy of the second key: 74.84%.

Referring to fig. 4, a schematic structural diagram of a multi-keypad mixed key sound recognition device according to an embodiment of the present application is shown.

As shown in fig. 4, the recognition apparatus 400 of multi-key pad mixed key sounds may include:

an acquisition module 410, configured to acquire a sound signal generated when a keyboard is knocked;

the intercepting module 420 is configured to intercept the keystroke signal for the sound signal, and determine a keystroke signal fragment;

a determining module 430, configured to determine a mel frequency cepstrum coefficient according to the keystroke signal fragment;

the processing module 440 is configured to input mel-frequency cepstrum coefficients into a preset single-key identification model, and output the corresponding typed content of each keyboard.

Optionally, the obtaining module 410 is further configured to:

Optionally, the interception module 420 is further configured to:

Optionally, the determining module 430 is further configured to:

Optionally, the processing module 440 is further configured to:

acquiring sound signals of each keyboard when the keys are knocked;

The embodiment of the method can be implemented by the device for identifying multi-keyboard mixed key sounds, and the implementation principle and technical effects are similar and are not repeated here.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 5, a schematic structural diagram of an electronic device 300 suitable for use in implementing embodiments of the present application is shown.

As shown in fig. 5, the electronic device 300 includes a Central Processing Unit (CPU) 301 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, and the like; an output portion 307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 308 including a hard disk or the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. The driver 310 is also connected to the I/O interface 306 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 310 as needed, so that a computer program read therefrom is installed into the storage section 308 as needed.

In particular, according to embodiments of the present disclosure, the process described above with reference to fig. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program containing program code for performing the above-described multi-key mixing key sound recognition method. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 309, and/or installed from the removable medium 311.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor. The names of these units or modules do not in some way constitute a limitation of the unit or module itself.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a notebook computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

As another aspect, the present application also provides a storage medium, which may be a storage medium contained in the foregoing apparatus in the foregoing embodiment; or may be a storage medium that exists alone and is not incorporated into the device. The storage medium stores one or more programs for use by one or more processors to perform the multi-keypad mixed key sound recognition method described herein.

Storage media, including both permanent and non-permanent, removable and non-removable media, may be implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. A method for recognizing a multi-keypad mixed key sound, the method comprising:

acquiring a sound signal emitted by a keyboard during knocking;

intercepting the key-clicking signal of the sound signal to determine a key-clicking signal segment;

and inputting the mel frequency cepstrum coefficients into a preset single-key identification model, and outputting the corresponding keying content of each keyboard.

2. The method of claim 1, wherein the capturing of the sound signal emitted upon a keyboard stroke comprises:

3. The method of claim 2, wherein said performing a key stroke signal intercept on said sound signal, determining a key stroke signal segment, comprises:

Calculating energy values of signal segments in the sound signal every 41.7 ms;

4. A method according to claim 3, wherein the second signal segment uses a voice activity detection method to determine the keystroke signal segment, comprising:

the second signal segment uses the voice activity detection method to determine the starting point and the ending point of the keystroke action and extract the keystroke signal;

inputting the total energy, peak value and 5 times of wavelet transformed signals of the key-press signals into a preset support vector machine, and judging whether the key-press signals only comprise one key-press operation or not;

if the key-striking signal only comprises one key-striking operation, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as the key-striking signal segment;

If the key-striking signal comprises two key-striking operations, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a first key-striking signal segment;

calculating a starting position of the second key-striking operation through a regression neural network, and intercepting a signal segment with the length of 41.7ms from the starting position to the rear as a second key-striking signal segment;

the first key stroke signal segment and the second key stroke signal segment are used as the key stroke signal segments.

5. The method of any of claims 1-4, wherein determining mel-frequency cepstral coefficients from the key stroke signal segments comprises:

6. The method according to any one of claims 1 to 4, wherein the preset single bond recognition model is constructed by:

acquiring sound signals of each keyboard when the keys are knocked;

the sound signal segment is overlapped with the training key-striking signal segment, and a training segment of the key-striking signal with noise is determined;

determining a mel frequency cepstrum coefficient training set according to all the key stroke signal training fragments and all the noisy key stroke signal training fragments respectively;

and training the training set of mel frequency cepstrum coefficients to obtain the preset single bond identification model.

7. A multi-keypad hybrid key sound recognition apparatus, the apparatus comprising:

the intercepting module is used for intercepting the key-striking signal of the sound signal and determining a key-striking signal segment;

the determining module is used for determining a Mel frequency cepstrum coefficient according to the keystroke signal fragment;

and the processing module is used for inputting the mel frequency cepstrum coefficient into a preset single bond identification model and outputting the corresponding keying content of each keyboard.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for recognizing multi-keypad mixed key sounds according to any of claims 1-6 when executing the program.

9. A readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of identifying multi-key mixing key sounds according to any of claims 1-6.