CN116415166A - Multi-keyboard mixed key sound identification method, device, equipment and storage medium - Google Patents

Multi-keyboard mixed key sound identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN116415166A
CN116415166A CN202111628149.0A CN202111628149A CN116415166A CN 116415166 A CN116415166 A CN 116415166A CN 202111628149 A CN202111628149 A CN 202111628149A CN 116415166 A CN116415166 A CN 116415166A
Authority
CN
China
Prior art keywords
signal
key
signal segment
sound
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111628149.0A
Other languages
Chinese (zh)
Inventor
王璐
赵家怡
黄勇志
伍楷舜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202111628149.0A priority Critical patent/CN116415166A/en
Priority to PCT/CN2022/130829 priority patent/WO2023124556A1/en
Publication of CN116415166A publication Critical patent/CN116415166A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/041Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means
    • G06F3/043Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means using propagating acoustic waves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Input From Keyboards Or The Like (AREA)

Abstract

The application provides a method, a device, equipment and a storage medium for identifying multi-keyboard mixed key sounds, wherein the method comprises the following steps: acquiring a sound signal emitted by a keyboard during knocking; intercepting a key-striking signal of the sound signal, and determining a key-striking signal segment; determining a Mel frequency cepstrum coefficient according to the keystroke signal fragment; the mel frequency cepstrum coefficient is input into a preset single-key identification model, and the corresponding keying content of each keyboard is output. The scheme can be suitable for the recognition of the mixed key sounds of a plurality of keyboards, and the recognition accuracy is high.

Description

Multi-keyboard mixed key sound identification method, device, equipment and storage medium
Technical Field
The invention belongs to the technical field of signal recognition, and particularly relates to a method, a device, equipment and a storage medium for recognizing multi-keyboard mixed key sounds.
Background
Today, most of the major office scenarios are where a computer is operated in a room with a keyboard and a mouse to input contents, and the contents input by a worker sometimes include information related to personal, customer, and even company privacy, such as personal passwords, customer information, and company bidding contracts, etc., which are information that, once utilized by lawbreakers, would have a huge loss of correlation personnel, such as 2018 Cost of Data Breach Study indicating that the average loss of enterprises is $ 386 ten thousand in the event of information leakage. Therefore, key input information security is critical.
In general, an external eavesdropper often uses an invasive eavesdropping mode to eavesdrop on the key input, and a malicious program is implanted on a computer to obtain the keyed information of the eavesdropper. With the development of network security computation such as cloud security, external personnel can be effectively prevented from eavesdropping through security technologies such as a firewall. But the eavesdropping action of the internal personnel still poses a great threat to the security of the entered information. The internal personnel can use the computer of the eavesdropper without the password for a short period of time when the eavesdropper leaves the computer (such as going to a toilet), so as to attack. For such eavesdropping scenario, related researchers propose a way of continuous authentication of users for prevention. Training according to user input information such as flight time recorded on a computer to obtain a model for distinguishing users from illegal users, continuously running the model during the running of the computer to authenticate the users, and taking corresponding actions (such as screen locking) once the users are considered to be illegal users. This approach also effectively prevents the eavesdropper from directly using the eavesdropped computer.
With the development of signal detection systems, keyboard keystroke recognition has become a focus of attention. The problem of keyboard keystroke recognition becomes one of the key problems for protecting the security of office information.
The existing keyboard keystroke recognition is mainly divided into two main categories, namely, the keyboard keystroke recognition is carried out by implanting malicious programs into a computer, and the leakage of the input content can be prevented through security technologies such as a firewall at present; secondly, the key-striking content of the keyboard is identified by utilizing signals such as sound, WIFI, light and the like, the form of eavesdropping on the key-striking content is variable, and the prevention is often difficult. The second research method can be divided into the following categories. The method comprises the steps of (1) identifying keyboard knocking contents by adopting a CSI technology based on WIFI signals, such as Wifinger, (2) identifying the keyboard knocking contents according to video data based on optical signals, such as Blind Recognition of Touched Keys on Mobile Devices (blind identification of touch keys on mobile equipment), (3) identifying the keyboard knocking contents based on sound signals, such as Accurate Combined Keystrokes Detection Using Acoustic Signals (accurate combined key-press detection by using acoustic signals), and identifying key-press combinations (such as Ctrl+C) by capturing the sound signals.
The existing keyboard keystroke recognition technology is to recognize a single key or a specific key combination (such as ctrl+c) of a single keyboard, but in an office scene, a situation that a plurality of keyboards are simultaneously struck often exists, and a signal received by a recording device is often a mixed sound signal of a plurality of keyboards. Therefore, the existing key sound recognition technology has no universality.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a method, an apparatus, a device, and a storage medium for identifying a multi-keyboard mixed key sound.
In order to solve the technical problems, the embodiments of the present application are implemented in the following manner:
in a first aspect, the present application provides a method for identifying a multi-keypad mixed key sound, the method including:
acquiring a sound signal emitted by a keyboard during knocking;
intercepting a key-striking signal of the sound signal, and determining a key-striking signal segment;
determining a Mel frequency cepstrum coefficient according to the keystroke signal fragment;
the mel frequency cepstrum coefficient is input into a preset single-key identification model, and the corresponding keying content of each keyboard is output.
In one embodiment, capturing a sound signal generated when a keyboard is struck includes:
the method comprises the steps of acquiring a sound signal sent by a recording element of a terminal when a keyboard is knocked, wherein the terminal comprises at least one recording element.
In one embodiment, performing key stroke signal interception on the sound signal to determine a key stroke signal segment includes:
calculating the energy value of the signal segment in the sound signal every 41.7 ms;
if the energy value of the first signal segment is larger than the energy threshold value, intercepting the signal segment with a first preset duration before the starting point of the first signal segment and a second preset duration after the starting point of the first signal segment as a second signal segment;
The second signal segment adopts a voice activity detection method to determine the keystroke signal segment.
In one embodiment, the second signal segment employs a voice activity detection method to determine a keystroke signal segment, comprising:
the second signal segment uses a voice activity detection method to determine the starting point and the ending point of the keystroke action and extract the keystroke signal;
calculating total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals;
inputting the total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals into a preset support vector machine, and judging whether the keystroke signals only comprise one keystroke operation or not;
if the key-press signal only comprises one key-press operation, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a key-press signal segment;
if the key-press signal comprises two key-press operations, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a first key-press signal segment;
calculating a starting position at which a second key-striking operation starts to occur through a regression neural network, and intercepting a signal segment with the length of 41.7ms from the starting position to the rear as a second key-striking signal segment;
a first key signal segment and a second key signal segment as key signal segments.
In one embodiment, determining the mel-frequency cepstral coefficient from the key stroke signal segment comprises:
denoising by adopting a low-pass filter according to the keystroke signal fragment to obtain a denoising signal fragment;
and determining the Mel frequency cepstrum coefficient according to the denoising signal segment.
In one embodiment, the preset single bond recognition model is constructed by:
acquiring sound signals of each keyboard when the keys are knocked;
according to the sound signal, intercepting a key-striking signal training fragment with the duration of 41.7ms from the sound signal by using a voice activity detection method;
randomly acquiring sound signal fragments with equal length as the training key-striking signal fragments from the sound signal;
the sound signal segment is overlapped with the training keystroke signal segment, and the training segment of the keystroke signal with noise is determined;
determining a mel frequency cepstrum coefficient training set according to all key-stroke signal training fragments and all noisy key-stroke signal training fragments respectively;
and training to obtain a preset single bond identification model by taking the training set of the mel frequency cepstrum coefficient as input data.
In a second aspect, the present application provides a device for recognizing a multi-keypad mixed key sound, the device comprising:
the acquisition module is used for acquiring sound signals emitted by the keyboard during knocking;
The intercepting module is used for intercepting the key-clicking signals of the sound signals and determining key-clicking signal fragments;
the determining module is used for determining the Mel frequency cepstrum coefficient according to the keystroke signal fragment;
and the processing module is used for inputting the mel frequency cepstrum coefficient into a preset single-key identification model and outputting the corresponding keying content of each keyboard.
In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for recognizing multi-keypad mixed key sounds as in the first aspect when the processor executes the program.
In a fourth aspect, the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for identifying multi-keypad mixed key sounds as in the first aspect.
The technical scheme provided by the embodiment of the present specification can be seen from the following scheme:
the recognition method of the multi-keyboard mixed key sound can be suitable for recognizing the keying content of the multi-keyboard.
According to the multi-keyboard mixed key sound identification method, only the recording element on the terminal is needed, no additional equipment is needed, the cost is low, and the method is easy to obtain.
According to the identification method of the multi-keyboard mixed key sound, which is provided by the embodiment of the application, a BLSTM model based on an attention mechanism is provided, and the accuracy of key identification of the BLSTM is improved to 96.41% by utilizing the characteristic that signals received by two recording elements in the same time period are in contact.
Drawings
In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a method for identifying multi-keyboard mixed key sounds provided in the present application;
FIG. 2 is a layout of an experimental platform provided herein;
fig. 3 is a schematic structural diagram of a preset single bond recognition model provided in the present application;
fig. 4 is a schematic structural diagram of a multi-keyboard hybrid key sound recognition device provided in the present application;
fig. 5 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments of the present disclosure without departing from the scope or spirit of the disclosure. Other embodiments will be apparent to the skilled person from the description of the present application. The specification and examples are exemplary only.
As used herein, the terms "comprising," "including," "having," "containing," and the like are intended to be inclusive and mean an inclusion, but not limited to.
The "parts" in the present application are all parts by mass unless otherwise specified.
The invention is described in further detail below with reference to the drawings and examples.
Referring to fig. 1, a flow chart of a method for identifying multi-keyboard mixed key sounds according to an embodiment of the present application is shown.
As shown in fig. 1, the method for identifying multi-key-pad mixed key sounds may include:
s110, acquiring a sound signal generated when the keyboard is knocked.
Specifically, the recording element of the terminal collects the sound signals sent by the keyboard during the knocking, and uploads the collected sound signals to the cloud. The terminal can comprise any electronic equipment with a recording element, such as a mobile phone, a tablet computer, a wearable device and the like. The sound recording element may be a microphone. The terminal may comprise at least one microphone, for example the handset may comprise two or more microphones.
As shown in fig. 2, when the sound signal generated when the sound recording element of the mobile phone is used for collecting the keyboard is tapped, the mobile phone is placed in the middle of two keyboards, and two or more sound recording elements on the mobile phone are used for collecting the keystroke sounds and uploading the keystroke sounds to the cloud.
S120, carrying out key stroke signal interception on the sound signal, and determining a key stroke signal segment.
Specifically, at the cloud, key-striking signal interception is performed on the collected sound signals, and then, signal segment cutting is performed on the intercepted key-striking signals, so that key-striking signal segments are obtained. It will be appreciated that if only one key stroke is included in the key stroke signal, a key stroke signal segment may be obtained, and if two key strokes are included in the key stroke signal, two key stroke signal segments may be obtained.
In one embodiment, S120 performs key stroke signal interception on the sound signal, and determining a key stroke signal segment may include:
calculating the energy value of the signal segment in the sound signal every 41.7 ms;
if the energy value of the first signal segment is larger than the energy threshold value, intercepting the signal segment with a first preset duration before the starting point of the first signal segment and a second preset duration after the starting point of the first signal segment as a second signal segment;
the second signal segment adopts a voice activity detection method to determine the keystroke signal segment.
Wherein, the second signal segment adopts a voice activity detection method, and determining the keystroke signal segment can comprise:
the second signal segment uses a voice activity detection method to determine the starting point and the ending point of the keystroke action and extract the keystroke signal;
Calculating total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals;
inputting the total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals into a preset support vector machine, and judging whether the keystroke signals only comprise one keystroke operation or not;
if the key-press signal only comprises one key-press operation, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a key-press signal segment;
if the key-press signal comprises two key-press operations, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a first key-press signal segment;
calculating a starting position at which a second key-striking operation starts to occur through a regression neural network, and intercepting a signal segment with the length of 41.7ms from the starting position to the rear as a second key-striking signal segment;
a first key signal segment and a second key signal segment as key signal segments.
Specifically, the energy threshold may be set according to actual requirements. The first preset duration and the second preset duration can be set according to actual requirements, for example, the first preset duration and the second preset duration are both 1s.
The energy value a of the signal segment is:
Figure BDA0003439202870000061
where n is the length of the signal segment.
For the received sound signal, the energy value of the signal segment is calculated every 41.7ms, and if the energy value exceeds the threshold value, the signal segments of the first 1s and the last 1s (the duration is 2s in total) are intercepted as the signal segments (i.e. the second signal segments) where the keystroke action may exist, starting from the starting point of the signal segment (i.e. the first signal segment).
For the intercepted signal segments, a voice activity detection (Voice Activity Detection, VAD) method is used for finding a starting point stp and an ending point of the keystroke action, and a keystroke signal is extracted.
For the keystroke signals extracted by the VAD, calculating the total energy value, kurtosis and signals after 5 times of wavelet transformation, and judging whether the keystroke signals only comprise one keystroke operation or not through a trained SVM (support vector machine); if the keystroke signal contains only one keystroke operation, a signal segment with the length of 41.7ms is intercepted backwards from the starting point stp obtained by the VAD as a keystroke signal segment, and the Mel frequency cepstrum coefficient is determined according to the keystroke signal segment in step S130; if the key stroke signal includes two key stroke operations, the first key stroke signal segment is a signal segment with a length of 41.7ms intercepted backward from the starting point stp, then the position inv where the second key stroke operation starts to occur (i.e. the moment when the two key stroke operations start to overlap) is calculated through the recurrent neural network, the second key stroke signal segment is a signal segment with a length of 41.7ms intercepted backward from inv, and the mel frequency cepstrum coefficient is determined in step S130 according to the first key stroke signal segment and the second key stroke signal segment respectively.
Specific operation of obtaining a recurrent neural network model for calculating the overlap start position:
the application adopts a regression neural network model based on LSTM to calculate the overlapping initial position, and the network structure comprises: input layer, LSTM layer, flame layer and dense layer. The present model uses random superposition of single bond signals in a set of keystroke signal fragments (overlapping start position, signal source and label random) to generate overlapping signals containing two keystroke operations, while recording the overlapping start position as a label.
Input layer: and receiving the intercepted keystroke signal fragment as the input of the model.
LSTM layer: the input data of the model is encoded such that the output data of the LSTM contains timing information.
Layer of flat: the output data of the LSTM layer is changed into a one-dimensional vector, so that the calculation of the full connection layer is facilitated.
Full tie layer: and multiplying the input data of the full connection layer by the weight value to obtain an estimated overlapping initial position. The layer does not use an activation function.
Loss function: in order to minimize the error between the predicted value and the true value, the loss function is set to
L(Y,f(X))=max(|Y-f(X)|)。
It will be appreciated that for a keystroke signal fragment, the present application determines the source of the signal (i.e., from which keyboard the keystroke signal originated) by calculating the energy differences of the signal fragments received by the different recording elements.
The specific operation of determining the signal source is as follows:
(1) The key stroke signal segments received by the two recording elements are aligned in time.
(2) After alignment, the total energy value of the signal segments of the two recording elements is calculated respectively, and the difference value is obtained.
(3) Since the paths of the same sound source reaching the two recording elements are different in length, the attenuation degree of the keystroke signals is also different, and the attenuation degree is higher as the paths are longer, namely the total energy of the signals received by the recording elements is lower. The two keyboards are positioned on two sides of the two recording elements, so that the total energy difference corresponding to one keyboard is positive, and the total energy difference corresponding to the other keyboard is negative. The source of the keystroke signal can be determined.
S130, determining the Mel frequency cepstrum coefficient according to the keystroke signal fragment can comprise:
denoising by adopting a low-pass filter according to the keystroke signal fragment to obtain a denoising signal fragment;
and determining the Mel frequency cepstrum coefficient according to the denoising signal segment.
Denoising the keystroke signal fragment by using a low-pass filter to obtain a denoised signal fragment; and according to the denoising signal segment, calculating the mel frequency cepstrum coefficient as input data of a preset single bond identification model.
In the sound field, mel frequency cepstrum is a linear transformation of the logarithmic energy spectrum based on the nonlinear mel scale of sound frequencies, and mel frequency cepstrum coefficients are the coefficients that make up the mel frequency cepstrum. It takes into account the auditory characteristics of humans, maps the linear spectrum into mel nonlinear spectrum based on auditory perception, and then converts to cepstrum.
The specific operation of calculating the mel frequency cepstrum coefficient is as follows:
(1) Pre-emphasis, framing and windowing are performed on the de-noised signal segments.
(2) For each frame, a corresponding spectrum is obtained by FFT (fast fourier transform).
(3) For the obtained spectrum, a mel spectrum is obtained by a mel filter bank.
(4) And carrying out cepstrum analysis operations such as logarithm taking, inverse transformation and the like on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients.
S140, inputting the mel frequency cepstrum coefficient into a preset single-key identification model, and outputting the corresponding keying content of each keyboard.
Specifically, the preset single bond recognition model may be trained in advance. The preset single bond recognition model is a attention mechanism-based BLSTM neural network model (two-way long short-term memory cyclic neural network), and the network structure is shown in fig. 3, and includes two input layers, two BLSTM layers, a con-cate layer, an attention mechanism layer and a dense layer.
Input layer: because some relations exist between signals received by the two recording elements in the same time period, the neural network adopts two input layers to respectively receive the mel frequency cepstrum coefficients corresponding to the two recording elements as input data.
BLSTM layer: BLSTM consists of forward LSTM (long and short memory recurrent neural network) and backward LSTM, which are often used to model context information in natural language processing tasks, and data obtained by performing BLSTM processing on real-time data contains forward and backward information. Therefore, the present application employs two BLSTM layers to receive output data from two input layers, respectively, and encodes the input layer data such that the output sequence of the BLSTM contains timing information.
conccate layer: the output sequences of the two BLSTM layers are connected in series.
Layer of the coating: the series of two BLSTMs is processed such that the attention output data contains information about the association between the signals received by the two recording elements.
dense layer: and the full-connection layer is used for processing the data output by the attention layer to obtain a key identification result. The activation function adopted by the full connection layer is a sigmoid function, and the output dimension is set to be the number of labels.
In one embodiment, the preset single bond recognition model is constructed by:
acquiring sound signals of each keyboard when the keys are knocked;
according to the sound signal, intercepting a key-striking signal training fragment with the duration of 41.7ms from the sound signal by using a voice activity detection method;
randomly acquiring sound signal fragments with equal length as the training key-striking signal fragments from the sound signal;
the sound signal segment is overlapped with the training keystroke signal segment, and the training segment of the keystroke signal with noise is determined;
determining a mel frequency cepstrum coefficient training set according to all key-stroke signal training fragments and all noisy key-stroke signal training fragments respectively;
and training to obtain a preset single bond identification model by taking the training set of the mel frequency cepstrum coefficient as input data.
Specifically, the commercial mobile phone is used for collecting the sound signals of key strokes for each keyboard, so that a user is required to strike keys every 2 seconds, and the key striking signals are prevented from being overlapped; for the collected sound signals, every two seconds of sound is taken as a set of signals to be processed; for a set of signals to be processed every two seconds, a segment of the key stroke signal having a duration of about 41.7ms is truncated from the signal using a VAD (voice activity detection) algorithm; for each intercepted keystroke signal fragment, randomly superposing the collected sound signal fragments with equal length as the keystroke signal fragments after noise addition, thereby increasing the data volume of the training set; for the obtained keystroke signal fragment set, a trained support vector machine model is obtained through the total energy value, kurtosis and signals after 5 times of wavelet transformation, and is used for judging whether the keystroke signal fragment only comprises one keystroke operation or not; calculating mel frequency cepstrum coefficients for the intercepted key stroke signal fragments and the key stroke signal fragments added with noise to generate a training set; and for the training set, using the training set as input data, training and obtaining a single bond recognition model.
In order to cut out the keystroke signal from the original signal and avoid the background classifying the data which does not contain the keystroke signal, a common voice activity detection algorithm and a double-threshold endpoint detection method are adopted to identify and eliminate the long-time mute period.
The specific operation of intercepting a keystroke signal fragment using the VAD algorithm is as follows:
(1) For the original signal, the formula x 'is used' i =x i /max(x 1 L), i=1..l, normalized to obtain normalized signal, where L is the original signal length.
Substituting the normalized signal into the formula x i =α*x i-1 +β, i=2..l is updated, thereby introducing timing information.
(2) For the signal after the time sequence information is introduced, calculating the total energy of the signal with the length of FrameLen at every interval of FrameInc, and obtaining an array amp, wherein the array amp is a set of the total energy of each frame. Specifically, a signal with a length of FrameLen is extracted as a frame i by taking FrameInc as a step length, and the sum of absolute values of a frame is calculated as the total energy of the frame, i.e. amp [ i ].
(3) Calculation a higher short-term energy threshold MH and a lower short-term energy threshold ML are calculated from the maximum value of amp. Where mh=min (max (amp)/4, 10); ml=min (max (amp)/8, 2). If amp i > ML, the frame may be in the sonification phase (the frame is set to status 1), and when the number of status1 frames is more than 15, it is considered to determine to enter the sonification phase.
(4) The short time zero crossing rate (i.e., the number of times the horizontal axis of the coordinate axis is crossed per unit time) is calculated. Wherein, each frame calculates the short-time zero-crossing rate respectively, obtains array zcr. In particular, zcr [ i ] is the number of times the horizontal axis of the coordinate axis passes in the statistical frame i divided by the frame length FrameLen.
(5) Traversing the array amp, if amp [ i ] exceeds the threshold MH, then the first reference starting point is stpl.
(6) Traversing backward from stp1, if amp [ i ] exceeds a threshold MH or the short-time zero-crossing rate zcr [ i ] exceeds a threshold Zs, then the keystroke sound is regarded as continuing, the backward traversing is continued, otherwise the keystroke sound is regarded as ending. The threshold Zs may be set according to actual requirements.
(7) And intercepting the keystroke signal fragments according to the found starting point and ending point.
The specific operation of obtaining a support vector machine model for determining whether a key stroke signal segment contains only one key stroke operation is as follows:
(1) The superimposed signal containing two key strokes is generated using random superposition of single key signals in a set of key stroke signal segments (the starting position of the overlap, the source of the signal and the label are random). Wherein, the random is the overlapped initial position (a numerical value) is randomly generated; randomly selecting two key signals (the signal coming from which keyboard and from which key is random); the superposition is a linear superposition of the selected keystroke signals according to the generated superposition starting location.
(2) And labeling the single-key signal of the training set and the generated overlapping signal to generate the original data of the training set of the support vector machine.
(3) The double bond signal differs from the single bond signal by the following 3 points: a. the double bond signal has three or more peaks in the time domain; b. the total energy of the double bond signal is higher than that of the original single bond signal; c. the latter half of the key stroke signal will have a more energetic hit peak. Therefore, the total energy value, kurtosis (kurtosis), of the received keystroke signal fragment is extracted as a judgment feature for distinguishing whether the keystroke signal fragment contains a double bond signal. Meanwhile, in order to describe the difference of the number of double bond signal peaks and single bond signal peaks and simultaneously reduce the training data volume, the signal obtained after 5 times of wavelet transformation is used as a judging feature. Therefore, the method calculates the total energy value, kurtosis and 5 times of wavelet transformed signals of the original data of the training set as the input characteristics of the support vector machine, and generates the training set for judging whether the training set only comprises single key-clicking operation.
(4) And training and obtaining an SVM model through the training set.
Calculation formula of total energy value:
Figure BDA0003439202870000111
where n is the length of the signal segment. .
Calculation formula of kurtosis:
Figure BDA0003439202870000112
The recognition method of the multi-keyboard mixed key sound can be suitable for recognizing the keying content of the multi-keyboard.
According to the multi-keyboard mixed key sound identification method, only the recording element on the terminal is needed, no additional equipment is needed, the cost is low, and the method is easy to obtain.
According to the identification method of the multi-keyboard mixed key sound, which is provided by the embodiment of the application, a BLSTM model based on an attention mechanism is provided, and the accuracy of key identification of the BLSTM is improved to 96.41% by utilizing the characteristic that signals received by two recording elements in the same time period are in contact.
Experiment verification
Experimental environment: experiments were performed in conference rooms and dormitories, respectively. The conference room environment is quite, and noise mainly comes from sounds of passing vehicles at a far place, sounds of an air conditioner and sounds reflected by key sounds. There are many objects in the conference room, and the environment is complex. The dormitory environment is noisy, and a series of interference noise such as various human voices, key-clicking sounds of non-target keyboards, sounds emitted by the washing machine and the like exist, so that the extraction of key-clicking signal fragments is challenged. Meanwhile, more objects exist in the dormitory, the environment is more complex, and the sounds reflected by the key sounds are more complex. In order to avoid the influence of desktop materials below the keyboard and desktop vibration when knocking keys, the keyboard and the mobile phone are placed on the mouse pad, and meanwhile, the keyboard and the mobile phone are fixed on the mouse pad, so that the position of the keyboard is prevented from being slightly changed in the process of knocking the keyboard.
A keyboard: experiments were mainly performed on a mechanical keyboard. The mechanical keyboard is of the type iKBC typeman W200 and is not used before collecting data, there is no key wear. The keystroke sound of the mechanical keyboard is clear, the key position is stable, the duration of the complete single key signal is about 125ms, and the hit peak duration is about 42ms.
Mobile phone: software is deployed on the mobile phone platform of Hua P20 and Hongmi K30 respectively, and key sound collection, data transmission and eavesdropping text display are carried out. The Hua P20 has 2 microphones which are respectively positioned at the top and the bottom of the mobile phone, and an Android 8.1 system is adopted to provide a sampling rate of 48kHz at most. The red rice K30 is provided with 3 microphones which are respectively positioned at the top, the bottom and the middle of four cameras of the mobile phone, and the Android 10.0 system is adopted to provide a sampling rate of 96kHz at the highest. Software deployed on the red-rice K30 handset platform can only invoke the two microphones located at the top and bottom. Therefore, the data collected on the two mobile phone platforms are two-channel data, the sampling rate of the data collected by P20 is 48kHz, and the sampling rate of the data collected by red rice is 96kHz.
Knocking speed: the tester is required to strike the key every 2 seconds to avoid overlapping signals in the signals received by the microphone.
Data set: the tester was asked to tap a total of 26 keys a through Z, each 60 times. To exclude the possibility of the application regarding stable characteristics of the environment over time (e.g. voice of a person speaking, voice of a song played outdoors) as characteristics of key classification, the tester is required to divide 60 taps of each key into 3 completions, collect 20 sets of audio signals of key taps each time, and have a time interval of at least 4 hours each time.
Single bond recognition effect
The recognition accuracy rate of 26 keys on one keyboard can reach 96.41% at the highest.
Double bond recognition effect
The key identification accuracy of the mixed signals of the two keyboards can reach 67 percent at most.
Overall simulation experiment: and linearly superposing the single key signals of the two keyboards, wherein the initial position of the superposition is a randomly generated numerical value inv. The linearly superimposed signal is used to simulate a multi-keyboard mixed signal. The mixed signal overlap start point, signal source and label are all randomly selected.
Effect of signal source judgment
The precondition is that: knowing the start position of the overlap
Single bond judgment accuracy: 99.87%
Double bond judgment accuracy: 94.37%
Double bond recognition effect
The precondition is that: knowing the starting position and signal source of the overlap
Identification accuracy of first key: 83.25%;
recognition accuracy of the second key: 74.84%.
Referring to fig. 4, a schematic structural diagram of a multi-keypad mixed key sound recognition device according to an embodiment of the present application is shown.
As shown in fig. 4, the recognition apparatus 400 of multi-key pad mixed key sounds may include:
an acquisition module 410, configured to acquire a sound signal generated when a keyboard is knocked;
the intercepting module 420 is configured to intercept the keystroke signal for the sound signal, and determine a keystroke signal fragment;
a determining module 430, configured to determine a mel frequency cepstrum coefficient according to the keystroke signal fragment;
the processing module 440 is configured to input mel-frequency cepstrum coefficients into a preset single-key identification model, and output the corresponding typed content of each keyboard.
Optionally, the obtaining module 410 is further configured to:
the method comprises the steps of acquiring a sound signal sent by a recording element of a terminal when a keyboard is knocked, wherein the terminal comprises at least one recording element.
Optionally, the interception module 420 is further configured to:
calculating the energy value of the signal segment in the sound signal every 41.7 ms;
if the energy value of the first signal segment is larger than the energy threshold value, intercepting the signal segment with a first preset duration before the starting point of the first signal segment and a second preset duration after the starting point of the first signal segment as a second signal segment;
The second signal segment adopts a voice activity detection method to determine the keystroke signal segment.
Optionally, the interception module 420 is further configured to:
the second signal segment uses a voice activity detection method to determine the starting point and the ending point of the keystroke action and extract the keystroke signal;
calculating total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals;
inputting the total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals into a preset support vector machine, and judging whether the keystroke signals only comprise one keystroke operation or not;
if the key-press signal only comprises one key-press operation, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a key-press signal segment;
if the key-press signal comprises two key-press operations, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a first key-press signal segment;
calculating a starting position at which a second key-striking operation starts to occur through a regression neural network, and intercepting a signal segment with the length of 41.7ms from the starting position to the rear as a second key-striking signal segment;
a first key signal segment and a second key signal segment as key signal segments.
Optionally, the determining module 430 is further configured to:
Denoising by adopting a low-pass filter according to the keystroke signal fragment to obtain a denoising signal fragment;
and determining the Mel frequency cepstrum coefficient according to the denoising signal segment.
Optionally, the processing module 440 is further configured to:
acquiring sound signals of each keyboard when the keys are knocked;
according to the sound signal, intercepting a key-striking signal training fragment with the duration of 41.7ms from the sound signal by using a voice activity detection method;
randomly acquiring sound signal fragments with equal length as the training key-striking signal fragments from the sound signal;
the sound signal segment is overlapped with the training keystroke signal segment, and the training segment of the keystroke signal with noise is determined;
determining a mel frequency cepstrum coefficient training set according to all key-stroke signal training fragments and all noisy key-stroke signal training fragments respectively;
and training to obtain a preset single bond identification model by taking the training set of the mel frequency cepstrum coefficient as input data.
The embodiment of the method can be implemented by the device for identifying multi-keyboard mixed key sounds, and the implementation principle and technical effects are similar and are not repeated here.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 5, a schematic structural diagram of an electronic device 300 suitable for use in implementing embodiments of the present application is shown.
As shown in fig. 5, the electronic device 300 includes a Central Processing Unit (CPU) 301 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, and the like; an output portion 307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 308 including a hard disk or the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. The driver 310 is also connected to the I/O interface 306 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 310 as needed, so that a computer program read therefrom is installed into the storage section 308 as needed.
In particular, according to embodiments of the present disclosure, the process described above with reference to fig. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program containing program code for performing the above-described multi-key mixing key sound recognition method. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 309, and/or installed from the removable medium 311.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor. The names of these units or modules do not in some way constitute a limitation of the unit or module itself.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a notebook computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
As another aspect, the present application also provides a storage medium, which may be a storage medium contained in the foregoing apparatus in the foregoing embodiment; or may be a storage medium that exists alone and is not incorporated into the device. The storage medium stores one or more programs for use by one or more processors to perform the multi-keypad mixed key sound recognition method described herein.
Storage media, including both permanent and non-permanent, removable and non-removable media, may be implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims (9)

1. A method for recognizing a multi-keypad mixed key sound, the method comprising:
acquiring a sound signal emitted by a keyboard during knocking;
intercepting the key-clicking signal of the sound signal to determine a key-clicking signal segment;
determining a mel frequency cepstrum coefficient according to the keystroke signal fragment;
and inputting the mel frequency cepstrum coefficients into a preset single-key identification model, and outputting the corresponding keying content of each keyboard.
2. The method of claim 1, wherein the capturing of the sound signal emitted upon a keyboard stroke comprises:
the method comprises the steps of acquiring a sound signal sent by a recording element of a terminal when a keyboard is knocked, wherein the terminal comprises at least one recording element.
3. The method of claim 2, wherein said performing a key stroke signal intercept on said sound signal, determining a key stroke signal segment, comprises:
Calculating energy values of signal segments in the sound signal every 41.7 ms;
if the energy value of the first signal segment is larger than the energy threshold value, intercepting the signal segment with a first preset duration before the starting point of the first signal segment and a second preset duration after the starting point of the first signal segment as a second signal segment;
the second signal segment adopts a voice activity detection method to determine the keystroke signal segment.
4. A method according to claim 3, wherein the second signal segment uses a voice activity detection method to determine the keystroke signal segment, comprising:
the second signal segment uses the voice activity detection method to determine the starting point and the ending point of the keystroke action and extract the keystroke signal;
calculating total energy, peak value and 5 times of wavelet transformed signals of the keystroke signals;
inputting the total energy, peak value and 5 times of wavelet transformed signals of the key-press signals into a preset support vector machine, and judging whether the key-press signals only comprise one key-press operation or not;
if the key-striking signal only comprises one key-striking operation, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as the key-striking signal segment;
If the key-striking signal comprises two key-striking operations, intercepting a signal segment with the length of 41.7ms from the starting point to the rear as a first key-striking signal segment;
calculating a starting position of the second key-striking operation through a regression neural network, and intercepting a signal segment with the length of 41.7ms from the starting position to the rear as a second key-striking signal segment;
the first key stroke signal segment and the second key stroke signal segment are used as the key stroke signal segments.
5. The method of any of claims 1-4, wherein determining mel-frequency cepstral coefficients from the key stroke signal segments comprises:
denoising by adopting a low-pass filter according to the keystroke signal fragment to obtain a denoising signal fragment;
and determining the Mel frequency cepstrum coefficient according to the denoising signal segment.
6. The method according to any one of claims 1 to 4, wherein the preset single bond recognition model is constructed by:
acquiring sound signals of each keyboard when the keys are knocked;
according to the sound signal, intercepting a key-striking signal training fragment with the duration of 41.7ms from the sound signal by using a voice activity detection method;
Randomly acquiring sound signal fragments with equal length as the training key-striking signal fragments from the sound signal;
the sound signal segment is overlapped with the training key-striking signal segment, and a training segment of the key-striking signal with noise is determined;
determining a mel frequency cepstrum coefficient training set according to all the key stroke signal training fragments and all the noisy key stroke signal training fragments respectively;
and training the training set of mel frequency cepstrum coefficients to obtain the preset single bond identification model.
7. A multi-keypad hybrid key sound recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring sound signals emitted by the keyboard during knocking;
the intercepting module is used for intercepting the key-striking signal of the sound signal and determining a key-striking signal segment;
the determining module is used for determining a Mel frequency cepstrum coefficient according to the keystroke signal fragment;
and the processing module is used for inputting the mel frequency cepstrum coefficient into a preset single bond identification model and outputting the corresponding keying content of each keyboard.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for recognizing multi-keypad mixed key sounds according to any of claims 1-6 when executing the program.
9. A readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of identifying multi-key mixing key sounds according to any of claims 1-6.
CN202111628149.0A 2021-12-28 2021-12-28 Multi-keyboard mixed key sound identification method, device, equipment and storage medium Pending CN116415166A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111628149.0A CN116415166A (en) 2021-12-28 2021-12-28 Multi-keyboard mixed key sound identification method, device, equipment and storage medium
PCT/CN2022/130829 WO2023124556A1 (en) 2021-12-28 2022-11-09 Method and apparatus for recognizing mixed key sounds of multiple keyboards, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111628149.0A CN116415166A (en) 2021-12-28 2021-12-28 Multi-keyboard mixed key sound identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116415166A true CN116415166A (en) 2023-07-11

Family

ID=86997523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111628149.0A Pending CN116415166A (en) 2021-12-28 2021-12-28 Multi-keyboard mixed key sound identification method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN116415166A (en)
WO (1) WO2023124556A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827011B (en) * 2024-03-04 2024-05-07 渴创技术(深圳)有限公司 Key feedback method and device based on user behavior prediction and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107492382B (en) * 2016-06-13 2020-12-18 阿里巴巴集团控股有限公司 Voiceprint information extraction method and device based on neural network
CN106128452A (en) * 2016-07-05 2016-11-16 深圳大学 Acoustical signal detection keyboard is utilized to tap the system and method for content
CN107680597B (en) * 2017-10-23 2019-07-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN110111812B (en) * 2019-04-15 2020-11-03 深圳大学 Self-adaptive identification method and system for keyboard keystroke content

Also Published As

Publication number Publication date
WO2023124556A1 (en) 2023-07-06

Similar Documents

Publication Publication Date Title
Chen et al. Who is real bob? adversarial attacks on speaker recognition systems
JP7210634B2 (en) Voice query detection and suppression
Yuan et al. {CommanderSong}: a systematic approach for practical adversarial voice recognition
Ahmed et al. Void: A fast and light voice liveness detection system
US20180190280A1 (en) Voice recognition method and apparatus
Anand et al. Spearphone: a lightweight speech privacy exploit via accelerometer-sensed reverberations from smartphone loudspeakers
Shi et al. Face-Mic: inferring live speech and speaker identity via subtle facial dynamics captured by AR/VR motion sensors
CN107517207A (en) Server, auth method and computer-readable recording medium
Wang et al. When the differences in frequency domain are compensated: Understanding and defeating modulated replay attacks on automatic speech recognition
Anand et al. Spearphone: A speech privacy exploit via accelerometer-sensed reverberations from smartphone loudspeakers
Ahmed et al. Towards more robust keyword spotting for voice assistants
WO2023124556A1 (en) Method and apparatus for recognizing mixed key sounds of multiple keyboards, device, and storage medium
Singh et al. Countermeasures to replay attacks: A review
Garg et al. Subband analysis for performance improvement of replay attack detection in speaker verification systems
Wang et al. Vsmask: Defending against voice synthesis attack via real-time predictive perturbation
CN113614828A (en) Method and apparatus for fingerprinting audio signals via normalization
Li et al. Security and privacy problems in voice assistant applications: A survey
Tian et al. Spoofing detection under noisy conditions: a preliminary investigation and an initial database
Sun et al. A self-attentional ResNet-LightGBM model for IoT-enabled voice liveness detection
Nagaraja et al. VoIPLoc: passive VoIP call provenance via acoustic side-channels
WO2023030017A1 (en) Audio data processing method and apparatus, device and medium
Walker et al. Sok: assessing the threat potential of vibration-based attacks against live speech using mobile sensors
Shi et al. Anti-replay: A fast and lightweight voice replay attack detection system
Anand et al. Motion Sensor-based Privacy Attack on Smartphones
Nagaraja et al. VoipLoc: VoIP call provenance using acoustic side-channels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination