CN110738991A

CN110738991A - Speech recognition equipment based on flexible wearable sensor

Info

Publication number: CN110738991A
Application number: CN201910962682.7A
Authority: CN
Inventors: 吴俊�; 段升顺; 查欣婧
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2020-01-31

Abstract

The invention discloses voice recognition equipment based on a flexible wearable sensor, which comprises a voice acquisition unit, a voice signal receiving and processing unit and a voice recognition network unit, wherein a voice acquisition module comprises the flexible wearable sensor, mechanical vibration of laryngeal structure vibration during speaking is converted into an electric signal through the flexible wearable sensor, and the electric signal is output, wherein the frequency and the amplitude of the electric signal are positively correlated with the frequency and the amplitude of the laryngeal structure vibration.

Description

Speech recognition equipment based on flexible wearable sensor

Technical Field

The invention relates to a voice recognition technology, a flexible electronic and neural network, in particular to voice recognition equipment based on a flexible wearable sensor.

Background

Since th Bell laboratories in the 50 s developed systems capable of realizing ten English numbers, the speech recognition technology has undergone a great deal of development, and the successful introduction of HMM models and Artificial Neural Networks (ANNs) has enabled the performance of speech recognition systems to be superior to the past.

However, in the conventional speech recognition technology, the acquisition of speech signals depends on microphones, and a source from a speaker to the microphones needs to experience transmission through an air channel, and in the process, speech is easily affected by noise when propagating in a propagation medium such as air, and effective information received by a microphone receiver is seriously affected. Because the voice recognition system is sensitive to the environment, the collected voice training system is only suitable for the environment corresponding to the collected voice training system, which also influences the conversion of the collected voice training system from a laboratory demonstration system to a commodity.

Disclosure of Invention

The invention aims to provide voice recognition equipment based on a flexible wearable sensor, so as to solve the defect that the acquisition of a voice signal source is easily influenced by the environment, and increase the robustness of a voice recognition system and the applicability of a multi-complex environment.

speech recognition equipment based on flexible wearable sensor includes:

the voice acquisition unit comprises a flexible wearable sensor and an analog-to-digital conversion unit, wherein the flexible wearable sensor is attached to the neck, the wearable sensor acquires a laryngeal node vibration signal during speaking and converts the laryngeal node vibration signal into an analog electric signal, and the analog-to-digital conversion unit receives the analog electric signal and encodes the analog electric signal into a digital signal;

the voice signal receiving and processing unit is connected with the voice acquisition unit and used for extracting the characteristic vector of the voice signal after the digital signal is subjected to audio data preprocessing;

and the voice recognition network unit is connected with the voice signal receiving and processing unit, decodes the characteristic vectors extracted by the voice signal receiving and processing unit, constructs a search space by utilizing the dictionary, the acoustic model and the language model, and searches for an optimal path in the search space through a search algorithm to obtain a voice recognition result.

The audio data preprocessing specifically includes the following contents:

step 1, a voice signal receiving and processing unit acquires a digital signal, carries out filtering processing on the voice signal, and then cuts off the mute at the head end and the tail end by utilizing an end point detection technology;

step 2, performing framing processing on the audio signal obtained by the previous processing by adopting a moving window function to obtain series frames;

and 3, processing each frames by utilizing algorithms such as PLP (PLP) and Mel cepstrum coefficients, and converting each frame into a feature vector containing sound information.

The specific steps of the speech recognition are as follows:

step 1, inputting the feature vector of each frames obtained by the processing of a speech signal receiving and processing unit into an acoustic model based on a deep neural network and hidden Markov, wherein the acoustic model calculates the score of each feature vector on acoustic features according to sound characteristics and outputs the score as phoneme (pinyin) information corresponding to the frame;

step 2, constructing a Chinese character network space by using a language model, and then constructing a phoneme (pinyin) network space through a dictionary;

and 3, searching optimal paths in the phoneme network space through a dynamic planning pruning algorithm, so that the accumulated probability of the voice obtained in the paths is maximum, and the output voice is the corresponding voice signal.

And the dictionary is the mapping relation between Chinese characters and phonemes, and in step , the Chinese character phoneme set is all initials and finals.

The language model adopts an N-Gram model, and the probability of the correlation of single characters or words is obtained by training a large amount of text information.

, the voice acquisition unit comprises a Bluetooth module, the voice acquisition unit and the voice signal receiving and processing unit adopt a Bluetooth wireless transmission mode, the voice acquisition unit comprises a filtering unit, the analog electric signal is processed by the filtering unit and then coded into a digital signal, and , the analog-to-digital conversion unit and the filtering unit are integrated in the Bluetooth module.

Further , the voice capture unit includes a power module.

Compared with the prior art, the invention has the following remarkable advantages: through utilizing flexible wearable sensor to acquire speech signal through the vibration information that detects the larynx knot when speaking, compare with traditional microphone that uses the air as the medium and acquire speech signal, very big improvement speech signal's under the noisy environment SNR, solved the shortcoming that its speech signal source acquireed and easily be influenced by the environment, increase speech recognition system's robustness and many complex environment's suitability.

Drawings

FIG. 1 is a schematic diagram of the apparatus of the present invention;

FIG. 2 is a schematic diagram of a wearable sensor according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the present invention comparing the voice signal obtained by the attached sensor with the voice signal obtained by the conventional microphone using air as the medium;

FIG. 4 is a flow chart illustrating a speech recognition process;

fig. 5 is a schematic diagram of an acoustic model.

Detailed Description

The technical solution of the present invention will be described in detail with reference to the accompanying drawings and the detailed description.

As shown in FIG. 1, kinds of flexible wearable sensor-based voice recognition devices comprise a voice acquisition unit, a voice signal receiving and processing unit and a neural network unit.

In this particular example, the present invention employs a triboelectric wearable sensor, the self-energizing and direct mechanical vibration to electrical signal feature that makes the overall speech acquisition module more energy efficient and the module design simpler.

(1) Voice acquisition unit

The voice acquisition unit comprises a power module, a Bluetooth module and a triboelectric wearable sensor, wherein each submodule is electrically connected.

The specific structure of the triboelectric wearable sensor is shown in fig. 2, and the triboelectric wearable sensor has an upper packaging layer, an upper electrode layer, an upper friction electrode layer, a lower electrode layer and a lower packaging layer. The upper and lower packaging layers are PVA films, and the Young modulus of the upper and lower packaging layers is matched with that of a human body; the upper and lower electrode layers adopt sputtered copper layers, the upper friction electrode layer adopts a nylon film, the lower friction electrode layer adopts PDMS, and the upper and lower friction electrode layers adopt abrasive paper to assist surface microstructuring.

The working principle of the device is based on the triboelectrification effect and the electrostatic coupling effect, and the working principle is specifically as follows:

the device is in a double-electrode working mode, when a person speaks, the throat node can vibrate, and therefore contact-separation reciprocating motion of an upper double electrode layer and a lower double electrode layer of the triboelectric wearable sensor attached to the throat node is caused. Specifically, during contact-separation, the surfaces of the upper friction electrode layer nylon and the lower friction electrode layer PDMS respectively have positive and negative charges after contact-separation, and an electric pulse is generated under the connection of a double-electrode lead, so that a positive voltage peak value is generated; similarly, after separation-contact, the two charges of the two electrodes are electrically neutral to each other, which results from the former phenomenon of inverse charge movement, thereby generating a negative voltage pulse. Thereby generating a negative voltage peak value, and the amplitude and the frequency of the laryngeal knot vibration can be converted into the peak value and the number of the peak values of the output voltage and form positive correlation with the peak value and the number of the peak values. So far, the vibration information of the laryngeal node has been converted into voltage information during speaking.

As shown in fig. 3, the conventional microphone-based method for acquiring signals mainly acquires related information by transmitting the voice spoken by a person to the microphone through air medium vibration, in the process, the voice information spoken by the person is coupled with other voice signals in the air, and seriously, when the noise in the environment is large, the voice information spoken by the person is completely annihilated by the noise signals, so that the information is completely unavailable. The triboelectric wearable sensor provided by the method acquires a corresponding sound signal through the laryngeal structure vibration information, and the triboelectric wearable sensor does not utilize a spoken sound signal transmitted by a human through an air medium, so that the triboelectric wearable sensor is hardly influenced by a noise signal in the environment.

(2) Voice signal receiving and processing unit

The voice signal acquisition unit mainly acquires voice digital signals through a Bluetooth module, filters the voice signals, performs audio data preprocessing such as framing and the like to obtain characteristic vectors of original voice signals, and transmits the characteristic vectors to a lower-level voice recognition network unit, specifically, the signal acquisition module acquires the digital signals through the Bluetooth module, performs filtering processing on the voice signals, removes the mute at the head end and the tail end by using an endpoint detection technology to reduce the interference on the subsequent steps, performs framing processing on the audio signals obtained by the previous step by using a moving window function, wherein the frame length of the moving window function is 25ms, the frame is shifted by 10ms to obtain series frames, and finally processes each frame by using Mel cepstrum coefficient (MFCC) to convert each frame into mostly vectors containing voice information, namely characteristic vectors.

(3) Speech recognition network element

The speech recognition network unit comprises a dictionary (shown in table 1), an acoustic model (shown in fig. 5), a language model, a decoding space and a search algorithm, wherein the dictionary comprises a mapping relation between Chinese characters or words and phonemes (full initials and finals) as shown in table 1, the acoustic model is composed of a deep neural network and a hidden markov chain as shown in fig. 5, wherein the deep neural network carries out steps of feature extraction on feature vectors and calculates corresponding probabilities of the phonemes through the hidden markov chain, the decoding space is constructed into a phoneme network space through the Hidden Markov Model (HMM), and the search algorithm is a dynamic planning pruning algorithm.

TABLE 1

Chinese characters	Phoneme
			yi1
series of events	yi2shi4
		Open	da3kai1
congress	yi1zhong1quan2hui4
		…	…

The working process of the invention is as follows:

1. the person speaks, causes the larynx knot vibration, and the laminating can produce corresponding shop output signal at the flexible wearable sensor of neck position this moment, and wherein the amplitude and the frequency variation of signal of telecommunication match with the vibration information phase-match of larynx knot, and the analog signal that bluetooth module will acquire the wearable sensor output through adc and wave filter carries out filtering coding to be digital signal, later transmits for signal acquisition processing unit under power module's power supply.

2. The method comprises the steps of obtaining a voice digital signal through a Bluetooth module by a signal obtaining module, obtaining the digital signal through the Bluetooth module by the signal obtaining module, filtering the voice signal, removing the mute at the head end and the tail end by using an end point detection technology to reduce the interference on the subsequent steps, performing framing processing on the audio signal obtained by the previous step by using a moving window function, wherein the frame length of the moving window function is 25-50ms, the frame is moved by 0-10ms to obtain series of frames, and finally processing each frame by using Mel cepstrum coefficient (MFCC) to convert each frame into a multi-dimensional vector-feature vector containing sound information.

3. The feature vector of each frames obtained by the speech signal receiving and processing unit is input into an acoustic Model (as shown in fig. 5) based on a Deep Neural network and Hidden Markov (DNN-HMM), and output as phoneme (pinyin) information corresponding to the frame.

4. A Hidden Markov Model (HMM) is used for constructing a Chinese character network space by utilizing a language model, then a phoneme (pinyin) network space is constructed through a dictionary, optimal paths are searched in the phoneme network space through a dynamic programming pruning algorithm, the accumulated probability of the voice obtained in the paths is the maximum, and the output voice is the corresponding voice signal.

The following example illustrates a simple speech recognition procedure:

(1) voice signal: for the acquired voice signal converted into the PCM file, the voice content is "i am a robot".

(2) Feature extraction: extracting feature vector [ 0.110.821.2]^T。

(3) Acoustic model: [0.110.821.2]^TCorresponding to wo shi ji qi rn.

(4) A dictionary: nesting: wo; i: wo; the method comprises the following steps: shi; machine: ji; the device comprises: qi; human: rn; stage (2): ji (j) is carried out.

(5) And (3) voice model: i: 0.0786, is: 0.0546, i are: 0.0898, machine: 0.0967, robot: 0.6785.

(6) and (3) outputting: i am a robot.

Claims

An flexible wearable sensor based speech recognition device, comprising:

the voice acquisition unit comprises a flexible wearable sensor and an analog-to-digital conversion unit, wherein the flexible wearable sensor is attached to the neck, the wearable sensor acquires a laryngeal node vibration signal during speaking and converts the laryngeal node vibration signal into an analog electric signal, and the analog-to-digital conversion unit receives the analog electric signal and encodes the analog electric signal into a digital signal;

the voice signal receiving and processing unit is connected with the voice acquisition unit and used for extracting the characteristic vector of the voice signal after the digital signal is subjected to audio data preprocessing;

and the voice recognition network unit is connected with the voice signal receiving and processing unit, decodes the characteristic vectors extracted by the voice signal receiving and processing unit, constructs a search space by utilizing the dictionary, the acoustic model and the language model, and searches for an optimal path in the search space through a search algorithm to obtain a voice recognition result.
2. The flexible wearable sensor-based speech recognition device of claim 1, wherein the audio data preprocessing specifically accommodates:

step 1, a voice signal receiving and processing unit acquires a digital signal, carries out filtering processing on the voice signal, and then cuts off the mute at the head end and the tail end by utilizing an end point detection technology;

step 2, performing framing processing on the processed audio signal by adopting a moving window function to obtain series frames;

and 3, processing each frames by utilizing algorithms such as PLP (PLP) and Mel cepstrum coefficients, and converting each frame into a feature vector containing sound information.
3. The flexible wearable sensor-based voice recognition device of claim 1, wherein the specific steps of the voice recognition are as follows:

step 1, inputting feature vectors of each frames obtained by processing of a speech signal receiving and processing unit into an acoustic model based on a deep neural network and hidden Markov, wherein the acoustic model calculates scores of the feature vectors on acoustic features according to sound characteristics and outputs phoneme information corresponding to the frames;

step 2, constructing a Chinese character network space by using a language model, and then constructing a phoneme network space by using a dictionary;

and 3, searching optimal paths in the phoneme network space through a dynamic planning pruning algorithm, so that the accumulated probability of the voice obtained in the paths is maximum, and the output voice is the corresponding voice signal.
4. The flexible wearable sensor-based speech recognition device of claim 3, wherein: the dictionary is a mapping relation between Chinese characters and phonemes.
5. The flexible wearable sensor-based speech recognition device of claim 4, wherein: the phoneme set in the Chinese character is all initials and finals.
6. The flexible wearable sensor-based speech recognition device of claim 3, wherein the language model employs an N-Gram model that derives a probability that individual words or phrases are related to each other by training textual information.
7. The voice recognition device based on the flexible wearable sensor, according to claim 1, wherein the voice acquisition unit comprises a bluetooth module, and the voice acquisition unit and the voice signal receiving and processing unit adopt a bluetooth wireless transmission mode; the analog-to-digital conversion unit is integrated in the Bluetooth module.
8. The flexible wearable sensor-based speech recognition device of claim 7, wherein the Bluetooth module comprises a filtering unit.
9. The speech recognition device based on the flexible wearable sensor, according to claim 1, wherein the speech acquisition unit comprises a filtering unit, and the analog electrical signal is encoded into a digital signal after being processed by the filtering unit.
10. The flexible wearable sensor-based speech recognition device of claim 1, wherein the speech acquisition unit comprises a power module.