CN110853629A

CN110853629A - Speech recognition digital method based on deep learning

Info

Publication number: CN110853629A
Application number: CN201911149493.4A
Authority: CN
Inventors: 蒋欣辰
Original assignee: Zhongke Zhiyun Technology Co Ltd
Current assignee: Zhongke Zhiyun Technology Co Ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-02-28

Abstract

The invention discloses a method for recognizing numbers by voices based on deep learning, which takes Chinese pinyin without tones as a modeling unit of an acoustic model to construct a deep neural network model from voice to pinyin end to end; and the deep neural network model is modeled by using a CNN + CTC structure, and a limited condition of digital pinyin is innovatively added on the basis of a CTC maximum decoding algorithm in a CTC decoding stage, so that the CTC decoding search space is greatly reduced, and the speech number can be efficiently and accurately identified.

Description

Speech recognition digital method based on deep learning

Technical Field

The invention belongs to the technical field of speech digital recognition, and particularly relates to a method for recognizing numbers by speech based on deep learning.

Background

The Speech digital Recognition is an important branch of an Automatic Speech Recognition (ASR) technology, and plays an important role in computer application fields such as user identity Recognition, living body authentication, network data capture and the like. However, in an actual application scenario, various complex factors such as accent, dialect, background noise interference and the like may exist in the voice data to be recognized, which brings a great challenge to the recognition of the high-accuracy voice digital verification code.

Application number is CN201910560346. X; the invention discloses a Chinese invention patent named as a voice digital recognition method and a voice digital recognition device, and discloses a digital voice data recognition method, wherein the method comprises the following steps: acquiring digital voice data to be recognized; extracting a spectral feature vector of the digital voice data using a short-time Fourier transform; identifying the frequency spectrum characteristic vector based on a preset DS2 network model to obtain an identified number; wherein the preset DS2 network model is obtained by resetting the output point of the last full connection layer to be the initial DS2 network model training of 10 numbers from 0 to 9.

With the rapid development of Deep Learning (DL) technology, the performance of an acoustic model based on a Deep Neural Network (DNN) is significantly improved compared with that of a traditional GMM-HMM model.

Disclosure of Invention

The invention provides a method for recognizing numbers based on deep learning, which is different from the prior art and is based on a CNN + CTC network model to efficiently and accurately recognize the numbers.

The invention is mainly realized by the following technical scheme: a method for recognizing numbers based on deep learning of speech includes using phonetic letters of Chinese without tone as model building unit of acoustic model, using CNN + CTC structure to build deep neural network model from speech to phonetic letters end to end, using CTC decoding algorithm with digital phonetic letter limiting condition to decode after model training and recognizing speech numbers.

Further, in order to better implement the invention, the method specifically comprises the following steps:

step S100: collecting audio annotation data, and cleaning and preprocessing the audio annotation data to obtain a Chinese pinyin and a spectrogram without tones;

step S200: inputting the two-dimensional matrix of the spectrogram obtained in the step S100 into the acoustic model by taking the Chinese pinyin without tones in the step S100 as a modeling unit of the acoustic model, and training the acoustic model by using a CNN + CTC model;

step S300: based on the acoustic model in the step S200, the CTC decoding algorithm with the digital pinyin limiting condition is used for maximum decoding, and the recognition from the voice to be recognized to the digital pinyin is carried out;

step S400: and then obtaining a final Arabic numeral sequence according to the corresponding relation between the numeric pinyin and the Arabic numerals.

The invention provides a method for recognizing numbers by voices based on deep learning, which takes Chinese pinyin without tones as a modeling unit of an acoustic model and constructs a deep neural network model from voice to pinyin end to end; and the deep neural network model is modeled by using a CNN + CTC structure, and a limited condition of digital pinyin is innovatively added on the basis of a CTC maximum decoding algorithm in a CTC decoding stage, so that the CTC decoding search space is greatly reduced, and the speech number can be efficiently and accurately identified.

Further, in order to better implement the present invention, when the audio annotation data is collected in step S100, at least 200 hours of chinese speech standard data need to be collected, wherein the chinese speech standard data is provided by a plurality of speech speakers with balanced male and female proportions, and the speech of each speech speaker is composed of a plurality of audio segments; each audio clip is used as a sample of the Chinese speech standard data and is provided with corresponding marked Chinese characters.

Furthermore, in order to better realize the invention, the total pronunciation time of each pronunciation speaker does not exceed 30 minutes, and one sample of the Chinese voice standard data does not exceed 30 seconds; the audio format of each sample is a single channel, 16k sample rate, 16 bit depth WAV format.

Further, in order to better implement the present invention, the step S100 of cleaning and preprocessing the audio annotation data specifically includes: deleting samples containing non-Chinese system symbols; removing punctuation marks of the marked Chinese characters, wherein if Arabic numerals exist, the marked Chinese characters need to be converted into corresponding Chinese characters; then, uniformly converting the Chinese characters into Chinese pinyin with tones removed; and framing the audio signal of each sample, performing short-time Fourier transform on each frame, and finally forming a spectrogram.

Further, in order to better implement the present invention, the architecture of the acoustic model in step S200 is that 1 full-link layer is added after the CNN convolutional neural network with 10 layers.

Further, in order to better implement the present invention, in step S300, the CTC decoding algorithm with the digital pinyin restriction condition performs maximum decoding on the speech to be recognized, all the decoded moments form paths and generate an optimal path, and the optimal path sequence of the optimal path is converted into a final digital pinyin sequence formed by digital pinyin;

namely, the CTC decoding algorithm with the digital pinyin limiting condition reduces the searching range of the CTC decoding from all Chinese pinyin to the range of digital pinyin.

Further, in order to better implement the present invention, the process of converting the optimal path sequence of the optimal path into the final digital pinyin sequence consisting of the digital pinyins is performed according to the following steps:

step S310: if continuous repeated digital pinyin or BLANK appears, merging and then jumping to the step S320; if there is no continuous repeated digital pinyin or BLANK, directly jumping to step S320;

step S320: removing all BLANK; if the digital pinyin before and after BLANK is the same, the continuous repetition of the digital pinyin is kept after the BLANK is removed.

The invention has the beneficial effects that:

(1) the invention can realize high-precision recognition only by a small amount of voice marking data, does not need special digital voice pronunciation data, can be the voice of any Chinese character, and the data can be easily acquired on an open source data set free of charge.

(2) The acoustic model uses a deep learning technology and is combined with the CTC-based decoding method, so that the automatic extraction of the audio features is realized, and a large amount of manual feature extraction work is saved.

(3) The modeling unit of the acoustic model is the pinyin with the tone removed, so that the acoustic model has strong robustness to dialects, and the numbers of various tones can be accurately identified.

Drawings

FIG. 1 is a schematic diagram of the architecture of the acoustic model of the present invention.

FIG. 2 is a schematic flow diagram of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments will be clearly and completely described below with reference to the accompanying drawings. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments presented in the figures is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

a method for recognizing numbers based on deep learning of speech includes using phonetic letters of Chinese without tone as model building unit of acoustic model, using CNN + CTC structure to build deep neural network model from speech to phonetic letters end to end, decoding by CTC decoding algorithm with digital phonetic letters limiting condition after deep learning to recognize speech numbers.

The method specifically comprises the steps S100-S400.

Step S100: and collecting audio annotation data, and cleaning and preprocessing the audio annotation data to obtain the Chinese pinyin and the spectrogram without tones.

The method specifically comprises the following steps:

1. chinese voice labeling data of more than 200 hours are collected, voice pronouncing persons participate in the method, the proportion of men and women is balanced, and the total pronunciation time of each person does not exceed 30 minutes. The voice of each person is composed of a plurality of segments, each segment being a sample of data, not more than 30 seconds. The audio format is unified as a wav format with a single channel, 16k sample rate, 16 bit depth. Each audio clip has a corresponding text label.

2. Deleting samples containing symbols of non-Chinese systems such as English and the like, removing punctuation marks of the marked characters, and converting the punctuation marks into corresponding Chinese characters if Arabic numerals exist. And finally, uniformly converting the Chinese characters into pinyin with tone removed.

For example: the following notations are provided:

today is No. 15, and the weather is clear.

The result after conversion is:

jin tian shi shi wu hao tian qi qing lang

3. and framing the audio waveform signal of each sample, performing short-time Fourier transform on each frame, and finally forming a spectrogram. The spectrogram is a time-series vector sequence, and each time corresponding vector is the characteristic of the audio at the current time.

Step S200: and (3) taking the Chinese pinyin without tones in the step S100 as a modeling unit of the acoustic model, inputting the two-dimensional matrix of the spectrogram obtained in the step S100 into the acoustic model, and training the acoustic model by using the CNN + CTC model.

And (3) training an acoustic model taking the non-tonal pinyin as a modeling unit by using the CNN + CTC model, wherein the input of the model is the two-dimensional matrix of the audio spectrogram obtained in the step S100.

The architecture of the acoustic model is shown in fig. 1 as 10 layers of CNN followed by a full connection, with the loss function using CTC.

For example: suppose the spectrogram feature of the sample is x⁽ⁱ⁾The corresponding phonetic notation is y⁽ⁱ⁾。

x⁽ⁱ⁾And y⁽ⁱ⁾Belongs to the training set X { (X)⁽¹⁾，y⁽¹⁾)，(x⁽²⁾，y⁽²⁾) ,. for each x⁽ⁱ⁾Assume that its timing length is T⁽ⁱ⁾Then the audio characteristic at each moment is

The output of the acoustic model is the probability distribution for all pinyins at each time t, given x

Wherein the content of the first and second substances,

i.e., the set of all pinyin and BLANK marks. Assuming concealment of the l-th layerLayer is h^lThe length of the sliding window is c, and the convolution kernel parameter is w^lIf f is the nonlinear transformation ReLU, the convolution layer is calculated at time t as follows:

the output of the last convolutional layer is input to a full link layer, W^lFor the parameters of the fully connected layer, the calculation of the fully connected layer is as follows:

the final output layer L is a softmax layer, calculated as follows:

step S300: and based on the acoustic model in the step S200, performing maximum decoding by using a CTC decoding algorithm with a digital pinyin limiting condition to recognize the speech to be recognized from the digital pinyin.

The invention adds the limited condition of digital pinyin to the CTC decoding algorithm, the searching range of CTC decoding is reduced from all Chinese pinyin to the range of digital pinyin, and the path formed by all decoded moments is assumed to beThen the optimal path

The expression of (a) is:

s.t.l′_t∈{ling，yi，er，san，si，wu，liu，qi，ba，jiu，BLANK} (4)。

optimal path

The final digital pinyin sequence is obtained by the following steps:

step S310: if continuous repeated pinyin or BLANK appears, merging;

step S320: and removing all BLANK, and if the Pinyin before and after BLANK is the same, keeping continuous repetition of the Pinyin after removal.

For example: the following optimal path sequence:

BLANK san BALNK jiu liu qi qi ba wu BLANK wu

the combined results are:

san jiu liu qi ba wu wu。

For example: the result of san jiu liu qiba wu wu was 3967855.

The invention provides a method for recognizing speech numbers, which takes Chinese pinyin without tones as a modeling unit of an acoustic model and constructs a deep neural network model from speech to pinyin end to end. The model is modeled by using a CNN + CTC structure, and in a CTC decoding stage, the invention innovatively adds a digital pinyin limiting condition on the basis of a CTC maximum decoding algorithm, thereby greatly reducing a CTC decoding search space and efficiently and accurately identifying voice numbers.

① the cost of the phonetic label data is high, the invention only needs a small amount of phonetic label data to realize high precision identification, and does not need special pronunciation data for the number, it can be the phonetic of any Chinese character, such data can be easily obtained on the open source data set free, ② the acoustic model uses deep learning technique, and combines with the decoding method based on CTC, it realizes the automatic extraction of the audio frequency feature, saves a lot of manual feature extraction work, ③ the model unit of the acoustic model is the phonetic de-intonation, makes the model have strong robustness to dialect, the number of the tone can be identified accurately.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A method for recognizing numbers based on deep learning is characterized in that the Chinese pinyin without tones is used as a modeling unit of an acoustic model, a CNN + CTC structure is adopted to construct a deep neural network model from voice to pinyin end, and a CTC decoding algorithm with digital pinyin limiting conditions is adopted to decode after the model is trained, so that the voice numbers are recognized.

2. The method for recognizing numbers based on deep learning of the speech according to claim 1, characterized by comprising the following steps:

s100, collecting audio annotation data, and cleaning and preprocessing the audio annotation data to obtain a Chinese pinyin and a spectrogram without tones;

s200, inputting the two-dimensional matrix of the spectrogram obtained in the step S100 into the acoustic model by taking the Chinese pinyin without tones in the step S100 as a modeling unit of the acoustic model and training the acoustic model by using a CNN + CTC model;

step S300, based on the acoustic model in the step S200, using a CTC decoding algorithm with a digital pinyin limiting condition to carry out maximum decoding, and identifying the voice to be identified to the digital pinyin;

and step S400, obtaining a final Arabic numeral sequence according to the corresponding relation between the digital pinyin and the Arabic numerals.

3. The method for recognizing numbers based on deep learning of claim 2, wherein the step S100 of collecting the audio annotation data is to collect at least 200 hours of chinese phonetic standard data, wherein the chinese phonetic standard data is provided by a plurality of phonetic speakers with equal male and female proportions, and the voice of each phonetic speaker is composed of a plurality of audio segments; each audio clip is used as a sample of the Chinese speech standard data and is provided with corresponding marked Chinese characters.

4. The method for recognizing numbers based on deep learning of claim 3, wherein the total pronunciation time of each pronunciation speaker is not more than 30 minutes, and one sample of the Chinese phonetic standard data is not more than 30 seconds; the audio format of each sample is a single channel, 16k sample rate, 16 bit depth WAV format.

5. The method for recognizing numbers based on deep learning of claim 3, wherein the step S100 of cleaning and preprocessing the audio annotation data specifically comprises: deleting samples containing non-Chinese system symbols; removing punctuation marks of the marked Chinese characters, wherein if Arabic numerals exist, the marked Chinese characters need to be converted into corresponding Chinese characters; then, uniformly converting the Chinese characters into Chinese pinyin with tones removed; and framing the audio signal of each sample, performing short-time Fourier transform on each frame, and finally forming a spectrogram.

6. The method for recognizing numbers based on deep learning of claim 2, wherein the acoustic model in step S200 is constructed by adding 1 full-connected layer after a 10-layer CNN convolutional neural network.

7. The method for recognizing numbers based on deep learning of claim 2, wherein in step S300, the CTC decoding algorithm with the digital pinyin restriction condition decodes the speech to be recognized maximally, forms paths at all times after decoding and generates the optimal path, and converts the optimal path sequence of the optimal path into the final digital pinyin sequence formed by digital pinyin;

8. The method for recognizing numbers based on deep learning of claim 7, wherein the process of converting the optimal path sequence of the optimal path into the final digital pinyin sequence consisting of digital pinyins is performed according to the following steps:

step S310, if continuous repeated digital pinyin or BLANK appears, merging and then jumping to step S320; if there is no continuous repeated digital pinyin or BLANK, directly jumping to step S320;

step S320, removing all BLANK; if the digital pinyin before and after BLANK is the same, the continuous repetition of the digital pinyin is kept after the BLANK is removed.