CN110827791A

CN110827791A - Edge-device-oriented speech recognition-synthesis combined modeling method

Info

Publication number: CN110827791A
Application number: CN201910847985.4A
Authority: CN
Inventors: 王海; 秦晨光; 张晓�; 刘艺; 赵子鑫; 高岭; 任杰; 郑杰
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2020-02-21
Anticipated expiration: 2039-09-09
Also published as: CN110827791B

Abstract

A speech recognition-synthesis combined modeling method for edge equipment is a model iteration method which integrates speech recognition and speech synthesis technologies at the back end through research on real-time calculation, distribution of edge calculation strategies, inspiring of 'copy is not out of shape' of an entertainment game. The voice enhancement function based on the audio processing field is used for constructing a real-time high-efficiency processing module, a voice recognition and synthesis iteration model aiming at Chinese dialects is constructed based on a voice recognition technology and a voice synthesis technology, the characteristics of the voice technology are fully utilized to realize a dialects processing model with the characteristics of recognition, synthesis and high efficiency, the processing capability of the edge environment is effectively utilized, the voice recognition technology and the voice synthesis technology are combined, and a voice model with more abundant functions and more robust performance is designed.

Description

Edge-device-oriented speech recognition-synthesis combined modeling method

Technical Field

The invention belongs to the technical field of edge calculation and audio research, relates to an edge server, voice enhancement, voice recognition, voice synthesis and a neural network, and particularly relates to a voice recognition-synthesis combined modeling method for edge equipment.

Background

After 4.0 of the industry, the rapid rise of artificial intelligence and the internet of things (IoT) provides great potential for convenience in human clothing and housing, and a great number of intelligent products are produced. Meanwhile, with the development of edge calculation in recent years, an edge calculation strategy can effectively realize the distribution of large task calculation amount, solve the real-time problem and improve the calculation capability of the model. Therefore, unlimited possibilities are provided for continuously strengthening the functions of the expanded intelligent products.

With the continuous development of neural networks and deep learning, the great breakthrough of related research is driven. Among them, the most obvious are the speech domain and the image domain. In recent years, in the field of NLP speech recognition, speech processing techniques, speech recognition, speech synthesis, and the like have been rapidly developed and have been drawing attention. However, there are still some technical problems to be studied optimally, such as: the real-time property of machine processing, the robustness of intelligent application, the comprehensive characteristics of objects and the like are all problems needing continuous optimization research. Some Chinese dialect data are tested by using a small program of 'dictation major' WeChat, and the recognition effect of some models to be tested is not good. The reasons are mostly bad input data characteristics, lack of model performance and operation procedure loopholes. Therefore, in addition to vulnerability detection, processing of model performance and data is an effective method to improve accuracy and solve this problem. Therefore, it is important to fully utilize the optimization model effects such as the speech field and the mobile computing technology, and to expand the application functions by using the model characteristics.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a voice recognition-synthesis combined modeling method facing edge equipment, wherein a real-time efficient processing module is constructed based on the voice enhancement function in the audio processing field, a voice recognition and synthesis iterative model aiming at Chinese dialects is constructed based on the voice recognition technology and the voice synthesis technology, and a real-time dialect processing model with recognition, synthesis and high efficiency is realized by fully utilizing the characteristics of the voice technology.

In order to achieve the purpose, the invention adopts the technical scheme that:

an edge-device-oriented speech recognition-synthesis joint modeling method is characterized by comprising the following steps:

1) collecting a data set sample:

collecting audio data of various environments, and classifying the audio data into clean audio in a class a quiet environment and different types of noise audio in a class b, wherein the different types of noise audio in the class b refers to a noise library;

2) and (3) carrying out data processing:

firstly, noise fusion processing is carried out, noise is added into clean audio, and the clean audio and the corresponding noise-added audio are packaged and assembled into clean audio data;

3) building an edge server:

building stable edge server equipment to realize a stable uploading and downloading interface; an algorithm processing voice enhancement module, which adopts a spectral subtraction method, a wavelet hard threshold value, a gan network model and a voice recognition engine to obtain an optimal value by adopting a voting method, takes the optimal value as an optimization means of audio preprocessing, performs audio front-end processing on the layer of equipment to perform dereverberation, noise reduction, noise separation and the like, screens an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selects a model method with higher corresponding audio quality by applying the voting method to process audio;

4) building a Chinese dialect speech recognition model:

adopting a basic model architecture based on cnn + rnn, representing secondary feature processing after voice feature MFCC extraction by using a cnn technology, taking an audio spectrogram as an input feature, and performing normalization processing on an extracted spectrogram feature vector and an MFCC feature vector; building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model; deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;

5) processing the data in the step 3) for the first time on the built edge service layer equipment, performing audio preprocessing by a voice enhancement means, improving the machine intelligibility of audio, extracting the characteristics of a voice sample set, and passing the characteristics through the Chinese dialect voice recognition model in the step 4) to obtain a positive sample T1, an accuracy acc1 of the positive sample, a negative sample F1, an accuracy acc2 of the negative sample, inputting T1 and F1 into the Chinese dialect voice synthesis model in the step 4), wherein the output results of the synthesized audio correspond to four types, namely T11, T12, F21 and F22; wherein, T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; t12 indicates that the speech recognition result is positive and the sample speech synthesis result is negative; f21 indicates that the speech recognition result is negative and the synthesis result is positive; f22 indicates that the results of speech recognition and synthesis are both negative;

6) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model to loss convergence and storing the model, reloading the model to be combined with a speech synthesis module, continuously iterating the training model through a mechanism of feedback parameter updating of neural network backward propagation, setting a reasonable iteration training period by adjusting the model hyper-parameter, enabling the network to be faster and more energy-saving in convergence, optimizing the final effect of the model, and ensuring the robustness of the model at the moment when the model effect represents gradual convergence stability.

Further, the processed voice data is converted into corresponding texts through a voice recognition technology, a CCLD (Chinese relational networks & LSTM) Chinese voice recognition network model is built by applying MFCC characteristics and combining the advantages of CNN network extraction characteristics, key characteristics are extracted by combining CNN network by adopting MFCC audio characteristic reference and are output to an RNN network of an LSTM layer, finally, a DNN network of three layers is connected as output judgment, a Chinese voice recognition engine is built by combining Chinese data voice characteristics, a Chinese voice recognition model is trained, and then, the samples classified by the model are divided into correct samples and error samples.

Further, the positive and negative samples are converted into corresponding audio samples through a speech synthesis model, whether corresponding recognition results are correct or not is counted, and the corresponding recognition results are classified into the positive and negative samples, and the method is characterized in that: the text is converted into audio by the wavenet-based improved speech synthesis model, where the text data corresponds to the output result of the last model, so that there are positive samples and negative samples, and there are further "positive and negative samples" corresponding thereto after passing through the speech synthesis model, so that the formed recognition result has sample attributes of "positive-positive", "positive-negative", "negative-positive" and "negative-negative".

Further, according to the dominant feature proportion of the classification result, the feature combination closest to the original appearance is screened out by calculating and comparing the original audio, and the method is characterized in that the sample is corresponding to the positive-positive characteristic group belonging to the A level, the sample is corresponding to the positive-negative characteristic group belonging to the B level, the sample is corresponding to the negative-positive characteristic group belonging to the C level, and the sample is corresponding to the negative-negative characteristic group belonging to the D level (and the priority A > B > C > D), and then the dominant feature is screened out by calculating the accuracy of each type of sample and taking the accuracy as the grading standard of the feature superiority.

The invention has the beneficial effects that:

1) the invention provides a model fusion method facing a large edge environment based on the consideration of performance and resources, reasonably schedules audio sources with different noise degrees by utilizing the real-time processing and task scheduling of edge equipment, combines a voice recognition module and a voice synthesis module, and greatly enriches the creativity of a new model.

2) According to the method, the dominance ratio of corresponding features is evaluated according to the proportion of correct samples, dominant feature expression sets are screened out and used as the characteristics of the staged model, the model is continuously and iteratively trained, the final effect of the model is optimized, and when the effect expression of the model is gradually converged and stabilized, the robustness of the model is very guaranteed.

3) The richer processing capability of the edge environment is effectively utilized, the voice recognition technology and the voice synthesis technology are combined, and a voice model with richer functions and more robust performance is designed.

4) The voice environment and voice experience of human-computer interaction are improved to a certain degree, and a practical application user brings comfortable experience.

5) The novel modeling idea provides a solution idea for the progress of the audio equipment and shows the huge expressive force of the audio equipment with strong functions.

Drawings

FIG. 1 is an overall architecture diagram;

FIG. 2 is a diagram of an edge-side speech enhancement model;

FIG. 3 is an iterative diagram of speech recognition and speech synthesis models.

Detailed Description

The invention will be further described with reference to the following drawings and examples, but the invention is not limited to the following examples:

as shown in fig. 1, 2 and 3, an edge device-oriented speech recognition-synthesis joint modeling method includes the following steps:

1) a data set sample is acquired. The method comprises the following steps of (a) dividing the audio into clean audio under a quiet environment and (b) dividing all audio data of different types of noise audio (specifically, white noise, ping noise, speed bubble and the like, which refer to a noise library in a classified mode) into a sampling rate of 16k and a storage format pcm (Shanxi, Minnan, Changsha, Sichuan, Hebei and Shanghai, six dialects);

2) and (6) carrying out data processing. Firstly, noise fusion processing is carried out, noise is added into clean audio, and the clean audio and the corresponding noise-added audio are packaged and assembled;

3) building an edge server, performing audio front-end processing on the layer of equipment to perform dereverberation, noise reduction, noise separation and the like, screening an optimal model of corresponding noise by utilizing wavelet, spectral subtraction and neural network model fusion processing, and selecting a model method with higher corresponding audio quality by applying a voting method to process audio;

4) building a Chinese dialect speech recognition model, adopting a basic model architecture based on cnn + rnn, representing secondary feature processing after speech feature MFCC extraction by using a cnn technology, taking an audio spectrogram as an input feature, and carrying out normalization processing on an extracted spectrogram feature vector and an MFCC feature vector;

5) building a Chinese dialect voice synthesis model, and providing an interface for multi-dimensional cross fusion based on a wavenet improved model;

6) deploying a joint voice recognition model and a voice synthesis model at an upper-layer cloud end, and receiving a processing result of an edge layer as an input source;

7) extracting the characteristics of a voice sample set and passing the characteristics through the Chinese dialect voice recognition model in 4) to obtain a positive sample T1, the accuracy rate acc1 of the positive sample, a negative sample F1 and the accuracy rate acc2 of the negative sample;

8) inputting T1 and F1 into 3) the Chinese dialect speech synthesis model, wherein four output results of the synthesized audio correspond to the input in the step (8), namely T11, T12, F21 and F22; (wherein T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; T12 represents that the speech recognition result is a positive sample and the speech synthesis result is negative; F21 represents that the speech recognition result is negative and the synthesis result is positive; F22 represents that the speech recognition and synthesis results are both negative);

9) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model until loss converges and storing the model, reloading the model to be combined with a speech synthesis module, carrying out continuous iterative training on the model through a mechanism of feedback updating parameters of neural network back propagation, setting a reasonable iterative training period by adjusting model hyper-parameters, optimizing the final effect of the model, and ensuring the robustness of the model when the model effect expresses gradual convergence stability.

Claims

1. An edge-device-oriented speech recognition-synthesis joint modeling method is characterized by comprising the following steps:

1) collecting a data set sample:

2) and (3) carrying out data processing:

performing noise fusion processing, adding noise into clean audio, packaging and assembling into clean audio data and corresponding noise-added audio data;

3) building an edge server:

building a Chinese dialect speech recognition model:

5) processing the data in the step 3) for the first time on the built edge service layer equipment, performing audio preprocessing by a voice enhancement means, improving the machine intelligibility of audio, extracting the characteristics of a voice sample set, and passing the characteristics through the Chinese dialect voice recognition model in the step 4) to obtain a positive sample T1, an accuracy acc1 of the positive sample, a negative sample F1, an accuracy acc2 of the negative sample, inputting T1 and F1 into the Chinese dialect voice synthesis model in the step 4) respectively, wherein the output result of the synthesized audio corresponds to the input in the step 7) to four types, namely T11, T12, F21 and F22 respectively; wherein, T11 represents that the speech recognition result is a positive sample and the speech synthesis result is a positive sample; t12 indicates that the speech recognition result is positive and the sample speech synthesis result is negative; f21 indicates that the speech recognition result is negative and the synthesis result is positive; f22 indicates that the results of speech recognition and synthesis are both negative;

6) evaluating the dominance ratio of corresponding features according to the proportion of correct samples, screening out a dominant feature expression set as a staged model feature, adjusting the weight of a hyper-parametric training speech recognition model until loss converges and storing the model, reloading the model to be combined with a speech synthesis module, carrying out continuous iterative training on the model through a mechanism of feedback updating parameters of neural network back propagation, setting a reasonable iterative training period by adjusting model hyper-parameters, optimizing the final effect of the model, and ensuring the robustness of the model when the model effect expresses gradual convergence stability.

2. The edge-device-oriented speech recognition-synthesis combined modeling method as claimed in claim 1, wherein the processed speech data is converted into corresponding text by a speech recognition technology, a CCLD (Chinese relational network & LSTM) Chinese speech recognition network model is built by applying MFCC features and combining the advantages of CNN network extraction features, a key feature is extracted by adopting MFCC audio feature standards and combining CNN Networks and output to an RNN network of an LSTM layer, finally a DNN network of three layers is connected as output discrimination, a Chinese speech recognition engine is built by combining Chinese data speech features, a Chinese speech recognition model is trained, and then samples classified by the model are divided into correct samples and error samples.

3. The edge-device-oriented speech recognition-synthesis combined modeling method according to claim 1, wherein the positive and negative samples are respectively converted into corresponding audio samples through a speech synthesis model, and whether the corresponding recognition results are correct or not is counted, and the corresponding audio samples are classified as the positive and negative samples, and the method is characterized in that: the text is converted into audio by the wavenet-based improved speech synthesis model, where the text data corresponds to the output result of the last model, so that there are positive samples and negative samples, and there are further "positive and negative samples" corresponding thereto after passing through the speech synthesis model, so that the formed recognition result has sample attributes of "positive-positive", "positive-negative", "negative-positive" and "negative-negative".

4. The edge-device-oriented speech recognition-synthesis combined modeling method according to claim 1, wherein the feature combination closest to the original appearance is selected by calculating and comparing the original audio according to the dominant feature proportion of the classification result, and the feature combination is characterized in that a sample belongs to a class a feature group in a positive-positive correspondence, a sample belongs to a class B feature group in a positive-negative correspondence, a sample belongs to a class C feature group in a negative-positive correspondence, and a sample belongs to a class D feature group in a negative-negative correspondence (and the priority is a > B > C > D), and then the dominant feature is selected by calculating the accuracy of each class of samples as the scoring criterion of the feature superiority.