CN113643695A

CN113643695A - Dialect accent mandarin voice recognition optimization method and system

Info

Publication number: CN113643695A
Application number: CN202111048340.8A
Authority: CN
Inventors: 杨逸舟; 陈海江
Original assignee: Zhejiang Lishi Technology Co Ltd
Current assignee: Zhejiang Lishi Technology Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-11-12
Anticipated expiration: 2041-09-08
Also published as: CN113643695B

Abstract

The invention relates to the technical field of voice recognition, in particular to a voice recognition optimization method and a voice recognition optimization system for dialect accent Mandarin, wherein a convolution neural network is used for performing convolution and feature extraction on audio contents; learning the characteristics of standard Mandarin through a neural network; after the standard mandarin audio is generated, the features are convoluted again to extract the features, the features are added in the middle features to serve as offsets, the mandarin features and the convolution results in each step are used as the features, the features are added to the end of the convolution layer corresponding to the dialect processing module to serve as the offsets, the obtained parameters are subjected to deconvolution after the convolution layer is passed, original information is amplified, a target audio is generated, and finally the target audio is input into the voice recognition function to be recognized. The method reduces the cost of customizing each dialect special model, and simultaneously amplifies the required characteristics of the standard mandarin by utilizing a mandarin fish and dialect accent characteristic superposition method, thereby further improving the accuracy of voice recognition while generalizing the model difficulty.

Description

Dialect accent mandarin voice recognition optimization method and system

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method and a system for optimizing voice recognition of dialect accent Mandarin.

Background

At present, speech recognition only aims at standard Mandarin, a model of a corresponding dialect is required to be established in a targeted manner for dialect recognition, and a relatively universal dialect accent removal solution is not available. The current scheme aims at establishing relevant speech recognition model modules in a pertinence way in a region with particularly serious dialect accent, so that the interference and the influence of the accent on the speech recognition accuracy are reduced.

In the prior art, each dialect accent recognition model needs to invest a large amount of cost, requires to collect mandarin audio training data with accents, labels the data at the same time, and then performs targeted training on the model. Meanwhile, the number of areas with dialect accents is large, the time cost and the labor cost of the targeted construction model of each area are too large, and the method and the system are not suitable for being applied in an actual scene, so that the dialect accent mandarin speech recognition optimization method and the system are provided for solving the problem.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a method and a system for speech recognition optimization of dialect accent Mandarin, which are used for solving the problems that each dialect accent recognition model needs to invest a large amount of cost, Mandarin audio training data with accents needs to be collected, the data is labeled at the same time, and then the model is trained in a targeted manner. Meanwhile, the number of the regions with dialect accents is large, and the time cost and the labor cost of the targeted construction model of each region are too high, so that the method is not suitable for being applied in an actual scene.

The invention is realized by the following technical scheme:

in a first aspect, the invention discloses a speech recognition optimization method for dialect accent Mandarin, comprising the following steps:

s1, inputting standard Mandarin audio as a model into a Mandarin enhancement module, and performing convolution on audio content by using a convolution neural network to extract features;

s2 deconvolving the extracted features to generate original audio, and learning the features of standard Mandarin through a neural network;

s3, generating standard Mandarin audio, convolving again to extract features, adding the features in the middle as offset, and strengthening the content of Mandarin and the related features of voice tone;

s4, adding the Mandarin Chinese character and the convolution result of each step as character to the end of convolution layer corresponding to the dialect processing module as offset, and carrying out convolution processing;

s5 deconvolves the obtained parameters after the convolution layer, amplifies the original information, generates the target audio, and finally inputs the target audio into the voice recognition function for recognition.

Further, in the method, a Mandarin enhancement module uses a self-encoded model structure that includes a convolution portion and a deconvolution portion.

Furthermore, in the method, the feature extraction of the self-coding model part of the mandarin enhancement module is trained independently, and the basic input is non-pure white noise.

Further, in the method, the dialect processing module adds convolution result parameters from the standard Mandarin module at the end of each layer of convolution based on a self-coding model framework.

Furthermore, in the method, the convolution parameters carried by the dialect processing module include mandarin semantics and related information of intonation and meaning.

Furthermore, the method is based on a convolutional neural network, the training samples are reading audios of the same characters, and each dialect accent audio corresponds to a standard mandarin audio.

In a second aspect, the present invention discloses a speech recognition optimization system for dialect accent mandarin, which is used for implementing the speech recognition optimization method for dialect accent mandarin according to the first aspect, and comprises a dialect accent processing module and a standard mandarin speech enhancement module.

Furthermore, the dialect module is used for extracting features of the dialect accent audio, obtaining enhanced features of the standard mandarin, regenerating the dialect accent and generating a section of standard mandarin audio.

Furthermore, the standard Mandarin enhancement module is used for extracting the characteristics of the standard Mandarin and the text content characteristics, and is used for enhancing the content in the voice process and improving the recognition capability of the voice.

The invention has the beneficial effects that:

compared with the traditional speech recognition algorithm, the method only recognizes the content of the Mandarin Chinese alone, and does not process the accent through some enhancement modules.

The universal dialect accent processing model reduces the cost of customizing each dialect special model, and simultaneously amplifies the required characteristics of the standard mandarin by utilizing a mandarin and dialect accent characteristic superposition method, thereby further improving the accuracy of voice recognition while generalizing the model difficulty.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of the speech recognition optimization method for dialect accent Mandarin;

FIG. 2 is a schematic diagram of a method for optimizing speech recognition of dialect accent Mandarin.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The embodiment discloses a speech recognition optimization method for dialect accent Mandarin as shown in FIG. 1, which includes the following steps:

In the embodiment, the neural network learns the characteristics of the standard mandarin through the model structure of the mandarin enhancement module self-coding. The characteristic extraction of the self-coding model part of the mandarin enhancement module is trained independently, and the basic input of the training is non-pure white noise.

The dialect processing module of the present embodiment is based on a self-coding model framework, adding convolution result parameters from the standard Mandarin module at the end of each layer of convolution. The dialect processing module promotes the relevant information of the Mandarin Chinese semantics, intonation and meaning in the dialect under the condition of carrying the convolution parameters.

The embodiment is based on a convolutional neural network, the training samples are the same text reading audio, and each dialect accent audio corresponds to a standard mandarin audio.

In this embodiment, an input speech is processed by using an optimization method model to generate an audio without dialect accents, and the result is used as a subsequent input for speech recognition.

The embodiment establishes the relevant voice recognition model module aiming at the area with particularly serious dialect accent, thereby reducing the interference and influence of the accent on the voice recognition accuracy.

Example 2

The embodiment discloses a specific implementation of the speech recognition optimization method for dialect accent mandarin, which is shown in fig. 2 and specifically includes the following steps:

in this embodiment, the mandarin enhancement module is trained separately, the standard mandarin audio is used as the model input, and the convolutional layer 1: the size of the filter is as follows: 4410x2, offset: 441 sample points. And (3) convolutional layer 2: the size of the filter is as follows: 441x2, offset: sample points 40. And (3) convolutional layer: the size of the filter is as follows: 441x2, offset: sample points 40. And carrying out convolution and feature extraction on the audio content by utilizing a convolution neural network.

In the embodiment, the training sample is the same word reading audio, the text length is 15 words, 3 sections of texts are randomly selected, the length of each section of audio is 5 seconds, and each section of dialect accent audio needs to correspond to a section of standard mandarin audio.

In this embodiment, the original audio is generated by using feature deconvolution, wherein a deconvolution layer 2 is selected: the size of the filter is as follows: 4410x2, offset: 441 sample points. Deconvolution layer 1: the size of the filter is as follows: 441x2, offset: sample points 40. And through the self-coding model structure, the neural network can better learn the characteristics of the standard mandarin. When the features can completely generate standard Mandarin audio, the generated audio is convolved to extract features, and the features are added in the middle as offsets, so that the content of Mandarin and the voice tone related features are strengthened.

In this embodiment, the number of samples is trained by selecting 50 segments of the audio of Guangdong Mandarin, Sichuan Mandarin, Hunan Mandarin, Fujian Mandarin, and Beijing Mandarin, wherein the recording accounts for 50 segments of the audio of male and female. Mandarin recording 100 segments, wherein each station of man and woman is 50%.

In the training process, the characteristic extraction of the self-coding model part of the Mandarin enhancement module is trained independently, so that the basic input of the model is not pure white noise, and the training effect of the model at the later stage can be further improved.

After the mandarin chinese enhancement module training is completed, the learning rate is selected: 0.005, when the loss function value is lower than 0.01, the learning rate is reduced by 0.0001, and the training iteration times are as follows: 5000 times. And taking the obtained Mandarin Chinese characteristics and the convolution result of each step as characteristics, adding the characteristics to the end of the convolution layer corresponding to the dialect processing module as an offset, and embedding the characteristics of the Mandarin Chinese into the dialect dialog.

The dialect processing module in this embodiment is also based on a self-coding model framework, a convolution result parameter from a standard mandarin module is added at the end of each layer of convolution, and relevant information of mandarin semantics and intonation and meaning in the dialect is promoted by carrying the convolution parameter.

Wherein selecting a convolution layer 1: the size of the filter is as follows: 4410x2, offset: 441 sample points. And (3) convolutional layer 2: the size of the filter is as follows: 441x2, offset: sample points 40.

After passing through the convolutional layer, the obtained parameters were deconvoluted (choice of deconvolution layer 2: filter size: 4410x2, offset: 441 samples; deconvolution layer 1: filter size: 441x2, offset: 40 samples), and the learning rate was set: 0.005, when the loss function value is less than 0.01, the learning rate is reduced by 0.0001. Training iteration times: 50000 times. And amplifying the original information to generate a target audio. And finally, inputting the target audio into a voice recognition function for recognition.

Compared with the traditional speech recognition algorithm, the embodiment only recognizes the content of the Mandarin Chinese alone, and does not process the accent through some enhancement modules.

The embodiment reduces the cost of customizing each dialect special model, and simultaneously amplifies the required characteristics of the standard mandarin by utilizing a mandarin and dialect accent characteristic superposition method, thereby further improving the accuracy of voice recognition while generalizing the model difficulty.

Example 3

The embodiment discloses a dialect accent mandarin voice recognition optimization system, which comprises a dialect accent processing module and a standard mandarin voice enhancement module.

The dialect module of the embodiment mainly extracts features of dialect accent audio, obtains enhanced features of standard mandarin, regenerates the dialect accent, and generates a section of standard mandarin audio.

The standard mandarin chinese enhancement module of the embodiment is used for extracting features of the standard mandarin chinese and text content features, and is used for enhancing contents in a voice process and improving recognition capability of voice.

The whole system of the embodiment is constructed based on a convolutional neural network, training samples are reading audios with the same characters, the length of each text is 15 characters, 3 sections of texts are randomly selected, the length of each section of audio is 5 seconds, and each section of dialect accent audio needs to correspond to one section of standard mandarin audio.

The standard mandarin module parameters of the embodiment are as follows:

the input audio vector is: 44100x5x 2;

the convolutional layer 1: the size of the filter is as follows: 4410x2, offset: 441 sampling points;

and (3) convolutional layer 2: the size of the filter is as follows: 441x2, offset: 40 sampling points;

and (3) convolutional layer: the size of the filter is as follows: 441x2, offset: 40 sampling points;

deconvolution layer 2: the size of the filter is as follows: 4410x2, offset: 441 sampling points;

deconvolution layer 1: the size of the filter is as follows: 441x2, offset: 40 sampling points;

learning rate: 0.005, when the loss function value is lower than 0.01, the learning rate is reduced by 0.0001;

training iteration times: 5000 times.

The dialect processing module parameters of the embodiment are as follows:

the input audio vector is: 44100x5x 2;

training iteration times: 50000 times.

The method solves the problems that each dialect accent recognition model needs to invest a large amount of cost, requires to acquire mandarin audio training data with accents, labels the data at the same time, and performs targeted training on the model subsequently. Meanwhile, the number of the regions with dialect accents is large, and the time cost and the labor cost of the targeted construction model of each region are too high, so that the method is not suitable for being applied in an actual scene.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for optimizing speech recognition of the dialect accent mandarin, the method comprising the steps of:

2. The method of claim 1, wherein the Mandarin enhancement module uses a self-coding model structure that includes a convolution portion and a deconvolution portion.

3. The method of claim 1, wherein the self-coding model partial feature extraction of the Mandarin enhancement module is trained separately, and the underlying input is non-pure white noise.

4. The method of claim 1, wherein the dialect processing module is based on a self-coding model framework, and convolution result parameters from a standard Mandarin module are added at the end of each layer of convolution.

5. The method as claimed in claim 4, wherein the convolution parameters carried by the dialect processing module include Mandarin semantic and information related to intonation and meaning.

6. The method of claim 1, wherein the training samples are reading audio of the same word based on a convolutional neural network, and each section of dialect accent audio corresponds to a section of standard mandarin audio.

7. A speech recognition optimization system for dialect accent Mandarin, the system being adapted to implement a method for speech recognition optimization of dialect accent Mandarin as claimed in any of claims 1-6, comprising a dialect accent processing module and a standard Mandarin speech enhancement module.

8. The system of claim 7, wherein the dialect module is configured to perform feature extraction on the dialect accent voice frequency, obtain enhanced features of standard mandarin, and regenerate the dialect accent to generate a segment of standard mandarin voice frequency.

9. The system of claim 7, wherein the standard Mandarin enhancement module is configured to extract features of standard Mandarin and text content features, so as to enhance the content of the speech process and improve the speech recognition capability.