CN115240710A

CN115240710A - Neural network-based multi-scale fusion pronunciation evaluation model optimization method

Info

Publication number: CN115240710A
Application number: CN202210772121.2A
Authority: CN
Inventors: 张句; 贡诚; 王宇光; 关昊天
Original assignee: Suzhou Zhiyan Information Technology Co ltd
Current assignee: Suzhou Zhiyan Information Technology Co ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-25

Abstract

The invention relates to the field of pronunciation evaluation, in particular to a pronunciation evaluation model optimization method based on multi-scale fusion of a neural network, which optimizes a pronunciation evaluation system by utilizing the neural network and the multi-scale fusion technology. The method mainly comprises the steps of acoustic model design and selection, GOP fraction calculation based on a neural network, construction of a multi-scale convolution neural network, multi-feature fusion based on an attention mechanism and voice evaluation. And (3) mining pronunciation characteristics related to prosody with different granularities by adopting CNN networks with different scales in consideration of the global property and the local property of prosody information. And adopting an attention mechanism model to fuse the pronunciation characteristics of different scales and the characteristics related to the posterior probability to realize the pronunciation characteristics of multi-scale fusion.

Description

Neural network-based multi-scale fusion pronunciation evaluation model optimization method

Technical Field

The invention relates to the field of pronunciation evaluation, in particular to a pronunciation evaluation model optimization method based on multi-scale fusion of a neural network, which optimizes a pronunciation evaluation system by utilizing the neural network and the multi-scale fusion technology.

Background

The automatic evaluation of English pronunciation is a technology that a testee pronounces according to a specified English text, a computer gives evaluation scores according to the pronunciation quality of the testee, and the computer performs fair, objective and efficient automatic evaluation on the English pronunciation level of the testee, so as to assist an English language learner in correcting pronunciation errors and improve the English spoken language level. With the rapid development of global economy, the communication and cooperation among different countries in various aspects such as political, economic and cultural education are more frequent. More and more people are beginning to learn a second language other than the native language. Mastering a communication language is important for spoken language learning. However, teachers and students often have limitations on space-time and economic conditions due to one-to-one learning, face-to-face interactive communication, and the like, so online education is becoming popular, and pronunciation evaluation techniques and applications for automatically evaluating the pronunciation of learners and correcting accent errors through computers are also popular with most learners.

Currently, in domestic and foreign research, most automatic prosodic pronunciation quality evaluation is performed from the perspective of overall listening quality, and pronunciation quality evaluation for specific sub-items, such as accent pronunciation quality evaluation, rhythm pronunciation quality evaluation and the like, is relatively few. When people communicate with each other, the information transmitted between people is not only language and character information, but also rich rhythm information. The prosodic information belongs to the information of the super-sound segment and mainly reflects the voice suppression and pause (rhythm), emphasis (accent), intonation, tone and tone of the speaker. On one hand, the prosodic information is helpful for a speaker to express information to be expressed more clearly and accurately, and the naturalness level and comprehensibility degree of the language are improved; on the other hand, prosodic information helps the listener to understand the information heard more clearly and accurately, even including the understanding and comprehension of the speaker's intention, emotion, attitude, tone, etc. In the task of automatically evaluating the pronunciation quality, it is very necessary and important to evaluate the prosodic pronunciation quality.

In the background, aiming at the defects of pronunciation evaluation, the patent provides a pronunciation evaluation model based on multi-scale fusion of a neural network, a speech recognition model based on the neural network is adopted as an acoustic model, a plurality of CNN networks with different scales are arranged to carry out convolution on evaluation characteristics, rhythm information with different scales is mined, and an attention mechanism model is used for evaluating the pronunciation evaluation model

Disclosure of Invention

The invention aims to solve the technical problems in the background technology and adopts a multi-scale fusion pronunciation evaluation model optimization method based on a neural network.

The technical scheme of the invention is a pronunciation evaluation model optimization method based on multi-scale fusion of a neural network, which comprises the following steps:

designing and selecting an acoustic model: selecting an end-to-end speech recognition model as an acoustic model for calculating GOP (group of picture) scores of the audio to be evaluated; in addition, a pronunciation evaluation training data set needs to be designed and constructed for training of a subsequent model.

Step two, calculating the GOP score based on the neural network: after the speech recognition model in the first step is completed, recognizing the speech to be evaluated by using the speech recognition model in the first step, and calculating a GOP score by using the output of the neural network:

formula (1) the posterior probability of the average frame level constructed by the neural network output is taken as the GOP score; p(s) here _t ,O _t ) Is the output of the last softmax layer of the neural network model, where O refers to the observation sequence of speech and O _t Is the observation sequence of the speech frame corresponding to time t, t _s And t _e Respectively representing a start frame and an end frame, s, of a phoneme P _t Is by forcing the state tag of the aligned frame t;

step three, construction of multi-scale convolution neural network

3) Extraction of prosody related features: and extracting the related prosodic acoustic features of each frame and using the extracted prosodic acoustic features as input of the convolutional neural network. Assuming that the speech to be evaluated is divided into N frames and each frame contains the rhythm related characteristics related to the M dimensions, inputting a matrix of N x M;

4) And (3) constructing a multi-scale neural network, and analyzing and extracting original convolution characteristics by adopting one-dimensional convolution:

setting T (1, 2, \8230;, T) convolution neural networks with different scales, wherein the convolution kernel size of each convolution network is C ₁ *1,C ₂ *1,…,C _T *1, wherein the number of each convolution kernel is M;

step four, multi-feature fusion based on attention mechanism:

1) For the prosodic features of T different scales learned in step three, assume that the T features are represented by S = [ S ] ₁ ,s ₂ ,…,s _t ]The resulting signature, E, can be calculated according to the attention mechanism of equation (2) below:

Q＝Q’W _q ,K＝SW _k ,V＝SW _v

where Q' is a vector of random initialization of the neural network, W _q ,W _k ，W _v The matrix of the random initial of the neural network is used for carrying out linear transformation on Q' and S, a query vector Q is obtained after the linear transformation, a comparison vector K and a content vector V, f means that the dimensionality of the vector is reduced, and d _m The method adopts softmax activation as a scoring function, is used for fixing a result within an interval of 0-1, and is carried out along with the continuous learning of a neural networkUpdating, and finally realizing the fusion of the features of different scales;

2) Score calculated from fusion features _e And further fusing with the GOP score as shown in the following formula (3):

score _final ＝α*score _e +(1-α)*GOP

α＝sigmod(W _α s _t+1 +b _α ) (3)

wherein s is _t+1 Output before softmax, W, based on neural network recognition model used for step one _α And b _α Also in the evaluation model, a matrix which is randomly initialized is used for linear transformation, and alpha is score _e Corresponding weight, (1-alpha) is the weight corresponding to GOP score, and the score is obtained _final The final evaluation result of different prosodic acoustic features and GOP are comprehensively considered, and sigmod is an activation function and is used for ensuring that alpha is a weighted value between 0 and 1.

Further, the evaluating step:

1) Receiving audio to be evaluated, and obtaining GOP scores and output s before identification model softmax through calculation in step one _t+1 ；

2) Extracting rhythm related features, and extracting corresponding deep features through CNNs of different scales;

3) Fusing features of different scales through an attention mechanism;

4) And fusing the fused feature score and the original GOP score to obtain a final score.

Has the advantages that:

the technical scheme of the invention can realize that:

1) A traditional pronunciation quality evaluation (GOP) algorithm is combined with a plurality of pronunciation characteristics related to prosody, so that a pronunciation evaluation model based on a neural network is realized.

2) The global and local characteristics of prosody information are considered, CNN networks with different scales are adopted, and prosody-related pronunciation characteristics with different granularities are mined.

3) And an attention mechanism model is adopted to fuse the pronunciation characteristics of different scales and the characteristics related to the posterior probability, so that the multi-scale fused pronunciation characteristics are realized.

Drawings

FIG. 1 is a schematic diagram of a one-dimensional convolutional neural network;

FIG. 2 is a flow chart of a pronunciation assessment system.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention aims to solve the technical problems in the background art, adopts a neural network-based multi-scale fusion pronunciation evaluation model optimization method, and mainly designs the following three aspects:

1) And a speech recognition model based on a neural network is used as an acoustic model, and the output of the neural network is used as an evaluation basis for pronunciation correctness.

2) Convolutional neural networks of different scales are adopted, prosody related features of different scales are mined, and prosody information such as emphasis, tone, stress and the like is considered.

3) And learning the weights of the related features of different scales and GOPs by adopting an attention mechanism method, so as to realize the pronunciation evaluation system fusing various information.

Step one, designing and selecting an acoustic model.

1) The invention selects a universal end-to-end speech recognition model as an acoustic model to calculate the GOP score of the audio to be evaluated. The end-to-end acoustic model based on the neural network needs to be pre-trained.

2) Constructing a pronunciation evaluation training data set, inviting 3 English teachers with rich experience, scoring the whole pronunciation quality of the voices from 3 aspects of pronunciation accuracy, fluency and integrity by 0-5 points, wherein the score of 0 is lowest, the score of 5 is highest, and finally manually scoring each voice data by taking the average value of the 3-bit teacher scoring.

And step two, calculating the GOP score based on the neural network.

After the speech recognition model in the step one is completed, the speech to be evaluated can be recognized by using the speech recognition model used in the step one, and the GOP score is calculated by using the output of the neural network:

formula (1) the posterior probability of the average frame level constructed by the neural network output is taken as the GOP score; p(s) here _t ,O _t ) Is the output of the last softmax layer of the neural network model, where O refers to the observation sequence of speech and O _t Is the observation sequence of the speech frame corresponding to time t, t _s And t _e Respectively representing the start and end frames, s, of the phoneme P _t Is by forcing the state tag of the aligned frame t;

and step three, constructing a multi-scale convolution neural network.

1) Extraction of prosodic related features

The three most common acoustic features associated with prosodic perception are pitch, and duration, and statistical and dynamic features corresponding thereto, so the above-described associated prosodic acoustic features for each frame are first extracted and used as inputs to the convolutional neural network. Assuming that the speech to be evaluated is divided into N frames and each frame includes the above-mentioned prosody related features related in M dimensions, an N x M matrix is input.

2) Multi-scale neural network architecture

Because different acoustic features have different performance characteristics on different time scales, if the acoustic features are analyzed only from one granularity, some local information may be ignored, for example, for a sentence or a intonation of the whole sentence, it is often necessary to analyze a longer window to find corresponding statistical characteristics and rules, and for a convolution network with a fixed convolution kernel, the feature analysis can be performed only from a time window of one scale, and prosodic information of other scales is ignored.

The study of convolutional neural networks originated from the biological study of the visual system, and Hubel and Wiesel discovered a cell sensitive to the spatial local region of visual input in the study of the visual cortex of cats in 1962, which was defined as the "receptive field". The receptive field covers the entire visual field in some way, enabling better acquisition of local spatial correlations in the image. Therefore, the scholars expand the structural characteristics and apply the structural characteristics to a neural network to extract local features of an input layer. The Convolutional neural network includes an Input Layer (Input Layer), a Convolutional Layer (Convolutional Layer), a Pooling Layer (Pooling Layer), a full-Connected Layer (full-Connected Layer), an Output Layer (Output Layer), and the like. The most core layer structure in CNN is convolutional layer and pooling, and the present invention uses one-dimensional convolution as shown in fig. 1 below to analyze and extract the original convolution features:

the patent sets T (1,2, \8230;, T) convolutional neural networks with different scales, and the convolution kernel size of each convolutional network is C ₁ *1,C ₂ *1,…,C _T *1, where the number of each convolution kernel is M.

Step four, multi-feature fusion based on attention mechanism

1) For the prosodic features of T different scales learned in step three, assume that the T features are represented by S = [ S ] ₁ ,s ₂ ,…,s _t ]The final feature expression E obtained by the final calculation of the attention mechanism and the calculation process are shown as follows:

Q＝Q’W _q ,K＝SW _k ,V＝SW _v

where Q' is a vector of random neural network initializations, W _q ,W _k ，W _v The matrix of the random initial of the neural network is used for carrying out linear transformation on Q' and S, a query vector Q is obtained after the linear transformation, a comparison vector K and a content vector V, f means that the dimensionality of the vector is reduced, and d _m The method adopts softmax activation as a scoring function to fix a result within an interval of 0-1, and finally can realize the updating of the failure with the continuous learning of a neural networkFusing the features of the same scale;

2) Although the final evaluation score can be calculated using the fused feature E _final However, considering the authority of the GOP score, the patent considers the score (score) calculated by the fusion feature _e ) And further fusing with GOP scores, wherein the following formula is shown:

score _final ＝α*score _e +(1-α)*GOP

α＝sigmod(W _α s _t+1 +b _α ) (3)

wherein s is _t+1 Output before softmax for the neural network-based recognition model used in step one, W _α And b _α Also is a random initial weight in the evaluation model, and the score is obtained finally _final The final evaluation result of different prosodic acoustic features and the GOP is comprehensively considered.

Evaluation steps:

2) Extracting prosodic related features, and extracting corresponding deep features through CNNs of different scales;

3) Fusing features of different scales through an attention mechanism;

Claims

1. The pronunciation evaluation model optimization method based on the multi-scale fusion of the neural network is characterized by comprising the following steps of:

step one, designing and selecting an acoustic model: selecting an end-to-end speech recognition model as an acoustic model for calculating GOP (group of picture) scores of the audio to be evaluated; in addition, a pronunciation evaluation training data set needs to be designed and constructed for the training of subsequent models;

formula (1) the posterior probability of the average frame level constructed by the neural network output is taken as the GOP score; p(s) here _t ,O _t ) Is the output of the last softmax layer of the neural network model, where O refers to the observation sequence of speech and O _t Is the observation sequence of the speech frame corresponding to the time t, t _s And t _e Respectively representing the start and end frames, s, of the phoneme P _t Is by forcing the state tag of the aligned post frame t;

step three, construction of multi-scale convolution neural network

1) Extraction of prosody related features: extracting the relevant prosodic acoustic features of each frame and taking the extracted prosodic acoustic features as input of a convolutional neural network, and if the speech to be evaluated is divided into N frames and each frame contains the relevant prosodic features of the M dimensions, inputting the prosodic acoustic features into an N-M matrix;

2) And (3) constructing a multi-scale neural network, and analyzing and extracting original convolution characteristics by adopting one-dimensional convolution:

step four, multi-feature fusion based on attention mechanism:

Q＝Q′W _q ,K＝SW _k ,V＝SW _v

where Q' is a vector of random initialization of the neural network, W _q ,W _k ，W _v The matrix of the random initial of the neural network is used for carrying out linear transformation on Q' and S, a query vector Q is obtained after the linear transformation, a comparison vector K and a content vector V, f means that the dimensionality of the vector is reduced, and d _m The method adopts softmax activation as a scoring function to fix a result in an interval of 0-1, and can realize the fusion of features of different scales finally along with the continuous learning and updating of a neural network;

2) Score calculated from the fusion features _e And further fusing with the GOP score as shown in the following formula (3):

score _final ＝α*score _e +(1-α)*GOP

α＝sigmod(W _α s _t+1 +b _α ) (3)

wherein s is _t+1 Output before softmax for the neural network-based recognition model used in step one, W _α And b _α Also in the evaluation model, a matrix which is randomly initialized is used for linear transformation, and alpha is score _e Corresponding weight, (1-alpha) is the weight corresponding to GOP score, and the score is obtained _final The final evaluation result of different prosodic acoustic features and GOP are comprehensively considered, and sigmod is an activation function and is used for ensuring that alpha is a weighted value between 0 and 1.

2. The optimization method according to claim 1, characterized in that the evaluating step:

3) Fusing features of different scales through an attention mechanism;