CN115240710A - Neural network-based multi-scale fusion pronunciation evaluation model optimization method - Google Patents

Neural network-based multi-scale fusion pronunciation evaluation model optimization method Download PDF

Info

Publication number
CN115240710A
CN115240710A CN202210772121.2A CN202210772121A CN115240710A CN 115240710 A CN115240710 A CN 115240710A CN 202210772121 A CN202210772121 A CN 202210772121A CN 115240710 A CN115240710 A CN 115240710A
Authority
CN
China
Prior art keywords
neural network
score
features
gop
pronunciation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210772121.2A
Other languages
Chinese (zh)
Inventor
张句
贡诚
王宇光
关昊天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Zhiyan Information Technology Co ltd
Original Assignee
Suzhou Zhiyan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Zhiyan Information Technology Co ltd filed Critical Suzhou Zhiyan Information Technology Co ltd
Priority to CN202210772121.2A priority Critical patent/CN115240710A/en
Publication of CN115240710A publication Critical patent/CN115240710A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Abstract

The invention relates to the field of pronunciation evaluation, in particular to a pronunciation evaluation model optimization method based on multi-scale fusion of a neural network, which optimizes a pronunciation evaluation system by utilizing the neural network and the multi-scale fusion technology. The method mainly comprises the steps of acoustic model design and selection, GOP fraction calculation based on a neural network, construction of a multi-scale convolution neural network, multi-feature fusion based on an attention mechanism and voice evaluation. And (3) mining pronunciation characteristics related to prosody with different granularities by adopting CNN networks with different scales in consideration of the global property and the local property of prosody information. And adopting an attention mechanism model to fuse the pronunciation characteristics of different scales and the characteristics related to the posterior probability to realize the pronunciation characteristics of multi-scale fusion.

Description

Neural network-based multi-scale fusion pronunciation evaluation model optimization method
Technical Field
The invention relates to the field of pronunciation evaluation, in particular to a pronunciation evaluation model optimization method based on multi-scale fusion of a neural network, which optimizes a pronunciation evaluation system by utilizing the neural network and the multi-scale fusion technology.
Background
The automatic evaluation of English pronunciation is a technology that a testee pronounces according to a specified English text, a computer gives evaluation scores according to the pronunciation quality of the testee, and the computer performs fair, objective and efficient automatic evaluation on the English pronunciation level of the testee, so as to assist an English language learner in correcting pronunciation errors and improve the English spoken language level. With the rapid development of global economy, the communication and cooperation among different countries in various aspects such as political, economic and cultural education are more frequent. More and more people are beginning to learn a second language other than the native language. Mastering a communication language is important for spoken language learning. However, teachers and students often have limitations on space-time and economic conditions due to one-to-one learning, face-to-face interactive communication, and the like, so online education is becoming popular, and pronunciation evaluation techniques and applications for automatically evaluating the pronunciation of learners and correcting accent errors through computers are also popular with most learners.
Currently, in domestic and foreign research, most automatic prosodic pronunciation quality evaluation is performed from the perspective of overall listening quality, and pronunciation quality evaluation for specific sub-items, such as accent pronunciation quality evaluation, rhythm pronunciation quality evaluation and the like, is relatively few. When people communicate with each other, the information transmitted between people is not only language and character information, but also rich rhythm information. The prosodic information belongs to the information of the super-sound segment and mainly reflects the voice suppression and pause (rhythm), emphasis (accent), intonation, tone and tone of the speaker. On one hand, the prosodic information is helpful for a speaker to express information to be expressed more clearly and accurately, and the naturalness level and comprehensibility degree of the language are improved; on the other hand, prosodic information helps the listener to understand the information heard more clearly and accurately, even including the understanding and comprehension of the speaker's intention, emotion, attitude, tone, etc. In the task of automatically evaluating the pronunciation quality, it is very necessary and important to evaluate the prosodic pronunciation quality.
In the background, aiming at the defects of pronunciation evaluation, the patent provides a pronunciation evaluation model based on multi-scale fusion of a neural network, a speech recognition model based on the neural network is adopted as an acoustic model, a plurality of CNN networks with different scales are arranged to carry out convolution on evaluation characteristics, rhythm information with different scales is mined, and an attention mechanism model is used for evaluating the pronunciation evaluation model
Disclosure of Invention
The invention aims to solve the technical problems in the background technology and adopts a multi-scale fusion pronunciation evaluation model optimization method based on a neural network.
The technical scheme of the invention is a pronunciation evaluation model optimization method based on multi-scale fusion of a neural network, which comprises the following steps:
designing and selecting an acoustic model: selecting an end-to-end speech recognition model as an acoustic model for calculating GOP (group of picture) scores of the audio to be evaluated; in addition, a pronunciation evaluation training data set needs to be designed and constructed for training of a subsequent model.
Step two, calculating the GOP score based on the neural network: after the speech recognition model in the first step is completed, recognizing the speech to be evaluated by using the speech recognition model in the first step, and calculating a GOP score by using the output of the neural network:
Figure BDA0003724594080000021
formula (1) the posterior probability of the average frame level constructed by the neural network output is taken as the GOP score; p(s) here t ,O t ) Is the output of the last softmax layer of the neural network model, where O refers to the observation sequence of speech and O t Is the observation sequence of the speech frame corresponding to time t, t s And t e Respectively representing a start frame and an end frame, s, of a phoneme P t Is by forcing the state tag of the aligned frame t;
step three, construction of multi-scale convolution neural network
3) Extraction of prosody related features: and extracting the related prosodic acoustic features of each frame and using the extracted prosodic acoustic features as input of the convolutional neural network. Assuming that the speech to be evaluated is divided into N frames and each frame contains the rhythm related characteristics related to the M dimensions, inputting a matrix of N x M;
4) And (3) constructing a multi-scale neural network, and analyzing and extracting original convolution characteristics by adopting one-dimensional convolution:
setting T (1, 2, \8230;, T) convolution neural networks with different scales, wherein the convolution kernel size of each convolution network is C 1 *1,C 2 *1,…,C T *1, wherein the number of each convolution kernel is M;
step four, multi-feature fusion based on attention mechanism:
1) For the prosodic features of T different scales learned in step three, assume that the T features are represented by S = [ S ] 1 ,s 2 ,…,s t ]The resulting signature, E, can be calculated according to the attention mechanism of equation (2) below:
Q=Q’W q ,K=SW k ,V=SW v
Figure BDA0003724594080000031
where Q' is a vector of random initialization of the neural network, W q ,W k ,W v The matrix of the random initial of the neural network is used for carrying out linear transformation on Q' and S, a query vector Q is obtained after the linear transformation, a comparison vector K and a content vector V, f means that the dimensionality of the vector is reduced, and d m The method adopts softmax activation as a scoring function, is used for fixing a result within an interval of 0-1, and is carried out along with the continuous learning of a neural networkUpdating, and finally realizing the fusion of the features of different scales;
2) Score calculated from fusion features e And further fusing with the GOP score as shown in the following formula (3):
score final =α*score e +(1-α)*GOP
α=sigmod(W α s t+1 +b α ) (3)
wherein s is t+1 Output before softmax, W, based on neural network recognition model used for step one α And b α Also in the evaluation model, a matrix which is randomly initialized is used for linear transformation, and alpha is score e Corresponding weight, (1-alpha) is the weight corresponding to GOP score, and the score is obtained final The final evaluation result of different prosodic acoustic features and GOP are comprehensively considered, and sigmod is an activation function and is used for ensuring that alpha is a weighted value between 0 and 1.
Further, the evaluating step:
1) Receiving audio to be evaluated, and obtaining GOP scores and output s before identification model softmax through calculation in step one t+1
2) Extracting rhythm related features, and extracting corresponding deep features through CNNs of different scales;
3) Fusing features of different scales through an attention mechanism;
4) And fusing the fused feature score and the original GOP score to obtain a final score.
Has the advantages that:
the technical scheme of the invention can realize that:
1) A traditional pronunciation quality evaluation (GOP) algorithm is combined with a plurality of pronunciation characteristics related to prosody, so that a pronunciation evaluation model based on a neural network is realized.
2) The global and local characteristics of prosody information are considered, CNN networks with different scales are adopted, and prosody-related pronunciation characteristics with different granularities are mined.
3) And an attention mechanism model is adopted to fuse the pronunciation characteristics of different scales and the characteristics related to the posterior probability, so that the multi-scale fused pronunciation characteristics are realized.
Drawings
FIG. 1 is a schematic diagram of a one-dimensional convolutional neural network;
FIG. 2 is a flow chart of a pronunciation assessment system.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention aims to solve the technical problems in the background art, adopts a neural network-based multi-scale fusion pronunciation evaluation model optimization method, and mainly designs the following three aspects:
1) And a speech recognition model based on a neural network is used as an acoustic model, and the output of the neural network is used as an evaluation basis for pronunciation correctness.
2) Convolutional neural networks of different scales are adopted, prosody related features of different scales are mined, and prosody information such as emphasis, tone, stress and the like is considered.
3) And learning the weights of the related features of different scales and GOPs by adopting an attention mechanism method, so as to realize the pronunciation evaluation system fusing various information.
Step one, designing and selecting an acoustic model.
1) The invention selects a universal end-to-end speech recognition model as an acoustic model to calculate the GOP score of the audio to be evaluated. The end-to-end acoustic model based on the neural network needs to be pre-trained.
2) Constructing a pronunciation evaluation training data set, inviting 3 English teachers with rich experience, scoring the whole pronunciation quality of the voices from 3 aspects of pronunciation accuracy, fluency and integrity by 0-5 points, wherein the score of 0 is lowest, the score of 5 is highest, and finally manually scoring each voice data by taking the average value of the 3-bit teacher scoring.
And step two, calculating the GOP score based on the neural network.
After the speech recognition model in the step one is completed, the speech to be evaluated can be recognized by using the speech recognition model used in the step one, and the GOP score is calculated by using the output of the neural network:
Figure BDA0003724594080000041
formula (1) the posterior probability of the average frame level constructed by the neural network output is taken as the GOP score; p(s) here t ,O t ) Is the output of the last softmax layer of the neural network model, where O refers to the observation sequence of speech and O t Is the observation sequence of the speech frame corresponding to time t, t s And t e Respectively representing the start and end frames, s, of the phoneme P t Is by forcing the state tag of the aligned frame t;
and step three, constructing a multi-scale convolution neural network.
1) Extraction of prosodic related features
The three most common acoustic features associated with prosodic perception are pitch, and duration, and statistical and dynamic features corresponding thereto, so the above-described associated prosodic acoustic features for each frame are first extracted and used as inputs to the convolutional neural network. Assuming that the speech to be evaluated is divided into N frames and each frame includes the above-mentioned prosody related features related in M dimensions, an N x M matrix is input.
2) Multi-scale neural network architecture
Because different acoustic features have different performance characteristics on different time scales, if the acoustic features are analyzed only from one granularity, some local information may be ignored, for example, for a sentence or a intonation of the whole sentence, it is often necessary to analyze a longer window to find corresponding statistical characteristics and rules, and for a convolution network with a fixed convolution kernel, the feature analysis can be performed only from a time window of one scale, and prosodic information of other scales is ignored.
The study of convolutional neural networks originated from the biological study of the visual system, and Hubel and Wiesel discovered a cell sensitive to the spatial local region of visual input in the study of the visual cortex of cats in 1962, which was defined as the "receptive field". The receptive field covers the entire visual field in some way, enabling better acquisition of local spatial correlations in the image. Therefore, the scholars expand the structural characteristics and apply the structural characteristics to a neural network to extract local features of an input layer. The Convolutional neural network includes an Input Layer (Input Layer), a Convolutional Layer (Convolutional Layer), a Pooling Layer (Pooling Layer), a full-Connected Layer (full-Connected Layer), an Output Layer (Output Layer), and the like. The most core layer structure in CNN is convolutional layer and pooling, and the present invention uses one-dimensional convolution as shown in fig. 1 below to analyze and extract the original convolution features:
the patent sets T (1,2, \8230;, T) convolutional neural networks with different scales, and the convolution kernel size of each convolutional network is C 1 *1,C 2 *1,…,C T *1, where the number of each convolution kernel is M.
Step four, multi-feature fusion based on attention mechanism
1) For the prosodic features of T different scales learned in step three, assume that the T features are represented by S = [ S ] 1 ,s 2 ,…,s t ]The final feature expression E obtained by the final calculation of the attention mechanism and the calculation process are shown as follows:
Q=Q’W q ,K=SW k ,V=SW v
Figure BDA0003724594080000061
where Q' is a vector of random neural network initializations, W q ,W k ,W v The matrix of the random initial of the neural network is used for carrying out linear transformation on Q' and S, a query vector Q is obtained after the linear transformation, a comparison vector K and a content vector V, f means that the dimensionality of the vector is reduced, and d m The method adopts softmax activation as a scoring function to fix a result within an interval of 0-1, and finally can realize the updating of the failure with the continuous learning of a neural networkFusing the features of the same scale;
2) Although the final evaluation score can be calculated using the fused feature E final However, considering the authority of the GOP score, the patent considers the score (score) calculated by the fusion feature e ) And further fusing with GOP scores, wherein the following formula is shown:
score final =α*score e +(1-α)*GOP
α=sigmod(W α s t+1 +b α ) (3)
wherein s is t+1 Output before softmax for the neural network-based recognition model used in step one, W α And b α Also is a random initial weight in the evaluation model, and the score is obtained finally final The final evaluation result of different prosodic acoustic features and the GOP is comprehensively considered.
Evaluation steps:
1) Receiving audio to be evaluated, and obtaining GOP scores and output s before identification model softmax through calculation in step one t+1
2) Extracting prosodic related features, and extracting corresponding deep features through CNNs of different scales;
3) Fusing features of different scales through an attention mechanism;
4) And fusing the fused feature score and the original GOP score to obtain a final score.

Claims (2)

1. The pronunciation evaluation model optimization method based on the multi-scale fusion of the neural network is characterized by comprising the following steps of:
step one, designing and selecting an acoustic model: selecting an end-to-end speech recognition model as an acoustic model for calculating GOP (group of picture) scores of the audio to be evaluated; in addition, a pronunciation evaluation training data set needs to be designed and constructed for the training of subsequent models;
step two, calculating the GOP score based on the neural network: after the speech recognition model in the first step is completed, recognizing the speech to be evaluated by using the speech recognition model in the first step, and calculating a GOP score by using the output of the neural network:
Figure FDA0003724594070000011
formula (1) the posterior probability of the average frame level constructed by the neural network output is taken as the GOP score; p(s) here t ,O t ) Is the output of the last softmax layer of the neural network model, where O refers to the observation sequence of speech and O t Is the observation sequence of the speech frame corresponding to the time t, t s And t e Respectively representing the start and end frames, s, of the phoneme P t Is by forcing the state tag of the aligned post frame t;
step three, construction of multi-scale convolution neural network
1) Extraction of prosody related features: extracting the relevant prosodic acoustic features of each frame and taking the extracted prosodic acoustic features as input of a convolutional neural network, and if the speech to be evaluated is divided into N frames and each frame contains the relevant prosodic features of the M dimensions, inputting the prosodic acoustic features into an N-M matrix;
2) And (3) constructing a multi-scale neural network, and analyzing and extracting original convolution characteristics by adopting one-dimensional convolution:
setting T (1, 2, \8230;, T) convolution neural networks with different scales, wherein the convolution kernel size of each convolution network is C 1 *1,C 2 *1,…,C T *1, wherein the number of each convolution kernel is M;
step four, multi-feature fusion based on attention mechanism:
1) For the prosodic features of T different scales learned in step three, assume that the T features are represented by S = [ S ] 1 ,s 2 ,…,s t ]The resulting signature, E, can be calculated according to the attention mechanism of equation (2) below:
Q=Q′W q ,K=SW k ,V=SW v
Figure FDA0003724594070000012
where Q' is a vector of random initialization of the neural network, W q ,W k ,W v The matrix of the random initial of the neural network is used for carrying out linear transformation on Q' and S, a query vector Q is obtained after the linear transformation, a comparison vector K and a content vector V, f means that the dimensionality of the vector is reduced, and d m The method adopts softmax activation as a scoring function to fix a result in an interval of 0-1, and can realize the fusion of features of different scales finally along with the continuous learning and updating of a neural network;
2) Score calculated from the fusion features e And further fusing with the GOP score as shown in the following formula (3):
score final =α*score e +(1-α)*GOP
α=sigmod(W α s t+1 +b α ) (3)
wherein s is t+1 Output before softmax for the neural network-based recognition model used in step one, W α And b α Also in the evaluation model, a matrix which is randomly initialized is used for linear transformation, and alpha is score e Corresponding weight, (1-alpha) is the weight corresponding to GOP score, and the score is obtained final The final evaluation result of different prosodic acoustic features and GOP are comprehensively considered, and sigmod is an activation function and is used for ensuring that alpha is a weighted value between 0 and 1.
2. The optimization method according to claim 1, characterized in that the evaluating step:
1) Receiving audio to be evaluated, and obtaining GOP scores and output s before identification model softmax through calculation in step one t+1
2) Extracting rhythm related features, and extracting corresponding deep features through CNNs of different scales;
3) Fusing features of different scales through an attention mechanism;
4) And fusing the fused feature score and the original GOP score to obtain a final score.
CN202210772121.2A 2022-06-30 2022-06-30 Neural network-based multi-scale fusion pronunciation evaluation model optimization method Pending CN115240710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210772121.2A CN115240710A (en) 2022-06-30 2022-06-30 Neural network-based multi-scale fusion pronunciation evaluation model optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210772121.2A CN115240710A (en) 2022-06-30 2022-06-30 Neural network-based multi-scale fusion pronunciation evaluation model optimization method

Publications (1)

Publication Number Publication Date
CN115240710A true CN115240710A (en) 2022-10-25

Family

ID=83672362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210772121.2A Pending CN115240710A (en) 2022-06-30 2022-06-30 Neural network-based multi-scale fusion pronunciation evaluation model optimization method

Country Status (1)

Country Link
CN (1) CN115240710A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115798519A (en) * 2023-02-10 2023-03-14 山东山大鸥玛软件股份有限公司 English multi-question spoken language pronunciation assessment method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115798519A (en) * 2023-02-10 2023-03-14 山东山大鸥玛软件股份有限公司 English multi-question spoken language pronunciation assessment method and system

Similar Documents

Publication Publication Date Title
Agarwal et al. A review of tools and techniques for computer aided pronunciation training (CAPT) in English
CN105741832B (en) Spoken language evaluation method and system based on deep learning
Chen et al. End-to-end neural network based automated speech scoring
Saz et al. Tools and technologies for computer-aided speech and language therapy
CN101739867B (en) Method for scoring interpretation quality by using computer
CN108766415B (en) Voice evaluation method
CN110797010A (en) Question-answer scoring method, device, equipment and storage medium based on artificial intelligence
CN109697988B (en) Voice evaluation method and device
CN110598208A (en) AI/ML enhanced pronunciation course design and personalized exercise planning method
Llompart Phonetic categorization ability and vocabulary size contribute to the encoding of difficult second-language phonological contrasts into the lexicon
Dong Application of artificial intelligence software based on semantic web technology in english learning and teaching
CN111915940A (en) Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation
CN115240710A (en) Neural network-based multi-scale fusion pronunciation evaluation model optimization method
CN109119064A (en) A kind of implementation method suitable for overturning the Oral English Teaching system in classroom
Al-Bakeri et al. ASR for Tajweed rules: integrated with self-learning environments
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
KR102395702B1 (en) Method for providing english education service using step-by-step expanding sentence structure unit
Leppik et al. Estoñol, a computer-assisted pronunciation training tool for Spanish L1 speakers to improve the pronunciation and perception of Estonian vowels
Zhao Study on the effectiveness of the asr-based english teaching software in helping college students’ listening learning
WO2012152290A1 (en) A mobile device for literacy teaching
Liu et al. Deep Learning Scoring Model in the Evaluation of Oral English Teaching
Duan et al. An English pronunciation and intonation evaluation method based on the DTW algorithm
CN111179902B (en) Speech synthesis method, equipment and medium for simulating resonance cavity based on Gaussian model
Li Modular design of English pronunciation proficiency evaluation system based on Speech Recognition Technology
KR20240029172A (en) Phoneme recognition-based augmented reality speech rehabilitation system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination