CN113345467B

CN113345467B - Spoken language pronunciation evaluation method, device, medium and equipment

Info

Publication number: CN113345467B
Application number: CN202110545441.XA
Authority: CN
Inventors: 王佳珺; 杨悦; 唐浩元; 王欢良; 代大明; 张李
Original assignee: Suzhou Qdreamer Network Technology Co ltd
Current assignee: Suzhou Qdreamer Network Technology Co ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2023-10-20
Anticipated expiration: 2041-05-19
Also published as: CN113345467A

Abstract

The invention discloses a spoken language pronunciation evaluation method, a device, a medium and equipment, wherein the method comprises the following steps: acquiring audio to be evaluated and text to be evaluated from a spoken language to be evaluated; extracting a first acoustic feature from the audio to be evaluated, and generating a second acoustic feature after frequency disturbance of the first acoustic feature; generating a phoneme sequence from a text to be evaluated, and then combining the phoneme sequence with an HMM model to generate a decoding network; the method comprises the steps of inputting second acoustic features into a decoding network to obtain acoustic information, and performing GOP scoring calculation by using the acoustic information, wherein the second acoustic features are obtained after pre-emphasis, windowing and framing and frequency domain random disturbance are performed on the audio features, so that signal distortion caused by front-end signal processing is simulated, and the extraction performance of the audio features in an actual noisy environment is improved; and a decoding network is constructed through the text to be evaluated, word pronunciation generation is performed by combining the context, the pronunciation evaluation accuracy under the specific pronunciation phenomenon is improved, and the accuracy of spoken language pronunciation evaluation is ensured.

Description

Spoken language pronunciation evaluation method, device, medium and equipment

Technical Field

The invention belongs to the field of language identification, and particularly relates to a spoken language pronunciation evaluation method, device, medium and equipment.

Background

Computer-aided pronunciation scoring is an automatic pronunciation level evaluation method from which a learner of a language can obtain real-time feedback of pronunciation accuracy.

The main current computer aided pronunciation scoring system is based on an automatic voice recognition framework and generally comprises three parts of an acoustic model, decoding and GOP scoring, wherein the basic thought is to calculate acoustic information such as phoneme likelihood/posterior probability/duration of audio to be evaluated in a decoding network, and then use the acoustic information to calculate GOP scoring.

However, in practical applications, the current methods have the following drawbacks:

(1) the acoustic model is usually obtained by using standard audio training of a quiet scene, which results in that the pronunciation evaluation technology is usually limited to the quiet environment, and under a complex human voice environment such as classroom noise, audio after front-end signal processing is usually directly sent to an evaluation module, and at the moment, voice distortion introduced by the front-end signal processing can cause serious degradation of evaluation performance, so that the pronunciation evaluation technology is difficult to use in an actual English classroom.

(2) The decoding network is constructed based on the pronunciation dictionary to obtain the phoneme sequence of each word, but in practice, the corresponding reasonable phoneme sequences of the same word in different contexts may be different, which may lead to erroneous judgment in a special pronunciation phenomenon such as a loss of explosion/burst or the like.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a spoken language pronunciation assessment method, device, medium and equipment, which are used for improving the extraction performance of audio features in an actual noisy environment by carrying out data amplification on the audio features and simulating signal distortion caused by front-end signal processing, and obtaining more reasonable phoneme information through a decoding network related to a text construction context to be assessed, thereby improving pronunciation assessment accuracy.

In order to achieve the above objective, an embodiment of the present invention provides a spoken utterance evaluation method, including the following steps:

acquiring audio to be evaluated and text to be evaluated from a spoken language to be evaluated;

extracting a first acoustic feature from the audio to be evaluated, and generating a second acoustic feature after frequency disturbance of the first acoustic feature;

generating a phoneme sequence from the text to be evaluated, and then combining the phoneme sequence with an HMM model to generate a decoding network;

and inputting the second acoustic characteristics into a decoding network to obtain acoustic information, and performing GOP scoring calculation by using the acoustic information.

Further, the method of extracting the first acoustic feature from the audio data segment includes: pre-emphasis is carried out on the audio to be evaluated, windowing and framing are carried out, and the output of the Mel frequency spectrum filter bank is used as a first acoustic feature.

Further, the method for generating the second acoustic feature after frequency disturbance of the first acoustic feature is as follows:

s100, randomly generating a starting frequency band number in uniform distribution according to the set starting disturbance coefficient

i to uniform (0, ratio 1. F), wherein,

i is the generated initial frequency band number, F is the maximum frequency band corresponding number of the input characteristic, and ratio1 is the initial disturbance coefficient;

s101, randomly generating disturbance frequency bandwidth in uniform distribution according to the set frequency disturbance coefficient

K-unitorm (0, ratio 2. Times. F), wherein,

k is the generated disturbance frequency bandwidth, F is the maximum frequency band corresponding number of the input characteristic, and ratio2 is the frequency band disturbance coefficient;

s102, weighting the selected [ i, i+K ] frequency band to generate a second acoustic feature.

Further, the step of generating a phoneme sequence from the text to be evaluated is as follows:

s200, performing intent division on a text to be evaluated by using a pre-trained spoken language position prediction model to obtain an intent division boundary; s201, in each intention group, a phoneme sequence of the whole intention group is given by combining predicted pronunciation rules such as continuous reading/blasting/breakdown losing and the like; simultaneously recording the corresponding relation between the phoneme sequence and the word so as to be used for outputting the score of the subsequent word; wherein,,

when modeling the position-dependent phonemes, adjacent words connected by the explosion/breakdown phenomenon are read/lost, and the generated phoneme sequence uses the form of the middle phonemes except the head and tail phonemes.

Further, the acoustic information obtained by inputting the second acoustic feature into the decoding network includes a phoneme likelihood, a posterior probability, and a duration.

An embodiment of the present invention provides a spoken language pronunciation evaluation device, including:

the acquisition module is configured to acquire audio to be evaluated and text to be evaluated from the spoken language to be evaluated;

the feature extraction module is configured to extract a first acoustic feature from the audio to be evaluated, and generate a second acoustic feature after frequency disturbance of the first acoustic feature;

the decoding network module is configured to generate a phoneme sequence from the text to be evaluated, and then combine the phoneme sequence with an HMM model to generate a decoding network;

and the GOD scoring module is configured to input the second acoustic characteristics into a decoding network to obtain acoustic information, and then score calculation is performed by utilizing the acoustic information.

An embodiment of the present invention provides a computer-readable storage medium storing program code which, when executed by a processor, implements the steps of the spoken utterance evaluation method described above.

An embodiment of the present invention provides an electronic device including a processor and a storage medium storing program code which, when executed by the processor, implements the steps of a spoken utterance evaluation method as described above.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

according to the spoken language pronunciation evaluation method, device, medium and equipment, the second acoustic characteristic is obtained after pre-emphasis, windowing and framing and frequency domain random disturbance are carried out on the audio characteristic, so that signal distortion caused by front-end signal processing is simulated, and the extraction performance of the audio characteristic in an actual noisy environment is improved; and a decoding network is constructed through the text to be evaluated, word pronunciation generation is performed by combining the context, and the pronunciation evaluation accuracy under the specific pronunciation phenomenon is improved, so that the accuracy of spoken language pronunciation evaluation is ensured.

Drawings

The technical scheme of the invention is further described below with reference to the accompanying drawings:

FIG. 1 is a flow chart of a spoken utterance evaluation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of boundaries of intent group divisions in a spoken utterance evaluation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a phoneme sequence in a spoken utterance evaluation method according to an embodiment of the present invention;

fig. 4 is a spoken language pronunciation evaluation device according to an embodiment of the present invention.

Detailed Description

The invention will be described in further detail with reference to the accompanying drawings and specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limiting the invention.

Referring to fig. 1, a spoken language pronunciation evaluation method according to an embodiment of the invention includes the following steps:

s001, acquiring audio to be evaluated and text to be evaluated from the spoken language to be evaluated.

After the audio to be evaluated and the text to be evaluated are obtained, S002 or S003 may be executed, and in this embodiment, the step S002 is executed first: extracting a first acoustic feature from the audio to be evaluated, and generating a second acoustic feature after frequency disturbance of the first acoustic feature.

Specifically, after the audio to be evaluated is extracted, pre-emphasis is firstly carried out on the audio to be evaluated, windowing and framing are carried out, and the output of a Mel frequency spectrum filter bank is used as a first acoustic feature; then, frequency domain random disturbance is carried out on the first acoustic feature to generate a second acoustic feature, and the operation method is as follows:

i to uniform (0, ratio 1. F), wherein,

i is the generated initial frequency band number, F is the maximum frequency band corresponding number of the input characteristic, and ratio1 is the initial disturbance coefficient.

K-unitorm (0, ratio 2. Times. F), wherein,

k is the generated disturbance frequency bandwidth, F is the maximum frequency band corresponding number of the input characteristic, and ratio2 is the frequency band disturbance coefficient.

S102, weighting the selected [ i, i+K ] frequency band to generate a second acoustic feature; the frequency bands may be selected to have uniform weighting coefficients, or randomly generated weighting coefficients may be used.

In addition, on the basis of the steps, all frequency bands can be divided into a plurality of blocks, and each block uses different initial disturbance coefficients, frequency band disturbance coefficients and weighting parameters, so that the accuracy of the frequency bands is improved.

In S002, the conventional quiet scene data is subjected to data amplification through the above steps, that is, the audio feature is pre-emphasized, windowed and framed, and frequency domain random disturbance to obtain the second acoustic feature, so as to simulate signal distortion caused by front-end signal processing, and improve the extraction performance of the audio feature in an actual noisy environment.

Then, S003: and generating a decoding network by combining the phoneme sequence with an HMM model from the phoneme sequence generated in the text to be evaluated.

Specifically, the steps of generating a phoneme sequence from the text to be evaluated are as follows:

s200, performing intent division on the text to be evaluated by using a pre-trained spoken language location prediction model to obtain intent division boundaries, and referring to FIG. 2. S201, in each intention group, a phoneme sequence of the whole intention group is given by combining predicted pronunciation rules such as continuous reading/blasting/breakdown losing and the like; and simultaneously recording the corresponding relation between the phoneme sequence and the word so as to be used for outputting the subsequent word score.

When modeling is performed using phonemes related to positions, adjacent words connected by the explosion/breakdown phenomenon are read/lost continuously, the generated phoneme sequences except for the head and tail phonemes all use intermediate phoneme forms, and in particular, see fig. 3, p_b represents a starting phoneme, p_i represents an intermediate phoneme, and p_e represents an end phoneme.

And finally, S004, inputting the second acoustic features into a decoding network to obtain acoustic information, and then performing GOP scoring calculation by utilizing the acoustic information, wherein the acoustic information which is obtained by inputting the second acoustic features into the decoding network in the embodiment comprises phoneme likelihood, posterior probability and duration information, and the GOP calculates a final score by utilizing the acoustic information.

In S004, a decoding network is constructed through the text to be evaluated, word pronunciation is generated by combining the context, and the pronunciation evaluation accuracy under the specific pronunciation phenomenon is improved, so that the accuracy of spoken language pronunciation evaluation is ensured.

The invention also provides a spoken language pronunciation evaluation device, which comprises:

and the acquisition module is configured to acquire the audio to be evaluated and the text to be evaluated from the spoken language to be evaluated.

The feature extraction module is configured to extract a first acoustic feature from the audio to be evaluated, and generate a second acoustic feature after frequency disturbance of the first acoustic feature.

And the decoding network module is configured to generate a phoneme sequence from the text to be evaluated, and then combine the phoneme sequence with an HMM model to generate a decoding network.

The invention also discloses a computer readable storage medium, on which a computer program (i.e. a program product) is stored which, when being executed by a processor, carries out the steps described in the above-mentioned method embodiments.

For example, audio to be evaluated and text to be evaluated are obtained from a spoken language to be evaluated; extracting a first acoustic feature from the audio to be evaluated, and generating a second acoustic feature after frequency disturbance of the first acoustic feature; generating a phoneme sequence from the text to be evaluated, and then combining the phoneme sequence with an HMM model to generate a decoding network; inputting the second acoustic characteristics into a decoding network to obtain acoustic information, and performing GOP scoring calculation by using the acoustic information; the specific implementation of each step is not repeated here.

Next, the invention also discloses an electronic device, which can be a computer system or a server; components of an electronic device may include, but are not limited to: one or more processors or processing units, a system memory, and a bus that connects the different system components (including the system memory and the processing units).

The foregoing is merely a specific application example of the present invention, and the protection scope of the present invention is not limited in any way. All technical schemes formed by equivalent transformation or equivalent substitution fall within the protection scope of the invention.

Claims

1. A spoken language pronunciation evaluation method is characterized by comprising the following steps:

extracting a first acoustic feature from the audio to be evaluated, and generating a second acoustic feature after frequency disturbance of the first acoustic feature, wherein the method for generating the second acoustic feature after frequency disturbance of the first acoustic feature is as follows:

s100, randomly generating an initial frequency band number i-unitorm (0, ratio 1. Times.F) in an even distribution mode according to the set initial disturbance coefficient; wherein,,

K-unitorm (0, ratio 2. Times.F); wherein,,

s102, weighting the selected [ i, i+K ] frequency band to generate a second acoustic feature;

2. The spoken utterance evaluation method of claim 1, characterized in that the method of extracting first acoustic features from the audio data segment comprises: pre-emphasis is carried out on the audio to be evaluated, windowing and framing are carried out, and the output of the Mel frequency spectrum filter bank is used as a first acoustic feature.

3. The spoken utterance evaluation method of claim 1, wherein: the steps of generating the phoneme sequence from the text to be evaluated are as follows:

s200, performing intent division on a text to be evaluated by using a pre-trained spoken language position prediction model to obtain an intent division boundary;

s201, in each intention group, a phoneme sequence of the whole intention group is given by combining a predicted continuous reading/blasting/breakdown losing sound rule; simultaneously recording the corresponding relation between the phoneme sequence and the word so as to be used for outputting the score of the subsequent word; wherein,,

4. The spoken utterance evaluation method of claim 1, wherein: the acoustic information obtained by using the second acoustic feature input into the decoding network includes a phoneme likelihood, a posterior probability, and a duration.

5. A spoken utterance evaluation device, comprising:

the feature extraction module is configured to extract a first acoustic feature from the audio to be evaluated, generate a second acoustic feature after frequency disturbance of the first acoustic feature, and generate the second acoustic feature after frequency disturbance of the first acoustic feature, wherein the method comprises the following steps:

K-unitorm (0, ratio 2. Times.F); wherein,,

and the GOP scoring module is configured to input the second acoustic characteristics into a decoding network to obtain acoustic information, and then score calculation is performed by utilizing the acoustic information.

6. A computer readable storage medium storing program code which, when executed by a processor, implements the method of one of claims 1-4.

7. An electronic device comprising a processor and a storage medium storing program code which, when executed by the processor, implements the method of one of claims 1-4.