CN115862674A

CN115862674A - Method, system, equipment and medium for speech recognition and error correction of oral English evaluation

Info

Publication number: CN115862674A
Application number: CN202310138725.6A
Authority: CN
Inventors: 许信顺; 辛洁; 马磊; 陈义学
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-03-28

Abstract

The invention discloses a speech recognition and error correction method, system, equipment and medium for English spoken language evaluation, which relate to the technical field of speech recognition and comprise the following steps: extracting Mel frequency cepstrum coefficients from the spoken English speech, and performing feature enhancement to obtain a feature map; encoding the feature map; decoding according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment; checking the character string decoding result according to a preset dictionary, screening candidate words in the dictionary according to the editing distance of the wrongly decoded words, and determining correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the wrongly decoded words, so that a correct character string recognition result is obtained, and the accuracy of the recognition result is improved.

Description

Method, system, equipment and medium for speech recognition and error correction of oral English evaluation

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition and error correction method, system, equipment and medium for oral English evaluation.

Background

The oral English evaluating technology can automatically score and correct the pronunciation of the oral language, and is widely applied to the aspects of oral test automatic scoring, oral practice and the like. The Automatic Speech Recognition (ASR) is a pattern Recognition based on speech feature parameters, that is, by learning, input speech is classified according to a certain pattern, and then an optimal matching result is found according to a judgment criterion, and the ASR is applied to scenes such as a vehicle-mounted system, a smart phone, a smart appliance, and the like.

Before deep learning is expanded to the field of language recognition, a speech recognition model based on a Gaussian mixture model-hidden Markov model (GMM-HMM) is always the mainstream method of a language recognition system, the speech recognition system at the moment generally comprises three parts, namely a feature extraction part, an acoustic model and a language model, the feature extraction part converts a speech signal from a time domain to a frequency domain, and extracts proper features for the acoustic model; the acoustic model combines acoustics and phonetics, and takes the characteristics as input to generate an acoustic model score; and the language model calculates the corresponding word sequence probability according to the score of the acoustic model. Although the GMM-HMM model has fast training speed and small acoustic model, the method also has obvious defects and does not fully utilize the context information. With the wide application of Convolutional Neural Networks (CNN), recurrent Neural Networks (RNN) and transformers, the recognition capability of the acoustic model based on deep learning or the mixture of deep learning and the conventional method has greatly exceeded the GMM-HMM model.

At present, end-to-end models based on deep learning, which are widely used in the field of speech recognition, can be classified into two types: CTC (Connectionisttemporal classification) based methods and Attention based methods. The CTC-based method solves the alignment problem between input and output, realizes training without aligning an input sequence and an output sequence in advance by calculating all possible alignment modes in a prediction process, is usually combined with an RNN (radio network node) method, such as RNN, LSTM, GRU (generalized regression analysis), and the like, but usually only focuses on local information and ignores global information; the Attention-based method generally adopts an Encoder-Decoder architecture, wherein the larger the value in the weight vector is, the more important the part is to the output, the alignment between the input sequence and the output sequence is learned through historical output and feature coding, the decoding mode is more flexible, but the sequence relation in the sequence is ignored.

In the traditional GMM-HMM model or the end-to-end speech recognition model based on deep learning, because of the problems of pronunciation, recognition algorithm and the like, words which do not exist in reality inevitably exist in the final prediction result, and therefore text error correction is adopted to solve the problem. The text correction task is typically a sequence-to-sequence task, where the input is text obtained by input device input, ASR or optical text Recognition (OCR), etc., and the output is a complete sentence after correcting the error word.

The current text error correction is mainly divided into a two-stage method and an end-to-end method. The two-stage method comprises a judgment stage and an error correction stage, wherein the judgment stage identifies an error text by an N-Gram or deep learning model method, the input of the error correction stage is an error part in the text, and the text is corrected by a deep learning method or a traditional method; the end-to-end method only comprises an error correction stage, wherein the input is complete text, and the output is corrected text, because the end-to-end method usually increases the speed at the expense of accuracy, some special processing is usually added during input to alleviate the decrease of the accuracy.

At present, research on English speech recognition methods in the scientific research field mainly aims at standard English or American reading texts under the quiet background condition, and has the advantages of short sentences, moderate speech speed and clear pronunciation, and the speech recognition for English spoken language evaluation under the real scene has a larger difference with the scientific research scene. Specifically, the speech recognition of spoken english assessment in real scenes faces several problems: due to the limitation of the environment of the examination room, a large amount of background noise exists in the voice data collected in the real examination room, and the recognition is greatly influenced; under the influence of the mother language and the dialect, the English pronunciation of the reader is different, so that the identification difficulty is increased; different from a high-quality public data set used in a scientific research scene, the voice length collected in a real examination room usually exceeds one minute, a large amount of three-minute voice data exists, and various complex conditions such as pause and silence exist in the middle of the voice data, so that the precision identification difficulty of the current method is high due to the limitation of an application scene.

Disclosure of Invention

In order to solve the problems, the invention provides a method, a system, equipment and a medium for recognizing and correcting the speech for English spoken language evaluation, which are used for extracting a Mel frequency cepstrum coefficient as a feature, performing feature enhancement in a twisting and shielding mode, and correcting a decoding result by editing distance and appearance frequency after decoding so as to obtain a more accurate recognition result.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a speech recognition and error correction method for spoken English evaluation, including:

extracting Mel frequency cepstrum coefficients after time-frequency conversion of the spoken English speech to form a spectrogram;

performing characteristic enhancement on the spectrogram in a distortion and shielding mode to obtain a characteristic diagram;

encoding the feature map;

decoding according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment;

and checking the character string decoding result according to a preset dictionary, screening candidate words in the dictionary according to the editing distance of the wrongly decoded words, and determining correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the wrongly decoded words so as to obtain a correct character string recognition result.

As an alternative implementation, the time domain signal of the spoken English speech is pre-emphasized by a high-pass filter, the pre-emphasized time domain signal is subjected to frame division and windowing, and the time domain signal in each window is converted into a frequency domain signal by adopting fast fourier transform; and (3) passing the frequency domain signal through a set of triangular filter banks with a Mel scale, and then extracting Mel frequency cepstrum coefficients through discrete cosine transform.

As an alternative, the warping is time warped, and the masking is frequency masked and time masked.

As an alternative implementation mode, the characteristic diagram is coded by adopting a coder, the coder comprises a feedforward module, a multi-head self-attention module and a convolution module, and the feedforward module comprises two feedforward modules

Weighted feed-forward module, and two->

The weight feedforward module is respectively connected in front of the multi-head self-attention module and behind the convolution module, the feature graph is processed by the feedforward module, the attention feature is extracted by adopting a multi-head self-attention mechanism, the output result of the multi-head self-attention module is input to the convolution module after layer normalization and point-by-point convolution, and finally the result is processed by the/H module>

The feed forward module of weights is fully encoded.

As an alternative embodiment, the decoding process adds null elements to align the characters and phonemes.

As an alternative embodiment, a dictionary is constructed in advance and the occurrence frequency of words in the dictionary is determined at the same time, a BK tree is constructed according to the editing distance between the words in the dictionary, and the candidate words are screened based on the BK tree.

As an alternative embodiment, after weighting scores corresponding to edit distances and appearance frequencies of candidate words, the candidate word with the highest total score is taken as the correct word.

In a second aspect, the present invention provides a speech recognition and error correction system for spoken English evaluation, including:

the characteristic extraction module is configured to extract a Mel frequency cepstrum coefficient after time-frequency conversion is carried out on the spoken English voice so as to form a spectrogram;

the characteristic enhancement module is configured for carrying out characteristic enhancement on the spectrogram in a twisting and shielding mode to obtain a characteristic diagram;

an encoding module configured to encode the feature map;

the decoding module is configured to decode according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment;

and the error correction module is configured to check the character string decoding result according to a preset dictionary, screen candidate words in the dictionary according to the editing distance for the words with wrong decoding, and determine correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the words with wrong decoding, so that a correct character string recognition result is obtained.

In a third aspect, the present invention provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the method of the first aspect is performed.

In a fourth aspect, the present invention provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method, a system, equipment and a medium for recognizing and correcting a speech for evaluating an English spoken language, which enable a model to better learn speech characteristics by performing a characteristic enhancement combined processing mode of time distortion, frequency shielding and time shielding on a spectrogram and by enlarging the number of the spectrogram.

The invention provides a method, a system, equipment and a medium for recognizing and correcting a speech for evaluating an English spoken language.A Transformer-structured encoder improved by using a CNN is designed, the Transformer can capture the dependence of a long sequence and global interaction information based on contents, and the CNN can effectively utilize local characteristics, thereby realizing local dependence modeling and global dependence modeling on an audio sequence; enhancing the attention to local information using a multi-headed self-attention mechanism at the decoding stage; by expanding the tag set, the problem of misalignment between the input sequence and the output sequence is solved by adding null elements.

The traditional error correction based on the BK tree focuses on a plurality of candidate words with the shortest editing distance, however, rarely-used words often appear in the candidate words, and are not needed, so that statistics of the appearance frequency is added in the process of constructing a dictionary and the BK tree, the final recognition result is determined by comprehensively considering the editing distance and the appearance frequency, and the recognition is more accurate.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a flowchart of a speech recognition and error correction method for spoken english evaluation according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a BK tree provided in embodiment 1 of the present invention.

Detailed Description

The invention is further explained by the following embodiments in conjunction with the drawings.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example 1

The embodiment provides a speech recognition and error correction method for spoken English evaluation, as shown in fig. 1, including:

encoding the feature map;

decoding according to the coding result and the character string recognition result at the previous moment to obtain a character string decoding result at the current moment;

checking the character string decoding result according to a preset dictionary, screening candidate words in the dictionary according to the editing distance for the wrongly decoded words, and determining correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the wrongly decoded words, so that a correct character string recognition result is obtained.

The Mel frequency is extracted based on the auditory characteristics of human ears, and forms a nonlinear corresponding relation with the Hertz frequency, and Mel-frequency cepstrum coefficients (MFCC) are Hertz frequency spectrum characteristics obtained by calculating by using the relation between the Mel-frequency cepstrum coefficients, so that the frequency spectrum is closer to the nonlinear auditory system of human beings, and the Mel frequency cepstrum coefficients are mainly used for extracting the voice data characteristics and reducing the operation dimension. The MFCC feature extraction process mainly comprises the steps of converting the obtained spoken English speech from a time domain to a frequency domain and then obtaining a Mel frequency cepstrum coefficient as a feature, and specifically comprises the following steps:

（1) Time domain signal of a certain time point t for given spoken English speech

Pre-emphasis processing is carried out to obtain a processed time domain signal->

The pre-emphasis processing is to pass a time domain signal of the spoken English voice through a high-pass filter, so as to promote a high-frequency part, flatten the frequency spectrum of the time domain signal, keep the frequency spectrum in the whole frequency band from low frequency to high frequency, and use the same signal-to-noise ratio to obtain the frequency spectrum; as shown in formula (1):

（1）

(2) In order to reduce the influence of unsteadiness and time variation of the whole time domain signal of the spoken English voice, the time domain signal after the pre-emphasis processing is subjected to framing processing, wherein the frame length is usually 25ms;

in order to ensure smooth transition between frames and maintain continuity of the frames, the framing generally adopts an overlapping and segmenting method to ensure that two adjacent frames overlap with each other by a portion, a time difference between start positions of the two adjacent frames is called frame shift, and the frame shift is generally 10ms.

The framed time domain signal is non-periodic, and there is a problem of frequency leakage after fourier transform, so in order to reduce leakage error to the maximum extent, the embodiment adopts a windowing function, so that the time domain signal better meets the periodicity requirement of fourier transform.

In this embodiment, a hamming window is selected as a windowing function, so that the value of the framed time domain signal at the window boundary is approximately 0, and the framed time domain signal approaches to a periodic signal, where the windowing function is:

（2）

(3) The data in each window is transformed from the time domain signal using Fast Fourier Transform (FFT)

Converted into a frequency-domain signal->

As shown in formula (3):

（3）

wherein the content of the first and second substances,

is a number of points of the Fourier transform, is greater than or equal to>

Is a natural base number.

(4) The frequency domain signal passes through a group of Mel-scale triangular filter banks to smooth the frequency spectrum and eliminate the effect of harmonic wave, so as to highlight the formants of the original voice, and then the MFCC is obtained through Discrete Cosine Transform (DCT), thereby forming a spectrogram.

One of the core challenges faced by ASR, like other Natural Language Processing (NLP) problems, is the lack of sufficient training data, with the result that the trained models are either easily over-fit or difficult to process data that never has been seen in the training set. The data enhancement is a common method for solving the problem, the common data enhancement method in the field of speech recognition is mainly to perform original audio processing from three aspects of increasing noise, changing tone and time stretching, the method for increasing noise is not applicable due to the limitation of an actual application scene data set, and the fact that the tone and time stretching are only changed by making small amplitude changes on audio data and the number of spectrograms is not changed, namely, the fact that enough training data is lacked is not changed.

Unlike the conventional method for processing audio data, the present embodiment uses a spectrogram as a reference for data enhancement, and directly performs operations on the spectrogram by using a combination of three basic methods, i.e., time warping, frequency masking and time masking, so as to achieve the purpose of data enhancement.

Wherein, the time warping is the deformation of the sequence in the time direction, and at a certain time point t moment and a time domain adjusting parameter w, a time interval is selected

Or->

Based on the twitch factor->

Performing an image warping operation, wherein a warping factor->

From a previously set>

Selecting from uniform distribution;

frequency masking, so that f successive mel frequency channels

Is masked, wherein a frequency masking parameter f is slave>

Is selected in a uniform distribution, and>

slave->

Middle selection, based on>

Is the number of mel frequency channels;

time masking time step

Masking, time masking parameter t from>

Is selected in the uniform distribution of (4)/',>

slave->

In mid selection, <' > based on>

Is the length of time of the audio sequence. />

By a combination of the three operations of time warping, frequency masking and time masking, a time-domain adjustment parameter w is set to 80, a frequency-domain masking parameter f is set to 27, and a time-domain masking parameter t is set to 100.

In this embodiment, the obtained feature graph is encoded by using an encoder, the encoder is constructed based on a CNN and a Transformer module, the Transformer module can capture long-sequence dependency and global interaction information based on content, and the CNN can effectively utilize local features, thereby implementing local dependency modeling and global dependency modeling.

Improving a Transformer encoder by using the CNN to obtain an improved conformer encoder, wherein the conformer encoder comprises a feedforward module, a multi-head self-attention module and a convolution module; wherein two are used

A weighted feedforward module connected respectively before the multi-head attention module and after the convolution module to form a sandwich structure, and performing a half-step feedforward module (i.e.)>

A weighted feedforward module), a multi-head self-attention module, a convolution module and a half-step feedforward module to obtain a coding result.

The Feed-forward (Feed-forward) module consists of two linear transformations and a nonlinear activation function Swish which is connected by a pre-norm residual error unit, wherein the Swish activation function has the characteristics of no upper bound, low bound, smoothness and nonmonotony, and the performance is generally due to the Relu activation function, after a feature diagram X after feature enhancement is input into an encoder, the feature diagram X is firstly used beforeFed to the module for processing, fed forward of the output of the module

The calculation method is shown in formula (4):

（4）

wherein the content of the first and second substances,

i.e. sigmoid activation functions.

The Multi-headed self-attentional (MHSA) module uses relative position coding in transform-XL, and is more universal and robust for voices with different input lengths. For input

The corresponding vector pick>

The process of computing the Attention is: first of all a calculation is made>

Is/are>

：

（5）

Wherein the content of the first and second substances,

respectively represent a query vector, a key value and a weight, and @>

Are all->

Maintaining;

then, the Attention is calculated by using a mode of scaling the dot product Attention:

（6）

and finally, connecting the Attention obtained by calculating a plurality of heads, as shown in a formula (7):

（7）

a gating mechanism consisting of a pointwise Convolution and a linear gating unit (GLU) is added before a Convolution Module (constraint Module), then a one-dimensional depth separation Convolution is carried out, and then a Batchnorm is added to help train a deeper model, wherein two activation functions are used, namely a sigmoid activation function and a swish activation function.

GLU is a gating mechanism in a convolutional neural network, is different from a gating cyclic unit (GRU) of the cyclic neural network in that gradient propagation is easier to perform, gradient disappearance or gradient explosion is not easy to cause, the computation time is greatly reduced, the input at the moment is the result of the output result of a multi-head self-attention module after layer normalization and point-by-point convolution, and the result is recorded as the result for convenient writing

Each layer of the GLU is composed of two convolution modules with different parameters and a door mechanism, the output of the two convolution modules is used as the input of the door mechanism through operation, and the formula (8) is as follows:

（8）

wherein the content of the first and second substances,

represents a fifth->

Layer, or>

Is a parameter that the convolution module needs to learn>

Represents a sigmoid activation function, <' > is asserted>

Is a hadamard product operation, i.e. multiplication of corresponding elements.

The alignment problem between the input sequence and the output sequence needs to be considered when decoding is carried out, and the alignment process needs to be iterated for multiple times to ensure the alignment accuracy, so that the CTC and Attention-based decoder is constructed in the embodiment, and the prediction result can be directly output without aligning data.

For a given coding result

And the sequence tag of the output->

Mapping the coding result x into

The sequence tag corresponding thereto is mapped to ≥>

The operation of aligning the characters and the phonemes corresponds to establishing an accurate mapping between the encoded result and the sequence tag. CTC for a given->

Based on all possible accurate mappings->

Giving an output distribution in which the probability of correct output is maximized, i.e. calculating:

。

when the CTC solves the alignment problem, the CTC expands the tag set and adds a null element

Empty element>

Only one place occupation is represented, no character is corresponding, and finally repeated characters are removed and empty elements are removed; for example, a path

、/>

Will eventually be mapped to a sequence->

。

For a given input sequence

The intermediate result corresponds to the path->

And finally outputs the sequence->

Then->

To (X)>

The posterior probability of (a) is expressed as:

（9）

assuming that the output variables at different times are independent of each other, the path

To (X)>

Is expressed as:

（10）

wherein, the first and the second end of the pipe are connected with each other,

indicates a path pick>

Is at>

Output character corresponding to the moment>

Is represented by>

Time selection character is>

；

Therefore, the following equations (9) to (10) are combined:

（11）

it can be understood that the speech is time-series, and the recognition result of the speech at the previous time is needed for decoding.

Because of the problems of pronunciation, recognition algorithm and the like, various small problems, such as word spelling errors, homophone errors and the like, inevitably exist in the final prediction result, so the embodiment corrects the decoding result according to the editing distance and the occurrence frequency, thereby obtaining a more accurate recognition result.

Constructing a dictionary according to the examination level or the known knowledge of common words, determining the occurrence frequency of the words in the dictionary, constructing a BK tree (Burkhard KellerTree) according to the words in the dictionary, constructing the BK tree based on an editing distance, and measuring the similarity between two character strings, namely the character strings

Converted into a character string>

The minimum number of editing operations required, as shown in equation (12):

（12）

wherein the content of the first and second substances,

respectively represent character strings>

The subscript of (1) begins with a subscript of (1).

The BK tree is a data structure, and the core idea is to use

Means character string->

To the character string->

The edit distance of (a), wherein the main requirements are: />

If and only if>

；/>

；

(ii) a Taking { gate, same, fame, gain, gate, gay, aim, frame } as an example, constructing a BK tree, as shown in fig. 2, the construction process is:

a) Selecting a character string as a root node, such as a game;

b) Continuously selecting the next character string same, calculating the editing distance between the character string same and the same to be 1, and taking the same as a branch node of the name root node;

c) Continuously selecting the next character string fame, traversing from the root node game, calculating the editing distance between the fame and the game to be 1, wherein the branch same as the editing distance 1 exists, and continuously calculating the editing distance between the fame and the same to be 1, wherein the fame is called as a new branch of the same;

d) And sequentially selecting the rest words, and continuously expanding according to the step b and the step c to finally construct the BK tree.

As shown in fig. 2, all descendant nodes under the root node game branch 1 have an edit distance of 1, and all descendant nodes under the root node game branch 2 have an edit distance of 2, which makes the BK tree less computationally intensive in querying and allows high-frequency words to be placed at the top of the BK tree.

Searching and comparing the character string decoding result in a dictionary, and if the character string decoding result does not exist in the dictionary, the character string decoding result is wrong; then, according to the editing distance, screening candidate words of which the editing distance with the character string decoding result is smaller than a set threshold value in the BK tree to realize candidate recall; for example, when it is recognized that the name has an error, all words with an edit distance of 1 from the name in the BK tree, that is, the name, and the gate, can be searched, so as to implement the candidate recall.

Since the conventional BK tree-based error correction focuses on several candidate words with the shortest edit distance, but rarely-used words often appear in the candidate words, which is not required, the present embodiment also introduces the occurrence frequency; and after the candidate words are determined, carrying out error correction according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the character string decoding result, namely, the smaller the editing distance is, the higher the score is, the higher the occurrence frequency is and the score is, and finally weighting the candidate words with the highest total score by the candidate words and the character string decoding result to obtain the final recognition result.

Example 2

The embodiment provides a speech recognition and error correction system for oral english evaluation, which includes:

the characteristic extraction module is configured to extract Mel frequency cepstrum coefficients after time-frequency conversion of the spoken English speech so as to form a spectrogram;

the characteristic enhancement module is configured to perform characteristic enhancement on the spectrogram in a distortion and shielding mode to obtain a characteristic map;

an encoding module configured to encode the feature map;

It should be noted that the modules correspond to the steps described in embodiment 1, and the modules are the same as the corresponding steps in the implementation examples and application scenarios, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment 1. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.

The method in embodiment 1 may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive changes in the technical solutions of the present invention.

Claims

1. The speech recognition and error correction method for English spoken language evaluation is characterized by comprising the following steps:

extracting Mel frequency cepstrum coefficients from the spoken English speech after time-frequency conversion to form a spectrogram;

encoding the feature map;

2. The method for speech recognition and error correction for oral english evaluation according to claim 1, wherein the time domain signal of the oral english speech is pre-emphasized by a high pass filter, the pre-emphasized time domain signal is windowed by frames, and the time domain signal in each window is converted into a frequency domain signal by fast fourier transform; and (3) passing the frequency domain signal through a set of triangular filter banks with a Mel scale, and then extracting Mel frequency cepstrum coefficients through discrete cosine transform.

3. The speech recognition and error correction method for spoken English evaluation according to claim 1, wherein the warping is time warping, and the masking is frequency masking or time masking.

4. The method for speech recognition and error correction for spoken English evaluation according to claim 1, wherein the feature map is encoded using an encoder, the encoder comprising a feedforward module, a multi-head self-attention module, and a convolution module, the feedforward module being two

Weighted feed-forward module, and two->

The weighted feedforward modules are respectively connected at a multi-head self-injectionBefore the intention module and after the convolution module, processing the characteristic diagram by a feedforward module, extracting attention characteristics by a multi-head self-attention mechanism, inputting the output result of the multi-head self-attention module into the convolution module after layer normalization and point-by-point convolution, and finally judging whether the output result is based on the value of the attention characteristic diagram>

The feed-forward module of weights is fully encoded.

5. The method for speech recognition and error correction for spoken English evaluation according to claim 1, wherein null elements are added during the decoding process to align characters and phonemes.

6. The method for speech recognition and correction for oral english evaluation according to claim 1, wherein a dictionary is constructed in advance while the occurrence frequency of words in the dictionary is determined, a BK tree is constructed based on the edit distance between words in the dictionary, and the candidate words are screened based on the BK tree.

7. The speech recognition and error correction method for oral english evaluation according to claim 1, wherein the candidate word with the highest total score is taken as the correct word after weighting the scores corresponding to the edit distance and the frequency of occurrence of the candidate word.

8. Speech recognition and error correction system that english spoken language evaluated, characterized by, include:

an encoding module configured to encode the feature map;

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.