CN115862674A - Method, system, equipment and medium for speech recognition and error correction of oral English evaluation - Google Patents

Method, system, equipment and medium for speech recognition and error correction of oral English evaluation Download PDF

Info

Publication number
CN115862674A
CN115862674A CN202310138725.6A CN202310138725A CN115862674A CN 115862674 A CN115862674 A CN 115862674A CN 202310138725 A CN202310138725 A CN 202310138725A CN 115862674 A CN115862674 A CN 115862674A
Authority
CN
China
Prior art keywords
module
words
character string
result
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310138725.6A
Other languages
Chinese (zh)
Inventor
许信顺
辛洁
马磊
陈义学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Original Assignee
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG SHANDA OUMA SOFTWARE CO Ltd filed Critical SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority to CN202310138725.6A priority Critical patent/CN115862674A/en
Publication of CN115862674A publication Critical patent/CN115862674A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a speech recognition and error correction method, system, equipment and medium for English spoken language evaluation, which relate to the technical field of speech recognition and comprise the following steps: extracting Mel frequency cepstrum coefficients from the spoken English speech, and performing feature enhancement to obtain a feature map; encoding the feature map; decoding according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment; checking the character string decoding result according to a preset dictionary, screening candidate words in the dictionary according to the editing distance of the wrongly decoded words, and determining correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the wrongly decoded words, so that a correct character string recognition result is obtained, and the accuracy of the recognition result is improved.

Description

Method, system, equipment and medium for speech recognition and error correction of oral English evaluation
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice recognition and error correction method, system, equipment and medium for oral English evaluation.
Background
The oral English evaluating technology can automatically score and correct the pronunciation of the oral language, and is widely applied to the aspects of oral test automatic scoring, oral practice and the like. The Automatic Speech Recognition (ASR) is a pattern Recognition based on speech feature parameters, that is, by learning, input speech is classified according to a certain pattern, and then an optimal matching result is found according to a judgment criterion, and the ASR is applied to scenes such as a vehicle-mounted system, a smart phone, a smart appliance, and the like.
Before deep learning is expanded to the field of language recognition, a speech recognition model based on a Gaussian mixture model-hidden Markov model (GMM-HMM) is always the mainstream method of a language recognition system, the speech recognition system at the moment generally comprises three parts, namely a feature extraction part, an acoustic model and a language model, the feature extraction part converts a speech signal from a time domain to a frequency domain, and extracts proper features for the acoustic model; the acoustic model combines acoustics and phonetics, and takes the characteristics as input to generate an acoustic model score; and the language model calculates the corresponding word sequence probability according to the score of the acoustic model. Although the GMM-HMM model has fast training speed and small acoustic model, the method also has obvious defects and does not fully utilize the context information. With the wide application of Convolutional Neural Networks (CNN), recurrent Neural Networks (RNN) and transformers, the recognition capability of the acoustic model based on deep learning or the mixture of deep learning and the conventional method has greatly exceeded the GMM-HMM model.
At present, end-to-end models based on deep learning, which are widely used in the field of speech recognition, can be classified into two types: CTC (Connectionisttemporal classification) based methods and Attention based methods. The CTC-based method solves the alignment problem between input and output, realizes training without aligning an input sequence and an output sequence in advance by calculating all possible alignment modes in a prediction process, is usually combined with an RNN (radio network node) method, such as RNN, LSTM, GRU (generalized regression analysis), and the like, but usually only focuses on local information and ignores global information; the Attention-based method generally adopts an Encoder-Decoder architecture, wherein the larger the value in the weight vector is, the more important the part is to the output, the alignment between the input sequence and the output sequence is learned through historical output and feature coding, the decoding mode is more flexible, but the sequence relation in the sequence is ignored.
In the traditional GMM-HMM model or the end-to-end speech recognition model based on deep learning, because of the problems of pronunciation, recognition algorithm and the like, words which do not exist in reality inevitably exist in the final prediction result, and therefore text error correction is adopted to solve the problem. The text correction task is typically a sequence-to-sequence task, where the input is text obtained by input device input, ASR or optical text Recognition (OCR), etc., and the output is a complete sentence after correcting the error word.
The current text error correction is mainly divided into a two-stage method and an end-to-end method. The two-stage method comprises a judgment stage and an error correction stage, wherein the judgment stage identifies an error text by an N-Gram or deep learning model method, the input of the error correction stage is an error part in the text, and the text is corrected by a deep learning method or a traditional method; the end-to-end method only comprises an error correction stage, wherein the input is complete text, and the output is corrected text, because the end-to-end method usually increases the speed at the expense of accuracy, some special processing is usually added during input to alleviate the decrease of the accuracy.
At present, research on English speech recognition methods in the scientific research field mainly aims at standard English or American reading texts under the quiet background condition, and has the advantages of short sentences, moderate speech speed and clear pronunciation, and the speech recognition for English spoken language evaluation under the real scene has a larger difference with the scientific research scene. Specifically, the speech recognition of spoken english assessment in real scenes faces several problems: due to the limitation of the environment of the examination room, a large amount of background noise exists in the voice data collected in the real examination room, and the recognition is greatly influenced; under the influence of the mother language and the dialect, the English pronunciation of the reader is different, so that the identification difficulty is increased; different from a high-quality public data set used in a scientific research scene, the voice length collected in a real examination room usually exceeds one minute, a large amount of three-minute voice data exists, and various complex conditions such as pause and silence exist in the middle of the voice data, so that the precision identification difficulty of the current method is high due to the limitation of an application scene.
Disclosure of Invention
In order to solve the problems, the invention provides a method, a system, equipment and a medium for recognizing and correcting the speech for English spoken language evaluation, which are used for extracting a Mel frequency cepstrum coefficient as a feature, performing feature enhancement in a twisting and shielding mode, and correcting a decoding result by editing distance and appearance frequency after decoding so as to obtain a more accurate recognition result.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a speech recognition and error correction method for spoken English evaluation, including:
extracting Mel frequency cepstrum coefficients after time-frequency conversion of the spoken English speech to form a spectrogram;
performing characteristic enhancement on the spectrogram in a distortion and shielding mode to obtain a characteristic diagram;
encoding the feature map;
decoding according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment;
and checking the character string decoding result according to a preset dictionary, screening candidate words in the dictionary according to the editing distance of the wrongly decoded words, and determining correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the wrongly decoded words so as to obtain a correct character string recognition result.
As an alternative implementation, the time domain signal of the spoken English speech is pre-emphasized by a high-pass filter, the pre-emphasized time domain signal is subjected to frame division and windowing, and the time domain signal in each window is converted into a frequency domain signal by adopting fast fourier transform; and (3) passing the frequency domain signal through a set of triangular filter banks with a Mel scale, and then extracting Mel frequency cepstrum coefficients through discrete cosine transform.
As an alternative, the warping is time warped, and the masking is frequency masked and time masked.
As an alternative implementation mode, the characteristic diagram is coded by adopting a coder, the coder comprises a feedforward module, a multi-head self-attention module and a convolution module, and the feedforward module comprises two feedforward modules
Figure SMS_1
Weighted feed-forward module, and two->
Figure SMS_2
The weight feedforward module is respectively connected in front of the multi-head self-attention module and behind the convolution module, the feature graph is processed by the feedforward module, the attention feature is extracted by adopting a multi-head self-attention mechanism, the output result of the multi-head self-attention module is input to the convolution module after layer normalization and point-by-point convolution, and finally the result is processed by the/H module>
Figure SMS_3
The feed forward module of weights is fully encoded.
As an alternative embodiment, the decoding process adds null elements to align the characters and phonemes.
As an alternative embodiment, a dictionary is constructed in advance and the occurrence frequency of words in the dictionary is determined at the same time, a BK tree is constructed according to the editing distance between the words in the dictionary, and the candidate words are screened based on the BK tree.
As an alternative embodiment, after weighting scores corresponding to edit distances and appearance frequencies of candidate words, the candidate word with the highest total score is taken as the correct word.
In a second aspect, the present invention provides a speech recognition and error correction system for spoken English evaluation, including:
the characteristic extraction module is configured to extract a Mel frequency cepstrum coefficient after time-frequency conversion is carried out on the spoken English voice so as to form a spectrogram;
the characteristic enhancement module is configured for carrying out characteristic enhancement on the spectrogram in a twisting and shielding mode to obtain a characteristic diagram;
an encoding module configured to encode the feature map;
the decoding module is configured to decode according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment;
and the error correction module is configured to check the character string decoding result according to a preset dictionary, screen candidate words in the dictionary according to the editing distance for the words with wrong decoding, and determine correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the words with wrong decoding, so that a correct character string recognition result is obtained.
In a third aspect, the present invention provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the method of the first aspect is performed.
In a fourth aspect, the present invention provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a method, a system, equipment and a medium for recognizing and correcting a speech for evaluating an English spoken language, which enable a model to better learn speech characteristics by performing a characteristic enhancement combined processing mode of time distortion, frequency shielding and time shielding on a spectrogram and by enlarging the number of the spectrogram.
The invention provides a method, a system, equipment and a medium for recognizing and correcting a speech for evaluating an English spoken language.A Transformer-structured encoder improved by using a CNN is designed, the Transformer can capture the dependence of a long sequence and global interaction information based on contents, and the CNN can effectively utilize local characteristics, thereby realizing local dependence modeling and global dependence modeling on an audio sequence; enhancing the attention to local information using a multi-headed self-attention mechanism at the decoding stage; by expanding the tag set, the problem of misalignment between the input sequence and the output sequence is solved by adding null elements.
The traditional error correction based on the BK tree focuses on a plurality of candidate words with the shortest editing distance, however, rarely-used words often appear in the candidate words, and are not needed, so that statistics of the appearance frequency is added in the process of constructing a dictionary and the BK tree, the final recognition result is determined by comprehensively considering the editing distance and the appearance frequency, and the recognition is more accurate.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
Fig. 1 is a flowchart of a speech recognition and error correction method for spoken english evaluation according to embodiment 1 of the present invention;
fig. 2 is a schematic diagram of a BK tree provided in embodiment 1 of the present invention.
Detailed Description
The invention is further explained by the following embodiments in conjunction with the drawings.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example 1
The embodiment provides a speech recognition and error correction method for spoken English evaluation, as shown in fig. 1, including:
extracting Mel frequency cepstrum coefficients after time-frequency conversion of the spoken English speech to form a spectrogram;
performing characteristic enhancement on the spectrogram in a distortion and shielding mode to obtain a characteristic diagram;
encoding the feature map;
decoding according to the coding result and the character string recognition result at the previous moment to obtain a character string decoding result at the current moment;
checking the character string decoding result according to a preset dictionary, screening candidate words in the dictionary according to the editing distance for the wrongly decoded words, and determining correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the wrongly decoded words, so that a correct character string recognition result is obtained.
The Mel frequency is extracted based on the auditory characteristics of human ears, and forms a nonlinear corresponding relation with the Hertz frequency, and Mel-frequency cepstrum coefficients (MFCC) are Hertz frequency spectrum characteristics obtained by calculating by using the relation between the Mel-frequency cepstrum coefficients, so that the frequency spectrum is closer to the nonlinear auditory system of human beings, and the Mel frequency cepstrum coefficients are mainly used for extracting the voice data characteristics and reducing the operation dimension. The MFCC feature extraction process mainly comprises the steps of converting the obtained spoken English speech from a time domain to a frequency domain and then obtaining a Mel frequency cepstrum coefficient as a feature, and specifically comprises the following steps:
(1) Time domain signal of a certain time point t for given spoken English speech
Figure SMS_4
Pre-emphasis processing is carried out to obtain a processed time domain signal->
Figure SMS_5
The pre-emphasis processing is to pass a time domain signal of the spoken English voice through a high-pass filter, so as to promote a high-frequency part, flatten the frequency spectrum of the time domain signal, keep the frequency spectrum in the whole frequency band from low frequency to high frequency, and use the same signal-to-noise ratio to obtain the frequency spectrum; as shown in formula (1):
Figure SMS_6
(1)
(2) In order to reduce the influence of unsteadiness and time variation of the whole time domain signal of the spoken English voice, the time domain signal after the pre-emphasis processing is subjected to framing processing, wherein the frame length is usually 25ms;
in order to ensure smooth transition between frames and maintain continuity of the frames, the framing generally adopts an overlapping and segmenting method to ensure that two adjacent frames overlap with each other by a portion, a time difference between start positions of the two adjacent frames is called frame shift, and the frame shift is generally 10ms.
The framed time domain signal is non-periodic, and there is a problem of frequency leakage after fourier transform, so in order to reduce leakage error to the maximum extent, the embodiment adopts a windowing function, so that the time domain signal better meets the periodicity requirement of fourier transform.
In this embodiment, a hamming window is selected as a windowing function, so that the value of the framed time domain signal at the window boundary is approximately 0, and the framed time domain signal approaches to a periodic signal, where the windowing function is:
Figure SMS_7
(2)
(3) The data in each window is transformed from the time domain signal using Fast Fourier Transform (FFT)
Figure SMS_8
Converted into a frequency-domain signal->
Figure SMS_9
As shown in formula (3):
Figure SMS_10
(3)
wherein the content of the first and second substances,
Figure SMS_11
is a number of points of the Fourier transform, is greater than or equal to>
Figure SMS_12
Is a natural base number.
(4) The frequency domain signal passes through a group of Mel-scale triangular filter banks to smooth the frequency spectrum and eliminate the effect of harmonic wave, so as to highlight the formants of the original voice, and then the MFCC is obtained through Discrete Cosine Transform (DCT), thereby forming a spectrogram.
One of the core challenges faced by ASR, like other Natural Language Processing (NLP) problems, is the lack of sufficient training data, with the result that the trained models are either easily over-fit or difficult to process data that never has been seen in the training set. The data enhancement is a common method for solving the problem, the common data enhancement method in the field of speech recognition is mainly to perform original audio processing from three aspects of increasing noise, changing tone and time stretching, the method for increasing noise is not applicable due to the limitation of an actual application scene data set, and the fact that the tone and time stretching are only changed by making small amplitude changes on audio data and the number of spectrograms is not changed, namely, the fact that enough training data is lacked is not changed.
Unlike the conventional method for processing audio data, the present embodiment uses a spectrogram as a reference for data enhancement, and directly performs operations on the spectrogram by using a combination of three basic methods, i.e., time warping, frequency masking and time masking, so as to achieve the purpose of data enhancement.
Wherein, the time warping is the deformation of the sequence in the time direction, and at a certain time point t moment and a time domain adjusting parameter w, a time interval is selected
Figure SMS_13
Or->
Figure SMS_14
Based on the twitch factor->
Figure SMS_15
Performing an image warping operation, wherein a warping factor->
Figure SMS_16
From a previously set>
Figure SMS_17
Selecting from uniform distribution;
frequency masking, so that f successive mel frequency channels
Figure SMS_18
Is masked, wherein a frequency masking parameter f is slave>
Figure SMS_19
Is selected in a uniform distribution, and>
Figure SMS_20
slave->
Figure SMS_21
Middle selection, based on>
Figure SMS_22
Is the number of mel frequency channels;
time masking time step
Figure SMS_23
Masking, time masking parameter t from>
Figure SMS_24
Is selected in the uniform distribution of (4)/',>
Figure SMS_25
slave->
Figure SMS_26
In mid selection, <' > based on>
Figure SMS_27
Is the length of time of the audio sequence. />
By a combination of the three operations of time warping, frequency masking and time masking, a time-domain adjustment parameter w is set to 80, a frequency-domain masking parameter f is set to 27, and a time-domain masking parameter t is set to 100.
In this embodiment, the obtained feature graph is encoded by using an encoder, the encoder is constructed based on a CNN and a Transformer module, the Transformer module can capture long-sequence dependency and global interaction information based on content, and the CNN can effectively utilize local features, thereby implementing local dependency modeling and global dependency modeling.
Improving a Transformer encoder by using the CNN to obtain an improved conformer encoder, wherein the conformer encoder comprises a feedforward module, a multi-head self-attention module and a convolution module; wherein two are used
Figure SMS_28
A weighted feedforward module connected respectively before the multi-head attention module and after the convolution module to form a sandwich structure, and performing a half-step feedforward module (i.e.)>
Figure SMS_29
A weighted feedforward module), a multi-head self-attention module, a convolution module and a half-step feedforward module to obtain a coding result.
The Feed-forward (Feed-forward) module consists of two linear transformations and a nonlinear activation function Swish which is connected by a pre-norm residual error unit, wherein the Swish activation function has the characteristics of no upper bound, low bound, smoothness and nonmonotony, and the performance is generally due to the Relu activation function, after a feature diagram X after feature enhancement is input into an encoder, the feature diagram X is firstly used beforeFed to the module for processing, fed forward of the output of the module
Figure SMS_30
The calculation method is shown in formula (4):
Figure SMS_31
(4)
wherein the content of the first and second substances,
Figure SMS_32
i.e. sigmoid activation functions.
The Multi-headed self-attentional (MHSA) module uses relative position coding in transform-XL, and is more universal and robust for voices with different input lengths. For input
Figure SMS_33
The corresponding vector pick>
Figure SMS_34
The process of computing the Attention is: first of all a calculation is made>
Figure SMS_35
Is/are>
Figure SMS_36
Figure SMS_37
(5)
Wherein the content of the first and second substances,
Figure SMS_38
respectively represent a query vector, a key value and a weight, and @>
Figure SMS_39
Are all->
Figure SMS_40
Maintaining;
then, the Attention is calculated by using a mode of scaling the dot product Attention:
Figure SMS_41
(6)
and finally, connecting the Attention obtained by calculating a plurality of heads, as shown in a formula (7):
Figure SMS_42
(7)
a gating mechanism consisting of a pointwise Convolution and a linear gating unit (GLU) is added before a Convolution Module (constraint Module), then a one-dimensional depth separation Convolution is carried out, and then a Batchnorm is added to help train a deeper model, wherein two activation functions are used, namely a sigmoid activation function and a swish activation function.
GLU is a gating mechanism in a convolutional neural network, is different from a gating cyclic unit (GRU) of the cyclic neural network in that gradient propagation is easier to perform, gradient disappearance or gradient explosion is not easy to cause, the computation time is greatly reduced, the input at the moment is the result of the output result of a multi-head self-attention module after layer normalization and point-by-point convolution, and the result is recorded as the result for convenient writing
Figure SMS_43
Each layer of the GLU is composed of two convolution modules with different parameters and a door mechanism, the output of the two convolution modules is used as the input of the door mechanism through operation, and the formula (8) is as follows:
Figure SMS_44
(8)
wherein the content of the first and second substances,
Figure SMS_45
represents a fifth->
Figure SMS_46
Layer, or>
Figure SMS_47
Is a parameter that the convolution module needs to learn>
Figure SMS_48
Represents a sigmoid activation function, <' > is asserted>
Figure SMS_49
Is a hadamard product operation, i.e. multiplication of corresponding elements.
The alignment problem between the input sequence and the output sequence needs to be considered when decoding is carried out, and the alignment process needs to be iterated for multiple times to ensure the alignment accuracy, so that the CTC and Attention-based decoder is constructed in the embodiment, and the prediction result can be directly output without aligning data.
For a given coding result
Figure SMS_50
And the sequence tag of the output->
Figure SMS_51
Mapping the coding result x into
Figure SMS_52
The sequence tag corresponding thereto is mapped to ≥>
Figure SMS_53
The operation of aligning the characters and the phonemes corresponds to establishing an accurate mapping between the encoded result and the sequence tag. CTC for a given->
Figure SMS_54
Based on all possible accurate mappings->
Figure SMS_55
Giving an output distribution in which the probability of correct output is maximized, i.e. calculating:
Figure SMS_56
when the CTC solves the alignment problem, the CTC expands the tag set and adds a null element
Figure SMS_57
Empty element>
Figure SMS_58
Only one place occupation is represented, no character is corresponding, and finally repeated characters are removed and empty elements are removed; for example, a path
Figure SMS_59
、/>
Figure SMS_60
Will eventually be mapped to a sequence->
Figure SMS_61
For a given input sequence
Figure SMS_62
The intermediate result corresponds to the path->
Figure SMS_63
And finally outputs the sequence->
Figure SMS_64
Then->
Figure SMS_65
To (X)>
Figure SMS_66
The posterior probability of (a) is expressed as:
Figure SMS_67
(9)
assuming that the output variables at different times are independent of each other, the path
Figure SMS_68
To (X)>
Figure SMS_69
Is expressed as:
Figure SMS_70
(10)
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_71
indicates a path pick>
Figure SMS_72
Is at>
Figure SMS_73
Output character corresponding to the moment>
Figure SMS_74
Is represented by>
Figure SMS_75
Time selection character is>
Figure SMS_76
Therefore, the following equations (9) to (10) are combined:
Figure SMS_77
(11)
it can be understood that the speech is time-series, and the recognition result of the speech at the previous time is needed for decoding.
Because of the problems of pronunciation, recognition algorithm and the like, various small problems, such as word spelling errors, homophone errors and the like, inevitably exist in the final prediction result, so the embodiment corrects the decoding result according to the editing distance and the occurrence frequency, thereby obtaining a more accurate recognition result.
Constructing a dictionary according to the examination level or the known knowledge of common words, determining the occurrence frequency of the words in the dictionary, constructing a BK tree (Burkhard KellerTree) according to the words in the dictionary, constructing the BK tree based on an editing distance, and measuring the similarity between two character strings, namely the character strings
Figure SMS_78
Converted into a character string>
Figure SMS_79
The minimum number of editing operations required, as shown in equation (12):
Figure SMS_80
(12)
wherein the content of the first and second substances,
Figure SMS_81
respectively represent character strings>
Figure SMS_82
The subscript of (1) begins with a subscript of (1).
The BK tree is a data structure, and the core idea is to use
Figure SMS_83
Means character string->
Figure SMS_84
To the character string->
Figure SMS_85
The edit distance of (a), wherein the main requirements are: />
Figure SMS_86
If and only if>
Figure SMS_87
;/>
Figure SMS_88
Figure SMS_89
(ii) a Taking { gate, same, fame, gain, gate, gay, aim, frame } as an example, constructing a BK tree, as shown in fig. 2, the construction process is:
a) Selecting a character string as a root node, such as a game;
b) Continuously selecting the next character string same, calculating the editing distance between the character string same and the same to be 1, and taking the same as a branch node of the name root node;
c) Continuously selecting the next character string fame, traversing from the root node game, calculating the editing distance between the fame and the game to be 1, wherein the branch same as the editing distance 1 exists, and continuously calculating the editing distance between the fame and the same to be 1, wherein the fame is called as a new branch of the same;
d) And sequentially selecting the rest words, and continuously expanding according to the step b and the step c to finally construct the BK tree.
As shown in fig. 2, all descendant nodes under the root node game branch 1 have an edit distance of 1, and all descendant nodes under the root node game branch 2 have an edit distance of 2, which makes the BK tree less computationally intensive in querying and allows high-frequency words to be placed at the top of the BK tree.
Searching and comparing the character string decoding result in a dictionary, and if the character string decoding result does not exist in the dictionary, the character string decoding result is wrong; then, according to the editing distance, screening candidate words of which the editing distance with the character string decoding result is smaller than a set threshold value in the BK tree to realize candidate recall; for example, when it is recognized that the name has an error, all words with an edit distance of 1 from the name in the BK tree, that is, the name, and the gate, can be searched, so as to implement the candidate recall.
Since the conventional BK tree-based error correction focuses on several candidate words with the shortest edit distance, but rarely-used words often appear in the candidate words, which is not required, the present embodiment also introduces the occurrence frequency; and after the candidate words are determined, carrying out error correction according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the character string decoding result, namely, the smaller the editing distance is, the higher the score is, the higher the occurrence frequency is and the score is, and finally weighting the candidate words with the highest total score by the candidate words and the character string decoding result to obtain the final recognition result.
Example 2
The embodiment provides a speech recognition and error correction system for oral english evaluation, which includes:
the characteristic extraction module is configured to extract Mel frequency cepstrum coefficients after time-frequency conversion of the spoken English speech so as to form a spectrogram;
the characteristic enhancement module is configured to perform characteristic enhancement on the spectrogram in a distortion and shielding mode to obtain a characteristic map;
an encoding module configured to encode the feature map;
the decoding module is configured to decode according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment;
and the error correction module is configured to check the character string decoding result according to a preset dictionary, screen candidate words in the dictionary according to the editing distance for the words with wrong decoding, and determine correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the words with wrong decoding, so that a correct character string recognition result is obtained.
It should be noted that the modules correspond to the steps described in embodiment 1, and the modules are the same as the corresponding steps in the implementation examples and application scenarios, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In further embodiments, there is also provided:
an electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment 1. For brevity, no further description is provided herein.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.
The method in embodiment 1 may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive changes in the technical solutions of the present invention.

Claims (10)

1. The speech recognition and error correction method for English spoken language evaluation is characterized by comprising the following steps:
extracting Mel frequency cepstrum coefficients from the spoken English speech after time-frequency conversion to form a spectrogram;
performing characteristic enhancement on the spectrogram in a distortion and shielding mode to obtain a characteristic diagram;
encoding the feature map;
decoding according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment;
and checking the character string decoding result according to a preset dictionary, screening candidate words in the dictionary according to the editing distance of the wrongly decoded words, and determining correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the wrongly decoded words so as to obtain a correct character string recognition result.
2. The method for speech recognition and error correction for oral english evaluation according to claim 1, wherein the time domain signal of the oral english speech is pre-emphasized by a high pass filter, the pre-emphasized time domain signal is windowed by frames, and the time domain signal in each window is converted into a frequency domain signal by fast fourier transform; and (3) passing the frequency domain signal through a set of triangular filter banks with a Mel scale, and then extracting Mel frequency cepstrum coefficients through discrete cosine transform.
3. The speech recognition and error correction method for spoken English evaluation according to claim 1, wherein the warping is time warping, and the masking is frequency masking or time masking.
4. The method for speech recognition and error correction for spoken English evaluation according to claim 1, wherein the feature map is encoded using an encoder, the encoder comprising a feedforward module, a multi-head self-attention module, and a convolution module, the feedforward module being two
Figure QLYQS_1
Weighted feed-forward module, and two->
Figure QLYQS_2
The weighted feedforward modules are respectively connected at a multi-head self-injectionBefore the intention module and after the convolution module, processing the characteristic diagram by a feedforward module, extracting attention characteristics by a multi-head self-attention mechanism, inputting the output result of the multi-head self-attention module into the convolution module after layer normalization and point-by-point convolution, and finally judging whether the output result is based on the value of the attention characteristic diagram>
Figure QLYQS_3
The feed-forward module of weights is fully encoded.
5. The method for speech recognition and error correction for spoken English evaluation according to claim 1, wherein null elements are added during the decoding process to align characters and phonemes.
6. The method for speech recognition and correction for oral english evaluation according to claim 1, wherein a dictionary is constructed in advance while the occurrence frequency of words in the dictionary is determined, a BK tree is constructed based on the edit distance between words in the dictionary, and the candidate words are screened based on the BK tree.
7. The speech recognition and error correction method for oral english evaluation according to claim 1, wherein the candidate word with the highest total score is taken as the correct word after weighting the scores corresponding to the edit distance and the frequency of occurrence of the candidate word.
8. Speech recognition and error correction system that english spoken language evaluated, characterized by, include:
the characteristic extraction module is configured to extract Mel frequency cepstrum coefficients after time-frequency conversion of the spoken English speech so as to form a spectrogram;
the characteristic enhancement module is configured for carrying out characteristic enhancement on the spectrogram in a twisting and shielding mode to obtain a characteristic diagram;
an encoding module configured to encode the feature map;
the decoding module is configured to decode according to the coding result and the character string identification result at the previous moment to obtain a character string decoding result at the current moment;
and the error correction module is configured to check the character string decoding result according to a preset dictionary, screen candidate words in the dictionary according to the editing distance for the words with wrong decoding, and determine correct words according to the occurrence frequency of the candidate words and the editing distance between the candidate words and the words with wrong decoding, so that a correct character string recognition result is obtained.
9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of any of claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
CN202310138725.6A 2023-02-21 2023-02-21 Method, system, equipment and medium for speech recognition and error correction of oral English evaluation Pending CN115862674A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310138725.6A CN115862674A (en) 2023-02-21 2023-02-21 Method, system, equipment and medium for speech recognition and error correction of oral English evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310138725.6A CN115862674A (en) 2023-02-21 2023-02-21 Method, system, equipment and medium for speech recognition and error correction of oral English evaluation

Publications (1)

Publication Number Publication Date
CN115862674A true CN115862674A (en) 2023-03-28

Family

ID=85658468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310138725.6A Pending CN115862674A (en) 2023-02-21 2023-02-21 Method, system, equipment and medium for speech recognition and error correction of oral English evaluation

Country Status (1)

Country Link
CN (1) CN115862674A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN113420219A (en) * 2021-06-30 2021-09-21 北京明略昭辉科技有限公司 Method and device for correcting query information, electronic equipment and readable storage medium
CN113569545A (en) * 2021-09-26 2021-10-29 中国电子科技集团公司第二十八研究所 Control information extraction method based on voice recognition error correction model
CN114444479A (en) * 2022-04-11 2022-05-06 南京云问网络技术有限公司 End-to-end Chinese speech text error correction method, device and storage medium
CN114818668A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method and device for correcting personal name of voice transcribed text and computer equipment
CN114860870A (en) * 2022-04-02 2022-08-05 北京明略昭辉科技有限公司 Text error correction method and device
CN115293138A (en) * 2022-08-03 2022-11-04 北京中科智加科技有限公司 Text error correction method and computer equipment
CN115497465A (en) * 2022-09-06 2022-12-20 平安银行股份有限公司 Voice interaction method and device, electronic equipment and storage medium
CN115565522A (en) * 2022-11-29 2023-01-03 支付宝(杭州)信息技术有限公司 Training language recognition model, language recognition method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN113420219A (en) * 2021-06-30 2021-09-21 北京明略昭辉科技有限公司 Method and device for correcting query information, electronic equipment and readable storage medium
CN113569545A (en) * 2021-09-26 2021-10-29 中国电子科技集团公司第二十八研究所 Control information extraction method based on voice recognition error correction model
CN114860870A (en) * 2022-04-02 2022-08-05 北京明略昭辉科技有限公司 Text error correction method and device
CN114444479A (en) * 2022-04-11 2022-05-06 南京云问网络技术有限公司 End-to-end Chinese speech text error correction method, device and storage medium
CN114818668A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method and device for correcting personal name of voice transcribed text and computer equipment
CN115293138A (en) * 2022-08-03 2022-11-04 北京中科智加科技有限公司 Text error correction method and computer equipment
CN115497465A (en) * 2022-09-06 2022-12-20 平安银行股份有限公司 Voice interaction method and device, electronic equipment and storage medium
CN115565522A (en) * 2022-11-29 2023-01-03 支付宝(杭州)信息技术有限公司 Training language recognition model, language recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏比 · 艾依提等: "基于多任务学习的端到端维吾尔语语音识别", vol. 37, no. 10, pages 1 - 2 *

Similar Documents

Publication Publication Date Title
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
WO2019214047A1 (en) Method and apparatus for establishing voice print model, computer device, and storage medium
TWI396184B (en) A method for speech recognition on all languages and for inputing words using speech recognition
CN112767958A (en) Zero-learning-based cross-language tone conversion system and method
Imtiaz et al. Isolated word automatic speech recognition (ASR) system using MFCC, DTW & KNN
Razak et al. Quranic verse recitation recognition module for support in j-QAF learning: A review
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
Sinha et al. Continuous density hidden markov model for context dependent Hindi speech recognition
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
CN112634860B (en) Method for screening training corpus of children voice recognition model
CN113539268A (en) End-to-end voice-to-text rare word optimization method
CN115472168B (en) Short-time voice voiceprint recognition method, system and equipment for coupling BGCC and PWPE features
Shafie et al. Al-Quran recitation speech signals time series segmentation for speaker adaptation using Dynamic Time Warping
CN115862674A (en) Method, system, equipment and medium for speech recognition and error correction of oral English evaluation
Aşlyan Syllable Based Speech Recognition
Khalifa et al. Statistical modeling for speech recognition
Elharati Performance evaluation of speech recognition system using conventional and hybrid features and hidden Markov model classifier
CN116403562B (en) Speech synthesis method and system based on semantic information automatic prediction pause
Gadekar et al. Analysis of speech recognition techniques
CN107305767A (en) A kind of Short Time Speech duration extended method recognized applied to languages
Viana et al. Self-organizing speech recognition that processes acoustic and articulatory features
Tian Research on Speech Recognition Technology of Oral English Learning Based on Improved GLR Algorithm
Satvik et al. Transformer Based Speech to Text Translation for Indic Languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination