CN107610720B

CN107610720B - Pronunciation deviation detection method and device, storage medium and equipment

Info

Publication number: CN107610720B
Application number: CN201710895726.XA
Authority: CN
Inventors: 解焱陆; 牛传迎; 张劲松
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2020-08-04
Anticipated expiration: 2037-09-28
Also published as: CN107610720A

Abstract

The invention provides a pronunciation deviation detection method, a device, a storage medium and equipment, wherein the method comprises the following steps: detecting the key frame position of a phoneme in the known correct speech by using a connection time sequence classification (CTC) method to be used as an acoustic landmark; and carrying out pronunciation bias detection on the phoneme in the speech to be detected based on the landmark. The invention uses the CTC method to detect the key frame as the acoustic landmark without marking the acoustic landmark in advance.

Description

Pronunciation deviation detection method and device, storage medium and equipment

Technical Field

The invention relates to the technical field of computer-aided voice correlation, in particular to a pronunciation deviation detection method, a pronunciation deviation detection device, a pronunciation deviation detection storage medium and pronunciation deviation detection equipment.

Background

The method is characterized in that pronunciation deviation detection is used as an important technology in a computer-aided pronunciation training system and can provide an effective way for a learner to improve the capability of speaking, in the past decades, a large number of pronunciation deviation detection methods based on segment levels have been developed, one of the methods is based on an Automatic Speech recognition technology and adopts a statistical Speech recognition framework for pronunciation deviation detection, and the method can be further divided into two types according to a feedback form, one type is a method based on a confidence score, for example, a log likelihood ratio ("Automatic detection of Phone-level pronunciation deviation for language learning", Speech Communication, volume.30, No.2-3, pp.95-108,2000) for measuring the similarity of acoustic phoneme models of a mother language and a non-mother language, and the similarity of the variant pronunciation well degree ("Phone-language learning characteristics of the model of the majority of the mother language," the pronunciation parameters are extracted from a language model of the Speech model, and the method is based on a model of a uniform pronunciation deviation model of pronunciation parameters, and the method is a method for improving the pronunciation characteristics of the learner from a language model of pronunciation, and the learning characteristics of the learner ("model of the Speech model of the pronunciation deviation) when the learner is considered to correct pronunciation, the model, the pronunciation parameters, the learner is based on a method of the model of pronunciation deviation detection of pronunciation, the learning of the pronunciation deviation detection of pronunciation, the learning of pronunciation, the pronunciation deviation detection methods of pronunciation, the model of the Speech recognition of the pronunciation, the pronunciation deviation detection methods, the model of the model, the pronunciation, the model of the model, the model of pronunciation, the model is not the model, the model of pronunciation, the model of the learner is not the model of the learner, the model of pronunciation, the learner, the model of the learner, the model of the model, the model of the model, the model of the model, the model of the model, the model of the model, the model of the learner, the model of the learner.

Wen Cao et al define pronunciation bias trends based on pronunciation position and pronunciation methodology, describing a plausible situation where two speakers pronounce between a correct pronunciation and a biased pronunciation intermediate ("development a Chinese L2 speech pitch of Japanese learners with narrow pitch-phonetic labels for computing a pronunciation shift, 2010 in pitch, 1922-1925.) this situation often arises in advanced learners.

Stevens's Acoustic Landmark theory, based On The mechanism of human speech production, defines Landmark as a transient region describing The quantum nonlinear relationship between pronunciation and acoustics (Acoustic dynamics. MITPRESS,2000, vol.30; "The quality of speech: assessment from audio-audio data," 1972, pp.51-66; "On The quality of speech," Journal of sounds, 1989, vol.17, No.2 pp.3-45; "Quantal The quality of speech, enhancement and application," Journal of dynamics, 2010, 243.38, No.1, pp.10-19.) in this region, it usually means that perceptual speech and audio targets have abundant information 255, thus it is often helpful to distinguish between audion and speech targets, and The quality of speech ("audio focus" noise "is a very difficult to distinguish between audion and audio sources, thus it is often difficult to find The underlying speech sounds in this region (" audio focus "test" trial "L. this is a very effective choice of speech recognition from audio sources".

In order to solve the problems, scholars at home and abroad propose various improvement methods. Roughly classified into three categories:

the first type is that The change of characteristic parameters of different levels and dimensions of a voice Signal is detected from The perspective of Signal Detection to obtain landsound, commonly used parameters have short-time energy, zero crossing rate, formants, etc. Sharlen A L iu proposes a method for detecting three landsound related to consonants by Using The frequency-dividing energy characteristic of voice, The method divides a voice spectrum into six frequency bands according to phoneme pronunciation characteristics, and takes The peak-valley value of a first-order difference curve of each frequency band energy as a landsound candidate, and obtains a landsound sequence of The voice Signal by a corresponding judgment criterion ("L and Detection for discrete sound-Based Speech," The Journal of The acoustic source of audio of America,1996, vol.100, No.5, pp.3417-3430),. A. R. Janyan and P.C. Pandex shows that The Processing method established by Using acoustic modeling of audio frequency, and noise Model ("noise filtering of audio filtering, noise", The method for detecting audio source of Sound, noise, and noise Detection by Using The algorithm "(noise Detection) and noise Detection algorithm, The noise Detection method (" noise Detection, noise Detection method of noise Model) (20, noise Detection by Using The algorithm of noise Model) (IEEE 28).

The second type is to select different parameters for different landmark types, starting from the perspective of machine learning, Howitt detects the landmark of a Vowel based on recursive convex hull algorithm detecting Syllable nuclei, which separates the labels of Speech frames into vowels and non-vowels ("Automatic systematic Detection for Vowel L and" dominant theory, 1999) "Hasegawa-Johnson et al implements a landmark-based Speech recognition system, which detects landmark by using a two-class SVM classifier (" L and spectral recognition: Johns Hopkins, which detects the beginning Speech by using a random noise Detection system ("noise-based Speech Detection, noise-based Detection, noise Detection.

The fourth category is that landmark of English vowels is assumed to be at the middle position of the phoneme duration from the linguistic point of view, landmark of consonants is at the beginning, middle and ending time of the phoneme ("L and" based automatic phonetic transcription "), in-Speech Xie et al perceive Japanese as a key clue of Chinese nasal vowels in the nasal vowel segment by using the Speech splicing synthesis technique and combining with perception experiments, and the middle time of the position is used as the landmark (" L and of human nasal coding and bits adaptation coding in-Speech detection, "Acotion, Speech in-Speech in Processing (SSP), and the mapping of Chinese phonetic transcription in-Speech detection scheme (" IEEE 5370, which is mapped to Chinese phonetic transcription by using the International phonetic transcription synthesis technique and principle — Kane et al — there is no other scheme of Chinese phonetic transcription by using the International phonetic transcription in-Speech mapping 532016, namely IEEE 5370.

In summary, the former studies or studies the pronunciation mechanism from the signal detection point of view, and different parameters are designed for landmark types of different phonemes. Either landmark is manually labeled from the perspective of the perception experiment, or a fixed location is assumed as landmark. For the first method, the training data containing manual labeling of landmark is not needed, however, landmark aiming at different phonemes needs to research pronunciation mechanism and design different signal parameters with distinctiveness for detection. And constant criteria are often chosen that are not sufficient for different considerations among speakers. For the second approach, the benefit is that only distinctive features need to be selected for automatic classification by machine learning. However, they typically rely on manually labeled data for training and require different discriminative features to be selected for different landrakes. If all landraks are detected, multiple training sessions are required. For the third method, it is advantageous to assume some fixed positions, which is computationally convenient, but does not take into account the context sufficiently.

Disclosure of Invention

The embodiment of the invention provides a pronunciation bias error detection method, a pronunciation bias error detection device, a pronunciation bias error detection storage medium and pronunciation bias error detection equipment, which are used for solving one or more defects in the prior art.

The embodiment of the invention provides a pronunciation deviation detection method, which comprises the following steps: detecting the key frame position of a phoneme in the known correct speech by using a connection time sequence classification (CTC) method to be used as an acoustic landmark; and carrying out pronunciation bias detection on the phoneme in the speech to be detected based on the landmark.

In one embodiment, detecting the keyframe locations of phonemes in known correct speech as acoustic landmarks landmark using a connected-temporal classification CTC method, comprises: training an RNN acoustic model using CTC criteria; decoding the voice of the processing unit in the known correct voice by using the trained RNN acoustic model to obtain a sequence of posterior probabilities of the phonemes in the voice of the processing unit on each time frame; calculating to obtain a peak function value corresponding to each time frame by using the set window length, the set peak function and each posterior probability in the sequence; calculating the mean and variance of all the peak function values larger than zero; obtaining a Chebyshev inequality by using the mean value and the variance, and obtaining a peak function value meeting the Chebyshev inequality; acquiring a maximum peak function value within a set window length range; the key frame position of the phoneme is determined as landmark by using the peak position of the maximum peak function value.

In one embodiment, determining the key frame position of the phoneme using the peak position of the maximum peak function value comprises: judging whether the speech text corresponding to the processing unit of the known correct speech contains the phoneme corresponding to the peak position; if so, taking the peak position as a key frame position; and if the peak position does not exist, eliminating the peak position, re-acquiring the maximum peak function value from the rest peak function values meeting the Chebyshev inequality, and determining the key frame position of the phoneme by using the re-acquired peak position of the maximum peak function value.

In one embodiment, determining the key frame position of the phoneme as landmark using the peak position of the maximum peak function value comprises: determining a key frame relative position of the phoneme by comparing the key frame position with the annotated text phoneme time information corresponding to the processing unit of known correct speech; and averaging the relative positions of all the key frames of the phoneme to obtain a final key frame of the phoneme, wherein the final key frame is used as landmark.

In one embodiment, performing pronunciation bias detection on the phoneme in the speech to be detected based on the landmark includes: extracting the acoustic features of the phonemes in the known bias type speech and the acoustic features of the phonemes in the known correct speech based on the landmark; training an SVM classifier by using the acoustic characteristics of the phonemes in the known bias type speech and the acoustic characteristics of the phonemes in the known correct speech; and carrying out pronunciation bias detection on the phonemes in the speech to be detected by using the trained SVM classifier.

In one embodiment, the set spike function is:

wherein S is_i(k,i,x_iT) represents the value of the peak function, T represents the sequence of the posterior probabilities of the initials and finals in the speech of the processing unit over the respective time frame, k represents the window length, x represents the length of the window_iA value representing the posterior probability of the ith time frame in the sequence T, i being an integer greater than or equal to zero.

An embodiment of the present invention further provides a pronunciation deviation detecting device, including: an acoustic landmark determination unit for: detecting the key frame position of a phoneme in the known correct speech by using a connection time sequence classification (CTC) method to be used as an acoustic landmark; a pronunciation bias detection unit for: and carrying out pronunciation bias detection on the phoneme in the speech to be detected based on the landmark.

In one embodiment, the acoustic landmark determination unit comprises: an acoustic model training module to: training an RNN acoustic model using CTC criteria; a probability sequence generation module to: decoding the voice of the processing unit in the known correct voice by using the trained RNN acoustic model to obtain a sequence of posterior probabilities of the phonemes in the voice of the processing unit on each time frame; a spike function value generation module to: calculating to obtain a peak function value corresponding to each time frame by using the set window length, the set peak function and each posterior probability in the sequence; an inequality parameter generation module to: calculating the mean and variance of all the peak function values larger than zero; a spike function value screening module to: obtaining a Chebyshev inequality by using the mean value and the variance, and obtaining a peak function value meeting the Chebyshev inequality; a maximum spike function value determination module to: acquiring a maximum peak function value within a set window length range; an acoustic landmark determination module to: the key frame position of the phoneme is determined as landmark by using the peak position of the maximum peak function value.

In one embodiment, the acoustic landmark determination module comprises: a phoneme judging module, configured to: judging whether the speech text corresponding to the processing unit of the known correct speech contains the phoneme corresponding to the peak position; a keyframe location determination module to: if so, taking the peak position as a key frame position; and if the peak position does not exist, eliminating the peak position, re-acquiring the maximum peak function value from the rest peak function values meeting the Chebyshev inequality, and determining the key frame position of the phoneme by using the re-acquired peak position of the maximum peak function value.

In one embodiment, the acoustic landmark determination module comprises: a keyframe relative position determination module to: determining a key frame relative position of the phoneme by comparing the key frame position with the annotated text phoneme time information corresponding to the processing unit of known correct speech; a final keyframe determination module to: and averaging the relative positions of all the key frames of the phoneme to obtain a final key frame of the phoneme, wherein the final key frame is used as landmark.

In one embodiment, the pronunciation deviation detecting unit includes: an acoustic feature extraction module to: extracting the acoustic features of the phonemes in the known bias type speech and the acoustic features of the phonemes in the known correct speech based on the landmark; an SVM classifier training module to: training an acoustic feature SVM classifier of the phoneme in the known error type speech and the acoustic feature SVM classifier of the phoneme in the known correct speech; a pronunciation bias detection module for: and carrying out pronunciation bias detection on the phonemes in the speech to be detected by using the trained SVM classifier.

In one embodiment, the spike function value generating module is further configured to perform:

the set spike function is:

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method described in the above embodiments.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the steps of the method described in the above embodiments are implemented.

According to the pronunciation bias detection method, the pronunciation bias detection device, the storage medium and the pronunciation bias detection equipment, the Landmark is determined by detecting the key frame position through the CTC method based on the CTC detection key frame, manual marking of the Landmark is not needed in advance, dependence on manual marking of the Landmark is avoided, and a unified voice recognition framework is adopted, so that pronunciation bias detection is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a flow chart of a pronunciation bias detection method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for determining keyframe locations for phonemes in known correct speech using a continuous time-series classification (CTC) method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for determining a keyframe location of a phoneme using a peak location of a maximum peak function value according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for determining a key frame position of a phoneme as an acoustic landmark by using a peak position of a maximum peak function value according to another embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for pronunciation bias detection of phonemes in a speech to be detected based on acoustic landmarks according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating an algorithm for extracting a peak of each phoneme in a sentence according to an embodiment of the present invention;

FIG. 7 is a graphical representation of spiking of CTCs in accordance with one embodiment of the present invention;

FIG. 8 is a flowchart of pronunciation bias detection according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a pronunciation deviation detection apparatus according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of the structure of an acoustic landmark determining unit in an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of an acoustic landmark determination module in an embodiment of the invention;

FIG. 12 is a schematic structural diagram of an acoustic landmark determination module in another embodiment of the present invention;

FIG. 13 is a schematic structural diagram of a pronunciation bias detection unit according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

In order to avoid the dependence on manual tagging landmark, the embodiment of the invention provides a pronunciation bias detection method. Fig. 1 is a flowchart illustrating a pronunciation bias detection method according to an embodiment of the present invention. As shown in fig. 1, the pronunciation bias detection method according to the embodiment of the present invention may include:

step S110: detecting the key frame position of a phoneme in the known correct speech by using a connection time sequence classification (CTC) method to be used as an acoustic landmark;

step S120: and carrying out pronunciation bias detection on the phoneme in the speech to be detected based on the landmark.

CTC methods can utilize a recurrent neural network for sequence marker learning. The main problem in speech recognition is to convert acoustic feature sequences into text label sequences, such as Chinese-English initial and final sequences, where the former is usually longer than the latter. In an embodiment, blank (blank) tags may be introduced through the CTC to absorb confusing or indeterminate boundaries between two pronunciation units and allow tags to recur, resulting in optimal alignment between the speech frame and the output tag. In an embodiment, CTC may use the softmax layer of RNN to give a posterior probability at each time step for each modeled unit. In an embodiment, multiple output tags may be mapped to a sequence without duplicate tags and blank tags by a many-to-one mapping. In an embodiment, CTC may sum all possible alignments of a target sequence through a forward-backward algorithm.

In step S110, it is known that correct speech can be obtained from an existing native language corpus. The phonemes may be, for example, initials or finals. In step S120, based on the determined landmark, pronunciation bias detection may be performed by a variety of different methods, such as an SVM (Support Vector Machine) classifier.

In the embodiment of the invention, the position of the key frame is detected by using the CTC method to determine the landmark, so that the landmark is not required to be manually marked in advance, and the dependence on manually marking the landmark is avoided. Moreover, a uniform voice recognition framework is adopted, and the consistency of detection results is good.

FIG. 2 is a flowchart illustrating a method for determining keyframe locations of phonemes in known correct speech using a continuous time-series classification (CTC) method as acoustic landmarks according to an embodiment of the present invention. As shown in fig. 2, in step S110, detecting the locations of the key frames of the phonemes in the known correct speech by using the connected-time classification CTC method as the method of the acoustic landmark may include:

step S111: training an RNN acoustic model using CTC criteria;

step S112: decoding the voice of the processing unit in the known correct voice by using the trained RNN acoustic model to obtain a sequence of posterior probabilities of the phonemes in the voice of the processing unit on each time frame;

step S113: calculating to obtain a peak function value corresponding to each time frame by using the set window length, the set peak function and each posterior probability in the sequence;

step S114: calculating the mean and variance of all the peak function values larger than zero;

step S115: obtaining a Chebyshev inequality by using the mean value and the variance, and obtaining a peak function value meeting the Chebyshev inequality;

step S116: acquiring a maximum peak function value within a set window length range;

step S117: the key frame position of the phoneme is determined as landmark by using the peak position of the maximum peak function value.

In step S111, an RNN (Recurrent Neural Network) acoustic model may be trained using the correct speech in the native language corpus as an input. In other embodiments, other acoustic models may be used. In the above step S112, the processing unit may be, for example, a sentence. The sequence is a time sequence. In step S115, the calculated mean value and variance may be substituted into the standard chebyshev inequality to obtain a specific chebyshev inequality, and the peak function value may be substituted into the specific chebyshev inequality as a value of a variable to determine whether the inequality is satisfied. In an embodiment, when the peak function value satisfying the chebyshev inequality is retained, the original index may be recorded at the same time, so as to obtain the time frame (peak position) thereof. In step S116, the peak function value comparison may be performed within a set window length (e.g., 2k), and a maximum value is retained. In step S117, the obtained position of the key frame may be directly used as landmark, or may be determined to be landmark after certain screening or judgment. The inventor finds that the posterior probability of the output tag of the RNN model trained by the CTC criterion has obvious peak phenomenon, and the landmark can be effectively determined by utilizing the characteristic.

In an embodiment, the set spike function may be:

In this embodiment, the peak function value S_i(k,i,x_iT) represents that the position is more likely to be a peak, and therefore the maximum peak position can be effectively selected by using the set peak function.

Fig. 3 is a flowchart illustrating a method for determining a keyframe position of a phoneme using a peak position with a maximum peak function value according to an embodiment of the present invention. As shown in fig. 3, in step S117, the method for determining the key frame position of the phoneme using the peak position of the maximum peak function value may include:

step S1171: judging whether the speech text corresponding to the processing unit of the known correct speech contains the phoneme corresponding to the peak position;

step S1172: if so, taking the peak position as a key frame position;

step S1173: and if the peak position does not exist, eliminating the peak position, re-acquiring the maximum peak function value from the rest peak function values meeting the Chebyshev inequality, and determining the key frame position of the phoneme by using the re-acquired peak position of the maximum peak function value.

Under the condition of inaccurate calculation, the maximum value of the selected peak function is possibly small, so that the phoneme is not included in the sentence (processing unit), and the peak positions of the phonemes not included in the sentence (processing unit) are removed through the step S1171, the step S1172 and the step S1172 in combination with the known text, so that the accuracy of the key frame position can be improved.

Fig. 4 is a flowchart illustrating a method for determining a key frame position of a phoneme as an acoustic landmark by using a peak position of a maximum peak function value according to another embodiment of the present invention. As shown in fig. 4, in step S117, determining a key frame position of the phoneme by using the peak position with the largest peak function value may include:

step S1174: determining a key frame relative position of the phoneme by comparing the key frame position with the annotated text phoneme time information corresponding to the processing unit of known correct speech;

step S1175: and averaging the relative positions of all the key frames of the phoneme to obtain a final key frame of the phoneme, wherein the final key frame is used as landmark.

In this embodiment, the time information of the phoneme of the labeled text may be time information of initials and finals of the labeled text. A processing unit (a sentence) can contain a plurality of same phonemes, and a plurality of key frame positions of the same phoneme can be averaged to obtain a uniform key frame position, so that pronunciation bias detection is convenient to implement.

In an embodiment, the key frame position, final key frame position, or key frame position average may be compared to manually labeled landmark, and if consistent, the key frame position, final key frame position, or key frame position average may be used as landmark to perform pronunciation bias detection, and if inconsistent, manually labeled landmark may be used to perform pronunciation bias detection, thereby improving pronunciation bias detection.

FIG. 5 is a flowchart illustrating a method for pronunciation bias detection of phonemes in a speech to be detected based on acoustic landmarks according to an embodiment of the present invention. As shown in fig. 5, in step S120, the method for performing pronunciation bias detection on phonemes in the speech to be detected based on landmark may include:

step S121: extracting the acoustic features of the phonemes in the known bias type speech and the acoustic features of the phonemes in the known correct speech based on the landmark;

step S122: training an SVM classifier by using the acoustic characteristics of the phonemes in the known bias type speech and the acoustic characteristics of the phonemes in the known correct speech;

step S123: and carrying out pronunciation bias detection on the phonemes in the speech to be detected by using the trained SVM classifier.

In the embodiment, the trained SVM classifier is used for pronunciation deviation detection, so that a better detection result can be obtained.

FIG. 6 is a flowchart illustrating an algorithm for extracting the peak of each phoneme in the sentence according to an embodiment of the present invention. As shown in fig. 6, a method for extracting a peak of each phoneme in a sentence by using a sentence as a processing unit may include:

step S301: and directly decoding a sentence by using the RNN acoustic model trained by the CTC to obtain a probability sequence.

Extraction of a modeling unit (e.g., initial consonant and vowel) from a native speech at each time frame_iA probability sequence T is formed that contains N (N represents the number of time steps of a sentence) points.

Step S302: calculating the peak function value a corresponding to each time frame_iAnd obtaining a peak function value array which is larger than zero.

In an embodiment, the spike function is selected as:

S_i(k,i,x_it) may represent the probability value x of the ith point in the time series T_iThe larger the value of the significance relative to other points, the greater the probability of being a spike. S to be greater than 0₁(k,i,x_iThe value of T) (representing the candidate spike) is sorted out and added to array a and its original index in the time series is maintained.

In an embodiment, half of the average duration of each phoneme may be counted from the corpus, or the window length k may be selected empirically, for example, set to 4.

Step S303: the mean m and variance s of all elements in array a are calculated.

Step S304: the Chebyshev Inequality (Chebyshev Inequality) was applied:

and screening a peak function value.

Where μ is the mean, σ is the variance, and h is a constant greater than 0. It does not assume that the random variable X obeys any distribution, which means that there are few peaks that satisfy this condition. If it is satisfied with

The candidate peaked value x is retained_iAnd records its original index. Where h can be manually set to a constant greater than 0.

Step S305: post-processing is performed to compare the peak values over the window length range (2k) and only one maximum value is retained.

The final remaining peak may be the true candidate peak whose original index is the final candidate peak location. Since the maximum that the algorithm may select will be small, it will appear that this phoneme is not included in the sentence. For pronunciation bias detection and labeling tasks. The text of the speech is known, and the peak positions of phonemes not contained in the speech need to be eliminated by combining the known text.

In the embodiment, a threshold needs to be set for detecting the key frame in tasks such as speech recognition, and the positions with excessively small peak values at the candidate peak positions are eliminated.

FIG. 7 is a graphical representation of spiking of CTCs in accordance with one embodiment of the present invention. As shown in fig. 7, taking "We've done identity" as an example, the confusable or uncertain boundary between two sounds is absorbed by blank label, and the posterior probability of the label corresponding to the voice of the sentence "We've done identity part" has peaks w, iy, v, d, ah, n, aa, r, p, t by CTC.

FIG. 8 is a flowchart of pronunciation bias detection according to an embodiment of the invention. As shown in fig. 8, the whole detection framework can be divided into two stages: the method comprises the steps of firstly, taking the voice of a mother language corpus as input, training an RNN acoustic model by utilizing a CTC (central traffic control) criterion, decoding extracted features of the mother language voice according to the peak extraction algorithm, generating label posterior probability, extracting peak positions, comparing the peak positions with initial and final time information (relative to the starting time of each phoneme) in an annotated text to determine the positions of key frames (counting the relative positions of peaks of each phoneme), and averaging the positions of the key frames of each phoneme to obtain a final key frame; and in the second stage, based on pronunciation deviation detection of the key frame, extracting acoustic features from the specific phoneme or the deviation voice sample thereof by using the position of the key frame trained in the first stage, and detecting the specific phoneme by using an SVM classifier trained by correct pronunciation and the deviation type thereof.

In an embodiment, the CTC-driven-based peak locations may first be verified for consistency with the locations of landmark, and then the CTC system may be used to detect pronunciation skews using the data-driven-based peaks as key frames. The advantage is that landmark does not need to be marked in advance, and a unified speech recognition framework is adopted.

Based on the same inventive concept as the pronunciation bias detection method shown in fig. 1, the embodiment of the present application further provides a pronunciation bias detection device, as described in the following embodiments. Because the principle of solving the problem of the pronunciation error detection device is similar to that of the pronunciation error detection method, the implementation of the pronunciation error detection device can refer to the implementation of the pronunciation error detection method, and repeated parts are not repeated.

Fig. 9 is a schematic structural diagram of a pronunciation deviation detecting device according to an embodiment of the invention. As shown in fig. 9, the pronunciation deviation detecting device according to the embodiment of the present invention may include: an acoustic landmark determining unit 510 and a pronunciation bias detecting unit 520, which are connected to each other.

An acoustic landmark determination unit 510 for: detecting the key frame position of a phoneme in the known correct speech by using a connection time sequence classification (CTC) method to be used as an acoustic landmark;

a pronunciation bias detection unit 520 for: and carrying out pronunciation bias detection on the phoneme in the speech to be detected based on the landmark.

Fig. 10 is a schematic structural diagram of an acoustic landmark determining unit in an embodiment of the present invention. As shown in fig. 10, the acoustic landmark determining unit 510 may include: the acoustic model training module 511, the probability sequence generating module 512, the spike function value generating module 513, the inequality parameter generating module 514, the spike function value screening module 515, the maximum spike function value determining module 516, and the acoustic landmark determining module 517, which are connected in sequence.

An acoustic model training module 511 configured to: training an RNN acoustic model using CTC criteria;

a probability sequence generation module 512 configured to: decoding the voice of the processing unit in the known correct voice by using the trained RNN acoustic model to obtain a sequence of posterior probabilities of the phonemes in the voice of the processing unit on each time frame;

a spike function value generating module 513 configured to: calculating to obtain a peak function value corresponding to each time frame by using the set window length, the set peak function and each posterior probability in the sequence;

an inequality parameter generation module 514 to: calculating the mean and variance of all the peak function values larger than zero;

a spike function value screening module 515 to: obtaining a Chebyshev inequality by using the mean value and the variance, and obtaining a peak function value meeting the Chebyshev inequality;

a maximum spike function value determining module 516 configured to: acquiring a maximum peak function value within a set window length range;

an acoustic landmark determination module 517 for: the key frame position of the phoneme is determined as landmark by using the peak position of the maximum peak function value.

Fig. 11 is a schematic structural diagram of an acoustic landmark determination module in an embodiment of the present invention. As shown in fig. 11, in an embodiment, the acoustic landmark determining module 517 may include: a phoneme judging module 5171 and a key frame position determining module 5172, which are connected to each other.

A phoneme judging module 5171, configured to: judging whether the speech text corresponding to the processing unit of the known correct speech contains the phoneme corresponding to the peak position;

a keyframe location determination module 5172 to: if so, taking the peak position as a key frame position; and if the peak position does not exist, eliminating the peak position, re-acquiring the maximum peak function value from the rest peak function values meeting the Chebyshev inequality, and determining the key frame position of the phoneme by using the re-acquired peak position of the maximum peak function value.

Fig. 12 is a schematic structural diagram of an acoustic landmark determination module in another embodiment of the present invention. As shown in fig. 12, in an embodiment, the acoustic landmark determining module 517 includes: a key frame relative position determination module 5173 and a final key frame determination module 5174, which are connected to each other.

A key frame relative position determination module 5173 configured to: determining a key frame relative position of the phoneme by comparing the key frame position with the annotated text phoneme time information corresponding to the processing unit of known correct speech;

a final key frame determination module 5174 configured to: and averaging the relative positions of all the key frames of the phoneme to obtain a final key frame of the phoneme, wherein the final key frame is used as landmark.

Fig. 13 is a schematic structural diagram of a pronunciation bias detection unit according to an embodiment of the invention. As shown in fig. 13, the pronunciation deviation detecting unit 520 may include: an acoustic feature extraction module 521, an SVM classifier training module 522, and a pronunciation bias detection module 523, which are connected in sequence.

An acoustic feature extraction module 521 for: extracting the acoustic features of the phonemes in the known bias type speech and the acoustic features of the phonemes in the known correct speech based on the landmark;

an SVM classifier training module 522 to: training an SVM classifier by using the acoustic characteristics of the phonemes in the known bias type speech and the acoustic characteristics of the phonemes in the known correct speech;

a pronunciation bias detection module 523 configured to: and carrying out pronunciation bias detection on the phonemes in the speech to be detected by using the trained SVM classifier.

In an embodiment, the spike function value generating module 513 may be further configured to:

the set spike function is:

Fig. 14 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 14, the computer device 600 includes a memory 610, a processor 620 and a computer program stored in the memory and executable on the processor, and when the processor 620 executes the computer program, the steps of the method according to the above embodiments are implemented.

In summary, the pronunciation bias detection method, apparatus, storage medium and device in the embodiments of the present invention determine landmark by detecting the key frame position by using the CTC method based on the CTC detection key frame, without manually labeling landmark in advance, avoid the dependence on manually labeling landmark, and use a unified speech recognition framework to facilitate pronunciation bias detection.

In the description herein, reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," "an example," "a particular example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the various embodiments is provided to schematically illustrate the practice of the invention, and the sequence of steps is not limited and can be suitably adjusted as desired.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A pronunciation bias detection method is characterized by comprising the following steps:

detecting the key frame position of a phoneme in the known correct speech by using a connection time sequence classification (CTC) method to be used as an acoustic landmark;

performing pronunciation bias detection on the phoneme in the speech to be detected based on the landmark;

the method for detecting the key frame positions of phonemes in the known correct speech by using the CTC method comprises the following steps of:

training an RNN acoustic model using CTC criteria;

decoding the voice of the processing unit in the known correct voice by using the trained RNN acoustic model to obtain a sequence of posterior probabilities of the phonemes in the voice of the processing unit on each time frame;

calculating to obtain a peak function value corresponding to each time frame by using the set window length, the set peak function and each posterior probability in the sequence;

calculating the mean and variance of all the peak function values larger than zero;

obtaining a Chebyshev inequality by using the mean value and the variance, and obtaining a peak function value meeting the Chebyshev inequality;

acquiring a maximum peak function value within a set window length range;

the key frame position of the phoneme is determined as landmark by using the peak position of the maximum peak function value.

2. The pronunciation bias detection method as claimed in claim 1, wherein determining the key frame position of the phoneme using the peak position of the maximum peak function value comprises:

judging whether the speech text corresponding to the processing unit of the known correct speech contains the phoneme corresponding to the peak position;

if so, taking the peak position as a key frame position; and if the peak position does not exist, eliminating the peak position, re-acquiring the maximum peak function value from the rest peak function values meeting the Chebyshev inequality, and determining the key frame position of the phoneme by using the re-acquired peak position of the maximum peak function value.

3. The pronunciation bias detection method as claimed in claim 1, wherein determining the key frame position of the phoneme as landmark using the peak position of the maximum peak function value comprises:

determining a key frame relative position of the phoneme by comparing the key frame position with the annotated text phoneme time information corresponding to the processing unit of known correct speech;

and averaging the relative positions of all the key frames of the phoneme to obtain a final key frame of the phoneme, wherein the final key frame is used as landmark.

4. The pronunciation bias detection method as claimed in claim 1, wherein performing pronunciation bias detection on the phonemes in the speech to be detected based on the landmark comprises:

extracting the acoustic features of the phonemes in the known bias type speech and the acoustic features of the phonemes in the known correct speech based on the landmark;

training an SVM classifier by using the acoustic characteristics of the phonemes in the known bias type speech and the acoustic characteristics of the phonemes in the known correct speech;

and carrying out pronunciation bias detection on the phonemes in the speech to be detected by using the trained SVM classifier.

5. The pronunciation bias detection method as claimed in any one of claims 1 to 3, wherein the set peak function is:

6. A pronunciation deviation detecting device, comprising:

an acoustic landmark determination unit for: detecting the key frame position of a phoneme in the known correct speech by using a connection time sequence classification (CTC) method to be used as an acoustic landmark;

a pronunciation bias detection unit for: performing pronunciation bias detection on the phoneme in the speech to be detected based on the landmark;

the acoustic landmark determination unit includes:

an acoustic model training module to: training an RNN acoustic model using CTC criteria;

a probability sequence generation module to: decoding the voice of the processing unit in the known correct voice by using the trained RNN acoustic model to obtain a sequence of posterior probabilities of the phonemes in the voice of the processing unit on each time frame;

a spike function value generation module to: calculating to obtain a peak function value corresponding to each time frame by using the set window length, the set peak function and each posterior probability in the sequence;

an inequality parameter generation module to: calculating the mean and variance of all the peak function values larger than zero;

a spike function value screening module to: obtaining a Chebyshev inequality by using the mean value and the variance, and obtaining a peak function value meeting the Chebyshev inequality;

a maximum spike function value determination module to: acquiring a maximum peak function value within a set window length range;

an acoustic landmark determination module to: the key frame position of the phoneme is determined as landmark by using the peak position of the maximum peak function value.

7. The pronunciation bias detection device as claimed in claim 6, wherein the acoustic landmark determination module comprises:

a phoneme judging module, configured to: judging whether the speech text corresponding to the processing unit of the known correct speech contains the phoneme corresponding to the peak position;

a keyframe location determination module to: if so, taking the peak position as a key frame position; and if the peak position does not exist, eliminating the peak position, re-acquiring the maximum peak function value from the rest peak function values meeting the Chebyshev inequality, and determining the key frame position of the phoneme by using the re-acquired peak position of the maximum peak function value.

8. The pronunciation bias detection device as claimed in claim 6, wherein the acoustic landmark determination module comprises:

a keyframe relative position determination module to: determining a key frame relative position of the phoneme by comparing the key frame position with the annotated text phoneme time information corresponding to the processing unit of known correct speech;

a final keyframe determination module to: and averaging the relative positions of all the key frames of the phoneme to obtain a final key frame of the phoneme, wherein the final key frame is used as landmark.

9. The pronunciation bias detection device as claimed in claim 6, wherein the pronunciation bias detection unit comprises:

an acoustic feature extraction module to: extracting the acoustic features of the phonemes in the known bias type speech and the acoustic features of the phonemes in the known correct speech based on the landmark;

an SVM classifier training module to: training an SVM classifier by using the acoustic characteristics of the phonemes in the known bias type speech and the acoustic characteristics of the phonemes in the known correct speech;

a pronunciation bias detection module for: and carrying out pronunciation bias detection on the phonemes in the speech to be detected by using the trained SVM classifier.

10. The pronunciation bias detection device as claimed in any one of claims 6 to 8, wherein the spike function value generation module is further configured to perform:

the set spike function is:

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.

12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 5 are implemented when the processor executes the program.