CN110349567B

CN110349567B - Speech signal recognition method and device, storage medium and electronic device

Info

Publication number: CN110349567B
Application number: CN201910741238.2A
Authority: CN
Inventors: 韦林煊; 董文伟; 林炳怀; 张劲松
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY; Tencent Technology Shenzhen Co Ltd
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY; Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2022-09-13
Anticipated expiration: 2039-08-12
Also published as: CN110349567A

Abstract

The invention discloses a voice signal identification method and device, a storage medium and an electronic device. Wherein, the method comprises the following steps: acquiring a first voice signal of a first target language corresponding to a target text of the first target language in a target application; acquiring a recognition result of a first voice signal recognized by a target recognition model in target application, wherein the target acoustic model in the target recognition model is obtained by training an initial acoustic model by using first training data of a first target language and second training data of a second target language, and the target acoustic model is used for outputting the probability that each frame of signal in the first voice signal corresponds to a target phoneme in the first target language; if the recognition result indicates that there is a phoneme of a pronunciation bias in the first speech signal, characters corresponding to the phoneme of the pronunciation bias in the target text are marked in the target application. The invention solves the technical problem of inaccurate voice bias detection in the related technology.

Description

Speech signal recognition method and device, storage medium and electronic device

Technical Field

The present invention relates to the field of speech, and in particular, to a method and an apparatus for recognizing a speech signal, a storage medium, and an electronic apparatus.

Background

In the prior art, in an application program for detecting the voice deviation, corresponding phoneme in a single expectation is used for replacing the pronunciation of the deviation. Due to the fact that the characteristics of large horizontal span and obvious acoustic difference of different speakers exist in voice pronunciation, the robustness of an acoustic model for automatic error detection is poor under the condition that sufficient pronunciation data are lacked.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a voice signal identification method and device, a storage medium and an electronic device, which are used for at least solving the technical problem of inaccurate voice error detection in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a method for recognizing a speech signal, including: acquiring a first voice signal of a first target language corresponding to a target text of the first target language in a target application; obtaining a recognition result of the first speech signal recognized by a target recognition model in the target application, wherein a target acoustic model in the target recognition model is obtained by training an initial acoustic model using first training data of the first target language and second training data of a second target language, and the target acoustic model is used for outputting a probability that each frame signal in the first speech signal corresponds to a target phoneme in the first target language; if the recognition result indicates that there is a phoneme of a pronunciation bias in the first speech signal, the target application marks a character corresponding to the phoneme of the pronunciation bias in the target text.

According to another aspect of the embodiments of the present invention, there is also provided a speech signal recognition apparatus, including: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a first voice signal of a first target language corresponding to a target text of the first target language in a target application; a second obtaining module, configured to obtain, in the target application, a recognition result of the first speech signal recognized by a target recognition model, where a target acoustic model in the target recognition model is obtained by training an initial acoustic model using first training data of the first target language and second training data of a second target language, and the target acoustic model is used to output a probability that each frame signal in the first speech signal corresponds to a target phoneme in the first target language; a marking module, configured to mark, in the target application, a character corresponding to a phoneme of the target text with the pronunciation bias when the recognition result indicates that the phoneme with the pronunciation bias exists in the first speech signal.

Optionally, the apparatus further comprises: a third obtaining module, configured to obtain the first training data of the first target language and the second training data of the second target language before obtaining a first voice signal of the first target language corresponding to a target text of the first target language in a target application, where the first training data includes first real training data of the first target language and first simulated training data of the first target language, and the second training data includes second real training data of the second target language and second simulated training data of the second target language; a first determining module, configured to train the initial acoustic model using the first training data of the first target language and the second training data of the second target language to obtain the target acoustic model.

Optionally, the first determining module includes: a first determining unit, configured to input a first phoneme in the first training data of the first target language into a full-connected layer in the initial acoustic model, and obtain a first probability that the first phoneme in the first training data output by the full-connected layer is a first target phoneme in the first target language; a second determining unit configured to input a second phoneme in the second training data of the second target language to the full-concatenation layer, and obtain a second probability that the second phoneme in the second training data output by the full-concatenation layer is a second target phoneme in the second target language; a first obtaining unit configured to obtain a first feature that is the same between the first phoneme and the second phoneme when the first target phoneme is similar to the second target phoneme, the first probability is greater than a first threshold, and the second probability is greater than a second threshold; and a third determining unit configured to determine the initial acoustic model as the target acoustic model when a similarity between the first feature and a second feature is greater than a third threshold, the second feature being a feature identical to the first target phoneme and the second target phoneme.

Optionally, the third obtaining module includes: a second acquiring unit, configured to acquire first real speech information that is uttered by a first object in the first target language, where the first real training data includes the first real speech information; a third obtaining unit, configured to obtain second real voice information that is uttered by a second object in the first target language, where a channel length of the second real voice information is greater than a channel length of the first real voice information; a fourth determining unit, configured to perform channel conversion on the voice feature of the second real voice information by using a channel length normalization VTLN algorithm to obtain the first simulated training data, where a channel length of the voice information in the first simulated training data is equal to a channel length of the first real voice information; a fourth obtaining unit, configured to obtain third real speech information that is uttered by a third object in the second target language, where the second real training data includes the third real speech information; a fifth obtaining unit, configured to obtain fourth real voice information that is uttered by a fourth object in the second target language, where a channel length of the fourth real voice information is greater than a channel length of the third real voice information; a fifth determining unit, configured to perform vocal tract conversion on the voice feature of the fourth real voice information by using the VTLN algorithm, so as to obtain the second simulated training data, where a vocal tract length of the voice information in the second simulated training data is equal to a vocal tract length of the third real voice information.

Optionally, the second obtaining module includes: a sixth determining unit, configured to perform feature extraction on the first speech signal to obtain frame signal feature information of the first speech signal; a seventh determining unit, configured to input the frame signal feature information into the target acoustic model, and obtain a posterior probability corresponding to each frame signal in the first speech signal output by the target acoustic model, where the posterior probability is used to indicate a probability that each frame signal corresponds to a target phoneme corresponding to a probability of the target phoneme in the first target speech; an eighth determining unit, configured to determine whether a phoneme corresponding to each frame signal in the first speech signal is deviated from a target phoneme by using a pronunciation goodness GOP algorithm and a posterior probability corresponding to each frame signal in the first speech signal, so as to obtain the recognition result.

Optionally, the sixth determining unit includes: the first determining subunit is used for performing signal enhancement on the acquired current voice signal according to a preset algorithm to obtain a first enhanced voice signal; a second determining subunit, configured to perform windowing on the first enhancement signal to obtain a first windowed speech signal; a third determining subunit, configured to perform fast fourier FFT on each frame of speech signal in the first windowed speech signal to obtain a frequency domain signal corresponding to the first windowed speech signal; and the fourth determining subunit is configured to perform filtering extraction on the frequency domain signal by frame to obtain frame signal feature information of the first speech signal.

Optionally, the apparatus further comprises: an alignment module, configured to determine whether a phoneme corresponding to each frame signal in the first speech signal is biased from a target phoneme by using a pronunciation goodness GOP algorithm and a posterior probability corresponding to each frame signal in the first speech signal, and after obtaining the recognition result, align the phoneme corresponding to each frame signal in the first speech signal in the recognition result with the target phoneme.

According to yet another aspect of the embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is configured to execute the above-mentioned speech signal recognition method when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the above-mentioned speech signal recognition method through the computer program.

In the embodiment of the invention, a first voice signal of a first target language corresponding to a target text of the first target language is acquired in a target application; acquiring a recognition result of a first voice signal recognized by a target recognition model in target application, wherein the target acoustic model in the target recognition model is obtained by training an initial acoustic model by using first training data of a first target language and second training data of a second target language, and the target acoustic model is used for outputting the probability that each frame of signal in the first voice signal corresponds to a target phoneme in the first target language; under the condition that the recognition result shows that the phoneme with the pronunciation bias exists in the first voice signal, the method marks characters corresponding to the phoneme with the pronunciation bias in the target text in the target application, and the target acoustic model trained by using the training data corresponding to the first target language and the second target language is used for recognizing the first voice signal, so that the aim of more diverse training data is fulfilled, the technical effect of accurately recognizing whether the voice signal is biased is achieved, and the technical problem of inaccurate detection of the voice bias in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an application environment of an alternative speech signal recognition method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative method of speech signal recognition according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an alternative software application for speech signal bias detection according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative training acoustic model according to an embodiment of the present invention;

FIG. 5 is a diagram of an alternative interaction framework for a user to practice pronunciation using English learning software, in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of an alternative speech signal conversion according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative speech signal hierarchy according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of an alternative speech signal recognition apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, a method for recognizing a speech signal is provided, and optionally, as an optional implementation, the method for recognizing a speech signal may be applied to, but is not limited to, an environment as shown in fig. 1.

In fig. 1, a target application may be run in the user equipment 102, and the first voice signal may be acquired through the target application. The user equipment 102 comprises a memory 104 for storing the first speech signal and a processor 106 for processing the first speech signal. User device 102 and server 112 may interact with data via network 110. The server 112 includes a database 114 for storing operational data and a processing engine 116 for processing the operational data. As shown in fig. 1, a first speech signal may be obtained in a target application installed on a user device 102, where the first speech signal is uttered using a first target language. The user equipment 102 obtains a recognition result of a first voice signal recognized by a target recognition model in a target application, wherein a target acoustic model in the target recognition model is obtained by training an initial acoustic model by using first training data of a first target language and second training data of a second target language, and the target acoustic model is used for outputting the probability that each frame signal in the first voice signal corresponds to a target phoneme in the first target language; in the case where the recognition result indicates that there is a phoneme of a pronunciation bias in the first speech signal, the user equipment 102 marks a character corresponding to the phoneme of the pronunciation bias in the target text in the target application.

Alternatively, the voice signal recognition method may be, but is not limited to, applied to a client running on a user equipment 102 that can calculate data, the user equipment 102 may be a mobile phone, a tablet computer, a notebook computer, a PC, and the like, and the network 110 may include, but is not limited to, a wireless network or a wired network. Wherein, this wireless network includes: WIFI and other networks that enable wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, local area networks. The server 112 may include, but is not limited to, any hardware device capable of performing computations.

Optionally, as an optional implementation manner, as shown in fig. 2, the method for recognizing a speech signal includes:

s202: acquiring a first voice signal of a first target language corresponding to a target text of the first target language in a target application;

s204: acquiring a recognition result of a first voice signal recognized by a target recognition model in target application, wherein the target acoustic model in the target recognition model is obtained by training an initial acoustic model by using first training data of a first target language and second training data of a second target language, and the target acoustic model is used for outputting the probability that each frame of signal in the first voice signal corresponds to a target phoneme in the first target language;

s206: if the recognition result indicates that there is a phoneme of a pronunciation bias in the first speech signal, a character corresponding to the phoneme of the pronunciation bias in the target text is marked in the target application.

Alternatively, the above-mentioned speech signal recognition method can be applied, but not limited, to the field of speech recognition, such as in the field of pronunciation bias detection. The method can be applied to the K12 phoneme pronunciation bias detection scene in any foreign language learning.

Optionally, the method in this embodiment includes, but is not limited to, application in a PC side of a computer, a mobile side (a mobile phone, a tablet, a vehicle-mounted system, etc.).

Optionally, the first and second target words include, but are not limited to, english, chinese, and the like. For example, the speech "my name is Linda" is input in the target application.

Optionally, the target application includes, but is not limited to, an application program for voice detection. Such as english learning software, chinese learning software, and so forth. As shown in fig. 3, it is a procedure for detecting an english pronunciation error in english learning software. In the process of using the English pronunciation software to carry out pronunciation practice, firstly, the designated text is subjected to pronunciation reading, then, the system background of the English learning software can detect pronunciation bias of the learner voice, and after the detection, the pronunciation bias can be fed back to the speaker. The gray phoneme part is the pronunciation error of the speaker after the detection of the algorithm.

Optionally, in this embodiment, the target acoustic model includes, but is not limited to, a neural network model. In a scene of detecting the pronunciation of the learner in which phoneme is biased, the performance of the target acoustic model is very important, and the quality of the target acoustic model directly influences the subsequent detection performance. The target acoustic model is used as a statistical model, and the performance of the target acoustic model depends on whether the training corpus can accurately represent the distribution of the whole standard speech. The lack of sufficient and appropriate corpora is one of the major problems in achieving high performance bias detection. The pronunciation of different learners has the characteristics of speaker individual difference, high or low language level and the like. It is therefore necessary to compensate for the lack of training data by appropriate technical means. For example, the training of the target acoustic model may be influenced by the following theory:

1) the theory of acquisition of two languages: when a learner learns the pronunciation of a Second Language (abbreviated as L2), the phoneme similar to the learner's native Language (abbreviated as L1) in L2 is replaced by the phoneme of L1, which is one of the important phonemes constituting the pronunciation bias.

2) Migration learning theory of deep learning: the different data and tasks may have inherent relevance, and by using the implicit level parameters of the deep neural network to try to acquire the relevance, the knowledge obtained from one task can be applied to the solution of another task.

Optionally, for example, in an english speech detection scenario, data having a strong correlation with the english pronunciation bias detection of the target task K12 is included as much as possible by using a transfer learning method, so as to construct a performance-robust bias detection technology. The specific strategy is as follows:

1) a Time Delay Neural Network (TDNN) model is used as an acoustic modeling method.

2) And mapping the English L1 adult voice characteristic parameters by a Vocal Tract Length Normalization (VTLN) method to generate a simulated K12 English characteristic parameter library.

3) And (4) mapping the adult voice characteristic parameters of the Chinese L1 by using a VTLN method to generate a simulated K12 Chinese mandarin characteristic parameter library.

4) By adopting a multi-task learning method, Chinese K12 training data (including real and simulated data) and English K12 training data (including real and simulated data) are introduced into an input layer, Chinese and English speech recognition tasks are respectively set in an output layer, and a high-robustness acoustic model aiming at the Chinese/English pronunciation of a highly variant line K12 is obtained through a potential transfer learning mechanism.

5) K12 English pronunciation bias detection can be implemented by using the obtained English speech output nodes.

6) And comparing the obtained representation English speech output node with the Chinese speech output node to obtain the K12 English pronunciation error detection with high robustness.

Alternatively, in order to implement a robust K12 english pronunciation bias detection algorithm, a schematic diagram of the acoustic model modeling method of the present embodiment is shown in fig. 4 (training part), where first training data of english pronunciation and second training data of chinese pronunciation are used as corpora of the training target acoustic model. And detecting the first voice signal by combining the characteristics of the two training data.

With the present embodiment, the target acoustic model is obtained by training the initial acoustic model using the first training data of the first target language and the second training data of the second target language, and the two target languages used in the training expectation are not a single corpus. The accuracy of the target acoustic model output is improved, and the robustness of the obtained pronunciation deviation detection model is improved.

In an optional embodiment, before obtaining, in the target application, the first speech signal of the first target language corresponding to the target text of the first target language, the method further includes:

s1, acquiring first training data of a first target language and second training data of a second target language, wherein the first training data comprise first real training data of the first target language and first simulated training data of the first target language, and the second training data comprise second real training data of the second target language and second simulated training data of the second target language;

and S2, training the initial acoustic model by using the first training data of the first target language and the second training data of the second target language to obtain a target acoustic model.

Alternatively, as shown in fig. 4, for example, in a scenario of english learning of a child, english pronunciation of the child may be used as the first real training data of the first target language, and english pronunciation of an adult may be used as the first simulated training data of the first target language. The Chinese pronunciation of the child is used as second real training data of a second target language, and the Chinese pronunciation of the adult is used as second simulated training data of the second target language. And training by utilizing common characteristics of two kinds of expectation synchronization to obtain a target acoustic model.

By the embodiment, the target acoustic model is obtained by utilizing the voice training of different speakers of two target languages, the robustness of the target acoustic model can be improved, and the accuracy of voice pronunciation bias detection is improved. Is more suitable for different pronunciation objects.

In an alternative embodiment, training the initial acoustic model using the first training data of the first target language and the second training data of the second target language to obtain the target acoustic model includes:

s1, inputting a first phoneme in first training data of a first target language into a full connection layer in the initial acoustic model, and obtaining a first probability that the first phoneme in the first training data output by the full connection layer is a first target phoneme in the first target language;

optionally, a plurality of fully connected layers, for example 6, may be included in the initial acoustic model, and the phonemes in the first training data are sequentially input into the initial acoustic model. And obtaining the probability that each phoneme output by the full connection layer is the target phoneme in the first target language.

S2, inputting a second phoneme in second training data of a second target language into the full-link layer to obtain a second probability that the second phoneme in the second training data output by the full-link layer is a second target phoneme in the second target language;

optionally, the phonemes in the second training data are input into the initial acoustic model in sequence. And obtaining the probability that each phoneme output by the full connection layer is the target phoneme in the second target language.

S3, acquiring a first feature that is the same between the first phoneme and the second phoneme when the first target phoneme is similar to the second target phoneme, the first probability is greater than a first threshold value, and the second probability is greater than a second threshold value;

and S4, determining the initial acoustic model as the target acoustic model when the similarity between the first feature and the second feature is larger than a third threshold, wherein the second feature is the same feature between the first target phoneme and the second target phoneme.

Alternatively, in the present embodiment, for example, the similarity (e.g., pronunciation of "p") of the first target phoneme and the second target phoneme, the probability that the pronunciation of the first phoneme is the pronunciation of "p" is 90%, and the probability that the pronunciation of the second phoneme is the pronunciation of "p" is 85%. The same first feature is considered to exist between the first phoneme and the second phoneme. And under the condition that the similarity between the same first characteristic and the second characteristic reaches a third threshold value, determining that the target acoustic model reaches a certain iteration number and is converged, and comparing the detection result of the target acoustic model on the first voice signal accurately.

Optionally, when the similarity between the first feature and the second feature is smaller than a third threshold, it indicates that the target acoustic model does not converge, and the training using the training data is continued until convergence is reached.

By the embodiment, the target acoustic model is obtained by extracting the same features among phonemes and training, so that the robustness of the target acoustic model is improved.

In an alternative embodiment, obtaining first training data for a first target and second training data for a second target includes:

s1, acquiring first real voice information of the first object sent in a first target language, wherein the first real training data comprises the first real voice information;

s2, acquiring second real voice information which is sent by a second object in a first target language, wherein the length of a sound channel of the second real voice information is larger than that of the sound channel of the first real voice information;

s3, performing vocal tract conversion on the voice characteristics of the second real voice information by using a vocal tract length normalization VTLN algorithm to obtain first simulated training data, wherein the vocal tract length of the voice information in the first simulated training data is equal to the vocal tract length of the first real voice information;

s4, third real voice information sent by a third object in a second target language is obtained, wherein the second real training data comprises the third real voice information;

s5, acquiring fourth real voice information which is sent by a fourth object in a second target language, wherein the length of a sound channel of the fourth real voice information is greater than that of the sound channel of the third real voice information;

and S6, performing vocal tract conversion on the voice characteristics of the fourth real voice information by using a VTLN algorithm to obtain second simulated training data, wherein the vocal tract length of the voice information in the second simulated training data is equal to the vocal tract length of the third real voice information.

Alternatively, in this embodiment, for example, in a scene of english learning of a child, english pronunciation of the child may be used as the first real training data of the first target language, and english pronunciation of an adult may be used as the first simulated training data of the first target language. The Chinese pronunciation of the child is used as second real training data of a second target language, and the Chinese pronunciation of the adult is used as second simulated training data of the second target language. And training by utilizing two common characteristics of the two predictions at the same time to obtain the target acoustic model. Since the length of the vocal tract of an adult is longer than that of the vocal tract of a child, in a scene for voice detection of a child, the real vocal tract of the adult needs to be converted into voice equal to that of the vocal tract of the child, so as to increase the amount of data for training.

By the embodiment, the voice of the adult is converted by the VTLN algorithm, the voices of different pronunciation objects can be used as the training expectation, the number of training data is increased, the accuracy of model training can be improved, and the robustness of the target acoustic model is improved.

In an alternative embodiment, obtaining a recognition result of the first speech signal recognized by the target recognition model in the target application includes:

s1, extracting the characteristics of the first voice signal to obtain the frame signal characteristic information of the first voice signal;

s2, inputting the frame signal characteristic information into a target acoustic model to obtain a posterior probability corresponding to each frame signal in a first voice signal output by the target acoustic model, wherein the posterior probability is used for expressing the probability of each frame signal corresponding to a target phoneme corresponding to the probability of the target phoneme in a first target language;

s3, determining whether the phoneme corresponding to each frame signal in the first voice signal has deviation with the target phoneme by using the pronunciation goodness GOP algorithm and the posterior probability corresponding to each frame signal in the first voice signal, and obtaining the recognition result.

Optionally, in this embodiment, for example, in a scenario of english speech detection, as shown in fig. 5, an interactive framework diagram for a user to use english learning software to perform pronunciation practice is divided into two parts, namely a client and a server. The client portion is used for the user to practice pronunciation (e.g., input the first voice signal) against the English learning software. After recording the audio frequency of a first voice signal sent by a user, the English learning software transmits the audio frequency to a server side, and after detecting pronunciation errors, the server transmits the errors back to the user and prompts the user to modify opinions. The server side describes the whole process of performing phoneme-level pronunciation deviation detection on the pronunciation of the user after receiving the audio frequency of the pronunciation practice of the user, and also explains that the pronunciation deviation information is returned to the client side after being detected by the server side so that the user can practice the next time.

Optionally, the detection process at the server side includes the following steps:

s501: and performing feature extraction on the first voice signal, wherein the frame signal feature information of the first voice signal at the frame level is obtained.

S502: and inputting the frame signal characteristic information into the target acoustic model to obtain the posterior probability corresponding to each frame signal in the first voice signal output by the target acoustic model, wherein the posterior probability represents the most possible phoneme of the learner who wants to make a table in each frame pronunciation.

Since the target acoustic model is usually trained by the data of the mother speaker, what the learner uttered can be seen from the perspective of the mother speaker, the target acoustic model adopted in this embodiment is a model of a speech recognition framework based on HMM-TDNN, and the principle is as follows:

wherein p (x | w) is a target acoustic model part, w is a pronunciation text of the first speech signal, x is a current pronunciation of the learner, and the probability p (x | w) represents how well the learner should pronounce the phoneme represented by the current text.

S503: and determining whether the phoneme corresponding to each frame signal in the first voice signal has deviation from the target phoneme by using a pronunciation goodness GOP algorithm and the posterior probability corresponding to each frame signal in the first voice signal to obtain a recognition result.

The GOP algorithm is to combine the frame-level posterior probabilities into phoneme-level posterior probabilities (which sounds are actually uttered by the user) according to the frame-level posterior probabilities and phoneme-level alignment information (which sounds should be uttered by the user) output by the target acoustic model, and to judge whether each utterance is skewed or not by comparing the probabilities of the phonemes that the user originally utters and the phonemes that the user actually utters. The known GOP algorithm used in this embodiment includes:

wherein o is used to represent the sample speech signal and p is used to represent the phonemes in the first speech signal; ts and te are used to represent the phoneme index of the beginning and end of the phoneme, respectively; p (p) for representing a posterior probability of a phoneme in the first speech signal; q is used to represent a phone set.

After the GOP scoring is detected, which phoneme of the current pronunciation is a bias error is sent to a user of the client side by mistake.

Through the embodiment, through the flow in fig. 5, whether the first speech signal has a bias or not is detected by using the target acoustic model, so that the accuracy of detection is increased.

In an optional embodiment, performing feature extraction on the first speech signal to obtain frame signal feature information of the first speech signal includes:

s1, performing signal enhancement on the obtained current voice signal according to a preset algorithm to obtain a first enhanced voice signal;

s2, performing windowing operation on the first enhanced signal to obtain a first windowed voice signal;

s3, performing fast Fourier FFT on each frame of voice signal in the first windowed voice signal to obtain a frequency domain signal corresponding to the first windowed voice signal;

and S4, filtering and extracting the frequency domain signal according to frames to obtain the frame signal characteristic information of the first voice signal.

Optionally, in this embodiment, the signal enhancement is performed on the obtained current speech signal according to a preset algorithm, that is, the current speech signal is subjected to pre-emphasis processing. It mainly aims to enhance the high frequency of voice signals to a certain extent and remove the influence of oral radiation. The formula includes:

y(n)＝x(n)-αx(n-1)；

where y (n) is the first speech signal, x (n) is the current speech signal, x (n-1) is the speech signal of the speech information at the time point immediately preceding the time point at which the current speech information was obtained, and α is a predetermined parameter (e.g., 0.98).

Optionally, the first enhanced signal is framed, with a frame length of 25ms and a frame shift of 10ms, the first enhanced signal of several seconds is decomposed into a group of 25ms long speech segment sequences, and each short segment of speech in the sequence is windowed. A hamming window is typically added.

Alternatively, the speech signal may be transformed from the time domain to the frequency domain by performing an FFT on each small segment of speech, as shown in fig. 6.

Optionally, the filtering extraction is performed on the frequency domain signal by frames, that is, the Mel filtering is performed on the group of speech frame sequences on the frequency domain by frames respectively to extract the characteristics usable by the subsequent model, which is essentially a process of information compression and abstraction. The extractable features at this stage are various, such as Frequency spectrum features (Mel Frequency Cepstrum Coefficient (MFCC), FBANK, PLP, etc.), Frequency features (fundamental Frequency, formants, etc.), time domain features (duration), energy features, etc. The feature used in this embodiment is a 23-dimensional FBANK feature plus a fundamental frequency feature of 3 and an energy feature of 1 dimension. After passing through this module, a learner's pronunciation becomes a set of feature sequences representing his pronunciation, such as the frame-level features shown in FIG. 7.

By the implementation, whether the first voice signal has the bias error or not is detected by utilizing the target acoustic model through processing the signal characteristics of the first voice signal, so that the detection accuracy is improved.

In an optional embodiment, after determining whether a phoneme corresponding to each frame signal in the first speech signal is deviated from a target phoneme by using a pronunciation goodness GOP algorithm and a posterior probability corresponding to each frame signal in the first speech signal, and obtaining a recognition result, the method further includes:

and S1, aligning the phoneme corresponding to each frame signal in the first voice signal in the recognition result with the target phoneme.

Alternatively, in this embodiment, the text in the first speech signal may be aligned at a phoneme level based on the speech recognition framework and the forced alignment technique, so that the position of each phoneme in the speech segment and in this position what phoneme the user should have sent can be known.

In summary, in this embodiment, with the support of the Time Delay Neural Network (TDNN) acoustic model modeling of the multitask transfer learning, databases of various target languages, for example, four kinds of native language speech libraries, such as a japanese-american child, a japanese-american adult, a chinese child, and a chinese adult, are introduced comprehensively. The method is suitable for the K12 pronunciation bias detection of any language, and can effectively relieve the problem of insufficient task related data, thereby further improving the detection performance.

In addition, the present embodiment is improved by more than 20% in detecting the error rate index of english pronunciation phoneme of chinese K12 children compared with the conventional system using only one target language for children. The problem that sufficient proper training data are lacked in a K12 child pronunciation error detection system can be effectively solved, and the robustness of an obtained pronunciation error detection model is improved.

The method in the embodiment can be combined with pronunciation detection software to more accurately detect the correct and wrong pronunciations in the K12 children pronunciation, so that the scoring based on pronunciation quality is more reliable. The phoneme bias which should be corrected most in the pronunciation of the children with the K12 can be accurately mentioned, so that the children can focus limited attention on the correction of the most important phoneme bias. So that they can improve the spoken language ability more efficiently and with more confidence.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiment of the present invention, there is also provided a speech signal recognition apparatus for implementing the above speech signal recognition method. As shown in fig. 8, the apparatus includes:

a first obtaining module 82, configured to obtain, in a target application, a first speech signal of a first target language corresponding to a target text of the first target language;

a second obtaining module 84, configured to obtain, in the target application, a recognition result of the first speech signal recognized by a target recognition model, where a target acoustic model in the target recognition model is a model obtained by training an initial acoustic model using first training data of the first target language and second training data of a second target language, and the target acoustic model is used to output a probability that each frame signal in the first speech signal corresponds to a target phoneme in the first target language;

a marking module 86, configured to mark, in the target application, a character corresponding to a phoneme of the pronunciation bias in the target text if the recognition result indicates that the phoneme of the pronunciation bias exists in the first speech signal.

Optionally, the apparatus further comprises:

a third obtaining module, configured to obtain the first training data of the first target language and the second training data of the second target language before obtaining a first voice signal of the first target language corresponding to a target text of the first target language in a target application, where the first training data includes first real training data of the first target language and first simulated training data of the first target language, and the second training data includes second real training data of the second target language and second simulated training data of the second target language;

a first determining module, configured to train the initial acoustic model using the first training data of the first target language and the second training data of the second target language to obtain the target acoustic model.

Optionally, the first determining module includes:

a first determining unit, configured to input a first phoneme in the first training data of the first target language into a full-connected layer in the initial acoustic model, and obtain a first probability that the first phoneme in the first training data output by the full-connected layer is a first target phoneme in the first target language;

a second determining unit configured to input a second phoneme in the second training data of the second target language to the full-concatenation layer, and obtain a second probability that the second phoneme in the second training data output by the full-concatenation layer is a second target phoneme in the second target language;

a first obtaining unit configured to obtain a first feature that is the same between the first phoneme and the second phoneme when the first target phoneme is similar to the second target phoneme, the first probability is greater than a first threshold, and the second probability is greater than a second threshold;

and a third determining unit configured to determine the initial acoustic model as the target acoustic model when a similarity between the first feature and a second feature is greater than a third threshold, wherein the second feature is a same feature between the first target phoneme and the second target phoneme.

Optionally, the third obtaining module includes:

a second acquiring unit, configured to acquire first real speech information that is uttered by a first object in the first target language, where the first real training data includes the first real speech information;

a third obtaining unit, configured to obtain second real voice information that is uttered by a second object in the first target language, where a channel length of the second real voice information is greater than a channel length of the first real voice information;

a fourth determining unit, configured to perform channel conversion on the voice feature of the second real voice information by using a channel length normalization VTLN algorithm to obtain the first simulated training data, where a channel length of the voice information in the first simulated training data is equal to a channel length of the first real voice information;

a fourth obtaining unit, configured to obtain third real speech information that is uttered by a third object in the second target language, where the second real training data includes the third real speech information;

a fifth acquiring unit, configured to acquire fourth real voice information that is uttered by a fourth object in the second target language, where a vocal tract length of the fourth real voice information is greater than a vocal tract length of the third real voice information;

a fifth determining unit, configured to perform channel conversion on the voice feature of the fourth real voice information by using the VTLN algorithm to obtain the second simulated training data, where a channel length of the voice information in the second simulated training data is equal to a channel length of the third real voice information.

Optionally, the second obtaining module includes:

a sixth determining unit, configured to perform feature extraction on the first speech signal to obtain frame signal feature information of the first speech signal;

a seventh determining unit, configured to input the frame signal feature information into the target acoustic model, and obtain a posterior probability corresponding to each frame signal in the first speech signal output by the target acoustic model, where the posterior probability is used to indicate a probability that each frame signal corresponds to a target phoneme corresponding to a probability of the target phoneme in the first target speech;

an eighth determining unit, configured to determine whether a phoneme corresponding to each frame signal in the first speech signal is deviated from a target phoneme by using a pronunciation goodness GOP algorithm and a posterior probability corresponding to each frame signal in the first speech signal, so as to obtain the recognition result.

Optionally, the sixth determining unit includes:

the first determining subunit is used for performing signal enhancement on the acquired current voice signal according to a preset algorithm to obtain a first enhanced voice signal;

a second determining subunit, configured to perform windowing on the first enhancement signal to obtain a first windowed speech signal;

a third determining subunit, configured to perform fast fourier FFT on each frame of speech signal in the first windowed speech signal to obtain a frequency domain signal corresponding to the first windowed speech signal;

and the fourth determining subunit is configured to perform filtering extraction on the frequency domain signal by frame to obtain frame signal feature information of the first speech signal.

Optionally, the apparatus further comprises:

and an alignment module, configured to determine whether a phoneme corresponding to each frame signal in the first speech signal is deviated from a target phoneme by using a pronunciation goodness GOP algorithm and a posterior probability corresponding to each frame signal in the first speech signal, and after the recognition result is obtained, align the phoneme corresponding to each frame signal in the first speech signal in the recognition result with the target phoneme.

According to a further aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the method for recognizing a speech signal, as shown in fig. 9, the electronic device includes a memory 902 and a processor 904, the memory 902 stores a computer program, and the processor 904 is configured to execute the steps in any one of the method embodiments through the computer program.

Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1: acquiring a first voice signal of a first target language corresponding to a target text of the first target language in a target application;

s2: acquiring a recognition result of a first voice signal recognized by a target recognition model in target application, wherein the target acoustic model in the target recognition model is obtained by training an initial acoustic model by using first training data of a first target language and second training data of a second target language, and the target acoustic model is used for outputting the probability that each frame of signal in the first voice signal corresponds to a target phoneme in the first target language;

s3: if the recognition result indicates that there is a phoneme of a pronunciation bias in the first speech signal, a character corresponding to the phoneme of the pronunciation bias in the target text is marked in the target application.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 9 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

The memory 902 may be used to store software programs and modules, such as program instructions/modules corresponding to the voice signal recognition method and apparatus in the embodiment of the present invention, and the processor 904 executes various functional applications and data processing by running the software programs and modules stored in the memory 902, that is, implements the voice signal recognition method. The memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 902 may further include memory located remotely from the processor 904, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 902 may be used for information such as a first voice signal, but is not limited thereto. As an example, as shown in fig. 9, the memory 902 may include, but is not limited to, the first obtaining module 82, the second obtaining module 84, and the marking module 86 in the recognition device of the voice signal. In addition, the device may further include, but is not limited to, other module units in the above-mentioned speech signal recognition apparatus, which is not described in detail in this example.

Optionally, the transmitting device 906 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 906 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmitting device 906 is a Radio Frequency (RF) module used to communicate with the internet via wireless means.

In addition, the electronic device further includes: a display 908 for displaying the recognition result; and a connection bus 910 for connecting the respective module parts in the above-described electronic apparatus.

According to a further aspect of embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be implemented in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for recognizing a speech signal, comprising:

acquiring a first voice signal of a first target language corresponding to a target text of the first target language in a target application;

obtaining a recognition result of recognizing the first voice signal by a target recognition model in the target application, wherein a target acoustic model in the target recognition model is a model obtained by training an initial acoustic model by using first training data of the first target language and second training data of a second target language, the target acoustic model is used for outputting a probability that each frame signal in the first voice signal corresponds to a target phoneme in the first target language, the first training data comprises first real training data and first simulated training data, the first real training data comprises first real voice information, the first real voice information is voice information uttered by a first object in the first target language, and the first simulated training data is data obtained by performing vocal tract conversion on voice features of second real voice information, the sound channel length of the voice information in the first simulated training data is equal to the sound channel length of the first real voice information, the second real voice information is the voice information sent by a second object in the first target language, and the sound channel length of the second real voice information is greater than the sound channel length of the first real voice information;

if the recognition result indicates that there is a phoneme of a pronunciation bias in the first speech signal, marking, in the target application, a character corresponding to the phoneme of the pronunciation bias in the target text.

2. The method of claim 1, wherein before the obtaining, in the target application, the first speech signal of the first target language corresponding to the target text of the first target language, the method further comprises:

acquiring the first training data of the first target language and the second training data of the second target language, wherein the second training data comprises second real training data of the second target language and second simulated training data of the second target language;

and training the initial acoustic model by using the first training data of the first target language and the second training data of the second target language to obtain the target acoustic model.

3. The method of claim 2, wherein training the initial acoustic model using the first training data for the first target language and the second training data for the second target language to obtain the target acoustic model comprises:

inputting a first phoneme in the first training data of the first target language into a full-link layer in the initial acoustic model, and obtaining a first probability that the first phoneme in the first training data output by the full-link layer is a first target phoneme in the first target language;

inputting a second phoneme in the second training data of the second target language into the full-link layer to obtain a second probability that the second phoneme in the second training data output by the full-link layer is a second target phoneme in the second target language;

acquiring a first feature that is the same between the first phoneme and the second phoneme when the first target phoneme is similar to the second target phoneme, the first probability is greater than a first threshold, and the second probability is greater than a second threshold;

determining the initial acoustic model as the target acoustic model when the similarity between the first feature and a second feature is larger than a third threshold, wherein the second feature is the same feature between the first target phoneme and the second target phoneme.

4. The method of claim 2, wherein said obtaining the first training data for the first target and the second training data for the second target comprises:

acquiring the first real voice information of the first object sent in the first target language;

acquiring the second real voice information sent by the second object in the first target language;

performing vocal tract conversion on the voice characteristics of the second real voice information by using a vocal tract length normalization VTLN algorithm to obtain the first simulated training data;

acquiring third real voice information sent by a third object in the second target language, wherein the second real training data comprises the third real voice information;

acquiring fourth real voice information sent by a fourth object in the second target language, wherein the length of a sound channel of the fourth real voice information is greater than that of the sound channel of the third real voice information;

and performing sound channel conversion on the voice characteristics of the fourth real voice information by using the VTLN algorithm to obtain second simulated training data, wherein the sound channel length of the voice information in the second simulated training data is equal to the sound channel length of the third real voice information.

5. The method of claim 1, wherein obtaining, in the target application, a recognition result of the first speech signal recognized by the target recognition model comprises:

performing feature extraction on the first voice signal to obtain frame signal feature information of the first voice signal;

inputting the frame signal feature information into the target acoustic model to obtain a posterior probability corresponding to each frame signal in the first voice signal output by the target acoustic model, wherein the posterior probability is used for representing the probability of a target phoneme corresponding to the probability of each frame signal corresponding to the target phoneme in the first target language;

and determining whether the phoneme corresponding to each frame signal in the first voice signal has deviation from the target phoneme by utilizing a pronunciation goodness GOP algorithm and the posterior probability corresponding to each frame signal in the first voice signal to obtain the recognition result.

6. The method of claim 5, wherein performing feature extraction on the first speech signal to obtain frame signal feature information of the first speech signal comprises:

performing signal enhancement on the obtained current voice signal according to a preset algorithm to obtain a first enhanced voice signal;

performing windowing operation on the first enhanced voice signal to obtain a first windowed voice signal;

performing Fast Fourier Transform (FFT) on each frame of voice signal in the first windowed voice signal to obtain a frequency domain signal corresponding to the first windowed voice signal;

and filtering and extracting the frequency domain signal according to frames to obtain the frame signal characteristic information of the first voice signal.

7. The method according to claim 5, wherein a pronunciation goodness GOP algorithm and a posterior probability corresponding to each frame signal in the first speech signal are used to determine whether a phoneme corresponding to each frame signal in the first speech signal is deviated from a target phoneme, and after obtaining the recognition result, the method further comprises:

and aligning the phoneme corresponding to each frame signal in the first voice signal in the recognition result with the target phoneme.

8. An apparatus for recognizing a speech signal, comprising:

the device comprises a first acquisition module, a second acquisition module and a first display module, wherein the first acquisition module is used for acquiring a first voice signal of a first target language corresponding to a target text of the first target language in a target application;

a second obtaining module, configured to obtain, in the target application, a recognition result of the first speech signal recognized by a target recognition model, where a target acoustic model in the target recognition model is a model obtained by training an initial acoustic model using first training data of the first target language and second training data of a second target language, the target acoustic model is used to output a probability that each frame signal in the first speech signal corresponds to a target phoneme in the first target language, the first training data includes first real training data and first simulated training data, the first real training data includes first real speech information, the first real speech information is speech information uttered by a first object in the first target language, and the first simulated training data is data obtained by converting a speech feature channel of second real speech information, the length of the sound channel of the voice information in the first simulated training data is equal to the length of the sound channel of the first real voice information, the second real voice information is the voice information sent by a second object in the first target language, and the length of the sound channel of the second real voice information is greater than the length of the sound channel of the first real voice information;

a marking module, configured to mark, in the target application, a character corresponding to a phoneme of the pronunciation bias in the target text when the recognition result indicates that the phoneme of the pronunciation bias exists in the first speech signal.

9. A computer-readable storage medium comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.