CN112669880B - Method and system for adaptively detecting voice ending - Google Patents

Method and system for adaptively detecting voice ending Download PDF

Info

Publication number
CN112669880B
CN112669880B CN202011498888.8A CN202011498888A CN112669880B CN 112669880 B CN112669880 B CN 112669880B CN 202011498888 A CN202011498888 A CN 202011498888A CN 112669880 B CN112669880 B CN 112669880B
Authority
CN
China
Prior art keywords
target user
threshold
delay
voice input
energy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011498888.8A
Other languages
Chinese (zh)
Other versions
CN112669880A (en
Inventor
邹朋朋
陈现麟
王强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Duwo Network Technology Co ltd
Original Assignee
Beijing Duwo Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Duwo Network Technology Co ltd filed Critical Beijing Duwo Network Technology Co ltd
Priority to CN202011498888.8A priority Critical patent/CN112669880B/en
Publication of CN112669880A publication Critical patent/CN112669880A/en
Application granted granted Critical
Publication of CN112669880B publication Critical patent/CN112669880B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a method and a system for adaptively detecting the end of voice, wherein the method comprises the following steps: acquiring voice input by a target user; obtaining a threshold of a target user, wherein the threshold of the target user comprises: an energy threshold of the target user; obtaining a decoding result obtained based on the reference text and the voice input by the target user; based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if so, then: and judging that the voice input of the target user is ended. The invention can automatically detect whether the voice is finished or not, and further finish the recording, compared with the prior art, the invention liberates the hands of the child and improves the user experience.

Description

Method and system for adaptively detecting voice ending
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and system for adaptively detecting an end of speech.
Background
At present, the number of people for learning foreign language in China is numerous, 1.2 hundred million students are learning English only in the field of K12, the number of preschool education young people is nearly 5000 ten thousand, and in the process of English learning training, english pronunciation evaluation technology is increasingly adopted in order to relieve the burden of teachers and parents and improve the interest of students in learning English. The pronunciation evaluation technology evaluates the pronunciation quality of students according to the pre-trained acoustic model and by combining the reference pronunciation text and the decoder technology, and gives a score. At present, in the evaluating process, when the recording of the student is finished, the recording is required to be manually finished, and an evaluating result is obtained. The user experience is poor, and in the young field, children sometimes cannot actively end the recording, the evaluation duration is too long, and the score is influenced.
Therefore, how to automatically detect whether the voice is finished or not, and then finish the recording, liberate both hands of the child, so as to improve the user experience is a problem to be solved urgently.
Disclosure of Invention
In view of this, the invention provides a method for adaptively detecting the end of voice, which can automatically detect whether the voice has ended, and further end the recording, liberate both hands of the child, so as to promote the user experience.
The invention provides a method for adaptively detecting the end of voice, which comprises the following steps:
acquiring voice input by a target user;
obtaining a threshold of the target user, wherein the threshold of the target user comprises: an energy threshold of the target user;
obtaining a decoding result obtained based on the reference text and the voice input by the target user;
based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if yes, then:
and judging that the voice input of the target user is ended.
Preferably, the threshold of the target user further includes: the delay threshold of the target user, when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user, further comprises:
based on the decoding result, judging whether the decoded text is equal to the reference text, and judging whether the non-valid pronunciation section delay reaches the delay threshold of the target user, if so, then:
and judging that the voice input of the target user is ended.
Preferably, when the decoded text is not equal to the reference text and/or the non-valid pronunciation segment delay does not reach the delay threshold of the target user, the method further comprises:
based on the decoding result, judging whether the decoded text is not changed any more and whether the non-valid pronunciation section delay reaches the delay threshold of the target user, if yes, then:
and judging that the voice input of the target user is ended.
Preferably, the threshold of the target user further includes: the time length threshold of the target user, when the decoded text is not changed and/or the delay of the non-effective pronunciation section does not reach the delay threshold of the target user, further comprises:
based on the decoding result, judging whether the current audio time length exceeds the time length threshold of the target user, if yes, then:
and judging that the voice input of the target user is ended.
A system for adaptively detecting end of speech, comprising:
the first acquisition module is used for acquiring the voice input by the target user;
a second obtaining module, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold of the target user;
the third acquisition module is used for acquiring a decoding result obtained based on the reference text and the voice input by the target user;
a first judging module, configured to judge, based on the decoding result, whether a ratio of average energy of speech input by the target user to accumulated global average energy is smaller than an energy threshold of the target user;
and the judging module is used for judging that the voice input of the target user is ended when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold value of the target user.
Preferably, the threshold of the target user further includes: the target user's delay threshold, the system further comprising:
the second judging module is used for judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not based on the decoding result when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is more than or equal to the energy threshold of the target user;
and the judging module is further used for judging that the voice input of the target user is ended when the decoded text is equal to the reference text and the delay of the non-valid pronunciation section reaches the delay threshold value of the target user.
Preferably, the system further comprises:
a third judging module, configured to judge, based on the decoding result, whether the decoded text is no longer changed and whether the non-valid pronunciation section delay reaches the delay threshold of the target user when the decoded text is not equal to the reference text and/or the non-valid pronunciation section delay does not reach the delay threshold of the target user;
the judging module is further used for judging that the voice input of the target user is finished when the decoded text is not changed any more and the delay of the non-valid pronunciation section reaches the delay threshold value of the target user.
Preferably, the threshold of the target user further includes: the target user's duration threshold, the system further comprising:
a fourth judging module, configured to judge, based on the decoding result, whether a current audio duration exceeds a duration threshold of the target user when the decoded text is not changed and/or the delay of the non-valid pronunciation section does not reach the delay threshold of the target user;
and the judging module is also used for judging that the voice input of the target user is finished when the current audio time length exceeds the time length threshold value of the target user.
In summary, the present invention discloses a method for adaptively detecting the end of speech, when it is required to automatically detect whether the speech is ended, firstly, obtaining the speech input by a target user, and obtaining a threshold of the target user, where the threshold of the target user includes: the energy threshold of the target user, obtain the decoding result based on the reference text and voice input by the target user; and then, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if so, then: and judging that the voice input of the target user is ended. The invention can automatically detect whether the voice is finished or not, and further finish the recording, compared with the prior art, the invention liberates the hands of the child and improves the user experience.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an embodiment 1 of a method for adaptively detecting end of speech according to the present disclosure;
FIG. 2 is a diagram illustrating a decoding state transition according to the present disclosure;
FIG. 3 is a flowchart of an embodiment 2 of a method for adaptively detecting end of speech according to the present disclosure;
FIG. 4 is a schematic diagram of a system for adaptively detecting end of speech according to an embodiment 1 of the present invention;
fig. 5 is a schematic structural diagram of a system embodiment 2 for adaptively detecting end of speech according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a flowchart of an embodiment 1 of a method for adaptively detecting an end of speech according to the present disclosure may include the following steps:
s101, acquiring voice input by a target user;
when the self-adaptive detection of whether the voice is ended is required, firstly acquiring the voice input by the target user, namely firstly acquiring the voice output by the user who needs to carry out voice ending detection during voice evaluation.
S102, acquiring a threshold value of a target user, wherein the threshold value of the target user comprises: an energy threshold of the target user;
and simultaneously, acquiring a threshold value corresponding to the target user. The obtained threshold value of the target user comprises the following steps: an energy threshold of the target user.
Specifically, when the threshold value of the target user is obtained, in order to achieve adaptive dynamic adjustment of the threshold value, the sound recordings of all the target users (such as the child users) need to be analyzed in advance, and the pronunciation interval, the energy and the speech speed average value of each target user are obtained, so that the threshold value of each target user is obtained.
S103, obtaining a decoding result obtained based on the reference text and the voice input by the target user;
meanwhile, a decoding result obtained based on the reference text and the voice input by the target user is obtained.
The reference text is English text which is required to be read by the target user. Specifically, in decoding, a decoding result is obtained by patterning the reference text and combining with an acoustic model.
The composition is a graph which is obtained by combining a pre-trained acoustic model and a pronunciation dictionary to perform HCLG operation and needs to be aligned and decoded. The acoustic model is used for calculating posterior probability of acoustic features belonging to each phoneme and is formed by training sound with good pronunciation for more than 100 hours.
The acoustic model training process comprises the following steps: the audio is firstly framed, then the characteristics are extracted, every 25ms is a frame, the frame is shifted by 10ms, and the characteristics are 40-dimensional Meier cepstrum coefficient mfcc characteristics. After the extraction of the features is completed, expanding the audio text into phonemes according to a dictionary, dividing each frame by time average, marking phoneme labels, training by using a time delay neural network tdnn after the features correspond to the labels, obtaining an initial model, forcibly corresponding by using a Viterbi of the initial model, carrying out new alignment on the audio, carrying out new training after the phoneme labels of each audio are obtained, obtaining a new model, and stopping training when iteration reaches a certain number of rounds, so as to obtain a final training model. MFCC features are cepstrum parameters extracted in the Mel scale frequency domain, which describes the nonlinear characteristics of human ear frequencies, whose relationship to frequency can be approximated by the following equation:
Figure BDA0002841016170000061
where f is frequency in Hz.
Decoding means that according to the mfcc characteristics of the audio input, likelihood and composition are output by combining an acoustic model, and decoding is performed through a Viterbi algorithm, so as to select an optimal path. The Viterbi algorithm is essentially a dynamic programming algorithm that can get a globally optimal solution.
If there is a final path from the start point to the end point, as shown in fig. 2, this path sub-path is also the optimal path from the start point to the corresponding point in time. As shown in the above figure, the dotted line is an optimal path from the start point to the end point, and then the dotted line from the start point to the time point 4 is also an optimal path for the period of time. In other words, at any time, only the optimal path of all states at the time is required to be recorded, taking time 4 as an example, at time 4, only the optimal paths of three states S1, S2 and S3 at time 4 are required to be recorded, that is, only three paths are required to be recorded, then at time 5, two paths pass at the state S3 at time 5, the optimal paths are taken, the states of S2 and S1 at time 5 are similar, that is, only three paths are required to be recorded at the state S5.
Therefore, two times of circulation are needed at each moment, the outer layer circulates all the states at the moment, and the inner layer circulates all the states from one state at the moment to the next moment. Time complexity, time complexity for all time periods. The state at any one time in actual large-scale speech recognition may be large, such as 5000, so that even if viterbi is used, the time complexity is too large, and in practice, in order to solve this problem, a Beam Search algorithm is introduced. The Beam Search algorithm involves the number of current time states and the number of next time states, and if it is desired to increase the decoding speed, both values need to be reduced.
S104, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if yes, entering S105:
and then, calculating the average energy of the voice input by the target user, comparing the average energy with the average value of the global energy accumulated before, and judging whether the ratio is smaller than the energy threshold of the target user.
S105, judging that the voice input of the target user is ended.
When the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, the voice is judged to be input to be finished.
In summary, in the above embodiment, when it is required to automatically detect whether the voice is ended, firstly, the voice input by the target user is acquired, and the threshold value of the target user is acquired, where the threshold value of the target user includes: the energy threshold of the target user, obtain the decoding result based on the reference text and voice input by the target user; and then, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if so, then: and judging that the voice input of the target user is ended. Whether can the automated inspection pronunciation have ended, and then end the recording, for prior art, liberated child's both hands, promoted user experience.
As shown in fig. 3, a flowchart of an embodiment 2 of a method for adaptively detecting an end of speech according to the present disclosure may include the following steps:
s301, acquiring voice input by a target user;
when the self-adaptive detection of whether the voice is ended is required, firstly acquiring the voice input by the target user, namely firstly acquiring the voice output by the user who needs to carry out voice ending detection during voice evaluation.
S302, acquiring a threshold value of a target user, wherein the threshold value of the target user comprises: an energy threshold of the target user, a delay threshold of the target user, and a duration threshold of the target user;
and simultaneously, acquiring a threshold value corresponding to the target user. The obtained threshold value of the target user comprises the following steps: an energy threshold of the target user, a delay threshold of the target user, and a duration threshold of the target user.
Specifically, when the threshold value of the target user is obtained, in order to achieve adaptive dynamic adjustment of the threshold value, the sound recordings of all the target users (such as the child users) need to be analyzed in advance, and the pronunciation interval, the energy and the speech speed average value of each target user are obtained, so that the threshold value of each target user is obtained.
S303, obtaining a decoding result obtained based on the reference text and the voice input by the target user;
meanwhile, a decoding result obtained based on the reference text and the voice input by the target user is obtained.
The reference text is English text which is required to be read by the target user. Specifically, in decoding, a decoding result is obtained by patterning the reference text and combining with an acoustic model.
The composition is a graph which is obtained by combining a pre-trained acoustic model and a pronunciation dictionary to perform HCLG operation and needs to be aligned and decoded. The acoustic model is used for calculating posterior probability of acoustic features belonging to each phoneme and is formed by training sound with good pronunciation for more than 100 hours.
The acoustic model training process comprises the following steps: the audio is firstly framed, then the characteristics are extracted, every 25ms is a frame, the frame is shifted by 10ms, and the characteristics are 40-dimensional Meier cepstrum coefficient mfcc characteristics. After the extraction of the features is completed, expanding the audio text into phonemes according to a dictionary, dividing each frame by time average, marking phoneme labels, training by using a time delay neural network tdnn after the features correspond to the labels, obtaining an initial model, forcibly corresponding by using a Viterbi of the initial model, carrying out new alignment on the audio, carrying out new training after the phoneme labels of each audio are obtained, obtaining a new model, and stopping training when iteration reaches a certain number of rounds, so as to obtain a final training model. MFCC features are cepstrum parameters extracted in the Mel scale frequency domain, which describes the nonlinear characteristics of human ear frequencies, whose relationship to frequency can be approximated by the following equation:
Figure BDA0002841016170000081
where f is frequency in Hz.
Decoding means that according to the mfcc characteristics of the audio input, likelihood and composition are output by combining an acoustic model, and decoding is performed through a Viterbi algorithm, so as to select an optimal path. The Viterbi algorithm is essentially a dynamic programming algorithm that can get a globally optimal solution.
If there is a final path from the start point to the end point, as shown in fig. 2, this path sub-path is also the optimal path from the start point to the corresponding point in time. As shown in the above figure, the dotted line is an optimal path from the start point to the end point, and then the dotted line from the start point to the time point 4 is also an optimal path for the period of time. In other words, at any time, only the optimal path of all states at the time is required to be recorded, taking time 4 as an example, at time 4, only the optimal paths of three states S1, S2 and S3 at time 4 are required to be recorded, that is, only three paths are required to be recorded, then at time 5, two paths pass at the state S3 at time 5, the optimal paths are taken, the states of S2 and S1 at time 5 are similar, that is, only three paths are required to be recorded at the state S5.
Therefore, two times of circulation are needed at each moment, the outer layer circulates all the states at the moment, and the inner layer circulates all the states from one state at the moment to the next moment. Time complexity, time complexity for all time periods. The state at any one time in actual large-scale speech recognition may be large, such as 5000, so that even if viterbi is used, the time complexity is too large, and in practice, in order to solve this problem, a Beam Search algorithm is introduced. The Beam Search algorithm involves the number of current time states and the number of next time states, and if it is desired to increase the decoding speed, both values need to be reduced.
S304, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if not, entering S305, if yes, entering S308:
and then, calculating the average energy of the voice input by the target user, comparing the average energy with the average value of the global energy accumulated before, and judging whether the ratio is smaller than the energy threshold of the target user.
S305, judging whether the decoded text is equal to the reference text or not based on the decoding result, and judging whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not, if not, entering S306, and if so, entering S308:
when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user, judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not.
Because the composition is based on the reference text, when the decoded text is the reference text, which represents that the target user has read, and the non-valid pronunciation section delay reaches the delay threshold of the target user, it is determined that the user has finished pronunciation. It should be noted that, the abnormal situation needs to be handled, for example, the child user is sometimes unfamiliar with the reference text, and multiple reading or repeated reading situations occur, for example, the situations of foreign words, numbers, years, etc. are also included.
S306, judging whether the decoded text is not changed any more and whether the non-valid pronunciation section delay reaches the delay threshold of the target user based on the decoding result, if not, entering S307, if yes, entering S308:
when the decoded text is not equal to the reference text and/or the non-valid utterance delay has not reached the target user's delay threshold, some target users may not be able to read the reference text completely in some cases, at which time the last policy has failed, but the target user is not already reading. Such problems can be effectively solved by judging whether the currently decoded text is already unchanged. The text is unchanged and the non-valid pronunciation section delay reaches the delay threshold of the target user, and the voice is judged to be ended.
S307, based on the decoding result, judging whether the current audio time length exceeds the time length threshold of the target user, if yes, entering S308:
when the decoded text is not changed any more and/or the delay of the non-effective pronunciation section does not reach the delay threshold of the target user, the average word reading time length of the target user can be obtained according to the prior statistics, the time length threshold which is needed to be read the longest by the user can be calculated according to the reference text length, and when the record is found to exceed the threshold, the voice is judged to be ended.
S308, judging that the voice input of the target user is ended.
When the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, the voice is judged to be input to be finished. Or when the decoded text is equal to the reference text and the non-valid pronunciation section delay reaches the delay threshold of the target user, judging that the voice is input to be finished. Alternatively, when the decoded text has not changed and the non-valid pronunciation section delay reaches the delay threshold of the target user, it is determined that the voice has been input. Or when the current audio time length exceeds the time length threshold value of the target user, judging that the voice is input to be ended.
In summary, in the process of English learning by the user using the English evaluation technology, the invention can detect the end of the user voice in millisecond level, which can certainly bring great user experience improvement, so that the user pays more attention to the actual English learning effect and the learning enthusiasm is improved.
As shown in fig. 4, a schematic structural diagram of a system for adaptively detecting end of speech according to embodiment 1 of the present disclosure may include:
a first obtaining module 401, configured to obtain a voice input by a target user;
when the self-adaptive detection of whether the voice is ended is required, firstly acquiring the voice input by the target user, namely firstly acquiring the voice output by the user who needs to carry out voice ending detection during voice evaluation.
A second obtaining module 402, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold of the target user;
and simultaneously, acquiring a threshold value corresponding to the target user. The obtained threshold value of the target user comprises the following steps: an energy threshold of the target user.
Specifically, when the threshold value of the target user is obtained, in order to achieve adaptive dynamic adjustment of the threshold value, the sound recordings of all the target users (such as the child users) need to be analyzed in advance, and the pronunciation interval, the energy and the speech speed average value of each target user are obtained, so that the threshold value of each target user is obtained.
A third obtaining module 403, configured to obtain a decoding result obtained based on the reference text and the voice input by the target user;
meanwhile, a decoding result obtained based on the reference text and the voice input by the target user is obtained.
The reference text is English text which is required to be read by the target user. Specifically, in decoding, a decoding result is obtained by patterning the reference text and combining with an acoustic model.
The composition is a graph which is obtained by combining a pre-trained acoustic model and a pronunciation dictionary to perform HCLG operation and needs to be aligned and decoded. The acoustic model is used for calculating posterior probability of acoustic features belonging to each phoneme and is formed by training sound with good pronunciation for more than 100 hours.
The acoustic model training process comprises the following steps: the audio is firstly framed, then the characteristics are extracted, every 25ms is a frame, the frame is shifted by 10ms, and the characteristics are 40-dimensional Meier cepstrum coefficient mfcc characteristics. After the extraction of the features is completed, expanding the audio text into phonemes according to a dictionary, dividing each frame by time average, marking phoneme labels, training by using a time delay neural network tdnn after the features correspond to the labels, obtaining an initial model, forcibly corresponding by using a Viterbi of the initial model, carrying out new alignment on the audio, carrying out new training after the phoneme labels of each audio are obtained, obtaining a new model, and stopping training when iteration reaches a certain number of rounds, so as to obtain a final training model. MFCC features are cepstrum parameters extracted in the Mel scale frequency domain, which describes the nonlinear characteristics of human ear frequencies, whose relationship to frequency can be approximated by the following equation:
Figure BDA0002841016170000121
where f is frequency in Hz.
Decoding means that according to the mfcc characteristics of the audio input, likelihood and composition are output by combining an acoustic model, and decoding is performed through a Viterbi algorithm, so as to select an optimal path. The Viterbi algorithm is essentially a dynamic programming algorithm that can get a globally optimal solution.
If there is a final path from the start point to the end point, as shown in fig. 2, this path sub-path is also the optimal path from the start point to the corresponding point in time. As shown in the above figure, the dotted line is an optimal path from the start point to the end point, and then the dotted line from the start point to the time point 4 is also an optimal path for the period of time. In other words, at any time, only the optimal path of all states at the time is required to be recorded, taking time 4 as an example, at time 4, only the optimal paths of three states S1, S2 and S3 at time 4 are required to be recorded, that is, only three paths are required to be recorded, then at time 5, two paths pass at the state S3 at time 5, the optimal paths are taken, the states of S2 and S1 at time 5 are similar, that is, only three paths are required to be recorded at the state S5.
Therefore, two times of circulation are needed at each moment, the outer layer circulates all the states at the moment, and the inner layer circulates all the states from one state at the moment to the next moment. Time complexity, time complexity for all time periods. The state at any one time in actual large-scale speech recognition may be large, such as 5000, so that even if viterbi is used, the time complexity is too large, and in practice, in order to solve this problem, a Beam Search algorithm is introduced. The Beam Search algorithm involves the number of current time states and the number of next time states, and if it is desired to increase the decoding speed, both values need to be reduced.
A first determining module 404, configured to determine, based on the decoding result, whether a ratio of the average energy of the speech input by the target user to the accumulated global average energy is less than an energy threshold of the target user;
and then, calculating the average energy of the voice input by the target user, comparing the average energy with the average value of the global energy accumulated before, and judging whether the ratio is smaller than the energy threshold of the target user.
A determining module 405, configured to determine that the voice input of the target user is ended when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user.
When the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, the voice is judged to be input to be finished.
In summary, in the above embodiment, when it is required to automatically detect whether the voice is ended, firstly, the voice input by the target user is acquired, and the threshold value of the target user is acquired, where the threshold value of the target user includes: the energy threshold of the target user, obtain the decoding result based on the reference text and voice input by the target user; and then, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if so, then: and judging that the voice input of the target user is ended. Whether can the automated inspection pronunciation have ended, and then end the recording, for prior art, liberated child's both hands, promoted user experience.
As shown in fig. 5, a schematic structural diagram of a system for adaptively detecting end of speech according to embodiment 2 of the present disclosure may include:
a first obtaining module 501, configured to obtain a voice input by a target user;
when the self-adaptive detection of whether the voice is ended is required, firstly acquiring the voice input by the target user, namely firstly acquiring the voice output by the user who needs to carry out voice ending detection during voice evaluation.
A second obtaining module 502, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold of the target user, a delay threshold of the target user, and a duration threshold of the target user;
and simultaneously, acquiring a threshold value corresponding to the target user. The obtained threshold value of the target user comprises the following steps: an energy threshold of the target user, a delay threshold of the target user, and a duration threshold of the target user.
Specifically, when the threshold value of the target user is obtained, in order to achieve adaptive dynamic adjustment of the threshold value, the sound recordings of all the target users (such as the child users) need to be analyzed in advance, and the pronunciation interval, the energy and the speech speed average value of each target user are obtained, so that the threshold value of each target user is obtained.
A third obtaining module 503, configured to obtain a decoding result obtained based on the reference text and the voice input by the target user;
meanwhile, a decoding result obtained based on the reference text and the voice input by the target user is obtained.
The reference text is English text which is required to be read by the target user. Specifically, in decoding, a decoding result is obtained by patterning the reference text and combining with an acoustic model.
The composition is a graph which is obtained by combining a pre-trained acoustic model and a pronunciation dictionary to perform HCLG operation and needs to be aligned and decoded. The acoustic model is used for calculating posterior probability of acoustic features belonging to each phoneme and is formed by training sound with good pronunciation for more than 100 hours.
The acoustic model training process comprises the following steps: the audio is firstly framed, then the characteristics are extracted, every 25ms is a frame, the frame is shifted by 10ms, and the characteristics are 40-dimensional Meier cepstrum coefficient mfcc characteristics. After the extraction of the features is completed, expanding the audio text into phonemes according to a dictionary, dividing each frame by time average, marking phoneme labels, training by using a time delay neural network tdnn after the features correspond to the labels, obtaining an initial model, forcibly corresponding by using a Viterbi of the initial model, carrying out new alignment on the audio, carrying out new training after the phoneme labels of each audio are obtained, obtaining a new model, and stopping training when iteration reaches a certain number of rounds, so as to obtain a final training model. MFCC features are cepstrum parameters extracted in the Mel scale frequency domain, which describes the nonlinear characteristics of human ear frequencies, whose relationship to frequency can be approximated by the following equation:
Figure BDA0002841016170000151
where f is frequency in Hz.
Decoding means that according to the mfcc characteristics of the audio input, likelihood and composition are output by combining an acoustic model, and decoding is performed through a Viterbi algorithm, so as to select an optimal path. The Viterbi algorithm is essentially a dynamic programming algorithm that can get a globally optimal solution.
If there is a final path from the start point to the end point, as shown in fig. 2, this path sub-path is also the optimal path from the start point to the corresponding point in time. As shown in the above figure, the dotted line is an optimal path from the start point to the end point, and then the dotted line from the start point to the time point 4 is also an optimal path for the period of time. In other words, at any time, only the optimal path of all states at the time is required to be recorded, taking time 4 as an example, at time 4, only the optimal paths of three states S1, S2 and S3 at time 4 are required to be recorded, that is, only three paths are required to be recorded, then at time 5, two paths pass at the state S3 at time 5, the optimal paths are taken, the states of S2 and S1 at time 5 are similar, that is, only three paths are required to be recorded at the state S5.
Therefore, two times of circulation are needed at each moment, the outer layer circulates all the states at the moment, and the inner layer circulates all the states from one state at the moment to the next moment. Time complexity, time complexity for all time periods. The state at any one time in actual large-scale speech recognition may be large, such as 5000, so that even if viterbi is used, the time complexity is too large, and in practice, in order to solve this problem, a Beam Search algorithm is introduced. The Beam Search algorithm involves the number of current time states and the number of next time states, and if it is desired to increase the decoding speed, both values need to be reduced.
A first determining module 504, configured to determine, based on the decoding result, whether a ratio of the average energy of the speech input by the target user to the accumulated global average energy is less than an energy threshold of the target user;
and then, calculating the average energy of the voice input by the target user, comparing the average energy with the average value of the global energy accumulated before, and judging whether the ratio is smaller than the energy threshold of the target user.
A second judging module 505, configured to judge, based on a decoding result, whether the decoded text is equal to the reference text and whether the non-valid pronunciation section delay reaches the delay threshold of the target user when the ratio of the average energy of the speech input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user;
when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user, judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not.
Because the composition is based on the reference text, when the decoded text is the reference text, which represents that the target user has read, and the non-valid pronunciation section delay reaches the delay threshold of the target user, it is determined that the user has finished pronunciation. It should be noted that, the abnormal situation needs to be handled, for example, the child user is sometimes unfamiliar with the reference text, and multiple reading or repeated reading situations occur, for example, the situations of foreign words, numbers, years, etc. are also included.
A third determining module 506, configured to determine, based on a decoding result, whether the decoded text is no longer changed and whether the non-valid pronunciation section delay reaches the delay threshold of the target user when the decoded text is not equal to the reference text and/or the non-valid pronunciation section delay does not reach the delay threshold of the target user;
when the decoded text is not equal to the reference text and/or the non-valid utterance delay has not reached the target user's delay threshold, some target users may not be able to read the reference text completely in some cases, at which time the last policy has failed, but the target user is not already reading. Such problems can be effectively solved by judging whether the currently decoded text is already unchanged. The text is unchanged and the non-valid pronunciation section delay reaches the delay threshold of the target user, and the voice is judged to be ended.
A fourth judging module 507, configured to judge, based on a decoding result, whether the current audio duration exceeds a duration threshold of the target user when the decoded text is not changed and/or the delay of the non-valid pronunciation section does not reach the delay threshold of the target user;
when the decoded text is not changed any more and/or the delay of the non-effective pronunciation section does not reach the delay threshold of the target user, the average word reading time length of the target user can be obtained according to the prior statistics, the time length threshold which is needed to be read the longest by the user can be calculated according to the reference text length, and when the record is found to exceed the threshold, the voice is judged to be ended.
A decision module 508 is configured to decide that the voice input of the target user is ended.
When the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, the voice is judged to be input to be finished. Or when the decoded text is equal to the reference text and the non-valid pronunciation section delay reaches the delay threshold of the target user, judging that the voice is input to be finished. Alternatively, when the decoded text has not changed and the non-valid pronunciation section delay reaches the delay threshold of the target user, it is determined that the voice has been input. Or when the current audio time length exceeds the time length threshold value of the target user, judging that the voice is input to be ended.
In summary, in the process of English learning by the user using the English evaluation technology, the invention can detect the end of the user voice in millisecond level, which can certainly bring great user experience improvement, so that the user pays more attention to the actual English learning effect and the learning enthusiasm is improved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A method for adaptively detecting an end of speech, comprising:
acquiring voice input by a target user;
obtaining a threshold of the target user, wherein the threshold of the target user comprises: an energy threshold and a delay threshold of the target user; the energy threshold is set based on the average value of the pronunciation energy of each target user; the delay threshold is set based on the average value of the pronunciation intervals of each target user;
obtaining a decoding result obtained based on the reference text and the voice input by the target user;
based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user;
when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, judging that the voice input of the target user is finished;
when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user, judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not based on the decoding result, if so, then: and judging that the voice input of the target user is ended.
2. The method of claim 1, wherein when the decoded text is not equal to the reference text and/or a non-valid pronunciation segment delay does not reach a delay threshold for the target user, further comprising:
based on the decoding result, judging whether the decoded text is not changed any more and whether the non-valid pronunciation section delay reaches the delay threshold of the target user, if yes, then:
and judging that the voice input of the target user is ended.
3. The method of claim 2, wherein the threshold of the target user further comprises: a duration threshold of the target user; the time length threshold is set based on the average value of the speech speed of each target user; when the decoded text no longer changes and/or the non-valid pronunciation segment delay does not reach the delay threshold of the target user, further comprising:
based on the decoding result, judging whether the current audio time length exceeds the time length threshold of the target user, if yes, then:
and judging that the voice input of the target user is ended.
4. A system for adaptively detecting an end of speech, comprising:
the first acquisition module is used for acquiring the voice input by the target user;
a second obtaining module, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold and a delay threshold of the target user; the energy threshold is set based on the average value of the pronunciation energy of each target user; the delay threshold is set based on the average value of the pronunciation intervals of each target user;
the third acquisition module is used for acquiring a decoding result obtained based on the reference text and the voice input by the target user;
a first judging module, configured to judge, based on the decoding result, whether a ratio of average energy of speech input by the target user to accumulated global average energy is smaller than an energy threshold of the target user; a determining module, configured to determine that the voice input of the target user is ended when a ratio of average energy of the voice input by the target user to the accumulated global average energy is smaller than an energy threshold of the target user;
the threshold of the target user further comprises: the target user's delay threshold, the system further comprising:
the second judging module is used for judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not based on the decoding result when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is more than or equal to the energy threshold of the target user;
and the judging module is further used for judging that the voice input of the target user is ended when the decoded text is equal to the reference text and the delay of the non-valid pronunciation section reaches the delay threshold value of the target user.
5. The system of claim 4, further comprising:
a third judging module, configured to judge, based on the decoding result, whether the decoded text is no longer changed and whether the non-valid pronunciation section delay reaches the delay threshold of the target user when the decoded text is not equal to the reference text and/or the non-valid pronunciation section delay does not reach the delay threshold of the target user;
the judging module is further used for judging that the voice input of the target user is finished when the decoded text is not changed any more and the delay of the non-valid pronunciation section reaches the delay threshold value of the target user.
6. The system of claim 5, wherein the threshold of the target user further comprises: a duration threshold of the target user; the time length threshold is set based on the average value of the speech speed of each target user; the system further comprises:
a fourth judging module, configured to judge, based on the decoding result, whether a current audio duration exceeds a duration threshold of the target user when the decoded text no longer changes and/or the delay of the non-valid pronunciation segment does not reach the delay threshold of the target user;
and the judging module is also used for judging that the voice input of the target user is finished when the current audio time length exceeds the time length threshold value of the target user.
CN202011498888.8A 2020-12-16 2020-12-16 Method and system for adaptively detecting voice ending Active CN112669880B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011498888.8A CN112669880B (en) 2020-12-16 2020-12-16 Method and system for adaptively detecting voice ending

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011498888.8A CN112669880B (en) 2020-12-16 2020-12-16 Method and system for adaptively detecting voice ending

Publications (2)

Publication Number Publication Date
CN112669880A CN112669880A (en) 2021-04-16
CN112669880B true CN112669880B (en) 2023-05-02

Family

ID=75405103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011498888.8A Active CN112669880B (en) 2020-12-16 2020-12-16 Method and system for adaptively detecting voice ending

Country Status (1)

Country Link
CN (1) CN112669880B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689882A (en) * 2021-08-24 2021-11-23 上海喜马拉雅科技有限公司 Pronunciation evaluation method and device, electronic equipment and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3909532A (en) * 1974-03-29 1975-09-30 Bell Telephone Labor Inc Apparatus and method for determining the beginning and the end of a speech utterance
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
CN105070287A (en) * 2015-07-03 2015-11-18 广东小天才科技有限公司 Method and device for voice endpoint detection in self-adaptive noisy environment
CN105261362A (en) * 2015-09-07 2016-01-20 科大讯飞股份有限公司 Conversation voice monitoring method and system
CN107527630A (en) * 2017-09-22 2017-12-29 百度在线网络技术(北京)有限公司 Sound end detecting method, device and computer equipment
CN108962284A (en) * 2018-07-04 2018-12-07 科大讯飞股份有限公司 A kind of voice recording method and device
CN110556128A (en) * 2019-10-15 2019-12-10 出门问问信息科技有限公司 Voice activity detection method and device and computer readable storage medium
CN110827795A (en) * 2018-08-07 2020-02-21 阿里巴巴集团控股有限公司 Voice input end judgment method, device, equipment, system and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2371619B1 (en) * 2009-10-08 2012-08-08 Telefónica, S.A. VOICE SEGMENT DETECTION PROCEDURE.

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3909532A (en) * 1974-03-29 1975-09-30 Bell Telephone Labor Inc Apparatus and method for determining the beginning and the end of a speech utterance
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
CN105070287A (en) * 2015-07-03 2015-11-18 广东小天才科技有限公司 Method and device for voice endpoint detection in self-adaptive noisy environment
CN105261362A (en) * 2015-09-07 2016-01-20 科大讯飞股份有限公司 Conversation voice monitoring method and system
CN107527630A (en) * 2017-09-22 2017-12-29 百度在线网络技术(北京)有限公司 Sound end detecting method, device and computer equipment
CN108962284A (en) * 2018-07-04 2018-12-07 科大讯飞股份有限公司 A kind of voice recording method and device
CN110827795A (en) * 2018-08-07 2020-02-21 阿里巴巴集团控股有限公司 Voice input end judgment method, device, equipment, system and storage medium
CN110556128A (en) * 2019-10-15 2019-12-10 出门问问信息科技有限公司 Voice activity detection method and device and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
阈值自适应语音自动分割系统模型;张俊星等;《计算机工程与设计》;20100428(第08期);第118-119页 *

Also Published As

Publication number Publication date
CN112669880A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN109545243B (en) Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium
CN109147765B (en) Audio quality comprehensive evaluation method and system
US8457967B2 (en) Automatic evaluation of spoken fluency
KR101183344B1 (en) Automatic speech recognition learning using user corrections
Lippmann Speech recognition by machines and humans
US7885817B2 (en) Easy generation and automatic training of spoken dialog systems using text-to-speech
US5634086A (en) Method and apparatus for voice-interactive language instruction
CN108986830B (en) Audio corpus screening method and device
CN111951796B (en) Speech recognition method and device, electronic equipment and storage medium
CN112270933B (en) Audio identification method and device
Tan et al. Application of Malay speech technology in Malay speech therapy assistance tools
Inoue et al. A Study of Objective Measurement of Comprehensibility through Native Speakers' Shadowing of Learners' Utterances.
CN112669880B (en) Method and system for adaptively detecting voice ending
CN112382310A (en) Human voice audio recording method and device
CN113486970B (en) Reading capability evaluation method and device
Nagano et al. Data augmentation based on vowel stretch for improving children's speech recognition
CN114694678A (en) Sound quality detection model training method, sound quality detection method, electronic device, and medium
CN113053414B (en) Pronunciation evaluation method and device
Lavechin et al. Statistical learning models of early phonetic acquisition struggle with child-centered audio data
CN112562731B (en) Spoken language pronunciation evaluation method and device, electronic equipment and storage medium
CN112489692A (en) Voice endpoint detection method and device
Sadeghian et al. Towards an automated screening tool for pediatric speech delay
Middag et al. Towards an ASR-free objective analysis of pathological speech
CN111402887A (en) Method and device for escaping characters by voice
KR102336015B1 (en) Video-based language disorder analysis system, method and program for performing the analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant