CN112669880B

CN112669880B - Method and system for adaptively detecting voice ending

Info

Publication number: CN112669880B
Application number: CN202011498888.8A
Authority: CN
Inventors: 邹朋朋; 陈现麟; 王强
Original assignee: Beijing Duwo Network Technology Co ltd
Current assignee: Beijing Duwo Network Technology Co ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2023-05-02
Anticipated expiration: 2040-12-16
Also published as: CN112669880A

Abstract

The invention discloses a method and a system for adaptively detecting the end of voice, wherein the method comprises the following steps: acquiring voice input by a target user; obtaining a threshold of a target user, wherein the threshold of the target user comprises: an energy threshold of the target user; obtaining a decoding result obtained based on the reference text and the voice input by the target user; based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if so, then: and judging that the voice input of the target user is ended. The invention can automatically detect whether the voice is finished or not, and further finish the recording, compared with the prior art, the invention liberates the hands of the child and improves the user experience.

Description

Method and system for adaptively detecting voice ending

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and system for adaptively detecting an end of speech.

Background

At present, the number of people for learning foreign language in China is numerous, 1.2 hundred million students are learning English only in the field of K12, the number of preschool education young people is nearly 5000 ten thousand, and in the process of English learning training, english pronunciation evaluation technology is increasingly adopted in order to relieve the burden of teachers and parents and improve the interest of students in learning English. The pronunciation evaluation technology evaluates the pronunciation quality of students according to the pre-trained acoustic model and by combining the reference pronunciation text and the decoder technology, and gives a score. At present, in the evaluating process, when the recording of the student is finished, the recording is required to be manually finished, and an evaluating result is obtained. The user experience is poor, and in the young field, children sometimes cannot actively end the recording, the evaluation duration is too long, and the score is influenced.

Therefore, how to automatically detect whether the voice is finished or not, and then finish the recording, liberate both hands of the child, so as to improve the user experience is a problem to be solved urgently.

Disclosure of Invention

In view of this, the invention provides a method for adaptively detecting the end of voice, which can automatically detect whether the voice has ended, and further end the recording, liberate both hands of the child, so as to promote the user experience.

The invention provides a method for adaptively detecting the end of voice, which comprises the following steps:

acquiring voice input by a target user;

obtaining a threshold of the target user, wherein the threshold of the target user comprises: an energy threshold of the target user;

obtaining a decoding result obtained based on the reference text and the voice input by the target user;

based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if yes, then:

and judging that the voice input of the target user is ended.

Preferably, the threshold of the target user further includes: the delay threshold of the target user, when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user, further comprises:

based on the decoding result, judging whether the decoded text is equal to the reference text, and judging whether the non-valid pronunciation section delay reaches the delay threshold of the target user, if so, then:

and judging that the voice input of the target user is ended.

Preferably, when the decoded text is not equal to the reference text and/or the non-valid pronunciation segment delay does not reach the delay threshold of the target user, the method further comprises:

based on the decoding result, judging whether the decoded text is not changed any more and whether the non-valid pronunciation section delay reaches the delay threshold of the target user, if yes, then:

and judging that the voice input of the target user is ended.

Preferably, the threshold of the target user further includes: the time length threshold of the target user, when the decoded text is not changed and/or the delay of the non-effective pronunciation section does not reach the delay threshold of the target user, further comprises:

based on the decoding result, judging whether the current audio time length exceeds the time length threshold of the target user, if yes, then:

and judging that the voice input of the target user is ended.

A system for adaptively detecting end of speech, comprising:

the first acquisition module is used for acquiring the voice input by the target user;

a second obtaining module, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold of the target user;

the third acquisition module is used for acquiring a decoding result obtained based on the reference text and the voice input by the target user;

a first judging module, configured to judge, based on the decoding result, whether a ratio of average energy of speech input by the target user to accumulated global average energy is smaller than an energy threshold of the target user;

and the judging module is used for judging that the voice input of the target user is ended when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold value of the target user.

Preferably, the threshold of the target user further includes: the target user's delay threshold, the system further comprising:

the second judging module is used for judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not based on the decoding result when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is more than or equal to the energy threshold of the target user;

and the judging module is further used for judging that the voice input of the target user is ended when the decoded text is equal to the reference text and the delay of the non-valid pronunciation section reaches the delay threshold value of the target user.

Preferably, the system further comprises:

a third judging module, configured to judge, based on the decoding result, whether the decoded text is no longer changed and whether the non-valid pronunciation section delay reaches the delay threshold of the target user when the decoded text is not equal to the reference text and/or the non-valid pronunciation section delay does not reach the delay threshold of the target user;

the judging module is further used for judging that the voice input of the target user is finished when the decoded text is not changed any more and the delay of the non-valid pronunciation section reaches the delay threshold value of the target user.

Preferably, the threshold of the target user further includes: the target user's duration threshold, the system further comprising:

a fourth judging module, configured to judge, based on the decoding result, whether a current audio duration exceeds a duration threshold of the target user when the decoded text is not changed and/or the delay of the non-valid pronunciation section does not reach the delay threshold of the target user;

and the judging module is also used for judging that the voice input of the target user is finished when the current audio time length exceeds the time length threshold value of the target user.

In summary, the present invention discloses a method for adaptively detecting the end of speech, when it is required to automatically detect whether the speech is ended, firstly, obtaining the speech input by a target user, and obtaining a threshold of the target user, where the threshold of the target user includes: the energy threshold of the target user, obtain the decoding result based on the reference text and voice input by the target user; and then, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if so, then: and judging that the voice input of the target user is ended. The invention can automatically detect whether the voice is finished or not, and further finish the recording, compared with the prior art, the invention liberates the hands of the child and improves the user experience.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an embodiment 1 of a method for adaptively detecting end of speech according to the present disclosure;

FIG. 2 is a diagram illustrating a decoding state transition according to the present disclosure;

FIG. 3 is a flowchart of an embodiment 2 of a method for adaptively detecting end of speech according to the present disclosure;

FIG. 4 is a schematic diagram of a system for adaptively detecting end of speech according to an embodiment 1 of the present invention;

fig. 5 is a schematic structural diagram of a system embodiment 2 for adaptively detecting end of speech according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a flowchart of an embodiment 1 of a method for adaptively detecting an end of speech according to the present disclosure may include the following steps:

s101, acquiring voice input by a target user;

when the self-adaptive detection of whether the voice is ended is required, firstly acquiring the voice input by the target user, namely firstly acquiring the voice output by the user who needs to carry out voice ending detection during voice evaluation.

S102, acquiring a threshold value of a target user, wherein the threshold value of the target user comprises: an energy threshold of the target user;

and simultaneously, acquiring a threshold value corresponding to the target user. The obtained threshold value of the target user comprises the following steps: an energy threshold of the target user.

Specifically, when the threshold value of the target user is obtained, in order to achieve adaptive dynamic adjustment of the threshold value, the sound recordings of all the target users (such as the child users) need to be analyzed in advance, and the pronunciation interval, the energy and the speech speed average value of each target user are obtained, so that the threshold value of each target user is obtained.

S103, obtaining a decoding result obtained based on the reference text and the voice input by the target user;

meanwhile, a decoding result obtained based on the reference text and the voice input by the target user is obtained.

The reference text is English text which is required to be read by the target user. Specifically, in decoding, a decoding result is obtained by patterning the reference text and combining with an acoustic model.

The composition is a graph which is obtained by combining a pre-trained acoustic model and a pronunciation dictionary to perform HCLG operation and needs to be aligned and decoded. The acoustic model is used for calculating posterior probability of acoustic features belonging to each phoneme and is formed by training sound with good pronunciation for more than 100 hours.

The acoustic model training process comprises the following steps: the audio is firstly framed, then the characteristics are extracted, every 25ms is a frame, the frame is shifted by 10ms, and the characteristics are 40-dimensional Meier cepstrum coefficient mfcc characteristics. After the extraction of the features is completed, expanding the audio text into phonemes according to a dictionary, dividing each frame by time average, marking phoneme labels, training by using a time delay neural network tdnn after the features correspond to the labels, obtaining an initial model, forcibly corresponding by using a Viterbi of the initial model, carrying out new alignment on the audio, carrying out new training after the phoneme labels of each audio are obtained, obtaining a new model, and stopping training when iteration reaches a certain number of rounds, so as to obtain a final training model. MFCC features are cepstrum parameters extracted in the Mel scale frequency domain, which describes the nonlinear characteristics of human ear frequencies, whose relationship to frequency can be approximated by the following equation:

where f is frequency in Hz.

Decoding means that according to the mfcc characteristics of the audio input, likelihood and composition are output by combining an acoustic model, and decoding is performed through a Viterbi algorithm, so as to select an optimal path. The Viterbi algorithm is essentially a dynamic programming algorithm that can get a globally optimal solution.

If there is a final path from the start point to the end point, as shown in fig. 2, this path sub-path is also the optimal path from the start point to the corresponding point in time. As shown in the above figure, the dotted line is an optimal path from the start point to the end point, and then the dotted line from the start point to the time point 4 is also an optimal path for the period of time. In other words, at any time, only the optimal path of all states at the time is required to be recorded, taking time 4 as an example, at time 4, only the optimal paths of three states S1, S2 and S3 at time 4 are required to be recorded, that is, only three paths are required to be recorded, then at time 5, two paths pass at the state S3 at time 5, the optimal paths are taken, the states of S2 and S1 at time 5 are similar, that is, only three paths are required to be recorded at the state S5.

Therefore, two times of circulation are needed at each moment, the outer layer circulates all the states at the moment, and the inner layer circulates all the states from one state at the moment to the next moment. Time complexity, time complexity for all time periods. The state at any one time in actual large-scale speech recognition may be large, such as 5000, so that even if viterbi is used, the time complexity is too large, and in practice, in order to solve this problem, a Beam Search algorithm is introduced. The Beam Search algorithm involves the number of current time states and the number of next time states, and if it is desired to increase the decoding speed, both values need to be reduced.

S104, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if yes, entering S105:

and then, calculating the average energy of the voice input by the target user, comparing the average energy with the average value of the global energy accumulated before, and judging whether the ratio is smaller than the energy threshold of the target user.

S105, judging that the voice input of the target user is ended.

When the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, the voice is judged to be input to be finished.

In summary, in the above embodiment, when it is required to automatically detect whether the voice is ended, firstly, the voice input by the target user is acquired, and the threshold value of the target user is acquired, where the threshold value of the target user includes: the energy threshold of the target user, obtain the decoding result based on the reference text and voice input by the target user; and then, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if so, then: and judging that the voice input of the target user is ended. Whether can the automated inspection pronunciation have ended, and then end the recording, for prior art, liberated child's both hands, promoted user experience.

As shown in fig. 3, a flowchart of an embodiment 2 of a method for adaptively detecting an end of speech according to the present disclosure may include the following steps:

s301, acquiring voice input by a target user;

S302, acquiring a threshold value of a target user, wherein the threshold value of the target user comprises: an energy threshold of the target user, a delay threshold of the target user, and a duration threshold of the target user;

and simultaneously, acquiring a threshold value corresponding to the target user. The obtained threshold value of the target user comprises the following steps: an energy threshold of the target user, a delay threshold of the target user, and a duration threshold of the target user.

S303, obtaining a decoding result obtained based on the reference text and the voice input by the target user;

where f is frequency in Hz.

S304, based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, if not, entering S305, if yes, entering S308:

S305, judging whether the decoded text is equal to the reference text or not based on the decoding result, and judging whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not, if not, entering S306, and if so, entering S308:

when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user, judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not.

Because the composition is based on the reference text, when the decoded text is the reference text, which represents that the target user has read, and the non-valid pronunciation section delay reaches the delay threshold of the target user, it is determined that the user has finished pronunciation. It should be noted that, the abnormal situation needs to be handled, for example, the child user is sometimes unfamiliar with the reference text, and multiple reading or repeated reading situations occur, for example, the situations of foreign words, numbers, years, etc. are also included.

S306, judging whether the decoded text is not changed any more and whether the non-valid pronunciation section delay reaches the delay threshold of the target user based on the decoding result, if not, entering S307, if yes, entering S308:

when the decoded text is not equal to the reference text and/or the non-valid utterance delay has not reached the target user's delay threshold, some target users may not be able to read the reference text completely in some cases, at which time the last policy has failed, but the target user is not already reading. Such problems can be effectively solved by judging whether the currently decoded text is already unchanged. The text is unchanged and the non-valid pronunciation section delay reaches the delay threshold of the target user, and the voice is judged to be ended.

S307, based on the decoding result, judging whether the current audio time length exceeds the time length threshold of the target user, if yes, entering S308:

when the decoded text is not changed any more and/or the delay of the non-effective pronunciation section does not reach the delay threshold of the target user, the average word reading time length of the target user can be obtained according to the prior statistics, the time length threshold which is needed to be read the longest by the user can be calculated according to the reference text length, and when the record is found to exceed the threshold, the voice is judged to be ended.

S308, judging that the voice input of the target user is ended.

When the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, the voice is judged to be input to be finished. Or when the decoded text is equal to the reference text and the non-valid pronunciation section delay reaches the delay threshold of the target user, judging that the voice is input to be finished. Alternatively, when the decoded text has not changed and the non-valid pronunciation section delay reaches the delay threshold of the target user, it is determined that the voice has been input. Or when the current audio time length exceeds the time length threshold value of the target user, judging that the voice is input to be ended.

In summary, in the process of English learning by the user using the English evaluation technology, the invention can detect the end of the user voice in millisecond level, which can certainly bring great user experience improvement, so that the user pays more attention to the actual English learning effect and the learning enthusiasm is improved.

As shown in fig. 4, a schematic structural diagram of a system for adaptively detecting end of speech according to embodiment 1 of the present disclosure may include:

a first obtaining module 401, configured to obtain a voice input by a target user;

A second obtaining module 402, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold of the target user;

A third obtaining module 403, configured to obtain a decoding result obtained based on the reference text and the voice input by the target user;

where f is frequency in Hz.

A first determining module 404, configured to determine, based on the decoding result, whether a ratio of the average energy of the speech input by the target user to the accumulated global average energy is less than an energy threshold of the target user;

A determining module 405, configured to determine that the voice input of the target user is ended when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user.

As shown in fig. 5, a schematic structural diagram of a system for adaptively detecting end of speech according to embodiment 2 of the present disclosure may include:

a first obtaining module 501, configured to obtain a voice input by a target user;

A second obtaining module 502, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold of the target user, a delay threshold of the target user, and a duration threshold of the target user;

A third obtaining module 503, configured to obtain a decoding result obtained based on the reference text and the voice input by the target user;

where f is frequency in Hz.

A first determining module 504, configured to determine, based on the decoding result, whether a ratio of the average energy of the speech input by the target user to the accumulated global average energy is less than an energy threshold of the target user;

A second judging module 505, configured to judge, based on a decoding result, whether the decoded text is equal to the reference text and whether the non-valid pronunciation section delay reaches the delay threshold of the target user when the ratio of the average energy of the speech input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user;

A third determining module 506, configured to determine, based on a decoding result, whether the decoded text is no longer changed and whether the non-valid pronunciation section delay reaches the delay threshold of the target user when the decoded text is not equal to the reference text and/or the non-valid pronunciation section delay does not reach the delay threshold of the target user;

A fourth judging module 507, configured to judge, based on a decoding result, whether the current audio duration exceeds a duration threshold of the target user when the decoded text is not changed and/or the delay of the non-valid pronunciation section does not reach the delay threshold of the target user;

A decision module 508 is configured to decide that the voice input of the target user is ended.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for adaptively detecting an end of speech, comprising:

acquiring voice input by a target user;

obtaining a threshold of the target user, wherein the threshold of the target user comprises: an energy threshold and a delay threshold of the target user; the energy threshold is set based on the average value of the pronunciation energy of each target user; the delay threshold is set based on the average value of the pronunciation intervals of each target user;

based on the decoding result, judging whether the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user;

when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is smaller than the energy threshold of the target user, judging that the voice input of the target user is finished;

when the ratio of the average energy of the voice input by the target user to the accumulated global average energy is greater than or equal to the energy threshold of the target user, judging whether the decoded text is equal to the reference text or not and whether the non-valid pronunciation section delay reaches the delay threshold of the target user or not based on the decoding result, if so, then: and judging that the voice input of the target user is ended.

2. The method of claim 1, wherein when the decoded text is not equal to the reference text and/or a non-valid pronunciation segment delay does not reach a delay threshold for the target user, further comprising:

and judging that the voice input of the target user is ended.

3. The method of claim 2, wherein the threshold of the target user further comprises: a duration threshold of the target user; the time length threshold is set based on the average value of the speech speed of each target user; when the decoded text no longer changes and/or the non-valid pronunciation segment delay does not reach the delay threshold of the target user, further comprising:

and judging that the voice input of the target user is ended.

4. A system for adaptively detecting an end of speech, comprising:

a second obtaining module, configured to obtain a threshold of the target user, where the threshold of the target user includes: an energy threshold and a delay threshold of the target user; the energy threshold is set based on the average value of the pronunciation energy of each target user; the delay threshold is set based on the average value of the pronunciation intervals of each target user;

a first judging module, configured to judge, based on the decoding result, whether a ratio of average energy of speech input by the target user to accumulated global average energy is smaller than an energy threshold of the target user; a determining module, configured to determine that the voice input of the target user is ended when a ratio of average energy of the voice input by the target user to the accumulated global average energy is smaller than an energy threshold of the target user;

the threshold of the target user further comprises: the target user's delay threshold, the system further comprising:

5. The system of claim 4, further comprising:

6. The system of claim 5, wherein the threshold of the target user further comprises: a duration threshold of the target user; the time length threshold is set based on the average value of the speech speed of each target user; the system further comprises:

a fourth judging module, configured to judge, based on the decoding result, whether a current audio duration exceeds a duration threshold of the target user when the decoded text no longer changes and/or the delay of the non-valid pronunciation segment does not reach the delay threshold of the target user;