CN109767792B - Voice endpoint detection method, device, terminal and storage medium - Google Patents
Voice endpoint detection method, device, terminal and storage medium Download PDFInfo
- Publication number
- CN109767792B CN109767792B CN201910204567.3A CN201910204567A CN109767792B CN 109767792 B CN109767792 B CN 109767792B CN 201910204567 A CN201910204567 A CN 201910204567A CN 109767792 B CN109767792 B CN 109767792B
- Authority
- CN
- China
- Prior art keywords
- speech
- voice
- user
- endpoint detection
- historical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 101
- 238000000034 method Methods 0.000 claims abstract description 58
- 230000008569 process Effects 0.000 claims abstract description 39
- 238000004590 computer program Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 206010071299 Slow speech Diseases 0.000 description 4
- 230000002035 prolonged effect Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Landscapes
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the invention discloses a voice endpoint detection method, a voice endpoint detection device, a terminal and a storage medium, wherein the method comprises the following steps: determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range; and if the difference between the current speech speed and the historical average speech speed is not in the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech speed, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection. The embodiment of the invention solves the problem that the accuracy of the voice recognition result is reduced due to the sharing of the parameter threshold in the existing voice endpoint detection method, realizes the personalized detection of the voice ending endpoints of different users, and improves the accuracy of the voice recognition.
Description
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a voice endpoint detection method, a voice endpoint detection device, a voice endpoint detection terminal and a storage medium.
Background
The conventional voice endpoint detection (VAD) algorithm mainly detects whether a sentence is finished according to indexes such as zero crossing rate and sound level. If the energy value of the speech of the previous part of the continuous preset number M0 frames in the acquired speech stream is lower than the energy value threshold Elow specified in advance, and the energy value of the speech of the next continuous M0 frames is larger than Elow, the starting end point of the speech is the place where the energy value of the speech is increased. Similarly, if the speech energy value of several consecutive frames is larger, the energy value of the subsequent speech frame becomes smaller, i.e. smaller than the energy threshold Ehigh specified in advance, and lasts for a certain period of time, it is considered that the end point of the speech is the place where the speech energy value is reduced.
However, the speaking speed of each person is different, some people are fast, and some people are slow. If the same VAD parameter threshold is used for all people in the voice endpoint detection process, some people may have a better voice recognition effect, and some people may be frequently and erroneously truncated, for example, some people with slow speech speed may not have spoken a sentence, and the sentence has been judged to be spoken based on the conventional VAD algorithm, thereby causing inaccurate voice recognition.
Disclosure of Invention
The embodiment of the invention provides a voice endpoint detection method, a voice endpoint detection device, a terminal and a storage medium, which aim to realize personalized detection of voice ending endpoints of different users and improve the accuracy of voice recognition.
In a first aspect, an embodiment of the present invention provides a method for detecting a voice endpoint, where the method includes:
determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;
and if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech rate, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection.
In a second aspect, an embodiment of the present invention further provides a device for detecting a voice endpoint, where the device includes:
the speech rate determining module is used for determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;
and the voice energy threshold adjusting module is used for adjusting the voice energy threshold of the voice endpoint detection according to the current speech rate if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, so that the voice ending endpoint of the user is determined according to the adjusted voice energy threshold in the next voice endpoint detection process adjacent to the current voice endpoint detection.
In a third aspect, an embodiment of the present invention further provides a terminal, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a method for voice endpoint detection as in any embodiment of the invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a voice endpoint detection method according to any embodiment of the present invention.
According to the embodiment of the invention, whether the difference between the current speech rate and the historical average speech rate of the user is in the preset difference range is judged, if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, the speech energy threshold value in the next speech endpoint detection is adjusted according to the current speech rate, so that the self-adaptive dynamic adjustment of the speech energy threshold values for different users in the speech endpoint detection process is realized, the problem that the accuracy of a speech recognition result is reduced due to the sharing of a parameter threshold value in the existing speech endpoint detection method is solved, the personalized detection of speech end points for different users is realized, and the accuracy of speech recognition is improved.
Drawings
Fig. 1 is a flowchart of a voice endpoint detection method according to an embodiment of the present invention;
fig. 2 is a flowchart of a voice endpoint detection method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a voice endpoint detection apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a voice endpoint detection method according to an embodiment of the present invention, where the present embodiment is applicable to detecting a voice endpoint of a user in a voice-based human-computer interaction process, and the method may be executed by a voice endpoint detection apparatus, where the apparatus may be implemented in a software and/or hardware manner, and may be integrated on a terminal with a voice recognition function, such as an intelligent mobile terminal and a vehicle-mounted device.
As shown in fig. 1, the voice endpoint detection method provided in this embodiment may include:
s110, determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range.
In the voice interaction process between the user and the terminal, the terminal can call a voice acquisition device, such as a microphone, to acquire the voice of the user in real time, recognize the voice of the user based on a voice recognition technology, and record the speed of the user in each interaction process. Based on the historical speech rates of the plurality of users, a historical average speech rate of the user may be determined. The historical average speech rate may reflect the user's speech rate level as a whole. By comparing the current speech rate with the historical average speech rate, whether the current speech rate of the user changes can be judged, and whether the speech energy threshold value needs to be adjusted in the speech endpoint detection process is further determined.
Optionally, the process of determining the historical average speech rate of the user includes:
counting historical voices of a user in a preset time period, and determining the corresponding time of a starting end point and an ending end point of the historical voices;
and determining the historical average speech rate of the user according to the text length of the recognition result of the historical speech and the corresponding time of the starting end point and the ending end point of the historical speech.
The text length of the speech recognition result refers to the number of language elements included in the user's speech. For example, for Chinese, the text length may refer to the number of Chinese characters included in the user's speech; for English, the text length may refer to the number of words included in the user's speech. According to the time t1 corresponding to the starting endpoint and the time t2 corresponding to the ending endpoint of each section of historical speech, the effective time t2-t1 of the section of historical speech can be determined, and further, the speech rate (t2-t1)/len or len/(t2-t1) of the user for outputting the section of historical speech can be determined by combining the text length len of the recognition result of the section of historical speech. And averaging the speech rates corresponding to all the historical speeches in a preset time period to obtain the historical average speech rate of the user in the preset time period.
The preset time period includes a voice statistic period closest to the current time, and the specific length of the preset time period may be adaptively set, which is not specifically limited in this embodiment. The historical average speech rate corresponding to the speech statistic period closest to the current time is selected as the speech rate reference for measuring whether the current speech rate of the user changes, the method has reference value for adjusting the speech energy threshold in the speech endpoint detection process, the necessity and the accuracy of adjusting the speech energy threshold can be ensured, and the accuracy of speech recognition is further ensured. The determination method of the current speech rate of the user is the same as the determination method of the historical speech rate of the user, and the detailed description is omitted here.
The specific value of the preset difference range is not limited in this embodiment, and may be adaptively set to any numerical range including a zero value according to the voice detection requirement. For example, when the preset difference range is set to 0, it is equivalent to determining whether to adjust the speech energy threshold according to the absolute difference between the current speech rate and the historical average speech rate of the user; when the preset difference range is set to be a non-zero value range, for example, within 50 milliseconds, it is equivalent to determining whether to adjust the speech energy threshold according to the relative difference between the current speech rate and the historical average speech rate of the user. Specifically, when the difference between the current speech rate and the historical average speech rate of the user is in a preset difference range, the current speech rate is considered to be the same as the historical average speech rate; and when the difference between the current speech rate and the historical average speech rate of the user is not within the preset difference range, the current speech rate is different from the historical average speech rate. Whether the voice energy threshold is adjusted or not is determined based on the relative difference, so that frequent adjustment of the voice energy threshold caused by slight fluctuation of the user speed can be avoided, and resource consumption of the terminal is reduced.
And S120, if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech rate, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection.
The difference between the current speech rate and the historical average speech rate of the user is not within the preset difference range, namely the current speech rate is different from the historical average speech rate, the current speech rate of the user changes, and the speech energy threshold value needs to be adjusted according to the current speech rate, so that the speech end point of the user is determined by using the adjusted speech energy threshold value in the next speech end point detection process, the self-adaptive dynamic adjustment of the speech energy threshold value of the same user under the condition of different speech rates is realized, and the adjusted speech energy threshold value is more suitable for the speaking characteristics of the user. For example, if the user has a fast speech rate, the speech energy threshold may be appropriately increased in the next speech endpoint detection process adjacent to the current time; if the speech rate is slow, the speech energy threshold value can be reduced in the next adjacent speech endpoint detection process. The determination of the user's speech rate each time is related to the comparison result of the current adjacent last user speech rate and the historical average speech rate, and as the user continues to interact with the terminal speech, the speech energy threshold value can gradually become stable along with the stability of the user's speech rate, and finally the speech energy threshold value most suitable for the user's speech endpoint detection is recurved.
For different users, due to the difference between the speech rate characteristics, the dynamic adjustment result of the speech energy threshold value in the speech endpoint detection process is different, so that the speech energy threshold values which accord with the speech rate characteristics of the users are respectively possessed for the different users, and the detection of the speech endpoint is not carried out on all the users by using the same speech energy threshold value as the prior art without distinguishing, so that the personalized detection of the speech endpoint for the different users is realized, and the accuracy of the speech recognition result for each user is improved.
According to the technical scheme of the embodiment, whether the difference between the current speech rate and the historical average speech rate of the user is in the preset difference range is judged, if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, the speech energy threshold value in the next speech endpoint detection is adjusted according to the current speech rate, so that the self-adaptive dynamic adjustment of the speech energy threshold values for different users in the speech endpoint detection process is realized, the problem that the accuracy of a speech recognition result is reduced due to the fact that all people share one set of parameter threshold values in the existing speech endpoint detection method is solved, the personalized detection of speech ending endpoints for different users is realized, the phenomenon of mistaken truncation of the speech of the user in the speech recognition process is avoided, the integrity of the acquired speech information of the user is further ensured, and the accuracy of the speech recognition is improved; in addition, according to the scheme of the embodiment, the voice energy threshold is dynamically adjusted based on the change of the user speed, and compared with a mode of determining the user voice energy threshold based on model training, the method is easier to implement in the aspect of product implementation, does not relate to interaction with a remote server, consumes less power for a terminal, and is easier to popularize for a product with an accurate voice recognition function based on the scheme of the embodiment.
Example two
Fig. 2 is a flowchart of a voice endpoint detection method according to a second embodiment of the present invention, which is further optimized based on the above-mentioned embodiments. As shown in fig. 2, the method may include:
s210, determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range.
S220, if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, in the next speech endpoint detection process adjacent to the current speech endpoint detection, extending the target duration based on the moment when the speech energy of the user starts to decrease, and taking the speech energy corresponding to the moment when the duration extension is finished as a speech energy threshold, wherein the target duration is a preset time length determined according to the current speech rate.
Whether the expression form of the user speech rate is the ratio (t2-t1)/len of the speech effective time length to the text length of the recognition result or the ratio (t2-t1) of the text length of the speech recognition result to the speech effective time length, the time when the user currently outputs a language element, for example, the time when the user currently speaks a Chinese character or the time when the user speaks an English word, can be determined according to the current speech rate of the user. In the next voice endpoint detection process, when it is detected that the voice energy starts to decrease, the timing extension may be performed according to the time for outputting one language element according to the current user, for example, the total time corresponding to a preset multiple of the time for outputting one language element by the user is taken as the target time length, and the preset multiple may be set according to the accuracy requirement or the sensitivity requirement of the voice endpoint detection. The preset multiple may be the same or different for different users.
For example, in the next voice endpoint detection process adjacent to the current voice endpoint detection, if the voice to be output by the user is "today is a fine day", after the user finishes speaking the "fine day", the voice energy starts to decrease, then the target time length is delayed based on the time corresponding to the time when the user finishes speaking the "fine day", for example, the target time length is 2 times of the time for outputting one chinese character determined according to the current speech rate of the user, and the voice energy corresponding to the time when the time length delay ends is the voice energy threshold in the next voice endpoint detection process. If a plurality of continuous frames of voice energy are larger than the voice energy threshold before the time length delay ending time, and a plurality of continuous frames of voice energy are smaller than the voice energy threshold after the time length delay ending time, the time length delay ending time is the next voice output ending endpoint of the user.
Equivalently, for users with faster speech speed, the speech energy threshold of the speech endpoint detection process can be determined within a shorter delay time; for users with slower speech speed, the speech energy threshold in the speech endpoint detection process needs to be determined in a relatively long delay time. Therefore, the phenomenon that a person with a slow speech speed is mistakenly identified by the terminal as the end of the speech output of the user when the person with the slow speech speed is not spoken in the speech endpoint detection process due to the fact that the same fixed speech energy threshold value or the same delay time is used for determining the speech energy threshold value by the users with the fast speech speed and the slow speech speed can be avoided, and when the speech information of the user collected by the terminal is lost, the speech recognition result is natural and inaccurate.
Optionally, the method further includes: and if the difference between the current speech rate and the historical average speech rate is in a preset difference range, namely the current speech rate of the user is considered to be the same as the historical average speech rate, determining the end point of the speech of the user in the next speech end point detection process according to the speech energy threshold used by the current speech end point detection. That is, when the current speech rate is not changed, the speech energy threshold value in the next speech endpoint detection process does not need to be adjusted.
In the technical scheme of the embodiment, when it is determined that the difference between the current speech rate and the historical average speech rate is not within the preset difference range, in the next voice endpoint detection process adjacent to the current voice endpoint detection, the target time length is prolonged based on the time when the voice energy of the user starts to reduce, the voice energy corresponding to the time length prolonging ending time is taken as the voice energy threshold value, the self-adaptive dynamic adjustment of the voice energy threshold value according to the voice speed of different users is realized, the problem that the accuracy of a voice recognition result is reduced because all people share one set of parameter threshold values in the existing voice endpoint detection method is solved, the personalized detection of the voice ending endpoints of different users is realized, and the phenomenon of mistaken truncation of the voice of the user in the voice recognition process is avoided, therefore, the integrity of the acquired user voice information is ensured, and the accuracy of voice recognition is improved.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a voice endpoint detection apparatus according to a third embodiment of the present invention, which is applicable to detecting a voice endpoint of a user in a voice-based human-computer interaction process. The device can be realized in a software and/or hardware mode, and can be integrated on a terminal with a voice recognition function, such as an intelligent mobile terminal, a vehicle-mounted device and the like.
As shown in fig. 3, the speech endpoint detection apparatus provided in this embodiment may include a speech rate determining module 310 and a speech energy threshold adjusting module 320, where:
a speech rate determining module 310, configured to determine whether a difference between a current speech rate of the user and a historical average speech rate is within a preset difference range;
the voice energy threshold adjusting module 320 is configured to adjust a voice energy threshold of the voice endpoint detection according to the current speech rate if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, so that in a next voice endpoint detection process adjacent to the current voice endpoint detection, a voice ending endpoint of the user is determined according to the adjusted voice energy threshold.
Optionally, the voice energy threshold adjusting module 320 is specifically configured to:
in the next voice endpoint detection process adjacent to the current voice endpoint detection, based on the time when the voice energy of the user starts to decrease, the target time length is prolonged, and the voice energy corresponding to the time when the time length is prolonged to finish is used as a voice energy threshold, wherein the target time length is a preset time length determined according to the current speed.
Optionally, the apparatus further comprises:
the historical voice counting module is used for counting the historical voice of the user in a preset time period and determining the corresponding time of a starting end point and an ending end point of the historical voice;
and the historical average speed determining module is used for determining the historical average speed of the user according to the text length of the recognition result of the historical voice and the corresponding time of the starting end point and the ending end point of the historical voice.
Optionally, the apparatus further comprises:
and the voice ending end point determining module is used for determining the voice ending end point of the user in the next voice end point detection process according to the voice energy threshold used by the current voice end point detection if the difference between the current voice speed and the historical average voice speed is in the preset difference range.
The voice endpoint detection device provided by the embodiment of the invention can execute the voice endpoint detection method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the invention not specifically described in this embodiment.
Example four
Fig. 4 is a schematic structural diagram of a terminal according to a fourth embodiment of the present invention. Fig. 4 illustrates a block diagram of an exemplary terminal 412 suitable for use in implementing embodiments of the present invention. The terminal 412 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 4, the terminal 412 is represented in the form of a general-purpose terminal. The components of the terminal 412 may include, but are not limited to: one or more processors 416, a storage device 428, and a bus 418 that couples the various system components including the storage device 428 and the processors 416.
A program/utility 440 having a set (at least one) of program modules 442 may be stored, for instance, in storage 428, such program modules 442 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. The program modules 442 generally perform the functions and/or methodologies of the described embodiments of the invention.
The terminal 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing terminal, display 424, etc.), one or more terminals that enable a user to interact with the terminal 412, and/or any terminal (e.g., network card, modem, etc.) that enables the terminal 412 to communicate with one or more other computing terminals. Such communication may occur via input/output (I/O) interfaces 422. Also, the terminal 412 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network, such as the internet) via the Network adapter 420. As shown in fig. 4, the network adapter 420 communicates with the other modules of the terminal 412 over a bus 418. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the terminal 412, including but not limited to: microcode, end drives, Redundant processors, external disk drive Arrays, RAID (Redundant Arrays of Independent Disks) systems, tape drives, and data backup storage systems, among others.
The processor 416 executes various functional applications and data processing by running programs stored in the storage 428, for example, implementing a voice endpoint detection method provided by any embodiment of the present invention, which may include:
determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;
and if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech rate, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection.
EXAMPLE five
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a voice endpoint detection method according to any embodiment of the present invention, where the method may include:
determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;
and if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech rate, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (8)
1. A method for voice endpoint detection, comprising:
determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;
if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, in the next speech endpoint detection process adjacent to the current speech endpoint detection, extending the target duration based on the moment when the user speech energy begins to decrease, and taking the speech energy corresponding to the moment when the duration extension ends as a speech energy threshold, wherein the target duration is a preset time length determined according to the current speech rate.
2. The method of claim 1, further comprising:
counting historical voices of a user in a preset time period, and determining moments corresponding to a starting endpoint and an ending endpoint of the historical voices respectively;
and determining the historical average speech speed of the user according to the text length of the recognition result of the historical speech and the corresponding time of the starting end point and the ending end point of the historical speech.
3. The method of claim 1, further comprising:
and if the difference between the current speech rate and the historical average speech rate is in a preset difference range, determining the end point of the user's speech in the next speech end point detection process according to the speech energy threshold used by the current speech end point detection.
4. A voice endpoint detection apparatus, comprising:
the speech rate determining module is used for determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;
and the voice energy threshold adjusting module is used for prolonging the target time length based on the moment when the voice energy of the user starts to reduce in the next voice endpoint detection process adjacent to the current voice endpoint detection if the difference between the current voice rate and the historical average voice rate is not within the preset difference range, and taking the voice energy corresponding to the time length prolonging ending moment as the voice energy threshold, wherein the target time length is a preset time length determined according to the current voice rate.
5. The apparatus of claim 4, further comprising:
the historical voice counting module is used for counting the historical voice of a user in a preset time period and determining the time corresponding to the starting end point and the ending end point of the historical voice respectively;
and the historical average speed determining module is used for determining the historical average speed of speech of the user according to the text length of the recognition result of the historical speech and the corresponding time of the starting end point and the ending end point of the historical speech.
6. The apparatus of claim 4, further comprising:
and the voice ending end point determining module is used for determining the voice ending end point of the user in the next voice end point detection process according to the voice energy threshold used by the current voice end point detection if the difference between the current voice speed and the historical average voice speed is in a preset difference range.
7. A terminal, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the voice endpoint detection method of any of claims 1-3.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for voice endpoint detection according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910204567.3A CN109767792B (en) | 2019-03-18 | 2019-03-18 | Voice endpoint detection method, device, terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910204567.3A CN109767792B (en) | 2019-03-18 | 2019-03-18 | Voice endpoint detection method, device, terminal and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109767792A CN109767792A (en) | 2019-05-17 |
CN109767792B true CN109767792B (en) | 2020-08-18 |
Family
ID=66459398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910204567.3A Active CN109767792B (en) | 2019-03-18 | 2019-03-18 | Voice endpoint detection method, device, terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109767792B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11404044B2 (en) | 2019-05-14 | 2022-08-02 | Samsung Electronics Co., Ltd. | Method, apparatus, electronic device, and computer readable storage medium for voice translation |
CN110400576B (en) * | 2019-07-29 | 2021-10-15 | 北京声智科技有限公司 | Voice request processing method and device |
CN110415710B (en) * | 2019-08-06 | 2022-05-31 | 大众问问(北京)信息科技有限公司 | Parameter adjusting method, device, equipment and medium for vehicle-mounted voice interaction system |
CN110619888B (en) * | 2019-09-30 | 2023-06-27 | 北京淇瑀信息科技有限公司 | AI voice rate adjusting method and device and electronic equipment |
CN110728994B (en) * | 2019-12-19 | 2020-05-05 | 北京海天瑞声科技股份有限公司 | Voice acquisition method and device of voice library, electronic equipment and storage medium |
WO2021134549A1 (en) * | 2019-12-31 | 2021-07-08 | 李庆远 | Human merging and training of multiple artificial intelligence outputs |
CN111402931B (en) * | 2020-03-05 | 2023-05-26 | 云知声智能科技股份有限公司 | Voice boundary detection method and system assisted by sound image |
CN111583933B (en) * | 2020-04-30 | 2023-10-27 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
CN111968680B (en) * | 2020-08-14 | 2024-10-01 | 北京小米松果电子有限公司 | Voice processing method, device and storage medium |
CN112185365A (en) * | 2020-09-30 | 2021-01-05 | 深圳供电局有限公司 | Power supply intelligent client processing method and system |
CN112634907B (en) * | 2020-12-24 | 2024-05-17 | 百果园技术(新加坡)有限公司 | Audio data processing method and device for voice recognition |
CN113177114B (en) * | 2021-05-28 | 2022-10-21 | 重庆电子工程职业学院 | Natural language semantic understanding method based on deep learning |
CN114203204B (en) * | 2021-12-06 | 2024-04-05 | 北京百度网讯科技有限公司 | Tail point detection method, device, equipment and storage medium |
CN114743571A (en) * | 2022-04-08 | 2022-07-12 | 北京字节跳动网络技术有限公司 | Audio processing method and device, storage medium and electronic equipment |
CN114898755B (en) * | 2022-07-14 | 2023-01-17 | 科大讯飞股份有限公司 | Voice processing method and related device, electronic equipment and storage medium |
CN116092352B (en) * | 2022-12-29 | 2024-07-09 | 深圳市声扬科技有限公司 | Dialogue training method and device and computer readable storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0713584A (en) * | 1992-10-05 | 1995-01-17 | Matsushita Electric Ind Co Ltd | Speech detecting device |
US8645133B2 (en) * | 2006-05-09 | 2014-02-04 | Core Wireless Licensing S.A.R.L. | Adaptation of voice activity detection parameters based on encoding modes |
CN104134440A (en) * | 2014-07-31 | 2014-11-05 | 百度在线网络技术(北京)有限公司 | Voice detection method and device used for portable terminal |
CN106611598A (en) * | 2016-12-28 | 2017-05-03 | 上海智臻智能网络科技股份有限公司 | VAD dynamic parameter adjusting method and device |
CN106782508A (en) * | 2016-12-20 | 2017-05-31 | 美的集团股份有限公司 | The cutting method of speech audio and the cutting device of speech audio |
CN107068147A (en) * | 2015-10-19 | 2017-08-18 | 谷歌公司 | Sound end is determined |
CN108962227A (en) * | 2018-06-08 | 2018-12-07 | 百度在线网络技术(北京)有限公司 | Voice beginning and end detection method, device, computer equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9997172B2 (en) * | 2013-12-02 | 2018-06-12 | Nuance Communications, Inc. | Voice activity detection (VAD) for a coded speech bitstream without decoding |
KR20150105847A (en) * | 2014-03-10 | 2015-09-18 | 삼성전기주식회사 | Method and Apparatus for detecting speech segment |
-
2019
- 2019-03-18 CN CN201910204567.3A patent/CN109767792B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0713584A (en) * | 1992-10-05 | 1995-01-17 | Matsushita Electric Ind Co Ltd | Speech detecting device |
US8645133B2 (en) * | 2006-05-09 | 2014-02-04 | Core Wireless Licensing S.A.R.L. | Adaptation of voice activity detection parameters based on encoding modes |
CN104134440A (en) * | 2014-07-31 | 2014-11-05 | 百度在线网络技术(北京)有限公司 | Voice detection method and device used for portable terminal |
CN107068147A (en) * | 2015-10-19 | 2017-08-18 | 谷歌公司 | Sound end is determined |
CN106782508A (en) * | 2016-12-20 | 2017-05-31 | 美的集团股份有限公司 | The cutting method of speech audio and the cutting device of speech audio |
CN106611598A (en) * | 2016-12-28 | 2017-05-03 | 上海智臻智能网络科技股份有限公司 | VAD dynamic parameter adjusting method and device |
CN108962227A (en) * | 2018-06-08 | 2018-12-07 | 百度在线网络技术(北京)有限公司 | Voice beginning and end detection method, device, computer equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
《Speech endpoint detection algorithm for Uyghur based on acoustic frequency feature》;Yating Yang et al.;《IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS》;20101203;全文 * |
《基于卷积神经网络的语音端点检测方法研究》;王海旭;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160115(第01期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109767792A (en) | 2019-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109767792B (en) | Voice endpoint detection method, device, terminal and storage medium | |
US20200402500A1 (en) | Method and device for generating speech recognition model and storage medium | |
KR101942521B1 (en) | Speech endpointing | |
CN107632980B (en) | Voice translation method and device for voice translation | |
US9805715B2 (en) | Method and system for recognizing speech commands using background and foreground acoustic models | |
EP3726524B1 (en) | Speech endpointing | |
KR102390940B1 (en) | Context biasing for speech recognition | |
US8996366B2 (en) | Multi-stage speaker adaptation | |
WO2015169134A1 (en) | Method and apparatus for phonetically annotating text | |
CN110047485B (en) | Method and apparatus for recognizing wake-up word, medium, and device | |
CN111797632B (en) | Information processing method and device and electronic equipment | |
BR112014017708B1 (en) | METHOD AND APPARATUS TO DETECT VOICE ACTIVITY IN THE PRESENCE OF BACKGROUND NOISE, AND, COMPUTER-READABLE MEMORY | |
US9431005B2 (en) | System and method for supplemental speech recognition by identified idle resources | |
EP3739583B1 (en) | Dialog device, dialog method, and dialog computer program | |
CN108055617B (en) | Microphone awakening method and device, terminal equipment and storage medium | |
US11562735B1 (en) | Multi-modal spoken language understanding systems | |
CN108877779B (en) | Method and device for detecting voice tail point | |
CN113611316A (en) | Man-machine interaction method, device, equipment and storage medium | |
CN112071310A (en) | Speech recognition method and apparatus, electronic device, and storage medium | |
CN114242064A (en) | Speech recognition method and device, and training method and device of speech recognition model | |
CN114093358A (en) | Speech recognition method and apparatus, electronic device, and storage medium | |
CN113053390A (en) | Text processing method and device based on voice recognition, electronic equipment and medium | |
CN111862943A (en) | Speech recognition method and apparatus, electronic device, and storage medium | |
US20180082703A1 (en) | Suitability score based on attribute scores | |
CN112863496B (en) | Voice endpoint detection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20211013 Address after: 100176 101, floor 1, building 1, yard 7, Ruihe West 2nd Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing Patentee after: Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Address before: Unit D, Unit 3, 301, Productivity Building No. 5, High-tech Secondary Road, Nanshan District, Shenzhen City, Guangdong Province Patentee before: BAIDU INTERNATIONAL TECHNOLOGY (SHENZHEN) Co.,Ltd. |
|
TR01 | Transfer of patent right |