CN109767792B

CN109767792B - Voice endpoint detection method, device, terminal and storage medium

Info

Publication number: CN109767792B
Application number: CN201910204567.3A
Authority: CN
Inventors: 欧阳能钧; 贺学焱; 彭汉迎
Original assignee: Baidu International Technology Shenzhen Co ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2020-08-18
Anticipated expiration: 2039-03-18
Also published as: CN109767792A

Abstract

The embodiment of the invention discloses a voice endpoint detection method, a voice endpoint detection device, a terminal and a storage medium, wherein the method comprises the following steps: determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range; and if the difference between the current speech speed and the historical average speech speed is not in the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech speed, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection. The embodiment of the invention solves the problem that the accuracy of the voice recognition result is reduced due to the sharing of the parameter threshold in the existing voice endpoint detection method, realizes the personalized detection of the voice ending endpoints of different users, and improves the accuracy of the voice recognition.

Description

Voice endpoint detection method, device, terminal and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a voice endpoint detection method, a voice endpoint detection device, a voice endpoint detection terminal and a storage medium.

Background

The conventional voice endpoint detection (VAD) algorithm mainly detects whether a sentence is finished according to indexes such as zero crossing rate and sound level. If the energy value of the speech of the previous part of the continuous preset number M0 frames in the acquired speech stream is lower than the energy value threshold Elow specified in advance, and the energy value of the speech of the next continuous M0 frames is larger than Elow, the starting end point of the speech is the place where the energy value of the speech is increased. Similarly, if the speech energy value of several consecutive frames is larger, the energy value of the subsequent speech frame becomes smaller, i.e. smaller than the energy threshold Ehigh specified in advance, and lasts for a certain period of time, it is considered that the end point of the speech is the place where the speech energy value is reduced.

However, the speaking speed of each person is different, some people are fast, and some people are slow. If the same VAD parameter threshold is used for all people in the voice endpoint detection process, some people may have a better voice recognition effect, and some people may be frequently and erroneously truncated, for example, some people with slow speech speed may not have spoken a sentence, and the sentence has been judged to be spoken based on the conventional VAD algorithm, thereby causing inaccurate voice recognition.

Disclosure of Invention

The embodiment of the invention provides a voice endpoint detection method, a voice endpoint detection device, a terminal and a storage medium, which aim to realize personalized detection of voice ending endpoints of different users and improve the accuracy of voice recognition.

In a first aspect, an embodiment of the present invention provides a method for detecting a voice endpoint, where the method includes:

determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;

and if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech rate, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection.

In a second aspect, an embodiment of the present invention further provides a device for detecting a voice endpoint, where the device includes:

the speech rate determining module is used for determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range;

and the voice energy threshold adjusting module is used for adjusting the voice energy threshold of the voice endpoint detection according to the current speech rate if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, so that the voice ending endpoint of the user is determined according to the adjusted voice energy threshold in the next voice endpoint detection process adjacent to the current voice endpoint detection.

In a third aspect, an embodiment of the present invention further provides a terminal, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method for voice endpoint detection as in any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a voice endpoint detection method according to any embodiment of the present invention.

According to the embodiment of the invention, whether the difference between the current speech rate and the historical average speech rate of the user is in the preset difference range is judged, if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, the speech energy threshold value in the next speech endpoint detection is adjusted according to the current speech rate, so that the self-adaptive dynamic adjustment of the speech energy threshold values for different users in the speech endpoint detection process is realized, the problem that the accuracy of a speech recognition result is reduced due to the sharing of a parameter threshold value in the existing speech endpoint detection method is solved, the personalized detection of speech end points for different users is realized, and the accuracy of speech recognition is improved.

Drawings

Fig. 1 is a flowchart of a voice endpoint detection method according to an embodiment of the present invention;

fig. 2 is a flowchart of a voice endpoint detection method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a voice endpoint detection apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a voice endpoint detection method according to an embodiment of the present invention, where the present embodiment is applicable to detecting a voice endpoint of a user in a voice-based human-computer interaction process, and the method may be executed by a voice endpoint detection apparatus, where the apparatus may be implemented in a software and/or hardware manner, and may be integrated on a terminal with a voice recognition function, such as an intelligent mobile terminal and a vehicle-mounted device.

As shown in fig. 1, the voice endpoint detection method provided in this embodiment may include:

s110, determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range.

In the voice interaction process between the user and the terminal, the terminal can call a voice acquisition device, such as a microphone, to acquire the voice of the user in real time, recognize the voice of the user based on a voice recognition technology, and record the speed of the user in each interaction process. Based on the historical speech rates of the plurality of users, a historical average speech rate of the user may be determined. The historical average speech rate may reflect the user's speech rate level as a whole. By comparing the current speech rate with the historical average speech rate, whether the current speech rate of the user changes can be judged, and whether the speech energy threshold value needs to be adjusted in the speech endpoint detection process is further determined.

Optionally, the process of determining the historical average speech rate of the user includes:

counting historical voices of a user in a preset time period, and determining the corresponding time of a starting end point and an ending end point of the historical voices;

and determining the historical average speech rate of the user according to the text length of the recognition result of the historical speech and the corresponding time of the starting end point and the ending end point of the historical speech.

The text length of the speech recognition result refers to the number of language elements included in the user's speech. For example, for Chinese, the text length may refer to the number of Chinese characters included in the user's speech; for English, the text length may refer to the number of words included in the user's speech. According to the time t1 corresponding to the starting endpoint and the time t2 corresponding to the ending endpoint of each section of historical speech, the effective time t2-t1 of the section of historical speech can be determined, and further, the speech rate (t2-t1)/len or len/(t2-t1) of the user for outputting the section of historical speech can be determined by combining the text length len of the recognition result of the section of historical speech. And averaging the speech rates corresponding to all the historical speeches in a preset time period to obtain the historical average speech rate of the user in the preset time period.

The preset time period includes a voice statistic period closest to the current time, and the specific length of the preset time period may be adaptively set, which is not specifically limited in this embodiment. The historical average speech rate corresponding to the speech statistic period closest to the current time is selected as the speech rate reference for measuring whether the current speech rate of the user changes, the method has reference value for adjusting the speech energy threshold in the speech endpoint detection process, the necessity and the accuracy of adjusting the speech energy threshold can be ensured, and the accuracy of speech recognition is further ensured. The determination method of the current speech rate of the user is the same as the determination method of the historical speech rate of the user, and the detailed description is omitted here.

The specific value of the preset difference range is not limited in this embodiment, and may be adaptively set to any numerical range including a zero value according to the voice detection requirement. For example, when the preset difference range is set to 0, it is equivalent to determining whether to adjust the speech energy threshold according to the absolute difference between the current speech rate and the historical average speech rate of the user; when the preset difference range is set to be a non-zero value range, for example, within 50 milliseconds, it is equivalent to determining whether to adjust the speech energy threshold according to the relative difference between the current speech rate and the historical average speech rate of the user. Specifically, when the difference between the current speech rate and the historical average speech rate of the user is in a preset difference range, the current speech rate is considered to be the same as the historical average speech rate; and when the difference between the current speech rate and the historical average speech rate of the user is not within the preset difference range, the current speech rate is different from the historical average speech rate. Whether the voice energy threshold is adjusted or not is determined based on the relative difference, so that frequent adjustment of the voice energy threshold caused by slight fluctuation of the user speed can be avoided, and resource consumption of the terminal is reduced.

And S120, if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, adjusting the speech energy threshold value of the speech endpoint detection according to the current speech rate, so that the speech ending endpoint of the user is determined according to the adjusted speech energy threshold value in the next speech endpoint detection process adjacent to the current speech endpoint detection.

The difference between the current speech rate and the historical average speech rate of the user is not within the preset difference range, namely the current speech rate is different from the historical average speech rate, the current speech rate of the user changes, and the speech energy threshold value needs to be adjusted according to the current speech rate, so that the speech end point of the user is determined by using the adjusted speech energy threshold value in the next speech end point detection process, the self-adaptive dynamic adjustment of the speech energy threshold value of the same user under the condition of different speech rates is realized, and the adjusted speech energy threshold value is more suitable for the speaking characteristics of the user. For example, if the user has a fast speech rate, the speech energy threshold may be appropriately increased in the next speech endpoint detection process adjacent to the current time; if the speech rate is slow, the speech energy threshold value can be reduced in the next adjacent speech endpoint detection process. The determination of the user's speech rate each time is related to the comparison result of the current adjacent last user speech rate and the historical average speech rate, and as the user continues to interact with the terminal speech, the speech energy threshold value can gradually become stable along with the stability of the user's speech rate, and finally the speech energy threshold value most suitable for the user's speech endpoint detection is recurved.

For different users, due to the difference between the speech rate characteristics, the dynamic adjustment result of the speech energy threshold value in the speech endpoint detection process is different, so that the speech energy threshold values which accord with the speech rate characteristics of the users are respectively possessed for the different users, and the detection of the speech endpoint is not carried out on all the users by using the same speech energy threshold value as the prior art without distinguishing, so that the personalized detection of the speech endpoint for the different users is realized, and the accuracy of the speech recognition result for each user is improved.

According to the technical scheme of the embodiment, whether the difference between the current speech rate and the historical average speech rate of the user is in the preset difference range is judged, if the difference between the current speech rate and the historical average speech rate is not in the preset difference range, the speech energy threshold value in the next speech endpoint detection is adjusted according to the current speech rate, so that the self-adaptive dynamic adjustment of the speech energy threshold values for different users in the speech endpoint detection process is realized, the problem that the accuracy of a speech recognition result is reduced due to the fact that all people share one set of parameter threshold values in the existing speech endpoint detection method is solved, the personalized detection of speech ending endpoints for different users is realized, the phenomenon of mistaken truncation of the speech of the user in the speech recognition process is avoided, the integrity of the acquired speech information of the user is further ensured, and the accuracy of the speech recognition is improved; in addition, according to the scheme of the embodiment, the voice energy threshold is dynamically adjusted based on the change of the user speed, and compared with a mode of determining the user voice energy threshold based on model training, the method is easier to implement in the aspect of product implementation, does not relate to interaction with a remote server, consumes less power for a terminal, and is easier to popularize for a product with an accurate voice recognition function based on the scheme of the embodiment.

Example two

Fig. 2 is a flowchart of a voice endpoint detection method according to a second embodiment of the present invention, which is further optimized based on the above-mentioned embodiments. As shown in fig. 2, the method may include:

s210, determining whether the difference between the current speech rate of the user and the historical average speech rate is in a preset difference range.

S220, if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, in the next speech endpoint detection process adjacent to the current speech endpoint detection, extending the target duration based on the moment when the speech energy of the user starts to decrease, and taking the speech energy corresponding to the moment when the duration extension is finished as a speech energy threshold, wherein the target duration is a preset time length determined according to the current speech rate.

Whether the expression form of the user speech rate is the ratio (t2-t1)/len of the speech effective time length to the text length of the recognition result or the ratio (t2-t1) of the text length of the speech recognition result to the speech effective time length, the time when the user currently outputs a language element, for example, the time when the user currently speaks a Chinese character or the time when the user speaks an English word, can be determined according to the current speech rate of the user. In the next voice endpoint detection process, when it is detected that the voice energy starts to decrease, the timing extension may be performed according to the time for outputting one language element according to the current user, for example, the total time corresponding to a preset multiple of the time for outputting one language element by the user is taken as the target time length, and the preset multiple may be set according to the accuracy requirement or the sensitivity requirement of the voice endpoint detection. The preset multiple may be the same or different for different users.

For example, in the next voice endpoint detection process adjacent to the current voice endpoint detection, if the voice to be output by the user is "today is a fine day", after the user finishes speaking the "fine day", the voice energy starts to decrease, then the target time length is delayed based on the time corresponding to the time when the user finishes speaking the "fine day", for example, the target time length is 2 times of the time for outputting one chinese character determined according to the current speech rate of the user, and the voice energy corresponding to the time when the time length delay ends is the voice energy threshold in the next voice endpoint detection process. If a plurality of continuous frames of voice energy are larger than the voice energy threshold before the time length delay ending time, and a plurality of continuous frames of voice energy are smaller than the voice energy threshold after the time length delay ending time, the time length delay ending time is the next voice output ending endpoint of the user.

Equivalently, for users with faster speech speed, the speech energy threshold of the speech endpoint detection process can be determined within a shorter delay time; for users with slower speech speed, the speech energy threshold in the speech endpoint detection process needs to be determined in a relatively long delay time. Therefore, the phenomenon that a person with a slow speech speed is mistakenly identified by the terminal as the end of the speech output of the user when the person with the slow speech speed is not spoken in the speech endpoint detection process due to the fact that the same fixed speech energy threshold value or the same delay time is used for determining the speech energy threshold value by the users with the fast speech speed and the slow speech speed can be avoided, and when the speech information of the user collected by the terminal is lost, the speech recognition result is natural and inaccurate.

Optionally, the method further includes: and if the difference between the current speech rate and the historical average speech rate is in a preset difference range, namely the current speech rate of the user is considered to be the same as the historical average speech rate, determining the end point of the speech of the user in the next speech end point detection process according to the speech energy threshold used by the current speech end point detection. That is, when the current speech rate is not changed, the speech energy threshold value in the next speech endpoint detection process does not need to be adjusted.

In the technical scheme of the embodiment, when it is determined that the difference between the current speech rate and the historical average speech rate is not within the preset difference range, in the next voice endpoint detection process adjacent to the current voice endpoint detection, the target time length is prolonged based on the time when the voice energy of the user starts to reduce, the voice energy corresponding to the time length prolonging ending time is taken as the voice energy threshold value, the self-adaptive dynamic adjustment of the voice energy threshold value according to the voice speed of different users is realized, the problem that the accuracy of a voice recognition result is reduced because all people share one set of parameter threshold values in the existing voice endpoint detection method is solved, the personalized detection of the voice ending endpoints of different users is realized, and the phenomenon of mistaken truncation of the voice of the user in the voice recognition process is avoided, therefore, the integrity of the acquired user voice information is ensured, and the accuracy of voice recognition is improved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a voice endpoint detection apparatus according to a third embodiment of the present invention, which is applicable to detecting a voice endpoint of a user in a voice-based human-computer interaction process. The device can be realized in a software and/or hardware mode, and can be integrated on a terminal with a voice recognition function, such as an intelligent mobile terminal, a vehicle-mounted device and the like.

As shown in fig. 3, the speech endpoint detection apparatus provided in this embodiment may include a speech rate determining module 310 and a speech energy threshold adjusting module 320, where:

a speech rate determining module 310, configured to determine whether a difference between a current speech rate of the user and a historical average speech rate is within a preset difference range;

the voice energy threshold adjusting module 320 is configured to adjust a voice energy threshold of the voice endpoint detection according to the current speech rate if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, so that in a next voice endpoint detection process adjacent to the current voice endpoint detection, a voice ending endpoint of the user is determined according to the adjusted voice energy threshold.

Optionally, the voice energy threshold adjusting module 320 is specifically configured to:

in the next voice endpoint detection process adjacent to the current voice endpoint detection, based on the time when the voice energy of the user starts to decrease, the target time length is prolonged, and the voice energy corresponding to the time when the time length is prolonged to finish is used as a voice energy threshold, wherein the target time length is a preset time length determined according to the current speed.

Optionally, the apparatus further comprises:

the historical voice counting module is used for counting the historical voice of the user in a preset time period and determining the corresponding time of a starting end point and an ending end point of the historical voice;

and the historical average speed determining module is used for determining the historical average speed of the user according to the text length of the recognition result of the historical voice and the corresponding time of the starting end point and the ending end point of the historical voice.

Optionally, the apparatus further comprises:

and the voice ending end point determining module is used for determining the voice ending end point of the user in the next voice end point detection process according to the voice energy threshold used by the current voice end point detection if the difference between the current voice speed and the historical average voice speed is in the preset difference range.

The voice endpoint detection device provided by the embodiment of the invention can execute the voice endpoint detection method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the invention not specifically described in this embodiment.

Example four

Fig. 4 is a schematic structural diagram of a terminal according to a fourth embodiment of the present invention. Fig. 4 illustrates a block diagram of an exemplary terminal 412 suitable for use in implementing embodiments of the present invention. The terminal 412 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 4, the terminal 412 is represented in the form of a general-purpose terminal. The components of the terminal 412 may include, but are not limited to: one or more processors 416, a storage device 428, and a bus 418 that couples the various system components including the storage device 428 and the processors 416.

Bus 418 represents one or more of any of several types of bus structures, including a memory device bus or memory device controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Terminal 412 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by terminal 412 and includes both volatile and nonvolatile media, removable and non-removable media.

Storage 428 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 430 and/or cache Memory 432. The terminal 412 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 434 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk such as a Compact disk Read-Only Memory (CD-ROM), Digital Video disk Read-Only Memory (DVD-ROM) or other optical media may be provided. In these cases, each drive may be connected to bus 418 by one or more data media interfaces. Storage 428 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 440 having a set (at least one) of program modules 442 may be stored, for instance, in storage 428, such program modules 442 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. The program modules 442 generally perform the functions and/or methodologies of the described embodiments of the invention.

The terminal 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing terminal, display 424, etc.), one or more terminals that enable a user to interact with the terminal 412, and/or any terminal (e.g., network card, modem, etc.) that enables the terminal 412 to communicate with one or more other computing terminals. Such communication may occur via input/output (I/O) interfaces 422. Also, the terminal 412 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network, such as the internet) via the Network adapter 420. As shown in fig. 4, the network adapter 420 communicates with the other modules of the terminal 412 over a bus 418. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the terminal 412, including but not limited to: microcode, end drives, Redundant processors, external disk drive Arrays, RAID (Redundant Arrays of Independent Disks) systems, tape drives, and data backup storage systems, among others.

The processor 416 executes various functional applications and data processing by running programs stored in the storage 428, for example, implementing a voice endpoint detection method provided by any embodiment of the present invention, which may include:

EXAMPLE five

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a voice endpoint detection method according to any embodiment of the present invention, where the method may include:

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for voice endpoint detection, comprising:

if the difference between the current speech rate and the historical average speech rate is not within the preset difference range, in the next speech endpoint detection process adjacent to the current speech endpoint detection, extending the target duration based on the moment when the user speech energy begins to decrease, and taking the speech energy corresponding to the moment when the duration extension ends as a speech energy threshold, wherein the target duration is a preset time length determined according to the current speech rate.

2. The method of claim 1, further comprising:

counting historical voices of a user in a preset time period, and determining moments corresponding to a starting endpoint and an ending endpoint of the historical voices respectively;

and determining the historical average speech speed of the user according to the text length of the recognition result of the historical speech and the corresponding time of the starting end point and the ending end point of the historical speech.

3. The method of claim 1, further comprising:

and if the difference between the current speech rate and the historical average speech rate is in a preset difference range, determining the end point of the user's speech in the next speech end point detection process according to the speech energy threshold used by the current speech end point detection.

4. A voice endpoint detection apparatus, comprising:

and the voice energy threshold adjusting module is used for prolonging the target time length based on the moment when the voice energy of the user starts to reduce in the next voice endpoint detection process adjacent to the current voice endpoint detection if the difference between the current voice rate and the historical average voice rate is not within the preset difference range, and taking the voice energy corresponding to the time length prolonging ending moment as the voice energy threshold, wherein the target time length is a preset time length determined according to the current voice rate.

5. The apparatus of claim 4, further comprising:

the historical voice counting module is used for counting the historical voice of a user in a preset time period and determining the time corresponding to the starting end point and the ending end point of the historical voice respectively;

and the historical average speed determining module is used for determining the historical average speed of speech of the user according to the text length of the recognition result of the historical speech and the corresponding time of the starting end point and the ending end point of the historical speech.

6. The apparatus of claim 4, further comprising:

and the voice ending end point determining module is used for determining the voice ending end point of the user in the next voice end point detection process according to the voice energy threshold used by the current voice end point detection if the difference between the current voice speed and the historical average voice speed is in a preset difference range.

7. A terminal, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the voice endpoint detection method of any of claims 1-3.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for voice endpoint detection according to any one of claims 1 to 3.