CN109767792A

CN109767792A - Sound end detecting method, device, terminal and storage medium

Info

Publication number: CN109767792A
Application number: CN201910204567.3A
Authority: CN
Inventors: 欧阳能钧; 贺学焱; 彭汉迎
Original assignee: Baidu International Technology Shenzhen Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2019-05-17
Anticipated expiration: 2039-03-18
Also published as: CN109767792B

Abstract

The embodiment of the invention discloses a kind of sound end detecting method, device, terminal and storage mediums, wherein this method comprises: the current word speed and history of determining user are averaged, whether the difference of word speed is in default disparity range；If current word speed and history are averaged, the difference of word speed is not at default disparity range, the speech energy threshold value of speech terminals detection is then adjusted according to current word speed, so that determining the voice end caps of user according to speech energy threshold value adjusted during the next time speech terminals detection adjacent with current speech end-point detection.The embodiment of the present invention solves the problems, such as to realize the personalized detection of the voice end caps for different user since the shared of parameter threshold leads to the reduction of speech recognition result accuracy rate in existing voice end-point detecting method, improve the accuracy of speech recognition.

Description

Sound end detecting method, device, terminal and storage medium

Technical field

The present embodiments relate to field of computer technology more particularly to a kind of sound end detecting method, device, terminals And storage medium.

Background technique

Traditional speech terminals detection (Voices Active Defect, VAD) algorithm is mainly according to zero-crossing rate harmony The indexs such as volume level go whether detection in short terminates.If the continuous preset quantity M0 frame of previous section in the voice flow of acquisition Speech energy value lower than energy value threshold value Elow specified in advance, and next, the energy value of 0 frame voice of continuous N is greater than Elow, then in the beginning endpoint that the place that speech energy value increases is exactly voice.Likewise, if continuous several frame speech energies It is worth larger, subsequent speech frame energy value to become smaller, that is, is less than energy value threshold value Ehigh specified in advance, and continue certain Duration, then it is assumed that speech energy value reduce place be exactly voice end caps.

However, the word speed that everyone speaks is different, some human speech speed is fast, some human speech speed are partially slow.If in sound end In detection process, using identical VAD parameter threshold per capita to all, then the speech recognition effect that will lead to some people is preferable, And some people then continually can be accidentally truncated, such as the slow people of some word speeds may be finished not yet in short, based on traditional Vad algorithm has just determined that the words is said and has been over, so as to cause speech recognition inaccuracy.

Summary of the invention

The embodiment of the present invention provides a kind of sound end detecting method, device, terminal and storage medium, to realize for not With the personalized detection of the voice end caps of user, the accuracy of speech recognition is improved.

In a first aspect, the embodiment of the invention provides a kind of sound end detecting methods, this method comprises:

Determine whether the be averaged difference of word speed of the current word speed of user and history is in default disparity range；

If the current word speed and history are averaged, the difference of word speed is not at the default disparity range, according to The speech energy threshold value of current word speed adjustment speech terminals detection, so that in the next time language adjacent with current speech end-point detection In voice endpoint detection process, the voice end caps of user are determined according to speech energy threshold value adjusted.

Second aspect, the embodiment of the invention also provides a kind of speech terminals detection device, which includes:

Word speed determining module, for determine user current word speed and history be averaged word speed difference whether be in preset it is poor Different range；

Speech energy threshold adjustment module, if being not at institute for the be averaged difference of word speed of the current word speed and history State default disparity range, then according to the speech energy threshold value of the current word speed adjustment speech terminals detection so that with it is current During the adjacent speech terminals detection next time of speech terminals detection, determine user's according to speech energy threshold value adjusted Voice end caps.

The third aspect, the embodiment of the invention also provides a kind of terminals, comprising:

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes the sound end detecting method as described in any embodiment of the present invention.

Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer Program realizes the sound end detecting method as described in any embodiment of the present invention when the program is executed by processor.

The embodiment of the present invention pass through judge the current word speed of user and history be averaged word speed difference whether be in preset it is poor Different range, if current word speed and history are averaged, the difference of word speed is not at default disparity range, is adjusted according to current word speed Speech energy threshold value when speech terminals detection next time realizes the voice that speech terminals detection is directed to different user in the process The adaptive dynamic of energy threshold adjusts, and solves in existing voice end-point detecting method since the shared of parameter threshold leads to language The problem of sound recognition result accuracy rate reduces realizes the personalized detection of the voice end caps for different user, improves The accuracy of speech recognition.

Detailed description of the invention

Fig. 1 is the flow chart for the sound end detecting method that the embodiment of the present invention one provides；

Fig. 2 is the flow chart of sound end detecting method provided by Embodiment 2 of the present invention；

Fig. 3 is the structural schematic diagram for the speech terminals detection device that the embodiment of the present invention three provides；

Fig. 4 is a kind of structural schematic diagram for terminal that the embodiment of the present invention four provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

Embodiment one

Fig. 1 is the flow chart for the sound end detecting method that the embodiment of the present invention one provides, and the present embodiment is applicable to In voice-based human-computer interaction process, detect user speech endpoint the case where, this method can be by speech terminals detection device It executes, which can be realized by the way of software and/or hardware, and can be integrated in the terminal with speech identifying function On, such as intelligent mobile terminal and mobile unit etc..

As shown in Figure 1, sound end detecting method provided in this embodiment may include:

Whether the be averaged difference of word speed of S110, the current word speed for determining user and history is in default disparity range.

During the interactive voice of user and terminal, terminal can call voice acquisition device, such as microphone, in real time User speech is obtained, and user speech is identified based on speech recognition technology, while recording the use in each interactive process Family word speed.Multiple user speeds based on historical record can determine that the history of user is averaged word speed.History is averaged word speed can be with Reflect that the word speed of user is horizontal on the whole.Through current word speed compared with history is averaged word speed, it can be determined that go out user and work as Whether preceding word speed changes, and then whether speech energy threshold value needs to adjust during determining speech terminals detection.

Optionally, the be averaged determination process of word speed of user's history includes:

History voice of the counting user in preset period of time, and determine the beginning endpoint and end caps of history voice At the time of respectively corresponding；

Distinguished according to the beginning endpoint of the text size of the recognition result of history voice and history voice and end caps At the time of corresponding, determine that the history of user is averaged word speed.

The text size of speech recognition result refers to the language element quantity for including in user speech.For example, for Chinese, text This length can refer to the Chinese character quantity for including in user speech；For English, text size, which can refer in user speech, includes Word quantity.T2 at the time of correspondence according to the corresponding moment t1 of the beginning endpoint of every phase of history voice and end caps, can be with It determines the effective time t2-t1 of this section of history voice, and then combines the text size len of the recognition result of this section of history voice, It can determine that user exports word speed (the t2-t1)/len or len/ (t2-t1) of this section of history voice.To in preset period of time The corresponding word speed of all history voices is averaged to be averaged word speed to get the history to user in the preset period of time.

Wherein, preset period of time includes and current time immediate speech sound statistics period, the tool of preset period of time The setting of being adapted to property of body length, the present embodiment are not especially limited.Selection and current time immediate speech sound statistics period Corresponding history is averaged word speed, and as the whether changed word speed reference of the current word speed of user is measured, sound end is examined The adjustment of speech energy threshold value has more reference value during survey, it can be ensured that the necessity and standard of speech energy adjusting thresholds True property, and then guarantee the accuracy of speech recognition.The current word speed of user is identical as the determination method of user's history word speed, herein not It repeats again.

The present embodiment is not construed as limiting the specific value of default disparity range, can be set according to speech detection demand adaptability Any number range being set to including zero.For example, being equivalent to and being worked as according to user when default disparity range is set as 0 Preceding word speed and history are averaged the antipode between word speed, it is determined whether adjustment speech energy threshold value；When default disparity range is set It is set in the numberical range of non-zero, such as 50 milliseconds, is equivalent to the phase being averaged between word speed according to the current word speed of user and history To difference, it is determined whether adjustment speech energy threshold value.Specifically, when the be averaged difference of word speed of the current word speed of user and history is in Default disparity range, then it is assumed that current word speed and the history word speed that is averaged are identical；When the current word speed of user and history are averaged word speed Difference is not at default disparity range, then it is assumed that current word speed and the history word speed that is averaged are different.Determined whether based on relative different Adjust speech energy threshold value, can to avoid the frequent adjustment caused due to the slight fluctuations because of user speed to speech energy threshold value, Reduce the resource consumption of terminal.

If S120, current word speed and history are averaged, the difference of word speed is not at default disparity range, according to current language The speech energy threshold value of the whole speech terminals detection of velocity modulation, so that in the next time sound end adjacent with current speech end-point detection In detection process, the voice end caps of user are determined according to speech energy threshold value adjusted.

The current word speed of user and history be averaged word speed difference not in default disparity range, i.e., current word speed and history are put down Equal word speed is different, and the current word speed of user changes, and needs to adjust speech energy threshold value according to current word speed, so that next time The voice end caps of user are determined during speech terminals detection using speech energy threshold value adjusted, are realized for same The adaptive dynamic of speech energy threshold value of the user in different word speeds adjusts, and speech energy threshold value adjusted is more Stick on the feature of speaking for closing the user.For example, if user speed is very fast, can with current adjacent sound end next time Speech energy threshold value is suitably tuned up in detection process；If word speed is slower, can with current adjacent end-speech next time Speech energy threshold value is turned down in point detection process.The determination of user's word speed each time, with current adjacent last user Word speed and the history comparison result of word speed that is averaged are related, with continuing for the interactive voice of user and terminal, speech energy threshold value It can gradually tend towards stability with the stabilization of user speed, final recurrence goes out to be most suitable for the language of the speech terminals detection of the user Sound energy threshold.

For different users, due to the difference between word speed feature, speech energy threshold value during speech terminals detection Dynamic adjustment result it is just different, thus realize be directed to different user, respectively possess the speech energy for meeting itself word speed feature Threshold value, rather than as be not added in the prior art differentiation to all with carrying out voice knot using identical speech energy threshold value per family The detection of Shu Duandian, it is thus achieved that the personalized detection of the voice end caps for different user, improves for each The accuracy of the speech recognition result of user.

The technical solution of the present embodiment, which passes through, judges the current word speed of user and whether the be averaged difference of word speed of history is in Default disparity range, if current word speed and history are averaged, the difference of word speed is not at default disparity range, according to current language Speech energy threshold value when the whole speech terminals detection next time of velocity modulation realizes and is directed to different user during speech terminals detection Speech energy threshold value it is adaptive dynamic adjust, solve in existing voice end-point detecting method due to owner share it is a set of The problem of parameter threshold causes speech recognition result accuracy rate to reduce realizes of the voice end caps for different user Propertyization detection, avoids in speech recognition process and phenomenon is truncated to the mistake of user speech, and then ensure the user speech of acquisition The integrality of information improves the accuracy of speech recognition；Also, this embodiment scheme based on user speed by being changed to language Sound energy threshold is carried out dynamic adjustment and fallen compared to the mode for being determined user speech energy threshold based on model training in product It is easier to realize in terms of real, is not related to the interaction with remote server, it is less to the power consumption of terminal, it is based on this embodiment scheme The product with accurate speech identifying function realized, is also easier to promote.

Embodiment two

Fig. 2 is the flow chart of sound end detecting method provided by Embodiment 2 of the present invention, and the present embodiment is in above-mentioned reality Further progress optimizes on the basis of applying example.As shown in Fig. 2, this method may include:

Whether the be averaged difference of word speed of S210, the current word speed for determining user and history is in default disparity range.

If S220, current word speed and history are averaged, the difference of word speed is not at default disparity range, with current speech During the adjacent speech terminals detection next time of end-point detection, at the time of starting to reduce based on user speech energy, mesh is carried out Duration is extended finish time corresponding speech energy as speech energy threshold value by the extension for marking duration, wherein target duration is The predetermined time period determined according to current word speed.

No matter the representation of user speed be voice effective time length Yu recognition result text size ratio (t2- T1 the ratio len/ (t2-t1) of)/len or the text size of speech recognition result and voice effective time length, can be with According to the current word speed of user, determine that user currently exports the time of a language element, such as user currently says a Chinese The time of word or user say the time of an English word.During speech terminals detection next time, when detecting language When sound energy starts to reduce, timing extension can be carried out according to the time for exporting a language element according to active user, such as Using the presupposition multiple corresponding total time of the time of user's one language element of output as target duration, presupposition multiple can root It is configured according to the accuracy requirement or sensitivity requirement of speech terminals detection.For different users, presupposition multiple can be with It is identical to can also be different.

Illustratively, during the next time speech terminals detection adjacent with current speech end-point detection, user will The voice of output is " today is fine day ", and after user finishes " fine day ", speech energy starts to reduce, then is finished based on user At the time of " fine day " is corresponding, the delay of target duration is carried out, for example, target duration is the output determined according to the current word speed of user 2 times of the time of one Chinese character, when long delay finish time corresponding speech energy in the speech terminals detection mistake next time Speech energy threshold value in journey.If at this before long delay finish time, there are continuous several frame speech energies and be all larger than At this after long delay finish time, there are continuous several frame speech energies and be respectively less than the voice energy in the speech energy threshold value Measure threshold value, then this when long delay finish time, that is, user voice output end caps next time.

It is equivalent to, the speech energy threshold value of user faster for word speed, speech terminals detection process can be shorter It is determined in delay time；The user slower for word speed, speech energy threshold value is then needed opposite during speech terminals detection In determining in longer delay time.So same fixation can be used to avoid the user that word speed is very fast and word speed is slower Speech energy threshold value or speech energy threshold value is determined using the same delay time, and lead to language during speech terminals detection Not the phenomenon that slower people of speed does not finish words also, is just mistakenly identified as user speech end of output by terminal, as the user of terminal acquisition Voice messaging missing, speech recognition result are naturally also inaccurate.

Optionally, this method further include: if current word speed and history are averaged, the difference of word speed is in default disparity range, Think that the current word speed of user and the history word speed that is averaged are identical, then the speech energy threshold value used according to current speech end-point detection Determine the voice end caps of user during speech terminals detection next time.When i.e. current word speed is there is no changing, then Speech energy threshold value is not necessarily to be adjusted during speech terminals detection next time.

The technical solution of the present embodiment by the current word speed of determination and history be averaged word speed difference be not at preset it is poor When different range, during the next time speech terminals detection adjacent with current speech end-point detection, it is based on user speech energy At the time of starting to reduce, the extension of target duration is carried out, duration is extended into finish time corresponding speech energy as voice energy Threshold value is measured, realizes and is adjusted for different user according to adaptive dynamic of its word speed to speech energy threshold value, solved existing Since owner shares the problem of set of parameter threshold value causes speech recognition result accuracy rate to reduce in sound end detecting method, The personalized detection for realizing the voice end caps for different user, avoids in speech recognition process to user speech Accidentally truncation phenomenon, and then ensure the integrality of the user speech information of acquisition, improve the accuracy of speech recognition.

Embodiment three

Fig. 3 is the structural schematic diagram for the speech terminals detection device that the embodiment of the present invention three provides, and the present embodiment is applicable In in voice-based human-computer interaction process, the case where detecting user speech endpoint.The device can be using software and/or hard The mode of part is realized, and can be integrated in the terminal with speech identifying function, such as intelligent mobile terminal and mobile unit etc..

As shown in figure 3, speech terminals detection device provided in this embodiment may include word speed determining module 310 and voice Energy threshold adjusts module 320, in which:

Word speed determining module 310, for determining whether the be averaged difference of word speed of the current word speed of user and history is in pre- If disparity range；

Speech energy threshold adjustment module 320, if for current word speed and history be averaged word speed difference be not at it is pre- If disparity range, then according to current word speed adjust speech terminals detection speech energy threshold value so that with current speech endpoint During detecting adjacent speech terminals detection next time, determine that the voice of user terminates according to speech energy threshold value adjusted Endpoint.

Optionally, speech energy threshold adjustment module 320 is specifically used for:

During the next time speech terminals detection adjacent with current speech end-point detection, opened based on user speech energy At the time of beginning to reduce, the extension of target duration is carried out, duration is extended into finish time corresponding speech energy as speech energy Threshold value, wherein target duration is the predetermined time period determined according to current word speed.

Optionally, the device further include:

History speech sound statistics module for history voice of the counting user in preset period of time, and determines history language At the time of the beginning endpoint and end caps of sound respectively correspond；

History is averaged word speed determining module, for the text size and history language according to the recognition result of history voice At the time of the beginning endpoint of sound and end caps respectively correspond, determine that the history of user is averaged word speed.

Optionally, the device further include:

Voice end caps determining module, if being in for the be averaged difference of word speed of current word speed and history and presetting difference Range, then the speech energy threshold value determination used according to current speech end-point detection are used during speech terminals detection next time The voice end caps at family.

Language provided by any embodiment of the invention can be performed in speech terminals detection device provided by the embodiment of the present invention Voice endpoint detection method has the corresponding functional module of execution method and beneficial effect.Not detailed description is interior in the present embodiment Holding can be with reference to the description in any means embodiment of the present invention.

Example IV

Fig. 4 is a kind of structural schematic diagram for terminal that the embodiment of the present invention four provides.Fig. 4, which is shown, to be suitable for being used to realizing this The block diagram of the exemplary terminal 412 of invention embodiment.The terminal 412 that Fig. 4 is shown is only an example, should not be to the present invention The function and use scope of embodiment bring any restrictions.

As shown in figure 4, terminal 412 is showed in the form of general purpose terminal.The component of terminal 412 can include but is not limited to: One or more processor 416, storage device 428 connect different system components (including storage device 428 and processor 416) bus 418.

Bus 418 indicates one of a few class bus structures or a variety of, including storage device bus or storage device control Device processed, peripheral bus, graphics acceleration port, processor or total using the local of any bus structures in a variety of bus structures Line.For example, these architectures include but is not limited to industry standard architecture (Industry Subversive Alliance, ISA) bus, microchannel architecture (Micro Channel Architecture, MAC) bus is enhanced Isa bus, Video Electronics Standards Association (Video Electronics Standards Association, VESA) local are total Line and peripheral component interconnection (Peripheral Component Interconnect, PCI) bus.

Terminal 412 typically comprises a variety of computer system readable media.These media can be it is any can be by terminal The usable medium of 412 access, including volatile and non-volatile media, moveable and immovable medium.

Storage device 428 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (Random Access Memory, RAM) 430 and/or cache memory 432.Terminal 412 can be wrapped further Include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as an example, storage system 434 can be used for reading and writing immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although It is not shown in Fig. 4, the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, and To removable anonvolatile optical disk, such as CD-ROM (Compact Disc Read-Only Memory, CD-ROM), number Optic disk (Digital Video Disc-Read Only Memory, DVD-ROM) or other optical mediums) read-write CD drive Dynamic device.In these cases, each driver can be connected by one or more data media interfaces with bus 418.It deposits Storage device 428 may include at least one program product, which has one group of (for example, at least one) program module, this A little program modules are configured to perform the function of various embodiments of the present invention.

Program/utility 440 with one group of (at least one) program module 442 can store in such as storage dress It sets in 428, such program module 442 includes but is not limited to operating system, one or more application program, other program moulds It may include the realization of network environment in block and program data, each of these examples or certain combination.Program module 442 usually execute function and/or method in embodiment described in the invention.

Terminal 412 can also be logical with one or more external equipments 414 (such as keyboard, direction terminal, display 424 etc.) Letter, can also be enabled a user to one or more terminal interact with the terminal 412 communicate, and/or with make the terminal 412 Any terminal (such as network interface card, modem etc.) communication that can be communicated with one or more of the other computing terminal.This Kind communication can be carried out by input/output (I/O) interface 422.Also, terminal 412 can also by network adapter 420 with One or more network (such as local area network (Local Area Network, LAN), wide area network (Wide Area Network, WAN) and/or public network, for example, internet) communication.As shown in figure 4, network adapter 420 passes through bus 418 and terminal 412 Other modules communication.It should be understood that although not shown in the drawings, other hardware and/or software mould can be used in conjunction with terminal 412 Block, including but not limited to: microcode, terminal driver, redundant processor, external disk drive array, disk array (Redundant Arrays of Independent Disks, RAID) system, tape drive and data backup storage system System etc..

The program that processor 416 is stored in storage device 428 by operation, thereby executing various function application and number According to processing, such as realize sound end detecting method provided by any embodiment of the invention, this method may include:

Embodiment five

The embodiment of the present invention five additionally provides a kind of computer readable storage medium, is stored thereon with computer program, should Realize that such as sound end detecting method provided by any embodiment of the invention, this method can wrap when program is executed by processor It includes:

The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable media Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: tool There are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium can be any tangible medium for including or store program, which can be commanded execution system, device or device Using or it is in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on remote computer or terminal completely on the remote computer on the user computer.It is relating to And in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or extensively Domain net (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as provided using Internet service Quotient is connected by internet).

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of sound end detecting method characterized by comprising

If the current word speed and history are averaged, the difference of word speed is not at the default disparity range, according to described current Word speed adjusts the speech energy threshold value of speech terminals detection, so that in the next time end-speech adjacent with current speech end-point detection In point detection process, the voice end caps of user are determined according to speech energy threshold value adjusted.

2. the method according to claim 1, wherein according to the language of the current word speed adjustment speech terminals detection Sound energy threshold, so that during the next time speech terminals detection adjacent with current speech end-point detection, after adjustment Speech energy threshold value determine the voice end caps of user, comprising:

During the next time speech terminals detection adjacent with current speech end-point detection, start to subtract based on user speech energy At the time of small, the extension of target duration is carried out, duration is extended into finish time corresponding speech energy as speech energy threshold value, Wherein, target duration is the predetermined time period determined according to the current word speed.

3. method according to claim 1 or 2, which is characterized in that the method also includes:

History voice of the counting user in preset period of time, and determine the beginning endpoint and end caps of the history voice At the time of respectively corresponding；

According to the text size of the recognition result of the history voice and the beginning endpoint and end caps of the history voice At the time of respectively corresponding, determine that the history of user is averaged word speed.

4. the method according to claim 1, wherein the method also includes:

If the current word speed and the history are averaged, the difference of word speed is in default disparity range, according to current speech end The speech energy threshold value that point detection uses determines the voice end caps of the user during speech terminals detection next time.

5. a kind of speech terminals detection device characterized by comprising

Whether word speed determining module, the be averaged difference of word speed of current word speed and history for determining user are in default difference model It encloses；

Speech energy threshold adjustment module, if for the current word speed and history be averaged word speed difference be not at it is described pre- If disparity range, then according to the speech energy threshold value of the current word speed adjustment speech terminals detection so that with current speech During the adjacent speech terminals detection next time of end-point detection, the voice of user is determined according to speech energy threshold value adjusted End caps.

6. device according to claim 5, which is characterized in that the speech energy threshold adjustment module is specifically used for:

7. device according to claim 5 or 6, which is characterized in that described device further include:

History speech sound statistics module for history voice of the counting user in preset period of time, and determines the history language At the time of the beginning endpoint and end caps of sound respectively correspond；

History is averaged word speed determining module, for according to the text size of the recognition result of the history voice and described going through At the time of the beginning endpoint of history voice and end caps respectively correspond, determine that the history of user is averaged word speed.

8. device according to claim 5, which is characterized in that described device further include:

Voice end caps determining module, if being in and presetting for the be averaged difference of word speed of the current word speed and the history Disparity range, then the speech energy threshold value used according to current speech end-point detection are determined in the speech terminals detection next time The voice end caps of user in the process.

9. a kind of terminal characterized by comprising

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now sound end detecting method as described in any in claim 1-4.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The sound end detecting method as described in any in claim 1-4 is realized when execution.