CN108962226A

CN108962226A - Method and apparatus for detecting the endpoint of voice

Info

Publication number: CN108962226A
Application number: CN201810792887.0A
Authority: CN
Inventors: 房伟伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2018-12-07
Anticipated expiration: 2038-07-18
Also published as: CN108962226B

Abstract

The embodiment of the present application discloses the method and apparatus of the endpoint for detecting voice.One specific embodiment of this method includes: to generate audio frame sequence based on the audio data got, wherein the audio frame in the audio frame sequence of generation is corresponding with audio frame type, and audio frame type is sound-type or non-voice type；For the audio frame of the sound-type in audio frame sequence, location information when sound source issues the audio frame corresponding sound is determined；According to the corresponding audio frame type of audio frame and location information in audio frame sequence, the endpoint of voice in the corresponding audio of audio frame sequence is determined.This embodiment offers the modes of the endpoint of new detection voice.

Description

Method and apparatus for detecting the endpoint of voice

Technical field

The invention relates to field of computer technology, and in particular to Internet technical field is more particularly, to examined The method and apparatus for surveying the endpoint of voice.

Background technique

With the development of artificial intelligence technology, novel intelligent equipment (such as intelligent sound box, intelligent interaction robot etc.) is opened Beginning emerges in large numbers, and this novel human-machine interaction technology of interactive voice is gradually received by masses, and the importance of speech recognition technology is increasingly It shows especially.Speech terminals detection finds the starting point and tail point of voice in continuous audio data, is the weight of speech recognition system Component part is wanted, accuracy can impact the accuracy of speech recognition.

Summary of the invention

The embodiment of the present application proposes the method and apparatus of the endpoint for detecting voice.

In a first aspect, the embodiment of the present application provides a kind of method for detecting the endpoint of voice, this method comprises: base In the audio data got, audio frame sequence is generated, wherein audio frame and audio frame type in the audio frame sequence of generation Corresponding, audio frame type is sound-type or non-voice type；For the audio frame of the sound-type in audio frame sequence, really Determine location information when sound source issues the audio frame corresponding sound；According to the corresponding audio frame of audio frame in audio frame sequence Type and location information determine the endpoint of voice in the corresponding audio of audio frame sequence.

In some embodiments, based on the audio data got, audio frame sequence is generated, comprising: according to acoustic energy, really Effective audio data in audio data；For effective audio data, carries out moving window framing, obtain audio frame sequence；To sound Audio frame in frequency frame sequence carries out speech detection, determines the corresponding audio frame type of audio frame.

In some embodiments, speech detection is carried out to the audio frame in audio frame sequence, determines the corresponding sound of audio frame Frequency frame type, comprising: for the audio frame in audio frame sequence, extract the audio frequency characteristics value of the predefined type of the audio frame； For the audio frame in audio frame sequence, the audio frequency characteristics value extracted from the audio frame is imported to the speech detection pre-established Model generates audio frame type, wherein speech detection model is corresponding between audio frequency characteristics value and audio frame type for characterizing Relationship.

In some embodiments, speech detection model is established by following steps: obtaining audio data sets, audio data Audio data in set is corresponding with audio frame type；To the audio data in audio data sets, predefined type is extracted Audio frequency characteristics value as training sample, and generate training sample set, wherein training sample is corresponding with audio frame type；It will Input of the training sample that above-mentioned training sample is concentrated as initial neural network, by audio corresponding with the training sample of input Desired output of the frame type as above-mentioned initial neural network, the initial neural network of training, obtains speech detection model.

In some embodiments, according to acoustic energy, effective audio data in audio data is determined, comprising: to getting Audio data according to regular length sampled point carry out cutting, obtain at least one sub-audio data；Determine that cutting obtains each Whether the acoustic energy of a sub-audio data is greater than default acoustic energy threshold value；In response to determining that it is pre- that the acoustic energy of sub-audio data is greater than If acoustic energy threshold value, it is determined that sub-audio data is effective audio data.

In some embodiments, according to the corresponding audio frame type of audio frame and location information in audio frame sequence, really Determine the endpoint of voice in the corresponding audio of audio frame sequence, comprising: according to first of the sound-type in audio frame sequence sound Frequency frame determines the starting point of voice, and the corresponding location information of first audio frame is determined as initial position message；According to first The corresponding location information of audio frame of sound-type in beginning location information and audio frame sequence after first audio frame determines The tail point of voice.

In some embodiments, according to the voice class after first audio frame in initial position message and audio frame sequence The corresponding location information of the audio frame of type determines the tail point of voice, comprising: for the audio of the sound-type in audio frame sequence It is pre- to determine whether position indicated by position indicated by the corresponding location information of the audio frame and initial position message is greater than for frame If angle；It is greater than predetermined angle in response to determining, the audio frame type of the audio frame is changed to non-voice type；From first Audio frame starts, and determines whether continuously occur the audio frame of predetermined number non-voice type in audio frame sequence；In response to true The audio frame for continuously occurring predetermined number non-voice type in audio frame sequence is determined, according to predetermined number non-voice type Audio frame determines the tail point of voice.

Second aspect, the embodiment of the present application provide a kind of for detecting the device of the endpoint of voice, which includes: sound Frequency generation unit is configured to generate audio frame sequence, wherein the audio frame sequence of generation based on the audio data got In audio frame it is corresponding with audio frame type, audio frame type be sound-type or non-voice type；Position determination unit, quilt It is configured to the audio frame for the sound-type in audio frame sequence, determines position when sound source issues the audio frame corresponding sound Confidence breath；Endpoint determination unit is configured to according to the corresponding audio frame type of audio frame and position letter in audio frame sequence Breath, determines the endpoint of voice in the corresponding audio of audio frame sequence.

In some embodiments, audio generation unit includes: effective audio determining module, is configured to according to acoustic energy, Determine effective audio data in audio data；Window framing module is moved, is configured to that effective audio data is carried out moving window point Frame obtains audio frame sequence；Audio frame determination type module is configured to carry out voice inspection to the audio frame in audio frame sequence It surveys, determines the corresponding audio frame type of audio frame.

In some embodiments, audio frame determination type module is further configured to: for the sound in audio frame sequence Frequency frame extracts the audio frequency characteristics value of the predefined type of the audio frame；It, will be from the audio for the audio frame in audio frame sequence The audio frequency characteristics value extracted in frame imports the speech detection model pre-established, generates audio frame type, wherein speech detection mould Type is used to characterize the corresponding relationship between audio frequency characteristics value and audio frame type.

In some embodiments, effective audio determining module is further configured to: to the audio data got according to Regular length sampled point carries out cutting, obtains at least one sub-audio data；Determine each sub-audio data that cutting obtains Whether acoustic energy is greater than default acoustic energy threshold value；In response to determining that the acoustic energy of sub-audio data is greater than default acoustic energy threshold value, Then determine that sub-audio data is effective audio data.

In some embodiments, endpoint determination unit includes: starting point determining module, is configured to according in audio frame sequence Sound-type first audio frame, determine the starting point of voice, and the corresponding location information of first audio frame is determined For initial position message；Tail point determining module, is configured to according to first audio in initial position message and audio frame sequence The corresponding location information of the audio frame of sound-type after frame determines the tail point of voice.

In some embodiments, tail point determining module is further configured to: for the sound-type in audio frame sequence Audio frame, whether determine position indicated by position indicated by the corresponding location information of the audio frame and initial position message Greater than predetermined angle；It is greater than predetermined angle in response to determining, the audio frame type of the audio frame is changed to non-voice type；From First audio frame starts, and determines whether continuously occur the audio frame of predetermined number non-voice type in audio frame sequence；It rings It should be in determining the audio frame for continuously occurring predetermined number non-voice type in audio frame sequence, according to predetermined number non-voice The audio frame of type determines the tail point of voice.

The third aspect, the embodiment of the present application provide a kind of electronic equipment, which includes: one or more processing Device；Storage device is stored thereon with one or more programs, when said one or multiple programs are by said one or multiple processing When device executes, so that said one or multiple processors realize the method as described in implementation any in first aspect.

Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, In, the method as described in implementation any in first aspect is realized when which is executed by processor.

Method and apparatus provided by the embodiments of the present application for detecting the endpoint of voice, by based on the audio got Data generate audio frame sequence, wherein the audio frame in the audio frame sequence of generation is corresponding with audio frame type, audio frame Type is sound-type or non-voice type；For the audio frame of the sound-type in audio frame sequence, determines that sound source issues and be somebody's turn to do Location information when the corresponding sound of audio frame；According to the corresponding audio frame type of audio frame and position letter in audio frame sequence Breath, determines the endpoint of voice in the corresponding audio of audio frame sequence, technical effect at least may include: to provide new detection language The mode of the endpoint of sound.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow chart according to one embodiment of the method for the endpoint for detecting voice of the application；

Fig. 3 A is the schematic flow chart according to a kind of implementation for step 201 of the application；

Fig. 3 B is the schematic diagram according to an application scenarios of the method for the endpoint for detecting voice of the application；

Fig. 4 is the flow chart according to another embodiment of the method for the endpoint for detecting voice of the application；

Fig. 5 is the structural schematic diagram according to one embodiment of the device of the endpoint for detecting voice of the application；

Fig. 6 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 is shown can be using the method for the endpoint for detecting voice of the application or the endpoint for detecting voice Device embodiment exemplary system architecture 100.

As shown in Figure 1, system architecture 100 may include terminal device 101,102, network 103 and server 104.Network 103 between terminal device 101,102 and server 104 to provide the medium of communication link.Network 103 may include various Connection type, such as wired, wireless communication link or fiber optic cables etc..

User can be used terminal device 101,102 and be interacted by network 103 with server 104, be disappeared with receiving or sending Breath etc..Various telecommunication customer end applications can be installed, such as audio collection class is applied, webpage is clear on terminal device 101,102 Device of looking at application, shopping class application, searching class application, instant messaging tools, mailbox client, social platform software etc..

Terminal device 101,102 can be hardware, be also possible to software.It, can be with when terminal device 101,102 is hardware It is the various electronic equipments with sound collection function, including but not limited to intelligent sound box, smart phone, tablet computer, electronics Book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert Compression standard audio level 3), (Moving Picture Experts Group Audio Layer IV, dynamic image are special by MP4 Family's compression standard audio level 4) player, pocket computer on knee and desktop computer etc..When terminal device 101,102 When for software, it may be mounted in above-mentioned cited electronic equipment.Its may be implemented into multiple softwares or software module (such as For providing sound collection service), single software or software module also may be implemented into.It is not specifically limited herein.

Server 104 can be to provide the server of various services, such as to the audio number that terminal device 101,102 acquires The background server supported according to offer.Background server can carry out the data such as the audio received the processing such as analyzing, and will Processing result (such as terminal point information) feeds back to terminal device.

It should be noted that being used to detect the method for the endpoint of voice provided by the embodiment of the present application generally by server 104 execute, and correspondingly, are generally positioned in server 104 for detecting the device of endpoint of voice.

It should be noted that server 104 can be hardware, it is also possible to software.It, can when server 105 is hardware To be implemented as the distributed server cluster that multiple servers form, individual server also may be implemented into.When server is soft When part, multiple softwares or software module (such as determining service for providing endpoint) may be implemented into, also may be implemented into single Software or software module.It is not specifically limited herein.

It should be noted that the method for the endpoint provided by the embodiment of the present application for detecting voice can pass through service Device 104 executes, and can also be executed by terminal device 101,102, can also pass through server 104 and terminal device 101,102 Common to execute, the application does not limit this.

It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

Referring to FIG. 2, it illustrates the processes 200 of one embodiment of the method for the endpoint for detecting voice.This reality Example is applied mainly to be applied to come in the electronic equipment for having certain operational capability for example, the electronic equipment can be figure in this way Server 104 shown in 1 is also possible to terminal device 101 shown in fig. 1.This is used to detect the method for the endpoint of voice, including Following steps:

Step 201, based on the audio data got, audio frame sequence is generated.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) audio data can be obtained from above-mentioned executing subject local or other electronic equipments, generate audio frame sequence.

Optionally, if above-mentioned executing subject is terminal, terminal can use the acquisition sound of the voice collection device in terminal Frequency evidence, to obtain audio data.Herein, voice collection device can be it is various forms of can assist determine sound hair The device of position when sound out.As an example, voice collection device can be various forms of microphone arrays.

Optionally, if above-mentioned executing subject is server, server can receive the audio number that terminal acquires from terminal According to.

In the present embodiment, the audio data that above-mentioned executing subject is got can be the original of voice collection device acquisition Beginning data are also possible to the data obtained after the original data processing acquired to voice collection device.As an example, above-mentioned place Reason, which can be, filters the strength information of initial data, and remains spectrum information.

It should be noted that if the above-mentioned audio data got is initial data, the packets of audio data got Include determining location information parameter.If the above-mentioned audio data got is treated data, audio data and determination Location information is associated with parameter.

Herein, determine that location information can be position when issuing the corresponding sound of the audio frame for sound source with parameter The related parameter of information.Determine that location information can be the parameter of predefined type with parameter.As an example, predefined type ginseng Number can include but is not limited to: sound intensity information, sound density information that each microphone in microphone array receives etc..

In the present embodiment, above-mentioned audio data can be acquired in real time by terminal device.It may include in audio data Background noise except the voice of people and the voice of people.

In the present embodiment, audio frame sequence can be the sequence of audio frame.Audio frame and audio in audio frame sequence Frame type is corresponding, and audio frame type is sound-type or non-voice type.Herein, sound-type can serve to indicate that audio frame Corresponding sound is voice.Non-voice type can serve to indicate that the corresponding sound of audio frame is not voice.It should be noted that Voice can refer to the sound that human hair goes out in the application.

As an example, windowing operation, the corresponding audio frame of every window can be carried out, then press for the audio data got Time sequencing arranges audio frame, obtains audio frame sequence.The audio frame type of audio frame can be determined according to the sound intensity.

Step 202, for the audio frame of the sound-type in audio frame sequence, it is corresponding to determine that sound source issues the audio frame Location information when sound.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) above-mentioned audio can be needed in sound-type audio frame, when determining that sound source issues the corresponding sound of the audio frame Location information.

In the present embodiment, various algorithms be can use and determine location information parameter, determine that sound source issues the audio Location information when the corresponding sound of frame.As an example, can use following at least one but be not limited to: beam-forming schemes, sound Up to time difference method etc., location information when sound source issues the audio frame corresponding sound is determined.

Step 203, according to the corresponding audio frame type of audio frame and location information in audio frame sequence, above-mentioned sound is determined The endpoint of voice in the corresponding audio of frequency frame sequence.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) above-mentioned audio frame can be determined according to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence The endpoint of voice in the corresponding audio of sequence.

It is appreciated that audio frame can be obtained by carrying out framing to a segment of audio, audio frame, which is sequentially arranged, to be formed Audio frame sequence.This section audio is properly termed as the corresponding audio of audio frame sequence.

In the present embodiment, the endpoint of voice may include at least one of following: the starting point of voice and the tail point of voice.On It states starting point and is referred to as a point.

In the present embodiment, the endpoint of voice can be indicated with various forms.As an example, above-mentioned endpoint is referred to audio frame Show, also can use position instruction of the audio frame in tonic train.

In the present embodiment, above-mentioned steps 203 can be realized by various modes.

In some embodiments, step 203 can be accomplished by the following way: according to the voice in above-mentioned audio frame sequence First audio frame of type determines the starting point of voice, and the corresponding location information of above-mentioned first audio frame is determined as Initial position message；According to the voice after first audio frame above-mentioned in above-mentioned initial position message and above-mentioned audio frame sequence The corresponding location information of the audio frame of type determines the tail point of voice.

As an example, can determine in audio frame sequence whether continuously make a reservation for since above-mentioned first audio frame The audio frame of number non-voice type.In response to continuously there is predetermined number non-voice class in the above-mentioned audio frame sequence of determination The audio frame of type, determine position indicated by the corresponding location information of audio frame of above-mentioned predetermined number non-voice type and just Angle between position indicated by beginning location information.It is less than predetermined angle in response to the determining angle, by the audio frame It is determined as target audio frame, in response to the number of target audio frame in the audio frame of the above-mentioned predetermined number non-voice type of determination The tail point of voice is determined according to the audio frame of above-mentioned predetermined number non-voice type greater than predetermined number threshold value.

As an example, can be with first non-targeted audio frame in the audio frame of above-mentioned predetermined number non-voice type, really The tail point of attribute sound.

The method provided by the above embodiment of the application, by generating audio frame sequence based on the audio data got, Wherein, the audio frame in the audio frame sequence of generation is corresponding with audio frame type, and audio frame type is sound-type or non-language Sound type；For the audio frame of the sound-type in above-mentioned audio frame sequence, determine that sound source issues the corresponding sound of the audio frame When location information；According to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence, determine above-mentioned The endpoint of voice, technical effect at least may include: in audio data

First, provide the mode of the endpoint of new detection voice.

Second, for audio frame sequence, audio frame type is determined by granularity of audio frame sequence, it can be with fine granularity determining Sound bite in audio frame sequence, the endpoint for further detection voice provides accurate foundation, so as to improve voice The accuracy of end-point detection.

Third issues location information when the corresponding sound of the audio frame using sound source, can be for some sound source positions The biggish audio frame of deviation is excluded, and so as to inhibit background noise, excludes background noise for determining sound end Interference, so as to improve the accuracy for the endpoint for detecting voice.

In some embodiments, above-mentioned steps 201 can realize that process 201 can wrap by process 201 shown in Fig. 3 A It includes:

Step 2011, according to acoustic energy, effective audio data in audio data is determined.

Herein, step 2011 can be accomplished by the following way: adopt to the audio data got according to regular length Sampling point carries out cutting, obtains at least one sub-audio data；Determine each sub-audio data that cutting obtains acoustic energy whether Greater than default acoustic energy threshold value；In response to determining that the acoustic energy of sub-audio data is greater than default acoustic energy threshold value, it is determined that consonant Frequency evidence is effective audio data.

Herein, it can use acoustic energy and preliminary classification carried out to audio data, acoustic energy threshold value is lower than for acoustic energy Audio data be considered quiet data.Subsequent processing is not carried out to quiet data, it is possible to reduce the calculating of above-mentioned executing subject Amount.

Step 2012, it for effective audio data, carries out moving window framing, obtains audio frame sequence.

Step 2013, speech detection is carried out to the audio frame in audio frame sequence, determines the corresponding audio frame class of audio frame Type.

As an example, step 2013 can be accomplished by the following way: for the audio frame in audio frame sequence, by the sound Frequency frame imports the detection model pre-established, generates audio frame type, wherein above-mentioned detection model is for characterizing audio and audio Corresponding relationship between frame type.

As an example, step 2013 can be accomplished by the following way: for the audio frame in audio frame sequence, extracting should The audio frequency characteristics value of the predefined type of audio frame；For the audio frame in audio frame sequence, by what is extracted from the audio frame Audio frequency characteristics value imports the speech detection model pre-established, generates audio frame type, wherein above-mentioned speech detection model is used for Characterize the corresponding relationship between audio frequency characteristics value and audio frame type.

Herein, the predefined type of audio frequency characteristics value can include but is not limited to: mel-frequency cepstrum coefficient, perception line Property predictive coefficient, the first-order difference of above-mentioned mel-frequency cepstrum coefficient, the second differnce of mel-frequency cepstrum coefficient, perception it is linear The first-order difference of predictive coefficient and the second differnce of perception linear predictor coefficient.

Optionally, speech detection model can be mapping table, and mapping table is for characterizing speech characteristic value and sound Corresponding relationship between frequency frame type.

Optionally, speech detection model can be established by following steps: obtain audio data sets, audio data sets In audio data it is corresponding with audio frame type；To the audio data in audio data sets, the sound of predefined type is extracted Frequency characteristic value is as training sample, and generates training sample set, wherein training sample is corresponding with audio frame type；It will be above-mentioned Input of the training sample that training sample is concentrated as initial neural network, by audio frame class corresponding with the training sample of input Desired output of the type as above-mentioned initial neural network, the initial neural network of training, obtains speech detection model.

Herein, the audio data in audio data sets can be the data acquired from real scene.Audio data Audio data in set may include voice data and non-speech data.Voice data can be corresponding with sound-type.Non- language Sound data can be corresponding with non-voice type.

Herein, the audio of one group of predefined type can be extracted for each audio data in audio data sets Characteristic value, and using this group of audio frequency characteristics value as training sample, then the audio frame type pair of the training sample and the audio data It answers.If be appreciated that audio data sets in audio data be it is multiple, multiple groups audio frequency characteristics value can be extracted, from And available multiple training samples, multiple training samples can form training sample set.

Herein, initial neural network can be the neural network of various structures, and initial neural network may include but not It is limited at least one of following: convolutional neural networks, Recognition with Recurrent Neural Network, shot and long term Memory Neural Networks.

Please refer to Fig. 3 B, Fig. 3 B is one according to the application scenarios of the method for the endpoint for detecting voice of the present embodiment A schematic diagram.In the application scenarios of Fig. 3:

User 301 issues one section of voice after waking up intelligent sound box 302.As an example, the voice that user issues is " please play a song ".

Intelligent sound box can start to acquire sound, to obtain audio data after being waken up.

Intelligent sound box can generate audio frame sequence based on the audio data of acquisition.Sound in the audio frame sequence of generation Frequency frame is corresponding with audio frame type.As an example, can will be eliminated in the audio data of acquisition the audio of quiet data as Generate the basis of audio frame sequence.

Intelligent sound box can determine that sound source issues the audio for the audio frame of the sound-type in above-mentioned audio frame sequence Location information when the corresponding sound of frame.

Intelligent sound box can be according to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence, really Determine the endpoint of voice in the corresponding audio of above-mentioned audio frame sequence.As an example, intelligent sound box can determine that voice " please play one The starting point and/or tail point of head song ".

With further reference to Fig. 4, it illustrates the processes of another embodiment of the method for the endpoint for detecting voice 400.This is used to detect the process 400 of the method for the endpoint of voice, comprising the following steps:

Step 401, based on the audio data got, audio frame sequence is generated.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) audio data can be obtained from above-mentioned executing subject local or other electronic equipments, generate audio frame sequence.Herein, Audio frame in the audio frame sequence of generation is corresponding with audio frame type, and audio frame type is sound-type or non-voice class Type.

Step 402, for the audio frame of the sound-type in audio frame sequence, it is corresponding to determine that sound source issues the audio frame Location information when sound.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) sound source, which issues the corresponding sound of the audio frame, to be determined for the audio frame of the sound-type in above-mentioned audio frame sequence When location information.

Step 201 and step in the concrete operations of step 401 and step 402 in the present embodiment and embodiment shown in Fig. 2 Rapid 202 operation is essentially identical, and details are not described herein.

Step 403, according to first of the sound-type in audio frame sequence audio frame, the starting point of voice is determined, and The corresponding location information of first audio frame is determined as initial position message.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) starting point of voice can be determined, and will be above-mentioned according to first audio frame of the sound-type in above-mentioned audio frame sequence The corresponding location information of first audio frame is determined as initial position message.

As an example, first audio frame of the sound-type in audio frame sequence can be determined as the starting point of voice.

Step 404, for the audio frame of the sound-type in audio frame sequence, the corresponding location information of the audio frame is determined Whether position indicated by indicated position and initial position message is greater than predetermined angle；It is greater than preset angle in response to determining Degree, is changed to non-voice type for the audio frame type of the audio frame.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) the corresponding location information of the audio frame is signified to be determined for the audio frame of the sound-type in above-mentioned audio frame sequence Whether position indicated by the position shown and above-mentioned initial position message is greater than predetermined angle；It is greater than preset angle in response to determining Degree, is changed to non-voice type for the audio frame type of the audio frame.

Step 405, since first audio frame, determine in audio frame sequence the non-language of predetermined number whether continuously occur The audio frame of sound type.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) it can determine in the audio frame sequence through step 404 whether predetermined number continuously occur since above-mentioned first audio frame The audio frame of a non-voice type.

Herein, predetermined number can be determined according to practical application scene.As an example, predetermined in Chinese speech scene Predetermined number in number and japanese voice scene, may be different.

Step 406, in response to determining in audio frame sequence the audio frame of predetermined number non-voice type, root continuously occur According to the audio frame of predetermined number non-voice type, the tail point of voice is determined.

In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice Case) it can be in response to continuously there is the audio frame of predetermined number non-voice type in the above-mentioned audio frame sequence of determination, according to upper The audio frame for stating predetermined number non-voice type determines the tail point of voice.

It is appreciated that can be non-according to the predetermined number since above-mentioned first audio frame, continuously occurred for the first time The audio frame of sound-type determines the tail point of voice.

As an example, first audio frame in the audio frame of above-mentioned predetermined number non-voice type can be determined For the tail point of voice.

As an example, can by the audio frame in the middle position in the audio frame of above-mentioned predetermined number non-voice type, It is determined as the tail point of voice.It can be with can be by the last one audio in the audio frame of above-mentioned predetermined number non-voice type Frame is determined as the tail point of voice.

Figure 4, it is seen that the endpoint for being used to detect voice compared with the corresponding embodiment of Fig. 2, in the present embodiment Method process 400 highlight according to initial position message instruction position angle difference, change the audio of audio frame Frame type, then the step of determining the tail point of voice, thus, technical effect at least may include:

First, provide a kind of mode of the endpoint of new detection voice.

Second, can from the angle difference with initial position, determine audio frame type occur deviation (this be not voice and Mistake is designated as sound-type) audio frame.It is thus possible to exclude the sound for differing biggish various sound sources with initial position.

Third excludes to differ the sound of biggish various sound sources with initial position, can in the present embodiment application process, Exclude the voice of non-targeted user.For example, there are other people to issue interference voice in room when user issues voice command, In the way of the present embodiment, other people voice of non-user can be excluded, it is thus possible to determine the end of more accurate voice Point.Prepare accurate identification material for speech recognition later.

With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides one kind for detecting language One embodiment of the device of the endpoint of sound, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, and the device is specific It can be applied in various electronic equipments.

As shown in figure 5, the device 500 of the endpoint for detecting voice of the present embodiment include: audio generation unit 501, Position determination unit 502 and endpoint determination unit 503.Wherein, audio generation unit 501 is configured to based on the sound got Frequency evidence generates audio frame sequence, wherein the audio frame in the audio frame sequence of generation is corresponding with audio frame type, audio Frame type is sound-type or non-voice type；Position determination unit 502 is configured to for the language in above-mentioned audio frame sequence The audio frame of sound type determines location information when sound source issues the audio frame corresponding sound；Endpoint determination unit 503, quilt It is configured to determine above-mentioned audio frame sequence according to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence Arrange the endpoint of voice in corresponding audio.

In the present embodiment, for detecting audio generation unit 501, the determining list in position of the device 500 of the endpoint of voice The specific processing of member 502 and endpoint determination unit 503 and its brought technical effect can be respectively with reference in Fig. 2 corresponding embodiments The related description of step 201, step 202 and step 203, details are not described herein.

In some optional implementations of the present embodiment, above-mentioned audio generation unit 501 may include: effective audio Determining module (is not shown) in Fig. 5, is configured to determine effective audio data in audio data according to acoustic energy；Move window point Frame module (is not shown) in Fig. 5, is configured to that effective audio data is carried out moving window framing, obtains audio frame sequence；Audio Frame determination type module (is not shown) in Fig. 5, is configured to carry out speech detection to the audio frame in audio frame sequence, determines sound The corresponding audio frame type of frequency frame.

In some optional implementations of the present embodiment, above-mentioned audio frame determination type module (being not shown in Fig. 5) Can be further configured to: for the audio frame in audio frame sequence, the audio for extracting the predefined type of the audio frame is special Value indicative；For the audio frame in audio frame sequence, the audio frequency characteristics value extracted from the audio frame is imported to the language pre-established Sound detection model generates audio frame type, wherein above-mentioned speech detection model is for characterizing audio frequency characteristics value and audio frame type Between corresponding relationship.

In some optional implementations of the present embodiment, speech detection model can be established by following steps: be obtained Audio data sets are taken, the audio data in audio data sets is corresponding with audio frame type；To in audio data sets Audio data extracts the audio frequency characteristics value of predefined type as training sample, and generates training sample set, wherein training Sample is corresponding with audio frame type；The training sample that above-mentioned training sample is concentrated, will be with as the input of initial neural network Desired output of the corresponding audio frame type of the training sample of input as above-mentioned initial neural network, the initial nerve net of training Network obtains speech detection model.

In some optional implementations of the present embodiment, above-mentioned effective audio determining module (being not shown in Fig. 5) can To be further configured to: carrying out cutting according to regular length sampled point to the audio data got, obtain at least one son Audio data；Determine whether the acoustic energy for each sub-audio data that cutting obtains is greater than default acoustic energy threshold value；In response to true The acoustic energy of stator audio data is greater than default acoustic energy threshold value, it is determined that sub-audio data is effective audio data.

In some optional implementations of the present embodiment, above-mentioned endpoint determination unit 503 may include: that starting point determines Module (is not shown) in Fig. 5, is configured to first audio frame according to the sound-type in above-mentioned audio frame sequence, determines language The starting point of sound, and the corresponding location information of above-mentioned first audio frame is determined as initial position message；Tail point determining module (being not shown in Fig. 5) is configured to according to first audio frame above-mentioned in above-mentioned initial position message and above-mentioned audio frame sequence The corresponding location information of the audio frame of sound-type later determines the tail point of voice.

In some optional implementations of the present embodiment, above-mentioned tail point determining module (being not shown in Fig. 5) can be into One step is configured to: for the audio frame of the sound-type in above-mentioned audio frame sequence, determining the corresponding position letter of the audio frame Whether position indicated by the indicated position of breath and above-mentioned initial position message is greater than predetermined angle；It is greater than in advance in response to determining If angle, the audio frame type of the audio frame is changed to non-voice type；Since above-mentioned first audio frame, audio is determined Whether the audio frame of predetermined number non-voice type is continuously occurred in frame sequence；In response to connecting in the above-mentioned audio frame sequence of determination It is continuous the audio frame of predetermined number non-voice type occur, according to the audio frame of above-mentioned predetermined number non-voice type, determine The tail point of voice.

It should be noted that in the device of the endpoint provided by the embodiments of the present application for detecting voice each unit realization Details and technical effect can be with reference to the explanations of other embodiments in the application, and details are not described herein.

Below with reference to Fig. 6, it illustrates the computer systems 600 for the electronic equipment for being suitable for being used to realize the embodiment of the present application Structural schematic diagram.Electronic equipment shown in Fig. 6 is only an example, function to the embodiment of the present application and should not use model Shroud carrys out any restrictions.

As shown in fig. 6, computer system 600 includes central processing unit (CPU, Central Processing Unit) 601, it can be according to the program being stored in read-only memory (ROM, Read Only Memory) 602 or from storage section 608 programs being loaded into random access storage device (RAM, Random Access Memory) 603 and execute various appropriate Movement and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data.CPU 601,ROM 602 and RAM 603 is connected with each other by bus 604.Input/output (I/O, Input/Output) interface 605 is also connected to Bus 604.

I/O interface 605 is connected to lower component: the importation 606 including keyboard, mouse etc.；It is penetrated including such as cathode Spool (CRT, Cathode Ray Tube), liquid crystal display (LCD, Liquid Crystal Display) etc. and loudspeaker Deng output par, c 607；Storage section 608 including hard disk etc.；And including such as LAN (local area network, Local Area Network) the communications portion 609 of the network interface card of card, modem etc..Communications portion 609 is via such as internet Network executes communication process.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 610, in order to from the calculating read thereon Machine program is mounted into storage section 608 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 609, and/or from detachable media 611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes Above-mentioned function.It should be noted that the above-mentioned computer-readable medium of the application can be computer-readable signal media or Computer readable storage medium either the two any combination.Computer readable storage medium for example can be --- but Be not limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination. The more specific example of computer readable storage medium can include but is not limited to: have one or more conducting wires electrical connection, Portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only deposit Reservoir (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory Part or above-mentioned any appropriate combination.In this application, computer readable storage medium, which can be, any include or stores The tangible medium of program, the program can be commanded execution system, device or device use or in connection.And In the application, computer-readable signal media may include in a base band or the data as the propagation of carrier wave a part are believed Number, wherein carrying computer-readable program code.The data-signal of this propagation can take various forms, including but not It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer Any computer-readable medium other than readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use In by the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc., Huo Zheshang Any appropriate combination stated.

The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, above procedure design language include object oriented program language-such as Java, Smalltalk, C+ +, it further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include audio generation unit, position determination unit and endpoint determination unit.Wherein, the title of these units is not under certain conditions Constitute restriction to the unit itself, for example, audio generation unit be also described as " based on the audio data got, Generate the unit of audio frame sequence ".

As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in device described in above-described embodiment；It is also possible to individualism, and without in the supplying device.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should Device: based on the audio data got, audio frame sequence is generated, wherein audio frame and sound in the audio frame sequence of generation Frequency frame type is corresponding, and audio frame type is sound-type or non-voice type；For the sound-type in audio frame sequence Audio frame determines location information when sound source issues the audio frame corresponding sound；According to the audio frame pair in audio frame sequence The audio frame type and location information answered determine the endpoint of voice in the corresponding audio of audio frame sequence.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of method for detecting the endpoint of voice, comprising:

Based on the audio data got, audio frame sequence is generated, wherein audio frame and audio in the audio frame sequence of generation Frame type is corresponding, and audio frame type is sound-type or non-voice type；

For the audio frame of the sound-type in the audio frame sequence, determine when sound source issues the audio frame corresponding sound Location information；

According to the corresponding audio frame type of audio frame and location information in the audio frame sequence, the audio frame sequence is determined The endpoint of voice in corresponding audio.

2. according to the method described in claim 1, wherein, described based on the audio data got, generation audio frame sequence is wrapped It includes:

According to acoustic energy, effective audio data in audio data is determined；

For effective audio data, carries out moving window framing, obtain audio frame sequence；

Speech detection is carried out to the audio frame in audio frame sequence, determines the corresponding audio frame type of audio frame.

3. according to the method described in claim 2, wherein, the audio frame in audio frame sequence carries out speech detection, really Determine the corresponding audio frame type of audio frame, comprising:

For the audio frame in audio frame sequence, the audio frequency characteristics value of the predefined type of the audio frame is extracted；

For the audio frame in audio frame sequence, the audio frequency characteristics value extracted from the audio frame is imported to the voice pre-established Detection model generates audio frame type, wherein the speech detection model for characterize audio frequency characteristics value and audio frame type it Between corresponding relationship.

4. according to the method described in claim 3, wherein, speech detection model is established by following steps:

Audio data sets are obtained, the audio data in audio data sets is corresponding with audio frame type；

To the audio data in audio data sets, the audio frequency characteristics value of predefined type is extracted as training sample, Yi Jisheng At training sample set, wherein training sample is corresponding with audio frame type；

The training sample that the training sample is concentrated, will be corresponding with the training sample of input as the input of initial neural network Desired output of the audio frame type as the initial neural network, the initial neural network of training obtains speech detection model.

5. it is described according to acoustic energy according to the method described in claim 2, wherein, determine effective audio number in audio data According to, comprising:

Cutting is carried out according to regular length sampled point to the audio data got, obtains at least one sub-audio data；

Determine whether the acoustic energy for each sub-audio data that cutting obtains is greater than default acoustic energy threshold value；

In response to determining that the acoustic energy of sub-audio data is greater than default acoustic energy threshold value, it is determined that sub-audio data is effective audio Data.

6. method according to any one of claims 1-5, wherein the audio frame according in the audio frame sequence Corresponding audio frame type and location information determine the endpoint of voice in the corresponding audio of the audio frame sequence, comprising:

According to first of the sound-type in the audio frame sequence audio frame, the starting point of voice is determined, and by described The corresponding location information of one audio frame is determined as initial position message；

The sound of sound-type after first audio frame according to the initial position message and the audio frame sequence The corresponding location information of frequency frame determines the tail point of voice.

7. described according in the initial position message and the audio frame sequence according to the method described in claim 6, wherein The corresponding location information of audio frame of sound-type after first audio frame determines the tail point of voice, comprising:

For the audio frame of the sound-type in the audio frame sequence, determine indicated by the corresponding location information of the audio frame Whether position indicated by position and the initial position message is greater than predetermined angle；It is greater than predetermined angle in response to determining, it will The audio frame type of the audio frame is changed to non-voice type；

Since first audio frame, determine in audio frame sequence whether predetermined number non-voice type continuously occur Audio frame；

In response to continuously there is the audio frame of predetermined number non-voice type in the determination audio frame sequence, according to described pre- The audio frame of fixed number mesh non-voice type determines the tail point of voice.

8. a kind of for detecting the device of the endpoint of voice, comprising:

Audio generation unit is configured to generate audio frame sequence, wherein the audio of generation based on the audio data got Audio frame in frame sequence is corresponding with audio frame type, and audio frame type is sound-type or non-voice type；

Position determination unit is configured to the audio frame for the sound-type in the audio frame sequence, determines that sound source issues Location information when the corresponding sound of the audio frame；

Endpoint determination unit is configured to according to the corresponding audio frame type of audio frame and position letter in the audio frame sequence Breath, determines the endpoint of voice in the corresponding audio of the audio frame sequence.

9. device according to claim 8, wherein the audio generation unit includes:

Effective audio determining module, is configured to determine effective audio data in audio data according to acoustic energy；

Window framing module is moved, is configured to that effective audio data is carried out moving window framing, obtains audio frame sequence；

Audio frame determination type module is configured to carry out speech detection to the audio frame in audio frame sequence, determines audio frame Corresponding audio frame type.

10. device according to claim 9, wherein the audio frame determination type module is further configured to:

11. device according to claim 10, wherein speech detection model is established by following steps:

12. device according to claim 9, wherein effective audio determining module is further configured to:

13. the device according to any one of claim 8-12, wherein the endpoint determination unit includes:

Starting point determining module is configured to first audio frame according to the sound-type in the audio frame sequence, determines language The starting point of sound, and the corresponding location information of first audio frame is determined as initial position message；

Tail point determining module is configured to first audio according to the initial position message and the audio frame sequence The corresponding location information of the audio frame of sound-type after frame determines the tail point of voice.

14. device according to claim 13, wherein the tail point determining module is further configured to:

15. a kind of electronic equipment, comprising:

One or more processors；

Storage device is stored thereon with one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-7.

16. a kind of computer-readable medium, is stored thereon with computer program, wherein real when described program is executed by processor The now method as described in any in claim 1-7.