CN108962226A - Method and apparatus for detecting the endpoint of voice - Google Patents
Method and apparatus for detecting the endpoint of voice Download PDFInfo
- Publication number
- CN108962226A CN108962226A CN201810792887.0A CN201810792887A CN108962226A CN 108962226 A CN108962226 A CN 108962226A CN 201810792887 A CN201810792887 A CN 201810792887A CN 108962226 A CN108962226 A CN 108962226A
- Authority
- CN
- China
- Prior art keywords
- audio frame
- audio
- type
- voice
- sound
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000001514 detection method Methods 0.000 claims abstract description 41
- 238000012549 training Methods 0.000 claims description 47
- 238000013528 artificial neural network Methods 0.000 claims description 22
- 230000004044 response Effects 0.000 claims description 22
- 238000009432 framing Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000006854 communication Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000015654 memory Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 3
- 230000008676 import Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001256 tonic effect Effects 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the present application discloses the method and apparatus of the endpoint for detecting voice.One specific embodiment of this method includes: to generate audio frame sequence based on the audio data got, wherein the audio frame in the audio frame sequence of generation is corresponding with audio frame type, and audio frame type is sound-type or non-voice type;For the audio frame of the sound-type in audio frame sequence, location information when sound source issues the audio frame corresponding sound is determined;According to the corresponding audio frame type of audio frame and location information in audio frame sequence, the endpoint of voice in the corresponding audio of audio frame sequence is determined.This embodiment offers the modes of the endpoint of new detection voice.
Description
Technical field
The invention relates to field of computer technology, and in particular to Internet technical field is more particularly, to examined
The method and apparatus for surveying the endpoint of voice.
Background technique
With the development of artificial intelligence technology, novel intelligent equipment (such as intelligent sound box, intelligent interaction robot etc.) is opened
Beginning emerges in large numbers, and this novel human-machine interaction technology of interactive voice is gradually received by masses, and the importance of speech recognition technology is increasingly
It shows especially.Speech terminals detection finds the starting point and tail point of voice in continuous audio data, is the weight of speech recognition system
Component part is wanted, accuracy can impact the accuracy of speech recognition.
Summary of the invention
The embodiment of the present application proposes the method and apparatus of the endpoint for detecting voice.
In a first aspect, the embodiment of the present application provides a kind of method for detecting the endpoint of voice, this method comprises: base
In the audio data got, audio frame sequence is generated, wherein audio frame and audio frame type in the audio frame sequence of generation
Corresponding, audio frame type is sound-type or non-voice type;For the audio frame of the sound-type in audio frame sequence, really
Determine location information when sound source issues the audio frame corresponding sound;According to the corresponding audio frame of audio frame in audio frame sequence
Type and location information determine the endpoint of voice in the corresponding audio of audio frame sequence.
In some embodiments, based on the audio data got, audio frame sequence is generated, comprising: according to acoustic energy, really
Effective audio data in audio data;For effective audio data, carries out moving window framing, obtain audio frame sequence;To sound
Audio frame in frequency frame sequence carries out speech detection, determines the corresponding audio frame type of audio frame.
In some embodiments, speech detection is carried out to the audio frame in audio frame sequence, determines the corresponding sound of audio frame
Frequency frame type, comprising: for the audio frame in audio frame sequence, extract the audio frequency characteristics value of the predefined type of the audio frame;
For the audio frame in audio frame sequence, the audio frequency characteristics value extracted from the audio frame is imported to the speech detection pre-established
Model generates audio frame type, wherein speech detection model is corresponding between audio frequency characteristics value and audio frame type for characterizing
Relationship.
In some embodiments, speech detection model is established by following steps: obtaining audio data sets, audio data
Audio data in set is corresponding with audio frame type;To the audio data in audio data sets, predefined type is extracted
Audio frequency characteristics value as training sample, and generate training sample set, wherein training sample is corresponding with audio frame type;It will
Input of the training sample that above-mentioned training sample is concentrated as initial neural network, by audio corresponding with the training sample of input
Desired output of the frame type as above-mentioned initial neural network, the initial neural network of training, obtains speech detection model.
In some embodiments, according to acoustic energy, effective audio data in audio data is determined, comprising: to getting
Audio data according to regular length sampled point carry out cutting, obtain at least one sub-audio data;Determine that cutting obtains each
Whether the acoustic energy of a sub-audio data is greater than default acoustic energy threshold value;In response to determining that it is pre- that the acoustic energy of sub-audio data is greater than
If acoustic energy threshold value, it is determined that sub-audio data is effective audio data.
In some embodiments, according to the corresponding audio frame type of audio frame and location information in audio frame sequence, really
Determine the endpoint of voice in the corresponding audio of audio frame sequence, comprising: according to first of the sound-type in audio frame sequence sound
Frequency frame determines the starting point of voice, and the corresponding location information of first audio frame is determined as initial position message;According to first
The corresponding location information of audio frame of sound-type in beginning location information and audio frame sequence after first audio frame determines
The tail point of voice.
In some embodiments, according to the voice class after first audio frame in initial position message and audio frame sequence
The corresponding location information of the audio frame of type determines the tail point of voice, comprising: for the audio of the sound-type in audio frame sequence
It is pre- to determine whether position indicated by position indicated by the corresponding location information of the audio frame and initial position message is greater than for frame
If angle;It is greater than predetermined angle in response to determining, the audio frame type of the audio frame is changed to non-voice type;From first
Audio frame starts, and determines whether continuously occur the audio frame of predetermined number non-voice type in audio frame sequence;In response to true
The audio frame for continuously occurring predetermined number non-voice type in audio frame sequence is determined, according to predetermined number non-voice type
Audio frame determines the tail point of voice.
Second aspect, the embodiment of the present application provide a kind of for detecting the device of the endpoint of voice, which includes: sound
Frequency generation unit is configured to generate audio frame sequence, wherein the audio frame sequence of generation based on the audio data got
In audio frame it is corresponding with audio frame type, audio frame type be sound-type or non-voice type;Position determination unit, quilt
It is configured to the audio frame for the sound-type in audio frame sequence, determines position when sound source issues the audio frame corresponding sound
Confidence breath;Endpoint determination unit is configured to according to the corresponding audio frame type of audio frame and position letter in audio frame sequence
Breath, determines the endpoint of voice in the corresponding audio of audio frame sequence.
In some embodiments, audio generation unit includes: effective audio determining module, is configured to according to acoustic energy,
Determine effective audio data in audio data;Window framing module is moved, is configured to that effective audio data is carried out moving window point
Frame obtains audio frame sequence;Audio frame determination type module is configured to carry out voice inspection to the audio frame in audio frame sequence
It surveys, determines the corresponding audio frame type of audio frame.
In some embodiments, audio frame determination type module is further configured to: for the sound in audio frame sequence
Frequency frame extracts the audio frequency characteristics value of the predefined type of the audio frame;It, will be from the audio for the audio frame in audio frame sequence
The audio frequency characteristics value extracted in frame imports the speech detection model pre-established, generates audio frame type, wherein speech detection mould
Type is used to characterize the corresponding relationship between audio frequency characteristics value and audio frame type.
In some embodiments, speech detection model is established by following steps: obtaining audio data sets, audio data
Audio data in set is corresponding with audio frame type;To the audio data in audio data sets, predefined type is extracted
Audio frequency characteristics value as training sample, and generate training sample set, wherein training sample is corresponding with audio frame type;It will
Input of the training sample that above-mentioned training sample is concentrated as initial neural network, by audio corresponding with the training sample of input
Desired output of the frame type as above-mentioned initial neural network, the initial neural network of training, obtains speech detection model.
In some embodiments, effective audio determining module is further configured to: to the audio data got according to
Regular length sampled point carries out cutting, obtains at least one sub-audio data;Determine each sub-audio data that cutting obtains
Whether acoustic energy is greater than default acoustic energy threshold value;In response to determining that the acoustic energy of sub-audio data is greater than default acoustic energy threshold value,
Then determine that sub-audio data is effective audio data.
In some embodiments, endpoint determination unit includes: starting point determining module, is configured to according in audio frame sequence
Sound-type first audio frame, determine the starting point of voice, and the corresponding location information of first audio frame is determined
For initial position message;Tail point determining module, is configured to according to first audio in initial position message and audio frame sequence
The corresponding location information of the audio frame of sound-type after frame determines the tail point of voice.
In some embodiments, tail point determining module is further configured to: for the sound-type in audio frame sequence
Audio frame, whether determine position indicated by position indicated by the corresponding location information of the audio frame and initial position message
Greater than predetermined angle;It is greater than predetermined angle in response to determining, the audio frame type of the audio frame is changed to non-voice type;From
First audio frame starts, and determines whether continuously occur the audio frame of predetermined number non-voice type in audio frame sequence;It rings
It should be in determining the audio frame for continuously occurring predetermined number non-voice type in audio frame sequence, according to predetermined number non-voice
The audio frame of type determines the tail point of voice.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, which includes: one or more processing
Device;Storage device is stored thereon with one or more programs, when said one or multiple programs are by said one or multiple processing
When device executes, so that said one or multiple processors realize the method as described in implementation any in first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program,
In, the method as described in implementation any in first aspect is realized when which is executed by processor.
Method and apparatus provided by the embodiments of the present application for detecting the endpoint of voice, by based on the audio got
Data generate audio frame sequence, wherein the audio frame in the audio frame sequence of generation is corresponding with audio frame type, audio frame
Type is sound-type or non-voice type;For the audio frame of the sound-type in audio frame sequence, determines that sound source issues and be somebody's turn to do
Location information when the corresponding sound of audio frame;According to the corresponding audio frame type of audio frame and position letter in audio frame sequence
Breath, determines the endpoint of voice in the corresponding audio of audio frame sequence, technical effect at least may include: to provide new detection language
The mode of the endpoint of sound.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is that this application can be applied to exemplary system architecture figures therein;
Fig. 2 is the flow chart according to one embodiment of the method for the endpoint for detecting voice of the application;
Fig. 3 A is the schematic flow chart according to a kind of implementation for step 201 of the application;
Fig. 3 B is the schematic diagram according to an application scenarios of the method for the endpoint for detecting voice of the application;
Fig. 4 is the flow chart according to another embodiment of the method for the endpoint for detecting voice of the application;
Fig. 5 is the structural schematic diagram according to one embodiment of the device of the endpoint for detecting voice of the application;
Fig. 6 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can be using the method for the endpoint for detecting voice of the application or the endpoint for detecting voice
Device embodiment exemplary system architecture 100.
As shown in Figure 1, system architecture 100 may include terminal device 101,102, network 103 and server 104.Network
103 between terminal device 101,102 and server 104 to provide the medium of communication link.Network 103 may include various
Connection type, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102 and be interacted by network 103 with server 104, be disappeared with receiving or sending
Breath etc..Various telecommunication customer end applications can be installed, such as audio collection class is applied, webpage is clear on terminal device 101,102
Device of looking at application, shopping class application, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 101,102 can be hardware, be also possible to software.It, can be with when terminal device 101,102 is hardware
It is the various electronic equipments with sound collection function, including but not limited to intelligent sound box, smart phone, tablet computer, electronics
Book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert
Compression standard audio level 3), (Moving Picture Experts Group Audio Layer IV, dynamic image are special by MP4
Family's compression standard audio level 4) player, pocket computer on knee and desktop computer etc..When terminal device 101,102
When for software, it may be mounted in above-mentioned cited electronic equipment.Its may be implemented into multiple softwares or software module (such as
For providing sound collection service), single software or software module also may be implemented into.It is not specifically limited herein.
Server 104 can be to provide the server of various services, such as to the audio number that terminal device 101,102 acquires
The background server supported according to offer.Background server can carry out the data such as the audio received the processing such as analyzing, and will
Processing result (such as terminal point information) feeds back to terminal device.
It should be noted that being used to detect the method for the endpoint of voice provided by the embodiment of the present application generally by server
104 execute, and correspondingly, are generally positioned in server 104 for detecting the device of endpoint of voice.
It should be noted that server 104 can be hardware, it is also possible to software.It, can when server 105 is hardware
To be implemented as the distributed server cluster that multiple servers form, individual server also may be implemented into.When server is soft
When part, multiple softwares or software module (such as determining service for providing endpoint) may be implemented into, also may be implemented into single
Software or software module.It is not specifically limited herein.
It should be noted that the method for the endpoint provided by the embodiment of the present application for detecting voice can pass through service
Device 104 executes, and can also be executed by terminal device 101,102, can also pass through server 104 and terminal device 101,102
Common to execute, the application does not limit this.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
Referring to FIG. 2, it illustrates the processes 200 of one embodiment of the method for the endpoint for detecting voice.This reality
Example is applied mainly to be applied to come in the electronic equipment for having certain operational capability for example, the electronic equipment can be figure in this way
Server 104 shown in 1 is also possible to terminal device 101 shown in fig. 1.This is used to detect the method for the endpoint of voice, including
Following steps:
Step 201, based on the audio data got, audio frame sequence is generated.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) audio data can be obtained from above-mentioned executing subject local or other electronic equipments, generate audio frame sequence.
Optionally, if above-mentioned executing subject is terminal, terminal can use the acquisition sound of the voice collection device in terminal
Frequency evidence, to obtain audio data.Herein, voice collection device can be it is various forms of can assist determine sound hair
The device of position when sound out.As an example, voice collection device can be various forms of microphone arrays.
Optionally, if above-mentioned executing subject is server, server can receive the audio number that terminal acquires from terminal
According to.
In the present embodiment, the audio data that above-mentioned executing subject is got can be the original of voice collection device acquisition
Beginning data are also possible to the data obtained after the original data processing acquired to voice collection device.As an example, above-mentioned place
Reason, which can be, filters the strength information of initial data, and remains spectrum information.
It should be noted that if the above-mentioned audio data got is initial data, the packets of audio data got
Include determining location information parameter.If the above-mentioned audio data got is treated data, audio data and determination
Location information is associated with parameter.
Herein, determine that location information can be position when issuing the corresponding sound of the audio frame for sound source with parameter
The related parameter of information.Determine that location information can be the parameter of predefined type with parameter.As an example, predefined type ginseng
Number can include but is not limited to: sound intensity information, sound density information that each microphone in microphone array receives etc..
In the present embodiment, above-mentioned audio data can be acquired in real time by terminal device.It may include in audio data
Background noise except the voice of people and the voice of people.
In the present embodiment, audio frame sequence can be the sequence of audio frame.Audio frame and audio in audio frame sequence
Frame type is corresponding, and audio frame type is sound-type or non-voice type.Herein, sound-type can serve to indicate that audio frame
Corresponding sound is voice.Non-voice type can serve to indicate that the corresponding sound of audio frame is not voice.It should be noted that
Voice can refer to the sound that human hair goes out in the application.
As an example, windowing operation, the corresponding audio frame of every window can be carried out, then press for the audio data got
Time sequencing arranges audio frame, obtains audio frame sequence.The audio frame type of audio frame can be determined according to the sound intensity.
Step 202, for the audio frame of the sound-type in audio frame sequence, it is corresponding to determine that sound source issues the audio frame
Location information when sound.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) above-mentioned audio can be needed in sound-type audio frame, when determining that sound source issues the corresponding sound of the audio frame
Location information.
In the present embodiment, various algorithms be can use and determine location information parameter, determine that sound source issues the audio
Location information when the corresponding sound of frame.As an example, can use following at least one but be not limited to: beam-forming schemes, sound
Up to time difference method etc., location information when sound source issues the audio frame corresponding sound is determined.
Step 203, according to the corresponding audio frame type of audio frame and location information in audio frame sequence, above-mentioned sound is determined
The endpoint of voice in the corresponding audio of frequency frame sequence.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) above-mentioned audio frame can be determined according to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence
The endpoint of voice in the corresponding audio of sequence.
It is appreciated that audio frame can be obtained by carrying out framing to a segment of audio, audio frame, which is sequentially arranged, to be formed
Audio frame sequence.This section audio is properly termed as the corresponding audio of audio frame sequence.
In the present embodiment, the endpoint of voice may include at least one of following: the starting point of voice and the tail point of voice.On
It states starting point and is referred to as a point.
In the present embodiment, the endpoint of voice can be indicated with various forms.As an example, above-mentioned endpoint is referred to audio frame
Show, also can use position instruction of the audio frame in tonic train.
In the present embodiment, above-mentioned steps 203 can be realized by various modes.
In some embodiments, step 203 can be accomplished by the following way: according to the voice in above-mentioned audio frame sequence
First audio frame of type determines the starting point of voice, and the corresponding location information of above-mentioned first audio frame is determined as
Initial position message;According to the voice after first audio frame above-mentioned in above-mentioned initial position message and above-mentioned audio frame sequence
The corresponding location information of the audio frame of type determines the tail point of voice.
As an example, can determine in audio frame sequence whether continuously make a reservation for since above-mentioned first audio frame
The audio frame of number non-voice type.In response to continuously there is predetermined number non-voice class in the above-mentioned audio frame sequence of determination
The audio frame of type, determine position indicated by the corresponding location information of audio frame of above-mentioned predetermined number non-voice type and just
Angle between position indicated by beginning location information.It is less than predetermined angle in response to the determining angle, by the audio frame
It is determined as target audio frame, in response to the number of target audio frame in the audio frame of the above-mentioned predetermined number non-voice type of determination
The tail point of voice is determined according to the audio frame of above-mentioned predetermined number non-voice type greater than predetermined number threshold value.
As an example, can be with first non-targeted audio frame in the audio frame of above-mentioned predetermined number non-voice type, really
The tail point of attribute sound.
The method provided by the above embodiment of the application, by generating audio frame sequence based on the audio data got,
Wherein, the audio frame in the audio frame sequence of generation is corresponding with audio frame type, and audio frame type is sound-type or non-language
Sound type;For the audio frame of the sound-type in above-mentioned audio frame sequence, determine that sound source issues the corresponding sound of the audio frame
When location information;According to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence, determine above-mentioned
The endpoint of voice, technical effect at least may include: in audio data
First, provide the mode of the endpoint of new detection voice.
Second, for audio frame sequence, audio frame type is determined by granularity of audio frame sequence, it can be with fine granularity determining
Sound bite in audio frame sequence, the endpoint for further detection voice provides accurate foundation, so as to improve voice
The accuracy of end-point detection.
Third issues location information when the corresponding sound of the audio frame using sound source, can be for some sound source positions
The biggish audio frame of deviation is excluded, and so as to inhibit background noise, excludes background noise for determining sound end
Interference, so as to improve the accuracy for the endpoint for detecting voice.
In some embodiments, above-mentioned steps 201 can realize that process 201 can wrap by process 201 shown in Fig. 3 A
It includes:
Step 2011, according to acoustic energy, effective audio data in audio data is determined.
Herein, step 2011 can be accomplished by the following way: adopt to the audio data got according to regular length
Sampling point carries out cutting, obtains at least one sub-audio data;Determine each sub-audio data that cutting obtains acoustic energy whether
Greater than default acoustic energy threshold value;In response to determining that the acoustic energy of sub-audio data is greater than default acoustic energy threshold value, it is determined that consonant
Frequency evidence is effective audio data.
Herein, it can use acoustic energy and preliminary classification carried out to audio data, acoustic energy threshold value is lower than for acoustic energy
Audio data be considered quiet data.Subsequent processing is not carried out to quiet data, it is possible to reduce the calculating of above-mentioned executing subject
Amount.
Step 2012, it for effective audio data, carries out moving window framing, obtains audio frame sequence.
Step 2013, speech detection is carried out to the audio frame in audio frame sequence, determines the corresponding audio frame class of audio frame
Type.
As an example, step 2013 can be accomplished by the following way: for the audio frame in audio frame sequence, by the sound
Frequency frame imports the detection model pre-established, generates audio frame type, wherein above-mentioned detection model is for characterizing audio and audio
Corresponding relationship between frame type.
As an example, step 2013 can be accomplished by the following way: for the audio frame in audio frame sequence, extracting should
The audio frequency characteristics value of the predefined type of audio frame;For the audio frame in audio frame sequence, by what is extracted from the audio frame
Audio frequency characteristics value imports the speech detection model pre-established, generates audio frame type, wherein above-mentioned speech detection model is used for
Characterize the corresponding relationship between audio frequency characteristics value and audio frame type.
Herein, the predefined type of audio frequency characteristics value can include but is not limited to: mel-frequency cepstrum coefficient, perception line
Property predictive coefficient, the first-order difference of above-mentioned mel-frequency cepstrum coefficient, the second differnce of mel-frequency cepstrum coefficient, perception it is linear
The first-order difference of predictive coefficient and the second differnce of perception linear predictor coefficient.
Optionally, speech detection model can be mapping table, and mapping table is for characterizing speech characteristic value and sound
Corresponding relationship between frequency frame type.
Optionally, speech detection model can be established by following steps: obtain audio data sets, audio data sets
In audio data it is corresponding with audio frame type;To the audio data in audio data sets, the sound of predefined type is extracted
Frequency characteristic value is as training sample, and generates training sample set, wherein training sample is corresponding with audio frame type;It will be above-mentioned
Input of the training sample that training sample is concentrated as initial neural network, by audio frame class corresponding with the training sample of input
Desired output of the type as above-mentioned initial neural network, the initial neural network of training, obtains speech detection model.
Herein, the audio data in audio data sets can be the data acquired from real scene.Audio data
Audio data in set may include voice data and non-speech data.Voice data can be corresponding with sound-type.Non- language
Sound data can be corresponding with non-voice type.
Herein, the audio of one group of predefined type can be extracted for each audio data in audio data sets
Characteristic value, and using this group of audio frequency characteristics value as training sample, then the audio frame type pair of the training sample and the audio data
It answers.If be appreciated that audio data sets in audio data be it is multiple, multiple groups audio frequency characteristics value can be extracted, from
And available multiple training samples, multiple training samples can form training sample set.
Herein, initial neural network can be the neural network of various structures, and initial neural network may include but not
It is limited at least one of following: convolutional neural networks, Recognition with Recurrent Neural Network, shot and long term Memory Neural Networks.
Please refer to Fig. 3 B, Fig. 3 B is one according to the application scenarios of the method for the endpoint for detecting voice of the present embodiment
A schematic diagram.In the application scenarios of Fig. 3:
User 301 issues one section of voice after waking up intelligent sound box 302.As an example, the voice that user issues is
" please play a song ".
Intelligent sound box can start to acquire sound, to obtain audio data after being waken up.
Intelligent sound box can generate audio frame sequence based on the audio data of acquisition.Sound in the audio frame sequence of generation
Frequency frame is corresponding with audio frame type.As an example, can will be eliminated in the audio data of acquisition the audio of quiet data as
Generate the basis of audio frame sequence.
Intelligent sound box can determine that sound source issues the audio for the audio frame of the sound-type in above-mentioned audio frame sequence
Location information when the corresponding sound of frame.
Intelligent sound box can be according to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence, really
Determine the endpoint of voice in the corresponding audio of above-mentioned audio frame sequence.As an example, intelligent sound box can determine that voice " please play one
The starting point and/or tail point of head song ".
With further reference to Fig. 4, it illustrates the processes of another embodiment of the method for the endpoint for detecting voice
400.This is used to detect the process 400 of the method for the endpoint of voice, comprising the following steps:
Step 401, based on the audio data got, audio frame sequence is generated.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) audio data can be obtained from above-mentioned executing subject local or other electronic equipments, generate audio frame sequence.Herein,
Audio frame in the audio frame sequence of generation is corresponding with audio frame type, and audio frame type is sound-type or non-voice class
Type.
Step 402, for the audio frame of the sound-type in audio frame sequence, it is corresponding to determine that sound source issues the audio frame
Location information when sound.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) sound source, which issues the corresponding sound of the audio frame, to be determined for the audio frame of the sound-type in above-mentioned audio frame sequence
When location information.
Step 201 and step in the concrete operations of step 401 and step 402 in the present embodiment and embodiment shown in Fig. 2
Rapid 202 operation is essentially identical, and details are not described herein.
Step 403, according to first of the sound-type in audio frame sequence audio frame, the starting point of voice is determined, and
The corresponding location information of first audio frame is determined as initial position message.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) starting point of voice can be determined, and will be above-mentioned according to first audio frame of the sound-type in above-mentioned audio frame sequence
The corresponding location information of first audio frame is determined as initial position message.
As an example, first audio frame of the sound-type in audio frame sequence can be determined as the starting point of voice.
Step 404, for the audio frame of the sound-type in audio frame sequence, the corresponding location information of the audio frame is determined
Whether position indicated by indicated position and initial position message is greater than predetermined angle;It is greater than preset angle in response to determining
Degree, is changed to non-voice type for the audio frame type of the audio frame.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) the corresponding location information of the audio frame is signified to be determined for the audio frame of the sound-type in above-mentioned audio frame sequence
Whether position indicated by the position shown and above-mentioned initial position message is greater than predetermined angle;It is greater than preset angle in response to determining
Degree, is changed to non-voice type for the audio frame type of the audio frame.
Step 405, since first audio frame, determine in audio frame sequence the non-language of predetermined number whether continuously occur
The audio frame of sound type.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) it can determine in the audio frame sequence through step 404 whether predetermined number continuously occur since above-mentioned first audio frame
The audio frame of a non-voice type.
Herein, predetermined number can be determined according to practical application scene.As an example, predetermined in Chinese speech scene
Predetermined number in number and japanese voice scene, may be different.
Step 406, in response to determining in audio frame sequence the audio frame of predetermined number non-voice type, root continuously occur
According to the audio frame of predetermined number non-voice type, the tail point of voice is determined.
In the present embodiment, for detecting executing subject (such as the intelligent sound shown in FIG. 1 of the method for the endpoint of voice
Case) it can be in response to continuously there is the audio frame of predetermined number non-voice type in the above-mentioned audio frame sequence of determination, according to upper
The audio frame for stating predetermined number non-voice type determines the tail point of voice.
It is appreciated that can be non-according to the predetermined number since above-mentioned first audio frame, continuously occurred for the first time
The audio frame of sound-type determines the tail point of voice.
As an example, first audio frame in the audio frame of above-mentioned predetermined number non-voice type can be determined
For the tail point of voice.
As an example, can by the audio frame in the middle position in the audio frame of above-mentioned predetermined number non-voice type,
It is determined as the tail point of voice.It can be with can be by the last one audio in the audio frame of above-mentioned predetermined number non-voice type
Frame is determined as the tail point of voice.
Figure 4, it is seen that the endpoint for being used to detect voice compared with the corresponding embodiment of Fig. 2, in the present embodiment
Method process 400 highlight according to initial position message instruction position angle difference, change the audio of audio frame
Frame type, then the step of determining the tail point of voice, thus, technical effect at least may include:
First, provide a kind of mode of the endpoint of new detection voice.
Second, can from the angle difference with initial position, determine audio frame type occur deviation (this be not voice and
Mistake is designated as sound-type) audio frame.It is thus possible to exclude the sound for differing biggish various sound sources with initial position.
Third excludes to differ the sound of biggish various sound sources with initial position, can in the present embodiment application process,
Exclude the voice of non-targeted user.For example, there are other people to issue interference voice in room when user issues voice command,
In the way of the present embodiment, other people voice of non-user can be excluded, it is thus possible to determine the end of more accurate voice
Point.Prepare accurate identification material for speech recognition later.
With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides one kind for detecting language
One embodiment of the device of the endpoint of sound, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, and the device is specific
It can be applied in various electronic equipments.
As shown in figure 5, the device 500 of the endpoint for detecting voice of the present embodiment include: audio generation unit 501,
Position determination unit 502 and endpoint determination unit 503.Wherein, audio generation unit 501 is configured to based on the sound got
Frequency evidence generates audio frame sequence, wherein the audio frame in the audio frame sequence of generation is corresponding with audio frame type, audio
Frame type is sound-type or non-voice type;Position determination unit 502 is configured to for the language in above-mentioned audio frame sequence
The audio frame of sound type determines location information when sound source issues the audio frame corresponding sound;Endpoint determination unit 503, quilt
It is configured to determine above-mentioned audio frame sequence according to the corresponding audio frame type of audio frame and location information in above-mentioned audio frame sequence
Arrange the endpoint of voice in corresponding audio.
In the present embodiment, for detecting audio generation unit 501, the determining list in position of the device 500 of the endpoint of voice
The specific processing of member 502 and endpoint determination unit 503 and its brought technical effect can be respectively with reference in Fig. 2 corresponding embodiments
The related description of step 201, step 202 and step 203, details are not described herein.
In some optional implementations of the present embodiment, above-mentioned audio generation unit 501 may include: effective audio
Determining module (is not shown) in Fig. 5, is configured to determine effective audio data in audio data according to acoustic energy;Move window point
Frame module (is not shown) in Fig. 5, is configured to that effective audio data is carried out moving window framing, obtains audio frame sequence;Audio
Frame determination type module (is not shown) in Fig. 5, is configured to carry out speech detection to the audio frame in audio frame sequence, determines sound
The corresponding audio frame type of frequency frame.
In some optional implementations of the present embodiment, above-mentioned audio frame determination type module (being not shown in Fig. 5)
Can be further configured to: for the audio frame in audio frame sequence, the audio for extracting the predefined type of the audio frame is special
Value indicative;For the audio frame in audio frame sequence, the audio frequency characteristics value extracted from the audio frame is imported to the language pre-established
Sound detection model generates audio frame type, wherein above-mentioned speech detection model is for characterizing audio frequency characteristics value and audio frame type
Between corresponding relationship.
In some optional implementations of the present embodiment, speech detection model can be established by following steps: be obtained
Audio data sets are taken, the audio data in audio data sets is corresponding with audio frame type;To in audio data sets
Audio data extracts the audio frequency characteristics value of predefined type as training sample, and generates training sample set, wherein training
Sample is corresponding with audio frame type;The training sample that above-mentioned training sample is concentrated, will be with as the input of initial neural network
Desired output of the corresponding audio frame type of the training sample of input as above-mentioned initial neural network, the initial nerve net of training
Network obtains speech detection model.
In some optional implementations of the present embodiment, above-mentioned effective audio determining module (being not shown in Fig. 5) can
To be further configured to: carrying out cutting according to regular length sampled point to the audio data got, obtain at least one son
Audio data;Determine whether the acoustic energy for each sub-audio data that cutting obtains is greater than default acoustic energy threshold value;In response to true
The acoustic energy of stator audio data is greater than default acoustic energy threshold value, it is determined that sub-audio data is effective audio data.
In some optional implementations of the present embodiment, above-mentioned endpoint determination unit 503 may include: that starting point determines
Module (is not shown) in Fig. 5, is configured to first audio frame according to the sound-type in above-mentioned audio frame sequence, determines language
The starting point of sound, and the corresponding location information of above-mentioned first audio frame is determined as initial position message;Tail point determining module
(being not shown in Fig. 5) is configured to according to first audio frame above-mentioned in above-mentioned initial position message and above-mentioned audio frame sequence
The corresponding location information of the audio frame of sound-type later determines the tail point of voice.
In some optional implementations of the present embodiment, above-mentioned tail point determining module (being not shown in Fig. 5) can be into
One step is configured to: for the audio frame of the sound-type in above-mentioned audio frame sequence, determining the corresponding position letter of the audio frame
Whether position indicated by the indicated position of breath and above-mentioned initial position message is greater than predetermined angle;It is greater than in advance in response to determining
If angle, the audio frame type of the audio frame is changed to non-voice type;Since above-mentioned first audio frame, audio is determined
Whether the audio frame of predetermined number non-voice type is continuously occurred in frame sequence;In response to connecting in the above-mentioned audio frame sequence of determination
It is continuous the audio frame of predetermined number non-voice type occur, according to the audio frame of above-mentioned predetermined number non-voice type, determine
The tail point of voice.
It should be noted that in the device of the endpoint provided by the embodiments of the present application for detecting voice each unit realization
Details and technical effect can be with reference to the explanations of other embodiments in the application, and details are not described herein.
Below with reference to Fig. 6, it illustrates the computer systems 600 for the electronic equipment for being suitable for being used to realize the embodiment of the present application
Structural schematic diagram.Electronic equipment shown in Fig. 6 is only an example, function to the embodiment of the present application and should not use model
Shroud carrys out any restrictions.
As shown in fig. 6, computer system 600 includes central processing unit (CPU, Central Processing Unit)
601, it can be according to the program being stored in read-only memory (ROM, Read Only Memory) 602 or from storage section
608 programs being loaded into random access storage device (RAM, Random Access Memory) 603 and execute various appropriate
Movement and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data.CPU 601,ROM
602 and RAM 603 is connected with each other by bus 604.Input/output (I/O, Input/Output) interface 605 is also connected to
Bus 604.
I/O interface 605 is connected to lower component: the importation 606 including keyboard, mouse etc.;It is penetrated including such as cathode
Spool (CRT, Cathode Ray Tube), liquid crystal display (LCD, Liquid Crystal Display) etc. and loudspeaker
Deng output par, c 607;Storage section 608 including hard disk etc.;And including such as LAN (local area network, Local Area
Network) the communications portion 609 of the network interface card of card, modem etc..Communications portion 609 is via such as internet
Network executes communication process.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as disk,
CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 610, in order to from the calculating read thereon
Machine program is mounted into storage section 608 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium
On computer program, which includes the program code for method shown in execution flow chart.In such reality
It applies in example, which can be downloaded and installed from network by communications portion 609, and/or from detachable media
611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes
Above-mentioned function.It should be noted that the above-mentioned computer-readable medium of the application can be computer-readable signal media or
Computer readable storage medium either the two any combination.Computer readable storage medium for example can be --- but
Be not limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.
The more specific example of computer readable storage medium can include but is not limited to: have one or more conducting wires electrical connection,
Portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only deposit
Reservoir (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory
Part or above-mentioned any appropriate combination.In this application, computer readable storage medium, which can be, any include or stores
The tangible medium of program, the program can be commanded execution system, device or device use or in connection.And
In the application, computer-readable signal media may include in a base band or the data as the propagation of carrier wave a part are believed
Number, wherein carrying computer-readable program code.The data-signal of this propagation can take various forms, including but not
It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer
Any computer-readable medium other than readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use
In by the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc., Huo Zheshang
Any appropriate combination stated.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof
Machine program code, above procedure design language include object oriented program language-such as Java, Smalltalk, C+
+, it further include conventional procedural programming language-such as " C " language or similar programming language.Program code can
Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package,
Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part.
In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN)
Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service
Provider is connected by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use
The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box
The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually
It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse
Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding
The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet
Include audio generation unit, position determination unit and endpoint determination unit.Wherein, the title of these units is not under certain conditions
Constitute restriction to the unit itself, for example, audio generation unit be also described as " based on the audio data got,
Generate the unit of audio frame sequence ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be
Included in device described in above-described embodiment;It is also possible to individualism, and without in the supplying device.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should
Device: based on the audio data got, audio frame sequence is generated, wherein audio frame and sound in the audio frame sequence of generation
Frequency frame type is corresponding, and audio frame type is sound-type or non-voice type;For the sound-type in audio frame sequence
Audio frame determines location information when sound source issues the audio frame corresponding sound;According to the audio frame pair in audio frame sequence
The audio frame type and location information answered determine the endpoint of voice in the corresponding audio of audio frame sequence.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Claims (16)
1. a kind of method for detecting the endpoint of voice, comprising:
Based on the audio data got, audio frame sequence is generated, wherein audio frame and audio in the audio frame sequence of generation
Frame type is corresponding, and audio frame type is sound-type or non-voice type;
For the audio frame of the sound-type in the audio frame sequence, determine when sound source issues the audio frame corresponding sound
Location information;
According to the corresponding audio frame type of audio frame and location information in the audio frame sequence, the audio frame sequence is determined
The endpoint of voice in corresponding audio.
2. according to the method described in claim 1, wherein, described based on the audio data got, generation audio frame sequence is wrapped
It includes:
According to acoustic energy, effective audio data in audio data is determined;
For effective audio data, carries out moving window framing, obtain audio frame sequence;
Speech detection is carried out to the audio frame in audio frame sequence, determines the corresponding audio frame type of audio frame.
3. according to the method described in claim 2, wherein, the audio frame in audio frame sequence carries out speech detection, really
Determine the corresponding audio frame type of audio frame, comprising:
For the audio frame in audio frame sequence, the audio frequency characteristics value of the predefined type of the audio frame is extracted;
For the audio frame in audio frame sequence, the audio frequency characteristics value extracted from the audio frame is imported to the voice pre-established
Detection model generates audio frame type, wherein the speech detection model for characterize audio frequency characteristics value and audio frame type it
Between corresponding relationship.
4. according to the method described in claim 3, wherein, speech detection model is established by following steps:
Audio data sets are obtained, the audio data in audio data sets is corresponding with audio frame type;
To the audio data in audio data sets, the audio frequency characteristics value of predefined type is extracted as training sample, Yi Jisheng
At training sample set, wherein training sample is corresponding with audio frame type;
The training sample that the training sample is concentrated, will be corresponding with the training sample of input as the input of initial neural network
Desired output of the audio frame type as the initial neural network, the initial neural network of training obtains speech detection model.
5. it is described according to acoustic energy according to the method described in claim 2, wherein, determine effective audio number in audio data
According to, comprising:
Cutting is carried out according to regular length sampled point to the audio data got, obtains at least one sub-audio data;
Determine whether the acoustic energy for each sub-audio data that cutting obtains is greater than default acoustic energy threshold value;
In response to determining that the acoustic energy of sub-audio data is greater than default acoustic energy threshold value, it is determined that sub-audio data is effective audio
Data.
6. method according to any one of claims 1-5, wherein the audio frame according in the audio frame sequence
Corresponding audio frame type and location information determine the endpoint of voice in the corresponding audio of the audio frame sequence, comprising:
According to first of the sound-type in the audio frame sequence audio frame, the starting point of voice is determined, and by described
The corresponding location information of one audio frame is determined as initial position message;
The sound of sound-type after first audio frame according to the initial position message and the audio frame sequence
The corresponding location information of frequency frame determines the tail point of voice.
7. described according in the initial position message and the audio frame sequence according to the method described in claim 6, wherein
The corresponding location information of audio frame of sound-type after first audio frame determines the tail point of voice, comprising:
For the audio frame of the sound-type in the audio frame sequence, determine indicated by the corresponding location information of the audio frame
Whether position indicated by position and the initial position message is greater than predetermined angle;It is greater than predetermined angle in response to determining, it will
The audio frame type of the audio frame is changed to non-voice type;
Since first audio frame, determine in audio frame sequence whether predetermined number non-voice type continuously occur
Audio frame;
In response to continuously there is the audio frame of predetermined number non-voice type in the determination audio frame sequence, according to described pre-
The audio frame of fixed number mesh non-voice type determines the tail point of voice.
8. a kind of for detecting the device of the endpoint of voice, comprising:
Audio generation unit is configured to generate audio frame sequence, wherein the audio of generation based on the audio data got
Audio frame in frame sequence is corresponding with audio frame type, and audio frame type is sound-type or non-voice type;
Position determination unit is configured to the audio frame for the sound-type in the audio frame sequence, determines that sound source issues
Location information when the corresponding sound of the audio frame;
Endpoint determination unit is configured to according to the corresponding audio frame type of audio frame and position letter in the audio frame sequence
Breath, determines the endpoint of voice in the corresponding audio of the audio frame sequence.
9. device according to claim 8, wherein the audio generation unit includes:
Effective audio determining module, is configured to determine effective audio data in audio data according to acoustic energy;
Window framing module is moved, is configured to that effective audio data is carried out moving window framing, obtains audio frame sequence;
Audio frame determination type module is configured to carry out speech detection to the audio frame in audio frame sequence, determines audio frame
Corresponding audio frame type.
10. device according to claim 9, wherein the audio frame determination type module is further configured to:
For the audio frame in audio frame sequence, the audio frequency characteristics value of the predefined type of the audio frame is extracted;
For the audio frame in audio frame sequence, the audio frequency characteristics value extracted from the audio frame is imported to the voice pre-established
Detection model generates audio frame type, wherein the speech detection model for characterize audio frequency characteristics value and audio frame type it
Between corresponding relationship.
11. device according to claim 10, wherein speech detection model is established by following steps:
Audio data sets are obtained, the audio data in audio data sets is corresponding with audio frame type;
To the audio data in audio data sets, the audio frequency characteristics value of predefined type is extracted as training sample, Yi Jisheng
At training sample set, wherein training sample is corresponding with audio frame type;
The training sample that the training sample is concentrated, will be corresponding with the training sample of input as the input of initial neural network
Desired output of the audio frame type as the initial neural network, the initial neural network of training obtains speech detection model.
12. device according to claim 9, wherein effective audio determining module is further configured to:
Cutting is carried out according to regular length sampled point to the audio data got, obtains at least one sub-audio data;
Determine whether the acoustic energy for each sub-audio data that cutting obtains is greater than default acoustic energy threshold value;
In response to determining that the acoustic energy of sub-audio data is greater than default acoustic energy threshold value, it is determined that sub-audio data is effective audio
Data.
13. the device according to any one of claim 8-12, wherein the endpoint determination unit includes:
Starting point determining module is configured to first audio frame according to the sound-type in the audio frame sequence, determines language
The starting point of sound, and the corresponding location information of first audio frame is determined as initial position message;
Tail point determining module is configured to first audio according to the initial position message and the audio frame sequence
The corresponding location information of the audio frame of sound-type after frame determines the tail point of voice.
14. device according to claim 13, wherein the tail point determining module is further configured to:
For the audio frame of the sound-type in the audio frame sequence, determine indicated by the corresponding location information of the audio frame
Whether position indicated by position and the initial position message is greater than predetermined angle;It is greater than predetermined angle in response to determining, it will
The audio frame type of the audio frame is changed to non-voice type;
Since first audio frame, determine in audio frame sequence whether predetermined number non-voice type continuously occur
Audio frame;
In response to continuously there is the audio frame of predetermined number non-voice type in the determination audio frame sequence, according to described pre-
The audio frame of fixed number mesh non-voice type determines the tail point of voice.
15. a kind of electronic equipment, comprising:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
The now method as described in any in claim 1-7.
16. a kind of computer-readable medium, is stored thereon with computer program, wherein real when described program is executed by processor
The now method as described in any in claim 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810792887.0A CN108962226B (en) | 2018-07-18 | 2018-07-18 | Method and apparatus for detecting end point of voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810792887.0A CN108962226B (en) | 2018-07-18 | 2018-07-18 | Method and apparatus for detecting end point of voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108962226A true CN108962226A (en) | 2018-12-07 |
CN108962226B CN108962226B (en) | 2019-12-20 |
Family
ID=64481698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810792887.0A Active CN108962226B (en) | 2018-07-18 | 2018-07-18 | Method and apparatus for detecting end point of voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108962226B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110310625A (en) * | 2019-07-05 | 2019-10-08 | 四川长虹电器股份有限公司 | Voice punctuate method and system |
CN110648692A (en) * | 2019-09-26 | 2020-01-03 | 苏州思必驰信息科技有限公司 | Voice endpoint detection method and system |
WO2020173488A1 (en) * | 2019-02-28 | 2020-09-03 | 北京字节跳动网络技术有限公司 | Audio starting point detection method and apparatus |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103426440A (en) * | 2013-08-22 | 2013-12-04 | 厦门大学 | Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information |
US20140379345A1 (en) * | 2013-06-20 | 2014-12-25 | Electronic And Telecommunications Research Institute | Method and apparatus for detecting speech endpoint using weighted finite state transducer |
CN107564546A (en) * | 2017-07-27 | 2018-01-09 | 上海师范大学 | A kind of sound end detecting method based on positional information |
CN107742522A (en) * | 2017-10-23 | 2018-02-27 | 科大讯飞股份有限公司 | Target voice acquisition methods and device based on microphone array |
CN107799126A (en) * | 2017-10-16 | 2018-03-13 | 深圳狗尾草智能科技有限公司 | Sound end detecting method and device based on Supervised machine learning |
-
2018
- 2018-07-18 CN CN201810792887.0A patent/CN108962226B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140379345A1 (en) * | 2013-06-20 | 2014-12-25 | Electronic And Telecommunications Research Institute | Method and apparatus for detecting speech endpoint using weighted finite state transducer |
CN103426440A (en) * | 2013-08-22 | 2013-12-04 | 厦门大学 | Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information |
CN107564546A (en) * | 2017-07-27 | 2018-01-09 | 上海师范大学 | A kind of sound end detecting method based on positional information |
CN107799126A (en) * | 2017-10-16 | 2018-03-13 | 深圳狗尾草智能科技有限公司 | Sound end detecting method and device based on Supervised machine learning |
CN107742522A (en) * | 2017-10-23 | 2018-02-27 | 科大讯飞股份有限公司 | Target voice acquisition methods and device based on microphone array |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020173488A1 (en) * | 2019-02-28 | 2020-09-03 | 北京字节跳动网络技术有限公司 | Audio starting point detection method and apparatus |
CN110310625A (en) * | 2019-07-05 | 2019-10-08 | 四川长虹电器股份有限公司 | Voice punctuate method and system |
CN110648692A (en) * | 2019-09-26 | 2020-01-03 | 苏州思必驰信息科技有限公司 | Voice endpoint detection method and system |
CN110648692B (en) * | 2019-09-26 | 2022-04-12 | 思必驰科技股份有限公司 | Voice endpoint detection method and system |
Also Published As
Publication number | Publication date |
---|---|
CN108962226B (en) | 2019-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11151765B2 (en) | Method and apparatus for generating information | |
US10553201B2 (en) | Method and apparatus for speech synthesis | |
WO2022052481A1 (en) | Artificial intelligence-based vr interaction method, apparatus, computer device, and medium | |
CN108962255B (en) | Emotion recognition method, emotion recognition device, server and storage medium for voice conversation | |
CN109545192A (en) | Method and apparatus for generating model | |
CN111599343B (en) | Method, apparatus, device and medium for generating audio | |
KR102346046B1 (en) | 3d virtual figure mouth shape control method and device | |
CN107767869A (en) | Method and apparatus for providing voice service | |
CN109545193B (en) | Method and apparatus for generating a model | |
CN107657017A (en) | Method and apparatus for providing voice service | |
CN108877779B (en) | Method and device for detecting voice tail point | |
CN111798821B (en) | Sound conversion method, device, readable storage medium and electronic equipment | |
CN107481715B (en) | Method and apparatus for generating information | |
CN107705782B (en) | Method and device for determining phoneme pronunciation duration | |
CN107808007A (en) | Information processing method and device | |
CN113257283B (en) | Audio signal processing method and device, electronic equipment and storage medium | |
CN107680584B (en) | Method and device for segmenting audio | |
CN108933730A (en) | Information-pushing method and device | |
CN112992190B (en) | Audio signal processing method and device, electronic equipment and storage medium | |
CN108962226A (en) | Method and apparatus for detecting the endpoint of voice | |
CN109697978A (en) | Method and apparatus for generating model | |
CN111696520A (en) | Intelligent dubbing method, device, medium and electronic equipment | |
CN110138654A (en) | Method and apparatus for handling voice | |
CN109087627A (en) | Method and apparatus for generating information | |
CN109600665A (en) | Method and apparatus for handling data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |