CN109545193A - Method and apparatus for generating model - Google Patents
Method and apparatus for generating model Download PDFInfo
- Publication number
- CN109545193A CN109545193A CN201811550086.XA CN201811550086A CN109545193A CN 109545193 A CN109545193 A CN 109545193A CN 201811550086 A CN201811550086 A CN 201811550086A CN 109545193 A CN109545193 A CN 109545193A
- Authority
- CN
- China
- Prior art keywords
- audio
- speech
- initial
- sub
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 102
- 238000012549 training Methods 0.000 claims abstract description 218
- 238000001514 detection method Methods 0.000 claims abstract description 41
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 19
- 238000010801 machine learning Methods 0.000 claims abstract description 18
- 230000006870 function Effects 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 33
- 230000004044 response Effects 0.000 claims description 23
- 238000012217 deletion Methods 0.000 claims description 21
- 230000037430 deletion Effects 0.000 claims description 21
- 230000000717 retained effect Effects 0.000 claims description 21
- 230000000306 recurrent effect Effects 0.000 claims description 10
- 238000003062 neural network model Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims 2
- 230000006854 communication Effects 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000009434 installation Methods 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 239000004568 cement Substances 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000032696 parturition Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the present application discloses the method and apparatus for generating model, and the method and apparatus for generating information.One specific embodiment of the method for being used to generate model includes: to obtain the training sample set for being directed to target audio set, wherein, target audio set includes executing the audio that truncation obtains to initial audio, training sample and target audio correspond, training sample in training sample set includes the characteristic and identification information of the target audio in target audio set, whether it includes speech audio that identification information is used to indicate in the audio frame that target audio includes, and initial audio includes speech audio;Using machine learning algorithm, the characteristic for including using the training sample in training sample set is as input, and using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition modeling.The embodiment enriches the training method of model, helps to improve the accuracy of speech terminals detection.
Description
Technical field
The invention relates to field of computer technology, and in particular to the method and apparatus for generating model.
Background technique
It is important that in audio, can judge position of the beginning and end of voice in audio in interactive voice
It sets.In the prior art, voice activity detection (Voice Activity Detection, VAD) Lai Jinhang voice is generallyd use
End-point detection.Voice activity detection, also known as speech terminals detection, speech endpoint detection, refer to and detect voice in noise circumstance
Presence or absence.In general, voice activity detection can be used in the speech processing systems such as voice coding, speech enhan-cement, drop is played
Low speech encoding rate saves communication bandwidth, reduces energy consumption of mobile equipment, improves the effects of discrimination.
The training sample of existing voice activity detection model is normally based on what the higher audio of quality obtained.
Summary of the invention
The embodiment of the present application proposes the method and apparatus for generating model, and method and dress for generating information
It sets.
In a first aspect, the embodiment of the present application provides a kind of method for generating model, it is directed to this method comprises: obtaining
The training sample set of target audio set, wherein target audio set includes executing truncation to initial audio to obtain
Audio, the target audio in training sample and target audio set in training sample set correspond, training sample set
In training sample include the target audio in target audio set characteristic and identification information, identification information is used to indicate
It whether include speech audio in the audio frame that target audio includes, initial audio includes speech audio;Using machine learning algorithm,
The characteristic for including using the training sample in training sample set is as input, by mark corresponding with the characteristic of input
Information obtains speech recognition modeling as desired output, training.
In some embodiments, target audio set includes handling obtained target audio as follows to initial audio execution:
Truncation is carried out to initial audio, obtains consonant frequency sequence;Delete at least one sub-audio in consonant frequency sequence;It will delete
The combination of the sub-audio retained afterwards, is determined as target audio.
In some embodiments, at least one sub-audio in consonant frequency sequence is deleted, comprising: delete in consonant frequency sequence
Preceding first quantity sub-audio, wherein the first quantity is less than the quantity of the consonant frequency sequence sub-audio that includes.
In some embodiments, at least one sub-audio in consonant frequency sequence is deleted, comprising: delete in consonant frequency sequence
Rear second quantity sub-audio, wherein the second quantity is less than the quantity of the consonant frequency sequence sub-audio that includes.
In some embodiments, target audio set is by executing such as the initial audio in initial audio set
What lower step obtained: random to generate the first random number and the second random number, wherein the first random number and the second random number are 0
Number between to 1;In response to determining that the first random number is less than predetermined first predetermined value, include to the initial audio
Preceding third quantity sub-audio executes truncation and delete processing, and the combination of at least one sub-audio retained after deletion determines
For target audio, wherein third quantity is less than the half of the quantity for the sub-audio that initial audio includes, and the first predetermined value is used for
Characterize the audio that in predetermined audio set, first frame audio frame includes speech audio quantity and predetermined sound
The ratio of the quantity of frequency set sound intermediate frequency;In response to determining that the second random number is less than predetermined second predetermined value, to this
The rear 4th quantity sub-audio that initial audio includes executes truncation and delete processing, at least one consonant that will retain after deletion
The combination of frequency, is determined as target audio, wherein and the 4th quantity is less than the half of the quantity for the sub-audio that initial audio includes, the
Two predetermined values are for characterizing the number that in predetermined audio set, last frame audio frame includes the audio of speech audio
The ratio of amount and the quantity of predetermined audio set sound intermediate frequency.
In some embodiments, using machine learning algorithm, the feature for including by the training sample in training sample set
Data are as input, and using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition mould
Type, comprising: choose training sample from training sample set, and execute following training step: by the training sample packet of selection
The characteristic included is input to initial speech identification model, obtains reality output, wherein reality output is initial speech identification mould
The output of type;Based on reality output, determine whether initial speech identification model meets predetermined termination condition;In response to true
Surely meet termination condition, the initial speech identification model of termination condition will be met, be determined as the speech recognition modeling that training obtains.
In some embodiments, this method further include: be unsatisfactory for termination condition in response to determination, be based on obtained reality
Output and desired output corresponding with obtained reality output adjust the parameter of the model parameter of initial speech identification model
It is worth, and chooses the training sample of unselected mistake from training sample set, is known based on parameter value initial speech adjusted
Other model, continues to execute training step.
In some embodiments, the activation primitive for the output layer that initial speech identification model includes is normalization index letter
Number, the cost function for the output layer that initial speech identification model includes are cross entropy cost function.
In some embodiments, speech recognition modeling is the Recognition with Recurrent Neural Network model with gating cycle unit.
Second aspect, the embodiment of the present application provide a kind of for generating the device of model, which includes: the first acquisition
Unit is configured to obtain the training sample set for being directed to target audio set, wherein target audio set includes to initial sound
Frequency executes the audio that truncation obtains, the target audio one in training sample and target audio set in training sample set
One is corresponding, and the training sample in training sample set includes the characteristic and mark letter of the target audio in target audio set
Breath, whether it includes speech audio that identification information is used to indicate in the audio frame that target audio includes, and initial audio includes voice sound
Frequently;Training unit is configured to using machine learning algorithm, the characteristic for including by the training sample in training sample set
As input, using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition modeling.
In some embodiments, target audio set includes handling obtained target audio as follows to initial audio execution:
Truncation is carried out to initial audio, obtains consonant frequency sequence;Delete at least one sub-audio in consonant frequency sequence;It will delete
The combination of the sub-audio retained afterwards, is determined as target audio.
In some embodiments, at least one sub-audio in consonant frequency sequence is deleted, comprising: delete in consonant frequency sequence
Preceding first quantity sub-audio, wherein the first quantity is less than the quantity of the consonant frequency sequence sub-audio that includes.
In some embodiments, at least one sub-audio in consonant frequency sequence is deleted, comprising: delete in consonant frequency sequence
Rear second quantity sub-audio, wherein the second quantity is less than the quantity of the consonant frequency sequence sub-audio that includes.
In some embodiments, target audio set is by executing such as the initial audio in initial audio set
What lower step obtained: random to generate the first random number and the second random number, wherein the first random number and the second random number are 0
Number between to 1;In response to determining that the first random number is less than predetermined first predetermined value, include to the initial audio
Preceding third quantity sub-audio executes truncation and delete processing, and the combination of at least one sub-audio retained after deletion determines
For target audio, wherein third quantity is less than the half of the quantity for the sub-audio that initial audio includes, and the first predetermined value is used for
Characterize the audio that in predetermined audio set, first frame audio frame includes speech audio quantity and predetermined sound
The ratio of the quantity of frequency set sound intermediate frequency;In response to determining that the second random number is less than predetermined second predetermined value, to this
The rear 4th quantity sub-audio that initial audio includes executes truncation and delete processing, at least one consonant that will retain after deletion
The combination of frequency, is determined as target audio, wherein and the 4th quantity is less than the half of the quantity for the sub-audio that initial audio includes, the
Two predetermined values are for characterizing the number that in predetermined audio set, last frame audio frame includes the audio of speech audio
The ratio of amount and the quantity of predetermined audio set sound intermediate frequency.
In some embodiments, training unit includes: training module, is configured to choose training from training sample set
Sample, and execute following training step: the characteristic that the training sample of selection includes is input to initial speech identification mould
Type obtains reality output, wherein reality output is the output of initial speech identification model;Based on reality output, initial language is determined
Whether sound identification model meets predetermined termination condition;Meet termination condition in response to determination, termination condition will be met
Initial speech identification model is determined as the speech recognition modeling that training obtains.
In some embodiments, device further include: adjustment unit is configured in response to determination and is unsatisfactory for terminating item
Part, is based on obtained reality output and desired output corresponding with obtained reality output, and adjustment initial speech identifies mould
The parameter value of the model parameter of type, and the training sample of unselected mistake is chosen from training sample set, it is based on parameter value
Initial speech identification model adjusted, continues to execute training step.
In some embodiments, the activation primitive for the output layer that initial speech identification model includes is normalization index letter
Number, the cost function for the output layer that initial speech identification model includes are cross entropy cost function.
In some embodiments, speech recognition modeling is the Recognition with Recurrent Neural Network model with gating cycle unit.
The third aspect, the embodiment of the present application provide a kind of method for generating information, this method comprises: obtaining target
Audio, wherein target audio includes speech audio;For the audio frame that target audio includes, which is input in advance
Trained speech recognition modeling obtains the probability that the audio frame includes speech audio, wherein speech recognition modeling is according to as above
The method training for stating any embodiment in the method for generating information obtains;Based on obtained probability and predetermined
The size relation of probability threshold value generates the speech terminals detection result of target audio.
In some optional implementations of the present embodiment, predetermined threshold value includes predetermined first threshold
With predetermined second threshold, first threshold is greater than second threshold, and speech terminals detection result includes what target audio included
The location information of starting point of the speech audio in target audio and the location information of terminal;And based on obtained probability and in advance
The size relation of first determining threshold value, generates the speech terminals detection result of target audio, comprising: based on obtained probability and
The size relation of predetermined first threshold, the position of starting point of the speech audio that generation target audio includes in target audio
Confidence breath;Size relation based on obtained probability and predetermined second threshold generates the voice that target audio includes
The location information of terminal of the audio in target audio.
Fourth aspect, the embodiment of the present application provide a kind of for generating the device of model, which includes: the second acquisition
Unit is configured to obtain target audio, wherein target audio includes speech audio;Input unit is configured to for target
The audio frame is input to speech recognition modeling trained in advance by the audio frame that audio includes, and obtaining the audio frame includes voice
The probability of audio, wherein speech recognition modeling is the method according to any embodiment in such as above-mentioned method for being used to generate information
What training obtained;Generation unit is configured to the size relation based on obtained probability and predetermined probability threshold value, raw
At the speech terminals detection result of target audio.
In some optional implementations of the present embodiment, predetermined threshold value includes predetermined first threshold
With predetermined second threshold, first threshold is greater than second threshold, and speech terminals detection result includes what target audio included
The location information of starting point of the speech audio in target audio and the location information of terminal;And generation unit includes: first raw
At module, it is configured to the size relation based on obtained probability and predetermined first threshold, generates target audio packet
The location information of starting point of the speech audio included in target audio;Second generation module is configured to based on obtained general
The size relation of rate and predetermined second threshold, terminal of the speech audio that generation target audio includes in target audio
Location information.
5th aspect, the embodiment of the present application provide a kind of electronic equipment, comprising: one or more processors;Storage dress
It sets, is stored thereon with one or more programs, when said one or multiple programs are executed by said one or multiple processors, make
It obtains the one or more processors and realizes the method such as any embodiment in the above-mentioned method for being used to generate model, alternatively, making
The method that the one or more processors realize any embodiment in the method as above-mentioned for generating information.
6th aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, should
The method that any embodiment in the method as above-mentioned for generating model is realized when program is executed by processor, alternatively, the program
The method of any embodiment in the method as above-mentioned for generating information is realized when being executed by processor.
Method and apparatus provided by the embodiments of the present application for generating model, by obtaining for target audio set
Training sample set, wherein target audio set includes executing the audio that truncation obtains, training sample set to initial audio
The target audio in training sample and target audio set in conjunction corresponds, and the training sample in training sample set includes
The characteristic and identification information of target audio in target audio set, identification information are used to indicate in target audio whether wrap
Include speech audio, then, using machine learning algorithm, the characteristic for including using the training sample in training sample set as
Input, using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition modeling, thus rich
The rich training method of model, helps to improve the accuracy of speech terminals detection.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is that one embodiment of the application can be applied to exemplary system architecture figure therein;
Fig. 2 is the flow chart according to one embodiment of the method for generating model of the application;
Fig. 3 A is the waveform signal according to the initial audio of one embodiment of the method for generating model of the application
Figure;
Fig. 3 B- Fig. 3 D be for Fig. 3 A initial audio carry out truncation obtain the operation chart of target audio;
Fig. 4 is the schematic diagram according to an application scenarios of the method for generating model of the application;
Fig. 5 is the flow chart according to another embodiment of the method for generating model of the application;
Fig. 6 is the structural schematic diagram according to one embodiment of the device for generating model of the application;
Fig. 7 is the flow chart according to one embodiment of the method for generating information of the application;
Fig. 8 is the structural schematic diagram according to one embodiment of the device for generating information of the application;
Fig. 9 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can the method for generating model using the embodiment of the present application or the dress for generating model
It sets, alternatively, the exemplary system architecture 100 for generating the method for information or the embodiment of the device for generating information.
As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105.
Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out
Send message etc..Various telecommunication customer end applications can be installed, such as speech recognition class is answered on terminal device 101,102,103
With, web browser applications, shopping class application, searching class application, instant messaging tools, mailbox client, social platform software
Deng.
Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard
When part, it can be the various electronic equipments with audio frequency transmission function, including but not limited to smart phone, tablet computer, electronics
Book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert
Compression standard audio level 3), (Moving Picture Experts Group Audio Layer IV, dynamic image are special by MP4
Family's compression standard audio level 4) player, pocket computer on knee and desktop computer etc..When terminal device 101,
102,103 when being software, may be mounted in above-mentioned cited electronic equipment.Multiple softwares or software mould may be implemented into it
Block (such as providing the software of Distributed Services or software module), also may be implemented into single software or software module.?
This is not specifically limited.
Server 105 can be to provide the server of various services, such as to the sound that terminal device 101,102,103 is sent
Frequency provides the background server supported.Background server can carry out the processing such as audio feature extraction to the audio received, and
It generates processing result (such as audio frequency characteristics of extraction).
It should be noted that the method provided by the embodiment of the present application for generating model can be held by server 105
Row, can also be executed, correspondingly, the device for generating model can be set in server by terminal device 101,102,103
In 105, also it can be set in terminal device 101,102,103.In addition, for generating information provided by the embodiment of the present application
Method can be executed by server 105, can also be executed by terminal device 101,102,103, correspondingly, for generating information
Device can be set in server 105, also can be set in terminal device 101,102,103.Herein, above-mentioned to be used for
The executing subject for generating the method and the method for generating information of model may be the same or different.
It should be noted that server can be hardware, it is also possible to software.When server is hardware, may be implemented
At the distributed server cluster that multiple servers form, individual server also may be implemented into.It, can when server is software
To be implemented as multiple softwares or software module (such as providing the software of Distributed Services or software module), also may be implemented
At single software or software module.It is not specifically limited herein.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.For example, when being used to generate model method operation thereon
Electronic equipment when not needing to carry out data transmission with other electronic equipments, which can only include for generating model
The electronic equipment of method operation thereon.
With continued reference to Fig. 2, the process of one embodiment of the method for generating model according to the application is shown
200.The method for being used to generate model, comprising the following steps:
Step 201, the training sample set for being directed to target audio set is obtained.
In the present embodiment, (such as server shown in FIG. 1 or terminal are set the executing subject for generating the method for model
It is standby) target audio can be directed to from other electronic equipments or local obtain by wired connection mode or radio connection
The training sample set of set.Wherein, target audio set includes executing the audio that truncation obtains to initial audio.Training
The target audio in training sample and target audio set in sample set corresponds.Training sample in training sample set
This includes the characteristic and identification information of the target audio in target audio set.Identification information is used to indicate target audio packet
It whether include speech audio in the audio frame included, initial audio includes speech audio.
Above-mentioned initial audio can be the various audios including speech audio, for example, above-mentioned initial audio may include but
It is not limited to following any one: the speech audio with noise, including mute and speech audio audio etc..Initial audio can be with
It is the audio of random length, for example, in short;It is also possible to audio frame, wherein the length of audio frame, which can be, to be preset
, such as frame length can be 32 milliseconds, 25 milliseconds etc..
Target audio in above-mentioned target audio set, which can be, executes the audio that truncation obtains to initial audio.
As an example, please referring to Fig. 3 A- Fig. 3 D.Fig. 3 A is a reality according to the method for generating model of the application
Apply the waveform diagram of the initial audio of example.Fig. 3 B- Fig. 3 D be for Fig. 3 A initial audio carry out truncation obtain target
The operation chart of audio.
As shown in Figure 3B, if above-mentioned executing subject or other electronic equipments pair with the communication connection of above-mentioned executing subject
Initial audio carries out truncated position at point of cut-off (such as line segment 301 intersect with the waveform of initial audio point), to initial audio
Reason, then, after executing truncation, two sub-audios of available initial audio (including sub-audio shown in Fig. 3 C
With sub-audio shown in Fig. 3 D).In this scenario, acquired target audio may include at least one in above-mentioned two sub-audio
A sub-audio.That is, target audio can be sub-audio shown in Fig. 3 C, it is also possible to sub-audio shown in Fig. 3 D, can also wraps
Include sub-audio shown in sub-audio shown in Fig. 3 C and Fig. 3 D.
Herein, the quantity of point of cut-off can be one, be also possible to multiple.The target obtained for an initial audio
The quantity of audio can be one, be also possible to multiple.The embodiment of the present application does not limit this.
It should be noted that when target audio only includes the multiple consonants for obtain after truncation to initial audio
When a part (not including carrying out obtained all sub-audios after truncation to initial audio) in frequency, truncated position at this time
Reason can be referred to as truncation and delete processing.It is appreciated that truncation and delete processing be to initial audio carry out truncation after,
At least one sub-audio in obtained multiple sub-audios is deleted, so that the combination of the sub-audio retained after deletion be determined
For the processing of the target audio for the initial audio.
Features described above data can include but is not limited to the data of at least one following feature of audio: amplitude, frame per second, mistake
Zero rate, short-time energy etc..As an example, features described above data can be Meier spectrum signature (the Mel Bank of 64 dimensions
Features)。
In some optional implementations of the present embodiment, target audio set includes that following cut is executed to initial audio
The target audio that disconnected processing obtains (including the first step to third step).Execution truncation can be as here below is held by above-mentioned
Row main body is performed, and is also possible to be performed by the electronic equipment communicated to connect with above-mentioned executing subject.
The first step carries out truncation to initial audio, obtains consonant frequency sequence.
Second step deletes at least one sub-audio in consonant frequency sequence.
The combination of the sub-audio retained after deletion is determined as target audio by third step.
In some optional implementations of the present embodiment, above-mentioned steps two can specifically include following steps: delete
Preceding first quantity sub-audio in consonant frequency sequence.Wherein, the first quantity is less than the number for the sub-audio that consonant frequency sequence includes
Amount.Herein, above-mentioned first quantity, which can be, at random determines, is also possible to artificially to specify.
In some optional implementations of the present embodiment, above-mentioned steps two can specifically include following steps: delete
Rear second quantity sub-audio in consonant frequency sequence.Wherein, the second quantity is less than the number for the sub-audio that consonant frequency sequence includes
Amount.Herein, above-mentioned second quantity, which can be, at random determines, is also possible to artificially to specify.
In some optional implementations of the present embodiment, above-mentioned target audio set be by above-mentioned executing subject or
The electronic equipment of person and the communication connection of above-mentioned executing subject execute following steps for the initial audio in initial audio set
It obtains:
Firstly, generating the first random number and the second random number at random.Wherein, the first random number and the second random number are 0
Number between to 1.
It should be noted that first, second in above-mentioned first random number and the second random number is used only as distinguishing random number,
The particular determination to random number is not constituted.Above-mentioned first random number and the second random number can be equal or differ.
Then, in response to determining that the first random number is less than predetermined first predetermined value, include to the initial audio
Preceding third quantity sub-audio execute truncation and delete processing, by the combination of at least one sub-audio retained after deletion, really
It is set to target audio.
Wherein, third quantity is less than the half of the quantity for the sub-audio that initial audio includes.First predetermined value is used for table
Levy the quantity and predetermined audio of the audio that in predetermined audio set, first frame audio frame includes speech audio
Gather the ratio of the quantity of sound intermediate frequency.
Above-mentioned third quantity, which can be, at random to be determined, is also possible to artificially to specify.
It is appreciated that not including the preceding third quantity that the initial audio includes in obtained target audio in this step
A sub-audio only includes other sub-audios in addition to above-mentioned preceding third quantity sub-audio that the initial audio includes.
Herein, the first predetermined value is for characterizing the probability that first frame audio frame in audio includes speech audio.
Finally, including to the initial audio in response to determining that the second random number is less than predetermined second predetermined value
Rear 4th quantity sub-audio execute truncation and delete processing, by the combination of at least one sub-audio retained after deletion, really
It is set to target audio.
Wherein, the 4th quantity is less than the half of the quantity for the sub-audio that initial audio includes, and the second predetermined value is used for table
Levy the audio that in predetermined audio set, last frame audio frame includes speech audio quantity and predetermined sound
The ratio of the quantity of frequency set sound intermediate frequency.
Above-mentioned 4th quantity, which can be, at random to be determined, is also possible to artificially to specify.
It is appreciated that not including rear 4th quantity that the initial audio includes in obtained target audio in this step
A sub-audio only includes other sub-audios in addition to above-mentioned rear 4th quantity sub-audio that the initial audio includes.
Herein, the second predetermined value is for characterizing the probability that last frame audio frame in audio includes speech audio.
It should be noted that first, second, in above-mentioned first quantity, the second quantity, third quantity and the 4th quantity
Three, it the 4th is used only as distinguishing quantity, does not constitute the particular determination to quantity.Above-mentioned first quantity, the second quantity, third quantity
It can be equal with the 4th quantity or differs.
In some optional implementations of the present embodiment, above-mentioned target audio set may include first frame audio frame
Audio and last frame audio frame including speech audio include the audio of speech audio.
Above-mentioned first frame audio frame includes that the audio of speech audio can be cutting at random to initial audio execution first half
The audio that disconnected and delete processing obtains.The Random Truncation Data and delete processing of above-mentioned first half refer to, if one section of initial audio
Length is L sampled point, then one sampled point Q1 of selection that can be random in 0 to this sampled point segment of L/2,0
It is deleted to the data between Q1.
Above-mentioned last frame audio frame include the audio of speech audio can be initial audio is executed it is latter half of random
The audio that truncation and delete processing obtain.Above-mentioned latter half of Random Truncation Data and delete processing refer to, if one section of initial audio
Length be L sampled point, then one sampled point Q2 of selection that can be random in this sampled point segment of L/2 to L,
0 deletes to the data between Q2.
Optionally, the first frame audio frame that above-mentioned target audio set includes includes the audio and last frame of speech audio
Audio frame is that the quantity of the audio of speech audio can be arbitrary;Be also possible to it is predetermined, for example, target audio set
Including first frame audio frame be speech audio audio quantity, the quantity for the target audio for including with target audio set
Ratio can be above-mentioned first predetermined value;The last frame audio frame that target audio set includes is the audio of speech audio
Quantity, the ratio of the quantity for the target audio for including with target audio set can be above-mentioned second predetermined value.
Step 202, using machine learning algorithm, the characteristic for including using the training sample in training sample set as
Input, using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition modeling.
In the present embodiment, above-mentioned executing subject can use machine learning algorithm, by training accessed by step 201
The characteristic that training sample in sample set includes makees identification information corresponding with the characteristic of input as input
For desired output, training obtains speech recognition modeling.
Specifically, above-mentioned executing subject can use machine learning algorithm, the training sample set that step 201 is got
In the training sample characteristic that includes as input, identification information corresponding with the characteristic of input is defeated as expectation
Out, initial model (such as Recognition with Recurrent Neural Network, convolutional neural networks) is trained, for the characteristic of each training input
According to available reality output.Wherein, reality output is initial model reality output.Then, above-mentioned executing subject can adopt
With gradient descent method, it is based on reality output and desired output, the parameter of initial model is adjusted, by what is obtained after each adjusting parameter
Initial model of the model as training next time, and in the case where meeting preset trained termination condition, terminate training, to instruct
Get speech recognition modeling.
Herein, above-mentioned executing subject can be trained initial model using batch training algorithm, can also use
Random training algorithm is trained initial model, and the embodiment of the present application does not limit this.
It should be noted that the training termination condition here preset at can include but is not limited to it is at least one of following: training
Time is more than preset duration;Frequency of training is more than preset times;It is small to calculate resulting difference (such as functional value of loss function)
In default discrepancy threshold.
In some optional implementations of the present embodiment, above-mentioned executing subject can in accordance with the following steps, Lai Zhihang
The step 202:
Training sample is chosen from training sample set, and executes following training step:
Firstly, the characteristic that the training sample of selection includes is input to initial speech identification model, obtain practical defeated
Out.
Herein, above-mentioned reality output can be the output of initial speech identification model.Above-mentioned initial speech identification model
It can be indiscipline, or trained but do not met the model of termination condition.
Then, it is based on reality output, determines whether initial speech identification model meets predetermined termination condition.
Above-mentioned predetermined termination condition can include but is not limited at least one of following: when the training time is more than default
It is long;Frequency of training is more than preset times;It calculates resulting difference (such as functional value of loss function) and is less than default discrepancy threshold.
As an example, when above-mentioned predetermined termination condition is " frequency of training is more than preset times ", above-mentioned execution
The quantity of obtained reality output can be determined as frequency of training by main body, if the quantity of obtained reality output is (i.e.
Frequency of training) it is more than preset times, then it can determine that initial speech identification model meets predetermined termination condition.When above-mentioned
When predetermined termination condition is " functional value for calculating resulting loss function is less than default discrepancy threshold ", above-mentioned execution master
Body can calculate predetermined according to obtained reality output, and the desired output of corresponding obtained reality output
The functional value of loss function can determine just if the functional value for calculating resulting loss function is less than default discrepancy threshold
Beginning speech recognition modeling meets predetermined termination condition.
Finally, meeting termination condition in response to determination, the initial speech identification model of termination condition will be met, be determined as instructing
The speech recognition modeling got.
In some optional implementations of the present embodiment, above-mentioned executing subject can also be unsatisfactory for above-mentioned knot in determination
In the case where beam condition, it is based on obtained reality output and desired output corresponding with obtained reality output, adjustment is just
The parameter value of the model parameter of beginning speech recognition modeling, and choose from training sample set the training sample of unselected mistake
This, is based on parameter value initial speech identification model adjusted, continues to execute above-mentioned training step.
Herein, above-mentioned executing subject can use back propagation, by calculate reality output and with obtained reality
Border exports the gradient value of the gradient of corresponding desired output, come adjust initial speech identification model model parameter parameter value.
Specifically, above-mentioned executing subject can use analytic calculation gradient value, can also calculate ladder using numerical gradient calculating method
Angle value, and then gradient value obtained by calculation adjusts the parameter value of the model parameter of initial speech identification model.
It is art technology it should be noted that adjusting the mode of the parameter value of model parameter above by gradient value
The well-known technique that personnel study extensively, details are not described herein.
In some optional implementations of the present embodiment, the activation letter for the output layer that initial speech identification model includes
For number for normalization exponential function, the cost function for the output layer that initial speech identification model includes is cross entropy cost function.
It is appreciated that the available each identification information pair of activation primitive using normalization exponential function as output layer
(i.e. the sub-audio probability that belongs to non-speech audio, sub-audio belongs to the probability of initial consonant audio to the probability answered and sub-audio belongs to rhythm
The probability of vowel frequency).Better training effect, example can usually be brought using the cost function of cross entropy cost function output layer
Such as, training speed is faster.
In some optional implementations of the present embodiment, above-mentioned initial speech identification model can be, and there is gate to follow
The Recognition with Recurrent Neural Network model of ring element, the speech recognition modeling that above-mentioned training obtains as a result, can be for gating cycle lists
The Recognition with Recurrent Neural Network model of member.
Herein, using the Recognition with Recurrent Neural Network model with gating cycle unit as initial speech identification model, instruction
The speech recognition modeling got, the speech recognition modeling that other opposite models are obtained as initial speech identification model, training
For, it can have faster computational efficiency.
Optionally, above-mentioned initial speech identification model is also possible to Recognition with Recurrent Neural Network, convolutional neural networks etc..
With continued reference to the signal that Fig. 4, Fig. 4 are according to the application scenarios of the method for generating model of the present embodiment
Figure.In the application scenarios of Fig. 4, server 401 obtains the training sample set 4001 for target audio set first.Its
In, target audio set includes that the audio that truncation obtains is executed to initial audio, the training sample in training sample set
It is corresponded with the target audio in target audio set, the training sample in training sample set includes in target audio set
Target audio characteristic and identification information, identification information be used to indicate in the audio frame that target audio includes whether include
Speech audio, initial audio include speech audio.Then, server 401 utilizes machine learning algorithm, will be in training sample set
The training sample characteristic that includes as (such as the Recognition with Recurrent Neural Network mould with gating cycle unit of initial model 4002
Type) input, using identification information corresponding with the characteristic of input as the desired output of initial model 4002, training is obtained
Speech recognition modeling 4003.
The method provided by the above embodiment of the application is directed to the training sample set of target audio set by obtaining,
Wherein, target audio set includes that the audio that truncation obtains is executed to initial audio, the training sample in training sample set
This is corresponded with the target audio in target audio set, and the training sample in training sample set includes target audio set
In target audio characteristic and identification information, whether identification information be used to indicate in the audio frame that target audio includes and wrap
Speech audio is included, initial audio includes speech audio, then, using machine learning algorithm, by the training in training sample set
The characteristic that sample includes is as input, using identification information corresponding with the characteristic of input as desired output, training
Speech recognition modeling is obtained, thus using including the characteristic and correspondence for executing the audio that truncation obtains to initial audio
Identification information training sample, Lai Xunlian speech recognition modeling enriches the training method of model, in addition, using trained
The accuracy of speech terminals detection can be improved in the speech recognition modeling arrived.
With further reference to Fig. 5, it illustrates the processes 500 of another embodiment of the method for generating model.The use
In the process 500 for the method for generating model, comprising the following steps:
Step 501, from initial audio set, the initial audio of unselected mistake is chosen.Later, step 502 is executed.
In the present embodiment, (such as server shown in FIG. 1 or terminal are set the executing subject for generating the method for model
It is standby) initial audio of unselected mistake can be chosen from initial audio set.
Above-mentioned initial audio can be the various audios including speech audio, for example, above-mentioned initial audio may include but
It is not limited to following any one: the speech audio with noise, including mute and speech audio audio etc..Initial audio can be with
It is the audio of random length, for example, in short;It is also possible to audio frame, wherein the length of audio frame, which can be, to be preset
, such as frame length can be 32 milliseconds, 25 milliseconds etc..
Step 502, the first random number and the second random number are generated at random.Later, step 503 and step 504 are executed.
In the present embodiment, above-mentioned executing subject can generate the first random number and the second random number at random.Wherein, first
Random number and the second random number are the number between 0 to 1.
Herein, first, second in the first random number and the second random number is used only as distinguishing random number, composition pair
The particular determination of random number.Above-mentioned first random number and the second random number can be equal, can not also wait.
Step 503, determine whether the first random number is less than predetermined first predetermined value.Later, if so, executing
Step 505;If it is not, thening follow the steps 507.
In the present embodiment, above-mentioned executing subject can determine whether the first random number is more than or equal to predetermined first
Numerical value.Wherein, the first predetermined value includes speech audio for characterizing in predetermined audio set, first frame audio frame
Audio quantity and predetermined audio set sound intermediate frequency quantity ratio.Herein, the first predetermined value is used for table
First frame audio frame includes the probability of speech audio in sign audio.
Step 504, determine whether the second random number is less than predetermined second predetermined value.Later, if so, executing
Step 506;If it is not, thening follow the steps 507.
In the present embodiment, above-mentioned executing subject can determine that the second random number is less than predetermined second value.Its
In, the second predetermined value is for characterizing the sound that in predetermined audio set, last frame audio frame includes speech audio
The ratio of the quantity of the quantity of frequency and predetermined audio set sound intermediate frequency.Herein, the second predetermined value is for characterizing sound
Last frame audio frame includes the probability of speech audio in frequency.
Step 505, truncation and delete processing are executed to the preceding third quantity sub-audio that initial audio includes, after deletion
The combination of at least one sub-audio retained, is determined as target audio.Later, step 508 is executed.
In the present embodiment, the preceding third quantity sub-audio that above-mentioned executing subject can include to initial audio, which executes, cuts
The combination of at least one sub-audio retained after deletion is determined as target audio by disconnected and delete processing.Wherein, third quantity
The half of the quantity for the sub-audio for including less than initial audio.
Step 506, truncation and delete processing are executed to the rear 4th quantity sub-audio that initial audio includes, after deletion
The combination of at least one sub-audio retained, is determined as target audio.Later, step 508 is executed.
In the present embodiment, the rear 4th quantity sub-audio that above-mentioned executing subject can include to initial audio, which executes, cuts
The combination of at least one sub-audio retained after deletion is determined as target audio by disconnected and delete processing.Wherein, the 4th quantity
The half of the quantity for the sub-audio for including less than initial audio.
Step 507, which is determined as target audio.Later, step 508 is executed.
In the present embodiment, which can be determined as target audio by above-mentioned executing subject.
Step 508, it determines in above-mentioned initial audio set, if there are the initial audios of unselected mistake.Later, if
It is to then follow the steps 501;If it is not, thening follow the steps 509.
In the present embodiment, above-mentioned executing subject can determine in above-mentioned initial audio set, if there are unselected
The initial audio crossed.
It is appreciated that in the above-mentioned initial audio set, there is no when the initial audio of unselected mistake to get having arrived mesh
Mark audio set.
Step 509, the training sample set for being directed to target audio set is obtained.Later, step 510 is executed.
In the present embodiment, step 510 and the step 201 in Fig. 2 corresponding embodiment are almost the same, and which is not described herein again.
Step 510, using machine learning algorithm, the characteristic for including using the training sample in training sample set as
Input, using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition modeling.
In the present embodiment, step 510 and the step 202 in Fig. 2 corresponding embodiment are almost the same, and which is not described herein again.
From figure 5 it can be seen that the method for generating model compared with the corresponding embodiment of Fig. 2, in the present embodiment
Process 500 highlight the step of obtaining target audio set.The scheme of the present embodiment description is for training speech recognition as a result,
The audio of truncation forward for including in the training sample of model (executes truncation to the preceding third quantity sub-audio that initial audio includes
And the target audio obtained after delete processing) quantity and the target audio set target audio that includes ratio of number, approach
First frame audio frame includes the probability of speech audio in audio, and audio (rear 4th number for including to initial audio is truncated backward
Measure sub-audio and execute the target audio obtained after truncation and delete processing) quantity and the target audio set target sound that includes
The ratio of number of frequency includes the probability of speech audio close to last frame audio frame in audio, thus, the voice that training obtains
Whether identification model can more accurately determine in audio comprising the position of speech audio and speech audio in audio.
With further reference to Fig. 6, as the realization to method shown in above-mentioned each figure, this application provides one kind for generating mould
One embodiment of the device of type, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, except following documented special
Sign is outer, which can also include feature identical or corresponding with embodiment of the method shown in Fig. 2.The device specifically may be used
To be applied in various electronic equipments.
As shown in fig. 6, the device 600 for generating model of the present embodiment includes: that first acquisition unit 601 and training are single
Member 602.Wherein, first acquisition unit 601 is configured to obtain the training sample set for being directed to target audio set, wherein mesh
Mark audio set includes that the audio that truncation obtains is executed to initial audio, training sample and target in training sample set
Target audio in audio set corresponds, and the training sample in training sample set includes the target in target audio set
The characteristic and identification information of audio, whether it includes voice sound that identification information is used to indicate in the audio frame that target audio includes
Frequently, initial audio includes speech audio;Training unit 602 is configured to using machine learning algorithm, will be in training sample set
The training sample characteristic that includes as input, identification information corresponding with the characteristic of input is defeated as expectation
Out, training obtains speech recognition modeling.
It in the present embodiment, can be by wired connection side for generating the first acquisition unit 601 of the device 600 of model
Formula or radio connection are directed to the training sample set of target audio set from other electronic equipments or local acquisition.Its
In, target audio set includes executing the audio that truncation obtains to initial audio.Training sample in training sample set
It is corresponded with the target audio in target audio set.Training sample in training sample set includes in target audio set
Target audio characteristic and identification information.Identification information be used to indicate in the audio frame that target audio includes whether include
Speech audio, initial audio include speech audio.
Above-mentioned initial audio can be the various audios including speech audio, for example, above-mentioned initial audio may include but
It is not limited to following any one: the speech audio with noise, including mute and speech audio audio etc..Initial audio can be with
It is the audio of random length, for example, in short;It is also possible to audio frame, wherein the length of audio frame, which can be, to be preset
, such as frame length can be 32 milliseconds, 25 milliseconds etc..
Target audio in above-mentioned target audio set, which can be, executes the audio that truncation obtains to initial audio.
Features described above data can include but is not limited to the data of at least one following feature of audio: amplitude, frame per second, mistake
Zero rate, short-time energy etc..As an example, features described above data can be Meier spectrum signature (the Mel Bank of 64 dimensions
Features)。
In the present embodiment, above-mentioned training unit 602 can use machine learning algorithm, by the instruction in training sample set
Practice the characteristic that sample includes to instruct as input using identification information corresponding with the characteristic of input as desired output
Get speech recognition modeling.
In some optional implementations of the present embodiment, target audio set includes that following place is executed to initial audio
Manage obtained target audio:
The first step carries out truncation to initial audio, obtains consonant frequency sequence.
Second step deletes at least one sub-audio in consonant frequency sequence.
The combination of the sub-audio retained after deletion is determined as target audio by third step.
In some optional implementations of the present embodiment, at least one sub-audio in consonant frequency sequence, packet are deleted
It includes: deleting the preceding first quantity sub-audio in consonant frequency sequence, wherein the first quantity is less than the consonant that consonant frequency sequence includes
The quantity of frequency.
In some optional implementations of the present embodiment, at least one sub-audio in consonant frequency sequence, packet are deleted
It includes: deleting the rear second quantity sub-audio in consonant frequency sequence.Wherein, the second quantity is less than the consonant that consonant frequency sequence includes
The quantity of frequency.
In some optional implementations of the present embodiment, target audio set is by in initial audio set
Initial audio, execute following steps and obtain:
Firstly, generating the first random number and the second random number at random.Wherein, the first random number and the second random number are 0
Number between to 1.
Then, in response to determining that the first random number is less than predetermined first predetermined value, include to the initial audio
Preceding third quantity sub-audio execute truncation and delete processing, by the combination of at least one sub-audio retained after deletion, really
It is set to target audio.Wherein, third quantity is less than the half of the quantity for the sub-audio that initial audio includes, and the first predetermined value is used
In characterize the quantity of the audio that in predetermined audio set, first frame audio frame includes speech audio with it is predetermined
The ratio of the quantity of audio set sound intermediate frequency.
Finally, including to the initial audio in response to determining that the second random number is less than predetermined second predetermined value
Rear 4th quantity sub-audio execute truncation and delete processing, by the combination of at least one sub-audio retained after deletion, really
It is set to target audio.Wherein, the 4th quantity is less than the half of the quantity for the sub-audio that initial audio includes, and the second predetermined value is used
In the quantity for characterizing the audio that in predetermined audio set, last frame audio frame includes speech audio and predefine
Audio set sound intermediate frequency quantity ratio.
In some optional implementations of the present embodiment, training unit includes: training module (not shown) quilt
It is configured to choose training sample from training sample set, and executes following training step: including by the training sample of selection
Characteristic be input to initial speech identification model, obtain reality output, wherein reality output is initial speech identification model
Output;Based on reality output, determine whether initial speech identification model meets predetermined termination condition;In response to determination
Meet termination condition, the initial speech identification model of termination condition will be met, is determined as the speech recognition modeling that training obtains.
In some optional implementations of the present embodiment, the device 600 further include: adjustment unit (not shown)
It is configured in response to determination and is unsatisfactory for termination condition, based on obtained reality output and corresponding with obtained reality output
Desired output, adjust the parameter value of the model parameter of initial speech identification model, and chosen not from training sample set
The training sample being selected is based on parameter value initial speech identification model adjusted, continues to execute training step.
In some optional implementations of the present embodiment, the activation letter for the output layer that initial speech identification model includes
For number for normalization exponential function, the cost function for the output layer that initial speech identification model includes is cross entropy cost function.
In some optional implementations of the present embodiment, speech recognition modeling is the circulation with gating cycle unit
Neural network model.
The device provided by the above embodiment of the application is obtained by first acquisition unit 601 and is directed to target audio set
Training sample set, wherein target audio set includes executing the obtained audio of truncation, training sample to initial audio
The target audio in training sample and target audio set in set corresponds, the training sample packet in training sample set
The characteristic and identification information of the target audio in target audio set are included, identification information is used to indicate what target audio included
It whether include speech audio in audio frame, initial audio includes speech audio, and then, training unit 602 is calculated using machine learning
Method, the characteristic for including using the training sample in training sample set, will be corresponding with the characteristic of input as input
Identification information obtains speech recognition modeling as desired output, training, thus using including executing truncation to initial audio
The training sample of the characteristic of obtained audio and corresponding identification information, Lai Xunlian speech recognition modeling, enriches model
Training method, in addition, the accuracy of speech terminals detection can be improved using the obtained speech recognition modeling of training.
With continued reference to Fig. 7, the process of one embodiment of the method for generating information according to the application is shown
700.The method for being used to generate information, comprising the following steps:
Step 701, target audio is obtained.
In the present embodiment, (such as server shown in FIG. 1 or terminal are set the executing subject for generating the method for information
It is standby) target audio can be obtained by wired connection mode or radio connection from other electronic equipments or local.Its
In, above-mentioned target audio can be the various audios to carry out speech terminals detection to it including speech audio.
Step 702, which is input to speech recognition mould trained in advance by the audio frame for including for target audio
Type obtains the probability that the audio frame includes speech audio.
In the present embodiment, the audio frame at least one audio frame for including for target audio, above-mentioned executing subject
The characteristic of the audio frame can be input to speech recognition modeling trained in advance, obtaining the audio frame includes speech audio
Probability.Wherein, above-mentioned speech recognition modeling can be above-mentioned executing subject or the electricity with the communication connection of above-mentioned executing subject
Sub- equipment is obtained according to method training described in any embodiment in the method for generating model as shown in Figure 2.
It is appreciated that in general, according to the speech recognition modeling that above-mentioned training method obtains, during actual use,
The probability in audio frame comprising speech audio can be exported, in turn, above-mentioned executing subject can be by comparing obtained probability
With the size relation between predetermined probabilities threshold value, so that it is determined that in audio frame whether include speech audio.
Step 703, the size relation based on obtained probability and predetermined probability threshold value, generates target audio
Speech terminals detection result.
In the present embodiment, above-mentioned executing subject can be based on the big of obtained probability and predetermined probability threshold value
Small relationship generates the speech terminals detection result of target audio.
Above-mentioned speech terminals detection result can serve to indicate that the start bit for the speech audio for including in above-mentioned target audio
It sets and final position.
For example, if obtained probability is greater than predetermined probability threshold value, above-mentioned executing subject can determine
The corresponding audio frame of the probability is speech audio, to obtain the speech terminals detection result of target audio.
In some optional implementations of the present embodiment, predetermined threshold value includes predetermined first threshold
With predetermined second threshold, first threshold is greater than second threshold, and speech terminals detection result includes what target audio included
The location information of starting point of the speech audio in target audio and the location information of terminal.Above-mentioned executing subject can be by as a result,
Above-mentioned steps 703 are executed according to following steps:
Firstly, the size relation based on obtained probability and predetermined first threshold, generating target audio includes
Starting point of the speech audio in target audio location information.
For example, above-mentioned executing subject can determine in the probability sequence for the audio frame sequence for including for target audio, the
The corresponding audio frame of the probability is determined as the language that target audio includes by one probability for being greater than predetermined first threshold
Starting point of the sound audio in target audio, to obtain the location information of starting point of the speech audio in target audio.
Then, the size relation based on obtained probability and predetermined second threshold, generating target audio includes
Terminal of the speech audio in target audio location information.
For example, above-mentioned executing subject can determine in the probability sequence for the audio frame sequence for including for target audio, most
The corresponding audio frame of the probability is determined as what target audio included by the probability that the latter is greater than predetermined second threshold
Terminal of the speech audio in target audio, to obtain the location information of terminal of the speech audio in target audio.
After determining whether each audio frame that target audio includes includes speech audio, above-mentioned executing subject can be true
First in the audio frame sequence that the audio that sets the goal includes and the last one include the audio frame of speech audio, and will determine
The first audio frame including speech audio, be determined as the initial position of speech audio for including in target audio, will determine
The last one include the audio frame of speech audio, be determined as the final position of speech audio for including in target audio, thus
Speech terminals detection result is obtained.
Optionally, whether each audio frame that above-mentioned target audio can also directly be included by above-mentioned executing subject includes language
The result of sound audio is determined as speech terminals detection result.For example, if above-mentioned target audio is made of 10 frame audio frames.Its
In, the 2nd frame to the 9th frame audio frame includes speech audio, and the 1st frame and the 10th frame audio frame do not include speech audio.So, above-mentioned
The sequence { 0,1,1,1,1,1,1,1,1,0 } of characterization the above results can be generated in executing subject, wherein first in above-mentioned sequence
A element is used to indicate whether first audio frame that target audio includes includes speech audio, the 2nd in above-mentioned sequence member
Whether the 2nd audio frame that element can serve to indicate that target audio includes includes speech audio, and so on." 0 " can characterize
It does not include speech audio, " 1 " can be characterized including speech audio.As a result, above-mentioned executing subject can by sequence 0,1,1,1,1,
1,1,1,1,0 } it is determined directly as speech terminals detection result.Under this application scenarios, by the speech terminals detection as a result, can
To determine that target audio is made of 10 frame audio frames.Wherein, the 2nd frame to the 9th frame audio frame includes speech audio, the 1st frame and
10 frame audio frames do not include speech audio.
The method provided by the above embodiment of the application, by obtaining target audio, wherein target audio includes voice sound
Frequently, then, which is input to speech recognition modeling trained in advance, obtained by the audio frame for including for target audio
The audio frame includes the probability of speech audio, wherein speech recognition modeling is in the method according to such as above-mentioned for generating information
What the method training of any embodiment obtained, finally, being closed based on the size of obtained probability and predetermined probability threshold value
System generates the speech terminals detection of target audio as a result, speech recognition modeling is applied to speech terminals detection as a result, thus
The order of accuarcy for improving speech terminals detection enriches the mode of speech terminals detection.
With further reference to Fig. 8, as the realization to method shown in above-mentioned each figure, this application provides one kind for generating letter
One embodiment of the device of breath, the Installation practice is corresponding with embodiment of the method shown in Fig. 7, except following documented special
Sign is outer, which can also include feature identical or corresponding with embodiment of the method shown in Fig. 7.The device specifically may be used
To be applied in various electronic equipments.
As shown in figure 8, the device 800 for generating information of the present embodiment includes: second acquisition unit 801, input list
Member 802 and generation unit 803.Wherein, second acquisition unit 801 is configured to obtain target audio, wherein target audio includes
Speech audio;Input unit 802 is configured to the audio frame for including for target audio, which is input to preparatory training
Speech recognition modeling, obtain the probability that the audio frame includes speech audio, wherein speech recognition modeling is as above-mentioned for giving birth to
It is obtained at the method training of any embodiment in the method for model;Generation unit 803 is configured to based on obtained probability
With the size relation of predetermined probability threshold value, the speech terminals detection result of target audio is generated.
It in the present embodiment, can be by wired connection side for generating the second acquisition unit 801 of the device 800 of information
Formula or radio connection obtain target audio from other electronic equipments or local.
Above-mentioned target audio can be the various audios including speech audio.
In the present embodiment, at least one audio frame that the target audio got for second acquisition unit 801 includes
In audio frame, the characteristic of the audio frame can be input in advance trained speech recognition mould by above-mentioned input unit 802
Type obtains the probability that the audio frame includes speech audio.Wherein, above-mentioned speech recognition modeling can be above-mentioned executing subject or
Electronic equipment with the communication connection of above-mentioned executing subject is according to any implementation in the method for generating model as shown in Figure 2
The training of method described in example obtains.
In the present embodiment, it is closed based on the size of the obtained probability of input unit 802 and predetermined probability threshold value
The speech terminals detection result of target audio can be generated in system, above-mentioned generation unit 803.Wherein, above-mentioned speech terminals detection knot
Fruit can serve to indicate that initial position and the final position for the speech audio for including in above-mentioned target audio.
In some optional implementations of the present embodiment, predetermined threshold value includes predetermined first threshold
With predetermined second threshold, first threshold is greater than second threshold, and speech terminals detection result includes what target audio included
The location information of starting point of the speech audio in target audio and the location information of terminal.And generation unit includes: first raw
It is configured to the size relation based on obtained probability and predetermined first threshold at module (not shown), is generated
The location information of starting point of the speech audio that target audio includes in target audio;Second generation module (not shown) quilt
It is configured to the size relation based on obtained probability and predetermined second threshold, generates the voice sound that target audio includes
The location information of terminal of the frequency in target audio.
The device provided by the above embodiment of the application obtains target audio by second acquisition unit 801, wherein mesh
Mark with phonetic symbols frequency includes speech audio, and then, input unit 802 is directed to the audio frame that target audio includes, which is input to
Trained speech recognition modeling in advance, obtains the probability that the audio frame includes speech audio, wherein speech recognition modeling is as above
The method training for stating any embodiment in the method for generating model obtains, finally, generation unit 803 be based on it is obtained
The size relation of probability and predetermined probability threshold value generates the speech terminals detection of target audio as a result, as a result, by voice
Identification model is applied to speech terminals detection, to improve the order of accuarcy of speech terminals detection, enriches sound end inspection
The mode of survey.
Below with reference to Fig. 9, it illustrates the computer systems 900 for the electronic equipment for being suitable for being used to realize the embodiment of the present application
Structural schematic diagram.Electronic equipment shown in Fig. 9 is only an example, function to the embodiment of the present application and should not use model
Shroud carrys out any restrictions.
As shown in figure 9, computer system 900 includes central processing unit (CPU) 901, it can be read-only according to being stored in
Program in memory (ROM) 902 or be loaded into the program in random access storage device (RAM) 903 from storage section 908 and
Execute various movements appropriate and processing.In RAM 903, also it is stored with system 900 and operates required various programs and data.
CPU901, ROM 902 and RAM 903 is connected with each other by bus 904.Input/output (I/O) interface 905 is also connected to always
Line 904.
I/O interface 905 is connected to lower component: the importation 906 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 907 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 908 including hard disk etc.;
And the communications portion 909 of the network interface card including LAN card, modem etc..Communications portion 909 via such as because
The network of spy's net executes communication process.Driver 910 is also connected to I/O interface 905 as needed.Detachable media 911, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 910, in order to read from thereon
Computer program be mounted into storage section 908 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium
On computer program, which includes the program code for method shown in execution flow chart.In such reality
It applies in example, which can be downloaded and installed from network by communications portion 909, and/or from detachable media
911 are mounted.When the computer program is executed by central processing unit (CPU) 901, limited in execution the present processes
Above-mentioned function.
It should be noted that computer-readable medium described herein can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In this application, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In application, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned
Any appropriate combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof
Machine program code, described program design language include object-oriented programming language-such as Python, Java,
Smalltalk, C++ further include conventional procedural programming language-such as " C " language or similar program design language
Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence
Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or
It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet
It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit
It is connected with ISP by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use
The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box
The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually
It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse
Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding
The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet
Include first acquisition unit and training unit.Wherein, the title of these units is not constituted to the unit itself under certain conditions
Restriction, for example, first acquisition unit be also described as " obtain for target audio set training sample set list
Member ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be
Included in electronic equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying electronic equipment.
Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are held by the electronic equipment
When row, so that the electronic equipment: obtaining the training sample set for being directed to target audio set, wherein target audio set includes
The audio that truncation obtains is executed to initial audio, the mesh in training sample and target audio set in training sample set
Mark with phonetic symbols frequency corresponds, and the training sample in training sample set includes the characteristic of the target audio in target audio set
And identification information, whether it includes speech audio, initial audio packet that identification information is used to indicate in the audio frame that target audio includes
Include speech audio;Using machine learning algorithm, the characteristic for including using the training sample in training sample set as input,
Using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition modeling.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Claims (24)
1. a kind of method for generating model, comprising:
Obtain the training sample set for being directed to target audio set, wherein the target audio set includes holding to initial audio
The audio that row truncation obtains, the target sound in training sample and the target audio set in the training sample set
Frequency corresponds, and the training sample in the training sample set includes the feature of the target audio in the target audio set
Data and identification information, whether it includes speech audio, initial sound that identification information is used to indicate in the audio frame that target audio includes
Frequency includes speech audio;
Using machine learning algorithm, the characteristic for including using the training sample in the training sample set, will as input
Identification information corresponding with the characteristic of input obtains speech recognition modeling as desired output, training.
2. according to the method described in claim 1, wherein, the target audio set includes executing following processing to initial audio
Obtained target audio:
Truncation is carried out to initial audio, obtains consonant frequency sequence;
Delete at least one sub-audio in the consonant frequency sequence;
By the combination of the sub-audio retained after deletion, it is determined as target audio.
3. according to the method described in claim 2, wherein, described at least one sub-audio deleted in the consonant frequency sequence,
Include:
Delete the preceding first quantity sub-audio in the consonant frequency sequence, wherein first quantity is less than the sub-audio
The quantity for the sub-audio that sequence includes.
4. according to the method described in claim 2, wherein, described at least one sub-audio deleted in the consonant frequency sequence,
Include:
Delete the rear second quantity sub-audio in the consonant frequency sequence, wherein second quantity is less than the sub-audio
The quantity for the sub-audio that sequence includes.
5. according to the method described in claim 1, wherein, the target audio set is by in initial audio set
Initial audio executes what following steps obtained:
It is random to generate the first random number and the second random number, wherein first random number and second random number are 0 to arrive
Number between 1;
It is less than predetermined first predetermined value in response to determination first random number, the before including to the initial audio
Three quantity sub-audios execute truncation and delete processing and the combination of at least one sub-audio retained after deletion are determined as mesh
Mark with phonetic symbols frequency, wherein the third quantity is less than the half of the quantity for the sub-audio that initial audio includes, first predetermined value
The quantity for the audio for including speech audio for characterizing in predetermined audio set, first frame audio frame and it is described in advance
The ratio of the quantity of determining audio set sound intermediate frequency;
It is less than predetermined second predetermined value in response to determination second random number, include to the initial audio rear the
Four quantity sub-audios execute truncation and delete processing and the combination of at least one sub-audio retained after deletion are determined as mesh
Mark with phonetic symbols frequency, wherein the 4th quantity is less than the half of the quantity for the sub-audio that initial audio includes, second predetermined value
The quantity for the audio for including speech audio for characterizing in predetermined audio set, last frame audio frame with it is described pre-
The first ratio of the quantity of determining audio set sound intermediate frequency.
6. it is described to utilize machine learning algorithm according to the method described in claim 1, wherein, it will be in the training sample set
The training sample characteristic that includes as input, identification information corresponding with the characteristic of input is defeated as expectation
Out, training obtains speech recognition modeling, comprising:
Training sample is chosen from the training sample set, and executes following training step: by the training sample packet of selection
The characteristic included is input to initial speech identification model, obtains reality output, wherein reality output is initial speech identification mould
The output of type;Based on reality output, determine whether initial speech identification model meets predetermined termination condition;In response to true
Surely meet the termination condition, the initial speech identification model of the termination condition will be met, be determined as the voice that training obtains
Identification model.
7. according to the method described in claim 6, wherein, the method also includes:
It is unsatisfactory for the termination condition in response to determination, based on obtained reality output and corresponding with obtained reality output
Desired output, adjust the parameter value of the model parameter of initial speech identification model, and selected from the training sample set
The training sample of unselected mistake is taken, parameter value initial speech identification model adjusted is based on, continues to execute the training step
Suddenly.
8. method according to claim 6 or 7, wherein the activation primitive for the output layer that initial speech identification model includes
To normalize exponential function, the cost function for the output layer that initial speech identification model includes is cross entropy cost function.
9. method described in one of -8 according to claim 1, wherein the speech recognition modeling is with gating cycle unit
Recognition with Recurrent Neural Network model.
10. a kind of method for generating information, comprising:
Obtain target audio, wherein the target audio includes speech audio;
For the audio frame that the target audio includes, which is input to speech recognition modeling trained in advance, is obtained
The audio frame includes the probability of speech audio, wherein the speech recognition modeling is according to as claimed in one of claims 1 to 9
Method training obtains;
Size relation based on obtained probability and predetermined probability threshold value, generates the sound end of the target audio
Testing result.
11. according to the method described in claim 10, wherein, the predetermined threshold value includes predetermined first threshold
With predetermined second threshold, the first threshold is greater than the second threshold, and the speech terminals detection result includes institute
State the location information of the starting point of speech audio that target audio includes in the target audio and the location information of terminal;And
The size relation based on obtained probability and predetermined threshold value, generates the sound end of the target audio
Testing result, comprising:
Size relation based on obtained probability and predetermined first threshold generates the voice that the target audio includes
The location information of starting point of the audio in the target audio;
Size relation based on obtained probability and predetermined second threshold generates the voice that the target audio includes
The location information of terminal of the audio in the target audio.
12. a kind of for generating the device of model, comprising:
First acquisition unit is configured to obtain the training sample set for being directed to target audio set, wherein the target audio
Set includes that the audio that truncation obtains is executed to initial audio, training sample and the mesh in the training sample set
The target audio marked in audio set corresponds, and the training sample in the training sample set includes the target audio collection
The characteristic and identification information of target audio in conjunction, identification information be used to indicate in the audio frame that target audio includes whether
Including speech audio, initial audio includes speech audio;
Training unit is configured to using machine learning algorithm, the spy for including by the training sample in the training sample set
Data are levied as input, using identification information corresponding with the characteristic of input as desired output, training obtains speech recognition
Model.
13. device according to claim 12, wherein the target audio set includes executing following place to initial audio
Manage obtained target audio:
Truncation is carried out to initial audio, obtains consonant frequency sequence;
Delete at least one sub-audio in the consonant frequency sequence;
By the combination of the sub-audio retained after deletion, it is determined as target audio.
14. device according to claim 13, wherein described at least one consonant deleted in the consonant frequency sequence
Frequently, comprising:
Delete the preceding first quantity sub-audio in the consonant frequency sequence, wherein first quantity is less than the sub-audio
The quantity for the sub-audio that sequence includes.
15. device according to claim 13, wherein described at least one consonant deleted in the consonant frequency sequence
Frequently, comprising:
Delete the rear second quantity sub-audio in the consonant frequency sequence, wherein second quantity is less than the sub-audio
The quantity for the sub-audio that sequence includes.
16. device according to claim 12, wherein the target audio set is by in initial audio set
Initial audio, execute following steps and obtain:
It is random to generate the first random number and the second random number, wherein first random number and second random number are 0 to arrive
Number between 1;
It is less than predetermined first predetermined value in response to determination first random number, the before including to the initial audio
Three quantity sub-audios execute truncation and delete processing and the combination of at least one sub-audio retained after deletion are determined as mesh
Mark with phonetic symbols frequency, wherein the third quantity is less than the half of the quantity for the sub-audio that initial audio includes, first predetermined value
The quantity for the audio for including speech audio for characterizing in predetermined audio set, first frame audio frame and it is described in advance
The ratio of the quantity of determining audio set sound intermediate frequency;
It is less than predetermined second predetermined value in response to determination second random number, include to the initial audio rear the
Four quantity sub-audios execute truncation and delete processing and the combination of at least one sub-audio retained after deletion are determined as mesh
Mark with phonetic symbols frequency, wherein the 4th quantity is less than the half of the quantity for the sub-audio that initial audio includes, second predetermined value
The quantity for the audio for including speech audio for characterizing in predetermined audio set, last frame audio frame with it is described pre-
The first ratio of the quantity of determining audio set sound intermediate frequency.
17. device according to claim 12, wherein the training unit includes:
Training module is configured to choose training sample from the training sample set, and executes following training step: will
The characteristic that the training sample of selection includes is input to initial speech identification model, obtains reality output, wherein reality output
It is the output of initial speech identification model;Based on reality output, it is predetermined to determine whether initial speech identification model meets
Termination condition;Meet the termination condition in response to determination, the initial speech identification model of the termination condition will be met, determines
The speech recognition modeling obtained for training.
18. device according to claim 17, wherein described device further include:
Adjustment unit, be configured in response to determination be unsatisfactory for the termination condition, based on obtained reality output and with institute
The corresponding desired output of obtained reality output adjusts the parameter value of the model parameter of initial speech identification model, and from institute
The training sample for choosing unselected mistake in training sample set is stated, parameter value initial speech identification model adjusted is based on,
Continue to execute the training step.
19. device described in 7 or 18 according to claim 1, wherein the activation letter for the output layer that initial speech identification model includes
For number for normalization exponential function, the cost function for the output layer that initial speech identification model includes is cross entropy cost function.
20. device described in one of 2-19 according to claim 1, wherein the speech recognition modeling is with gating cycle list
The Recognition with Recurrent Neural Network model of member.
21. a kind of for generating the device of information, comprising:
Second acquisition unit is configured to obtain target audio, wherein the target audio includes speech audio;
Input unit is configured to the audio frame for including for the target audio, which is input to training in advance
Speech recognition modeling obtains the probability that the audio frame includes speech audio, wherein the speech recognition modeling is according to such as right
It is required that the training of method described in one of 12-20 obtained;
Generation unit, is configured to the size relation based on obtained probability and predetermined probability threshold value, described in generation
The speech terminals detection result of target audio.
22. device according to claim 21, wherein the predetermined threshold value includes predetermined first threshold
With predetermined second threshold, the first threshold is greater than the second threshold, and the speech terminals detection result includes institute
State the location information of the starting point of speech audio that target audio includes in the target audio and the location information of terminal;And
The generation unit includes:
First generation module is configured to the size relation based on obtained probability and predetermined first threshold, generates
The location information of starting point of the speech audio that the target audio includes in the target audio;
Second generation module is configured to the size relation based on obtained probability and predetermined second threshold, generates
The location information of terminal of the speech audio that the target audio includes in the target audio.
23. a kind of electronic equipment, comprising:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
The now method as described in any in claim 1-11.
24. a kind of computer-readable medium, is stored thereon with computer program, wherein real when described program is executed by processor
The now method as described in any in claim 1-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811550086.XA CN109545193B (en) | 2018-12-18 | 2018-12-18 | Method and apparatus for generating a model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811550086.XA CN109545193B (en) | 2018-12-18 | 2018-12-18 | Method and apparatus for generating a model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109545193A true CN109545193A (en) | 2019-03-29 |
CN109545193B CN109545193B (en) | 2023-03-14 |
Family
ID=65855598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811550086.XA Active CN109545193B (en) | 2018-12-18 | 2018-12-18 | Method and apparatus for generating a model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109545193B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111292766A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device, and medium for generating speech samples |
CN111432233A (en) * | 2020-03-20 | 2020-07-17 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating video |
CN111985643A (en) * | 2020-08-21 | 2020-11-24 | 腾讯音乐娱乐科技(深圳)有限公司 | Training method for generating network, audio data enhancement method and related device |
CN112151054A (en) * | 2020-09-07 | 2020-12-29 | 北京达佳互联信息技术有限公司 | Audio noise reduction processing method and device, server and storage medium |
CN112581970A (en) * | 2019-09-12 | 2021-03-30 | 深圳市韶音科技有限公司 | System and method for audio signal generation |
CN112887789A (en) * | 2021-01-22 | 2021-06-01 | 北京百度网讯科技有限公司 | Video generation model construction method, video generation device, video generation equipment and video generation medium |
CN113076932A (en) * | 2021-04-28 | 2021-07-06 | 百度在线网络技术(北京)有限公司 | Method for training audio language recognition model, video detection method and device thereof |
CN113380225A (en) * | 2021-06-18 | 2021-09-10 | 广州虎牙科技有限公司 | Language model training method, speech recognition method and related device |
US11902759B2 (en) | 2019-09-12 | 2024-02-13 | Shenzhen Shokz Co., Ltd. | Systems and methods for audio signal generation |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050055201A1 (en) * | 2003-09-10 | 2005-03-10 | Microsoft Corporation, Corporation In The State Of Washington | System and method for real-time detection and preservation of speech onset in a signal |
US20060004573A1 (en) * | 2004-07-01 | 2006-01-05 | International Business Machines Corporation | Microphone initialization enhancement for speech recognition |
US20110145001A1 (en) * | 2009-12-10 | 2011-06-16 | At&T Intellectual Property I, L.P. | Automated detection and filtering of audio advertisements |
CN105976810A (en) * | 2016-04-28 | 2016-09-28 | Tcl集团股份有限公司 | Method and device for detecting endpoints of effective discourse segment in voices |
CN106887241A (en) * | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of voice signal detection method and device |
US20170236532A1 (en) * | 2013-07-02 | 2017-08-17 | Family Systems, Ltd. | Systems and methods for improving audio conferencing services |
CN108766418A (en) * | 2018-05-24 | 2018-11-06 | 百度在线网络技术(北京)有限公司 | Sound end recognition methods, device and equipment |
CN108877778A (en) * | 2018-06-13 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Sound end detecting method and equipment |
CN108877779A (en) * | 2018-08-22 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Method and apparatus for detecting voice tail point |
CN108922513A (en) * | 2018-06-04 | 2018-11-30 | 平安科技(深圳)有限公司 | Speech differentiation method, apparatus, computer equipment and storage medium |
-
2018
- 2018-12-18 CN CN201811550086.XA patent/CN109545193B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050055201A1 (en) * | 2003-09-10 | 2005-03-10 | Microsoft Corporation, Corporation In The State Of Washington | System and method for real-time detection and preservation of speech onset in a signal |
US20060004573A1 (en) * | 2004-07-01 | 2006-01-05 | International Business Machines Corporation | Microphone initialization enhancement for speech recognition |
US20110145001A1 (en) * | 2009-12-10 | 2011-06-16 | At&T Intellectual Property I, L.P. | Automated detection and filtering of audio advertisements |
US20170236532A1 (en) * | 2013-07-02 | 2017-08-17 | Family Systems, Ltd. | Systems and methods for improving audio conferencing services |
CN105976810A (en) * | 2016-04-28 | 2016-09-28 | Tcl集团股份有限公司 | Method and device for detecting endpoints of effective discourse segment in voices |
CN106887241A (en) * | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of voice signal detection method and device |
CN108766418A (en) * | 2018-05-24 | 2018-11-06 | 百度在线网络技术(北京)有限公司 | Sound end recognition methods, device and equipment |
CN108922513A (en) * | 2018-06-04 | 2018-11-30 | 平安科技(深圳)有限公司 | Speech differentiation method, apparatus, computer equipment and storage medium |
CN108877778A (en) * | 2018-06-13 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Sound end detecting method and equipment |
CN108877779A (en) * | 2018-08-22 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Method and apparatus for detecting voice tail point |
Non-Patent Citations (1)
Title |
---|
童思博: "基于深度学习的语音端点检测", 《中国优秀硕士学位论文全文数据库(信息科技)》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112581970A (en) * | 2019-09-12 | 2021-03-30 | 深圳市韶音科技有限公司 | System and method for audio signal generation |
US11902759B2 (en) | 2019-09-12 | 2024-02-13 | Shenzhen Shokz Co., Ltd. | Systems and methods for audio signal generation |
CN111292766A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device, and medium for generating speech samples |
CN111292766B (en) * | 2020-02-07 | 2023-08-08 | 抖音视界有限公司 | Method, apparatus, electronic device and medium for generating voice samples |
CN111432233A (en) * | 2020-03-20 | 2020-07-17 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating video |
CN111985643B (en) * | 2020-08-21 | 2023-12-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Training method for generating network, audio data enhancement method and related devices |
CN111985643A (en) * | 2020-08-21 | 2020-11-24 | 腾讯音乐娱乐科技(深圳)有限公司 | Training method for generating network, audio data enhancement method and related device |
CN112151054A (en) * | 2020-09-07 | 2020-12-29 | 北京达佳互联信息技术有限公司 | Audio noise reduction processing method and device, server and storage medium |
CN112151054B (en) * | 2020-09-07 | 2024-02-13 | 北京达佳互联信息技术有限公司 | Audio noise reduction processing method, device, server and storage medium |
CN112887789A (en) * | 2021-01-22 | 2021-06-01 | 北京百度网讯科技有限公司 | Video generation model construction method, video generation device, video generation equipment and video generation medium |
CN112887789B (en) * | 2021-01-22 | 2023-02-21 | 北京百度网讯科技有限公司 | Video generation model construction method, video generation device, video generation equipment and video generation medium |
CN113076932A (en) * | 2021-04-28 | 2021-07-06 | 百度在线网络技术(北京)有限公司 | Method for training audio language recognition model, video detection method and device thereof |
CN113076932B (en) * | 2021-04-28 | 2023-08-04 | 百度在线网络技术(北京)有限公司 | Method for training audio language identification model, video detection method and device thereof |
CN113380225A (en) * | 2021-06-18 | 2021-09-10 | 广州虎牙科技有限公司 | Language model training method, speech recognition method and related device |
CN113380225B (en) * | 2021-06-18 | 2024-05-17 | 广州虎牙科技有限公司 | Language model training method, voice recognition method and related device |
Also Published As
Publication number | Publication date |
---|---|
CN109545193B (en) | 2023-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109545193A (en) | Method and apparatus for generating model | |
CN109545192A (en) | Method and apparatus for generating model | |
CN107945786A (en) | Phoneme synthesizing method and device | |
CN107623614A (en) | Method and apparatus for pushed information | |
CN108305626A (en) | The sound control method and device of application program | |
CN109189544B (en) | Method and device for generating dial plate | |
CN109635095A (en) | Method and apparatus for optimizing dialog model | |
CN108805091A (en) | Method and apparatus for generating model | |
CN109101919A (en) | Method and apparatus for generating information | |
CN107153496A (en) | Method and apparatus for inputting emotion icons | |
CN107705782B (en) | Method and device for determining phoneme pronunciation duration | |
CN108196820A (en) | For adjusting the method and apparatus of play parameter | |
CN107481715B (en) | Method and apparatus for generating information | |
CN109977839A (en) | Information processing method and device | |
CN109697978B (en) | Method and apparatus for generating a model | |
CN113257283B (en) | Audio signal processing method and device, electronic equipment and storage medium | |
CN111081280A (en) | Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method | |
CN108877779A (en) | Method and apparatus for detecting voice tail point | |
CN108521516A (en) | Control method and device for terminal device | |
CN109920431A (en) | Method and apparatus for output information | |
CN107342083A (en) | Method and apparatus for providing voice service | |
CN114999441A (en) | Avatar generation method, apparatus, device, storage medium, and program product | |
CN110084317A (en) | The method and apparatus of image for identification | |
CN109582825A (en) | Method and apparatus for generating information | |
CN108595412A (en) | Correction processing method and device, computer equipment and readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |