CN108682436A - Voice alignment schemes and device - Google Patents
Voice alignment schemes and device Download PDFInfo
- Publication number
- CN108682436A CN108682436A CN201810449585.3A CN201810449585A CN108682436A CN 108682436 A CN108682436 A CN 108682436A CN 201810449585 A CN201810449585 A CN 201810449585A CN 108682436 A CN108682436 A CN 108682436A
- Authority
- CN
- China
- Prior art keywords
- voice data
- speech
- voice
- speech samples
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 claims abstract description 26
- 239000000284 extract Substances 0.000 claims abstract description 14
- 238000004458 analytical method Methods 0.000 claims description 9
- 241000208340 Araliaceae Species 0.000 claims 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 1
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 1
- 235000008434 ginseng Nutrition 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 238000003672 processing method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Voice alignment schemes provided by the invention and device by obtaining the corresponding multiple voice data of same voice content acquired by different sound pick-up outfits, and choose a sound bite as speech samples from any voice data;It determines the sample frame number of the speech samples, and extracts the speech characteristic parameter of the speech samples according to the sample frame number;It is determined and the highest target voice segment of the speech samples similarity in other each voice data according to the speech characteristic parameter of the speech samples;Other wherein described voice data are the voice data in the multiple voice data in addition to any voice data;According to the speech samples and each target voice segment, the time shaft of the multiple voice data is subjected to registration process, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and the time long technical problem low with alignment accuracy rate is spent caused by the mode for pulling together starting point, and then effectively increase treatment effeciency.
Description
Technical field
The present invention relates to data processing field more particularly to a kind of voice alignment schemes and devices.
Background technology
It refers to being collected the voice of speaker in the form of data that voice, which is included, this technology has a wide range of applications
Scene.
In general, for it is same recording scene under same speaker voice, need using multiple roadbed equipment into
The acquisition of row voice data, and different sound pick-up outfits the acquisition starting point of collected voice data can not ensure complete one
It causes.Therefore, in order to ensure the how collected each voice data of multiple sound pick-up outfits acquisition starting point consistency, also for just
In synthesize etc. subsequent processing to these voice data, how to carry out alignment to voice becomes technical problem.
In the prior art, alignment operation is carried out to voice data generally by artificial mode.For example, exist
When in face of the voice data of different acquisition starting point, technical staff needs manually to compare the sound wave of each voice data, and will starting
Point pulls together, to realize the alignment of each voice data.And the processing method being manually aligned is used to require a great deal of time, it handles
Efficiency and alignment accuracy rate are all very low, are also unfavorable for the processing to the voice data of big data quantity.
Invention content
The technical issues of for the above-mentioned treatment effeciency for how improving voice alignment referred to and alignment accuracy rate, this hair
It is bright to provide a kind of voice alignment schemes and device.
On the one hand, the present invention provides a kind of voice alignment schemes, including:
Obtain the corresponding multiple voice data of same voice content that are acquired by different sound pick-up outfits, and from any voice number
According to one sound bite of middle selection as speech samples;
It determines the sample frame number of the speech samples, and extracts the voice spy of the speech samples according to the sample frame number
Levy parameter;
It is determined and the speech samples phase in other each voice data according to the speech characteristic parameter of the speech samples
Like the highest target voice segment of degree;Other wherein described voice data are that any voice is removed in the multiple voice data
Voice data other than data;
According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment
Reason.
In wherein a kind of optional embodiment, the frame number of the determination speech samples, and according to the sample
Frame number extracts the speech characteristic parameter of the speech samples, including:
The sample frame number of the speech samples is determined according to the duration of the speech samples;
Cepstral analysis is carried out to the speech samples according to the sample frame number, obtains the mel-frequency of the speech samples
Cepstrum coefficient.
In wherein a kind of optional embodiment, the speech characteristic parameter according to the speech samples it is each other
The determining and highest target voice segment of the speech samples similarity in voice data, including:
For each pending voice data in other described voice data;
The target frame of the pending data is chosen as present frame, and will be after the present frame and the present frame
Continuous several frames are as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number;
The speech characteristic parameter of the candidate speech segment is extracted, and is joined according to the phonetic feature of the current speech segment
The speech characteristic parameter of number and the speech samples calculates the similarity of the current speech segment;
It chooses using the next frame of the target frame as present frame, and repeats described by the present frame and the present frame
The step of continuous several frames later are as current speech segment, until the last frame for obtaining the current speech segment is institute
State the last frame of pending voice data;
According to each similarity of acquisition, using the highest current speech segment of similarity as the pending voice data
Target voice segment.
In wherein a kind of optional embodiment, a sound bite of being chosen from any voice data is as voice
Sample, including:
Determine the duration of any voice data;
A sound bite is chosen as speech samples according to the duration of any voice data.
It is described according to the speech samples and each target voice segment in wherein a kind of optional embodiment, by institute
The time shaft for stating multiple voice data carries out registration process, including:
Existed according to position of the speech samples on the time shaft of its affiliated voice data and each target voice segment
Position on the time shaft of its affiliated voice data carries out registration process to the time shaft of the multiple voice data.
On the other hand, the present invention provides a kind of voice alignment means, including:
Collecting unit, for obtaining the corresponding multiple voice data of same voice content acquired by different sound pick-up outfits;
Processing unit, for choosing a sound bite from any voice data as speech samples;Determine the voice
The sample frame number of sample, and extract according to the sample frame number speech characteristic parameter of the speech samples;According to the voice
The speech characteristic parameter of sample determines and the highest target language tablet of the speech samples similarity in other each voice data
Section;Other wherein described voice data are the voice data in the multiple voice data in addition to any voice data;
Alignment unit, for according to the speech samples and each target voice segment, by the multiple voice data when
Countershaft carries out registration process.
In wherein a kind of optional embodiment, the processing unit is specifically used for:
The sample frame number of the speech samples is determined according to the duration of the speech samples;
Cepstral analysis is carried out to the speech samples according to the sample frame number, obtains the mel-frequency of the speech samples
Cepstrum coefficient.
In wherein a kind of optional embodiment, the processing unit is specifically used for:
For each pending voice data in other described voice data;
The target frame of the pending data is chosen as present frame, and will be after the present frame and the present frame
Continuous several frames are as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number;
The speech characteristic parameter of the candidate speech segment is extracted, and is joined according to the phonetic feature of the current speech segment
The speech characteristic parameter of number and the speech samples calculates the similarity of the current speech segment;
It chooses using the next frame of the target frame as present frame, and repeats described by the present frame and the present frame
The step of continuous several frames later are as current speech segment, until the last frame for obtaining the current speech segment is institute
State the last frame of pending voice data;
According to each similarity of acquisition, using the highest current speech segment of similarity as the pending voice data
Target voice segment.
In wherein a kind of optional embodiment, the processing unit is specifically used for:
Determine the duration of any voice data;
A sound bite is chosen as speech samples according to the duration of any voice data.
In wherein a kind of optional embodiment, the alignment unit is specifically used for:
Existed according to position of the speech samples on the time shaft of its affiliated voice data and each target voice segment
Position on the time shaft of its affiliated voice data carries out registration process to the time shaft of the multiple voice data.
Voice alignment schemes provided by the invention and device, by obtaining in the same voice acquired by different sound pick-up outfits
Hold corresponding multiple voice data, and a sound bite is chosen as speech samples from any voice data;Determine institute's predicate
The sample frame number of sound sample, and extract according to the sample frame number speech characteristic parameter of the speech samples;According to institute's predicate
The speech characteristic parameter of sound sample determines and the highest target voice of speech samples similarity in other each voice data
Segment;Other wherein described voice data are the voice number in the multiple voice data in addition to any voice data
According to;According to the speech samples and each target voice segment, the time shaft of the multiple voice data is subjected to registration process, from
And avoid manually comparing the sound wave of each voice data due to using in the prior art, and caused by the mode that starting point is pulled together
The time long technical problem low with alignment accuracy rate is spent, and then effectively increases treatment effeciency.
Description of the drawings
Through the above attached drawings, it has been shown that the specific embodiment of the disclosure will be hereinafter described in more detail.These attached drawings
It is not intended to limit the scope of this disclosure concept by any means with verbal description, but is by referring to specific embodiments
Those skilled in the art illustrate the concept of the disclosure.
Fig. 1 is a kind of flow diagram for voice alignment schemes that the embodiment of the present invention one provides;
Fig. 2 is a kind of flow diagram of voice alignment schemes provided by Embodiment 2 of the present invention;
Fig. 3 is the structural schematic diagram for the voice alignment means that the embodiment of the present invention three provides.
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure
Example, and together with specification for explaining the principles of this disclosure.
Specific implementation mode
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.
It refers to being collected the voice of speaker in the form of data that voice, which is included, this technology has a wide range of applications
Scene.
In general, for it is same recording scene under same speaker voice, need using multiple roadbed equipment into
The acquisition of row voice data, and different sound pick-up outfits the acquisition starting point of collected voice data can not ensure complete one
It causes.Therefore, in order to ensure the how collected each voice data of multiple sound pick-up outfits acquisition starting point consistency, also for just
In synthesize etc. subsequent processing to these voice data, how to carry out alignment to voice becomes technical problem.
In the prior art, alignment operation is carried out to voice data generally by artificial mode.For example, exist
When in face of the voice data of different acquisition starting point, technical staff needs manually to compare the sound wave of each voice data, and will starting
Point pulls together, to realize the alignment of each voice data.And the processing method being manually aligned is used to require a great deal of time, it handles
Efficiency and alignment accuracy rate are all very low, are also unfavorable for the processing to the voice data of big data quantity.
The technical issues of for the above-mentioned treatment effeciency for how improving voice alignment referred to and alignment accuracy rate, this hair
It is bright to provide a kind of voice alignment schemes and device.
Fig. 1 is a kind of flow diagram for voice alignment schemes that the embodiment of the present invention one provides.
As shown in Figure 1, the voice alignment schemes include:
The corresponding multiple voice data of same voice content that step 101, acquisition are acquired by different sound pick-up outfits;
Step 102 chooses a sound bite as speech samples from any voice data;
Step 103, the sample frame number for determining the speech samples, and the speech samples are extracted according to the sample frame number
Speech characteristic parameter;
Step 104 determines and institute's predicate according to the speech characteristic parameter of the speech samples in other each voice data
The highest target voice segment of sound Sample Similarity.
Other wherein described voice data are the voice in the multiple voice data in addition to any voice data
Data.
Step 105, according to the speech samples and each target voice segment, by the time shaft of the multiple voice data into
Row registration process.
It should be noted that the executive agent of voice alignment schemes provided by the invention concretely voice alignment means,
The voice alignment means can be realized by way of hardware and/or software.The local that voice platform is based on can be generally integrated in
In server or cloud server, there is the data server cooperation of various types of voice data to make with the storage that voice platform is based on
With, in addition, local server or cloud server that voice alignment means are based on can be same server with data server,
Or to be under the jurisdiction of the different server of same server cluster, the present invention is not limited this.
Specifically, the application scenarios that the present invention is based on are:The same speaker being directed under same recording scene, it is past
It is past to be recorded to it using different sound pick-up outfits.Therefore, the same language acquired by different sound pick-up outfits can be obtained first
The corresponding multiple voice data of sound content.Wherein, sound pick-up outfit is concretely loaded with the different intelligent end of different application systems
End or professional recording equipment.Although multiple voice data corresponds to same voice content, but due to the record belonging to it
The time shaft of the difference of sound equipment, voice data is inconsistent.For example, multiple languages of same speech content are directed to
Sound data, difference records the time that people presses recording button in recorded speech data using the mobile phone of oneself, and there will be differences
Different, difference will occur in this time shaft for allowing for the corresponding multiple voice data of the speech content obtained.
Then, any voice data can be chosen from multiple voice data, and chooses one in any voice data
Sound bite is as speech samples.Wherein, in general, which can be the intersection of the successive frame randomly selected.It is preferred that
, in order to further increase alignment accuracy rate, the selection of the sound bite can be carried out according to the duration of the voice data belonging to it,
The duration for determining any voice data can be used, a sound bite is chosen according to the duration of any voice data and is made
For speech samples.For example, if any voice data when it is 10 minutes a length of, it is 1 minute voice that can choose duration
Segment is as speech samples;If any voice data when it is 50 minutes a length of, it is 5 minutes sound bites that can choose duration
As speech samples;In both of the aforesaid example, the duration for choosing sound bite is the 10% of the duration of its affiliated voice data,
Certainly the sound bite of other durations can be also chosen, present embodiment is not limited.In addition, it is furthermore preferred that the voice chosen
Segment can be the middle section of any voice data, i.e., the first frame of sound bite is not the first frame of voice data, voice sheet
The last frame of section is not the last frame of voice data;It is used as voice sheet in middle section by choosing any voice data
Section can further improve to its accuracy rate.
After completing to choose speech samples, the voice of the speech samples can be extracted according to the sample frame number of speech samples
Characteristic parameter.Wherein, the sample frame number of speech samples generally with the duration positive correlation of speech samples, for example, for each
For the voice data of 16bit acquisition precisions, corresponding to one frame when a length of 20ms therefore can be according to the speech samples
Duration determine the sample frame numbers of the speech samples.Then, the phonetic feature that speech samples can be extracted according to sample frame number is joined
Number, such as mel-frequency cepstrum coefficient (Mel-frequency cepstral coefficients, abbreviation MFCC), you can to institute
It states speech samples and carries out cepstral analysis, obtain the mel-frequency cepstrum coefficient of the speech samples;In addition, the speech characteristic parameter
Can also be other parameters, present embodiment is not limited this.And the speech samples are carried out described in cepstral analysis acquisition
Existing mode can be used in the realization method of the mel-frequency cepstrum coefficient of speech samples, and present embodiment is also to it without limit
System.It should be noted that in general, in order to maintain the balance between processing speed and alignment accuracy, plum is obtained in extraction
When your frequency cepstral coefficient, the characteristic coefficient of preceding 12 row can be only extracted as mel-frequency cepstrum coefficient, that is, the Meier frequency obtained
Rate cepstrum coefficient is the one-dimension array for including 12 columns characteristic coefficients.
After getting the speech characteristic parameter of speech samples, the speech characteristic parameter according to the speech samples is needed
It is determined and the highest target voice segment of the speech samples similarity in other each voice data;Other wherein described voices
Data are the voice data in addition to any voice data in the multiple voice data.It should be noted that in this step
It in rapid, needs to determine the corresponding target voice segment of each other voice data respectively, and determines each target voice segment
Process can be used to be carried out successively, or is carried out at the same time.Meanwhile when a certain sound bite in other a certain voice data is true
When being set to the corresponding target voice segment of other voice data, it is known that, relative in other a certain voice data
Other sound bites for, the speech characteristic parameter of a certain sound bite is similar to the speech characteristic parameter of speech samples
Spend highest.
Finally, according to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out
Registration process specifically can be according to position of the speech samples on the time shaft of its affiliated voice data and each mesh
Position of the sound bite on the time shaft of its affiliated voice data is marked, the time shaft of the multiple voice data is aligned
Processing.
The present invention provides the voice alignment schemes that embodiment one provides, by obtain acquired by different sound pick-up outfits it is same
The corresponding multiple voice data of voice content, and a sound bite is chosen as speech samples from any voice data;It determines
The sample frame number of the speech samples, and extract according to the sample frame number speech characteristic parameter of the speech samples;According to
The speech characteristic parameter of the speech samples determines and the highest mesh of speech samples similarity in other each voice data
Mark sound bite;Other wherein described voice data are the language in the multiple voice data in addition to any voice data
Sound data;According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment
Reason, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and by mode that starting point pulls together and
It is caused to spend the time long technical problem low with alignment accuracy rate, and then effectively increase treatment effeciency.
In order to preferably describe voice alignment schemes provided by the invention, on the basis of embodiment one, Fig. 2 is the present invention
A kind of flow diagram for voice alignment schemes that embodiment two provides.
What is different from the first embodiment is that in the present embodiment two, for each pending in other described voice data
Voice data;The target frame of the pending data is chosen as present frame, and will be after the present frame and the present frame
Continuous several frames as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number;Extraction
The speech characteristic parameter of the candidate speech segment, and according to the speech characteristic parameter of the current speech segment and the voice
The speech characteristic parameter of sample calculates the similarity of the current speech segment;It chooses using the next frame of the target frame as working as
Previous frame, and continuous several frames using after the present frame and the present frame are repeated as the step of current speech segment
Suddenly, until the last frame for obtaining the current speech segment is the last frame of the pending voice data;According to acquisition
Each similarity, using the highest current speech segment of similarity as the target voice segment of the pending voice data.
Specifically, as shown in Fig. 2, the voice alignment schemes include:
The corresponding multiple voice data of same voice content that step 201, acquisition are acquired by different sound pick-up outfits;
Step 202 chooses a sound bite as speech samples from any voice data;
Step 203, the sample frame number for determining the speech samples, and the speech samples are extracted according to the sample frame number
Speech characteristic parameter;
Step 204 chooses a voice data as pending voice data from other voice data;
Step 205 chooses target frame as present frame in the pending data;
Step 206, using continuous several frames after the present frame and the present frame as current speech segment;
Wherein, the frame number of continuous several frames is identical as the sample frame number;
The speech characteristic parameter of step 207, the extraction candidate speech segment, and according to the language of the current speech segment
The speech characteristic parameter of sound characteristic parameter and the speech samples calculates the similarity of the current speech segment;
Step 208 judges whether the last frame of the current speech segment is the last of the pending voice data
One frame;
If so, thening follow the steps 210;Otherwise, step 209 is executed.
Step 209 is chosen using the next frame of the target frame as present frame, return to step 206.
Step 210, each similarity according to the pending voice data of acquisition, by the highest current speech of similarity
Target voice segment of the segment as the pending voice data.
Step 211 is chosen next voice data as pending voice data from other voice data, and is returned in institute
It states and chooses the step of target frame is as present frame in pending data, until obtaining all corresponding target languages of other voice data
Tablet section.
Step 212, according to the speech samples and each target voice segment, by the time shaft of the multiple voice data into
Row registration process.
Similarly with embodiment one, concretely voice is aligned the executive agent of voice alignment schemes provided by the invention
Device, the voice alignment means can be realized by way of hardware and/or software.It can generally be integrated in what voice platform was based on
In local server or cloud server, there is the data server of various types of voice data to coordinate with the storage that voice platform is based on
It uses, in addition, local server or cloud server that voice alignment means are based on can be same service with data server
Device, or to be under the jurisdiction of the different server of same server cluster, the present invention is not limited this.
Specifically, the application scenarios that the present invention is based on are:The same speaker being directed under same recording scene, it is past
It is past to be recorded to it using different sound pick-up outfits.Therefore, the same language acquired by different sound pick-up outfits can be obtained first
The corresponding multiple voice data of sound content.Wherein, sound pick-up outfit is concretely loaded with the different intelligent end of different application systems
End or professional recording equipment.Although multiple voice data corresponds to same voice content, but due to the record belonging to it
The time shaft of the difference of sound equipment, voice data is inconsistent.For example, multiple languages of same speech content are directed to
Sound data, difference records the time that people presses recording button in recorded speech data using the mobile phone of oneself, and there will be differences
Different, difference will occur in this time shaft for allowing for the corresponding multiple voice data of the speech content obtained.It then, can be more from this
Any voice data is chosen in a voice data, and chooses a sound bite as speech samples in any voice data.
Wherein, in general, which can be the intersection of the successive frame randomly selected.Preferably, it is aligned to further increase
The selection of accuracy rate, the sound bite can be carried out according to the duration of the voice data belonging to it, you can described any using determination
The duration of voice data chooses a sound bite as speech samples according to the duration of any voice data.For example,
If any voice data when it is 10 minutes a length of, it is 1 minute sound bite as speech samples that can choose duration;If should
Any voice data when it is 50 minutes a length of, then it is 5 minutes sound bites as speech samples that can choose duration;Aforementioned two
In a example, the duration for choosing sound bite is the 10% of the duration of its affiliated voice data, can also choose other durations certainly
Sound bite, present embodiment is not limited.In addition, it is furthermore preferred that the sound bite chosen can be any voice data
Middle section, i.e. the first frame of sound bite is not the first frame of voice data, and the last frame of sound bite is not voice
The last frame of data;It is used as sound bite and can further improve to its standard in middle section by choosing any voice data
True rate.After completing to choose speech samples, the phonetic feature of the speech samples can be extracted according to the sample frame number of speech samples
Parameter.Wherein, the sample frame number of speech samples generally with the duration positive correlation of speech samples, for example, for each 16bit
For the voice data of acquisition precision, corresponding to one frame when a length of 20ms therefore can be according to the duration of the speech samples
Determine the sample frame number of the speech samples.Then, the speech characteristic parameter of speech samples, such as plum can be extracted according to sample frame number
That frequency cepstral coefficient (Mel-frequency cepstral coefficients, abbreviation MFCC), you can to the voice sample
This progress cepstral analysis obtains the mel-frequency cepstrum coefficient of the speech samples;In addition, the speech characteristic parameter can also be it
His parameter, present embodiment are not limited this.And cepstral analysis is carried out to the speech samples and obtains the speech samples
The realization method of mel-frequency cepstrum coefficient existing mode can be used, present embodiment is also not limited it.It needs
Illustrate, in general, in order to maintain the balance between processing speed and alignment accuracy, obtains mel-frequency in extraction and fall
When spectral coefficient, the characteristic coefficient of preceding 12 row can be only extracted as mel-frequency cepstrum coefficient, that is, the mel-frequency cepstrum system obtained
Number is the one-dimension array for including 12 columns characteristic coefficients.
What is different from the first embodiment is that in the present embodiment, after getting the speech characteristic parameter of speech samples,
For each pending voice data in other described voice data, the target frame of the pending data is chosen as current
Frame, and using continuous several frames after the present frame and the present frame as current speech segment, wherein if described continuous
The frame number of dry frame is identical as the sample frame number;The speech characteristic parameter of the candidate speech segment is extracted, and is worked as according to described
The speech characteristic parameter of the speech characteristic parameter of preceding sound bite and the speech samples calculates the phase of the current speech segment
Like degree;Choose and regard the next frame of the target frame as present frame, and repeatedly it is described by the present frame and the present frame it
The step of continuous several frames afterwards are as current speech segment, until the last frame for obtaining the current speech segment is described
The last frame of pending voice data;According to each similarity of acquisition, using the highest current speech segment of similarity as institute
The target voice segment of pending voice data is stated, specific flow can be found in step 204-211.
Wherein, formula (1) is a kind of calculation formula for similarity that present embodiment provides:
In formula (1), f (x) is that the sound bite that xth frame rises in pending voice data is similar to speech samples
Degree, x therein are the target frame in step 205;NumFrame is the sample frame number of speech samples;MFCCref [n] is voice sample
This corresponding mel-frequency cepstrum coefficient of n-th frame;MFCCwav2 [n+x] is that the n-th+x frames of pending voice data are corresponding
Mel-frequency cepstrum coefficient;X, numFrame and n therein are positive integer.It can be calculated by above-mentioned formula (1) pending
The sum of the similarity of each frame each frame corresponding with speech samples in the current speech segment of voice data.Then, may be used
By the sum of the similarity as the corresponding similarity of the current speech segment, and carry out the mistake of subsequent determining target voice segment
Journey.
Finally, according to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out
Registration process specifically can be according to position of the speech samples on the time shaft of its affiliated voice data and each mesh
Position of the sound bite on the time shaft of its affiliated voice data is marked, the time shaft of the multiple voice data is aligned
Processing.
Preferably, on the basis of aforementioned any embodiment, when the information of the longer voice data of the duration of voice data
It, can be after obtaining the corresponding multiple voice data of same voice content acquired by different sound pick-up outfits, to this when measuring larger
Multiple voice data carry out cutting processing, such as can each voice data be carried out equal decile according to time shaft, and obtain each language
The corresponding multiple speech data blocks of sound data;One is chosen correspondingly, can be directed in each speech data block of any voice data
Sound bite is as speech samples, that is, any voice data corresponds to multiple speech samples;Then, similar with abovementioned steps
, need to determine similarity highest corresponding with each speech samples in the corresponding each speech data block of other each voice data
Each target voice segment.In addition, according to the speech samples and each target voice segment, by the multiple voice data
During time shaft carries out registration process, determined belonging to each voice data and speech samples using the mode of linear fit
Time relationship between voice data.
The present invention provides the voice alignment schemes that embodiment two provides, by obtain acquired by different sound pick-up outfits it is same
The corresponding multiple voice data of voice content, and a sound bite is chosen as speech samples from any voice data;It determines
The sample frame number of the speech samples, and extract according to the sample frame number speech characteristic parameter of the speech samples;According to
The speech characteristic parameter of the speech samples determines and the highest mesh of speech samples similarity in other each voice data
Mark sound bite;Other wherein described voice data are the language in the multiple voice data in addition to any voice data
Sound data;According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment
Reason, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and by mode that starting point pulls together and
It is caused to spend the time long technical problem low with alignment accuracy rate, and then effectively increase treatment effeciency.
Fig. 3 is a kind of structural schematic diagram for voice alignment means that the embodiment of the present invention three provides, as shown in figure 3, the language
Sound alignment means, including:
Collecting unit 10, for obtaining the corresponding multiple voice numbers of same voice content acquired by different sound pick-up outfits
According to;
Processing unit 20, for choosing a sound bite from any voice data as speech samples;Determine institute's predicate
The sample frame number of sound sample, and extract according to the sample frame number speech characteristic parameter of the speech samples;According to institute's predicate
The speech characteristic parameter of sound sample determines and the highest target voice of speech samples similarity in other each voice data
Segment;Other wherein described voice data are the voice number in the multiple voice data in addition to any voice data
According to;
Alignment unit 30 is used for according to the speech samples and each target voice segment, by the multiple voice data
Time shaft carries out registration process.
Preferably, the processing unit 20, is specifically used for:
The sample frame number of the speech samples is determined according to the duration of the speech samples;According to the sample frame number to institute
It states speech samples and carries out cepstral analysis, obtain the mel-frequency cepstrum coefficient of the speech samples.
Preferably, the processing unit 20 is specifically used for:
For each pending voice data in other described voice data;Choose the target frame of the pending data
As present frame, and using continuous several frames after the present frame and the present frame as current speech segment, wherein institute
The frame number for stating continuous several frames is identical as the sample frame number;Extract the speech characteristic parameter of the candidate speech segment, and root
The current speech is calculated according to the speech characteristic parameter of the current speech segment and the speech characteristic parameter of the speech samples
The similarity of segment;It chooses using the next frame of the target frame as present frame, and repeats described by the present frame and described
The step of continuous several frames after present frame are as current speech segment, until obtaining last of the current speech segment
Frame is the last frame of the pending voice data;According to each similarity of acquisition, by the highest current speech piece of similarity
The target voice segment of pending voice data described in Duan Zuowei.
Preferably, the processing unit 20 is specifically used for:
Determine the duration of any voice data;A sound bite is chosen according to the duration of any voice data to make
For speech samples.
Preferably, the alignment unit 30 is specifically used for:
Existed according to position of the speech samples on the time shaft of its affiliated voice data and each target voice segment
Position on the time shaft of its affiliated voice data carries out registration process to the time shaft of the multiple voice data.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description
Specific work process and corresponding advantageous effect, can refer to corresponding processes in the foregoing method embodiment, herein no longer
It repeats.
The present invention provides the voice alignment means that embodiment three provides, by obtain acquired by different sound pick-up outfits it is same
The corresponding multiple voice data of voice content, and a sound bite is chosen as speech samples from any voice data;It determines
The sample frame number of the speech samples, and extract according to the sample frame number speech characteristic parameter of the speech samples;According to
The speech characteristic parameter of the speech samples determines and the highest mesh of speech samples similarity in other each voice data
Mark sound bite;Other wherein described voice data are the language in the multiple voice data in addition to any voice data
Sound data;According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment
Reason, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and by mode that starting point pulls together and
It is caused to spend the time long technical problem low with alignment accuracy rate, and then effectively increase treatment effeciency.
One of ordinary skill in the art will appreciate that:Realize that all or part of step of above-mentioned each method embodiment can lead to
The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer read/write memory medium.The journey
When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned includes:ROM, RAM, magnetic disc or
The various media that can store program code such as person's CD.
Finally it should be noted that:The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Present invention has been described in detail with reference to the aforementioned embodiments for pipe, it will be understood by those of ordinary skill in the art that:Its according to
So can with technical scheme described in the above embodiments is modified, either to which part or all technical features into
Row equivalent replacement;And these modifications or replacements, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (10)
1. a kind of voice alignment schemes, which is characterized in that including:
The corresponding multiple voice data of same voice content acquired by different sound pick-up outfits are obtained, and from any voice data
A sound bite is chosen as speech samples;
It determines the sample frame number of the speech samples, and extracts the phonetic feature ginseng of the speech samples according to the sample frame number
Number;
It is determined and the speech samples similarity in other each voice data according to the speech characteristic parameter of the speech samples
Highest target voice segment;Other wherein described voice data are that any voice data is removed in the multiple voice data
Voice data in addition;
According to the speech samples and each target voice segment, the time shaft of the multiple voice data is subjected to registration process.
2. voice alignment schemes according to claim 1, which is characterized in that the frame number of the determination speech samples,
And the speech characteristic parameter of the speech samples is extracted according to the sample frame number, including:
The sample frame number of the speech samples is determined according to the duration of the speech samples;
Cepstral analysis is carried out to the speech samples according to the sample frame number, obtains the mel-frequency cepstrum of the speech samples
Coefficient.
3. voice alignment schemes according to claim 1, which is characterized in that the voice according to the speech samples is special
Parameter determining and highest target voice segment of the speech samples similarity in other each voice data is levied, including:
For each pending voice data in other described voice data;
The target frame of the pending data is chosen as present frame, and will be continuous after the present frame and the present frame
Several frames are as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number;
Extract the speech characteristic parameter of the candidate speech segment, and according to the speech characteristic parameter of the current speech segment and
The speech characteristic parameter of the speech samples calculates the similarity of the current speech segment;
Choose using the next frame of the target frame be used as present frame, and repeatedly it is described will be after the present frame and the present frame
Continuous several frames as current speech segment the step of, until the last frame for obtaining the current speech segment is described waits for
Handle the last frame of voice data;
According to each similarity of acquisition, using the highest current speech segment of similarity as the target of the pending voice data
Sound bite.
4. voice alignment schemes according to claim 1, which is characterized in that described to choose a language from any voice data
Tablet section as speech samples, including:
Determine the duration of any voice data;
A sound bite is chosen as speech samples according to the duration of any voice data.
5. according to claim 1-4 any one of them voice alignment schemes, which is characterized in that described according to the speech samples
With each target voice segment, the time shaft of the multiple voice data is subjected to registration process, including:
According to position of the speech samples on the time shaft of its affiliated voice data and each target voice segment in its institute
Belong to the position on the time shaft of voice data, registration process is carried out to the time shaft of the multiple voice data.
6. a kind of voice alignment means, which is characterized in that including:
Collecting unit, for obtaining the corresponding multiple voice data of same voice content acquired by different sound pick-up outfits;
Processing unit, for choosing a sound bite from any voice data as speech samples;Determine the speech samples
Sample frame number, and extract according to the sample frame number speech characteristic parameter of the speech samples;According to the speech samples
Speech characteristic parameter determined and the highest target voice segment of the speech samples similarity in other each voice data;Its
Described in other voice data be voice data in addition to any voice data in the multiple voice data;
Alignment unit is used for according to the speech samples and each target voice segment, by the time shaft of the multiple voice data
Carry out registration process.
7. voice alignment means according to claim 6, which is characterized in that the processing unit is specifically used for:
The sample frame number of the speech samples is determined according to the duration of the speech samples;
Cepstral analysis is carried out to the speech samples according to the sample frame number, obtains the mel-frequency cepstrum of the speech samples
Coefficient.
8. voice alignment means according to claim 6, which is characterized in that the processing unit is specifically used for:
For each pending voice data in other described voice data;
The target frame of the pending data is chosen as present frame, and will be continuous after the present frame and the present frame
Several frames are as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number;
Extract the speech characteristic parameter of the candidate speech segment, and according to the speech characteristic parameter of the current speech segment and
The speech characteristic parameter of the speech samples calculates the similarity of the current speech segment;
Choose using the next frame of the target frame be used as present frame, and repeatedly it is described will be after the present frame and the present frame
Continuous several frames as current speech segment the step of, until the last frame for obtaining the current speech segment is described waits for
Handle the last frame of voice data;
According to each similarity of acquisition, using the highest current speech segment of similarity as the target of the pending voice data
Sound bite.
9. voice alignment means according to claim 6, which is characterized in that the processing unit is specifically used for:
Determine the duration of any voice data;
A sound bite is chosen as speech samples according to the duration of any voice data.
10. according to claim 6-9 any one of them voice alignment means, which is characterized in that the alignment unit is specific to use
In:
According to position of the speech samples on the time shaft of its affiliated voice data and each target voice segment in its institute
Belong to the position on the time shaft of voice data, registration process is carried out to the time shaft of the multiple voice data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810449585.3A CN108682436B (en) | 2018-05-11 | 2018-05-11 | Voice alignment method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810449585.3A CN108682436B (en) | 2018-05-11 | 2018-05-11 | Voice alignment method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108682436A true CN108682436A (en) | 2018-10-19 |
CN108682436B CN108682436B (en) | 2020-06-23 |
Family
ID=63805967
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810449585.3A Active CN108682436B (en) | 2018-05-11 | 2018-05-11 | Voice alignment method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108682436B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091849A (en) * | 2020-03-03 | 2020-05-01 | 龙马智芯(珠海横琴)科技有限公司 | Snore identification method and device, storage medium snore stopping equipment and processor |
CN111597239A (en) * | 2020-04-10 | 2020-08-28 | 中科驭数(北京)科技有限公司 | Data alignment method and device |
CN113409815A (en) * | 2021-05-28 | 2021-09-17 | 合肥群音信息服务有限公司 | Voice alignment method based on multi-source voice data |
CN114495977A (en) * | 2022-01-28 | 2022-05-13 | 北京百度网讯科技有限公司 | Speech translation and model training method, device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1742492A (en) * | 2003-02-14 | 2006-03-01 | 汤姆森特许公司 | Automatic synchronization of audio and video based media services of media content |
CN103931199A (en) * | 2011-11-14 | 2014-07-16 | 苹果公司 | Generation of multi -views media clips |
CN105430537A (en) * | 2015-11-27 | 2016-03-23 | 刘军 | Method and server for synthesis of multiple paths of data, and music teaching system |
CN105827997A (en) * | 2016-04-26 | 2016-08-03 | 厦门幻世网络科技有限公司 | Method and device for dubbing audio and visual digital media |
CN105845127A (en) * | 2015-01-13 | 2016-08-10 | 阿里巴巴集团控股有限公司 | Voice recognition method and system |
CN105989846A (en) * | 2015-06-12 | 2016-10-05 | 乐视致新电子科技(天津)有限公司 | Multi-channel speech signal synchronization method and device |
CN106612457A (en) * | 2016-11-09 | 2017-05-03 | 广州视源电子科技股份有限公司 | Method and system for video sequence alignment |
CN107667400A (en) * | 2015-03-09 | 2018-02-06 | 弗劳恩霍夫应用研究促进协会 | The audio coding of fragment alignment |
-
2018
- 2018-05-11 CN CN201810449585.3A patent/CN108682436B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1742492A (en) * | 2003-02-14 | 2006-03-01 | 汤姆森特许公司 | Automatic synchronization of audio and video based media services of media content |
CN103931199A (en) * | 2011-11-14 | 2014-07-16 | 苹果公司 | Generation of multi -views media clips |
CN105845127A (en) * | 2015-01-13 | 2016-08-10 | 阿里巴巴集团控股有限公司 | Voice recognition method and system |
CN107667400A (en) * | 2015-03-09 | 2018-02-06 | 弗劳恩霍夫应用研究促进协会 | The audio coding of fragment alignment |
CN105989846A (en) * | 2015-06-12 | 2016-10-05 | 乐视致新电子科技(天津)有限公司 | Multi-channel speech signal synchronization method and device |
CN105430537A (en) * | 2015-11-27 | 2016-03-23 | 刘军 | Method and server for synthesis of multiple paths of data, and music teaching system |
CN105827997A (en) * | 2016-04-26 | 2016-08-03 | 厦门幻世网络科技有限公司 | Method and device for dubbing audio and visual digital media |
CN106612457A (en) * | 2016-11-09 | 2017-05-03 | 广州视源电子科技股份有限公司 | Method and system for video sequence alignment |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091849A (en) * | 2020-03-03 | 2020-05-01 | 龙马智芯(珠海横琴)科技有限公司 | Snore identification method and device, storage medium snore stopping equipment and processor |
CN111597239A (en) * | 2020-04-10 | 2020-08-28 | 中科驭数(北京)科技有限公司 | Data alignment method and device |
CN113409815A (en) * | 2021-05-28 | 2021-09-17 | 合肥群音信息服务有限公司 | Voice alignment method based on multi-source voice data |
CN114495977A (en) * | 2022-01-28 | 2022-05-13 | 北京百度网讯科技有限公司 | Speech translation and model training method, device, electronic equipment and storage medium |
CN114495977B (en) * | 2022-01-28 | 2024-01-30 | 北京百度网讯科技有限公司 | Speech translation and model training method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108682436B (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108682436A (en) | Voice alignment schemes and device | |
US10896765B2 (en) | Selecting speech features for building models for detecting medical conditions | |
Li et al. | Stress and emotion classification using jitter and shimmer features | |
CN110189757A (en) | A kind of giant panda individual discrimination method, equipment and computer readable storage medium | |
CN1967657B (en) | Automatic tracking and tonal modification system of speaker in program execution and method thereof | |
CN110047469A (en) | Voice data Emotion tagging method, apparatus, computer equipment and storage medium | |
JPS59121100A (en) | Continuous voice recognition equipment | |
CN108646914A (en) | A kind of multi-modal affection data collection method and device | |
CN112750442A (en) | Nipponia nippon population ecosystem monitoring system with wavelet transformation and wavelet transformation method thereof | |
CN109584904A (en) | The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method | |
US20230069908A1 (en) | Recognition apparatus, learning apparatus, methods and programs for the same | |
Wagner et al. | Applying cooperative machine learning to speed up the annotation of social signals in large multi-modal corpora | |
CN109189982A (en) | A kind of music emotion classification method based on SVM Active Learning | |
CN115273904A (en) | Angry emotion recognition method and device based on multi-feature fusion | |
Alenin et al. | A Subnetwork Approach for Spoofing Aware Speaker Verification. | |
CN102782750A (en) | Region of interest extraction device, region of interest extraction method | |
JP7402396B2 (en) | Emotion analysis device, emotion analysis method, and emotion analysis program | |
Lee et al. | A comparison of speaker-based and utterance-based data selection for text-to-speech synthesis | |
Hansen et al. | TEO-based speaker stress assessment using hybrid classification and tracking schemes | |
US20140074468A1 (en) | System and Method for Automatic Prediction of Speech Suitability for Statistical Modeling | |
CN111768764B (en) | Voice data processing method and device, electronic equipment and medium | |
Burget et al. | Data driven design of filter bank for speech recognition | |
CN111596261B (en) | Sound source positioning method and device | |
CN115273908A (en) | Live pig cough sound identification method based on classifier fusion | |
CN112837688B (en) | Voice transcription method, device, related system and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |