CN108682436A

CN108682436A - Voice alignment schemes and device

Info

Publication number: CN108682436A
Application number: CN201810449585.3A
Authority: CN
Inventors: 邵志明; 郝玉峰
Original assignee: Beijing Haitian Rui Sheng Polytron Technologies Inc
Current assignee: Beijing Haitian Rui Sheng Polytron Technologies Inc
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2018-10-19
Anticipated expiration: 2038-05-11
Also published as: CN108682436B

Abstract

Voice alignment schemes provided by the invention and device by obtaining the corresponding multiple voice data of same voice content acquired by different sound pick-up outfits, and choose a sound bite as speech samples from any voice data；It determines the sample frame number of the speech samples, and extracts the speech characteristic parameter of the speech samples according to the sample frame number；It is determined and the highest target voice segment of the speech samples similarity in other each voice data according to the speech characteristic parameter of the speech samples；Other wherein described voice data are the voice data in the multiple voice data in addition to any voice data；According to the speech samples and each target voice segment, the time shaft of the multiple voice data is subjected to registration process, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and the time long technical problem low with alignment accuracy rate is spent caused by the mode for pulling together starting point, and then effectively increase treatment effeciency.

Description

Voice alignment schemes and device

Technical field

The present invention relates to data processing field more particularly to a kind of voice alignment schemes and devices.

Background technology

It refers to being collected the voice of speaker in the form of data that voice, which is included, this technology has a wide range of applications Scene.

In general, for it is same recording scene under same speaker voice, need using multiple roadbed equipment into The acquisition of row voice data, and different sound pick-up outfits the acquisition starting point of collected voice data can not ensure complete one It causes.Therefore, in order to ensure the how collected each voice data of multiple sound pick-up outfits acquisition starting point consistency, also for just In synthesize etc. subsequent processing to these voice data, how to carry out alignment to voice becomes technical problem.

In the prior art, alignment operation is carried out to voice data generally by artificial mode.For example, exist When in face of the voice data of different acquisition starting point, technical staff needs manually to compare the sound wave of each voice data, and will starting Point pulls together, to realize the alignment of each voice data.And the processing method being manually aligned is used to require a great deal of time, it handles Efficiency and alignment accuracy rate are all very low, are also unfavorable for the processing to the voice data of big data quantity.

Invention content

The technical issues of for the above-mentioned treatment effeciency for how improving voice alignment referred to and alignment accuracy rate, this hair It is bright to provide a kind of voice alignment schemes and device.

On the one hand, the present invention provides a kind of voice alignment schemes, including：

Obtain the corresponding multiple voice data of same voice content that are acquired by different sound pick-up outfits, and from any voice number According to one sound bite of middle selection as speech samples；

It determines the sample frame number of the speech samples, and extracts the voice spy of the speech samples according to the sample frame number Levy parameter；

It is determined and the speech samples phase in other each voice data according to the speech characteristic parameter of the speech samples Like the highest target voice segment of degree；Other wherein described voice data are that any voice is removed in the multiple voice data Voice data other than data；

According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment Reason.

In wherein a kind of optional embodiment, the frame number of the determination speech samples, and according to the sample Frame number extracts the speech characteristic parameter of the speech samples, including：

The sample frame number of the speech samples is determined according to the duration of the speech samples；

Cepstral analysis is carried out to the speech samples according to the sample frame number, obtains the mel-frequency of the speech samples Cepstrum coefficient.

In wherein a kind of optional embodiment, the speech characteristic parameter according to the speech samples it is each other The determining and highest target voice segment of the speech samples similarity in voice data, including：

For each pending voice data in other described voice data；

The target frame of the pending data is chosen as present frame, and will be after the present frame and the present frame Continuous several frames are as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number；

The speech characteristic parameter of the candidate speech segment is extracted, and is joined according to the phonetic feature of the current speech segment The speech characteristic parameter of number and the speech samples calculates the similarity of the current speech segment；

It chooses using the next frame of the target frame as present frame, and repeats described by the present frame and the present frame The step of continuous several frames later are as current speech segment, until the last frame for obtaining the current speech segment is institute State the last frame of pending voice data；

According to each similarity of acquisition, using the highest current speech segment of similarity as the pending voice data Target voice segment.

In wherein a kind of optional embodiment, a sound bite of being chosen from any voice data is as voice Sample, including：

Determine the duration of any voice data；

A sound bite is chosen as speech samples according to the duration of any voice data.

It is described according to the speech samples and each target voice segment in wherein a kind of optional embodiment, by institute The time shaft for stating multiple voice data carries out registration process, including：

Existed according to position of the speech samples on the time shaft of its affiliated voice data and each target voice segment Position on the time shaft of its affiliated voice data carries out registration process to the time shaft of the multiple voice data.

On the other hand, the present invention provides a kind of voice alignment means, including：

Collecting unit, for obtaining the corresponding multiple voice data of same voice content acquired by different sound pick-up outfits；

Processing unit, for choosing a sound bite from any voice data as speech samples；Determine the voice The sample frame number of sample, and extract according to the sample frame number speech characteristic parameter of the speech samples；According to the voice The speech characteristic parameter of sample determines and the highest target language tablet of the speech samples similarity in other each voice data Section；Other wherein described voice data are the voice data in the multiple voice data in addition to any voice data；

Alignment unit, for according to the speech samples and each target voice segment, by the multiple voice data when Countershaft carries out registration process.

In wherein a kind of optional embodiment, the processing unit is specifically used for：

For each pending voice data in other described voice data；

Determine the duration of any voice data；

In wherein a kind of optional embodiment, the alignment unit is specifically used for：

Voice alignment schemes provided by the invention and device, by obtaining in the same voice acquired by different sound pick-up outfits Hold corresponding multiple voice data, and a sound bite is chosen as speech samples from any voice data；Determine institute's predicate The sample frame number of sound sample, and extract according to the sample frame number speech characteristic parameter of the speech samples；According to institute's predicate The speech characteristic parameter of sound sample determines and the highest target voice of speech samples similarity in other each voice data Segment；Other wherein described voice data are the voice number in the multiple voice data in addition to any voice data According to；According to the speech samples and each target voice segment, the time shaft of the multiple voice data is subjected to registration process, from And avoid manually comparing the sound wave of each voice data due to using in the prior art, and caused by the mode that starting point is pulled together The time long technical problem low with alignment accuracy rate is spent, and then effectively increases treatment effeciency.

Description of the drawings

Through the above attached drawings, it has been shown that the specific embodiment of the disclosure will be hereinafter described in more detail.These attached drawings It is not intended to limit the scope of this disclosure concept by any means with verbal description, but is by referring to specific embodiments Those skilled in the art illustrate the concept of the disclosure.

Fig. 1 is a kind of flow diagram for voice alignment schemes that the embodiment of the present invention one provides；

Fig. 2 is a kind of flow diagram of voice alignment schemes provided by Embodiment 2 of the present invention；

Fig. 3 is the structural schematic diagram for the voice alignment means that the embodiment of the present invention three provides.

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.

Specific implementation mode

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.

Fig. 1 is a kind of flow diagram for voice alignment schemes that the embodiment of the present invention one provides.

As shown in Figure 1, the voice alignment schemes include：

The corresponding multiple voice data of same voice content that step 101, acquisition are acquired by different sound pick-up outfits；

Step 102 chooses a sound bite as speech samples from any voice data；

Step 103, the sample frame number for determining the speech samples, and the speech samples are extracted according to the sample frame number Speech characteristic parameter；

Step 104 determines and institute's predicate according to the speech characteristic parameter of the speech samples in other each voice data The highest target voice segment of sound Sample Similarity.

Other wherein described voice data are the voice in the multiple voice data in addition to any voice data Data.

Step 105, according to the speech samples and each target voice segment, by the time shaft of the multiple voice data into Row registration process.

It should be noted that the executive agent of voice alignment schemes provided by the invention concretely voice alignment means, The voice alignment means can be realized by way of hardware and/or software.The local that voice platform is based on can be generally integrated in In server or cloud server, there is the data server cooperation of various types of voice data to make with the storage that voice platform is based on With, in addition, local server or cloud server that voice alignment means are based on can be same server with data server, Or to be under the jurisdiction of the different server of same server cluster, the present invention is not limited this.

Specifically, the application scenarios that the present invention is based on are：The same speaker being directed under same recording scene, it is past It is past to be recorded to it using different sound pick-up outfits.Therefore, the same language acquired by different sound pick-up outfits can be obtained first The corresponding multiple voice data of sound content.Wherein, sound pick-up outfit is concretely loaded with the different intelligent end of different application systems End or professional recording equipment.Although multiple voice data corresponds to same voice content, but due to the record belonging to it The time shaft of the difference of sound equipment, voice data is inconsistent.For example, multiple languages of same speech content are directed to Sound data, difference records the time that people presses recording button in recorded speech data using the mobile phone of oneself, and there will be differences Different, difference will occur in this time shaft for allowing for the corresponding multiple voice data of the speech content obtained.

Then, any voice data can be chosen from multiple voice data, and chooses one in any voice data Sound bite is as speech samples.Wherein, in general, which can be the intersection of the successive frame randomly selected.It is preferred that , in order to further increase alignment accuracy rate, the selection of the sound bite can be carried out according to the duration of the voice data belonging to it, The duration for determining any voice data can be used, a sound bite is chosen according to the duration of any voice data and is made For speech samples.For example, if any voice data when it is 10 minutes a length of, it is 1 minute voice that can choose duration Segment is as speech samples；If any voice data when it is 50 minutes a length of, it is 5 minutes sound bites that can choose duration As speech samples；In both of the aforesaid example, the duration for choosing sound bite is the 10% of the duration of its affiliated voice data, Certainly the sound bite of other durations can be also chosen, present embodiment is not limited.In addition, it is furthermore preferred that the voice chosen Segment can be the middle section of any voice data, i.e., the first frame of sound bite is not the first frame of voice data, voice sheet The last frame of section is not the last frame of voice data；It is used as voice sheet in middle section by choosing any voice data Section can further improve to its accuracy rate.

After completing to choose speech samples, the voice of the speech samples can be extracted according to the sample frame number of speech samples Characteristic parameter.Wherein, the sample frame number of speech samples generally with the duration positive correlation of speech samples, for example, for each For the voice data of 16bit acquisition precisions, corresponding to one frame when a length of 20ms therefore can be according to the speech samples Duration determine the sample frame numbers of the speech samples.Then, the phonetic feature that speech samples can be extracted according to sample frame number is joined Number, such as mel-frequency cepstrum coefficient (Mel-frequency cepstral coefficients, abbreviation MFCC), you can to institute It states speech samples and carries out cepstral analysis, obtain the mel-frequency cepstrum coefficient of the speech samples；In addition, the speech characteristic parameter Can also be other parameters, present embodiment is not limited this.And the speech samples are carried out described in cepstral analysis acquisition Existing mode can be used in the realization method of the mel-frequency cepstrum coefficient of speech samples, and present embodiment is also to it without limit System.It should be noted that in general, in order to maintain the balance between processing speed and alignment accuracy, plum is obtained in extraction When your frequency cepstral coefficient, the characteristic coefficient of preceding 12 row can be only extracted as mel-frequency cepstrum coefficient, that is, the Meier frequency obtained Rate cepstrum coefficient is the one-dimension array for including 12 columns characteristic coefficients.

After getting the speech characteristic parameter of speech samples, the speech characteristic parameter according to the speech samples is needed It is determined and the highest target voice segment of the speech samples similarity in other each voice data；Other wherein described voices Data are the voice data in addition to any voice data in the multiple voice data.It should be noted that in this step It in rapid, needs to determine the corresponding target voice segment of each other voice data respectively, and determines each target voice segment Process can be used to be carried out successively, or is carried out at the same time.Meanwhile when a certain sound bite in other a certain voice data is true When being set to the corresponding target voice segment of other voice data, it is known that, relative in other a certain voice data Other sound bites for, the speech characteristic parameter of a certain sound bite is similar to the speech characteristic parameter of speech samples Spend highest.

Finally, according to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out Registration process specifically can be according to position of the speech samples on the time shaft of its affiliated voice data and each mesh Position of the sound bite on the time shaft of its affiliated voice data is marked, the time shaft of the multiple voice data is aligned Processing.

The present invention provides the voice alignment schemes that embodiment one provides, by obtain acquired by different sound pick-up outfits it is same The corresponding multiple voice data of voice content, and a sound bite is chosen as speech samples from any voice data；It determines The sample frame number of the speech samples, and extract according to the sample frame number speech characteristic parameter of the speech samples；According to The speech characteristic parameter of the speech samples determines and the highest mesh of speech samples similarity in other each voice data Mark sound bite；Other wherein described voice data are the language in the multiple voice data in addition to any voice data Sound data；According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment Reason, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and by mode that starting point pulls together and It is caused to spend the time long technical problem low with alignment accuracy rate, and then effectively increase treatment effeciency.

In order to preferably describe voice alignment schemes provided by the invention, on the basis of embodiment one, Fig. 2 is the present invention A kind of flow diagram for voice alignment schemes that embodiment two provides.

What is different from the first embodiment is that in the present embodiment two, for each pending in other described voice data Voice data；The target frame of the pending data is chosen as present frame, and will be after the present frame and the present frame Continuous several frames as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number；Extraction The speech characteristic parameter of the candidate speech segment, and according to the speech characteristic parameter of the current speech segment and the voice The speech characteristic parameter of sample calculates the similarity of the current speech segment；It chooses using the next frame of the target frame as working as Previous frame, and continuous several frames using after the present frame and the present frame are repeated as the step of current speech segment Suddenly, until the last frame for obtaining the current speech segment is the last frame of the pending voice data；According to acquisition Each similarity, using the highest current speech segment of similarity as the target voice segment of the pending voice data.

Specifically, as shown in Fig. 2, the voice alignment schemes include：

The corresponding multiple voice data of same voice content that step 201, acquisition are acquired by different sound pick-up outfits；

Step 202 chooses a sound bite as speech samples from any voice data；

Step 203, the sample frame number for determining the speech samples, and the speech samples are extracted according to the sample frame number Speech characteristic parameter；

Step 204 chooses a voice data as pending voice data from other voice data；

Step 205 chooses target frame as present frame in the pending data；

Step 206, using continuous several frames after the present frame and the present frame as current speech segment；

Wherein, the frame number of continuous several frames is identical as the sample frame number；

The speech characteristic parameter of step 207, the extraction candidate speech segment, and according to the language of the current speech segment The speech characteristic parameter of sound characteristic parameter and the speech samples calculates the similarity of the current speech segment；

Step 208 judges whether the last frame of the current speech segment is the last of the pending voice data One frame；

If so, thening follow the steps 210；Otherwise, step 209 is executed.

Step 209 is chosen using the next frame of the target frame as present frame, return to step 206.

Step 210, each similarity according to the pending voice data of acquisition, by the highest current speech of similarity Target voice segment of the segment as the pending voice data.

Step 211 is chosen next voice data as pending voice data from other voice data, and is returned in institute It states and chooses the step of target frame is as present frame in pending data, until obtaining all corresponding target languages of other voice data Tablet section.

Step 212, according to the speech samples and each target voice segment, by the time shaft of the multiple voice data into Row registration process.

Similarly with embodiment one, concretely voice is aligned the executive agent of voice alignment schemes provided by the invention Device, the voice alignment means can be realized by way of hardware and/or software.It can generally be integrated in what voice platform was based on In local server or cloud server, there is the data server of various types of voice data to coordinate with the storage that voice platform is based on It uses, in addition, local server or cloud server that voice alignment means are based on can be same service with data server Device, or to be under the jurisdiction of the different server of same server cluster, the present invention is not limited this.

Specifically, the application scenarios that the present invention is based on are：The same speaker being directed under same recording scene, it is past It is past to be recorded to it using different sound pick-up outfits.Therefore, the same language acquired by different sound pick-up outfits can be obtained first The corresponding multiple voice data of sound content.Wherein, sound pick-up outfit is concretely loaded with the different intelligent end of different application systems End or professional recording equipment.Although multiple voice data corresponds to same voice content, but due to the record belonging to it The time shaft of the difference of sound equipment, voice data is inconsistent.For example, multiple languages of same speech content are directed to Sound data, difference records the time that people presses recording button in recorded speech data using the mobile phone of oneself, and there will be differences Different, difference will occur in this time shaft for allowing for the corresponding multiple voice data of the speech content obtained.It then, can be more from this Any voice data is chosen in a voice data, and chooses a sound bite as speech samples in any voice data. Wherein, in general, which can be the intersection of the successive frame randomly selected.Preferably, it is aligned to further increase The selection of accuracy rate, the sound bite can be carried out according to the duration of the voice data belonging to it, you can described any using determination The duration of voice data chooses a sound bite as speech samples according to the duration of any voice data.For example, If any voice data when it is 10 minutes a length of, it is 1 minute sound bite as speech samples that can choose duration；If should Any voice data when it is 50 minutes a length of, then it is 5 minutes sound bites as speech samples that can choose duration；Aforementioned two In a example, the duration for choosing sound bite is the 10% of the duration of its affiliated voice data, can also choose other durations certainly Sound bite, present embodiment is not limited.In addition, it is furthermore preferred that the sound bite chosen can be any voice data Middle section, i.e. the first frame of sound bite is not the first frame of voice data, and the last frame of sound bite is not voice The last frame of data；It is used as sound bite and can further improve to its standard in middle section by choosing any voice data True rate.After completing to choose speech samples, the phonetic feature of the speech samples can be extracted according to the sample frame number of speech samples Parameter.Wherein, the sample frame number of speech samples generally with the duration positive correlation of speech samples, for example, for each 16bit For the voice data of acquisition precision, corresponding to one frame when a length of 20ms therefore can be according to the duration of the speech samples Determine the sample frame number of the speech samples.Then, the speech characteristic parameter of speech samples, such as plum can be extracted according to sample frame number That frequency cepstral coefficient (Mel-frequency cepstral coefficients, abbreviation MFCC), you can to the voice sample This progress cepstral analysis obtains the mel-frequency cepstrum coefficient of the speech samples；In addition, the speech characteristic parameter can also be it His parameter, present embodiment are not limited this.And cepstral analysis is carried out to the speech samples and obtains the speech samples The realization method of mel-frequency cepstrum coefficient existing mode can be used, present embodiment is also not limited it.It needs Illustrate, in general, in order to maintain the balance between processing speed and alignment accuracy, obtains mel-frequency in extraction and fall When spectral coefficient, the characteristic coefficient of preceding 12 row can be only extracted as mel-frequency cepstrum coefficient, that is, the mel-frequency cepstrum system obtained Number is the one-dimension array for including 12 columns characteristic coefficients.

What is different from the first embodiment is that in the present embodiment, after getting the speech characteristic parameter of speech samples, For each pending voice data in other described voice data, the target frame of the pending data is chosen as current Frame, and using continuous several frames after the present frame and the present frame as current speech segment, wherein if described continuous The frame number of dry frame is identical as the sample frame number；The speech characteristic parameter of the candidate speech segment is extracted, and is worked as according to described The speech characteristic parameter of the speech characteristic parameter of preceding sound bite and the speech samples calculates the phase of the current speech segment Like degree；Choose and regard the next frame of the target frame as present frame, and repeatedly it is described by the present frame and the present frame it The step of continuous several frames afterwards are as current speech segment, until the last frame for obtaining the current speech segment is described The last frame of pending voice data；According to each similarity of acquisition, using the highest current speech segment of similarity as institute The target voice segment of pending voice data is stated, specific flow can be found in step 204-211.

Wherein, formula (1) is a kind of calculation formula for similarity that present embodiment provides：

In formula (1), f (x) is that the sound bite that xth frame rises in pending voice data is similar to speech samples Degree, x therein are the target frame in step 205；NumFrame is the sample frame number of speech samples；MFCCref [n] is voice sample This corresponding mel-frequency cepstrum coefficient of n-th frame；MFCCwav2 [n+x] is that the n-th+x frames of pending voice data are corresponding Mel-frequency cepstrum coefficient；X, numFrame and n therein are positive integer.It can be calculated by above-mentioned formula (1) pending The sum of the similarity of each frame each frame corresponding with speech samples in the current speech segment of voice data.Then, may be used By the sum of the similarity as the corresponding similarity of the current speech segment, and carry out the mistake of subsequent determining target voice segment Journey.

Preferably, on the basis of aforementioned any embodiment, when the information of the longer voice data of the duration of voice data It, can be after obtaining the corresponding multiple voice data of same voice content acquired by different sound pick-up outfits, to this when measuring larger Multiple voice data carry out cutting processing, such as can each voice data be carried out equal decile according to time shaft, and obtain each language The corresponding multiple speech data blocks of sound data；One is chosen correspondingly, can be directed in each speech data block of any voice data Sound bite is as speech samples, that is, any voice data corresponds to multiple speech samples；Then, similar with abovementioned steps , need to determine similarity highest corresponding with each speech samples in the corresponding each speech data block of other each voice data Each target voice segment.In addition, according to the speech samples and each target voice segment, by the multiple voice data During time shaft carries out registration process, determined belonging to each voice data and speech samples using the mode of linear fit Time relationship between voice data.

The present invention provides the voice alignment schemes that embodiment two provides, by obtain acquired by different sound pick-up outfits it is same The corresponding multiple voice data of voice content, and a sound bite is chosen as speech samples from any voice data；It determines The sample frame number of the speech samples, and extract according to the sample frame number speech characteristic parameter of the speech samples；According to The speech characteristic parameter of the speech samples determines and the highest mesh of speech samples similarity in other each voice data Mark sound bite；Other wherein described voice data are the language in the multiple voice data in addition to any voice data Sound data；According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment Reason, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and by mode that starting point pulls together and It is caused to spend the time long technical problem low with alignment accuracy rate, and then effectively increase treatment effeciency.

Fig. 3 is a kind of structural schematic diagram for voice alignment means that the embodiment of the present invention three provides, as shown in figure 3, the language Sound alignment means, including：

Collecting unit 10, for obtaining the corresponding multiple voice numbers of same voice content acquired by different sound pick-up outfits According to；

Processing unit 20, for choosing a sound bite from any voice data as speech samples；Determine institute's predicate The sample frame number of sound sample, and extract according to the sample frame number speech characteristic parameter of the speech samples；According to institute's predicate The speech characteristic parameter of sound sample determines and the highest target voice of speech samples similarity in other each voice data Segment；Other wherein described voice data are the voice number in the multiple voice data in addition to any voice data According to；

Alignment unit 30 is used for according to the speech samples and each target voice segment, by the multiple voice data Time shaft carries out registration process.

Preferably, the processing unit 20, is specifically used for：

The sample frame number of the speech samples is determined according to the duration of the speech samples；According to the sample frame number to institute It states speech samples and carries out cepstral analysis, obtain the mel-frequency cepstrum coefficient of the speech samples.

Preferably, the processing unit 20 is specifically used for：

For each pending voice data in other described voice data；Choose the target frame of the pending data As present frame, and using continuous several frames after the present frame and the present frame as current speech segment, wherein institute The frame number for stating continuous several frames is identical as the sample frame number；Extract the speech characteristic parameter of the candidate speech segment, and root The current speech is calculated according to the speech characteristic parameter of the current speech segment and the speech characteristic parameter of the speech samples The similarity of segment；It chooses using the next frame of the target frame as present frame, and repeats described by the present frame and described The step of continuous several frames after present frame are as current speech segment, until obtaining last of the current speech segment Frame is the last frame of the pending voice data；According to each similarity of acquisition, by the highest current speech piece of similarity The target voice segment of pending voice data described in Duan Zuowei.

Preferably, the processing unit 20 is specifically used for：

Determine the duration of any voice data；A sound bite is chosen according to the duration of any voice data to make For speech samples.

Preferably, the alignment unit 30 is specifically used for：

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description Specific work process and corresponding advantageous effect, can refer to corresponding processes in the foregoing method embodiment, herein no longer It repeats.

The present invention provides the voice alignment means that embodiment three provides, by obtain acquired by different sound pick-up outfits it is same The corresponding multiple voice data of voice content, and a sound bite is chosen as speech samples from any voice data；It determines The sample frame number of the speech samples, and extract according to the sample frame number speech characteristic parameter of the speech samples；According to The speech characteristic parameter of the speech samples determines and the highest mesh of speech samples similarity in other each voice data Mark sound bite；Other wherein described voice data are the language in the multiple voice data in addition to any voice data Sound data；According to the speech samples and each target voice segment, the time shaft of the multiple voice data is carried out at alignment Reason, to avoid in the prior art due to using the sound wave for manually comparing each voice data, and by mode that starting point pulls together and It is caused to spend the time long technical problem low with alignment accuracy rate, and then effectively increase treatment effeciency.

One of ordinary skill in the art will appreciate that：Realize that all or part of step of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer read/write memory medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence；And storage medium above-mentioned includes：ROM, RAM, magnetic disc or The various media that can store program code such as person's CD.

Finally it should be noted that：The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, it will be understood by those of ordinary skill in the art that：Its according to So can with technical scheme described in the above embodiments is modified, either to which part or all technical features into Row equivalent replacement；And these modifications or replacements, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of voice alignment schemes, which is characterized in that including：

The corresponding multiple voice data of same voice content acquired by different sound pick-up outfits are obtained, and from any voice data A sound bite is chosen as speech samples；

It determines the sample frame number of the speech samples, and extracts the phonetic feature ginseng of the speech samples according to the sample frame number Number；

It is determined and the speech samples similarity in other each voice data according to the speech characteristic parameter of the speech samples Highest target voice segment；Other wherein described voice data are that any voice data is removed in the multiple voice data Voice data in addition；

According to the speech samples and each target voice segment, the time shaft of the multiple voice data is subjected to registration process.

2. voice alignment schemes according to claim 1, which is characterized in that the frame number of the determination speech samples, And the speech characteristic parameter of the speech samples is extracted according to the sample frame number, including：

Cepstral analysis is carried out to the speech samples according to the sample frame number, obtains the mel-frequency cepstrum of the speech samples Coefficient.

3. voice alignment schemes according to claim 1, which is characterized in that the voice according to the speech samples is special Parameter determining and highest target voice segment of the speech samples similarity in other each voice data is levied, including：

For each pending voice data in other described voice data；

The target frame of the pending data is chosen as present frame, and will be continuous after the present frame and the present frame Several frames are as current speech segment, wherein the frame number of continuous several frames is identical as the sample frame number；

Extract the speech characteristic parameter of the candidate speech segment, and according to the speech characteristic parameter of the current speech segment and The speech characteristic parameter of the speech samples calculates the similarity of the current speech segment；

Choose using the next frame of the target frame be used as present frame, and repeatedly it is described will be after the present frame and the present frame Continuous several frames as current speech segment the step of, until the last frame for obtaining the current speech segment is described waits for Handle the last frame of voice data；

According to each similarity of acquisition, using the highest current speech segment of similarity as the target of the pending voice data Sound bite.

4. voice alignment schemes according to claim 1, which is characterized in that described to choose a language from any voice data Tablet section as speech samples, including：

Determine the duration of any voice data；

5. according to claim 1-4 any one of them voice alignment schemes, which is characterized in that described according to the speech samples With each target voice segment, the time shaft of the multiple voice data is subjected to registration process, including：

According to position of the speech samples on the time shaft of its affiliated voice data and each target voice segment in its institute Belong to the position on the time shaft of voice data, registration process is carried out to the time shaft of the multiple voice data.

6. a kind of voice alignment means, which is characterized in that including：

Processing unit, for choosing a sound bite from any voice data as speech samples；Determine the speech samples Sample frame number, and extract according to the sample frame number speech characteristic parameter of the speech samples；According to the speech samples Speech characteristic parameter determined and the highest target voice segment of the speech samples similarity in other each voice data；Its Described in other voice data be voice data in addition to any voice data in the multiple voice data；

Alignment unit is used for according to the speech samples and each target voice segment, by the time shaft of the multiple voice data Carry out registration process.

7. voice alignment means according to claim 6, which is characterized in that the processing unit is specifically used for：

8. voice alignment means according to claim 6, which is characterized in that the processing unit is specifically used for：

For each pending voice data in other described voice data；

9. voice alignment means according to claim 6, which is characterized in that the processing unit is specifically used for：

Determine the duration of any voice data；

10. according to claim 6-9 any one of them voice alignment means, which is characterized in that the alignment unit is specific to use In：