CN108682436B

CN108682436B - Voice alignment method and device

Info

Publication number: CN108682436B
Application number: CN201810449585.3A
Authority: CN
Inventors: 邵志明; 郝玉峰
Original assignee: Beijing Speechocean Technology Co ltd
Current assignee: Beijing Speechocean Technology Co ltd
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2020-06-23
Anticipated expiration: 2038-05-11
Also published as: CN108682436A

Abstract

The voice alignment method and the voice alignment device provided by the invention have the advantages that a plurality of voice data corresponding to the same voice content collected by different recording devices are obtained, and a voice fragment is selected from any one voice data to serve as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.

Description

Voice alignment method and device

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for aligning speech.

Background

The voice recording refers to collecting voice of a speaker in a data form, and the technology has wide application scenes.

Generally, for the voice of the same speaker in the same recording scene, a plurality of roadbed devices are required to collect voice data, and the starting points of the voice data collected by different recording devices cannot be guaranteed to be completely consistent. Therefore, in order to ensure consistency of the collection starting points of the voice data collected by a plurality of recording devices, and in order to facilitate subsequent processing such as synthesis of the voice data, it is a technical problem how to align the voices.

In the prior art, the alignment operation is generally performed on the voice data in a manual manner. For example, when facing voice data of different collection starting points, technicians need to manually compare sound waves of the voice data and align the starting points to achieve alignment of the voice data. The processing method of manual alignment needs a lot of time, has low processing efficiency and alignment accuracy, and is not beneficial to processing voice data with large data volume.

Disclosure of Invention

In view of the above-mentioned technical problem of how to improve the processing efficiency and alignment accuracy of voice alignment, the present invention provides a voice alignment method and apparatus.

In one aspect, the present invention provides a speech alignment method, including:

acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selecting a voice fragment from any one voice data as a voice sample;

determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number;

determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;

and aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.

In an optional implementation manner, the determining a frame number of the voice sample and extracting a voice feature parameter of the voice sample according to the sample frame number includes:

determining the sample frame number of the voice sample according to the duration of the voice sample;

and carrying out cepstrum analysis on the voice sample according to the sample frame number to obtain a Mel frequency cepstrum coefficient of the voice sample.

In an optional implementation manner, the determining, according to the speech feature parameter of the speech sample, a target speech segment with the highest similarity to the speech sample in each other speech data includes:

for each voice data to be processed in the other voice data;

selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames;

extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples;

selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed;

and according to the obtained similarity, taking the current voice segment with the highest similarity as the target voice segment of the voice data to be processed.

In an optional implementation manner, the selecting a voice segment from any voice data as a voice sample includes:

determining the duration of any voice data;

and selecting a voice segment as a voice sample according to the duration of any voice data.

In an optional implementation manner, the aligning the time axes of the plurality of voice data according to the voice sample and each target voice segment includes:

and performing alignment processing on the time axes of the plurality of voice data according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs.

In another aspect, the present invention provides a speech alignment apparatus, including:

the acquisition unit is used for acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices;

the processing unit is used for selecting a voice segment from any voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;

and the alignment unit is used for aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.

In one optional implementation, the processing unit is specifically configured to:

for each voice data to be processed in the other voice data;

determining the duration of any voice data;

In an optional implementation manner, the alignment unit is specifically configured to:

Drawings

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Fig. 1 is a schematic flowchart of a voice alignment method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a speech alignment method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech alignment apparatus according to a third embodiment of the present invention.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

Fig. 1 is a flowchart illustrating a voice alignment method according to an embodiment of the present invention.

As shown in fig. 1, the voice alignment method includes:

step 101, acquiring a plurality of voice data corresponding to the same voice content acquired by different voice recording devices;

102, selecting a voice segment from any voice data as a voice sample;

step 103, determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number;

and step 104, determining a target voice segment with the highest similarity to the voice sample in each other voice data according to the voice characteristic parameters of the voice sample.

Wherein the other voice data is voice data other than the any voice data in the plurality of voice data.

And 105, aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.

It should be noted that the main body of the speech alignment method provided by the present invention may specifically be a speech alignment apparatus, and the speech alignment apparatus may be implemented by hardware and/or software. The present invention is not limited to this, and the present invention is generally applicable to a local server or a cloud server based on a voice platform, and is used in cooperation with a data server based on a voice platform, where various voice data are stored.

Specifically, the application scenarios on which the present invention is based are: aiming at the same speaker under the same recording scene, different recording devices are often adopted to record the same speaker. Therefore, a plurality of voice data corresponding to the same voice content acquired by different sound recording apparatuses can be acquired first. The recording equipment can be different intelligent terminals loaded with different application systems, and can also be professional recording equipment. Although the plurality of voice data correspond to the same voice content, the time axes of the voice data are not identical due to the difference of the recording apparatuses to which the voice data belong. For example, for a plurality of voice data of the same speech content, there will be a difference in the time when different recorders use their mobile phones to press the recording button when recording the voice data, so that there will be a difference in the time axes of the obtained plurality of voice data corresponding to the speech content.

Then, any voice data can be selected from the plurality of voice data, and a voice segment can be selected from the any voice data as a voice sample. In general, the speech segments may be a randomly selected collection of consecutive frames. Preferably, in order to further improve the alignment accuracy, the selection of the voice segment may be performed according to the duration of the voice data to which the voice segment belongs, that is, the duration of any one of the voice data may be determined, and a voice segment is selected as the voice sample according to the duration of any one of the voice data. For example, if the duration of any voice data is 10 minutes, a voice segment with the duration of 1 minute can be selected as a voice sample; if the duration of any voice data is 50 minutes, selecting a voice segment with the duration of 5 minutes as a voice sample; in the foregoing two examples, the duration of the selected voice segment is 10% of the duration of the voice data to which the selected voice segment belongs, and certainly, voice segments with other durations may also be selected, which is not limited in this embodiment. In addition, more preferably, the selected voice segment may be a middle portion of any voice data, that is, the first frame of the voice segment is not the first frame of the voice data, and the last frame of the voice segment is not the last frame of the voice data; the accuracy of any voice data can be further improved by selecting the middle part of the voice data as a voice segment.

After the voice sample is selected, the voice characteristic parameters of the voice sample can be extracted according to the sample frame number of the voice sample. For example, for speech data with each 16-bit acquisition precision, the time length corresponding to one frame is 20ms, and therefore, the sample frame number of the speech sample can be determined according to the time length of the speech sample. Then, a speech characteristic parameter of the speech sample, such as Mel-frequency cepstral coefficients (MFCCs), may be extracted according to the sample frame number, that is, the speech sample may be subjected to cepstrum analysis to obtain Mel-frequency cepstral coefficients of the speech sample; in addition, the speech feature parameter may also be another parameter, which is not limited in this embodiment. The existing manner can be adopted for the implementation manner of obtaining the mel-frequency cepstrum coefficient of the voice sample by performing cepstrum analysis on the voice sample, and the implementation manner is not limited thereto. In general, in order to maintain the balance between the processing rate and the alignment accuracy, when extracting and obtaining the mel-frequency cepstrum coefficients, only the first 12 rows of feature coefficients may be extracted as the mel-frequency cepstrum coefficients, that is, the obtained mel-frequency cepstrum coefficients are one-dimensional arrays including 12 rows of feature coefficients.

After the voice feature parameters of the voice sample are obtained, determining a target voice segment with the highest similarity to the voice sample in other voice data according to the voice feature parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data. It should be noted that, in this step, it is necessary to determine the target voice segment corresponding to each other voice data, and the process of determining each target voice segment may be performed sequentially or may be performed simultaneously. Meanwhile, when a certain voice segment in certain other voice data is determined as a corresponding target voice segment of the other voice data, it is known that the similarity between the voice feature parameter of the certain voice segment and the voice feature parameter of the voice sample is highest with respect to other voice segments in the certain other voice data.

Finally, according to the voice sample and each target voice clip, the time axes of the plurality of voice data are aligned, specifically, according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs, the time axes of the plurality of voice data are aligned.

The invention provides a voice alignment method provided by an embodiment.A plurality of voice data corresponding to the same voice content and collected by different recording devices are obtained, and a voice fragment is selected from any one voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.

To better describe the voice alignment method provided by the present invention, on the basis of the first embodiment, fig. 2 is a schematic flow chart of a voice alignment method provided by the second embodiment of the present invention.

Different from the first embodiment, in the second embodiment, each to-be-processed voice data in the other voice data is targeted; selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames; extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples; selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed; and according to the obtained similarity, taking the current voice segment with the highest similarity as the target voice segment of the voice data to be processed.

Specifically, as shown in fig. 2, the voice alignment method includes:

step 201, acquiring a plurality of voice data corresponding to the same voice content acquired by different voice recording devices;

step 202, selecting a voice segment from any voice data as a voice sample;

step 203, determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number;

step 204, selecting voice data from other voice data as voice data to be processed;

step 205, selecting a target frame as a current frame from the data to be processed;

step 206, taking the current frame and a plurality of continuous frames behind the current frame as current voice segments;

wherein the number of the continuous frames is the same as the number of the sample frames;

step 207, extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples;

step 208, judging whether the last frame of the current voice fragment is the last frame of the voice data to be processed;

if yes, go to step 210; otherwise, step 209 is performed.

Step 209, selecting the next frame of the target frame as the current frame, and returning to step 206.

And step 210, according to the obtained similarity of the voice data to be processed, taking the current voice segment with the highest similarity as a target voice segment of the voice data to be processed.

And step 211, selecting the next voice data from the other voice data as the voice data to be processed, and returning to the step of selecting the target frame from the data to be processed as the current frame until all target voice segments corresponding to the other voice data are obtained.

And step 212, aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.

Similar to the embodiment, the main body of the speech alignment method provided by the present invention may be a speech alignment apparatus, and the speech alignment apparatus may be implemented by hardware and/or software. The present invention is not limited to this, and the present invention is generally applicable to a local server or a cloud server based on a voice platform, and is used in cooperation with a data server based on a voice platform, where various voice data are stored.

Specifically, the application scenarios on which the present invention is based are: aiming at the same speaker under the same recording scene, different recording devices are often adopted to record the same speaker. Therefore, a plurality of voice data corresponding to the same voice content acquired by different sound recording apparatuses can be acquired first. The recording equipment can be different intelligent terminals loaded with different application systems, and can also be professional recording equipment. Although the plurality of voice data correspond to the same voice content, the time axes of the voice data are not identical due to the difference of the recording apparatuses to which the voice data belong. For example, for a plurality of voice data of the same speech content, there will be a difference in the time when different recorders use their mobile phones to press the recording button when recording the voice data, so that there will be a difference in the time axes of the obtained plurality of voice data corresponding to the speech content. Then, any voice data can be selected from the plurality of voice data, and a voice segment can be selected from the any voice data as a voice sample. In general, the speech segments may be a randomly selected collection of consecutive frames. Preferably, in order to further improve the alignment accuracy, the selection of the voice segment may be performed according to the duration of the voice data to which the voice segment belongs, that is, the duration of any one of the voice data may be determined, and a voice segment is selected as the voice sample according to the duration of any one of the voice data. For example, if the duration of any voice data is 10 minutes, a voice segment with the duration of 1 minute can be selected as a voice sample; if the duration of any voice data is 50 minutes, selecting a voice segment with the duration of 5 minutes as a voice sample; in the foregoing two examples, the duration of the selected voice segment is 10% of the duration of the voice data to which the selected voice segment belongs, and certainly, voice segments with other durations may also be selected, which is not limited in this embodiment. In addition, more preferably, the selected voice segment may be a middle portion of any voice data, that is, the first frame of the voice segment is not the first frame of the voice data, and the last frame of the voice segment is not the last frame of the voice data; the accuracy of any voice data can be further improved by selecting the middle part of the voice data as a voice segment. After the voice sample is selected, the voice characteristic parameters of the voice sample can be extracted according to the sample frame number of the voice sample. For example, for speech data with each 16-bit acquisition precision, the time length corresponding to one frame is 20ms, and therefore, the sample frame number of the speech sample can be determined according to the time length of the speech sample. Then, a speech characteristic parameter of the speech sample, such as Mel-frequency cepstral coefficients (MFCCs), may be extracted according to the sample frame number, that is, the speech sample may be subjected to cepstrum analysis to obtain Mel-frequency cepstral coefficients of the speech sample; in addition, the speech feature parameter may also be another parameter, which is not limited in this embodiment. The existing manner can be adopted for the implementation manner of obtaining the mel-frequency cepstrum coefficient of the voice sample by performing cepstrum analysis on the voice sample, and the implementation manner is not limited thereto. In general, in order to maintain the balance between the processing rate and the alignment accuracy, when extracting and obtaining the mel-frequency cepstrum coefficients, only the first 12 rows of feature coefficients may be extracted as the mel-frequency cepstrum coefficients, that is, the obtained mel-frequency cepstrum coefficients are one-dimensional arrays including 12 rows of feature coefficients.

Different from the first embodiment, in this embodiment, after obtaining the voice feature parameters of the voice sample, for each to-be-processed voice data in the other voice data, selecting a target frame of the to-be-processed data as a current frame, and taking the current frame and a plurality of continuous frames after the current frame as a current voice segment, where the number of the plurality of continuous frames is the same as the number of the sample frames; extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples; selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed; according to the obtained similarities, the current speech segment with the highest similarity is used as the target speech segment of the speech data to be processed, and the specific process can refer to step 204-211.

The formula (1) is a calculation formula of similarity provided in this embodiment:

in formula (1), f (x) is the similarity between the speech segment from the xth frame in the speech data to be processed and the speech sample, where x is the target frame in step 205; numFrame is the sample frame number of the voice sample; MFCCref [ n ] is the Mel frequency cepstrum coefficient corresponding to the nth frame of the speech sample; MFCCwav2[ n + x ] is a Mel frequency cepstrum coefficient corresponding to the n + x frame of the speech data to be processed; wherein x, numFrame and n are positive integers. The sum of the similarity of each frame in the current speech segment of the speech data to be processed and each corresponding frame in the speech sample can be calculated by the formula (1). Then, the sum of the similarity can be used as the similarity corresponding to the current voice segment, and a subsequent process of determining the target voice segment is performed.

Preferably, on the basis of any one of the foregoing embodiments, when the information amount of the voice data with a long duration of the voice data is large, after acquiring a plurality of pieces of voice data corresponding to the same voice content acquired by different sound recording devices, the plurality of pieces of voice data may be segmented, for example, each piece of voice data may be equally divided according to a time axis, and a plurality of pieces of voice data corresponding to each piece of voice data may be acquired; correspondingly, a voice segment can be selected as a voice sample from each voice data block of any voice data, that is, any voice data corresponds to a plurality of voice samples; subsequently, similarly to the foregoing steps, it is necessary to determine each target speech segment having the highest similarity corresponding to each speech sample among each speech data block corresponding to each other speech data. In addition, in the process of aligning the time axes of the plurality of voice data according to the voice sample and each target voice segment, the time relationship between each voice data and the voice data to which the voice sample belongs can be determined by a linear fitting mode.

The invention provides a voice alignment method provided by the second embodiment, which comprises the steps of acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selecting a voice fragment from any one voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.

Fig. 3 is a schematic structural diagram of a speech alignment apparatus according to a third embodiment of the present invention, as shown in fig. 3, the speech alignment apparatus includes:

the voice recording device comprises an acquisition unit 10, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices;

a processing unit 20, configured to select a voice segment from any voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;

an alignment unit 30, configured to perform alignment processing on the time axes of the multiple pieces of voice data according to the voice samples and the target voice clips.

Preferably, the processing unit 20 is specifically configured to:

determining the sample frame number of the voice sample according to the duration of the voice sample; and carrying out cepstrum analysis on the voice sample according to the sample frame number to obtain a Mel frequency cepstrum coefficient of the voice sample.

Preferably, the processing unit 20 is specifically configured to:

for each voice data to be processed in the other voice data; selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames; extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples; selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed; and according to the obtained similarity, taking the current voice segment with the highest similarity as the target voice segment of the voice data to be processed.

Preferably, the processing unit 20 is specifically configured to:

determining the duration of any voice data; and selecting a voice segment as a voice sample according to the duration of any voice data.

Preferably, the alignment unit 30 is specifically configured to:

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and corresponding beneficial effects of the system described above may refer to the corresponding process in the foregoing method embodiments, and are not described herein again.

The voice aligning device provided by the third embodiment of the present invention obtains a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selects a voice fragment from any one of the voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of speech alignment, comprising:

acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selecting a voice fragment of the middle part of any voice data from any voice data as a voice sample;

aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips;

the determining, according to the voice feature parameters of the voice sample, a target voice segment with the highest similarity to the voice sample in each other voice data includes:

for each voice data to be processed in the other voice data;

extracting the voice characteristic parameters of the current voice segment, and calculating the similarity of the current voice segment according to the voice characteristic parameters of the current voice segment and the voice characteristic parameters of the voice sample;

according to the obtained similarity, taking the current voice segment with the highest similarity as a target voice segment of the voice data to be processed;

the selecting a voice segment from any voice data as a voice sample comprises:

determining the duration of any voice data;

selecting a voice fragment as a voice sample according to the duration of any voice data;

after acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, performing segmentation processing on the plurality of voice data to obtain a plurality of voice data blocks corresponding to each voice data, and correspondingly selecting a voice fragment from each voice data block aiming at any voice data as a voice sample; and

the determining, in each of the other voice data, the target voice segment having the highest similarity to the voice sample includes: and determining each target voice fragment with the highest similarity corresponding to each voice sample in each voice data block corresponding to each other voice data.

2. The method of claim 1, wherein the determining the frame number of the voice sample and extracting the voice feature parameter of the voice sample according to the sample frame number comprises:

3. The speech alignment method according to any one of claims 1-2, wherein the aligning the time axes of the plurality of speech data according to the speech sample and each target speech segment comprises:

4. A speech alignment apparatus, comprising:

an alignment unit, configured to perform alignment processing on the time axes of the plurality of voice data according to the voice sample and each target voice clip;

the processing unit is specifically configured to:

for each voice data to be processed in the other voice data;

the processing unit is further configured to:

determining the duration of any voice data;

5. The speech alignment apparatus according to claim 4, wherein the processing unit is specifically configured to:

6. The speech alignment apparatus according to claim 5, wherein the alignment unit is specifically configured to: