CN108682436B - Voice alignment method and device - Google Patents

Voice alignment method and device Download PDF

Info

Publication number
CN108682436B
CN108682436B CN201810449585.3A CN201810449585A CN108682436B CN 108682436 B CN108682436 B CN 108682436B CN 201810449585 A CN201810449585 A CN 201810449585A CN 108682436 B CN108682436 B CN 108682436B
Authority
CN
China
Prior art keywords
voice
sample
voice data
data
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810449585.3A
Other languages
Chinese (zh)
Other versions
CN108682436A (en
Inventor
邵志明
郝玉峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Speechocean Technology Co ltd
Original Assignee
Beijing Speechocean Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Speechocean Technology Co ltd filed Critical Beijing Speechocean Technology Co ltd
Priority to CN201810449585.3A priority Critical patent/CN108682436B/en
Publication of CN108682436A publication Critical patent/CN108682436A/en
Application granted granted Critical
Publication of CN108682436B publication Critical patent/CN108682436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The voice alignment method and the voice alignment device provided by the invention have the advantages that a plurality of voice data corresponding to the same voice content collected by different recording devices are obtained, and a voice fragment is selected from any one voice data to serve as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.

Description

Voice alignment method and device
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for aligning speech.
Background
The voice recording refers to collecting voice of a speaker in a data form, and the technology has wide application scenes.
Generally, for the voice of the same speaker in the same recording scene, a plurality of roadbed devices are required to collect voice data, and the starting points of the voice data collected by different recording devices cannot be guaranteed to be completely consistent. Therefore, in order to ensure consistency of the collection starting points of the voice data collected by a plurality of recording devices, and in order to facilitate subsequent processing such as synthesis of the voice data, it is a technical problem how to align the voices.
In the prior art, the alignment operation is generally performed on the voice data in a manual manner. For example, when facing voice data of different collection starting points, technicians need to manually compare sound waves of the voice data and align the starting points to achieve alignment of the voice data. The processing method of manual alignment needs a lot of time, has low processing efficiency and alignment accuracy, and is not beneficial to processing voice data with large data volume.
Disclosure of Invention
In view of the above-mentioned technical problem of how to improve the processing efficiency and alignment accuracy of voice alignment, the present invention provides a voice alignment method and apparatus.
In one aspect, the present invention provides a speech alignment method, including:
acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selecting a voice fragment from any one voice data as a voice sample;
determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number;
determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;
and aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.
In an optional implementation manner, the determining a frame number of the voice sample and extracting a voice feature parameter of the voice sample according to the sample frame number includes:
determining the sample frame number of the voice sample according to the duration of the voice sample;
and carrying out cepstrum analysis on the voice sample according to the sample frame number to obtain a Mel frequency cepstrum coefficient of the voice sample.
In an optional implementation manner, the determining, according to the speech feature parameter of the speech sample, a target speech segment with the highest similarity to the speech sample in each other speech data includes:
for each voice data to be processed in the other voice data;
selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames;
extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples;
selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed;
and according to the obtained similarity, taking the current voice segment with the highest similarity as the target voice segment of the voice data to be processed.
In an optional implementation manner, the selecting a voice segment from any voice data as a voice sample includes:
determining the duration of any voice data;
and selecting a voice segment as a voice sample according to the duration of any voice data.
In an optional implementation manner, the aligning the time axes of the plurality of voice data according to the voice sample and each target voice segment includes:
and performing alignment processing on the time axes of the plurality of voice data according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs.
In another aspect, the present invention provides a speech alignment apparatus, including:
the acquisition unit is used for acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices;
the processing unit is used for selecting a voice segment from any voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;
and the alignment unit is used for aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.
In one optional implementation, the processing unit is specifically configured to:
determining the sample frame number of the voice sample according to the duration of the voice sample;
and carrying out cepstrum analysis on the voice sample according to the sample frame number to obtain a Mel frequency cepstrum coefficient of the voice sample.
In one optional implementation, the processing unit is specifically configured to:
for each voice data to be processed in the other voice data;
selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames;
extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples;
selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed;
and according to the obtained similarity, taking the current voice segment with the highest similarity as the target voice segment of the voice data to be processed.
In one optional implementation, the processing unit is specifically configured to:
determining the duration of any voice data;
and selecting a voice segment as a voice sample according to the duration of any voice data.
In an optional implementation manner, the alignment unit is specifically configured to:
and performing alignment processing on the time axes of the plurality of voice data according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs.
The voice alignment method and the voice alignment device provided by the invention have the advantages that a plurality of voice data corresponding to the same voice content collected by different recording devices are obtained, and a voice fragment is selected from any one voice data to serve as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.
Drawings
With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Fig. 1 is a schematic flowchart of a voice alignment method according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a speech alignment method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a speech alignment apparatus according to a third embodiment of the present invention.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
The voice recording refers to collecting voice of a speaker in a data form, and the technology has wide application scenes.
Generally, for the voice of the same speaker in the same recording scene, a plurality of roadbed devices are required to collect voice data, and the starting points of the voice data collected by different recording devices cannot be guaranteed to be completely consistent. Therefore, in order to ensure consistency of the collection starting points of the voice data collected by a plurality of recording devices, and in order to facilitate subsequent processing such as synthesis of the voice data, it is a technical problem how to align the voices.
In the prior art, the alignment operation is generally performed on the voice data in a manual manner. For example, when facing voice data of different collection starting points, technicians need to manually compare sound waves of the voice data and align the starting points to achieve alignment of the voice data. The processing method of manual alignment needs a lot of time, has low processing efficiency and alignment accuracy, and is not beneficial to processing voice data with large data volume.
In view of the above-mentioned technical problem of how to improve the processing efficiency and alignment accuracy of voice alignment, the present invention provides a voice alignment method and apparatus.
Fig. 1 is a flowchart illustrating a voice alignment method according to an embodiment of the present invention.
As shown in fig. 1, the voice alignment method includes:
step 101, acquiring a plurality of voice data corresponding to the same voice content acquired by different voice recording devices;
102, selecting a voice segment from any voice data as a voice sample;
step 103, determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number;
and step 104, determining a target voice segment with the highest similarity to the voice sample in each other voice data according to the voice characteristic parameters of the voice sample.
Wherein the other voice data is voice data other than the any voice data in the plurality of voice data.
And 105, aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.
It should be noted that the main body of the speech alignment method provided by the present invention may specifically be a speech alignment apparatus, and the speech alignment apparatus may be implemented by hardware and/or software. The present invention is not limited to this, and the present invention is generally applicable to a local server or a cloud server based on a voice platform, and is used in cooperation with a data server based on a voice platform, where various voice data are stored.
Specifically, the application scenarios on which the present invention is based are: aiming at the same speaker under the same recording scene, different recording devices are often adopted to record the same speaker. Therefore, a plurality of voice data corresponding to the same voice content acquired by different sound recording apparatuses can be acquired first. The recording equipment can be different intelligent terminals loaded with different application systems, and can also be professional recording equipment. Although the plurality of voice data correspond to the same voice content, the time axes of the voice data are not identical due to the difference of the recording apparatuses to which the voice data belong. For example, for a plurality of voice data of the same speech content, there will be a difference in the time when different recorders use their mobile phones to press the recording button when recording the voice data, so that there will be a difference in the time axes of the obtained plurality of voice data corresponding to the speech content.
Then, any voice data can be selected from the plurality of voice data, and a voice segment can be selected from the any voice data as a voice sample. In general, the speech segments may be a randomly selected collection of consecutive frames. Preferably, in order to further improve the alignment accuracy, the selection of the voice segment may be performed according to the duration of the voice data to which the voice segment belongs, that is, the duration of any one of the voice data may be determined, and a voice segment is selected as the voice sample according to the duration of any one of the voice data. For example, if the duration of any voice data is 10 minutes, a voice segment with the duration of 1 minute can be selected as a voice sample; if the duration of any voice data is 50 minutes, selecting a voice segment with the duration of 5 minutes as a voice sample; in the foregoing two examples, the duration of the selected voice segment is 10% of the duration of the voice data to which the selected voice segment belongs, and certainly, voice segments with other durations may also be selected, which is not limited in this embodiment. In addition, more preferably, the selected voice segment may be a middle portion of any voice data, that is, the first frame of the voice segment is not the first frame of the voice data, and the last frame of the voice segment is not the last frame of the voice data; the accuracy of any voice data can be further improved by selecting the middle part of the voice data as a voice segment.
After the voice sample is selected, the voice characteristic parameters of the voice sample can be extracted according to the sample frame number of the voice sample. For example, for speech data with each 16-bit acquisition precision, the time length corresponding to one frame is 20ms, and therefore, the sample frame number of the speech sample can be determined according to the time length of the speech sample. Then, a speech characteristic parameter of the speech sample, such as Mel-frequency cepstral coefficients (MFCCs), may be extracted according to the sample frame number, that is, the speech sample may be subjected to cepstrum analysis to obtain Mel-frequency cepstral coefficients of the speech sample; in addition, the speech feature parameter may also be another parameter, which is not limited in this embodiment. The existing manner can be adopted for the implementation manner of obtaining the mel-frequency cepstrum coefficient of the voice sample by performing cepstrum analysis on the voice sample, and the implementation manner is not limited thereto. In general, in order to maintain the balance between the processing rate and the alignment accuracy, when extracting and obtaining the mel-frequency cepstrum coefficients, only the first 12 rows of feature coefficients may be extracted as the mel-frequency cepstrum coefficients, that is, the obtained mel-frequency cepstrum coefficients are one-dimensional arrays including 12 rows of feature coefficients.
After the voice feature parameters of the voice sample are obtained, determining a target voice segment with the highest similarity to the voice sample in other voice data according to the voice feature parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data. It should be noted that, in this step, it is necessary to determine the target voice segment corresponding to each other voice data, and the process of determining each target voice segment may be performed sequentially or may be performed simultaneously. Meanwhile, when a certain voice segment in certain other voice data is determined as a corresponding target voice segment of the other voice data, it is known that the similarity between the voice feature parameter of the certain voice segment and the voice feature parameter of the voice sample is highest with respect to other voice segments in the certain other voice data.
Finally, according to the voice sample and each target voice clip, the time axes of the plurality of voice data are aligned, specifically, according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs, the time axes of the plurality of voice data are aligned.
The invention provides a voice alignment method provided by an embodiment.A plurality of voice data corresponding to the same voice content and collected by different recording devices are obtained, and a voice fragment is selected from any one voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.
To better describe the voice alignment method provided by the present invention, on the basis of the first embodiment, fig. 2 is a schematic flow chart of a voice alignment method provided by the second embodiment of the present invention.
Different from the first embodiment, in the second embodiment, each to-be-processed voice data in the other voice data is targeted; selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames; extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples; selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed; and according to the obtained similarity, taking the current voice segment with the highest similarity as the target voice segment of the voice data to be processed.
Specifically, as shown in fig. 2, the voice alignment method includes:
step 201, acquiring a plurality of voice data corresponding to the same voice content acquired by different voice recording devices;
step 202, selecting a voice segment from any voice data as a voice sample;
step 203, determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number;
step 204, selecting voice data from other voice data as voice data to be processed;
step 205, selecting a target frame as a current frame from the data to be processed;
step 206, taking the current frame and a plurality of continuous frames behind the current frame as current voice segments;
wherein the number of the continuous frames is the same as the number of the sample frames;
step 207, extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples;
step 208, judging whether the last frame of the current voice fragment is the last frame of the voice data to be processed;
if yes, go to step 210; otherwise, step 209 is performed.
Step 209, selecting the next frame of the target frame as the current frame, and returning to step 206.
And step 210, according to the obtained similarity of the voice data to be processed, taking the current voice segment with the highest similarity as a target voice segment of the voice data to be processed.
And step 211, selecting the next voice data from the other voice data as the voice data to be processed, and returning to the step of selecting the target frame from the data to be processed as the current frame until all target voice segments corresponding to the other voice data are obtained.
And step 212, aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips.
Similar to the embodiment, the main body of the speech alignment method provided by the present invention may be a speech alignment apparatus, and the speech alignment apparatus may be implemented by hardware and/or software. The present invention is not limited to this, and the present invention is generally applicable to a local server or a cloud server based on a voice platform, and is used in cooperation with a data server based on a voice platform, where various voice data are stored.
Specifically, the application scenarios on which the present invention is based are: aiming at the same speaker under the same recording scene, different recording devices are often adopted to record the same speaker. Therefore, a plurality of voice data corresponding to the same voice content acquired by different sound recording apparatuses can be acquired first. The recording equipment can be different intelligent terminals loaded with different application systems, and can also be professional recording equipment. Although the plurality of voice data correspond to the same voice content, the time axes of the voice data are not identical due to the difference of the recording apparatuses to which the voice data belong. For example, for a plurality of voice data of the same speech content, there will be a difference in the time when different recorders use their mobile phones to press the recording button when recording the voice data, so that there will be a difference in the time axes of the obtained plurality of voice data corresponding to the speech content. Then, any voice data can be selected from the plurality of voice data, and a voice segment can be selected from the any voice data as a voice sample. In general, the speech segments may be a randomly selected collection of consecutive frames. Preferably, in order to further improve the alignment accuracy, the selection of the voice segment may be performed according to the duration of the voice data to which the voice segment belongs, that is, the duration of any one of the voice data may be determined, and a voice segment is selected as the voice sample according to the duration of any one of the voice data. For example, if the duration of any voice data is 10 minutes, a voice segment with the duration of 1 minute can be selected as a voice sample; if the duration of any voice data is 50 minutes, selecting a voice segment with the duration of 5 minutes as a voice sample; in the foregoing two examples, the duration of the selected voice segment is 10% of the duration of the voice data to which the selected voice segment belongs, and certainly, voice segments with other durations may also be selected, which is not limited in this embodiment. In addition, more preferably, the selected voice segment may be a middle portion of any voice data, that is, the first frame of the voice segment is not the first frame of the voice data, and the last frame of the voice segment is not the last frame of the voice data; the accuracy of any voice data can be further improved by selecting the middle part of the voice data as a voice segment. After the voice sample is selected, the voice characteristic parameters of the voice sample can be extracted according to the sample frame number of the voice sample. For example, for speech data with each 16-bit acquisition precision, the time length corresponding to one frame is 20ms, and therefore, the sample frame number of the speech sample can be determined according to the time length of the speech sample. Then, a speech characteristic parameter of the speech sample, such as Mel-frequency cepstral coefficients (MFCCs), may be extracted according to the sample frame number, that is, the speech sample may be subjected to cepstrum analysis to obtain Mel-frequency cepstral coefficients of the speech sample; in addition, the speech feature parameter may also be another parameter, which is not limited in this embodiment. The existing manner can be adopted for the implementation manner of obtaining the mel-frequency cepstrum coefficient of the voice sample by performing cepstrum analysis on the voice sample, and the implementation manner is not limited thereto. In general, in order to maintain the balance between the processing rate and the alignment accuracy, when extracting and obtaining the mel-frequency cepstrum coefficients, only the first 12 rows of feature coefficients may be extracted as the mel-frequency cepstrum coefficients, that is, the obtained mel-frequency cepstrum coefficients are one-dimensional arrays including 12 rows of feature coefficients.
Different from the first embodiment, in this embodiment, after obtaining the voice feature parameters of the voice sample, for each to-be-processed voice data in the other voice data, selecting a target frame of the to-be-processed data as a current frame, and taking the current frame and a plurality of continuous frames after the current frame as a current voice segment, where the number of the plurality of continuous frames is the same as the number of the sample frames; extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples; selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed; according to the obtained similarities, the current speech segment with the highest similarity is used as the target speech segment of the speech data to be processed, and the specific process can refer to step 204-211.
The formula (1) is a calculation formula of similarity provided in this embodiment:
Figure BDA0001658092800000111
in formula (1), f (x) is the similarity between the speech segment from the xth frame in the speech data to be processed and the speech sample, where x is the target frame in step 205; numFrame is the sample frame number of the voice sample; MFCCref [ n ] is the Mel frequency cepstrum coefficient corresponding to the nth frame of the speech sample; MFCCwav2[ n + x ] is a Mel frequency cepstrum coefficient corresponding to the n + x frame of the speech data to be processed; wherein x, numFrame and n are positive integers. The sum of the similarity of each frame in the current speech segment of the speech data to be processed and each corresponding frame in the speech sample can be calculated by the formula (1). Then, the sum of the similarity can be used as the similarity corresponding to the current voice segment, and a subsequent process of determining the target voice segment is performed.
Finally, according to the voice sample and each target voice clip, the time axes of the plurality of voice data are aligned, specifically, according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs, the time axes of the plurality of voice data are aligned.
Preferably, on the basis of any one of the foregoing embodiments, when the information amount of the voice data with a long duration of the voice data is large, after acquiring a plurality of pieces of voice data corresponding to the same voice content acquired by different sound recording devices, the plurality of pieces of voice data may be segmented, for example, each piece of voice data may be equally divided according to a time axis, and a plurality of pieces of voice data corresponding to each piece of voice data may be acquired; correspondingly, a voice segment can be selected as a voice sample from each voice data block of any voice data, that is, any voice data corresponds to a plurality of voice samples; subsequently, similarly to the foregoing steps, it is necessary to determine each target speech segment having the highest similarity corresponding to each speech sample among each speech data block corresponding to each other speech data. In addition, in the process of aligning the time axes of the plurality of voice data according to the voice sample and each target voice segment, the time relationship between each voice data and the voice data to which the voice sample belongs can be determined by a linear fitting mode.
The invention provides a voice alignment method provided by the second embodiment, which comprises the steps of acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selecting a voice fragment from any one voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.
Fig. 3 is a schematic structural diagram of a speech alignment apparatus according to a third embodiment of the present invention, as shown in fig. 3, the speech alignment apparatus includes:
the voice recording device comprises an acquisition unit 10, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices;
a processing unit 20, configured to select a voice segment from any voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;
an alignment unit 30, configured to perform alignment processing on the time axes of the multiple pieces of voice data according to the voice samples and the target voice clips.
Preferably, the processing unit 20 is specifically configured to:
determining the sample frame number of the voice sample according to the duration of the voice sample; and carrying out cepstrum analysis on the voice sample according to the sample frame number to obtain a Mel frequency cepstrum coefficient of the voice sample.
Preferably, the processing unit 20 is specifically configured to:
for each voice data to be processed in the other voice data; selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames; extracting the voice characteristic parameters of the candidate voice segments, and calculating the similarity of the current voice segments according to the voice characteristic parameters of the current voice segments and the voice characteristic parameters of the voice samples; selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed; and according to the obtained similarity, taking the current voice segment with the highest similarity as the target voice segment of the voice data to be processed.
Preferably, the processing unit 20 is specifically configured to:
determining the duration of any voice data; and selecting a voice segment as a voice sample according to the duration of any voice data.
Preferably, the alignment unit 30 is specifically configured to:
and performing alignment processing on the time axes of the plurality of voice data according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and corresponding beneficial effects of the system described above may refer to the corresponding process in the foregoing method embodiments, and are not described herein again.
The voice aligning device provided by the third embodiment of the present invention obtains a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selects a voice fragment from any one of the voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data; according to the voice samples and the target voice clips, the time axes of the voice data are aligned, so that the technical problems that in the prior art, the time spent is long and the alignment accuracy is low due to the fact that the voice waves of the voice data are compared manually and the starting points are aligned are solved, and the processing efficiency is effectively improved.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A method of speech alignment, comprising:
acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, and selecting a voice fragment of the middle part of any voice data from any voice data as a voice sample;
determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number;
determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;
aligning the time axes of the plurality of voice data according to the voice samples and the target voice clips;
the determining, according to the voice feature parameters of the voice sample, a target voice segment with the highest similarity to the voice sample in each other voice data includes:
for each voice data to be processed in the other voice data;
selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames;
extracting the voice characteristic parameters of the current voice segment, and calculating the similarity of the current voice segment according to the voice characteristic parameters of the current voice segment and the voice characteristic parameters of the voice sample;
selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed;
according to the obtained similarity, taking the current voice segment with the highest similarity as a target voice segment of the voice data to be processed;
the selecting a voice segment from any voice data as a voice sample comprises:
determining the duration of any voice data;
selecting a voice fragment as a voice sample according to the duration of any voice data;
after acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, performing segmentation processing on the plurality of voice data to obtain a plurality of voice data blocks corresponding to each voice data, and correspondingly selecting a voice fragment from each voice data block aiming at any voice data as a voice sample; and
the determining, in each of the other voice data, the target voice segment having the highest similarity to the voice sample includes: and determining each target voice fragment with the highest similarity corresponding to each voice sample in each voice data block corresponding to each other voice data.
2. The method of claim 1, wherein the determining the frame number of the voice sample and extracting the voice feature parameter of the voice sample according to the sample frame number comprises:
determining the sample frame number of the voice sample according to the duration of the voice sample;
and carrying out cepstrum analysis on the voice sample according to the sample frame number to obtain a Mel frequency cepstrum coefficient of the voice sample.
3. The speech alignment method according to any one of claims 1-2, wherein the aligning the time axes of the plurality of speech data according to the speech sample and each target speech segment comprises:
and performing alignment processing on the time axes of the plurality of voice data according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs.
4. A speech alignment apparatus, comprising:
the acquisition unit is used for acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices;
the processing unit is used for selecting a voice segment from any voice data as a voice sample; determining the sample frame number of the voice sample, and extracting the voice characteristic parameters of the voice sample according to the sample frame number; determining a target voice segment with the highest similarity with the voice sample in other voice data according to the voice characteristic parameters of the voice sample; wherein the other voice data is voice data other than the any voice data in the plurality of voice data;
an alignment unit, configured to perform alignment processing on the time axes of the plurality of voice data according to the voice sample and each target voice clip;
the processing unit is specifically configured to:
for each voice data to be processed in the other voice data;
selecting a target frame of the data to be processed as a current frame, and taking the current frame and a plurality of continuous frames behind the current frame as a current voice segment, wherein the number of the continuous frames is the same as the number of the sample frames;
extracting the voice characteristic parameters of the current voice segment, and calculating the similarity of the current voice segment according to the voice characteristic parameters of the current voice segment and the voice characteristic parameters of the voice sample;
selecting a next frame of the target frame as a current frame, and repeating the steps of using the current frame and a plurality of continuous frames behind the current frame as a current voice segment until a last frame of the current voice segment is obtained as a last frame of the voice data to be processed;
according to the obtained similarity, taking the current voice segment with the highest similarity as a target voice segment of the voice data to be processed;
the processing unit is further configured to:
determining the duration of any voice data;
selecting a voice fragment as a voice sample according to the duration of any voice data;
after acquiring a plurality of voice data corresponding to the same voice content acquired by different recording devices, performing segmentation processing on the plurality of voice data to obtain a plurality of voice data blocks corresponding to each voice data, and correspondingly selecting a voice fragment from each voice data block aiming at any voice data as a voice sample; and
the determining, in each of the other voice data, the target voice segment having the highest similarity to the voice sample includes: and determining each target voice fragment with the highest similarity corresponding to each voice sample in each voice data block corresponding to each other voice data.
5. The speech alignment apparatus according to claim 4, wherein the processing unit is specifically configured to:
determining the sample frame number of the voice sample according to the duration of the voice sample;
and carrying out cepstrum analysis on the voice sample according to the sample frame number to obtain a Mel frequency cepstrum coefficient of the voice sample.
6. The speech alignment apparatus according to claim 5, wherein the alignment unit is specifically configured to:
and performing alignment processing on the time axes of the plurality of voice data according to the position of the voice sample on the time axis of the voice data to which the voice sample belongs and the position of each target voice clip on the time axis of the voice data to which the target voice clip belongs.
CN201810449585.3A 2018-05-11 2018-05-11 Voice alignment method and device Active CN108682436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810449585.3A CN108682436B (en) 2018-05-11 2018-05-11 Voice alignment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810449585.3A CN108682436B (en) 2018-05-11 2018-05-11 Voice alignment method and device

Publications (2)

Publication Number Publication Date
CN108682436A CN108682436A (en) 2018-10-19
CN108682436B true CN108682436B (en) 2020-06-23

Family

ID=63805967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810449585.3A Active CN108682436B (en) 2018-05-11 2018-05-11 Voice alignment method and device

Country Status (1)

Country Link
CN (1) CN108682436B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091849B (en) * 2020-03-03 2020-12-22 龙马智芯(珠海横琴)科技有限公司 Snore identification method and device, storage medium snore stopping equipment and processor
CN111597239B (en) * 2020-04-10 2021-08-31 中科驭数(北京)科技有限公司 Data alignment method and device
CN113409815B (en) * 2021-05-28 2022-02-11 合肥群音信息服务有限公司 Voice alignment method based on multi-source voice data
CN114495977B (en) * 2022-01-28 2024-01-30 北京百度网讯科技有限公司 Speech translation and model training method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103931199A (en) * 2011-11-14 2014-07-16 苹果公司 Generation of multi -views media clips
CN105430537A (en) * 2015-11-27 2016-03-23 刘军 Method and server for synthesis of multiple paths of data, and music teaching system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1742492B (en) * 2003-02-14 2011-07-20 汤姆森特许公司 Automatic synchronization of audio and video based media services of media content
CN105845127B (en) * 2015-01-13 2019-10-01 阿里巴巴集团控股有限公司 Audio recognition method and its system
CA2978835C (en) * 2015-03-09 2021-01-19 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Fragment-aligned audio coding
CN105989846B (en) * 2015-06-12 2020-01-17 乐融致新电子科技(天津)有限公司 Multichannel voice signal synchronization method and device
CN105827997A (en) * 2016-04-26 2016-08-03 厦门幻世网络科技有限公司 Method and device for dubbing audio and visual digital media
CN106612457B (en) * 2016-11-09 2019-09-03 广州视源电子科技股份有限公司 Video sequence alignment schemes and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103931199A (en) * 2011-11-14 2014-07-16 苹果公司 Generation of multi -views media clips
CN105430537A (en) * 2015-11-27 2016-03-23 刘军 Method and server for synthesis of multiple paths of data, and music teaching system

Also Published As

Publication number Publication date
CN108682436A (en) 2018-10-19

Similar Documents

Publication Publication Date Title
CN108682436B (en) Voice alignment method and device
CN108766418B (en) Voice endpoint recognition method, device and equipment
CN108597498B (en) Multi-microphone voice acquisition method and device
CN107452372B (en) Training method and device of far-field speech recognition model
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
CN109979466B (en) Voiceprint identity identification method and device and computer readable storage medium
CN110189757A (en) A kind of giant panda individual discrimination method, equipment and computer readable storage medium
CN105975568B (en) Audio processing method and device
CN104036786A (en) Method and device for denoising voice
CN110751960B (en) Method and device for determining noise data
CN107481738B (en) Real-time audio comparison method and device
CN107609149B (en) Video positioning method and device
CN112348110B (en) Model training and image processing method and device, electronic equipment and storage medium
CN111899755A (en) Speaker voice separation method and related equipment
CN107680584B (en) Method and device for segmenting audio
CN111108552A (en) Voiceprint identity identification method and related device
CN110019922B (en) Audio climax identification method and device
CN111402865A (en) Method for generating speech recognition training data and method for training speech recognition model
CN111596261B (en) Sound source positioning method and device
CN103390404A (en) Information processing apparatus, information processing method and information processing program
CN114049898A (en) Audio extraction method, device, equipment and storage medium
CN107025902A (en) Data processing method and device
CN108364654B (en) Voice processing method, medium, device and computing equipment
CN107993666B (en) Speech recognition method, speech recognition device, computer equipment and readable storage medium
CN113409792A (en) Voice recognition method and related equipment thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant