CN113409815A

CN113409815A - Voice alignment method based on multi-source voice data

Info

Publication number: CN113409815A
Application number: CN202110591658.4A
Authority: CN
Inventors: 李天洋; 胡环环; 朱保龙
Original assignee: Hefei Qunyin Information Service Co ltd
Current assignee: Anhui Feishu Information Technology Co ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-09-17
Anticipated expiration: 2041-05-28
Also published as: CN113409815B

Abstract

The invention discloses a voice alignment method based on multi-source voice data, belongs to the field of voice processing, relates to a voice alignment technology, and is used for aligning starting points through the voice alignment method to realize the alignment of various voice data and avoid manual alignment. It takes a lot of time and processing efficiency, and the alignment accuracy is low; Step 1: Use the voice acquisition module to collect voice data from different positions of the same sound source, and send the collected voice data of the sound source to the voice processing module. ; Process the voice data sent by multiple voice acquisition modules through the voice processing module; send the processed voice data to the voice analysis module; perform voice alignment on the processed voice data through the voice analysis module; The voice data is sent to the voice combining module; the aligned voice data is combined by the voice combining module.

Description

Voice alignment method based on multi-source voice data

Technical Field

The invention belongs to the field of voice processing, relates to a voice alignment technology, and particularly relates to a voice alignment method based on multi-source voice data.

Background

Generally, for the voice of the same speaker in the same recording scene, a plurality of roadbed devices are required to collect voice data, and the starting points of the voice data collected by different recording devices cannot be guaranteed to be completely consistent. Therefore, in order to ensure consistency of the collection starting points of the voice data collected by a plurality of recording devices, and in order to facilitate subsequent processing such as synthesis of the voice data, it is a technical problem how to align the voices.

In the prior art, the alignment operation is generally performed on the voice data in a manual manner. For example, when facing voice data of different collection starting points, technicians need to manually compare sound waves of the voice data and align the starting points to achieve alignment of the voice data. The processing method of manual alignment needs a lot of time, has low processing efficiency and alignment accuracy, and is not beneficial to processing voice data with large data volume.

Therefore, a voice alignment method based on multi-source voice data is provided.

Disclosure of Invention

The invention provides a voice alignment method based on multi-source voice data, which is used for aligning starting points through the voice alignment method, realizing alignment of voice data and avoiding the problems that a large amount of time and processing efficiency are consumed in a manual alignment mode and the alignment accuracy is low. The voice data acquisition module is used for acquiring the voice data of the same sound source at different positions and sending the acquired voice data of the sound source to the voice processing module; processing the voice data sent by the voice acquisition modules through the voice processing module; the processed voice data are sent to a voice analysis module; performing voice alignment on the processed voice data through a voice analysis module; sending the aligned voice data to a voice combination module; the voice analysis module carries out data arrangement on the data characteristic coefficient TZij of the acquired single-frame voice data according to different frame numbers and different voice acquisition modules, and the voice analysis module randomly selects the voice data acquired by one data acquisition module as reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij; processing the rest single-frame voice data in the same way to obtain different contrast values; combining different contrast values into different number sequences, comparing Dij in different number sequences with Dij in a reference number sequence respectively, when the contrast values continuously exceed 10 bits and are consistent or the quotient of the contrast values is within (0.95-1.05), indicating that single-frame voice data can be adopted, and marking the adopted single-frame voice data as single-frame voice data to be aligned; and carrying out voice combination on the aligned voice data through a voice combination module.

The purpose of the invention can be realized by the following technical scheme:

a voice alignment method based on multi-source voice data comprises a voice alignment system based on the multi-source voice data, and the voice alignment system comprises a plurality of voice acquisition modules, a voice analysis module, a voice processing module and a voice combination module, wherein the voice acquisition modules are respectively positioned around a sound source and used for acquiring voice data of the same sound source at different positions and sending the acquired voice data of the sound source to the voice processing module;

the voice processing module is used for processing the voice data sent by the voice acquisition modules; the processed voice data are sent to a voice analysis module;

the voice analysis module is used for carrying out voice alignment on the processed voice data; sending the aligned voice data to a voice combination module;

and the voice combination module performs voice combination on the aligned voice data.

It should be noted that the voice acquisition module is specifically some devices with a recording function or a microphone; the voice acquisition modules are distributed around the sound source, have different spatial distances with the sound source and are defaulted to be the same equipment;

the voice acquisition modules send acquired voice data to the voice processing module;

the voice processing module numbers the voice acquisition modules and marks the number as i, wherein the i represents the number of the voice acquisition module; 1,2 … … n;

the voice processing module acquires the space linear distance between the voice acquisition module and the sound source, and marks the space linear distance between the voice acquisition module and the sound source as Li;

the voice processing module acquires voice data, processes the voice data into single-frame voice data, decodes and splits the single-frame voice data, acquires an amplitude value and a frequency value, and marks the amplitude value and the frequency value as Zfij and Plij respectively; where j denotes a number of a single frame of voice data, j is 1,2 … … m;

the voice processing module calculates the data characteristic coefficient TZij of the single-frame voice data by using a calculation formula, wherein the calculation formula is

Wherein c is a proportionality coefficient, and c is related to the timbre of the sound source;

the voice processing module sends the calculated data characteristic coefficient TZij of the single-frame voice data to the voice analysis module;

the voice analysis module is used for analyzing the data characteristic coefficient TZij of the single-frame voice data, and the specific analysis process comprises the following steps:

the voice analysis module acquires a spatial linear distance Li between the voice acquisition module and a sound source; the voice analysis module acquires a data characteristic coefficient TZij of single-frame voice data;

the voice analysis module carries out data arrangement on the acquired data characteristic coefficient TZij of the single-frame voice data according to different frame numbers and different voice acquisition modules, and the arrangement form is as follows:

TZ11、TZ12、TZ13、TZ14、TZ15……TZ1m；

TZ21、TZ22、TZ23、TZ24、TZ25……TZ2m；

……

TZn1、TZn2、TZn3、TZn4、TZn5……TZnm；

it should be noted that, when different collected voice data are processed into single-frame voice data for different voice collection modules, the total amount of the single-frame voice data may be different, that is, the values of different voice collection modules m may be different;

the voice analysis module randomly selects voice data acquired by one of the data acquisition modules as reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij;

processing the rest single-frame voice data in the same way to obtain different contrast values;

combining different contrast values into different number sequences, namely a reference number sequence, a number sequence 1 and a number sequence 2 … … number sequence n-1;

d11, D12, D13, D14, D15 … … D1 m-1; (reference series)

D21, D22, D23, D24, D25 … … D2 m-1; (array 1)

……

Dn1, Dn2, Dn3, Dn4, Dn5 … … Dnm-1; (array n-1)

Comparing Dij in the number sequence 1 and the number sequence 2 … …, the number sequence n-1 with Dij in the reference number sequence respectively, when the contrast value continuously exceeding 10 bits is consistent or the quotient of the contrast value is within (0.95-1.05), indicating that the single-frame voice data can be adopted, and marking the adopted single-frame voice data as the single-frame voice data to be aligned;

the voice analysis module sends the single-frame voice data to be aligned to the voice combination module; the voice combination module obtains a first comparison value continuously exceeding 10-bit comparison values and being consistent or the quotient of the comparison values being within (0.95-1.05), further obtains the position of corresponding single-frame voice data, takes the single-frame voice data as an alignment standard, starts to carry out voice combination one by one from the single-frame voice data, and finally completes voice alignment.

Compared with the prior art, the invention has the beneficial effects that:

1. the voice acquisition module is specifically equipment with a recording function or a microphone; the voice acquisition modules are distributed around the sound source, have different spatial distances with the sound source and are defaulted to be the same equipment; the consistency of the voice data of the sound source is guaranteed, inaccuracy of later-stage voice alignment caused by different acquisition devices is avoided, and accuracy of voice alignment is improved.

2. The voice processing module acquires voice data, processes the voice data into single-frame voice data, decodes and splits the single-frame voice data to acquire an amplitude value and a frequency value, and respectively marks the amplitude value and the frequency value as Zfij and Plij; the voice processing module calculates the data characteristic coefficient TZij of the single-frame voice data by using a calculation formula, wherein the calculation formula is

c is related to the timbre of the sound source; the voice processing module sends the calculated data characteristic coefficient TZij of the single-frame voice data to the voice analysis module; by processing the speech data, late stage speech alignment is facilitated.

3. The voice analysis module of the invention arbitrarily selects the voice data collected by one data collection module as the reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij; processing the rest single-frame voice data in the same way to obtain different contrast values; combining different contrast values into different number sequences, namely a reference number sequence, a number sequence 1 and a number sequence 2 … … number sequence n-1;

d11, D12, D13, D14, D15 … … D1 m-1; (reference series)

D21, D22, D23, D24, D25 … … D2 m-1; (array 1)

……

Dn1, Dn2, Dn3, Dn4, Dn5 … … Dnm-1; (array n-1)

Comparing Dij in the sequence 1 and the sequence 2 … … and the sequence n-1 with Dij in the reference sequence respectively, when the contrast value continuously exceeding 10 bits is consistent or the quotient of the contrast value is within (0.95-1.05), indicating that the single frame voice data can be adopted, and marking the adopted single frame voice data as the single frame voice data to be aligned. And the alignment of the voice is realized in an array mode.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flow chart of a speech alignment method based on multi-source speech data according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a voice alignment method based on multi-source voice data includes a voice alignment system based on multi-source voice data, including a plurality of voice acquisition modules, a plurality of voice analysis modules, a plurality of voice processing modules and a voice combination module, where the voice acquisition modules are respectively located around a sound source, and the voice acquisition modules are configured to acquire voice data of different positions of the same sound source and send the acquired voice data of the sound source to the voice processing module;

TZ11、TZ12、TZ13、TZ14、TZ15……TZ1m；

TZ21、TZ22、TZ23、TZ24、TZ25……TZ2m；

……

TZn1、TZn2、TZn3、TZn4、TZn5……TZnm；

d11, D12, D13, D14, D15 … … D1 m-1; (reference series)

D21, D22, D23, D24, D25 … … D2 m-1; (array 1)

……

Dn1, Dn2, Dn3, Dn4, Dn5 … … Dnm-1; (array n-1)

The above formulas are all calculated by removing dimensions and taking numerical values thereof, the formula is a formula which is obtained by acquiring a large amount of data and performing software simulation to obtain the closest real situation, and the preset parameters and the preset threshold value in the formula are set by the technical personnel in the field according to the actual situation or obtained by simulating a large amount of data.

The working principle of the invention is as follows: the voice data acquisition module is used for acquiring the voice data of the same sound source at different positions and sending the acquired voice data of the sound source to the voice processing module; processing the voice data sent by the voice acquisition modules through the voice processing module; the processed voice data are sent to a voice analysis module; performing voice alignment on the processed voice data through a voice analysis module; sending the aligned voice data to a voice combination module; the voice analysis module carries out data arrangement on the data characteristic coefficient TZij of the acquired single-frame voice data according to different frame numbers and different voice acquisition modules, and the voice analysis module randomly selects the voice data acquired by one data acquisition module as reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij; processing the rest single-frame voice data in the same way to obtain different contrast values; combining different contrast values into different number sequences, comparing Dij in different number sequences with Dij in a reference number sequence respectively, when the contrast values continuously exceed 10 bits and are consistent or the quotient of the contrast values is within (0.95-1.05), indicating that single-frame voice data can be adopted, and marking the adopted single-frame voice data as single-frame voice data to be aligned; and carrying out voice combination on the aligned voice data through a voice combination module.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and there may be other divisions when the actual implementation is performed; the modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the method of the embodiment.

It will also be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above examples are only intended to illustrate the technical process of the present invention and not to limit the same, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical process of the present invention without departing from the spirit and scope of the technical process of the present invention.

Claims

1. a speech alignment method based on multi-source speech data, is characterized in that, this method comprises the following steps:

Step 1: collect the voice data of the same sound source at different positions through the voice acquisition module, and send the voice data of the acquired sound source to the voice processing module;

Step 2: process the voice data sent by the multiple voice acquisition modules through the voice processing module; and send the processed voice data to the voice analysis module;

Step 3: perform voice alignment on the processed voice data through the voice analysis module; and send the aligned voice data to the voice combination module;

The speech analysis module arranges the data characteristic coefficients TZij of the acquired single frame speech data according to different frame numbers and different speech collection modules, and the speech analysis module arbitrarily selects the speech data collected by one of the data collection modules as the reference speech data; The data feature coefficient of the single frame of voice data is divided by the data feature coefficient of the previous single frame of voice data, namely TZij/TZij-1; the obtained quotient is used as a comparison value and marked as Dij;

Process the remaining single-frame voice data in the same way to obtain different comparison values;

Combine different comparison values into different series, combine different comparison values into different series, and compare the Dij in the different series with the Dij in the reference series. When the quotient of the value is within (0.95-1.05), it means that the single frame of voice data can be used, and the adopted single frame of voice data is marked as the single frame of voice data to be aligned;

Step 4: Voice combining the aligned voice data through the voice combining module.

2. a kind of voice alignment method based on multi-source voice data according to claim 1, is characterized in that, described voice collection module is specifically some equipment with recording function; Described voice collection module is distributed in the sound source around and at different spatial distances from the sound source.

3. a kind of voice alignment method based on multi-source voice data according to claim 1, is characterized in that, described voice processing module carries out numbering by some voice acquisition modules, is marked as i, and wherein i represents the number of voice acquisition module number; i=1,2...n;

The voice processing module obtains the spatial straight-line distance from the voice acquisition module to the sound source, and marks the spatial straight-line distance from the voice acquisition module to the sound source as Li;

The voice processing module obtains the voice data, processes the voice data into a single frame of voice data, decodes and splits the single frame of voice data, obtains the amplitude value and the frequency value, and marks the amplitude value and the frequency value as Zfij and Plij respectively; where j represents the digital number of a single frame of voice data, j=1,2...m;

The speech processing module uses the calculation formula to calculate the data characteristic coefficient TZij of the single frame of voice data, where the calculation formula is

where c is the proportional coefficient, and c is related to the timbre of the sound source;

The speech processing module sends the calculated data characteristic coefficient TZij of the single frame of speech data to the speech analysis module.

4. a kind of voice alignment method based on multi-source voice data according to claim 3 is characterized in that, for different voice acquisition modules, when the different voice data collected is processed as single frame voice data, single frame voice The total amount of data is different, that is, the value of m will be different for different voice collection modules.

5. a kind of voice alignment method based on multi-source voice data according to claim 1, is characterized in that, the quotient of the continuous more than 10 contrast numerical values or the quotient of contrast numerical values obtained by the voice combination module is within (0.95-1.05) Then, the position of the corresponding single frame of voice data is obtained, and the single frame of voice data is used as the alignment standard. From this single frame of voice data, the voices are combined one by one, and finally the voice alignment is completed.