CN113409815A - Voice alignment method based on multi-source voice data - Google Patents

Voice alignment method based on multi-source voice data Download PDF

Info

Publication number
CN113409815A
CN113409815A CN202110591658.4A CN202110591658A CN113409815A CN 113409815 A CN113409815 A CN 113409815A CN 202110591658 A CN202110591658 A CN 202110591658A CN 113409815 A CN113409815 A CN 113409815A
Authority
CN
China
Prior art keywords
voice
voice data
data
module
different
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110591658.4A
Other languages
Chinese (zh)
Other versions
CN113409815B (en
Inventor
李天洋
胡环环
朱保龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Feishu Information Technology Co ltd
Original Assignee
Hefei Qunyin Information Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Qunyin Information Service Co ltd filed Critical Hefei Qunyin Information Service Co ltd
Priority to CN202110591658.4A priority Critical patent/CN113409815B/en
Publication of CN113409815A publication Critical patent/CN113409815A/en
Application granted granted Critical
Publication of CN113409815B publication Critical patent/CN113409815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明公开了一种基于多源语音数据的语音对齐方法,属于语音处理领域,涉及语音对齐技术,用于通过语音对齐方法将起始点拉齐,实现各语音数据的对齐,避免人工对齐的方式花费大量的时间和处理效率、对齐准确率较低的问题;步骤一:通过语音采集模块对同一声源进行不同位置的语音数据采集,并将采集获取的声源的语音数据发送至语音处理模块;通过语音处理模块对多个语音采集模块发送的语音数据进行处理;并将处理完成的语音数据发送至语音分析模块;通过语音分析模块对处理完成后的语音数据进行语音对齐;并将对齐的语音数据发送至语音组合模块;通过语音组合模块将对齐后的语音数据进行语音组合。

Figure 202110591658

The invention discloses a voice alignment method based on multi-source voice data, belongs to the field of voice processing, relates to a voice alignment technology, and is used for aligning starting points through the voice alignment method to realize the alignment of various voice data and avoid manual alignment. It takes a lot of time and processing efficiency, and the alignment accuracy is low; Step 1: Use the voice acquisition module to collect voice data from different positions of the same sound source, and send the collected voice data of the sound source to the voice processing module. ; Process the voice data sent by multiple voice acquisition modules through the voice processing module; send the processed voice data to the voice analysis module; perform voice alignment on the processed voice data through the voice analysis module; The voice data is sent to the voice combining module; the aligned voice data is combined by the voice combining module.

Figure 202110591658

Description

Voice alignment method based on multi-source voice data
Technical Field
The invention belongs to the field of voice processing, relates to a voice alignment technology, and particularly relates to a voice alignment method based on multi-source voice data.
Background
Generally, for the voice of the same speaker in the same recording scene, a plurality of roadbed devices are required to collect voice data, and the starting points of the voice data collected by different recording devices cannot be guaranteed to be completely consistent. Therefore, in order to ensure consistency of the collection starting points of the voice data collected by a plurality of recording devices, and in order to facilitate subsequent processing such as synthesis of the voice data, it is a technical problem how to align the voices.
In the prior art, the alignment operation is generally performed on the voice data in a manual manner. For example, when facing voice data of different collection starting points, technicians need to manually compare sound waves of the voice data and align the starting points to achieve alignment of the voice data. The processing method of manual alignment needs a lot of time, has low processing efficiency and alignment accuracy, and is not beneficial to processing voice data with large data volume.
Therefore, a voice alignment method based on multi-source voice data is provided.
Disclosure of Invention
The invention provides a voice alignment method based on multi-source voice data, which is used for aligning starting points through the voice alignment method, realizing alignment of voice data and avoiding the problems that a large amount of time and processing efficiency are consumed in a manual alignment mode and the alignment accuracy is low. The voice data acquisition module is used for acquiring the voice data of the same sound source at different positions and sending the acquired voice data of the sound source to the voice processing module; processing the voice data sent by the voice acquisition modules through the voice processing module; the processed voice data are sent to a voice analysis module; performing voice alignment on the processed voice data through a voice analysis module; sending the aligned voice data to a voice combination module; the voice analysis module carries out data arrangement on the data characteristic coefficient TZij of the acquired single-frame voice data according to different frame numbers and different voice acquisition modules, and the voice analysis module randomly selects the voice data acquired by one data acquisition module as reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij; processing the rest single-frame voice data in the same way to obtain different contrast values; combining different contrast values into different number sequences, comparing Dij in different number sequences with Dij in a reference number sequence respectively, when the contrast values continuously exceed 10 bits and are consistent or the quotient of the contrast values is within (0.95-1.05), indicating that single-frame voice data can be adopted, and marking the adopted single-frame voice data as single-frame voice data to be aligned; and carrying out voice combination on the aligned voice data through a voice combination module.
The purpose of the invention can be realized by the following technical scheme:
a voice alignment method based on multi-source voice data comprises a voice alignment system based on the multi-source voice data, and the voice alignment system comprises a plurality of voice acquisition modules, a voice analysis module, a voice processing module and a voice combination module, wherein the voice acquisition modules are respectively positioned around a sound source and used for acquiring voice data of the same sound source at different positions and sending the acquired voice data of the sound source to the voice processing module;
the voice processing module is used for processing the voice data sent by the voice acquisition modules; the processed voice data are sent to a voice analysis module;
the voice analysis module is used for carrying out voice alignment on the processed voice data; sending the aligned voice data to a voice combination module;
and the voice combination module performs voice combination on the aligned voice data.
It should be noted that the voice acquisition module is specifically some devices with a recording function or a microphone; the voice acquisition modules are distributed around the sound source, have different spatial distances with the sound source and are defaulted to be the same equipment;
the voice acquisition modules send acquired voice data to the voice processing module;
the voice processing module numbers the voice acquisition modules and marks the number as i, wherein the i represents the number of the voice acquisition module; 1,2 … … n;
the voice processing module acquires the space linear distance between the voice acquisition module and the sound source, and marks the space linear distance between the voice acquisition module and the sound source as Li;
the voice processing module acquires voice data, processes the voice data into single-frame voice data, decodes and splits the single-frame voice data, acquires an amplitude value and a frequency value, and marks the amplitude value and the frequency value as Zfij and Plij respectively; where j denotes a number of a single frame of voice data, j is 1,2 … … m;
the voice processing module calculates the data characteristic coefficient TZij of the single-frame voice data by using a calculation formula, wherein the calculation formula is
Figure BDA0003089812720000031
Wherein c is a proportionality coefficient, and c is related to the timbre of the sound source;
the voice processing module sends the calculated data characteristic coefficient TZij of the single-frame voice data to the voice analysis module;
the voice analysis module is used for analyzing the data characteristic coefficient TZij of the single-frame voice data, and the specific analysis process comprises the following steps:
the voice analysis module acquires a spatial linear distance Li between the voice acquisition module and a sound source; the voice analysis module acquires a data characteristic coefficient TZij of single-frame voice data;
the voice analysis module carries out data arrangement on the acquired data characteristic coefficient TZij of the single-frame voice data according to different frame numbers and different voice acquisition modules, and the arrangement form is as follows:
TZ11、TZ12、TZ13、TZ14、TZ15……TZ1m;
TZ21、TZ22、TZ23、TZ24、TZ25……TZ2m;
……
TZn1、TZn2、TZn3、TZn4、TZn5……TZnm;
it should be noted that, when different collected voice data are processed into single-frame voice data for different voice collection modules, the total amount of the single-frame voice data may be different, that is, the values of different voice collection modules m may be different;
the voice analysis module randomly selects voice data acquired by one of the data acquisition modules as reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij;
processing the rest single-frame voice data in the same way to obtain different contrast values;
combining different contrast values into different number sequences, namely a reference number sequence, a number sequence 1 and a number sequence 2 … … number sequence n-1;
d11, D12, D13, D14, D15 … … D1 m-1; (reference series)
D21, D22, D23, D24, D25 … … D2 m-1; (array 1)
……
Dn1, Dn2, Dn3, Dn4, Dn5 … … Dnm-1; (array n-1)
Comparing Dij in the number sequence 1 and the number sequence 2 … …, the number sequence n-1 with Dij in the reference number sequence respectively, when the contrast value continuously exceeding 10 bits is consistent or the quotient of the contrast value is within (0.95-1.05), indicating that the single-frame voice data can be adopted, and marking the adopted single-frame voice data as the single-frame voice data to be aligned;
the voice analysis module sends the single-frame voice data to be aligned to the voice combination module; the voice combination module obtains a first comparison value continuously exceeding 10-bit comparison values and being consistent or the quotient of the comparison values being within (0.95-1.05), further obtains the position of corresponding single-frame voice data, takes the single-frame voice data as an alignment standard, starts to carry out voice combination one by one from the single-frame voice data, and finally completes voice alignment.
Compared with the prior art, the invention has the beneficial effects that:
1. the voice acquisition module is specifically equipment with a recording function or a microphone; the voice acquisition modules are distributed around the sound source, have different spatial distances with the sound source and are defaulted to be the same equipment; the consistency of the voice data of the sound source is guaranteed, inaccuracy of later-stage voice alignment caused by different acquisition devices is avoided, and accuracy of voice alignment is improved.
2. The voice processing module acquires voice data, processes the voice data into single-frame voice data, decodes and splits the single-frame voice data to acquire an amplitude value and a frequency value, and respectively marks the amplitude value and the frequency value as Zfij and Plij; the voice processing module calculates the data characteristic coefficient TZij of the single-frame voice data by using a calculation formula, wherein the calculation formula is
Figure BDA0003089812720000051
c is related to the timbre of the sound source; the voice processing module sends the calculated data characteristic coefficient TZij of the single-frame voice data to the voice analysis module; by processing the speech data, late stage speech alignment is facilitated.
3. The voice analysis module of the invention arbitrarily selects the voice data collected by one data collection module as the reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij; processing the rest single-frame voice data in the same way to obtain different contrast values; combining different contrast values into different number sequences, namely a reference number sequence, a number sequence 1 and a number sequence 2 … … number sequence n-1;
d11, D12, D13, D14, D15 … … D1 m-1; (reference series)
D21, D22, D23, D24, D25 … … D2 m-1; (array 1)
……
Dn1, Dn2, Dn3, Dn4, Dn5 … … Dnm-1; (array n-1)
Comparing Dij in the sequence 1 and the sequence 2 … … and the sequence n-1 with Dij in the reference sequence respectively, when the contrast value continuously exceeding 10 bits is consistent or the quotient of the contrast value is within (0.95-1.05), indicating that the single frame voice data can be adopted, and marking the adopted single frame voice data as the single frame voice data to be aligned. And the alignment of the voice is realized in an array mode.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flow chart of a speech alignment method based on multi-source speech data according to the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a voice alignment method based on multi-source voice data includes a voice alignment system based on multi-source voice data, including a plurality of voice acquisition modules, a plurality of voice analysis modules, a plurality of voice processing modules and a voice combination module, where the voice acquisition modules are respectively located around a sound source, and the voice acquisition modules are configured to acquire voice data of different positions of the same sound source and send the acquired voice data of the sound source to the voice processing module;
the voice processing module is used for processing the voice data sent by the voice acquisition modules; the processed voice data are sent to a voice analysis module;
the voice analysis module is used for carrying out voice alignment on the processed voice data; sending the aligned voice data to a voice combination module;
and the voice combination module performs voice combination on the aligned voice data.
It should be noted that the voice acquisition module is specifically some devices with a recording function or a microphone; the voice acquisition modules are distributed around the sound source, have different spatial distances with the sound source and are defaulted to be the same equipment;
the voice acquisition modules send acquired voice data to the voice processing module;
the voice processing module numbers the voice acquisition modules and marks the number as i, wherein the i represents the number of the voice acquisition module; 1,2 … … n;
the voice processing module acquires the space linear distance between the voice acquisition module and the sound source, and marks the space linear distance between the voice acquisition module and the sound source as Li;
the voice processing module acquires voice data, processes the voice data into single-frame voice data, decodes and splits the single-frame voice data, acquires an amplitude value and a frequency value, and marks the amplitude value and the frequency value as Zfij and Plij respectively; where j denotes a number of a single frame of voice data, j is 1,2 … … m;
the voice processing module calculates the data characteristic coefficient TZij of the single-frame voice data by using a calculation formula, wherein the calculation formula is
Figure BDA0003089812720000071
Wherein c is a proportionality coefficient, and c is related to the timbre of the sound source;
the voice processing module sends the calculated data characteristic coefficient TZij of the single-frame voice data to the voice analysis module;
the voice analysis module is used for analyzing the data characteristic coefficient TZij of the single-frame voice data, and the specific analysis process comprises the following steps:
the voice analysis module acquires a spatial linear distance Li between the voice acquisition module and a sound source; the voice analysis module acquires a data characteristic coefficient TZij of single-frame voice data;
the voice analysis module carries out data arrangement on the acquired data characteristic coefficient TZij of the single-frame voice data according to different frame numbers and different voice acquisition modules, and the arrangement form is as follows:
TZ11、TZ12、TZ13、TZ14、TZ15……TZ1m;
TZ21、TZ22、TZ23、TZ24、TZ25……TZ2m;
……
TZn1、TZn2、TZn3、TZn4、TZn5……TZnm;
it should be noted that, when different collected voice data are processed into single-frame voice data for different voice collection modules, the total amount of the single-frame voice data may be different, that is, the values of different voice collection modules m may be different;
the voice analysis module randomly selects voice data acquired by one of the data acquisition modules as reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij;
processing the rest single-frame voice data in the same way to obtain different contrast values;
combining different contrast values into different number sequences, namely a reference number sequence, a number sequence 1 and a number sequence 2 … … number sequence n-1;
d11, D12, D13, D14, D15 … … D1 m-1; (reference series)
D21, D22, D23, D24, D25 … … D2 m-1; (array 1)
……
Dn1, Dn2, Dn3, Dn4, Dn5 … … Dnm-1; (array n-1)
Comparing Dij in the number sequence 1 and the number sequence 2 … …, the number sequence n-1 with Dij in the reference number sequence respectively, when the contrast value continuously exceeding 10 bits is consistent or the quotient of the contrast value is within (0.95-1.05), indicating that the single-frame voice data can be adopted, and marking the adopted single-frame voice data as the single-frame voice data to be aligned;
the voice analysis module sends the single-frame voice data to be aligned to the voice combination module; the voice combination module obtains a first comparison value continuously exceeding 10-bit comparison values and being consistent or the quotient of the comparison values being within (0.95-1.05), further obtains the position of corresponding single-frame voice data, takes the single-frame voice data as an alignment standard, starts to carry out voice combination one by one from the single-frame voice data, and finally completes voice alignment.
The above formulas are all calculated by removing dimensions and taking numerical values thereof, the formula is a formula which is obtained by acquiring a large amount of data and performing software simulation to obtain the closest real situation, and the preset parameters and the preset threshold value in the formula are set by the technical personnel in the field according to the actual situation or obtained by simulating a large amount of data.
The working principle of the invention is as follows: the voice data acquisition module is used for acquiring the voice data of the same sound source at different positions and sending the acquired voice data of the sound source to the voice processing module; processing the voice data sent by the voice acquisition modules through the voice processing module; the processed voice data are sent to a voice analysis module; performing voice alignment on the processed voice data through a voice analysis module; sending the aligned voice data to a voice combination module; the voice analysis module carries out data arrangement on the data characteristic coefficient TZij of the acquired single-frame voice data according to different frame numbers and different voice acquisition modules, and the voice analysis module randomly selects the voice data acquired by one data acquisition module as reference voice data; dividing the data characteristic coefficient of the single-frame voice data by the data characteristic coefficient of the previous single-frame voice data, namely TZij/TZij-1; taking the obtained quotient as a comparison numerical value and marking the quotient as Dij; processing the rest single-frame voice data in the same way to obtain different contrast values; combining different contrast values into different number sequences, comparing Dij in different number sequences with Dij in a reference number sequence respectively, when the contrast values continuously exceed 10 bits and are consistent or the quotient of the contrast values is within (0.95-1.05), indicating that single-frame voice data can be adopted, and marking the adopted single-frame voice data as single-frame voice data to be aligned; and carrying out voice combination on the aligned voice data through a voice combination module.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and there may be other divisions when the actual implementation is performed; the modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the method of the embodiment.
It will also be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above examples are only intended to illustrate the technical process of the present invention and not to limit the same, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical process of the present invention without departing from the spirit and scope of the technical process of the present invention.

Claims (5)

1.一种基于多源语音数据的语音对齐方法,其特征在于,该方法包括以下步骤:1. a speech alignment method based on multi-source speech data, is characterized in that, this method comprises the following steps: 步骤一:通过语音采集模块对同一声源进行不同位置的语音数据采集,并将采集获取的声源的语音数据发送至语音处理模块;Step 1: collect the voice data of the same sound source at different positions through the voice acquisition module, and send the voice data of the acquired sound source to the voice processing module; 步骤二:通过语音处理模块对多个语音采集模块发送的语音数据进行处理;并将处理完成的语音数据发送至语音分析模块;Step 2: process the voice data sent by the multiple voice acquisition modules through the voice processing module; and send the processed voice data to the voice analysis module; 步骤三:通过语音分析模块对处理完成后的语音数据进行语音对齐;并将对齐的语音数据发送至语音组合模块;Step 3: perform voice alignment on the processed voice data through the voice analysis module; and send the aligned voice data to the voice combination module; 语音分析模块将获取的单帧语音数据的数据特征系数TZij根据不同的帧数和不同的语音采集模块进行数据排列,语音分析模块任意选取其中一个数据采集模块采集的语音数据作为基准语音数据;将单帧语音数据的数据特征系数除以前一单帧语音数据的数据特征系数即TZij/TZij-1;将得到的商作为对比数值,并标记为Dij;The speech analysis module arranges the data characteristic coefficients TZij of the acquired single frame speech data according to different frame numbers and different speech collection modules, and the speech analysis module arbitrarily selects the speech data collected by one of the data collection modules as the reference speech data; The data feature coefficient of the single frame of voice data is divided by the data feature coefficient of the previous single frame of voice data, namely TZij/TZij-1; the obtained quotient is used as a comparison value and marked as Dij; 采用同样的方式对其余的单帧语音数据进行处理,获取不同的对比数值;Process the remaining single-frame voice data in the same way to obtain different comparison values; 将不同的对比数值组合成为不同的数列,将不同的对比数值组合成为不同的数列,将不同数列中的Dij分别与基准数列中的Dij进行对比,当存在连续超过10位对比数值一致或是对比数值的商在(0.95-1.05)之内时,则表示单帧语音数据可以采用,并将采用的单帧语音数据标记为待对齐单帧语音数据;Combine different comparison values into different series, combine different comparison values into different series, and compare the Dij in the different series with the Dij in the reference series. When the quotient of the value is within (0.95-1.05), it means that the single frame of voice data can be used, and the adopted single frame of voice data is marked as the single frame of voice data to be aligned; 步骤四:通过语音组合模块将对齐后的语音数据进行语音组合。Step 4: Voice combining the aligned voice data through the voice combining module. 2.根据权利要求1所述的一种基于多源语音数据的语音对齐方法,其特征在于,所述语音采集模块具体为一些带有录音功能的设备;所述语音采集模块分布在声源的周围,并与声源之间存在不同的空间距离。2. a kind of voice alignment method based on multi-source voice data according to claim 1, is characterized in that, described voice collection module is specifically some equipment with recording function; Described voice collection module is distributed in the sound source around and at different spatial distances from the sound source. 3.根据权利要求1所述的一种基于多源语音数据的语音对齐方法,其特征在于,所述语音处理模块将若干语音采集模块进行编号,标记为i,其中i表示语音采集模块的数字编号;i=1,2……n;3. a kind of voice alignment method based on multi-source voice data according to claim 1, is characterized in that, described voice processing module carries out numbering by some voice acquisition modules, is marked as i, and wherein i represents the number of voice acquisition module number; i=1,2...n; 语音处理模块获取语音采集模块距离声源的空间直线距离,并将语音采集模块距离声源的空间直线距离标记为Li;The voice processing module obtains the spatial straight-line distance from the voice acquisition module to the sound source, and marks the spatial straight-line distance from the voice acquisition module to the sound source as Li; 语音处理模块获取语音数据,将语音数据进行处理成为单帧语音数据,并将单帧语音数据进行解码拆分,获取振幅值、频率值,并分别将振幅值以及频率值标记为Zfij、Plij;其中j表示单帧语音数据的数字编号,j=1,2……m;The voice processing module obtains the voice data, processes the voice data into a single frame of voice data, decodes and splits the single frame of voice data, obtains the amplitude value and the frequency value, and marks the amplitude value and the frequency value as Zfij and Plij respectively; where j represents the digital number of a single frame of voice data, j=1,2...m; 语音处理模块利用计算公式计算出单帧语音数据的数据特征系数TZij,其中计算公式为
Figure FDA0003089812710000021
其中c为比例系数,c与声源的音色有关;
The speech processing module uses the calculation formula to calculate the data characteristic coefficient TZij of the single frame of voice data, where the calculation formula is
Figure FDA0003089812710000021
where c is the proportional coefficient, and c is related to the timbre of the sound source;
语音处理模块将计算得出的单帧语音数据的数据特征系数TZij发送至语音分析模块。The speech processing module sends the calculated data characteristic coefficient TZij of the single frame of speech data to the speech analysis module.
4.根据权利要求3所述的一种基于多源语音数据的语音对齐方法,其特征在于,针对不同的语音采集模块,采集的不同的语音数据进行处理为单帧语音数据时,单帧语音数据的总量存在差异,即对不同的语音采集模块m的值会存在不同。4. a kind of voice alignment method based on multi-source voice data according to claim 3 is characterized in that, for different voice acquisition modules, when the different voice data collected is processed as single frame voice data, single frame voice The total amount of data is different, that is, the value of m will be different for different voice collection modules. 5.根据权利要求1所述的一种基于多源语音数据的语音对齐方法,其特征在于,语音组合模块获取连续超过10位对比数值一致或是对比数值的商在(0.95-1.05)之内的第一位对比数值,进而获取相应的单帧语音数据的位置,将此单帧语音数据作为对齐标准,从此单帧语音数据开始逐一进行语音组合,最终完成语音对齐。5. a kind of voice alignment method based on multi-source voice data according to claim 1, is characterized in that, the quotient of the continuous more than 10 contrast numerical values or the quotient of contrast numerical values obtained by the voice combination module is within (0.95-1.05) Then, the position of the corresponding single frame of voice data is obtained, and the single frame of voice data is used as the alignment standard. From this single frame of voice data, the voices are combined one by one, and finally the voice alignment is completed.
CN202110591658.4A 2021-05-28 2021-05-28 A speech alignment method based on multi-source speech data Active CN113409815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110591658.4A CN113409815B (en) 2021-05-28 2021-05-28 A speech alignment method based on multi-source speech data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110591658.4A CN113409815B (en) 2021-05-28 2021-05-28 A speech alignment method based on multi-source speech data

Publications (2)

Publication Number Publication Date
CN113409815A true CN113409815A (en) 2021-09-17
CN113409815B CN113409815B (en) 2022-02-11

Family

ID=77674998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110591658.4A Active CN113409815B (en) 2021-05-28 2021-05-28 A speech alignment method based on multi-source speech data

Country Status (1)

Country Link
CN (1) CN113409815B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220789A1 (en) * 2002-05-21 2003-11-27 Kepuska Veton K. Dynamic time warping of speech
CN105989846A (en) * 2015-06-12 2016-10-05 乐视致新电子科技(天津)有限公司 Multi-channel speech signal synchronization method and device
US9697849B1 (en) * 2016-07-25 2017-07-04 Gopro, Inc. Systems and methods for audio based synchronization using energy vectors
CN107657947A (en) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device based on artificial intelligence
CN108682436A (en) * 2018-05-11 2018-10-19 北京海天瑞声科技股份有限公司 Voice alignment schemes and device
CN109192223A (en) * 2018-09-20 2019-01-11 广州酷狗计算机科技有限公司 The method and apparatus of audio alignment
EP3573059A1 (en) * 2018-05-25 2019-11-27 Dolby Laboratories Licensing Corp. Dialogue enhancement based on synthesized speech
CN111276156A (en) * 2020-01-20 2020-06-12 深圳市数字星河科技有限公司 Real-time voice stream monitoring method
CN111383658A (en) * 2018-12-29 2020-07-07 广州市百果园信息技术有限公司 Method and device for aligning audio signals
CN211628033U (en) * 2019-07-15 2020-10-02 兰州工业学院 Container anti-drop monitoring and transmission system
US20210065676A1 (en) * 2019-08-28 2021-03-04 International Business Machines Corporation Speech characterization using a synthesized reference audio signal

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220789A1 (en) * 2002-05-21 2003-11-27 Kepuska Veton K. Dynamic time warping of speech
CN105989846A (en) * 2015-06-12 2016-10-05 乐视致新电子科技(天津)有限公司 Multi-channel speech signal synchronization method and device
US9697849B1 (en) * 2016-07-25 2017-07-04 Gopro, Inc. Systems and methods for audio based synchronization using energy vectors
CN107657947A (en) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device based on artificial intelligence
CN108682436A (en) * 2018-05-11 2018-10-19 北京海天瑞声科技股份有限公司 Voice alignment schemes and device
EP3573059A1 (en) * 2018-05-25 2019-11-27 Dolby Laboratories Licensing Corp. Dialogue enhancement based on synthesized speech
CN109192223A (en) * 2018-09-20 2019-01-11 广州酷狗计算机科技有限公司 The method and apparatus of audio alignment
CN111383658A (en) * 2018-12-29 2020-07-07 广州市百果园信息技术有限公司 Method and device for aligning audio signals
CN211628033U (en) * 2019-07-15 2020-10-02 兰州工业学院 Container anti-drop monitoring and transmission system
US20210065676A1 (en) * 2019-08-28 2021-03-04 International Business Machines Corporation Speech characterization using a synthesized reference audio signal
CN111276156A (en) * 2020-01-20 2020-06-12 深圳市数字星河科技有限公司 Real-time voice stream monitoring method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JENNIFER LISTGARTEN ET AL: "Multiple Alignment of Continuous Time Series", 《ADVANCES IN NEURAL INFORMATION》 *
赖家豪: "基于深度学习的语音转换研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Also Published As

Publication number Publication date
CN113409815B (en) 2022-02-11

Similar Documents

Publication Publication Date Title
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
ES2774018T3 (en) Method and system for evaluating the sound quality of a human voice
CN102723079B (en) Music and chord automatic identification method based on sparse representation
CN106375780A (en) Method and apparatus for generating multimedia file
CN106571146A (en) Noise signal determining method, and voice de-noising method and apparatus
CN104240712A (en) Three-dimensional audio multichannel grouping and clustering coding method and three-dimensional audio multichannel grouping and clustering coding system
CN112382300A (en) Voiceprint identification method, model training method, device, equipment and storage medium
Chaurasiya Time-frequency representations: spectrogram, cochleogram and correlogram
CN114352486B (en) A classification-based method for detecting wind turbine blade audio faults
CN109920446A (en) A kind of audio data processing method, device and computer storage medium
CN113409815B (en) A speech alignment method based on multi-source speech data
CN105679331A (en) Sound-breath signal separating and synthesizing method and system
CN102184733A (en) Audio attention-based audio quality evaluation system and method
CN102820037B (en) Chinese initial and final visualization method based on combination feature
CN118629419A (en) Embedded AI speech noise reduction model database construction method, device and storage medium
CN108010533A (en) The automatic identifying method and device of voice data code check
Falk et al. Improving instrumental quality prediction performance for the Blizzard Challenge
Zhan et al. Audio post-processing detection and identification based on audio features
CN113488070B (en) Detection method, device, electronic device and storage medium for tampering with audio
CN108271017A (en) The audio loudness measuring system and method for digital broadcast television
Santacruz et al. Spectral envelope transformation in singing voice for advanced pitch shifting
CN108769874B (en) Method and device for separating audio in real time
CN107025902A (en) Data processing method and device
CN113329190B (en) Animation design video production analysis management method, equipment, system and computer storage medium
CN112233693A (en) Sound quality evaluation method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20250113

Address after: Room B03, 9th Floor, Building B2, Big Data Industrial Park, 668 Xiangpu Road, High tech Zone, Hefei City, Anhui Province, China 230000

Patentee after: Anhui Feishu Information Technology Co.,Ltd.

Country or region after: China

Address before: 230000 Room 401, No. 3, Tianzhu Road, high tech Zone, Hefei, Anhui

Patentee before: Hefei qunyin Information Service Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right