CN105989846B - Multichannel voice signal synchronization method and device - Google Patents

Multichannel voice signal synchronization method and device Download PDF

Info

Publication number
CN105989846B
CN105989846B CN201510321268.XA CN201510321268A CN105989846B CN 105989846 B CN105989846 B CN 105989846B CN 201510321268 A CN201510321268 A CN 201510321268A CN 105989846 B CN105989846 B CN 105989846B
Authority
CN
China
Prior art keywords
channel
template
voice signal
sampling
channels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510321268.XA
Other languages
Chinese (zh)
Other versions
CN105989846A (en
Inventor
王育军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Leshi Zhixin Electronic Technology Tianjin Co Ltd
Original Assignee
Leshi Zhixin Electronic Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Leshi Zhixin Electronic Technology Tianjin Co Ltd filed Critical Leshi Zhixin Electronic Technology Tianjin Co Ltd
Priority to CN201510321268.XA priority Critical patent/CN105989846B/en
Publication of CN105989846A publication Critical patent/CN105989846A/en
Application granted granted Critical
Publication of CN105989846B publication Critical patent/CN105989846B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method and a device for synchronizing multi-channel voice signals. The method comprises the following steps: selecting a channel as a template channel, and generating a corresponding voice signal energy envelope template; respectively matching the voice signals of other channels with the energy envelope template so as to respectively determine the offset value between the voice signals of other channels and the voice signals of the template channel; and respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value. The invention solves the problems that the prior art adopts a manual adjustment mode to synchronize multi-channel voice signals, which wastes manpower resources and has low efficiency.

Description

Multichannel voice signal synchronization method and device
Technical Field
The embodiment of the invention relates to the field of voice signal processing, in particular to a multi-channel voice signal synchronization method and device.
Background
Currently, in the field of speech signal processing, it is often necessary to separately collect speech signals from multiple channels for research in noise immunity, speech recognition, etc., where each channel included in the multiple channels may be a speech signal input or output channel provided by any speech collecting device.
However, in practical applications, the voice signals respectively collected from the multiple channels (hereinafter, simply referred to as "multi-channel voice signals") may not be synchronized with each other (i.e., may not be aligned on the time axis). For example, to study the perceptual relationship between the far field and the near field for the same sound source, one voice capture device (e.g., a cell phone) may be used to record at a location closer to the sound source and another voice capture device (e.g., a microphone) may be used to record at a location further from the sound source, but since the cell phone and microphone may not start recording at the same time, the voice signals captured from the channels of the cell phone and microphone may not be synchronized. Using an unsynchronized multi-channel speech signal such as in the above example for subsequent studies may reduce the reliability of the results of the study.
In view of the above problems, in the prior art, a manual adjustment mode is generally adopted to synchronize asynchronous multi-channel voice signals, and specifically, a researcher may separately observe waveforms of voice signals of each channel in the multi-channel voice signals, and then manually synchronize the multi-channel signals according to shapes of the waveforms. However, this synchronization method is not only wasteful of human resources, but also inefficient.
Disclosure of Invention
The embodiment of the invention provides a method and a device for synchronizing multi-channel voice signals, which are used for solving the problems that in the prior art, a manual adjustment mode is adopted, the synchronization of the multi-channel voice signals wastes human resources, and the efficiency and the accuracy are low.
The embodiment of the invention provides a multichannel voice signal synchronization method, which comprises the following steps:
selecting a channel as a template channel, and generating a corresponding voice signal energy envelope template;
respectively matching the energy envelopes of the voice signals of the other channels with the energy envelope template so as to respectively determine offset values between the voice signals of the other channels and the voice signals of the template channel;
and respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value.
An embodiment of the present invention further provides a multi-channel speech signal synchronization apparatus, including:
the generating module is used for selecting a channel as a template channel and generating a corresponding voice signal energy envelope template;
a determining module, configured to match the speech signals of the other channels with the energy envelope template, respectively, so as to determine offset values between the speech signals of the other channels and the speech signals of the template channel, respectively;
and the synchronization module is used for respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value.
The multichannel voice signal synchronization method and the multichannel voice signal synchronization device provided by the embodiment of the invention match the energy envelope of the waveform segment intercepted from each channel with the energy envelope template generated by the waveform segment intercepted from the template channel, determine the offset value of the voice signal of each channel and the template channel, and realize the synchronization of the multichannel voice signal by intercepting the offset value of the voice signal of each channel and the voice signal of the template channel, thereby saving the labor and improving the efficiency. The problems that in the prior art, manual adjustment is adopted, multi-channel voice signals are synchronized, human resources are wasted, and efficiency is low are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 illustrates a multi-channel speech signal synchronization process according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of calculating an energy vector of a waveform segment by sliding a selected sliding window in the waveform segment according to an embodiment of the present invention;
FIG. 3 is a process for generating an energy envelope template according to a speech signal of a template channel by using a selected parameter value in practical applications according to an embodiment of the present invention;
FIG. 4 is a process for determining an offset value between a speech signal of each other channel and a speech signal of a template channel for the speech signal of the other channel according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an offset between the speech signal of one of the other channels and the speech signal of the template channel according to the embodiment of the present invention;
FIG. 6 is a simplified process diagram for parallel processing and synchronization of multi-channel speech signals according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a multi-channel speech signal synchronization apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a process of synchronizing a multi-channel speech signal according to an embodiment of the present invention, which specifically includes the following steps:
s101: and selecting the channel as a template channel to generate a corresponding voice signal energy envelope template.
The execution subject of the embodiment of the present invention may be any device that can be used to process a voice signal. Such devices include, but are not limited to: personal computers, smart phones, tablet computers, smart televisions, smart watches, smart bracelets, vehicle stations, large and medium sized computers, computer clusters, and the like. The described execution body does not constitute a limitation of the present invention.
In embodiments of the present invention, a speech signal may be acquired from the same sound source using multiple channels, which may include the selected template channel and at least one other channel. The energy envelope template may be a single feature or a combined feature in terms of energy envelope extracted from a portion or all of the speech signal of the template channel. Of course, besides the energy envelope, features in terms of volume, frequency, timbre, waveform shape, etc. may also be extracted from the speech signal of the template channel as a template for subsequent matching.
S102: and respectively matching the voice signals of the other channels with the energy envelope template so as to respectively determine the deviation value between the voice signals of the other channels and the voice signals of the template channel.
In the embodiment of the present invention, the voice signal of the template channel and the voice signals of the other channels are the multi-channel voice signals to be synchronized.
After the energy envelope template is generated, similar processing may be performed on the voice signals of the other channels by using a method for generating the energy envelope template, and then, on the voice signals of the other channels, the portions corresponding to the voice signals of the template channel on the time axis and the offset value between the portions corresponding to each other on the time axis may be determined, respectively, so as to be used for subsequent synchronization.
S103: and respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value.
In the embodiment of the present invention, according to the offset value, a time (which may be a difference between values of corresponding waveform segments in the speech signals of any two channels on the time axis) of a difference between the speech signal of the template channel and the speech signal of each other channel on the time axis may be determined, so that the speech signal of the template channel and the speech signal of each other channel may be aligned on the time axis in a manner of shifting and/or clipping on the time axis, so as to implement multi-channel speech signal synchronization.
By the method, the multichannel voice signals can be automatically synchronized by the equipment, so that manpower is saved, and efficiency is improved.
In this embodiment of the present invention, for step S101, generating a corresponding speech signal energy envelope template specifically includes: and intercepting waveform segments from the voice signals of the template channel, and calculating the energy envelope of the waveform segments to serve as a template for generating the corresponding voice signal energy envelope.
The clipped waveform segment may be a portion where waveform variation is significant in the voice signal of the template channel, or a portion where waveform difference is large compared with other portions, or the like. In this way, subsequent matching is easier, and the matching result and the synchronization result are more reliable. The length of the intercepted waveform segment is not limited, generally, the longer the length of the intercepted waveform segment is, the more reliable the subsequent matching result is, but correspondingly, the longer the subsequent processing time is, and under most application scenes, the length of the intercepted waveform segment can be set to be about 5 seconds.
In addition, when the waveform segment is extracted, the estimated maximum offset value between the voice signal of each other channel and the voice signal of the template channel (hereinafter, referred to as the estimated maximum offset value) needs to be considered. If the voice signal of the template channel is assumed to be ahead of the voice signals of the other channels on the time axis (the time point at which the template channel starts to collect the voice signals is earlier than the time point at which the voice signals of the other channels start), if the waveform segment is directly cut from the start of the voice signal of the template channel, there may be no portion corresponding to the cut waveform segment in the voice signals of the other channels, thereby affecting the reliability of the subsequent synchronization result.
In order to prevent such a problem from occurring, the waveform segment may not be cut directly from the start of the voice signal of the template channel, but may be cut from a time point which is a time point at which the maximum offset value is estimated from the start of the voice signal of the template channel, or a time point after the time point. For example, assuming that the estimated maximum offset value is 10 seconds, the waveform segment may be cut from the 10 th or 10 th second of the voice signal of the template channel.
Further, the voice signal of the template channel may be a discrete digital voice signal or a continuous analog voice signal. In order to reduce the amount of calculation and increase the speed of calculating the energy envelope, the energy envelope can be calculated after sampling and extracting the voice signal of the template channel. Therefore, for the above steps, calculating the energy envelope of the waveform segment specifically includes: sampling and extracting the waveform segment, determining a first set number of sampling points, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process to be used as an energy envelope of the waveform segment.
Further, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process, specifically comprising: and sliding the selected sliding window m times in the waveform segment according to a set sliding step length to generate an m-dimensional energy vector of the waveform segment, wherein the value of the ith dimension in the m-dimensional energy vector is the average energy of each sampling point contained in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.
For example, as shown in fig. 2. The abscissa axis is a time axis, the ordinate axis is a y axis, the y axis can be used for representing the volume, assuming that a sliding window slides m times in the intercepted waveform segment, and the generated m-dimensional energy vector is recorded as [ x [ ]1,x2,x3...,xm]Then x1Is the average energy, x, of each sample point contained in the sliding window after the 1 st sliding2Is the average energy, x, of each sample point contained in the sliding window after the 2 nd sliding3The sliding window after the 3 rd sliding is omitted from fig. 2 and is not shown, and so on, as the average energy of each sample point contained in the sliding window after the 3 rd sliding.
In addition, in order to speed up the calculation of the average energy of each sampling point included in the sliding window, after the sampling points are determined, the following preprocessing may be performed on each sampling point: and carrying out an average calculation on the sampling values of every several (16 or 8, etc.) sampling points in succession, and determining the calculated average sampling value as the sampling value of the several sampling points again. The sampling value of the sampling point may be a value of the sampling point on the y-axis, and the energy of the sampling point is equal to the square of the sampling value of the sampling point. Based on the preprocessing, the energy vector of the waveform segment is subsequently calculated through a sliding window. Therefore, the calculation speed of the average energy can be increased, and high-frequency random disturbance in the voice signal can be removed.
In practical application, the variable parameters may be respectively given appropriate values and then used, for example, fig. 3 is a process of selecting a group of values and assigning the values to the variable parameters, and then generating an energy envelope template according to a voice signal of a template channel, where the template channel is channel 1, and the other channels are channel 2, channel 3, and channel 4. The process may specifically comprise the steps of:
s301: a waveform segment of 5 seconds in length is truncated starting at the 10 th second of the channel 1 speech signal, where it is assumed that the estimated maximum offset value is not greater than 10 seconds.
S302: and sampling and extracting the intercepted waveform segments according to the sampling time interval of 1 millisecond to determine 5000 sampling points.
S303: and carrying out average sampling value calculation on sampling values of 16 sampling points from the 1 st sampling point, and re-determining the calculated average sampling value as the sampling value of the 16 sampling points until the re-determination of the sampling values of all the sampling points is finished.
By executing S303, the purpose of preprocessing each sampling point is achieved. For convenience of description, each sampling point at which the sampling value is newly determined will be referred to as a preprocessed sampling point hereinafter.
S304: sliding a sliding window with the length of 32 milliseconds for 313 times in sequence in the sampled waveform segment according to a sliding step of 16 milliseconds to generate a 313-dimensional energy vector, wherein the value of the ith dimension in the 313-dimensional energy vector is the average energy of each preprocessed sampling point contained in the sliding window after the sliding window slides for the ith time, i is a positive integer, and i is less than or equal to 313.
S305: the generated 313-dimensional energy vector is taken as an energy envelope of the truncated waveform segment, i.e., an energy envelope template.
In actual multiple tests, based on the parameter values, the synchronization accuracy of the multi-channel voice signals reaches 100%, the theoretical error of aligning the voice signals during synchronization is 16 milliseconds, and the actual measurement error is about 100 milliseconds.
In this embodiment of the present invention, for the step S102, an offset value between the speech signal of the other channel and the speech signal of the template channel may be determined for each speech signal of the other channel, as shown in fig. 4, specifically including the following steps:
s401: and sequentially intercepting a second set number of waveform segments with the same length as the waveform segments intercepted from the voice signal of the template channel by adopting a method used for intercepting the waveform segments from the voice signal of the template channel from the beginning of the voice signal of the other channel.
S402: and sampling and extracting the waveform segments of the second set number respectively by adopting a sampling and extracting method and an energy envelope calculating method of the waveform segments of the template channel, and calculating corresponding energy envelopes.
S403: and determining the waveform segment with the corresponding energy envelope which is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel in the waveform segments with the second set number.
In the embodiment of the present invention, an m-dimensional energy vector corresponding to a waveform segment extracted from a speech signal of a template channel may be denoted as [ x [ ]1,x2,...,xm]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ yn1,yn2,...,ynm]Wherein n is equal to the second set number;
calculate [ yn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]A distance between, wherein knIn order to be the energy gain factor,
Figure BDA0000736561760000071
and determining the waveform segment corresponding to the calculated minimum distance as the waveform segment of which the corresponding energy envelope is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel.
In practical applications, [ y ] isn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]The distance between may be measured based on a variety of ways including, but not limited to: the metric is based on the mean square error or euclidean distance, etc.
For example, when the variance is measured based on the mean square errorWhen the distance is measured. Calculate [ yn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]The distance between the two may specifically include: using a formula
Figure BDA0000736561760000072
Calculate [ yn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]Mean square error between as [ y ]n1,yn2,...,ynm]And k isn×[x1,x2,...,xm]The distance between them.
Where k isnIllustratively, the matching degree may be rather low, since even for corresponding waveform segments in the multi-channel speech signal, the volumes of the waveform segments may differ significantly from each other, resulting in the amplitudes of the energy envelopes of the waveform segments differing significantly. To solve this problem, k may be used as a reference for the amplitude of the energy envelope of the speech signal of the template channelnThe amplitude of the energy envelope of the speech signal of each of the other channels is adjusted to a level substantially identical to the amplitude of the energy envelope of the speech signal of the template channel, so that the corresponding waveform segment can be determined on the speech signal of each of the other channels more reliably according to the energy envelope of the speech signal of the template channel.
S404: and determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel.
Wherein the difference is the difference between the start of the waveform segment with the most matched energy envelope and the start of the waveform segment cut from the voice signal of the template channel on the time axis. For example, FIG. 5 is a diagram illustrating the offset between the speech signal of one other channel and the speech signal of the template channel in FIG. 2. It can be seen that, assuming that, on the speech signal of the other channel, the start of the waveform segment with the most matched energy envelope on the time axis is t, and the start of the waveform segment cut from the speech signal of the template channel on the time axis is γ, the offset τ between the speech signal of the other channel corresponding to the waveform segment with the most matched energy envelope and the speech signal of the template channel is: τ -t- γ.
In the embodiment of the present invention, obviously, the channel corresponding to the larger offset value is started earlier, and the channel corresponding to the smallest offset value is started latest. Therefore, the voice signals of all other channels should be cut off from the beginning to align with the voice signal of the channel corresponding to the minimum offset value, compared with the channel corresponding to the minimum offset value.
According to the analysis, for the step S103, the speech signals of the other channels and the speech signal of the template channel are synchronized according to the offset value, and the method specifically includes: determining the smallest offset value in the offset values corresponding to the voice signals of the other channels, and executing the following operations for the voice signals of the other channels: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.
Of course, in addition to aligning the multi-channel speech signal with the channel corresponding to the minimum offset value as a reference, the multi-channel speech signal may be aligned with the speech signal of any other channel in the multi-channel speech signal as a reference. For example, the speech signal of each of the other channels may be shifted on the time axis by a distance of an offset value from the speech signal of the template channel based on the speech signal of the template channel and the offset value corresponding to the speech signal of each of the other channels, so that the speech signal of each of the other channels and the speech signal of the template channel can be aligned. When the deviant is a positive number, the translation is carried out leftwards, when the deviant is a negative number, the translation is carried out rightwards, and when the deviant is 0, the translation is not needed.
In practical application, the multichannel voice signals can be processed in parallel, and all the offset values are determined and then synchronized. Fig. 6 is a simplified process diagram for parallel processing and synchronization of multi-channel speech signals according to the above description. Wherein, the passageway of the top is the template passageway, and the below of template passageway is 3 other passageways respectively. When matching is performed on the voice signal of each other channel, the energy envelope of each waveform segment intercepted from the other channel may be sequentially matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel (one waveform segment may be matched once per interception, or multiple waveform segments may be intercepted and then respectively matched, where the former method is adopted in fig. 6). The mean square error sequence can be generated through matching scanning, and the waveform segment corresponding to the minimum mean square error in the mean square error sequence can be determined as follows: waveform segments in the other channels corresponding to waveform segments cut from the speech signal of the template channel. An offset value between the speech signal of the other channel and the speech signal of the template channel may then be determined and synchronization may be performed based on the offset value.
Based on the same idea, the multi-channel speech signal synchronization method provided in the embodiment of the present invention further provides a corresponding multi-channel speech signal synchronization apparatus, as shown in fig. 7.
Fig. 7 is a schematic structural diagram of a multi-channel speech signal synchronization apparatus provided in an embodiment of the present invention, which specifically includes:
a generating module 701, configured to select a channel as a template channel, and generate a corresponding speech signal energy envelope template;
a determining module 702, configured to match the speech signal of each other channel with the energy envelope template, so as to determine an offset value between the speech signal of each other channel and the speech signal of the template channel;
a synchronization module 703, configured to synchronize the voice signals of the other channels with the voice signal of the template channel according to the offset value.
The generating module 701 is specifically configured to intercept a waveform segment from the voice signal of the template channel, sample and extract the waveform segment, determine a first set number of sampling points, slide a selected sliding window in the waveform segment according to a set manner, and calculate an energy vector of the waveform segment according to each sampling point included in the selected sliding window in the sliding process, where the energy vector is used as a corresponding generated voice signal energy envelope template.
The determining module 702 is specifically configured to sequentially intercept, from the start of the voice signal of the other channel, a second set number of waveform segments having the same length as the waveform segments intercepted from the voice signal of the template channel by using a method used for intercepting the waveform segments from the voice signal of the template channel;
sampling and extracting the waveform segments of the second set number by adopting a sampling and extracting method and an energy envelope calculating method of the waveform segments of the template channel respectively, and calculating corresponding energy envelopes;
determining a waveform segment with the corresponding energy envelope which is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel in the waveform segments with the second set number;
and determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel.
The determining module 702 is specifically configured to slide a selected sliding window m times in the waveform segment according to a set sliding step length, and generate an m-dimensional energy vector of the waveform segment, where an ith dimension of the m-dimensional energy vector is an average energy of each sampling point included in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.
The determining module 702 is specifically configured to record an m-dimensional energy vector corresponding to a waveform segment intercepted from the speech signal of the template channel as [ x [ ]1,x2,...,xm]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ yn1,yn2,...,ynm]Wherein n is equal to the second set number;
calculate [ yn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]A distance between, wherein knIn order to be the energy gain factor,
Figure BDA0000736561760000101
and determining the waveform segment corresponding to the calculated minimum distance as the waveform segment of which the corresponding energy envelope is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel.
The determining module 702 is specifically configured to employ a formula
Figure BDA0000736561760000102
Calculate [ yn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]Mean square error between as [ y ]n1,yn2,...,ynm]And k isn×[x1,x2,...,xm]The distance between them.
The synchronization module 703 is specifically configured to determine a minimum offset value among offset values corresponding to the voice signals of the other channels, and execute the following operations for the voice signal of each of the other channels: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.
The apparatus shown in figure 7 and described above in particular may be located on any device which can be used to process speech signals.
In the embodiment of the present invention, the related functional modules may be implemented by a hardware processor (hardware processor).
The multichannel voice signal synchronization method and the multichannel voice signal synchronization device provided by the embodiment of the invention match the energy envelope of the waveform segment intercepted from each channel with the energy envelope template generated by the waveform segment intercepted from the template channel, determine the offset value of the voice signal of each channel and the template channel, and realize the synchronization of the multichannel voice signal by intercepting the offset value of the voice signal of each channel and the voice signal of the template channel, thereby saving the labor and improving the efficiency. The problems that in the prior art, manual adjustment is adopted, multi-channel voice signals are synchronized, human resources are wasted, and efficiency is low are solved.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. A method for synchronizing a multi-channel speech signal, comprising:
selecting a channel as a template channel, and generating a corresponding voice signal energy envelope template;
respectively matching the energy envelopes of the voice signals of the other channels with the energy envelope template so as to respectively determine offset values between the voice signals of the other channels and the voice signals of the template channel;
according to the deviation value, respectively synchronizing the voice signals of the other channels with the voice signals of the template channel;
generating a corresponding voice signal energy envelope template, specifically comprising:
intercepting waveform segments from the voice signal of the template channel from a time point which is at a distance from the start of the voice signal of the template channel and is an estimated maximum deviation value or a time point after the time point; the estimated maximum deviation value is the estimated maximum deviation value between the voice signals of other channels and the voice signals of the template channel;
sampling and extracting the waveform segments to determine a first set number of sampling points;
sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process to serve as a corresponding generated voice signal energy envelope template;
for each of the speech signals of the other channels, determining an offset value between the speech signal of the other channel and the speech signal of the template channel as follows:
sequentially intercepting a second set number of waveform segments with the same length as the waveform segments intercepted from the voice signal of the template channel by adopting a method used for intercepting the waveform segments from the voice signal of the template channel from the beginning of the voice signal of the other channel;
sampling and extracting the waveform segments of the second set number by adopting a sampling and extracting method and an energy envelope calculating method of the waveform segments of the template channel respectively, and calculating corresponding energy envelopes;
determining a waveform segment with the corresponding energy envelope which is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel in the waveform segments with the second set number;
determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel;
preprocessing the sampling points, comprising:
carrying out once average calculation on the sampling values of every plurality of continuous sampling points, and re-determining the calculated average sampling value as the sampling values of the plurality of sampling points;
and the sampling value of the sampling point is the value of the sampling point on the y axis, and the energy of the sampling point is equal to the square of the sampling value of the sampling point.
2. The method according to claim 1, wherein sliding a selected sliding window in a set manner in the waveform segment, and calculating an energy vector of the waveform segment according to each of the sample points included in the selected sliding window during the sliding process, specifically comprises:
and sliding the selected sliding window m times in the waveform segment according to a set sliding step length to generate an m-dimensional energy vector of the waveform segment, wherein the value of the ith dimension in the m-dimensional energy vector is the average energy of each sampling point contained in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.
3. The method according to claim 2, wherein determining, among the second set number of waveform segments, a waveform segment whose corresponding energy envelope best matches an energy envelope of a waveform segment extracted from the speech signal of the template channel specifically includes:
recording m-dimensional energy vectors corresponding to waveform segments intercepted from the voice signals of the template channels as x1,x2,...,xm]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ yn1,yn2,...,ynm]Wherein n is equal to the second set number;
calculate [ yn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]A distance between, wherein knIn order to be the energy gain factor,
Figure FDA0002188725890000021
and determining the waveform segment corresponding to the calculated minimum distance as the waveform segment of which the corresponding energy envelope is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel.
4. The method of claim 3, wherein [ y ] is calculatedn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]The distance between the two specifically includes:
using a formulaCalculate [ yn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]Mean square error between as [ y ]n1,yn2,...,ynm]And k isn×[x1,x2,...,xm]The distance between them.
5. The method according to claim 1, wherein the synchronizing the voice signals of the other channels with the voice signal of the template channel according to the offset value comprises:
determining the minimum offset value in the offset values corresponding to the voice signals of the other channels;
for each of the other channels of speech signals, performing the following operations: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.
6. A multi-channel speech signal synchronization apparatus, comprising:
the generating module is used for selecting a channel as a template channel and generating a corresponding voice signal energy envelope template;
a determining module, configured to match the speech signals of the other channels with the energy envelope template, respectively, so as to determine offset values between the speech signals of the other channels and the speech signals of the template channel, respectively;
the synchronous module is used for respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value;
the generating module is specifically configured to intercept a waveform segment from the voice signal of the template channel starting from a time point which is a starting point of the voice signal of the template channel and is a predicted maximum deviation value, or a time point after the time point, where the predicted maximum deviation value is a predicted maximum deviation value between the voice signals of the other channels and the voice signal of the template channel; sampling and extracting the waveform segment, determining a first set number of sampling points, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process to serve as a generated corresponding voice signal energy envelope template;
preprocessing the sampling points, comprising:
carrying out once average calculation on the sampling values of every plurality of continuous sampling points, and re-determining the calculated average sampling value as the sampling values of the plurality of sampling points;
and the sampling value of the sampling point is the value of the sampling point on the y axis, and the energy of the sampling point is equal to the square of the sampling value of the sampling point.
7. The apparatus according to claim 6, wherein the synchronization module is specifically configured to determine a smallest offset value among offset values corresponding to the voice signals of the other channels, and for each voice signal of the other channels, perform the following operations: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.
CN201510321268.XA 2015-06-12 2015-06-12 Multichannel voice signal synchronization method and device Active CN105989846B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510321268.XA CN105989846B (en) 2015-06-12 2015-06-12 Multichannel voice signal synchronization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510321268.XA CN105989846B (en) 2015-06-12 2015-06-12 Multichannel voice signal synchronization method and device

Publications (2)

Publication Number Publication Date
CN105989846A CN105989846A (en) 2016-10-05
CN105989846B true CN105989846B (en) 2020-01-17

Family

ID=57040005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510321268.XA Active CN105989846B (en) 2015-06-12 2015-06-12 Multichannel voice signal synchronization method and device

Country Status (1)

Country Link
CN (1) CN105989846B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986831B (en) * 2017-05-31 2021-04-20 南宁富桂精密工业有限公司 Method for filtering voice interference, electronic device and computer readable storage medium
CN107221340B (en) * 2017-05-31 2021-01-15 福建星网视易信息系统有限公司 Real-time scoring method based on multi-channel audio, storage device and application
CN108021675B (en) * 2017-12-07 2021-11-09 北京慧听科技有限公司 Automatic segmentation and alignment method for multi-equipment recording
CN108725340B (en) * 2018-03-30 2022-04-12 斑马网络技术有限公司 Vehicle voice processing method and system
CN108682436B (en) * 2018-05-11 2020-06-23 北京海天瑞声科技股份有限公司 Voice alignment method and device
CN113409815B (en) * 2021-05-28 2022-02-11 合肥群音信息服务有限公司 Voice alignment method based on multi-source voice data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514430A (en) * 1995-01-13 2004-07-21 雅马哈株式会社 Digital signal processor for processing sound signal
CN1742492A (en) * 2003-02-14 2006-03-01 汤姆森特许公司 Automatic synchronization of audio and video based media services of media content
CN1971710A (en) * 2006-12-08 2007-05-30 中兴通讯股份有限公司 Single-chip based multi-channel multi-voice codec scheduling method
CN102088625A (en) * 2003-02-14 2011-06-08 汤姆森特许公司 Automatic synchronization of audio-video-based media services of media content
CN102419998A (en) * 2011-09-30 2012-04-18 广州市动景计算机科技有限公司 Voice frequency processing method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6658112B1 (en) * 1999-08-06 2003-12-02 General Dynamics Decision Systems, Inc. Voice decoder and method for detecting channel errors using spectral energy evolution
AT410874B (en) * 2001-02-22 2003-08-25 Peter Ing Gutwillinger DATA TRANSFER METHOD

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514430A (en) * 1995-01-13 2004-07-21 雅马哈株式会社 Digital signal processor for processing sound signal
CN1742492A (en) * 2003-02-14 2006-03-01 汤姆森特许公司 Automatic synchronization of audio and video based media services of media content
CN102088625A (en) * 2003-02-14 2011-06-08 汤姆森特许公司 Automatic synchronization of audio-video-based media services of media content
CN1971710A (en) * 2006-12-08 2007-05-30 中兴通讯股份有限公司 Single-chip based multi-channel multi-voice codec scheduling method
CN102419998A (en) * 2011-09-30 2012-04-18 广州市动景计算机科技有限公司 Voice frequency processing method and system

Also Published As

Publication number Publication date
CN105989846A (en) 2016-10-05

Similar Documents

Publication Publication Date Title
CN105989846B (en) Multichannel voice signal synchronization method and device
WO2019101123A1 (en) Voice activity detection method, related device, and apparatus
US10481859B2 (en) Audio synchronization and delay estimation
EP1887831A2 (en) Method, apparatus and program for estimating the direction of a sound source
CN110047478B (en) Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation
JP2010112996A (en) Voice processing device, voice processing method and program
CN111640411B (en) Audio synthesis method, device and computer readable storage medium
CN105374357B (en) Voice recognition method and device and voice control system
CN107509155A (en) A kind of bearing calibration of array microphone, device, equipment and storage medium
CN110853677B (en) Drumbeat beat recognition method and device for songs, terminal and non-transitory computer readable storage medium
CN111986695A (en) Non-overlapping sub-band division fast independent vector analysis voice blind separation method and system
CN108682436B (en) Voice alignment method and device
CN110610718A (en) Method and device for extracting expected sound source voice signal
CN102760435A (en) Frequency-domain blind deconvolution method for voice signal
CN107592600B (en) Pickup screening method and pickup device based on distributed microphones
CN112992190B (en) Audio signal processing method and device, electronic equipment and storage medium
Kepesi et al. Joint position-pitch estimation for multiple speaker scenarios
CN107025902A (en) Data processing method and device
CN104900227A (en) Voice characteristic information extraction method and electronic equipment
CN111028857B (en) Method and system for reducing noise of multichannel audio-video conference based on deep learning
CN113707149A (en) Audio processing method and device
CN111028860B (en) Audio data processing method and device, computer equipment and storage medium
CN109785864B (en) Method and device for eliminating court trial noise interference
CN112804043A (en) Clock asynchronism detection method, device and equipment
CN112509597A (en) Recording data identification method and device and recording equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 301-1, Room 301-3, Area B2, Animation Building, No. 126 Animation Road, Zhongxin Eco-city, Tianjin Binhai New Area, Tianjin

Applicant after: LE SHI ZHI XIN ELECTRONIC TECHNOLOGY (TIANJIN) Ltd.

Address before: 300453 Tianjin Binhai New Area, Tianjin Eco-city, No. 126 Animation and Animation Center Road, Area B1, Second Floor 201-427

Applicant before: Xinle Visual Intelligent Electronic Technology (Tianjin) Co.,Ltd.

Address after: 300453 Tianjin Binhai New Area, Tianjin Eco-city, No. 126 Animation and Animation Center Road, Area B1, Second Floor 201-427

Applicant after: Xinle Visual Intelligent Electronic Technology (Tianjin) Co.,Ltd.

Address before: 300467 Tianjin Binhai New Area, ecological city, animation Middle Road, building, No. two, B1 District, 201-427

Applicant before: LE SHI ZHI XIN ELECTRONIC TECHNOLOGY (TIANJIN) Ltd.

GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20210201

Granted publication date: 20200117

PD01 Discharge of preservation of patent
PD01 Discharge of preservation of patent

Date of cancellation: 20240201

Granted publication date: 20200117

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240313

Granted publication date: 20200117