Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a process of synchronizing a multi-channel speech signal according to an embodiment of the present invention, which specifically includes the following steps:
s101: and selecting the channel as a template channel to generate a corresponding voice signal energy envelope template.
The execution subject of the embodiment of the present invention may be any device that can be used to process a voice signal. Such devices include, but are not limited to: personal computers, smart phones, tablet computers, smart televisions, smart watches, smart bracelets, vehicle stations, large and medium sized computers, computer clusters, and the like. The described execution body does not constitute a limitation of the present invention.
In embodiments of the present invention, a speech signal may be acquired from the same sound source using multiple channels, which may include the selected template channel and at least one other channel. The energy envelope template may be a single feature or a combined feature in terms of energy envelope extracted from a portion or all of the speech signal of the template channel. Of course, besides the energy envelope, features in terms of volume, frequency, timbre, waveform shape, etc. may also be extracted from the speech signal of the template channel as a template for subsequent matching.
S102: and respectively matching the voice signals of the other channels with the energy envelope template so as to respectively determine the deviation value between the voice signals of the other channels and the voice signals of the template channel.
In the embodiment of the present invention, the voice signal of the template channel and the voice signals of the other channels are the multi-channel voice signals to be synchronized.
After the energy envelope template is generated, similar processing may be performed on the voice signals of the other channels by using a method for generating the energy envelope template, and then, on the voice signals of the other channels, the portions corresponding to the voice signals of the template channel on the time axis and the offset value between the portions corresponding to each other on the time axis may be determined, respectively, so as to be used for subsequent synchronization.
S103: and respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value.
In the embodiment of the present invention, according to the offset value, a time (which may be a difference between values of corresponding waveform segments in the speech signals of any two channels on the time axis) of a difference between the speech signal of the template channel and the speech signal of each other channel on the time axis may be determined, so that the speech signal of the template channel and the speech signal of each other channel may be aligned on the time axis in a manner of shifting and/or clipping on the time axis, so as to implement multi-channel speech signal synchronization.
By the method, the multichannel voice signals can be automatically synchronized by the equipment, so that manpower is saved, and efficiency is improved.
In this embodiment of the present invention, for step S101, generating a corresponding speech signal energy envelope template specifically includes: and intercepting waveform segments from the voice signals of the template channel, and calculating the energy envelope of the waveform segments to serve as a template for generating the corresponding voice signal energy envelope.
The clipped waveform segment may be a portion where waveform variation is significant in the voice signal of the template channel, or a portion where waveform difference is large compared with other portions, or the like. In this way, subsequent matching is easier, and the matching result and the synchronization result are more reliable. The length of the intercepted waveform segment is not limited, generally, the longer the length of the intercepted waveform segment is, the more reliable the subsequent matching result is, but correspondingly, the longer the subsequent processing time is, and under most application scenes, the length of the intercepted waveform segment can be set to be about 5 seconds.
In addition, when the waveform segment is extracted, the estimated maximum offset value between the voice signal of each other channel and the voice signal of the template channel (hereinafter, referred to as the estimated maximum offset value) needs to be considered. If the voice signal of the template channel is assumed to be ahead of the voice signals of the other channels on the time axis (the time point at which the template channel starts to collect the voice signals is earlier than the time point at which the voice signals of the other channels start), if the waveform segment is directly cut from the start of the voice signal of the template channel, there may be no portion corresponding to the cut waveform segment in the voice signals of the other channels, thereby affecting the reliability of the subsequent synchronization result.
In order to prevent such a problem from occurring, the waveform segment may not be cut directly from the start of the voice signal of the template channel, but may be cut from a time point which is a time point at which the maximum offset value is estimated from the start of the voice signal of the template channel, or a time point after the time point. For example, assuming that the estimated maximum offset value is 10 seconds, the waveform segment may be cut from the 10 th or 10 th second of the voice signal of the template channel.
Further, the voice signal of the template channel may be a discrete digital voice signal or a continuous analog voice signal. In order to reduce the amount of calculation and increase the speed of calculating the energy envelope, the energy envelope can be calculated after sampling and extracting the voice signal of the template channel. Therefore, for the above steps, calculating the energy envelope of the waveform segment specifically includes: sampling and extracting the waveform segment, determining a first set number of sampling points, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process to be used as an energy envelope of the waveform segment.
Further, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process, specifically comprising: and sliding the selected sliding window m times in the waveform segment according to a set sliding step length to generate an m-dimensional energy vector of the waveform segment, wherein the value of the ith dimension in the m-dimensional energy vector is the average energy of each sampling point contained in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.
For example, as shown in fig. 2. The abscissa axis is a time axis, the ordinate axis is a y axis, the y axis can be used for representing the volume, assuming that a sliding window slides m times in the intercepted waveform segment, and the generated m-dimensional energy vector is recorded as [ x [ ]1,x2,x3...,xm]Then x1Is the average energy, x, of each sample point contained in the sliding window after the 1 st sliding2Is the average energy, x, of each sample point contained in the sliding window after the 2 nd sliding3The sliding window after the 3 rd sliding is omitted from fig. 2 and is not shown, and so on, as the average energy of each sample point contained in the sliding window after the 3 rd sliding.
In addition, in order to speed up the calculation of the average energy of each sampling point included in the sliding window, after the sampling points are determined, the following preprocessing may be performed on each sampling point: and carrying out an average calculation on the sampling values of every several (16 or 8, etc.) sampling points in succession, and determining the calculated average sampling value as the sampling value of the several sampling points again. The sampling value of the sampling point may be a value of the sampling point on the y-axis, and the energy of the sampling point is equal to the square of the sampling value of the sampling point. Based on the preprocessing, the energy vector of the waveform segment is subsequently calculated through a sliding window. Therefore, the calculation speed of the average energy can be increased, and high-frequency random disturbance in the voice signal can be removed.
In practical application, the variable parameters may be respectively given appropriate values and then used, for example, fig. 3 is a process of selecting a group of values and assigning the values to the variable parameters, and then generating an energy envelope template according to a voice signal of a template channel, where the template channel is channel 1, and the other channels are channel 2, channel 3, and channel 4. The process may specifically comprise the steps of:
s301: a waveform segment of 5 seconds in length is truncated starting at the 10 th second of the channel 1 speech signal, where it is assumed that the estimated maximum offset value is not greater than 10 seconds.
S302: and sampling and extracting the intercepted waveform segments according to the sampling time interval of 1 millisecond to determine 5000 sampling points.
S303: and carrying out average sampling value calculation on sampling values of 16 sampling points from the 1 st sampling point, and re-determining the calculated average sampling value as the sampling value of the 16 sampling points until the re-determination of the sampling values of all the sampling points is finished.
By executing S303, the purpose of preprocessing each sampling point is achieved. For convenience of description, each sampling point at which the sampling value is newly determined will be referred to as a preprocessed sampling point hereinafter.
S304: sliding a sliding window with the length of 32 milliseconds for 313 times in sequence in the sampled waveform segment according to a sliding step of 16 milliseconds to generate a 313-dimensional energy vector, wherein the value of the ith dimension in the 313-dimensional energy vector is the average energy of each preprocessed sampling point contained in the sliding window after the sliding window slides for the ith time, i is a positive integer, and i is less than or equal to 313.
S305: the generated 313-dimensional energy vector is taken as an energy envelope of the truncated waveform segment, i.e., an energy envelope template.
In actual multiple tests, based on the parameter values, the synchronization accuracy of the multi-channel voice signals reaches 100%, the theoretical error of aligning the voice signals during synchronization is 16 milliseconds, and the actual measurement error is about 100 milliseconds.
In this embodiment of the present invention, for the step S102, an offset value between the speech signal of the other channel and the speech signal of the template channel may be determined for each speech signal of the other channel, as shown in fig. 4, specifically including the following steps:
s401: and sequentially intercepting a second set number of waveform segments with the same length as the waveform segments intercepted from the voice signal of the template channel by adopting a method used for intercepting the waveform segments from the voice signal of the template channel from the beginning of the voice signal of the other channel.
S402: and sampling and extracting the waveform segments of the second set number respectively by adopting a sampling and extracting method and an energy envelope calculating method of the waveform segments of the template channel, and calculating corresponding energy envelopes.
S403: and determining the waveform segment with the corresponding energy envelope which is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel in the waveform segments with the second set number.
In the embodiment of the present invention, an m-dimensional energy vector corresponding to a waveform segment extracted from a speech signal of a template channel may be denoted as [ x [ ]1,x2,...,xm]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ yn1,yn2,...,ynm]Wherein n is equal to the second set number;
calculate [ y
n1,y
n2,...,y
nm]And k is
n×[x
1,x
2,...,x
m]A distance between, wherein k
nIn order to be the energy gain factor,
and determining the waveform segment corresponding to the calculated minimum distance as the waveform segment of which the corresponding energy envelope is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel.
In practical applications, [ y ] isn1,yn2,...,ynm]And k isn×[x1,x2,...,xm]The distance between may be measured based on a variety of ways including, but not limited to: the metric is based on the mean square error or euclidean distance, etc.
For example, when the variance is measured based on the mean square errorWhen the distance is measured. Calculate [ y
n1,y
n2,...,y
nm]And k is
n×[x
1,x
2,...,x
m]The distance between the two may specifically include: using a formula
Calculate [ y
n1,y
n2,...,y
nm]And k is
n×[x
1,x
2,...,x
m]Mean square error between as [ y ]
n1,y
n2,...,y
nm]And k is
n×[x
1,x
2,...,x
m]The distance between them.
Where k isnIllustratively, the matching degree may be rather low, since even for corresponding waveform segments in the multi-channel speech signal, the volumes of the waveform segments may differ significantly from each other, resulting in the amplitudes of the energy envelopes of the waveform segments differing significantly. To solve this problem, k may be used as a reference for the amplitude of the energy envelope of the speech signal of the template channelnThe amplitude of the energy envelope of the speech signal of each of the other channels is adjusted to a level substantially identical to the amplitude of the energy envelope of the speech signal of the template channel, so that the corresponding waveform segment can be determined on the speech signal of each of the other channels more reliably according to the energy envelope of the speech signal of the template channel.
S404: and determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel.
Wherein the difference is the difference between the start of the waveform segment with the most matched energy envelope and the start of the waveform segment cut from the voice signal of the template channel on the time axis. For example, FIG. 5 is a diagram illustrating the offset between the speech signal of one other channel and the speech signal of the template channel in FIG. 2. It can be seen that, assuming that, on the speech signal of the other channel, the start of the waveform segment with the most matched energy envelope on the time axis is t, and the start of the waveform segment cut from the speech signal of the template channel on the time axis is γ, the offset τ between the speech signal of the other channel corresponding to the waveform segment with the most matched energy envelope and the speech signal of the template channel is: τ -t- γ.
In the embodiment of the present invention, obviously, the channel corresponding to the larger offset value is started earlier, and the channel corresponding to the smallest offset value is started latest. Therefore, the voice signals of all other channels should be cut off from the beginning to align with the voice signal of the channel corresponding to the minimum offset value, compared with the channel corresponding to the minimum offset value.
According to the analysis, for the step S103, the speech signals of the other channels and the speech signal of the template channel are synchronized according to the offset value, and the method specifically includes: determining the smallest offset value in the offset values corresponding to the voice signals of the other channels, and executing the following operations for the voice signals of the other channels: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.
Of course, in addition to aligning the multi-channel speech signal with the channel corresponding to the minimum offset value as a reference, the multi-channel speech signal may be aligned with the speech signal of any other channel in the multi-channel speech signal as a reference. For example, the speech signal of each of the other channels may be shifted on the time axis by a distance of an offset value from the speech signal of the template channel based on the speech signal of the template channel and the offset value corresponding to the speech signal of each of the other channels, so that the speech signal of each of the other channels and the speech signal of the template channel can be aligned. When the deviant is a positive number, the translation is carried out leftwards, when the deviant is a negative number, the translation is carried out rightwards, and when the deviant is 0, the translation is not needed.
In practical application, the multichannel voice signals can be processed in parallel, and all the offset values are determined and then synchronized. Fig. 6 is a simplified process diagram for parallel processing and synchronization of multi-channel speech signals according to the above description. Wherein, the passageway of the top is the template passageway, and the below of template passageway is 3 other passageways respectively. When matching is performed on the voice signal of each other channel, the energy envelope of each waveform segment intercepted from the other channel may be sequentially matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel (one waveform segment may be matched once per interception, or multiple waveform segments may be intercepted and then respectively matched, where the former method is adopted in fig. 6). The mean square error sequence can be generated through matching scanning, and the waveform segment corresponding to the minimum mean square error in the mean square error sequence can be determined as follows: waveform segments in the other channels corresponding to waveform segments cut from the speech signal of the template channel. An offset value between the speech signal of the other channel and the speech signal of the template channel may then be determined and synchronization may be performed based on the offset value.
Based on the same idea, the multi-channel speech signal synchronization method provided in the embodiment of the present invention further provides a corresponding multi-channel speech signal synchronization apparatus, as shown in fig. 7.
Fig. 7 is a schematic structural diagram of a multi-channel speech signal synchronization apparatus provided in an embodiment of the present invention, which specifically includes:
a generating module 701, configured to select a channel as a template channel, and generate a corresponding speech signal energy envelope template;
a determining module 702, configured to match the speech signal of each other channel with the energy envelope template, so as to determine an offset value between the speech signal of each other channel and the speech signal of the template channel;
a synchronization module 703, configured to synchronize the voice signals of the other channels with the voice signal of the template channel according to the offset value.
The generating module 701 is specifically configured to intercept a waveform segment from the voice signal of the template channel, sample and extract the waveform segment, determine a first set number of sampling points, slide a selected sliding window in the waveform segment according to a set manner, and calculate an energy vector of the waveform segment according to each sampling point included in the selected sliding window in the sliding process, where the energy vector is used as a corresponding generated voice signal energy envelope template.
The determining module 702 is specifically configured to sequentially intercept, from the start of the voice signal of the other channel, a second set number of waveform segments having the same length as the waveform segments intercepted from the voice signal of the template channel by using a method used for intercepting the waveform segments from the voice signal of the template channel;
sampling and extracting the waveform segments of the second set number by adopting a sampling and extracting method and an energy envelope calculating method of the waveform segments of the template channel respectively, and calculating corresponding energy envelopes;
determining a waveform segment with the corresponding energy envelope which is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel in the waveform segments with the second set number;
and determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel.
The determining module 702 is specifically configured to slide a selected sliding window m times in the waveform segment according to a set sliding step length, and generate an m-dimensional energy vector of the waveform segment, where an ith dimension of the m-dimensional energy vector is an average energy of each sampling point included in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.
The determining module 702 is specifically configured to record an m-dimensional energy vector corresponding to a waveform segment intercepted from the speech signal of the template channel as [ x [ ]1,x2,...,xm]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ yn1,yn2,...,ynm]Wherein n is equal to the second set number;
calculate [ y
n1,y
n2,...,y
nm]And k is
n×[x
1,x
2,...,x
m]A distance between, wherein k
nIn order to be the energy gain factor,
and determining the waveform segment corresponding to the calculated minimum distance as the waveform segment of which the corresponding energy envelope is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel.
The determining
module 702 is specifically configured to employ a formula
Calculate [ y
n1,y
n2,...,y
nm]And k is
n×[x
1,x
2,...,x
m]Mean square error between as [ y ]
n1,y
n2,...,y
nm]And k is
n×[x
1,x
2,...,x
m]The distance between them.
The synchronization module 703 is specifically configured to determine a minimum offset value among offset values corresponding to the voice signals of the other channels, and execute the following operations for the voice signal of each of the other channels: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.
The apparatus shown in figure 7 and described above in particular may be located on any device which can be used to process speech signals.
In the embodiment of the present invention, the related functional modules may be implemented by a hardware processor (hardware processor).
The multichannel voice signal synchronization method and the multichannel voice signal synchronization device provided by the embodiment of the invention match the energy envelope of the waveform segment intercepted from each channel with the energy envelope template generated by the waveform segment intercepted from the template channel, determine the offset value of the voice signal of each channel and the template channel, and realize the synchronization of the multichannel voice signal by intercepting the offset value of the voice signal of each channel and the voice signal of the template channel, thereby saving the labor and improving the efficiency. The problems that in the prior art, manual adjustment is adopted, multi-channel voice signals are synchronized, human resources are wasted, and efficiency is low are solved.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.