CN105989846B

CN105989846B - Multichannel voice signal synchronization method and device

Info

Publication number: CN105989846B
Application number: CN201510321268.XA
Authority: CN
Inventors: 王育军
Original assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Current assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Priority date: 2015-06-12
Filing date: 2015-06-12
Publication date: 2020-01-17
Anticipated expiration: 2035-06-12
Also published as: CN105989846A

Abstract

The invention provides a method and a device for synchronizing multi-channel voice signals. The method comprises the following steps: selecting a channel as a template channel, and generating a corresponding voice signal energy envelope template; respectively matching the voice signals of other channels with the energy envelope template so as to respectively determine the offset value between the voice signals of other channels and the voice signals of the template channel; and respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value. The invention solves the problems that the prior art adopts a manual adjustment mode to synchronize multi-channel voice signals, which wastes manpower resources and has low efficiency.

Description

Multichannel voice signal synchronization method and device

Technical Field

The embodiment of the invention relates to the field of voice signal processing, in particular to a multi-channel voice signal synchronization method and device.

Background

Currently, in the field of speech signal processing, it is often necessary to separately collect speech signals from multiple channels for research in noise immunity, speech recognition, etc., where each channel included in the multiple channels may be a speech signal input or output channel provided by any speech collecting device.

However, in practical applications, the voice signals respectively collected from the multiple channels (hereinafter, simply referred to as "multi-channel voice signals") may not be synchronized with each other (i.e., may not be aligned on the time axis). For example, to study the perceptual relationship between the far field and the near field for the same sound source, one voice capture device (e.g., a cell phone) may be used to record at a location closer to the sound source and another voice capture device (e.g., a microphone) may be used to record at a location further from the sound source, but since the cell phone and microphone may not start recording at the same time, the voice signals captured from the channels of the cell phone and microphone may not be synchronized. Using an unsynchronized multi-channel speech signal such as in the above example for subsequent studies may reduce the reliability of the results of the study.

In view of the above problems, in the prior art, a manual adjustment mode is generally adopted to synchronize asynchronous multi-channel voice signals, and specifically, a researcher may separately observe waveforms of voice signals of each channel in the multi-channel voice signals, and then manually synchronize the multi-channel signals according to shapes of the waveforms. However, this synchronization method is not only wasteful of human resources, but also inefficient.

Disclosure of Invention

The embodiment of the invention provides a method and a device for synchronizing multi-channel voice signals, which are used for solving the problems that in the prior art, a manual adjustment mode is adopted, the synchronization of the multi-channel voice signals wastes human resources, and the efficiency and the accuracy are low.

The embodiment of the invention provides a multichannel voice signal synchronization method, which comprises the following steps:

selecting a channel as a template channel, and generating a corresponding voice signal energy envelope template;

respectively matching the energy envelopes of the voice signals of the other channels with the energy envelope template so as to respectively determine offset values between the voice signals of the other channels and the voice signals of the template channel;

and respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value.

An embodiment of the present invention further provides a multi-channel speech signal synchronization apparatus, including:

the generating module is used for selecting a channel as a template channel and generating a corresponding voice signal energy envelope template;

a determining module, configured to match the speech signals of the other channels with the energy envelope template, respectively, so as to determine offset values between the speech signals of the other channels and the speech signals of the template channel, respectively;

and the synchronization module is used for respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value.

The multichannel voice signal synchronization method and the multichannel voice signal synchronization device provided by the embodiment of the invention match the energy envelope of the waveform segment intercepted from each channel with the energy envelope template generated by the waveform segment intercepted from the template channel, determine the offset value of the voice signal of each channel and the template channel, and realize the synchronization of the multichannel voice signal by intercepting the offset value of the voice signal of each channel and the voice signal of the template channel, thereby saving the labor and improving the efficiency. The problems that in the prior art, manual adjustment is adopted, multi-channel voice signals are synchronized, human resources are wasted, and efficiency is low are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 illustrates a multi-channel speech signal synchronization process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of calculating an energy vector of a waveform segment by sliding a selected sliding window in the waveform segment according to an embodiment of the present invention;

FIG. 3 is a process for generating an energy envelope template according to a speech signal of a template channel by using a selected parameter value in practical applications according to an embodiment of the present invention;

FIG. 4 is a process for determining an offset value between a speech signal of each other channel and a speech signal of a template channel for the speech signal of the other channel according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an offset between the speech signal of one of the other channels and the speech signal of the template channel according to the embodiment of the present invention;

FIG. 6 is a simplified process diagram for parallel processing and synchronization of multi-channel speech signals according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a multi-channel speech signal synchronization apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a process of synchronizing a multi-channel speech signal according to an embodiment of the present invention, which specifically includes the following steps:

s101: and selecting the channel as a template channel to generate a corresponding voice signal energy envelope template.

The execution subject of the embodiment of the present invention may be any device that can be used to process a voice signal. Such devices include, but are not limited to: personal computers, smart phones, tablet computers, smart televisions, smart watches, smart bracelets, vehicle stations, large and medium sized computers, computer clusters, and the like. The described execution body does not constitute a limitation of the present invention.

In embodiments of the present invention, a speech signal may be acquired from the same sound source using multiple channels, which may include the selected template channel and at least one other channel. The energy envelope template may be a single feature or a combined feature in terms of energy envelope extracted from a portion or all of the speech signal of the template channel. Of course, besides the energy envelope, features in terms of volume, frequency, timbre, waveform shape, etc. may also be extracted from the speech signal of the template channel as a template for subsequent matching.

S102: and respectively matching the voice signals of the other channels with the energy envelope template so as to respectively determine the deviation value between the voice signals of the other channels and the voice signals of the template channel.

In the embodiment of the present invention, the voice signal of the template channel and the voice signals of the other channels are the multi-channel voice signals to be synchronized.

After the energy envelope template is generated, similar processing may be performed on the voice signals of the other channels by using a method for generating the energy envelope template, and then, on the voice signals of the other channels, the portions corresponding to the voice signals of the template channel on the time axis and the offset value between the portions corresponding to each other on the time axis may be determined, respectively, so as to be used for subsequent synchronization.

S103: and respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value.

In the embodiment of the present invention, according to the offset value, a time (which may be a difference between values of corresponding waveform segments in the speech signals of any two channels on the time axis) of a difference between the speech signal of the template channel and the speech signal of each other channel on the time axis may be determined, so that the speech signal of the template channel and the speech signal of each other channel may be aligned on the time axis in a manner of shifting and/or clipping on the time axis, so as to implement multi-channel speech signal synchronization.

By the method, the multichannel voice signals can be automatically synchronized by the equipment, so that manpower is saved, and efficiency is improved.

In this embodiment of the present invention, for step S101, generating a corresponding speech signal energy envelope template specifically includes: and intercepting waveform segments from the voice signals of the template channel, and calculating the energy envelope of the waveform segments to serve as a template for generating the corresponding voice signal energy envelope.

The clipped waveform segment may be a portion where waveform variation is significant in the voice signal of the template channel, or a portion where waveform difference is large compared with other portions, or the like. In this way, subsequent matching is easier, and the matching result and the synchronization result are more reliable. The length of the intercepted waveform segment is not limited, generally, the longer the length of the intercepted waveform segment is, the more reliable the subsequent matching result is, but correspondingly, the longer the subsequent processing time is, and under most application scenes, the length of the intercepted waveform segment can be set to be about 5 seconds.

In addition, when the waveform segment is extracted, the estimated maximum offset value between the voice signal of each other channel and the voice signal of the template channel (hereinafter, referred to as the estimated maximum offset value) needs to be considered. If the voice signal of the template channel is assumed to be ahead of the voice signals of the other channels on the time axis (the time point at which the template channel starts to collect the voice signals is earlier than the time point at which the voice signals of the other channels start), if the waveform segment is directly cut from the start of the voice signal of the template channel, there may be no portion corresponding to the cut waveform segment in the voice signals of the other channels, thereby affecting the reliability of the subsequent synchronization result.

In order to prevent such a problem from occurring, the waveform segment may not be cut directly from the start of the voice signal of the template channel, but may be cut from a time point which is a time point at which the maximum offset value is estimated from the start of the voice signal of the template channel, or a time point after the time point. For example, assuming that the estimated maximum offset value is 10 seconds, the waveform segment may be cut from the 10 th or 10 th second of the voice signal of the template channel.

Further, the voice signal of the template channel may be a discrete digital voice signal or a continuous analog voice signal. In order to reduce the amount of calculation and increase the speed of calculating the energy envelope, the energy envelope can be calculated after sampling and extracting the voice signal of the template channel. Therefore, for the above steps, calculating the energy envelope of the waveform segment specifically includes: sampling and extracting the waveform segment, determining a first set number of sampling points, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process to be used as an energy envelope of the waveform segment.

Further, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process, specifically comprising: and sliding the selected sliding window m times in the waveform segment according to a set sliding step length to generate an m-dimensional energy vector of the waveform segment, wherein the value of the ith dimension in the m-dimensional energy vector is the average energy of each sampling point contained in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.

For example, as shown in fig. 2. The abscissa axis is a time axis, the ordinate axis is a y axis, the y axis can be used for representing the volume, assuming that a sliding window slides m times in the intercepted waveform segment, and the generated m-dimensional energy vector is recorded as [ x [ ]₁,x₂,x₃...,x_m]Then x₁Is the average energy, x, of each sample point contained in the sliding window after the 1 st sliding₂Is the average energy, x, of each sample point contained in the sliding window after the 2 nd sliding₃The sliding window after the 3 rd sliding is omitted from fig. 2 and is not shown, and so on, as the average energy of each sample point contained in the sliding window after the 3 rd sliding.

In addition, in order to speed up the calculation of the average energy of each sampling point included in the sliding window, after the sampling points are determined, the following preprocessing may be performed on each sampling point: and carrying out an average calculation on the sampling values of every several (16 or 8, etc.) sampling points in succession, and determining the calculated average sampling value as the sampling value of the several sampling points again. The sampling value of the sampling point may be a value of the sampling point on the y-axis, and the energy of the sampling point is equal to the square of the sampling value of the sampling point. Based on the preprocessing, the energy vector of the waveform segment is subsequently calculated through a sliding window. Therefore, the calculation speed of the average energy can be increased, and high-frequency random disturbance in the voice signal can be removed.

In practical application, the variable parameters may be respectively given appropriate values and then used, for example, fig. 3 is a process of selecting a group of values and assigning the values to the variable parameters, and then generating an energy envelope template according to a voice signal of a template channel, where the template channel is channel 1, and the other channels are channel 2, channel 3, and channel 4. The process may specifically comprise the steps of:

s301: a waveform segment of 5 seconds in length is truncated starting at the 10 th second of the channel 1 speech signal, where it is assumed that the estimated maximum offset value is not greater than 10 seconds.

S302: and sampling and extracting the intercepted waveform segments according to the sampling time interval of 1 millisecond to determine 5000 sampling points.

S303: and carrying out average sampling value calculation on sampling values of 16 sampling points from the 1 st sampling point, and re-determining the calculated average sampling value as the sampling value of the 16 sampling points until the re-determination of the sampling values of all the sampling points is finished.

By executing S303, the purpose of preprocessing each sampling point is achieved. For convenience of description, each sampling point at which the sampling value is newly determined will be referred to as a preprocessed sampling point hereinafter.

S304: sliding a sliding window with the length of 32 milliseconds for 313 times in sequence in the sampled waveform segment according to a sliding step of 16 milliseconds to generate a 313-dimensional energy vector, wherein the value of the ith dimension in the 313-dimensional energy vector is the average energy of each preprocessed sampling point contained in the sliding window after the sliding window slides for the ith time, i is a positive integer, and i is less than or equal to 313.

S305: the generated 313-dimensional energy vector is taken as an energy envelope of the truncated waveform segment, i.e., an energy envelope template.

In actual multiple tests, based on the parameter values, the synchronization accuracy of the multi-channel voice signals reaches 100%, the theoretical error of aligning the voice signals during synchronization is 16 milliseconds, and the actual measurement error is about 100 milliseconds.

In this embodiment of the present invention, for the step S102, an offset value between the speech signal of the other channel and the speech signal of the template channel may be determined for each speech signal of the other channel, as shown in fig. 4, specifically including the following steps:

s401: and sequentially intercepting a second set number of waveform segments with the same length as the waveform segments intercepted from the voice signal of the template channel by adopting a method used for intercepting the waveform segments from the voice signal of the template channel from the beginning of the voice signal of the other channel.

S402: and sampling and extracting the waveform segments of the second set number respectively by adopting a sampling and extracting method and an energy envelope calculating method of the waveform segments of the template channel, and calculating corresponding energy envelopes.

S403: and determining the waveform segment with the corresponding energy envelope which is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel in the waveform segments with the second set number.

In the embodiment of the present invention, an m-dimensional energy vector corresponding to a waveform segment extracted from a speech signal of a template channel may be denoted as [ x [ ]₁,x₂,...,x_m]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ y_n1,y_n2,...,y_nm]Wherein n is equal to the second set number;

calculate [ y_n1,y_n2,...,y_nm]And k is_n×[x₁,x₂,...,x_m]A distance between, wherein k_nIn order to be the energy gain factor,

and determining the waveform segment corresponding to the calculated minimum distance as the waveform segment of which the corresponding energy envelope is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel.

In practical applications, [ y ] is_n1,y_n2,...,y_nm]And k is_n×[x₁,x₂,...,x_m]The distance between may be measured based on a variety of ways including, but not limited to: the metric is based on the mean square error or euclidean distance, etc.

For example, when the variance is measured based on the mean square errorWhen the distance is measured. Calculate [ y_n1,y_n2,...,y_nm]And k is_n×[x₁,x₂,...,x_m]The distance between the two may specifically include: using a formula

Calculate [ y_n1,y_n2,...,y_nm]And k is_n×[x₁,x₂,...,x_m]Mean square error between as [ y ]_n1,y_n2,...,y_nm]And k is_n×[x₁,x₂,...,x_m]The distance between them.

Where k is_nIllustratively, the matching degree may be rather low, since even for corresponding waveform segments in the multi-channel speech signal, the volumes of the waveform segments may differ significantly from each other, resulting in the amplitudes of the energy envelopes of the waveform segments differing significantly. To solve this problem, k may be used as a reference for the amplitude of the energy envelope of the speech signal of the template channel_nThe amplitude of the energy envelope of the speech signal of each of the other channels is adjusted to a level substantially identical to the amplitude of the energy envelope of the speech signal of the template channel, so that the corresponding waveform segment can be determined on the speech signal of each of the other channels more reliably according to the energy envelope of the speech signal of the template channel.

S404: and determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel.

Wherein the difference is the difference between the start of the waveform segment with the most matched energy envelope and the start of the waveform segment cut from the voice signal of the template channel on the time axis. For example, FIG. 5 is a diagram illustrating the offset between the speech signal of one other channel and the speech signal of the template channel in FIG. 2. It can be seen that, assuming that, on the speech signal of the other channel, the start of the waveform segment with the most matched energy envelope on the time axis is t, and the start of the waveform segment cut from the speech signal of the template channel on the time axis is γ, the offset τ between the speech signal of the other channel corresponding to the waveform segment with the most matched energy envelope and the speech signal of the template channel is: τ -t- γ.

In the embodiment of the present invention, obviously, the channel corresponding to the larger offset value is started earlier, and the channel corresponding to the smallest offset value is started latest. Therefore, the voice signals of all other channels should be cut off from the beginning to align with the voice signal of the channel corresponding to the minimum offset value, compared with the channel corresponding to the minimum offset value.

According to the analysis, for the step S103, the speech signals of the other channels and the speech signal of the template channel are synchronized according to the offset value, and the method specifically includes: determining the smallest offset value in the offset values corresponding to the voice signals of the other channels, and executing the following operations for the voice signals of the other channels: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.

Of course, in addition to aligning the multi-channel speech signal with the channel corresponding to the minimum offset value as a reference, the multi-channel speech signal may be aligned with the speech signal of any other channel in the multi-channel speech signal as a reference. For example, the speech signal of each of the other channels may be shifted on the time axis by a distance of an offset value from the speech signal of the template channel based on the speech signal of the template channel and the offset value corresponding to the speech signal of each of the other channels, so that the speech signal of each of the other channels and the speech signal of the template channel can be aligned. When the deviant is a positive number, the translation is carried out leftwards, when the deviant is a negative number, the translation is carried out rightwards, and when the deviant is 0, the translation is not needed.

In practical application, the multichannel voice signals can be processed in parallel, and all the offset values are determined and then synchronized. Fig. 6 is a simplified process diagram for parallel processing and synchronization of multi-channel speech signals according to the above description. Wherein, the passageway of the top is the template passageway, and the below of template passageway is 3 other passageways respectively. When matching is performed on the voice signal of each other channel, the energy envelope of each waveform segment intercepted from the other channel may be sequentially matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel (one waveform segment may be matched once per interception, or multiple waveform segments may be intercepted and then respectively matched, where the former method is adopted in fig. 6). The mean square error sequence can be generated through matching scanning, and the waveform segment corresponding to the minimum mean square error in the mean square error sequence can be determined as follows: waveform segments in the other channels corresponding to waveform segments cut from the speech signal of the template channel. An offset value between the speech signal of the other channel and the speech signal of the template channel may then be determined and synchronization may be performed based on the offset value.

Based on the same idea, the multi-channel speech signal synchronization method provided in the embodiment of the present invention further provides a corresponding multi-channel speech signal synchronization apparatus, as shown in fig. 7.

Fig. 7 is a schematic structural diagram of a multi-channel speech signal synchronization apparatus provided in an embodiment of the present invention, which specifically includes:

a generating module 701, configured to select a channel as a template channel, and generate a corresponding speech signal energy envelope template;

a determining module 702, configured to match the speech signal of each other channel with the energy envelope template, so as to determine an offset value between the speech signal of each other channel and the speech signal of the template channel;

a synchronization module 703, configured to synchronize the voice signals of the other channels with the voice signal of the template channel according to the offset value.

The generating module 701 is specifically configured to intercept a waveform segment from the voice signal of the template channel, sample and extract the waveform segment, determine a first set number of sampling points, slide a selected sliding window in the waveform segment according to a set manner, and calculate an energy vector of the waveform segment according to each sampling point included in the selected sliding window in the sliding process, where the energy vector is used as a corresponding generated voice signal energy envelope template.

The determining module 702 is specifically configured to sequentially intercept, from the start of the voice signal of the other channel, a second set number of waveform segments having the same length as the waveform segments intercepted from the voice signal of the template channel by using a method used for intercepting the waveform segments from the voice signal of the template channel;

sampling and extracting the waveform segments of the second set number by adopting a sampling and extracting method and an energy envelope calculating method of the waveform segments of the template channel respectively, and calculating corresponding energy envelopes;

determining a waveform segment with the corresponding energy envelope which is most matched with the energy envelope of the waveform segment intercepted from the voice signal of the template channel in the waveform segments with the second set number;

and determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel.

The determining module 702 is specifically configured to slide a selected sliding window m times in the waveform segment according to a set sliding step length, and generate an m-dimensional energy vector of the waveform segment, where an ith dimension of the m-dimensional energy vector is an average energy of each sampling point included in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.

The determining module 702 is specifically configured to record an m-dimensional energy vector corresponding to a waveform segment intercepted from the speech signal of the template channel as [ x [ ]₁,x₂,...,x_m]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ y_n1,y_n2,...,y_nm]Wherein n is equal to the second set number;

The determining module 702 is specifically configured to employ a formula

The synchronization module 703 is specifically configured to determine a minimum offset value among offset values corresponding to the voice signals of the other channels, and execute the following operations for the voice signal of each of the other channels: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.

The apparatus shown in figure 7 and described above in particular may be located on any device which can be used to process speech signals.

In the embodiment of the present invention, the related functional modules may be implemented by a hardware processor (hardware processor).

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for synchronizing a multi-channel speech signal, comprising:

according to the deviation value, respectively synchronizing the voice signals of the other channels with the voice signals of the template channel;

generating a corresponding voice signal energy envelope template, specifically comprising:

intercepting waveform segments from the voice signal of the template channel from a time point which is at a distance from the start of the voice signal of the template channel and is an estimated maximum deviation value or a time point after the time point; the estimated maximum deviation value is the estimated maximum deviation value between the voice signals of other channels and the voice signals of the template channel;

sampling and extracting the waveform segments to determine a first set number of sampling points;

sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process to serve as a corresponding generated voice signal energy envelope template;

for each of the speech signals of the other channels, determining an offset value between the speech signal of the other channel and the speech signal of the template channel as follows:

sequentially intercepting a second set number of waveform segments with the same length as the waveform segments intercepted from the voice signal of the template channel by adopting a method used for intercepting the waveform segments from the voice signal of the template channel from the beginning of the voice signal of the other channel;

determining the difference value of the waveform segment with the most matched energy envelope and the waveform segment cut from the voice signal of the template channel on a time axis as the offset value between the voice signals of the other channels and the voice signal of the template channel;

preprocessing the sampling points, comprising:

carrying out once average calculation on the sampling values of every plurality of continuous sampling points, and re-determining the calculated average sampling value as the sampling values of the plurality of sampling points;

and the sampling value of the sampling point is the value of the sampling point on the y axis, and the energy of the sampling point is equal to the square of the sampling value of the sampling point.

2. The method according to claim 1, wherein sliding a selected sliding window in a set manner in the waveform segment, and calculating an energy vector of the waveform segment according to each of the sample points included in the selected sliding window during the sliding process, specifically comprises:

and sliding the selected sliding window m times in the waveform segment according to a set sliding step length to generate an m-dimensional energy vector of the waveform segment, wherein the value of the ith dimension in the m-dimensional energy vector is the average energy of each sampling point contained in the selected sliding window after the selected sliding window is slid for the ith time, m and i are positive integers, and i is less than or equal to m.

3. The method according to claim 2, wherein determining, among the second set number of waveform segments, a waveform segment whose corresponding energy envelope best matches an energy envelope of a waveform segment extracted from the speech signal of the template channel specifically includes:

recording m-dimensional energy vectors corresponding to waveform segments intercepted from the voice signals of the template channels as x₁,x₂,...,x_m]In the second set number of waveform segments, the m-dimensional energy vector corresponding to the nth waveform segment is recorded as [ y_n1,y_n2,...,y_nm]Wherein n is equal to the second set number;

4. The method of claim 3, wherein [ y ] is calculated_n1,y_n2,...,y_nm]And k is_n×[x1,x2,...,xm]The distance between the two specifically includes:

using a formulaCalculate [ y_n1,y_n2,...,y_nm]And k is_n×[x₁,x₂,...,x_m]Mean square error between as [ y ]_n1,y_n2,...,y_nm]And k is_n×[x₁,x₂,...,x_m]The distance between them.

5. The method according to claim 1, wherein the synchronizing the voice signals of the other channels with the voice signal of the template channel according to the offset value comprises:

determining the minimum offset value in the offset values corresponding to the voice signals of the other channels;

for each of the other channels of speech signals, performing the following operations: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.

6. A multi-channel speech signal synchronization apparatus, comprising:

the synchronous module is used for respectively synchronizing the voice signals of the other channels with the voice signals of the template channel according to the deviation value;

the generating module is specifically configured to intercept a waveform segment from the voice signal of the template channel starting from a time point which is a starting point of the voice signal of the template channel and is a predicted maximum deviation value, or a time point after the time point, where the predicted maximum deviation value is a predicted maximum deviation value between the voice signals of the other channels and the voice signal of the template channel; sampling and extracting the waveform segment, determining a first set number of sampling points, sliding a selected sliding window in the waveform segment according to a set mode, and calculating an energy vector of the waveform segment according to each sampling point contained in the selected sliding window in the sliding process to serve as a generated corresponding voice signal energy envelope template;

preprocessing the sampling points, comprising:

7. The apparatus according to claim 6, wherein the synchronization module is specifically configured to determine a smallest offset value among offset values corresponding to the voice signals of the other channels, and for each voice signal of the other channels, perform the following operations: and cutting out a waveform segment with the length of the difference between the offset value corresponding to the voice signal and the minimum offset value from the beginning of the voice signal, and aligning the voice signal corresponding to the cut voice signal and the minimum offset value.