CN108021675B

CN108021675B - Automatic segmentation and alignment method for multi-equipment recording

Info

Publication number: CN108021675B
Application number: CN201711284222.0A
Authority: CN
Inventors: 吴妍; 郑羲光
Original assignee: Beijing Huiting Technology Corp
Current assignee: Beijing Huiting Technology Corp
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2021-11-09
Anticipated expiration: 2037-12-07
Also published as: CN108021675A

Abstract

The invention discloses an automatic segmentation and alignment method for multi-equipment sound recording, which comprises the following steps: correspondingly processing a plurality of original recordings in different forms into a plurality of long-time recordings in the same format; associating the same long-term recordings included in the plurality of long-term recordings; and respectively aligning the associated long-time records by using the short-time reference records, and then cutting the long-time records into the short-time records corresponding to the short-time reference records. The invention solves the problem of complex data processing of the voice recognition database of the recording multi-device.

Description

Automatic segmentation and alignment method for multi-equipment recording

Technical Field

The invention relates to the technical field of voice recognition database manufacturing, in particular to an automatic segmentation and alignment method for multi-device recording.

Background

In the voice recognition database manufacturing process, the efficiency and the diversity of the recording can be greatly improved by simultaneously acquiring the recording by utilizing multiple devices. For example, signals of a head microphone, a mobile phone and a microphone array are simultaneously collected in recording, so that diversity of channels can be ensured, and further, the practicability of the identification database is improved, and the database can be used in applications of far-field identification, awakening, noise reduction and the like. Due to the fact that corresponding data of near-speaking and far-speaking exist at the same time, performance of far-field recognition, awakening and noise reduction algorithms can be evaluated conveniently.

However, in the process of acquiring the multiple-device recording, because the recording devices are different, the recording devices cannot start recording at the same time in terms of time (i.e. simultaneously pressing a recording switch or sending a recording command); the problem of frame loss of the recording of part of recording equipment and misoperation in the recording process bring certain challenges to the post-processing of voice recognition data.

Disclosure of Invention

The invention aims to provide an automatic segmentation and alignment method of multi-equipment sound records for manufacturing a voice recognition database, aiming at the technical defects in the prior art, which realizes the automatic and respective alignment of associated sound records in a plurality of target sound records by taking a short-time reference sound record as a reference, and then segments the associated sound records to form corresponding short-time sound records to be stored in the voice recognition database, thereby realizing the conversion of different original sound records into the short-time sound records which can be used by a voice recognition system.

The technical scheme adopted for realizing the purpose of the invention is as follows:

an automatic segmentation and alignment method for multi-device sound recording comprises the following steps:

correspondingly processing a plurality of original recordings in different forms into a plurality of long-time recordings in the same format;

associating the same long-term recordings included in the plurality of long-term recordings;

and respectively aligning the associated long-time records by using the short-time reference records, and then cutting the long-time records into the short-time records corresponding to the short-time reference records.

In the invention, the long-time recording refers to all recordings which are continuously acquired by different recording equipment from the recording starting time to the recording ending time, and comprises effective recording and invalid recording; the short-time recording refers to an effective recording cut out from the long-time recording.

In the invention, the original recording comprises an original short-time recording and an original long-time recording, and the long-time recording is formed by the following steps respectively;

for the original long-term recording, performing uniform format conversion after decompressing the original long-term recording, and resampling the original long-term recording according to a uniform sampling rate, thereby forming the long-term recording;

and for the original short-time recording, performing unified format conversion after decompressing the original short-time recording, resampling the original short-time recording according to a unified sampling rate, and splicing the original short-time recording into the long-time recording according to the timestamp.

The alignment of the plurality of associated long-term recordings is performed by using the short-term reference recording, which may be implemented by searching the plurality of associated long-term recordings for the short-term reference recording, respectively.

Further, the short-time reference recordings are used to align the plurality of associated long-time recordings respectively, and the following method can be adopted:

respectively intercepting the head and tail sections of the associated long-time recording and the short-time reference recording, and calculating the recording offset of the associated long-time recording and short-time reference recording at the starting stage and the ending stage of the recording;

and acquiring the position of the short-time reference recording in the associated long-time recording according to the recording offset, and then cutting out the corresponding short recording in the associated long-time recording by using the short-time reference recording.

Specifically, the recording offset may be calculated on the original time domain signal, or on the time domain signal after noise reduction, or on the domain of the signal characteristics.

The short-time reference recording can be formed by cutting a long-time reference recording recorded by a reference recording device or a short-time recording directly recorded by the reference recording device.

The segmentation of the long-time reference record recorded by the reference recording equipment is performed by using voice activity detection information.

In the invention, the same long-time recordings in the plurality of long-time recordings are associated by reading the content of the long-time recordings and calculating the correlation degree of the content of the plurality of long-time recordings.

The correlation comprises the correlation between the time domain correlation of the audio record and the audio characteristic sequence.

According to the automatic segmentation and alignment method for the multi-equipment sound recording, after the original sound recording formats of a plurality of different sound recording equipment are unified, the target sound recording file is automatically associated, the target sound recording is aligned by using the reference short-time sound recording and then segmented, the original sound recordings with different formats recorded by the multi-sound recording equipment can be automatically converted into the short-time sound recording used by the voice recognition system, and the problem that the data processing for recording the multi-equipment voice recognition database is complex is solved.

Drawings

FIG. 1 is a process flow diagram of an automatic segmentation alignment method for multi-device audio recordings;

FIG. 2 is a flow diagram illustrating format unification processing of original audio records.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1-2, an automatic splitting and aligning method for multi-device sound recording includes the steps of:

And the short-time recording corresponding to the short-time reference recording is cut into short-time recordings, and the short-time recordings are stored in a voice recognition database for recognition, so that different original recordings are converted into short-time recordings which can be used by a voice recognition system.

The method comprises the steps of inputting a plurality of original recordings in different forms by different recording input devices, as shown in fig. 1, inputting the original recordings in different forms by a recording device 1 and a recording device 2 … …, processing the original recordings in different forms into a plurality of long-term recordings in the same format by a format unified processing step, associating the same recording files in the same recording file in the long-term recording in the same format, aligning the associated long-term recordings by using short-term reference recordings, splitting the long-term recordings, realizing the recording stored in a voice recognition database, and outputting the recording files to the voice recognition database for storage by the recording device 1 and the recording device 2 … ….

The original sound recordings come from different sound recording devices, such as a head microphone, a mobile phone, a microphone array and the like, and because the formats of the sound recordings acquired by the sound recording devices are possibly inconsistent, the invention is convenient for subsequent segmentation processing.

Because of the difference of the recording devices, in the process of collecting the audio, the original recording which is possibly formed is the original short-time recording and also the original long-time recording, therefore, the corresponding long-time recording is formed by the following steps aiming at the processing of the original short-time recording and the original long-time recording;

for the original long-term recording, performing unified format conversion after decompressing (and decrypting) the original long-term recording, and resampling the original long-term recording according to a unified sampling rate, thereby forming the long-term recording;

and for the original short-time recording, performing unified format conversion after decompressing (and decrypting) the original short-time recording, resampling the original short-time recording according to a unified sampling rate, and splicing the original short-time recording into the long-time recording according to the timestamp information.

The original short-time recording splicing specifically may be:

if S_kIs the kth original short-time recording (K is more than or equal to 1 and less than or equal to K), K is a natural number, S_kCorresponding time stamp is t_k＝[t_k ^start,t_k ^end]If the long-time recording s (t) spliced by the corresponding timestamp t is:

S_k(t) is the kth original short-time recording corresponding to the timestamp t; t is t_k ^start,t_k ^endIs S_kThe start time and the end time of the corresponding timestamp.

The short-time reference recording can be selected by corresponding recording reference equipment, the recording reference equipment can select equipment with high signal-to-noise ratio as the recording reference equipment according to the signal-to-noise ratio of the recording file, and the recording reference equipment can also be selected according to the actual recording item requirement.

The long-term recording is formed in a unified mode through the unified file format and the sampling rate, and subsequent processing is facilitated.

In the invention, the long-time recording refers to all recordings which are continuously acquired by different recording equipment from the recording starting time to the recording ending time, and comprises effective recording and invalid recording; since the start and/or end times of the individual recording devices are not necessarily the same, the process of re-recording, pausing, etc. in the middle of capturing audio is included in the long recording.

The short-time recording refers to an effective recording cut from the long-time recording according to a cutting rule, and is usually a complete sentence or paragraph.

Because the start and stop times of different recording devices are different and some recording devices may have frame loss and pause in the recording process, when the recordings of other recording devices are divided, the short-time reference recording and the target long-time recording (i.e., the associated same long-time recording) need to be aligned first.

The method can be realized by respectively searching the short-time reference records in the plurality of associated long-time records, and the method needs to search each sentence of short record, has a large search range and is easy to cause alignment errors.

The method is realized by calculating the cross-correlation coefficient between corresponding signals intercepted at the beginning and ending stages of the target long-time recording and the reference long-time recording, can improve the alignment accuracy and simultaneously reduce the search range, and specifically comprises the following steps:

step 1: respectively intercepting target long-time recording S₁And a reference long-time recording S₂Respectively calculating the recording offsets D1, D2 of the target long-time recording and the reference long-time recording at the beginning stage and the end stage of the recording, wherein the offset refers to the offset of time, such as the offset due to the target long-time recording S₁And a reference long-time recording S₂When the collecting device presses the recording switchDifferent in scale, S₁And S₂There may be a difference of D seconds between them, and the recording offset is D seconds here. Recording S if the target is long₁And a reference long-time recording S₂If the length is N, then S₁And S₂No time deviation occurs, and the cross correlation coefficient between the two signals has a maximum value at N + 1; otherwise, D is the maximum value of the cross correlation coefficient- (N +1), where D is the recording offset.

If the head and tail deviation D1 of the recording is D2, the recording equipment is good, the recording at the reference equipment t1 is at the t1+ D position of the target equipment, and the step 3 is directly carried out; otherwise, indicating that the phenomena of frame loss or pause and the like exist in the recording process, and entering the step 2;

step 2: according to the sound record head-tail deviation D1 and D2, for the short sound record starting at the time of t1 and ending at the time of t2 of the reference equipment, searching for the corresponding sound record in the range of [ D1+ t1-delta and D2+ t2+ delta ] of the target long sound record, further obtaining the position of the short sound record on the target equipment, and entering the step 3. Where delta is the extended search duration (e.g., 1 second).

And step 3: and switching out the short record corresponding to the target long-time record according to the position of the short-time reference record in the target long-time record.

Wherein the short-time reference recording may be a short-time recording directly recorded by a reference recording device.

The original short-time recording can be directly used as the short-time reference recording to carry out alignment segmentation processing on the target long-time recording to be processed.

The short-time reference record can be formed by segmenting a long-time reference record recorded by a reference recording device, and if the long-time reference record recorded by the reference recording device is segmented, the short-time reference record can be segmented by utilizing voice activity detection information.

Segmentation with Voice Activity Detection (VAD) information: for long-term original recording files, VAD information of voice signals can be analyzed, long-term recording is further divided into short sentences according to a predefined criterion, the long-term recording can be divided according to the pause duration of the voice signals, and generally, pauses obviously longer than pauses in progress of each sentence are generated when each sentence is finished. And the VAD information can be utilized to carry out segmentation according to the pause length between two sentences of which the VAD detection values are true. If the continuous pause is found to exceed 2 seconds, the segmentation is performed once at the key point of pause. When the conversation database is recorded, the energy of the head-mounted microphones of the two parties of the conversation can be combined, and the segmentation precision is improved.

Because during the recording acquisition process, it is often necessary to process recordings of multiple persons (segments) simultaneously. Therefore, in the multi-device recording process, it is necessary to associate the recordings of different recording devices, that is, to find out a file corresponding to a certain person (session) recording in different recording devices, that is, to associate the same long-term recordings included in a plurality of long-term recordings.

As described above, the same long-time recordings included in the plurality of long-time recordings may be associated with each other in the following manner, for example, according to information such as the file names of the recordings, the recording durations, and the file sizes of the recordings. The method can also be realized by reading the content of the long-time recording and calculating the correlation degree of the content of a plurality of long-time recordings.

According to the content of the read recording files, correlation can be carried out by calculating the correlation degree among the recording files. Suppose there are N recording devices, each having M recordings. A plurality of files still appear after short-time recording splicing, because a certain recording device possibly participates in the recording of a plurality of people, and the files are stored in the same storage device. And calculating the correlation between all the files of the target sound recording and all the files of the reference sound recording by taking the reference sound recording as a reference so as to obtain an M-M sound recording correlation matrix T. Two recording devices n₁(1≤n₁≤N),n₂(1≤n₂≤N)，n₁≠n₂Two-stage recording of

m₁(1≤m₁≤M),m₂(1≤m₂≤M)，m₁≠m₂Correlation coefficient ρ of₁₂Comprises the following steps:

wherein the content of the first and second substances,

E[·]as desired. Then two recording devices n₁,n₂The correlation matrix T of (a) is:

based on the correlation matrix T, according to a certain selection criterion (such as the total correlation after maximum correlation), the one-to-one corresponding relation between the target audio file and the reference audio file can be obtained. Namely with

Highest degree of association

M is

The correlation degree can be the time domain correlation degree of the sound recording or the correlation degree of the audio characteristic sequence.

The method related above has the advantage of being directly applicable to all devices, so as to reduce the complexity of calculation. In practical systems, computational complexity can be reduced by simplifying the correlation computation (e.g., sampling in computing the time-domain correlation).

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The method for automatically segmenting and aligning the multi-device sound records is characterized by comprising the following steps of:

respectively aligning the associated long-time recordings by using short-time reference recordings, and then cutting the long-time recordings into short-time recordings corresponding to the short-time reference recordings;

aligning a plurality of associated long-time recordings respectively using a short-time reference recording, comprising the steps of:

2. The method for automatically segmenting and aligning the multi-device recording according to claim 1, wherein the long-time recording refers to all recordings that are continuously acquired by different recording devices from the recording start time to the recording end time, and includes valid recordings and invalid recordings; the short-time recording refers to an effective recording cut out from the long-time recording.

3. The method for automatically slicing and aligning multiple device recordings according to claim 1, wherein the original recordings include an original short-time recording and an original long-time recording, the long-time recording being formed by the steps of;

4. The method for automatically segmenting and aligning multiple device records according to claim 1, wherein the short-time reference records are used to align the multiple associated long-time records respectively by searching the multiple associated long-time records for the short-time reference records respectively.

5. The method of claim 1, wherein the recording offset is calculated in the original time domain signal, in the noise-reduced time domain signal, or in the signal feature domain.

6. The method for automatically slicing and aligning multiple device recordings according to claim 1, wherein the short time reference recording is formed by slicing a long time reference recording recorded by a reference recording device or is a short time recording directly recorded by a reference recording device.

7. The method for automatically segmenting and aligning multiple device recordings according to claim 6, wherein the segmentation of the long-term reference recordings recorded by the reference recording device is performed by using voice activity detection information.

8. The method for automatically segmenting and aligning the multi-device sound recordings according to claim 1, wherein the same long-time recordings included in the plurality of long-time recordings are associated by reading the contents of the long-time recordings and calculating the correlation degree of the contents of the plurality of long-time recordings.

9. The method of claim 8, wherein the correlation comprises a correlation between a time-domain correlation of the audio record and a sequence of audio features.