CN108198570B - Method and device for separating voice during interrogation - Google Patents

Method and device for separating voice during interrogation Download PDF

Info

Publication number
CN108198570B
CN108198570B CN201810106940.7A CN201810106940A CN108198570B CN 108198570 B CN108198570 B CN 108198570B CN 201810106940 A CN201810106940 A CN 201810106940A CN 108198570 B CN108198570 B CN 108198570B
Authority
CN
China
Prior art keywords
voice data
trial
voice
matrix
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810106940.7A
Other languages
Chinese (zh)
Other versions
CN108198570A (en
Inventor
马金龙
关海欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Beijing Yunzhisheng Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yunzhisheng Information Technology Co Ltd filed Critical Beijing Yunzhisheng Information Technology Co Ltd
Priority to CN201810106940.7A priority Critical patent/CN108198570B/en
Publication of CN108198570A publication Critical patent/CN108198570A/en
Application granted granted Critical
Publication of CN108198570B publication Critical patent/CN108198570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a method and a device for separating voice during interrogation, wherein the method comprises the following steps: acquiring first voice data acquired by a first audio acquisition device and second voice data acquired by a second audio acquisition device, wherein the first audio acquisition device is a device pointing to an auditor, and the second audio acquisition device is a device pointing to a polled person; filtering the first voice data, and determining the trial voice data corresponding to the trial persons; and taking the trial voice data as a reference signal, removing the trial voice data in the second voice data, and determining the trial voice data in the second voice data. The method effectively reduces the interference of the trial messenger channel through two groups of voice data, thereby realizing the correct separation of the talking signals of the trial messenger and the auditor, and then correctly identifying the voices of the trial messenger and the auditor by utilizing voice identification, thereby automatically generating the trial notes, further improving the trial efficiency and saving the labor cost.

Description

Method and device for separating voice during interrogation
Technical Field
The invention relates to the technical field of voice separation, in particular to a method and a device for voice separation during interrogation.
Background
At present, inquiries (such as criminal inquiries) of judicial scenes are generally generated in a stroke form, efficiency is low, manual operation is needed, and waste of manpower and material resources is caused. Meanwhile, due to inherent limitations of interrogation scenes and the like, signals acquired by a microphone often contain a plurality of speaking targets, and the acquired voice signals are directly identified, so that the speaking targets cannot be effectively distinguished; moreover, the speaking voice of the auditor is often too small or even far away from the speaking voice of the auditor, so most of the audition processes at present adopt a manual recording mode.
Disclosure of Invention
The invention provides a method and a device for separating voice during interrogation, which are used for solving the defect of low efficiency of recording an interrogation note in the conventional manual mode.
The embodiment of the invention provides a method for separating voice during interrogation, which comprises the following steps:
acquiring first voice data acquired by a first audio acquisition device and second voice data acquired by a second audio acquisition device, wherein the first audio acquisition device is a device pointing to an auditor, and the second audio acquisition device is a device pointing to a polled person;
filtering the first voice data, and determining the audition voice data corresponding to the audition person;
and taking the audition voice data as a reference signal, removing the audition voice data in the second voice data, and determining the audited voice data in the second voice data.
In one possible implementation, the method further includes:
and identifying the trial voice data and the trial voice data respectively, and determining corresponding trial texts and trial texts.
In one possible implementation, the determining the corresponding trial text and the polled text comprises:
determining time stamps of the trial voice data and the trial voice data, and adding corresponding time stamps to the trial text and the trial text respectively according to the time stamps, wherein the time stamps comprise a start time stamp and an end time stamp;
and determining the overlapping part of the trial text and the text to be trial according to the time stamp, and highlighting the text corresponding to the overlapping part.
In a possible implementation manner, the removing the trial voice data from the second voice data by using the trial voice data as a reference signal includes:
determining signal time delay according to the distance between the first audio acquisition device and the trial messenger and the distance between the second audio acquisition device and the trial messenger;
and carrying out time delay processing on the trial voice data according to the signal time delay, taking the trial voice data after the time delay processing as a reference signal, and removing the trial voice data in the second voice data.
In a possible implementation manner, the removing the trial voice data from the second voice data by using the trial voice data as a reference signal includes:
preprocessing the trial voice data and the second voice data respectively, and determining a first voice matrix G1 corresponding to the trial voice data and a second voice matrix G2 corresponding to the second voice data;
determining a third voice matrix G3 according to the first voice matrix G1 and the second voice matrix G2, wherein G3 is G2-lambda G1, and lambda is a weight coefficient;
performing dimensionality reduction processing on the third voice matrix G3, converting the third voice matrix G3 into a discrete voice array Xs, reducing the discrete voice array Xs into continuous voice data according to a preset sampling period, and taking the reduced voice data as audited voice data;
wherein the preprocessing process comprises the following steps:
performing discrete sampling processing on the voice data according to the preset sampling period to obtain a discrete voice array X represented in an array form, wherein the voice data is the trial voice data or the second voice data; x ═ X1,x2,…xj,…,xn]Wherein x isjRepresenting the sampling value of the voice data corresponding to the jth sampling point, wherein n is the total number of samples;
performing row expansion by taking the discrete voice array X as a row, and determining an M multiplied by n discrete voice matrix M after row expansion;wherein m is an odd number,
Figure BDA0001568016200000031
and m isi,jValue of and
Figure BDA0001568016200000032
m is a negative correlation between themi,jThe element representing the ith row and the jth column in the discrete speech matrix M;
determining a k × k reference matrix H; k is an odd number greater than 1, and the element of the ith row and the jth column of the reference matrix H is:
Figure BDA0001568016200000033
wherein σ is an adjustment coefficient;
performing difference reduction processing on the discrete voice matrix M according to the reference matrix H, determining a voice matrix G after difference reduction processing, and determining an element G of the xth row and the yth column of the voice matrix Gx,yComprises the following steps:
Figure BDA0001568016200000034
when the voice data is the interrogation voice data, the determined voice matrix G is a first voice matrix G1; and when the voice data is the second voice data, the determined voice matrix G is a second voice matrix G2.
Based on the same inventive concept, the embodiment of the present invention further provides a device for separating speech during interrogation, including:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring first voice data acquired by a first audio acquisition device and second voice data acquired by a second audio acquisition device, the first audio acquisition device is a device pointing to an auditor, and the second audio acquisition device is a device pointing to a polled person;
the first processing module is used for carrying out filtering processing on the first voice data and determining the audition voice data corresponding to the audition person;
and the second processing module is used for removing the trial voice data in the second voice data by taking the trial voice data as a reference signal and determining the trial voice data in the second voice data.
In one possible implementation, the apparatus further includes:
and the recognition module is used for respectively recognizing the trial speech data and determining corresponding trial texts and the corresponding trial texts.
In one possible implementation, the identification module is further configured to:
determining time stamps of the trial voice data and the trial voice data, and adding corresponding time stamps to the trial text and the trial text respectively according to the time stamps, wherein the time stamps comprise a start time stamp and an end time stamp;
and determining the overlapping part of the trial text and the text to be trial according to the time stamp, and highlighting the text corresponding to the overlapping part.
In one possible implementation manner, the second processing module is configured to:
determining signal time delay according to the distance between the first audio acquisition device and the trial messenger and the distance between the second audio acquisition device and the trial messenger;
and carrying out time delay processing on the trial voice data according to the signal time delay, taking the trial voice data after the time delay processing as a reference signal, and removing the trial voice data in the second voice data.
In one possible implementation manner, the second processing module is configured to:
preprocessing the trial voice data and the second voice data respectively, and determining a first voice matrix G1 corresponding to the trial voice data and a second voice matrix G2 corresponding to the second voice data;
determining a third voice matrix G3 according to the first voice matrix G1 and the second voice matrix G2, wherein G3 is G2-lambda G1, and lambda is a weight coefficient;
performing dimensionality reduction processing on the third voice matrix G3, converting the third voice matrix G3 into a discrete voice array Xs, reducing the discrete voice array Xs into continuous voice data according to a preset sampling period, and taking the reduced voice data as audited voice data;
wherein the preprocessing process comprises the following steps:
performing discrete sampling processing on the voice data according to the preset sampling period to obtain a discrete voice array X represented in an array form, wherein the voice data is the trial voice data or the second voice data; x ═ X1,x2,…xj,…,xn]Wherein x isjRepresenting the sampling value of the voice data corresponding to the jth sampling point, wherein n is the total number of samples;
performing row expansion by taking the discrete voice array X as a row, and determining an M multiplied by n discrete voice matrix M after row expansion; wherein m is an odd number,
Figure BDA0001568016200000051
and m isi,jValue of and
Figure BDA0001568016200000052
m is a negative correlation between themi,jThe element representing the ith row and the jth column in the discrete speech matrix M;
determining a k × k reference matrix H; k is an odd number greater than 1, and the element of the ith row and the jth column of the reference matrix H is:
Figure BDA0001568016200000053
wherein σ is an adjustment coefficient;
performing difference reduction processing on the discrete voice matrix M according to the reference matrix H, determining a voice matrix G after difference reduction processing, and determining an element G of the xth row and the yth column of the voice matrix Gx,yComprises the following steps:
Figure BDA0001568016200000054
when the voice data is the interrogation voice data, the determined voice matrix G is a first voice matrix G1; and when the voice data is the second voice data, the determined voice matrix G is a second voice matrix G2.
The embodiment of the invention provides a method and a device for separating voice during interrogation, which are characterized in that two groups of voice data are respectively obtained based on two set audio acquisition devices, and one group of voice data is taken as a reference signal to eliminate the other group of voice data, so that voice separation is realized. The interference of the trial messenger channel is effectively reduced through the two groups of voice data, so that the correct separation of the talking signals of the trial messenger and the auditioned is realized, the voices of the trial messenger and the audited messenger can be correctly identified by utilizing voice identification, the trial notes can be automatically generated, the trial efficiency is improved, and the labor cost is saved. By highlighting the text corresponding to the overlapped part, the user can be quickly positioned at the position where the voice recognition is possibly wrong, and the user can conveniently and quickly check and confirm whether the recognized text is accurate or not. The trial voice data in the second voice data can be removed more accurately through the delay processing, so that more accurate trial voice data can be obtained. The voice data expressed in the array form is expanded into a matrix form, so that the voice data can be conveniently processed; and after determining a third voice matrix corresponding to the audited voice data, performing dimension reduction processing to obtain the audited voice data. The mode removes trial voice data in the second voice data with higher dimensionality, and can effectively reduce distortion caused by removing the second voice data.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a method for speech separation during interrogation according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating the determination of overlapping text in an embodiment of the present invention;
FIG. 3 is a first block diagram of an apparatus for speech separation during interrogation in accordance with an embodiment of the present invention;
FIG. 4 is a first block diagram of an apparatus for speech separation during interrogation according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
The voice separation method during interrogation provided by the embodiment of the invention realizes the voice separation effect based on two audio acquisition devices (such as microphones and the like). Specifically, referring to fig. 1, the method includes steps 101-103:
step 101: the method comprises the steps of acquiring first voice data acquired by a first audio acquisition device and second voice data acquired by a second audio acquisition device, wherein the first audio acquisition device is a device pointing to an auditor, and the second audio acquisition device is a device pointing to a polled person.
In the embodiment of the invention, two audio acquisition devices, namely a first audio acquisition device and a second audio acquisition device, are respectively arranged. The first audio acquisition device points to the auditor and is used for acquiring the voice of the auditor; because the first audio acquisition device can be close to the auditor and far away from the auditor, the data acquired by the first audio acquisition device are basically the voice data of the auditor.
The second audio acquisition device points to the auditor, and considering safety factors, the second audio acquisition device cannot be too close to the auditor, even the distance between the second audio acquisition device and the auditor still needs to be smaller than the distance between the second audio acquisition device and the auditor, so that although the second audio acquisition device points to the auditor, the second audio acquisition device still can acquire comprehensive voice data of the auditor and the auditor. It should be noted that, in the embodiment of the present invention, the fact that the first audio capture device points at the auditor means that a voice capture direction of the first audio capture device (e.g., a microphone) faces the auditor, and a distance from the first audio capture device to the auditor is smaller than a distance from the second audio capture device to the auditor.
Step 102: and carrying out filtering processing on the first voice data, and determining the trial voice data corresponding to the trial person.
In the embodiment of the present invention, as described above, since the first audio acquisition device can be close to the auditor and far from the auditor, the first voice data acquired by the first audio acquisition device is basically the voice data of the auditor, and at this time, the auditor voice data of the auditor can be determined by simply performing filtering and difference reduction processing on the first voice data.
Step 103: and taking the trial voice data as a reference signal, removing the trial voice data in the second voice data, and determining the trial voice data in the second voice data.
The second voice data collected by the second audio collecting device comprises voice data of the auditor and the audited person, and at the moment, voice separation is needed to be carried out on the second voice data. In the embodiment of the invention, the auditor voice data acquired by the first audio acquisition device is used as the reference signal, so that the voice part of the auditor in the second voice data can be removed, and the auditor voice data in the second voice data can be determined.
The method for separating the voice during interrogation provided by the embodiment of the invention is characterized in that two groups of voice data are respectively obtained based on the two arranged audio acquisition devices, and one group of voice data is taken as a reference signal to eliminate the other group of voice data, thereby realizing voice separation. The interference of the trial messenger channel is effectively reduced through the two groups of voice data, so that the correct separation of the talking signals of the trial messenger and the auditioned is realized, the voices of the trial messenger and the audited messenger can be correctly identified by utilizing voice identification, the trial notes can be automatically generated, the trial efficiency is improved, and the labor cost is saved.
In one possible implementation, the method further includes: and identifying the trial speech data and the trial speech data respectively, and determining corresponding trial texts and the corresponding trial texts.
In the embodiment of the invention, the trial voice data only contains the voice of the trial messenger, and the corresponding trial text can be determined after the voice recognition is carried out on the trial voice data; similarly, the audited speech data only contains the speech of the auditor, and the corresponding audited text can be determined after the speech recognition is carried out on the auditor speech data. And then, the interrogation record can be generated through the interrogation text and the interrogated text.
In a possible implementation manner, since there is a case that the auditor and the polled person speak simultaneously or one of the two parties interrupts the other party to speak, and the text content may have a problem, in an embodiment of the present invention, the determining the corresponding audition text and the text to be audited includes steps a1-a 2:
step A1: and determining time stamps of the trial voice data and the trial voice data, and adding corresponding time stamps to the trial text and the trial text respectively according to the time stamps.
Step A2: and determining the overlapping part of the trial text and the trial text according to the time stamp, and highlighting the text corresponding to the overlapping part.
In the embodiment of the invention, the time stamps of two texts are determined according to the time stamp of voice data, the time stamp in the text is used for indicating the time of the section, the sentence or a word in the text, the whole text is divided by the start time stamp and the end time stamp, the text exists between the start time stamp and the end time stamp, and the time stamps from the end time stamp to the start time stamp are blank; meanwhile, the text between the start timestamp and the end timestamp may also be provided with a common timestamp, and the common timestamp is only used for indicating the time point corresponding to the text. That is, it can be determined from the time stamp of the text, which time period the text in the trial text or the text being polled was collected. When the overlap part of the trial text and the audited text is determined according to the time stamp, the trial text and the audited text are acquired in the overlapped time period, at this time, the audited voice data obtained by taking the trial voice data as a reference signal may have larger interference, and the corresponding text is highlighted so that a user can check whether the voice recognition result is accurate.
As shown in fig. 2, a rectangular wave indicates the presence or absence of text, and when the waveform is 1, it indicates the presence of text, and when the waveform is 0, it indicates the absence of text. Specifically, for trial texts, T1, T3, and T5 are start timestamps, T2, T4, and T6 are end timestamps, and one text segment in the trial text exists between T1 and T2, between T3 and T4, and between T5 and T6, respectively. Similarly, for the text to be polled, T7 and T9 are start timestamps, T8 and T10 are end timestamps, and a text segment in the text to be polled exists between T7 and T8 and between T9 and T10 respectively. At this time, as shown in fig. 2, the overlapping part of the trial text and the polled text is the part represented by segments T9-T4, that is, the trial participant starts speaking at time T3, and before the trial participant finishes speaking (the trial participant finishes speaking at time T4), the polled participant inserts a dialog at time T9, so that the trial text and the polled text are simultaneously collected in the time period T9-T4, and at this time, the text corresponding to the time period T9-T4 needs to be highlighted (for example, highlighted). Optionally, the trial text may be considered correct because the trial voice data is less likely to be disturbed by the trial participants, and only the overlapping portion of the trial text may need to be highlighted.
Another embodiment of the present invention provides a method for separating speech during interrogation, which includes steps 101 and 103 in the above embodiments, and the implementation principle and technical effect thereof are as shown in the corresponding embodiment of fig. 1. Meanwhile, in the embodiment of the present invention, the removing the trial voice data from the second voice data by using the trial voice data as the reference signal in step 103 includes steps B1-B2:
step B1: and determining the signal time delay according to the distance between the first audio acquisition device and the trial man and the distance between the second audio acquisition device and the trial man.
Step B2: and carrying out time delay processing on the trial voice data according to the signal time delay, taking the trial voice data after the time delay processing as a reference signal, and removing the trial voice data in the second voice data.
Because the distance between the two audio acquisition devices and the auditor is different, the voices of the auditor acquired by the two audio acquisition devices actually have time delay. In the embodiment of the invention, the signal delay is determined according to the distance between the first audio acquisition device and the auditor and the distance between the second audio acquisition device and the auditor, and the signal delay can be determined by utilizing the existing delay estimation algorithm. After the trial voice data is obtained, time delay processing is carried out on the trial voice data according to the signal time delay, so that the time delay between the signals collected by the two audio collecting devices is eliminated, the trial voice data in the second voice data is removed more accurately, and more accurate trial voice data can be obtained.
On the basis of the above embodiment, the step 103 of "using the trial voice data as the reference signal and removing the trial voice data from the second voice data" includes the steps C1-C3:
step C1: and respectively preprocessing the trial voice data and the second voice data, and determining a first voice matrix G1 corresponding to the trial voice data and a second voice matrix G2 corresponding to the second voice data.
Step C2: a third speech matrix G3 is determined from the first speech matrix G1 and the second speech matrix G2, G3 ═ G2- λ G1, λ is a weighting factor.
Step C3: and performing dimensionality reduction on the third voice matrix G3, converting the third voice matrix G3 into a discrete voice array Xs, restoring the discrete voice array Xs into continuous voice data according to a preset sampling period, and taking the restored voice data as audition voice data.
Wherein the pretreatment process in the step C1 specifically comprises the steps C11-C14:
step C11: performing voice data according to a preset sampling periodDiscrete sampling processing is carried out, a discrete voice array X represented in an array form is obtained, and voice data are trial voice data or second voice data; x ═ X1,x2,…xj,…,xn]Wherein x isjAnd n is the total number of samples.
In the embodiment of the invention, the voice data (trial voice data or second voice data) is continuous audio data, and discrete sampling is carried out on the continuous audio data to determine the voice characteristics contained in the continuous audio data; the voice data is recorded by the discrete voice array expressed in the array form, so that the subsequent processing is facilitated. The preset sampling period is the same as the preset sampling period in step C3, that is, for the voice data, n sampling values are obtained after sampling with the preset sampling period.
Step C12: performing row expansion by taking the discrete voice array X as a row, and determining an M multiplied by n discrete voice matrix M after row expansion; wherein m is an odd number,
Figure BDA0001568016200000101
and m isi,jValue of and
Figure BDA0001568016200000102
m is a negative correlation between themi,jRepresenting the elements in the ith row and jth column of the discrete speech matrix M.
In the embodiment of the present invention, the discrete speech array X may also be regarded as a1 × n matrix, and the discrete speech array X is expanded into an M × n discrete speech matrix M by row expansion. Wherein, namely
Figure BDA0001568016200000111
For other mi,j,mi,jValue of and
Figure BDA0001568016200000112
are in a negative correlation relationship, i.e. mi,jCorresponding elements are from the element of the middle row
Figure BDA0001568016200000113
The farther away, mi,jThe smaller the value of (c).
Alternatively to this, the first and second parts may,
Figure BDA0001568016200000114
or
Figure BDA0001568016200000115
Wherein the content of the first and second substances,
Figure BDA0001568016200000116
m may also be determined in other waysi,jThis embodiment is not limited to this.
Step C13: determining a k × k reference matrix H; k is an odd number greater than 1, and the element in the ith row and the jth column of the reference matrix H is:
Figure BDA0001568016200000117
where σ is an adjustment coefficient.
Because errors are introduced after the originally collected voice data are expanded into the discrete voice matrix M, the errors need to be reduced at this time so as to reduce or even eliminate the introduced errors. In the embodiment of the invention, the error in the discrete speech matrix M is removed by establishing the k-order reference matrix H. The larger the adjustment coefficient sigma is, the more obvious the degradation effect is, but the more easily the discrete speech matrix M is distorted; therefore, the adjustment coefficient σ is generally selected according to the actual situation, for example, σ is 0.8.
Step C14: performing difference reduction processing on the discrete voice matrix M according to the reference matrix H, determining a voice matrix G after the difference reduction processing, and determining an element G of the x-th row and the y-th column of the voice matrix Gx,yComprises the following steps:
Figure BDA0001568016200000118
when the voice data is trial voice data, the determined voice matrix G is a first voice matrix G1; when the voice data is the second voice data, the determined voice matrix G is the second voice matrix G2.
Examples of the inventionIn m whenx,yWhen the difference reduction processing can be carried out (i.e. the
Figure BDA0001568016200000121
And is
Figure BDA0001568016200000122
) Performing difference reduction processing on the discrete speech matrix M according to the reference matrix H; for elements M outside the discrete speech matrix Mx,y(
Figure BDA0001568016200000123
Or
Figure BDA0001568016200000124
) Although the subtraction processing is not performed, since the original discrete speech array X is located in the middle row of the discrete speech matrix M, the influence of the peripheral elements on the discrete speech array X can be ignored even if the subtraction processing is not performed.
After the first speech matrix G1 and the second speech matrix G2 are obtained, the audited speech data can be determined according to the above steps C2 and C3. Wherein, the weight coefficient lambda is a real number between 0 and 1, namely lambda is maximum 1; in particular, λ may be determined according to a distance between two audio acquisition devices, and the larger the distance between the two audio acquisition devices, the smaller λ is. Meanwhile, the third speech matrix G3 is converted into the discrete speech array Xs, specifically, the element in the middle row of G3 may be directly used as the discrete speech array Xs, and an element in the relative position of the discrete speech array Xs may also be determined according to all elements in a column of the third speech matrix G3.
In the embodiment of the invention, firstly, the voice data is subjected to dimension-increasing processing, and the voice data expressed in an array form is expanded into a matrix form, so that the voice data is conveniently processed; and after determining a third voice matrix corresponding to the audited voice data, performing dimension reduction processing to obtain the audited voice data. The mode removes trial voice data in the second voice data with higher dimensionality, and can effectively reduce distortion caused by removing the second voice data.
The method for separating the voice during interrogation provided by the embodiment of the invention is characterized in that two groups of voice data are respectively obtained based on two set audio acquisition devices, and one group of voice data is taken as a reference signal to eliminate the other group of voice data, so that the voice separation is realized. The interference of the trial messenger channel is effectively reduced through the two groups of voice data, so that the correct separation of the talking signals of the trial messenger and the auditioned is realized, the voices of the trial messenger and the audited messenger can be correctly identified by utilizing voice identification, the trial notes can be automatically generated, the trial efficiency is improved, and the labor cost is saved. By highlighting the text corresponding to the overlapped part, the user can be quickly positioned at the position where the voice recognition is possibly wrong, and the user can conveniently and quickly check and confirm whether the recognized text is accurate or not. The trial voice data in the second voice data can be removed more accurately through the delay processing, so that more accurate trial voice data can be obtained. The voice data expressed in the array form is expanded into a matrix form, so that the voice data can be conveniently processed; and after determining a third voice matrix corresponding to the audited voice data, performing dimension reduction processing to obtain the audited voice data. The mode removes trial voice data in the second voice data with higher dimensionality, and can effectively reduce distortion caused by removing the second voice data.
The above describes in detail the method flow of speech separation during interrogation, which can also be implemented by a corresponding device, and the structure and function of the device are described in detail below.
An apparatus for separating speech during interrogation according to an embodiment of the present invention is shown in fig. 3, and includes:
the acquisition module 31 is used for acquiring first voice data acquired by a first audio acquisition device and second voice data acquired by a second audio acquisition device, wherein the first audio acquisition device is a device pointing to an auditor, and the second audio acquisition device is a device pointing to a person to be audited;
the first processing module 32 is configured to perform filtering processing on the first voice data, and determine interrogation voice data corresponding to an interrogation person;
the second processing module 33 is configured to remove the trial voice data from the second voice data by using the trial voice data as a reference signal, and determine the trial voice data from the second voice data.
In one possible implementation, referring to fig. 4, the apparatus further includes:
and the recognition module 34 is used for respectively recognizing the trial speech data and determining the corresponding trial text and the corresponding trial text.
In one possible implementation, the identification module 34 is further configured to:
determining time stamps of the trial voice data and the trial voice data, and adding corresponding time stamps to the trial text and the trial text respectively according to the time stamps, wherein the time stamps comprise a start time stamp and an end time stamp;
and determining the overlapping part of the trial text and the trial text according to the time stamp, and highlighting the text corresponding to the overlapping part.
In a possible implementation manner, the second processing module 33 is specifically configured to:
determining signal time delay according to the distance between the first audio acquisition device and the auditor and the distance between the second audio acquisition device and the auditor;
and carrying out time delay processing on the trial voice data according to the signal time delay, taking the trial voice data after the time delay processing as a reference signal, and removing the trial voice data in the second voice data.
In a possible implementation manner, the second processing module 33 is specifically configured to:
preprocessing the trial voice data and the second voice data respectively, and determining a first voice matrix G1 corresponding to the trial voice data and a second voice matrix G2 corresponding to the second voice data;
determining a third voice matrix G3 according to the first voice matrix G1 and the second voice matrix G2, wherein G3 is G2-lambda G1, and lambda is a weight coefficient;
performing dimensionality reduction on the third voice matrix G3, converting the third voice matrix G3 into a discrete voice array Xs, reducing the discrete voice array Xs into continuous voice data according to a preset sampling period, and taking the reduced voice data as audition voice data;
wherein, the pretreatment process comprises the following steps:
performing discrete sampling processing on the voice data according to a preset sampling period to obtain a discrete voice array X represented in an array form, wherein the voice data is trial voice data or second voice data; x ═ X1,x2,…xj,…,xn]Wherein x isjRepresenting the sampling value of the voice data corresponding to the jth sampling point, wherein n is the total number of samples;
performing row expansion by taking the discrete voice array X as a row, and determining an M multiplied by n discrete voice matrix M after row expansion; wherein m is an odd number,
Figure BDA0001568016200000141
and m isi,jValue of and
Figure BDA0001568016200000142
m is a negative correlation between themi,jThe element representing the ith row and the jth column in the discrete speech matrix M;
determining a k × k reference matrix H; k is an odd number greater than 1, and the element in the ith row and the jth column of the reference matrix H is:
Figure BDA0001568016200000143
wherein σ is an adjustment coefficient;
performing difference reduction processing on the discrete voice matrix M according to the reference matrix H, determining a voice matrix G after the difference reduction processing, and determining an element G of the x-th row and the y-th column of the voice matrix Gx,yComprises the following steps:
Figure BDA0001568016200000151
when the voice data is trial voice data, the determined voice matrix G is a first voice matrix G1; when the voice data is the second voice data, the determined voice matrix G is the second voice matrix G2.
The device for separating the voice during interrogation provided by the embodiment of the invention is characterized in that two groups of voice data are respectively obtained based on the two set audio acquisition devices, and one group of voice data is taken as a reference signal to eliminate the other group of voice data, so that the voice separation is realized. The interference of the trial messenger channel is effectively reduced through the two groups of voice data, so that the correct separation of the talking signals of the trial messenger and the auditioned is realized, the voices of the trial messenger and the audited messenger can be correctly identified by utilizing voice identification, the trial notes can be automatically generated, the trial efficiency is improved, and the labor cost is saved. By highlighting the text corresponding to the overlapped part, the user can be quickly positioned at the position where the voice recognition is possibly wrong, and the user can conveniently and quickly check and confirm whether the recognized text is accurate or not. The trial voice data in the second voice data can be removed more accurately through the delay processing, so that more accurate trial voice data can be obtained. The voice data expressed in the array form is expanded into a matrix form, so that the voice data can be conveniently processed; and after determining a third voice matrix corresponding to the audited voice data, performing dimension reduction processing to obtain the audited voice data. The mode removes trial voice data in the second voice data with higher dimensionality, and can effectively reduce distortion caused by removing the second voice data.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. A method for speech separation during interrogation, comprising:
acquiring first voice data acquired by a first audio acquisition device and second voice data acquired by a second audio acquisition device, wherein the first audio acquisition device is a device pointing to an auditor, and the second audio acquisition device is a device pointing to a polled person;
filtering the first voice data, and determining the audition voice data corresponding to the audition person;
taking the audition voice data as a reference signal, removing the audition voice data in the second voice data, and determining the audited voice data in the second voice data;
the removing the trial voice data in the second voice data by using the trial voice data as a reference signal includes:
preprocessing the trial voice data and the second voice data respectively, and determining a first voice matrix G1 corresponding to the trial voice data and a second voice matrix G2 corresponding to the second voice data;
determining a third voice matrix G3 according to the first voice matrix G1 and the second voice matrix G2, wherein G3 is G2-lambda G1, and lambda is a weight coefficient;
performing dimensionality reduction processing on the third voice matrix G3, converting the third voice matrix G3 into a discrete voice array Xs, reducing the discrete voice array Xs into continuous voice data according to a preset sampling period, and taking the reduced voice data as audited voice data;
wherein the preprocessing process comprises the following steps:
performing discrete sampling processing on the voice data according to the preset sampling period to obtain a discrete voice array X represented in an array form, wherein the voice data is the trial voice data or the second voice data; x ═ X1,x2,…xj,…,xn]Wherein x isjRepresenting the sampling value of the voice data corresponding to the jth sampling point, wherein n is the total number of samples;
performing row expansion by taking the discrete voice array X as a row, and determining an M multiplied by n discrete voice matrix M after row expansion; wherein m is an odd number,
Figure FDA0002651378720000021
and m isi,jValue of and
Figure FDA0002651378720000022
m is a negative correlation between themi,jThe element representing the ith row and the jth column in the discrete speech matrix M;
determining a k × k reference matrix H; k is an odd number greater than 1, and the element of the ith row and the jth column of the reference matrix H is:
Figure FDA0002651378720000023
wherein σ is an adjustment coefficient;
performing difference reduction processing on the discrete voice matrix M according to the reference matrix H, determining a voice matrix G after difference reduction processing, and determining an element G of the xth row and the yth column of the voice matrix Gx,yComprises the following steps:
Figure FDA0002651378720000024
when the voice data is the interrogation voice data, the determined voice matrix G is a first voice matrix G1; and when the voice data is the second voice data, the determined voice matrix G is a second voice matrix G2.
2. The method of claim 1, further comprising:
and identifying the trial voice data and the trial voice data respectively, and determining corresponding trial texts and trial texts.
3. The method of claim 2, wherein determining the corresponding trial text and the polled text comprises:
determining time stamps of the trial voice data and the trial voice data, and adding corresponding time stamps to the trial text and the trial text respectively according to the time stamps, wherein the time stamps comprise a start time stamp and an end time stamp;
and determining the overlapping part of the trial text and the text to be trial according to the time stamp, and highlighting the text corresponding to the overlapping part.
4. The method of claim 1, wherein removing the trial voice data from the second voice data using the trial voice data as a reference signal comprises:
determining signal time delay according to the distance between the first audio acquisition device and the trial messenger and the distance between the second audio acquisition device and the trial messenger;
and carrying out time delay processing on the trial voice data according to the signal time delay, taking the trial voice data after the time delay processing as a reference signal, and removing the trial voice data in the second voice data.
5. An apparatus for audio separation during interrogation, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring first voice data acquired by a first audio acquisition device and second voice data acquired by a second audio acquisition device, the first audio acquisition device is a device pointing to an auditor, and the second audio acquisition device is a device pointing to a polled person;
the first processing module is used for carrying out filtering processing on the first voice data and determining the audition voice data corresponding to the audition person;
the second processing module is used for removing the trial voice data in the second voice data by taking the trial voice data as a reference signal and determining the trial voice data in the second voice data;
the second processing module is configured to:
preprocessing the trial voice data and the second voice data respectively, and determining a first voice matrix G1 corresponding to the trial voice data and a second voice matrix G2 corresponding to the second voice data;
determining a third voice matrix G3 according to the first voice matrix G1 and the second voice matrix G2, wherein G3 is G2-lambda G1, and lambda is a weight coefficient;
performing dimensionality reduction processing on the third voice matrix G3, converting the third voice matrix G3 into a discrete voice array Xs, reducing the discrete voice array Xs into continuous voice data according to a preset sampling period, and taking the reduced voice data as audited voice data;
wherein the preprocessing process comprises the following steps:
performing discrete sampling processing on the voice data according to the preset sampling period to obtain a discrete voice array X represented in an array form, wherein the voice data is the trial voice data or the second voice data; x ═ X1,x2,…xj,…,xn]Wherein x isjRepresenting the sampling value of the voice data corresponding to the jth sampling point, wherein n is the total number of samples;
performing row expansion by taking the discrete voice array X as a row, and determining an M multiplied by n discrete voice matrix M after row expansion; wherein m is an odd number,
Figure FDA0002651378720000041
and m isi,jValue of and
Figure FDA0002651378720000042
m is a negative correlation between themi,jThe element representing the ith row and the jth column in the discrete speech matrix M;
determining a k × k reference matrix H; k is an odd number greater than 1, and the element of the ith row and the jth column of the reference matrix H is:
Figure FDA0002651378720000043
wherein σ is an adjustment coefficient;
performing difference reduction processing on the discrete voice matrix M according to the reference matrix H, determining a voice matrix G after difference reduction processing, and determining an element G of the xth row and the yth column of the voice matrix Gx,yComprises the following steps:
Figure FDA0002651378720000044
when the voice data is the interrogation voice data, the determined voice matrix G is a first voice matrix G1; and when the voice data is the second voice data, the determined voice matrix G is a second voice matrix G2.
6. The apparatus of claim 5, further comprising:
and the recognition module is used for respectively recognizing the trial speech data and determining corresponding trial texts and the corresponding trial texts.
7. The apparatus of claim 6, wherein the identification module is further configured to:
determining time stamps of the trial voice data and the trial voice data, and adding corresponding time stamps to the trial text and the trial text respectively according to the time stamps, wherein the time stamps comprise a start time stamp and an end time stamp;
and determining the overlapping part of the trial text and the text to be trial according to the time stamp, and highlighting the text corresponding to the overlapping part.
8. The apparatus of claim 5, wherein the second processing module is configured to:
determining signal time delay according to the distance between the first audio acquisition device and the trial messenger and the distance between the second audio acquisition device and the trial messenger;
and carrying out time delay processing on the trial voice data according to the signal time delay, taking the trial voice data after the time delay processing as a reference signal, and removing the trial voice data in the second voice data.
CN201810106940.7A 2018-02-02 2018-02-02 Method and device for separating voice during interrogation Active CN108198570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810106940.7A CN108198570B (en) 2018-02-02 2018-02-02 Method and device for separating voice during interrogation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810106940.7A CN108198570B (en) 2018-02-02 2018-02-02 Method and device for separating voice during interrogation

Publications (2)

Publication Number Publication Date
CN108198570A CN108198570A (en) 2018-06-22
CN108198570B true CN108198570B (en) 2020-10-23

Family

ID=62592089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810106940.7A Active CN108198570B (en) 2018-02-02 2018-02-02 Method and device for separating voice during interrogation

Country Status (1)

Country Link
CN (1) CN108198570B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065023A (en) * 2018-08-23 2018-12-21 广州势必可赢网络科技有限公司 A kind of voice identification method, device, equipment and computer readable storage medium
CN109785855B (en) * 2019-01-31 2022-01-28 秒针信息技术有限公司 Voice processing method and device, storage medium and processor
CN110689900B (en) * 2019-09-29 2022-05-13 北京地平线机器人技术研发有限公司 Signal enhancement method and device, computer readable storage medium and electronic equipment
CN111128212A (en) * 2019-12-09 2020-05-08 秒针信息技术有限公司 Mixed voice separation method and device
CN111145774A (en) * 2019-12-09 2020-05-12 秒针信息技术有限公司 Voice separation method and device
EP4344449A1 (en) * 2022-06-13 2024-04-03 Orcam Technologies Ltd. Processing and utilizing audio signals

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998004100A1 (en) * 1996-07-19 1998-01-29 David Griesinger Multichannel active matrix sound reproduction with maximum lateral separation
JP3695324B2 (en) * 2000-12-12 2005-09-14 日本電気株式会社 Information system using TV broadcasting
US20070133811A1 (en) * 2005-12-08 2007-06-14 Kabushiki Kaisha Kobe Seiko Sho Sound source separation apparatus and sound source separation method
CN101964192A (en) * 2009-07-22 2011-02-02 索尼公司 Sound processing device, sound processing method, and program
CN103106903A (en) * 2013-01-11 2013-05-15 太原科技大学 Single channel blind source separation method
CN103247295A (en) * 2008-05-29 2013-08-14 高通股份有限公司 Systems, methods, apparatus, and computer program products for spectral contrast enhancement
CN104408042A (en) * 2014-10-17 2015-03-11 广州三星通信技术研究有限公司 Method and device for displaying a text corresponding to voice of a dialogue in a terminal
CN104505099A (en) * 2014-12-08 2015-04-08 北京云知声信息技术有限公司 Method and equipment for removing known interference in voice signal
CN106448722A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 Sound recording method, device and system
CN106887238A (en) * 2017-03-01 2017-06-23 中国科学院上海微系统与信息技术研究所 A kind of acoustical signal blind separating method based on improvement Independent Vector Analysis algorithm
CN107093438A (en) * 2012-06-18 2017-08-25 谷歌公司 System and method for recording selective removal audio content from mixed audio

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101096091B1 (en) * 2010-05-20 2011-12-19 충북대학교 산학협력단 Apparatus for Separating Voice and Method for Separating Voice of Single Channel Using the Same

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998004100A1 (en) * 1996-07-19 1998-01-29 David Griesinger Multichannel active matrix sound reproduction with maximum lateral separation
JP3695324B2 (en) * 2000-12-12 2005-09-14 日本電気株式会社 Information system using TV broadcasting
US20070133811A1 (en) * 2005-12-08 2007-06-14 Kabushiki Kaisha Kobe Seiko Sho Sound source separation apparatus and sound source separation method
CN103247295A (en) * 2008-05-29 2013-08-14 高通股份有限公司 Systems, methods, apparatus, and computer program products for spectral contrast enhancement
CN101964192A (en) * 2009-07-22 2011-02-02 索尼公司 Sound processing device, sound processing method, and program
CN107093438A (en) * 2012-06-18 2017-08-25 谷歌公司 System and method for recording selective removal audio content from mixed audio
CN103106903A (en) * 2013-01-11 2013-05-15 太原科技大学 Single channel blind source separation method
CN104408042A (en) * 2014-10-17 2015-03-11 广州三星通信技术研究有限公司 Method and device for displaying a text corresponding to voice of a dialogue in a terminal
CN104505099A (en) * 2014-12-08 2015-04-08 北京云知声信息技术有限公司 Method and equipment for removing known interference in voice signal
CN106448722A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 Sound recording method, device and system
CN106887238A (en) * 2017-03-01 2017-06-23 中国科学院上海微系统与信息技术研究所 A kind of acoustical signal blind separating method based on improvement Independent Vector Analysis algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Natural Gradient Multichannel Blind Deconvolution and Speech Separation Using Causal FIR Filters;Scott C. Douglas,Hiroshi Sawada,Shoji Makino;《IEEE International Conference on Acoustics IEEE》;20041220;全文 *
基于Givens-Hyperbolic双旋转的多路语音信号卷积盲分离;张华,左健存,戴虹,桂林;《上海第二工业大学学报》;20160615;第33卷(第二期);全文 *

Also Published As

Publication number Publication date
CN108198570A (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN108198570B (en) Method and device for separating voice during interrogation
JP6535706B2 (en) Method for creating a ternary bitmap of a data set
Tom et al. End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention.
CN108630193B (en) Voice recognition method and device
US5583961A (en) Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands
US8078463B2 (en) Method and apparatus for speaker spotting
US20090177466A1 (en) Detection of speech spectral peaks and speech recognition method and system
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
CN111243619B (en) Training method and device for speech signal segmentation model and computer equipment
KR20010005685A (en) Speech analysis system
Pan et al. USEV: Universal speaker extraction with visual cue
CN113870892A (en) Conference recording method, device, equipment and storage medium based on voice recognition
CN108399913B (en) High-robustness audio fingerprint identification method and system
CN110265000A (en) A method of realizing Rapid Speech writing record
CN113744742A (en) Role identification method, device and system in conversation scene
CN113709313A (en) Intelligent quality inspection method, device, equipment and medium for customer service call data
CN113035225B (en) Visual voiceprint assisted voice separation method and device
KR20200140235A (en) Method and device for building a target speaker's speech model
CN114996489A (en) Method, device and equipment for detecting violation of news data and storage medium
CN210606618U (en) System for realizing voice and character recording
JPS6129518B2 (en)
CN112151070B (en) Voice detection method and device and electronic equipment
CN117831537A (en) Conference transfer method and system based on multi-stage memristor array and electronic equipment
CN116129901A (en) Speech recognition method, device, electronic equipment and readable storage medium
CN112687273A (en) Voice transcription method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room 101, 1st floor, building 1, Xisanqi building materials City, Haidian District, Beijing 100096

Patentee after: Yunzhisheng Intelligent Technology Co.,Ltd.

Address before: 12 / F, Guanjie building, building 1, No. 16, Taiyanggong Middle Road, Chaoyang District, Beijing

Patentee before: BEIJING UNISOUND INFORMATION TECHNOLOGY Co.,Ltd.

CP03 Change of name, title or address