CN110808062B

CN110808062B - Mixed voice separation method and device

Info

Publication number: CN110808062B
Application number: CN201911175510.1A
Authority: CN
Inventors: 李健; 徐浩; 梁志婷
Original assignee: Miaozhen Information Technology Co Ltd
Current assignee: Miaozhen Information Technology Co Ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2022-12-13
Anticipated expiration: 2039-11-26
Also published as: CN110808062A

Abstract

The invention discloses a mixed voice separation method and a mixed voice separation device. Wherein, the method comprises the following steps: acquiring target voices to be separated, wherein the target voices comprise first voices sent by a first object and second voices sent by a second object, and the first voices and the second voices are not overlapped; acquiring a time period of the first object for emitting the first voice, wherein the time period is from the time of starting vibration to the time of finishing vibration of the sound-emitting part of the first object detected by the vibration detection module; separating a target voice fragment from the target voice according to the time period; and taking the target voice segment as a first voice segment and taking the rest voice segments in the target voice as second voice segments, wherein the first voice segment comprises first voice, and the second voice segment comprises second voice. The invention solves the technical problem of low efficiency of separating mixed voice in the related technology.

Description

Mixed voice separation method and device

Technical Field

The invention relates to the field of computers, in particular to a mixed voice separation method and a mixed voice separation device.

Background

After the sound recording is performed for the multi-person scene, the obtained sound recording comprises the sound emitted by a plurality of objects. At this time, if the sounds emitted by the plurality of objects are separated, the recording needs to be played manually, and the sound emitted by each object is intercepted, so that the separation of the mixed voice is completed.

However, if the above method is adopted, the efficiency of separating the mixed speech is low.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for separating mixed voice, which are used for at least solving the technical problem of low efficiency of separating mixed voice in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a mixed speech separation method, including: acquiring target voices to be separated, wherein the target voices comprise first voices sent by a first object and second voices sent by a second object, and the first voices and the second voices are not overlapped; acquiring a time period of the first voice uttered by the first object, wherein the time period is from the time of the first object's vocal part starting vibration to the time of the first object's ending vibration detected by a vibration detection module; separating a target voice fragment from the target voice according to the time period; and taking the target voice segment as a first voice segment, and taking the rest voice segments in the target voice as second voice segments, wherein the first voice segment comprises the first voice, and the second voice segment comprises the second voice.

As an optional example, the obtaining of the time period in which the first object utters the first voice includes: recording the time of starting vibration when the vibration detection module detects that the sound generating part of the first object starts vibration; when the vibration detection block detects that the sound-emitting part of the first object has finished vibrating, the vibration-finishing time is recorded.

As an optional example, before the obtaining of the time period when the first object utters the first voice, the method further includes: under the condition that the time for starting vibration and the time for finishing vibration are natural time points and the time recorded in the target voice is a time period, acquiring a target natural time point for starting recording the target voice; and recording the natural time point of each frame in the recorded target voice into the target voice according to the target natural time point.

As an optional example, before the obtaining of the time period when the first object utters the first voice, the method further includes: under the condition that the time for starting vibration and the time for finishing vibration are natural time points and the time recorded in the target voice is a time period, acquiring a target natural time point for starting recording the target voice; determining the time length from the target natural time point to the vibration starting time as a new vibration starting time; and determining the time length from the target natural time point to the vibration ending time as a new vibration ending time.

As an optional example, the taking the target speech segment as a first speech segment and the remaining speech segments in the target speech as a second speech segment includes: splicing the target voice fragments according to the time sequence to obtain the first voice fragment; and splicing the rest voice segments in the target voice according to the time sequence to obtain the second voice segment.

According to another aspect of the embodiments of the present invention, there is also provided a mixed voice separating apparatus, including: the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring target voices to be separated, the target voices comprise first voices and second voices, the first voices are emitted by a first object, the second voices are emitted by a second object, and the first voices and the second voices are not overlapped; a second obtaining unit, configured to obtain a time period during which the first object utters the first voice, where the time period is from a time when the sounding part of the first object starts shaking to a time when the shaking is finished, where the time period is detected by the shaking detection module; a separation unit, configured to separate a target speech segment from the target speech according to the time period; a first determining unit, configured to use the target speech segment as a first speech segment and use remaining speech segments of the target speech as second speech segments, where the first speech segment includes the first speech and the second speech segment includes the second speech.

As an optional implementation, the second obtaining unit includes: a first recording module, configured to record a vibration start time when the vibration detection module detects that the sound generating portion of the first object starts vibrating; and a second recording module for recording the vibration ending time when the vibration detection block detects that the sound generating part of the first object ends vibration.

As an optional implementation, the apparatus further includes: a third acquiring unit configured to acquire a target natural time point at which recording of the target voice is started, when the time at which the vibration starts and the time at which the vibration ends are natural time points and the time recorded in the target voice is a time period, before the time period at which the first voice is uttered by the first object is acquired; and the recording unit is used for recording the natural time point of each frame in the recorded target voice into the target voice according to the target natural time point.

As an optional implementation manner, before the obtaining of the time period in which the first object utters the first voice, the apparatus further includes: a fourth acquiring unit, configured to acquire a target natural time point at which recording of the target voice is started when the time of starting the vibration and the time of ending the vibration are natural time points and the time recorded in the target voice is a time period; a second determining unit for determining a time length from the target natural time point to the vibration starting time as a new vibration starting time; a third determining unit, configured to determine a time length from the target natural time point to the vibration ending time as a new vibration ending time.

As an optional implementation, the first determining unit includes: the first splicing module is used for splicing the target voice fragments according to the time sequence to obtain the first voice fragment; and the second splicing module is used for splicing the rest voice fragments in the target voice according to the time sequence to obtain the second voice fragment.

In the embodiment of the invention, target voices to be separated are obtained, wherein the target voices comprise first voices sent by a first object and second voices sent by a second object, and the first voices and the second voices are not overlapped; acquiring a time period of the first voice of the first object, wherein the time period is from the time of starting vibration to the time of finishing vibration of the sound generating part of the first object detected by a vibration detection module; separating a target voice fragment from the target voice according to the time period; and taking the target voice segment as a first voice segment, and taking the rest voice segments in the target voice as a second voice segment, wherein the first voice segment comprises the first voice, and the second voice segment comprises the second voice. In the method, after the mixed voice is obtained, the vibration of the sound-emitting part of the first object can be detected, so that the vibration starting time and the vibration ending time of the sound emitted by the first object can be determined, the sound of the first object is separated from the mixed voice, the mixed voice separation efficiency is improved, and the technical problem of low mixed voice separation efficiency in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative hybrid speech separation method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an alternative mixed speech separation apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, there is provided a mixed speech separation method, optionally, as an optional implementation, as shown in fig. 1, the method includes:

s102, target voices to be separated are obtained, wherein the target voices comprise first voices sent by a first object and second voices sent by a second object, and the first voices are not overlapped with the second voices;

s104, acquiring a time period of the first object for emitting the first voice, wherein the time period is from the time of starting vibration to the time of finishing vibration of the sound-emitting part of the first object detected by the vibration detection module;

s106, separating a target voice segment from the target voice according to the time period;

and S108, taking the target voice segment as a first voice segment, and taking the rest voice segments in the target voice as second voice segments, wherein the first voice segment comprises the first voice, and the second voice segment comprises the second voice.

Alternatively, the hybrid voice separation method may be, but is not limited to, applied to a terminal capable of calculating data, such as a mobile phone, a tablet computer, a notebook computer, a PC, and the like, which may interact with a server through a network, which may include, but is not limited to, a wireless network or a wired network. Wherein, this wireless network includes: WIFI and other networks that enable wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, and local area networks. The server may include, but is not limited to, any hardware device capable of performing calculations.

Alternatively, the above-mentioned mixed voice separation method can be applied, but not limited to, separating voice recordings, such as separating conference recordings, and the like. Or, it can also be applied to separating audio recordings of two or more parties to a transaction, etc.

For example, for the communication records of the two, the method in the related art needs to artificially judge the belonging of the voice to separate the voice. In the scheme, the vibration of one person can be detected through the vibration detection module of the acquisition device, so that the speaking time period of one person can be detected, the separation of the sounds of two persons can be automatically completed according to the time period, and the mixed voice separation efficiency is improved. In this scheme, collection system is the equipment that is used for the recording, specifically can include: the system comprises a microphone for receiving sound, a vibration detection module for detecting vibration of a sounding part of a human body, a storage module for storing a recording file, a connection module for transmitting information with a server and the like.

Under the shop scene, the conversation to customer and store clerk separates for the example, and the store clerk can carry collection system, and collection system is equipped with vibrations detection module, and vibrations detection module can be located near store clerk's throat, contacts with staff's throat under the user state. In the case of a vibration of the throat of the employee, the vibration detection module may detect a vibration signal. The microphone for picking up sound may be located in front of the clerk's chest or carried by the clerk, e.g. by being fixed in front of the chest by a brooch, etc. When a store clerk communicates with a customer, the throat of the store clerk vibrates when the store clerk speaks, and after the vibration is detected by the vibration detection module, the vibration starting time and the vibration ending time are recorded, wherein the time period is the speaking time period of the store clerk. After the microphone receives the communication record of the customer and the store clerk, the voice content of the store clerk can be intercepted from the communication record through the time period, and the rest content is the voice content of the customer.

Alternatively, since the recorded content is usually recorded only for a period of time, such as 20 minutes, the time for starting the vibration and the time for ending the vibration are generally time points, for example, the vibration is started at 4 pm and the vibration is ended at 4 pm. Therefore, the measurement units can be unified for both. For example, a target natural time point at which recording of the target voice starts is obtained, where the natural time point is a real-life time, such as 12 pm, and then each frame in the target voice is mapped to a display-life time to obtain a natural time point of each frame. For example, 30 frames per second, the time from the 31 st frame to the 60 th frame is 12:00:01. Thereby, the unification of the measurement units can be accomplished. Or, after the target natural time point for starting recording the target voice is acquired, for example, three pm, the time from the time point to the start of vibration, for example, 4 pm, one hour long is determined as the new vibration start time, and the time from the time point to the end of vibration, for example, 5 pm, is determined as the new vibration end time. After a target voice with the duration of 4 hours is acquired, the beginning of the second hour is the time for starting vibration, and the end of the 2 nd hour is the time for ending vibration, namely the starting time and the ending time of the speaking of the first object such as a store clerk. Thereby accomplishing the unification of the measurement units.

It should be noted that a clerk and a customer may speak alternately, at this time, there may be multiple vibration starting times and vibration ending times, for example, the vibration starting time is 1 pm and 3 pm, the vibration ending time is 2 pm and 4 pm, at this time, the vibration starting time is taken as a starting point, each pair of vibration starting time and vibration ending time is a set, the voice content of the target voice corresponding to the set of time is obtained and spliced with the voice content corresponding to the next set of time, the splicing result is the voice of the clerk, and the remaining content is the voice of the customer.

Alternatively, there may be a plurality of microphones, and there may be a deviation in the time of starting recording of the plurality of microphones, and in this case, it is necessary to align the voice contents of the recording results of the plurality of microphones, for example, if the 1 st second of the sound received by the first microphone and the 2 nd second of the sound received by the second microphone are the same contents, the 1 st second of the sound received by the first microphone and the 2 nd second of the sound received by the second microphone are aligned, and then the sound is separated by using the above method.

Through the embodiment, the method is used for separating the mixed voice, the separation can be completed without manual judgment, and the efficiency of separating the mixed voice is improved.

It should be noted that for simplicity of description, the above-mentioned method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiment of the invention, a mixed voice separation device for implementing the mixed voice separation method is also provided. As shown in fig. 2, the apparatus includes:

(1) A first obtaining unit 202, configured to obtain target voices to be separated, where the target voices include a first voice uttered by a first object and a second voice uttered by a second object, and the first voice and the second voice do not overlap;

(2) A second obtaining unit 204, configured to obtain a time period during which the first object utters the first voice, where the time period is from a time when the sound generation portion of the first object starts to vibrate to a time when the vibration detection module detects that the sound generation portion of the first object finishes vibrating;

(3) A separating unit 206, configured to separate a target speech segment from the target speech according to the time period;

(4) A first determining unit 208, configured to use the target speech segment as a first speech segment, and use remaining speech segments in the target speech as a second speech segment, where the first speech segment includes the first speech, and the second speech segment includes the second speech.

Alternatively, the hybrid voice separation apparatus may be, but is not limited to, applied to a terminal capable of calculating data, such as a mobile phone, a tablet computer, a notebook computer, a PC, and the like, which may interact with a server through a network, which may include, but is not limited to, a wireless network or a wired network. Wherein, this wireless network includes: WIFI and other networks that enable wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, and local area networks. The server may include, but is not limited to, any hardware device capable of performing computations.

Alternatively, the above-mentioned mixed voice separation method can be applied, but not limited, to separating voice recordings, such as separating conference recordings. Or, it can also be applied to separating audio recordings of two or more parties to a transaction, etc.

For example, for the communication records of the two, the method in the related art needs to artificially judge the belonging of the voice to separate the voice. In the scheme, the vibration of one person can be detected through the vibration detection module of the acquisition device, so that the speaking time period of one person can be detected, the separation of the sounds of two persons can be automatically completed according to the time period, and the mixed voice separation efficiency is improved. In this scheme, collection system is the equipment that is used for the recording, specifically can include: the system comprises a microphone for receiving sound, a vibration detection module for detecting vibration of a human body sounding part, a storage module for storing a recording file, a connection module for transmitting information with a server and the like.

Under the shop scene, the conversation to customer and store clerk separates for the example, and the store clerk can carry collection system, and collection system is equipped with vibrations detection module, and vibrations detection module can be located near store clerk's throat, contacts with staff's throat under the user state. The microphone for picking up sound may be located in front of the clerk's chest or carried by the clerk, e.g. by being fixed in front of the chest by a brooch, etc. When a store clerk communicates with a customer, the throat of the store clerk vibrates when the store clerk speaks, and after the vibration is detected by the vibration detection module, the vibration starting time and the vibration ending time are recorded, wherein the time period is the speaking time period of the store clerk. After the microphone receives the communication record of the customer and the store clerk, the voice content of the store clerk can be intercepted from the communication record through the time period, and the rest content is the voice content of the customer.

Alternatively, since the recorded content is usually recorded only for a period of time, such as 20 minutes, the time to start the vibration and the time to end the vibration are generally the time points, for example, the vibration starts at 4 pm and the vibration ends at 4 pm. Therefore, the measurement units can be unified for both. For example, a target natural time point at which recording of the target voice is started is obtained, where the natural time point is a real-life time, such as 12 pm, and then each frame in the target voice is mapped to a display-life time to obtain a natural time point of each frame. For example, 30 frames per second, the time from the 31 st frame to the 60 th frame is 12:00:01. Thereby, the unification of the measurement units can be accomplished. Or, after the target natural time point for starting recording the target voice is acquired, for example, three pm, the time point to the start of vibration, for example, 4 pm, one hour long, is determined as a new vibration start time, and the time point to the end of vibration, for example, 5 pm, is determined as a new vibration end time. After a target voice with the duration of 4 hours is acquired, the beginning of the second hour is the time for starting vibration, and the end of the 2 nd hour is the time for ending vibration, namely the starting time and the ending time of the speaking of the first object such as a store clerk. Thereby completing the unification of the measurement units.

Alternatively, there may be a plurality of microphones, and there may be a deviation in the time of starting recording of the plurality of microphones, and at this time, it is necessary to align the recording results of the plurality of microphones, for example, if the 1 st second of the sound received by the first microphone and the 2 nd second of the sound received by the second microphone are the same content, the 1 st second of the sound received by the first microphone and the 2 nd second of the sound received by the second microphone are aligned, and then the sound separation is performed by using the above method.

In the above embodiments of the present invention, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described in detail in a certain embodiment.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims

1. A method for separating mixed speech, comprising:

acquiring target voices to be separated, wherein the target voices comprise first voices sent by a first object and second voices sent by a second object, and the first voices and the second voices are not overlapped;

acquiring a time period of the first object for emitting the first voice, wherein the time period is from the time of starting vibration to the time of finishing vibration of a sound emitting part of the first object detected by a vibration detection module;

separating a target voice segment from the target voice according to the time period;

taking the target voice segment as a first voice segment, and taking the rest voice segments in the target voice as second voice segments, wherein the first voice segment comprises the first voice, and the second voice segment comprises the second voice;

the acquiring the time period for the first object to emit the first voice comprises:

recording the vibration starting time under the condition that the vibration detection module detects that the sound generating part of the first object starts vibration;

and recording the vibration ending time when the vibration detection block detects that the sound generating part of the first object finishes vibration.

2. The method of claim 1, wherein prior to said obtaining a time period for said first object to emit said first voice, said method further comprises:

under the condition that the vibration starting time and the vibration ending time are natural time points and the time recorded in the target voice is a time period, acquiring a target natural time point for starting recording the target voice;

and recording the natural time point of each frame in the recorded target voice into the target voice according to the target natural time point.

3. The method of claim 1, wherein prior to said obtaining a time period for said first object to emit said first voice, said method further comprises:

determining a time length from the target natural time point to the vibration starting time as a new vibration starting time;

determining a time length from the target natural time point to the time of ending the vibration as a new time of ending the vibration.

4. The method according to any one of claims 1 to 3, wherein the taking the target speech segment as a first speech segment and the remaining speech segments of the target speech as a second speech segment comprises:

splicing the target voice fragments according to the time sequence to obtain the first voice fragment;

and splicing the rest voice segments in the target voice according to the time sequence to obtain the second voice segment.

5. A hybrid speech separation device, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring target voices to be separated, the target voices comprise a first voice sent by a first object and a second voice sent by a second object, and the first voice and the second voice are not overlapped;

a second acquisition unit, configured to acquire a time period during which the first object utters the first voice, where the time period is from a time when a sound generation portion of the first object starts shaking to a time when the shaking is finished, and the time period is detected by the shaking detection module;

the separation unit is used for separating a target voice fragment from the target voice according to the time period;

a first determining unit, configured to use the target speech segment as a first speech segment, and use remaining speech segments in the target speech as a second speech segment, where the first speech segment includes the first speech and the second speech segment includes the second speech;

the second acquisition unit includes:

a first recording module configured to record a time at which a sound-emitting portion of the first object starts vibrating when the vibration detection module detects that the sound-emitting portion starts vibrating;

and the second recording module is used for recording the vibration ending time when the vibration detection block detects that the sound generating part of the first object finishes the vibration.

6. The apparatus of claim 5, further comprising:

a third acquiring unit, configured to acquire, before acquiring a time period in which the first object utters the first voice, a target natural time point at which recording of the target voice is started when the time of starting vibration and the time of ending vibration are natural time points and the time recorded in the target voice is the time period;

and the recording unit is used for recording the natural time point of each frame in the recorded target voice into the target voice according to the target natural time point.

7. The apparatus of claim 5, wherein prior to said obtaining a time period for said first object to emit said first voice, said apparatus further comprises:

a fourth obtaining unit, configured to obtain a target natural time point at which recording of the target voice is started when the time for starting vibration and the time for ending vibration are natural time points and the time recorded in the target voice is a time period;

a second determination unit configured to determine a time length from the target natural time point to the time of starting vibration as a new time of starting vibration;

a third determining unit, configured to determine a time length from the target natural time point to the vibration ending time as a new vibration ending time.

8. The apparatus according to any one of claims 5 to 7, wherein the first determining unit comprises:

the first splicing module is used for splicing the target voice fragments according to the time sequence to obtain a first voice fragment;

and the second splicing module is used for splicing the rest voice fragments in the target voice according to the time sequence to obtain the second voice fragment.