CN109036386A

CN109036386A - A kind of method of speech processing and device

Info

Publication number: CN109036386A
Application number: CN201811076321.4A
Authority: CN
Inventors: 邹新生
Original assignee: Beijing Net Co Creation Technology Co Ltd
Current assignee: Hunan Huawei Jin'an Enterprise Management Co ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2018-12-18
Anticipated expiration: 2038-09-14
Also published as: CN109036386B

Abstract

The present invention provides a kind of method of speech processing and devices, wherein this method comprises: mixing voice is divided into N number of sound bite by end-point detection, wherein the N is the natural number more than or equal to 2；Bayesian information criterion BIC detection is carried out to two sound bites of the arbitrary neighborhood in N number of sound bite, and abandons BIC detection and abnormal sound bite occurs, obtains the efficient voice segment of target object.By the invention it is possible to solve the problems, such as to realize the effect of the quick separating specific objective voice from mixing voice for being mainly that the mixing voice spoken of specific objective cannot quickly and effectively isolate specific objective voice in the related technology.

Description

A kind of method of speech processing and device

Technical field

The present invention relates to the communications fields, in particular to a kind of method of speech processing and device.

Background technique

Original scheme that the detection of speaker's turning point is done based on bayesian information criterion BIC, to be separated into purpose, one As be finally in order to which the mixing voice of multiple speakers is separated.Technically will not to the position of turning point carry out it is assumed that and And it can generally retain the voice data of different speakers as far as possible.In addition the method will not generally be used alone, for example calculate The distance between different data distribution, and cluster, etc..It is dominant for the voice duration of certain speaker dependent, Qi Taren Or the voice duration of noise is relatively low, and is less concerned about voice content, more concerned with the occasion of speaker characteristic, proposes To be separated into the scheme of purpose.For such issues that, current solution complexity is high, and effect is undesirable, lacks The solution of weary maturation.

For in the related technology for be mainly the mixing voice spoken of specific objective cannot quickly and effectively isolate it is specific The problem of target voice, not yet proposition solution.

Summary of the invention

The embodiment of the invention provides a kind of method of speech processing and devices, at least to solve in the related technology for main The problem of specific objective voice cannot quickly and effectively be isolated for the mixing voice that specific objective is spoken.

According to one embodiment of present invention, a kind of method of speech processing is provided, comprising:

Mixing voice is divided into N number of sound bite by end-point detection, wherein the N is the nature more than or equal to 2 Number；

Bayesian information criterion BIC detection is carried out to two sound bites of the arbitrary neighborhood in N number of sound bite, And abandon BIC detection and abnormal sound bite occur, obtain the efficient voice segment of target object.

Optionally, two sound bites of the arbitrary neighborhood in N number of sound bite carry out Bayesian Information Criterion BIC detection, and abandon the abnormal sound bite of BIC detection appearance and include:

To two sound bites adjacent in N number of sound bite to progress BIC detection；

Judge whether two sound bites of BIC detection exception occur；

In the case where the judgment result is yes, it abandons BIC detection and two abnormal sound bites occurs；

It repeats to carry out BIC detection to two sound bites adjacent in remaining N-2 sound bite, abandons BIC detection There are two abnormal sound bites, until remaining two neighboring sound bite does not occur exception.

Optionally, judge whether two sound bites of BIC detection exception occur and include:

Judge whether the BIC value between described two sound bites is greater than predetermined threshold；

In the case where the judgment result is yes, it is abnormal to determine that described two sound bites occur；

If the determination result is NO, determine that described two sound bites are normal.

Optionally, two sound bites in N number of sound bite carry out bayesian information criterion BIC inspection It surveys, and abandons the abnormal sound bite of BIC detection appearance and include:

To the sound bite in N number of sound bite to carrying out BIC detection, wherein the sound bite is to being the N Two sound bites in a sound bite；

Judge that the sound bite of BIC detection to whether there is exception, obtains testing result；

Abandoning the testing result is abnormal sound bite pair.

Optionally, judge the sound bite of BIC detection to whether occur abnormal include:

Judge whether the BIC value of sound bite pair is greater than predetermined threshold；

In the case where the judgment result is yes, determine that the sound bite is abnormal to occurring；

If the determination result is NO, determine the sound bite to normal.

Optionally, carrying out BIC detection to two sound bites in N number of sound bite includes:

Calculate the BIC value between two sound bites；

The BIC value is normalized.

Optionally, it is divided into N number of sound bite to include: mixing voice by end-point detection

Obtain mute section in the mixing voice；

Remove described mute section；

The mixing voice is split according to described mute section, the long sound bite after being divided；

The long sound bite is divided into N number of sound bite by end-point detection.

According to still another embodiment of the invention, a kind of voice processing apparatus is additionally provided, comprising:

Divide module, for mixing voice to be divided into N number of sound bite by end-point detection, wherein the N for greater than Or the natural number equal to 2；

Detection module carries out Bayes's letter for two sound bites to the arbitrary neighborhood in N number of sound bite Criterion BIC detection is ceased, and abandons BIC detection and abnormal sound bite occurs, obtains the efficient voice segment of target object.

Optionally, the detection module includes:

Detection unit, for two sound bites adjacent in N number of sound bite to carry out BIC detection；

Judging unit, for judging whether two sound bites of BIC detection exception occur；

There are two abnormal voice sheets in the case where the judgment result is yes, abandoning BIC detection in discarding unit Section；

Repetition detection unit carries out BIC to two sound bites adjacent in remaining N-2 sound bite for repeating Detection abandons BIC detection and two abnormal sound bites occurs, until remaining two neighboring sound bite do not occur it is different Often.

Optionally, the judging unit, is also used to

Optionally, the detection module includes:

Computing unit, for calculating the BIC value between two sound bites；

Processing unit, for the BIC value to be normalized.

Optionally, the segmentation module includes:

Acquiring unit, for obtaining mute section in the mixing voice；

Removal unit, for removing described mute section；

First cutting unit, for being split according to described mute section to the mixing voice, the length after being divided Sound bite；

Second cutting unit, for the long sound bite to be divided into N number of sound bite by end-point detection.

According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.

According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.

Through the invention, mixing voice is divided by N number of sound bite by end-point detection, wherein the N be greater than or Natural number equal to 2；Bayesian information criterion is carried out to two sound bites of the arbitrary neighborhood in N number of sound bite BIC detection, and abandon BIC detection and abnormal sound bite occur, obtains the efficient voice segment of target object, therefore, can be with It solves in the related technology for being mainly that the mixing voice spoken of specific objective cannot quickly and effectively isolate specific objective voice The problem of, realize the effect of the quick separating specific objective voice from mixing voice.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is a kind of hardware block diagram of the mobile terminal of method of speech processing of the embodiment of the present invention；

Fig. 2 is a kind of flow chart of method of speech processing according to an embodiment of the present invention；

Fig. 3 is a kind of block diagram of voice processing apparatus according to an embodiment of the present invention；

Fig. 4 is a kind of block diagram one of voice processing apparatus according to the preferred embodiment of the invention；

Fig. 5 is a kind of block diagram two of voice processing apparatus according to the preferred embodiment of the invention.

Specific embodiment

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.

Embodiment 1

Embodiment of the method provided by the embodiment of the present application one can be in mobile terminal, terminal or similar fortune It calculates and is executed in device.For running on mobile terminals, Fig. 1 is a kind of movement of method of speech processing of the embodiment of the present invention The hardware block diagram of terminal, as shown in Figure 1, mobile terminal 10 may include at one or more (only showing one in Fig. 1) It manages device 102 (processing unit that processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc.) Memory 104 for storing data, optionally, above-mentioned mobile terminal can also include the transmission device for communication function 106 and input-output equipment 108.It will appreciated by the skilled person that structure shown in FIG. 1 is only to illustrate, simultaneously The structure of above-mentioned mobile terminal is not caused to limit.For example, mobile terminal 10 may also include it is more than shown in Fig. 1 or less Component, or with the configuration different from shown in Fig. 1.

Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair The corresponding computer program of message method of reseptance in bright embodiment, processor 102 are stored in memory 104 by operation Computer program realizes above-mentioned method thereby executing various function application and data processing.Memory 104 may include High speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or its His non-volatile solid state memory.In some instances, memory 104 can further comprise remotely setting relative to processor 102 The memory set, these remote memories can pass through network connection to mobile terminal 10.The example of above-mentioned network includes but not It is limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of mobile terminal 10 provide.In an example, transmitting device 106 includes a Network adaptation Device (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments to It can be communicated with internet.In an example, transmitting device 106 can for radio frequency (Radio Frequency, referred to as RF) module is used to wirelessly be communicated with internet.

A kind of method of speech processing for running on above-mentioned mobile terminal or the network architecture, Fig. 2 are provided in the present embodiment It is a kind of flow chart of method of speech processing according to an embodiment of the present invention, as shown in Fig. 2, the process includes the following steps:

Mixing voice is divided into N number of sound bite by end-point detection by step S202, wherein the N is to be greater than or wait In 2 natural number；

It is quasi- to carry out Bayesian Information to two sound bites of the arbitrary neighborhood in N number of sound bite by step S204 Then BIC is detected, and abandons BIC detection and abnormal sound bite occur, obtains the efficient voice segment of target object.

Through the above steps, mixing voice is divided by N number of sound bite by end-point detection, wherein the N be greater than Or the natural number equal to 2；Bayesian information criterion is carried out to two sound bites of the arbitrary neighborhood in N number of sound bite BIC detection, and abandon BIC detection and abnormal sound bite occur, obtains the efficient voice segment of target object, therefore, can be with It solves in the related technology for being mainly that the mixing voice spoken of specific objective cannot quickly and effectively isolate specific objective voice The problem of, realize the effect of the quick separating specific objective voice from mixing voice.

In the embodiment of the present invention, Bayes is carried out to two sound bites of the arbitrary neighborhood in N number of sound bite Information criterion BIC detection, and abandon BIC detection occur abnormal sound bite mode can be it is a variety of, can be to N number of voice Any two in segment are detected, and are also possible to successively detect every two according to the sequencing of voice, one It in a optional embodiment, specifically includes: to two sound bites adjacent in N number of sound bite to progress BIC detection； Judge whether two sound bites of BIC detection exception occur；In the case where the judgment result is yes, BIC detection is abandoned to occur Two abnormal sound bites；It repeats to carry out BIC detection to the two neighboring sound bite in remaining N-2 sound bite, It abandons BIC detection and two abnormal sound bites occurs, until remaining two neighboring sound bite does not occur exception.

Further, judge whether two sound bites of BIC detection occur abnormal may include: to judge described two languages Whether the BIC value between tablet section is greater than predetermined threshold；In the case where the judgment result is yes, described two sound bites are determined Occur abnormal；If the determination result is NO, determine that described two sound bites are normal.

In another alternative embodiment, two sound bites in N number of sound bite carry out pattra leaves This information criterion BIC detection, and abandoning BIC detection abnormal sound bite occur includes: to the language in N number of sound bite Tablet section is to carrying out BIC detection, wherein the sound bite is to being any two sound bite in N number of sound bite； Judge that the sound bite of BIC detection to whether there is exception, obtains testing result；Abandoning the testing result is abnormal voice Segment pair.

Further, judge the sound bite of BIC detection to whether occurring abnormal may include: to judge sound bite pair Whether BIC value is greater than predetermined threshold；In the case where the judgment result is yes, determine that the sound bite is abnormal to occurring；Sentencing In the case that disconnected result is no, determine the sound bite to normal.

In the embodiment of the present invention, carrying out BIC detection to two sound bites in N number of sound bite specifically be can wrap It includes: calculating the BIC value between two sound bites；The BIC value is normalized.

In the embodiment of the present invention, it is divided into N number of sound bite can specifically include mixing voice by end-point detection: obtains Take mute section in the mixing voice；Remove described mute section；The mixing voice is split according to described mute section, Long sound bite after being divided；The long sound bite is divided into N number of sound bite by end-point detection.Its In, end-point detection is a basic link of speech recognition and speech processes and a hot fields of the Research of Speech Recognition. The main purpose of technology is to distinguish from the voice of input to voice and non-voice, can remove in voice it is mute at Point, obtain efficient voice in input voice.

The embodiment of the present invention is directed to specific mixing voice, is dominant for the voice duration of certain speaker dependent, other The voice duration of people or noise is relatively low.And voice content is less concerned about, more concerned with the occasion of speaker characteristic, is mentioned Go out for the purpose of detectable, by weakening target, innovatory algorithm promotes effect, and guarantee obtains relatively cleaner specific theory Talk about human speech sound.The specific method is as follows:

The position occurred to speaker's turning point assumes, it is believed that turning point appears in the voice sheet by end-point detection On section boundary.Bring benefit in this way: if one section of mixing voice there are 100,000 sampled points, there is 100 after end-point detection After a sound bite, BIC detection mode turning point originally possibly are present on this 100,000 points, and the embodiment of the present invention is improved Turning point is only present on the head and the tail of this 100 sound bites, greatly improves computational efficiency.

Two sound bites abnormal to BIC detection, directly discarding, and calculate remaining sound bite between any two BIC.Until whole sound bites all at least calculated 1 BIC.

If occurring the abnormal situation of BIC detection in epicycle BIC calculating process, repeating the above steps, until epicycle Do not occur BIC exception, terminate, after remaining sound bite reconfigures at this time, as passes through speaker dependent's language of BIC detection Sound.Voice at this time is without noise and nonspecific speaker's voice.

For the influence for reducing the BIC calculating different with distribution of sound bite length, BIC is normalized, specifically by The maximum value of number of sampling points and current BIC are normalized, and are normalized by following formula:

Wherein N₁For the length (sample points) of fragment 1, N₂For the length of fragment 2, N is after fragment 1 merges with fragment 2 Length, σ₁For the variance of 1 sample point of fragment, σ₂For the variance of 2 sample point of fragment, σ is sample point after fragment 1 merges with fragment 2 Variance, λ are a coefficient.For above-mentioned BIC, it is normalized toF is the letter of sample point variance and length Number, according to real data and empirically determined.

The influence that the threshold value of above-mentioned end-point detection is chosen to BIC is very big, is embodied in the acutance and void for influencing BIC signal Alert, mainly influence signal acutance, the present invention choose suitable end-point detection threshold value by experimental method, it is sharp to improve BIC signal Degree.

The difficulty of traditional BIC practical application is that BIC abnormal signal is related to a threshold value, and this threshold value is usually Every 2 sections of voices are distinctive when calculating BIC value, do not have universality, are difficult to obtain a Global B IC threshold value, are worth pointing out It is that, by above-mentioned improvement, BIC signal has normalized, and since the threshold value to end-point detection is controlled, BIC signal It is relatively accurate effectively, the BIC threshold value of the embodiment of the present invention basic global sense, and no longer needing according to call every time Data threshold value, to ensure that the practical application effect of the embodiment of the present invention.

Embodiment 2

A kind of voice processing apparatus is additionally provided in the present embodiment, and the device is real for realizing above-described embodiment and preferably Mode is applied, the descriptions that have already been made will not be repeated.As used below, the soft of predetermined function may be implemented in term " module " The combination of part and/or hardware.Although device described in following embodiment is preferably realized with software, hardware, or The realization of the combination of software and hardware is also that may and be contemplated.

Fig. 3 is a kind of block diagram of voice processing apparatus according to an embodiment of the present invention, as shown in Figure 3, comprising:

Divide module 32, for mixing voice to be divided into N number of sound bite by end-point detection, wherein the N is big In or equal to 2 natural number；

Detection module 34 carries out Bayes for two sound bites to the arbitrary neighborhood in N number of sound bite Information criterion BIC detection, and abandon BIC detection and abnormal sound bite occur, obtain the efficient voice segment of target object.

Fig. 4 is a kind of block diagram one of voice processing apparatus according to the preferred embodiment of the invention, as shown in figure 4, the inspection Surveying module 34 includes:

Detection unit 42, for two sound bites in N number of sound bite to carry out BIC detection；

Judging unit 44, for judging whether two sound bites of BIC detection exception occur；

There are two abnormal voice sheets in the case where the judgment result is yes, abandoning BIC detection in discarding unit 46 Section；

Repetition detection unit 48 carries out BIC inspection to two sound bites in remaining N-2 sound bite for repeating It surveys, abandons BIC detection and two abnormal sound bites occur, until remaining any two sound bite does not occur exception.

Optionally, the judging unit 44, is also used to

Fig. 5 is a kind of block diagram two of voice processing apparatus according to the preferred embodiment of the invention, as shown in figure 5, the inspection Surveying module 34 includes:

Computing unit 52, for calculating the BIC value between any two sound bite；

Processing unit 54, for the BIC value to be normalized.

Optionally, the segmentation module 32 includes:

Acquiring unit, for obtaining mute section in the mixing voice；

Removal unit, for removing described mute section；

It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor；Alternatively, above-mentioned modules are with any Combined form is located in different processors.

Embodiment 3

The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.

Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following steps Calculation machine program:

Mixing voice is divided into N number of sound bite by end-point detection by S11, wherein the N is more than or equal to 2 Natural number；

S12 carries out bayesian information criterion BIC to two sound bites of the arbitrary neighborhood in N number of sound bite Detection, and abandon BIC detection and abnormal sound bite occur, obtain the efficient voice segment of target object.

Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.

Embodiment 4

The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.

Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.

Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:

Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment Example, details are not described herein for the present embodiment.

Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.It is all within principle of the invention, it is made it is any modification, etc. With replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of method of speech processing characterized by comprising

Mixing voice is divided into N number of sound bite by end-point detection, wherein the N is the natural number more than or equal to 2；

Bayesian information criterion BIC detection is carried out to two sound bites of the arbitrary neighborhood in N number of sound bite, and is lost It abandons BIC detection and abnormal sound bite occurs, obtain the efficient voice segment of target object.

2. the method according to claim 1, wherein the arbitrary neighborhood in N number of sound bite Two sound bites carry out bayesian information criterion BIC detection, and abandon the abnormal sound bite of BIC detection appearance and include:

Judge whether two sound bites of BIC detection exception occur；

It repeats to carry out BIC detection to two sound bites adjacent in remaining N-2 sound bite, abandons BIC detection and occur Two abnormal sound bites, until remaining two neighboring sound bite does not occur exception.

3. according to the method described in claim 2, it is characterized in that, judge BIC detection two sound bites whether occur it is different Often include:

4. according to the method in any one of claims 1 to 3, which is characterized in that two in N number of sound bite Sound bite carries out BIC detection

Calculate the BIC value between described two sound bites；

The BIC value is normalized.

5. according to the method in any one of claims 1 to 3, which is characterized in that by end-point detection by creolized language cent Being cut into N number of sound bite includes:

Obtain mute section in the mixing voice；

Remove described mute section；

6. a kind of voice processing apparatus characterized by comprising

Divide module, for mixing voice to be divided into N number of sound bite by end-point detection, wherein the N is to be greater than or wait In 2 natural number；

It is quasi- to carry out Bayesian Information for two sound bites to the arbitrary neighborhood in N number of sound bite for detection module Then BIC is detected, and abandons BIC detection and abnormal sound bite occur, obtains the efficient voice segment of target object.

7. device according to claim 6, which is characterized in that the detection module includes:

There are two abnormal sound bites in the case where the judgment result is yes, abandoning BIC detection in discarding unit；

Repetition detection unit carries out BIC inspection to two sound bites adjacent in remaining N-2 sound bite for repeating It surveys, abandons BIC detection and two abnormal sound bites occur, until remaining two neighboring sound bite does not occur exception.

8. device according to claim 7, which is characterized in that the judging unit is also used to

9. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer Program is arranged to execute method described in the claim 1 to 5 when operation.

10. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to run the computer program to execute method described in the claim 1 to 5.