CN110491409A - Separation method, device, storage medium and the electronic device of mixing voice signal - Google Patents

Separation method, device, storage medium and the electronic device of mixing voice signal Download PDF

Info

Publication number
CN110491409A
CN110491409A CN201910736585.6A CN201910736585A CN110491409A CN 110491409 A CN110491409 A CN 110491409A CN 201910736585 A CN201910736585 A CN 201910736585A CN 110491409 A CN110491409 A CN 110491409A
Authority
CN
China
Prior art keywords
angle
target
matrix
weight coefficient
airspace
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910736585.6A
Other languages
Chinese (zh)
Other versions
CN110491409B (en
Inventor
顾容之
陈联武
张世雄
徐勇
于蒙
苏丹
俞栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910736585.6A priority Critical patent/CN110491409B/en
Publication of CN110491409A publication Critical patent/CN110491409A/en
Application granted granted Critical
Publication of CN110491409B publication Critical patent/CN110491409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present invention provides separation method, device, storage medium and the electronic devices of a kind of mixing voice signal, comprising: obtains the collected mixing voice signal of voice acquisition device, mixing voice signal includes the voice that at least two target objects issue;The first airspace matrix that the first frequency domain matrix of frequency domain character formation and the spatial feature of the mixing voice signal for obtaining mixing voice signal are formed;The smallest target angle of angle is determined from the angle at least two target objects between every two target object and the voice acquisition device;First airspace matrix is weighted using weight coefficient corresponding with the target angle, obtains the second airspace matrix;By the first frequency domain matrix and the second airspace Input matrix to target nerve network model, obtain isolating in the slave mixing voice signal of the target nerve network model output with the one-to-one multi-path voice signal of at least two target objects.Through the invention, solve in the smaller situation in target angle, speech separating method performance decline the problem of.

Description

Separation method, device, storage medium and the electronic device of mixing voice signal
Technical field
The present invention relates to the communications fields, separation method, device, storage in particular to a kind of mixing voice signal Medium and electronic device.
Background technique
Speech recognition and interactive task under complex scene are often faced with the challenge such as more voice aliasings, RMR room reverb, Shandong The speech recognition system of stick be unable to do without the speech signal separation and enhancing module of front end.Recently, it towards complicated acoustic enviroment, is based on Extensive concern of the multicenter voice separation method of deep learning by educational circles and industry.
When speaker is distributed farther out in space, the differences in spatial location of speaker is larger, therefore spatial information (si) has Biggish distinction is conducive to multichannel separated network separation voice, relative to the single-channel voice point for only utilizing frequency domain character There is significant performance promotion from system.Microphone array but when closer between speaker, between them relative to acquisition voice The angle of column is smaller, and spatial feature will no longer have that distinction at this time.Separated network can be obscured by not having the spatial feature of distinction, Performance is caused to be significantly worse than single-channel voice separation system.
The prior art solves the problems, such as this by the output of switching single channel and multicenter voice separation system.Pass through judgement In current mixing voice, whether the spatial position between speaker is overlapped or close, and under overlapping or similar situation, selection is single Channel speech separation system carries out isolated single channel separating resulting to mixing voice;Conversely, selection multicenter voice separation System carries out isolated multichannel separating resulting to mixing voice.But this two sets of method needs in the prior art are solely Vertical system needs two network models of training, and needs to run differentiation network, thereby increase system runing time and Computation complexity.
For in the related technology, in the smaller situation in target angle, the decline of speech separating method performance, there is no one to have The solution of effect.
Summary of the invention
The embodiment of the invention provides separation method, device, storage medium and the electronic device of a kind of mixing voice signal, At least to solve the problem of that speech separating method performance declines in the smaller situation in target angle in the related technology.
According to one embodiment of present invention, a kind of separation method of mixing voice signal is provided, comprising: obtain voice The collected mixing voice signal of acquisition device, wherein the mixing voice signal includes what at least two target objects issued Voice;Obtain the first frequency domain matrix of the frequency domain character formation of the mixing voice signal and the airspace of the mixing voice signal The first airspace matrix that feature is formed;Target object described in every two is adopted with the voice from least two target object The smallest target angle of angle is determined in angle between acquisition means;Using weight coefficient corresponding with the target angle to institute It states the first airspace matrix to be weighted, obtains the second airspace matrix, wherein 0≤weight coefficient≤1;By first frequency Domain matrix and second airspace Input matrix obtain the target nerve network model output to target nerve network model Isolated from the mixing voice signal with the one-to-one multi-path voice signal of at least two target object, In, the target nerve network model is to train to original neural network model using multi-group data, the multiple groups number Every group of data in include: the frequency domain character matrix for the voice that at least two target objects issue and the airspace spy after weighting Levy matrix.
Optionally, from least two target object target object described in every two and the voice acquisition device it Between angle in determine the smallest target angle of angle, comprising: determine between the first center and first position first connect The first angle that the second line between line and first center and the second position is constituted, wherein first center Position is the center of the voice acquisition device, and the first position is first in target object described in the every two Position where target object, the second position is where the second target object in target object described in the every two Position, corresponding first angle of target object described in the every two;Every two from least two target object Determine the smallest angle of angle as the first minimum angle in corresponding first angle of a target object, wherein institute Stating target angle includes first minimum angle.
Optionally, it is weighted to first airspace matrix in use weight coefficient corresponding with the target angle Before, which comprises the first weight coefficient corresponding with the target angle is determined by following formula:
Wherein, the θ is the target angle, and the θ value range is 0 to 180 degree, and the w and b are that initialization determines Network parameter, the att1(θ) indicates the weight coefficient corresponding with the target angle.
Optionally, first airspace matrix is weighted using weight coefficient corresponding with the target angle, is obtained To the second airspace matrix, comprising: be weighted using first weight coefficient to first airspace matrix, obtain described Two airspace matrixes.
Optionally, from least two target object target object described in every two and the voice acquisition device it Between angle in determine the smallest target angle of angle, comprising: determine in the matrix of first airspace represented spatial feature Destination number;The microphone pair that the destination number is selected in the voice acquisition device determines each microphone centering The first microphone and second microphone between the second center, determine between the second center and first position The second angle that the 4th line between three lines and second center and the second position is constituted, wherein described first Position is the position where the first object object in target object described in the every two, and the second position is described every two The position where the second target object in a target object, target object described in the every two corresponding one described the Two angles;Angle is determined from corresponding second angle of target object described in every two at least two target object The smallest angle is as the second minimum angle, and each microphone is to corresponding second minimum angle, the target Angle includes each microphone to corresponding second minimum angle.
Optionally, it is weighted to first airspace matrix in use weight coefficient corresponding with the target angle Before, which comprises the second power corresponding with second minimum angle of each microphone pair is determined by following formula Weight coefficient:
Wherein, θkIt is with k-th of microphone to corresponding second minimum angle, θkValue range is 0 to 180 degree, wk And bkIt is the determining network parameter of initialization, att2k) indicate and the second minimum angle corresponding second of k-th of microphone pair Weight coefficient.
Optionally, first airspace matrix is weighted using weight coefficient corresponding with the target angle, is obtained To the second airspace matrix, comprising: be weighted using weight coefficient matrix to first airspace matrix, it is empty to obtain described second Domain matrix, wherein each Mike is to corresponding second weight coefficient, and the microphone of the destination number is to corresponding mesh The weight coefficient for marking quantity is the weight coefficient matrix.
Optionally, from least two target object target object described in every two and the voice acquisition device it Between angle in determine the smallest target angle of angle, further includes: determine in the matrix of first airspace that represented airspace is special The destination number of sign;The microphone pair that the destination number is selected in the voice acquisition device determines each microphone pair In the first microphone and second microphone between the second center, determine between the second center and first position The second angle that the 4th line between third line and second center and the second position is constituted, wherein described the One position is the position where the first object object in target object described in the every two, and the second position is described every The position where the second target object in two target objects, described in target object described in the every two is one corresponding Second angle;Angle is determined from corresponding second angle of target object described in every two at least two target object The smallest angle is spent as the second minimum angle, and each microphone is to corresponding second minimum angle, the mesh Ticket holder angle includes each microphone to corresponding second minimum angle.
Optionally, it is weighted to first airspace matrix in use weight coefficient corresponding with the target angle Before, which comprises the second power corresponding with second minimum angle of each microphone pair is determined by following formula Weight coefficient:
Wherein, θkIt is with k-th of microphone to corresponding second minimum angle, θkValue range is 0 to 180 degree, wk And bkIt is the determining network parameter of initialization, att2k) indicate and the second minimum angle corresponding second of k-th of microphone pair Weight coefficient.
Optionally, first airspace matrix is weighted using weight coefficient corresponding with the target angle, is obtained To the second airspace matrix, comprising: empty to described first using the product of first weight coefficient and the weight coefficient matrix Domain matrix is weighted, and obtains second airspace matrix, wherein each Mike is to corresponding second weight system Number, the microphone of the destination number are the weight coefficient matrix to the weight coefficient of corresponding destination number.
According to another embodiment of the invention, a kind of separator of mixing voice signal is provided, comprising: first obtains Modulus block, for obtaining the collected mixing voice signal of voice acquisition device, wherein the mixing voice signal includes at least The voice that two target objects issue;Second obtains module, what the frequency domain character for obtaining the mixing voice signal was formed The first airspace matrix that the spatial feature of first frequency domain matrix and the mixing voice signal is formed;First determining module, is used for It is determined from the angle described in every two at least two target object between target object and the voice acquisition device The smallest target angle of angle;Weighting block, it is empty to described first for use weight coefficient corresponding with the target angle Domain matrix is weighted, and obtains the second airspace matrix, wherein 0≤weight coefficient≤1;Input module, for by described the It is defeated to obtain the target nerve network model to target nerve network model for one frequency domain matrix and second airspace Input matrix Out isolated from the mixing voice signal with the one-to-one multi-path voice signal of at least two target object, Wherein, the target nerve network model is to train to original neural network model using multi-group data, the multiple groups The frequency domain character matrix and the airspace after weighting that every group of data in data include: the voice that at least two target objects issue Eigenmatrix.
Optionally, the determining module includes: the first determination unit, for determine the first center and first position it Between the first line and first center and the second position between the second line constitute the first angle, wherein institute The center that the first center is the voice acquisition device is stated, the first position is target pair described in the every two The position where first object object as in, the second position are the second target in target object described in the every two Position where object, corresponding first angle of target object described in the every two;Second determination unit, for from The smallest folder of angle is determined at least two target object in corresponding first angle of target object described in every two Angle is as the first minimum angle, wherein the target angle includes first minimum angle.
Optionally, described device includes: the second determining module, in use weight corresponding with target angle system It is several first airspace matrix is weighted before, pass through following formula and determine corresponding with the target angle the first weight Coefficient:
Wherein, the θ is the target angle, and the θ value range is 0 to 180 degree, and the w and b are that initialization determines Network parameter, the att1(θ) indicates the weight coefficient corresponding with the target angle.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.
Through the invention, by from least two target objects every two target object and the voice acquisition device it Between angle in determine the smallest target angle of angle, using with target the corresponding weight coefficient of angle to mixing voice signal Airspace matrix be weighted, using the frequency domain matrix of mixing voice signal and weighting after airspace matrix as target nerve network Model is exported to mixing voice Signal separator result.Due to that can get merely with a target nerve network model to mixed The separating resulting of voice signal is closed, and then is avoided in the smaller situation in target angle in the prior art, speech separating method performance The problem of decline.The computation complexity for having reached simplified system improves the effect of operation efficiency.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of hardware configuration frame of the terminal of the separation method of mixing voice signal of the embodiment of the present invention Figure;
Fig. 2 is the isolated flow chart of mixing voice signal according to an embodiment of the present invention;
Fig. 3 is every two target object and voice acquisition device at least two target object according to an embodiment of the present invention Between angle schematic diagram;
Fig. 4 is every two target object and voice collecting at least two target object according to another embodiment of the present invention Angle schematic diagram between device;
Fig. 5 is to be based on spatial feature to learn the multicenter voice separation system flow chart of attention mechanism;
Fig. 6 attention Drawing of Curve figure according to an embodiment of the present invention;
Fig. 7 is angle schematic diagram of two target objects relative to different microphones pair;
Fig. 8 is the structural block diagram of the separator of mixing voice signal according to an embodiment of the present invention.
Specific embodiment
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.
Embodiment of the method provided by the embodiment of the present application one can be in mobile terminal, terminal or similar fortune It calculates and is executed in device.For operating in calculating and terminal, Fig. 1 is a kind of point of mixing voice signal of the embodiment of the present invention The hardware block diagram of terminal from method.As shown in Figure 1, terminal 10 may include one or more (Fig. 1 In only show one) (processor 102 can include but is not limited to Micro-processor MCV or programmable logic device to processor 102 The processing unit of FPGA etc.) and memory 104 for storing data, optionally, above-mentioned terminal can also include using In the transmission device 106 and input-output equipment 108 of communication function.It will appreciated by the skilled person that shown in Fig. 1 Structure be only illustrate, the structure of above-mentioned terminal is not caused to limit.For example, terminal 10 may also include The more perhaps less component or with the configuration different from shown in Fig. 1 than shown in Fig. 1.
Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair The corresponding computer program of separation method of mixing voice signal in bright embodiment, processor 102 are stored in by operation Computer program in reservoir 104 realizes above-mentioned method thereby executing various function application and data processing.Storage Device 104 may include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, Flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise relative to processing The remotely located memory of device 102, these remote memories can pass through network connection to mobile terminal 10.The reality of above-mentioned network Example includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of terminal 10 provide.In an example, transmitting device 106 includes that a network is suitable Orchestration (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments from And it can be communicated with internet.In an example, transmitting device 106 can be radio frequency (Radio Frequency, abbreviation For RF) module, it is used to wirelessly be communicated with internet.
A kind of separation method of mixing voice signal for running on above-mentioned terminal is provided in the present embodiment, is schemed 2 be the isolated flow chart of mixing voice signal according to an embodiment of the present invention, as shown in Fig. 2, the process includes the following steps:
Step S202 obtains the collected mixing voice signal of voice acquisition device, wherein the mixing voice signal packet Include the voice of at least two target objects sending;
Wherein, mixing voice signal is the voice signal after being mixed with muli-sounds, including more voice aliasings and ring Sound in border.
Step S204 obtains the first frequency domain matrix and the creolized language that the frequency domain character of the mixing voice signal is formed The first airspace matrix that the spatial feature of sound signal is formed;
Wherein, in reality scene, sound source isolate in space it is separated, for example, inter-channel phase difference, interchannel pressure The spatial features such as strong poor, inter-channel time differences imply the spatial positional information of sound source;Frequency domain character include: log power spectrum, Spectrum amplitude, logarithm Meier spectrum etc..
Step S206, target object described in every two and the voice acquisition device from least two target object Between angle in determine the smallest target angle of angle;
Wherein, voice acquisition device can be microphone array, and issuing the object of voice includes multiple, every two sounding pair As in an angle, in multiple angles that multiple objects and voice acquisition device are constituted, being determined most with voice acquisition device structure Small angle is target angle.
Step S208 is weighted first airspace matrix using weight coefficient corresponding with the target angle, Obtain the second airspace matrix, wherein 0≤weight coefficient≤1;
Step S210, by the first frequency domain matrix and second airspace Input matrix to target nerve network model, Obtain isolate from the mixing voice signal and at least two target of the target nerve network model output The one-to-one multi-path voice signal of object, wherein the target nerve network model is using multi-group data to original nerve What network model trained, every group of data in the multi-group data include: the voice that at least two target objects issue Frequency domain character matrix and weighting after spatial feature matrix.
Through the above steps, from least two target objects between every two target object and the voice acquisition device Angle in determine the smallest target angle of angle, using with target the corresponding weight coefficient of angle to mixing voice signal Airspace matrix is weighted, using the airspace matrix after the frequency domain matrix of mixing voice signal and weighting as target nerve network mould Type is exported to mixing voice Signal separator result.Due to that can get merely with a target nerve network model to mixing The separating resulting of voice signal, and then avoid in the smaller situation in target angle in the prior art, under speech separating method performance The problem of drop.The computation complexity for having reached simplified system improves the effect of operation efficiency.
Optionally, the executing subject of above-mentioned steps can be terminal etc., but not limited to this.
In an alternative embodiment, target object described in every two and the voice from least two target object The smallest target angle of angle is determined in angle between acquisition device, comprising: determine the first center and first position it Between the first line and first center and the second position between the second line constitute the first angle, wherein institute The center that the first center is the voice acquisition device is stated, the first position is target pair described in the every two The position where first object object as in, the second position are the second target in target object described in the every two Position where object, corresponding first angle of target object described in the every two;From at least two target Determine the smallest angle of angle as the first minimum folder in object in corresponding first angle of target object described in every two Angle, wherein the target angle includes first minimum angle.In the present embodiment, by taking three target objects as an example, such as scheme 3 be the angle at least two target object according to an embodiment of the present invention between every two target object and voice acquisition device Schematic diagram, wherein first position is the position where the first object, and the second position is the position where the second object, third position Setting is position where third object.Wherein, the angle that first position and the second position are constituted is θ1, the second position and third position The angle for setting composition is θ2, the angle that first position is constituted with the third place is θ3.In θ1、θ2And θ3The middle the smallest angle theta of determination1 For target angle.
In an alternative embodiment, in use weight coefficient corresponding with the target angle to first airspace matrix Before being weighted, which comprises determine the first weight coefficient corresponding with the target angle by following formula:
Wherein, the θ is the target angle, and the θ value range is 0 to 180 degree, and the w and b are that initialization determines Network parameter, the att1(θ) indicates the weight coefficient corresponding with the target angle.In the present embodiment, above It states in an embodiment for three target objects, from the mixing voice waveform that three sounding objects are issued, extracts frequency domain Feature and spatial feature determine that target angle is θ1.By θ1It is input in above-mentioned formula, obtains and target angle θ1It is corresponding First weight coefficient att11).In the present embodiment, w and b can learn, during training above-mentioned network model, first Initializing set network parameter w0And b0, in network training process, being updated according to gradient descent method iteration includes w0And b0Inside All-network parameter, obtains above-mentioned w and b after the completion of network model training.
In an alternative embodiment, using weight coefficient corresponding with the target angle to first airspace matrix into Row weighting, obtains the second airspace matrix, comprising: first airspace matrix is weighted using first weight coefficient, Obtain second airspace matrix.In the present embodiment, with the first airspace matrix Y=[y1,y2…yk] for, by above-mentioned Embodiment obtains first weight coefficient att corresponding with target angle11) it is a1For, use a1To the first airspace matrix Y =[y1,y2…yk] be weighted after obtain the second airspace matrix Yatt=[α1y11y2,...,α1yK]。
In an alternative embodiment, target object described in every two and the voice from least two target object The smallest target angle of angle is determined in angle between acquisition device, comprising: determine represented in the matrix of first airspace Spatial feature destination number;The microphone pair of the destination number is selected in the voice acquisition device, is determined each The second center between the first microphone and second microphone of microphone centering, determines the second center and first The second angle that the 4th line between third line and second center and the second position between setting is constituted, In, the first position is the position where the first object object in target object described in the every two, the second It is set to the position where the second target object in target object described in the every two, target object pair described in the every two Answer second angle;Corresponding second folder of target object described in every two from least two target object The smallest angle of angle is determined in angle as the second minimum angle, each microphone is to a corresponding described second minimum folder Angle, the target angle include each microphone to corresponding second minimum angle.In the present embodiment, with voice collecting Device is microphone array, by taking three target objects as an example, if Fig. 4 is at least two target according to another embodiment of the present invention Angle schematic diagram in object between every two target object and voice acquisition device, wherein first position is the first object institute In position, the second position is the second object position, and the third place is the position where third object, due to the first airspace square Battle array Y=[y1,y2…yk] it include k feature, then choosing k in voice acquisition device to microphone, each pair of microphone is as wheat Gram wind pair, determines the midpoint of each microphone pair as the second center, in Fig. 4 first position, the second position and The angle of the third place and the second center is respectively θ1、θ2And θ3, choose wherein the smallest angle theta1As with kth to wheat Gram corresponding target angle θ of wind faciesk, k corresponds to k target angle to microphone, and k target angle constitutes target angle set.
In an alternative embodiment, in use weight coefficient corresponding with the target angle to first airspace matrix Before being weighted, which comprises pass through determining second minimum angle pair with each microphone pair of following formula The second weight coefficient answered:
Wherein, θkIt is with k-th of microphone to corresponding second minimum angle, θkValue range is 0 to 180 degree, wk And bkIt is the determining network parameter of initialization, att2k) indicate and the second minimum angle corresponding second of k-th of microphone pair Weight coefficient.In the present embodiment, by θkIt substitutes into above-mentioned formula and obtains kth the second weight coefficient corresponding to microphone, by k K the second weight coefficients, k the second weight coefficient respective weights coefficient matrixes are corresponded to microphone.In the present embodiment, wkWith bkIt can learn, during training above-mentioned network model, first initializing set network parameter w0And b0, in network training In the process, being updated according to gradient descent method iteration includes w0And b0All-network parameter inside, after the completion of network model training To above-mentioned bkAnd bk
In an alternative embodiment, using weight coefficient corresponding with the target angle to first airspace matrix into Row weighting, obtains the second airspace matrix, comprising: be weighted, obtain to first airspace matrix using weight coefficient matrix Second airspace matrix, wherein each Mike is to corresponding second weight coefficient, the microphone of the destination number Weight coefficient to corresponding destination number is the weight coefficient matrix.In the present embodiment, with k to microphone respective weights Coefficient matrix is α2=[α2,12,2,...,α2,K] for, using the weight coefficient matrix to the first airspace matrix Y=[y1, y2…yk] be weighted to obtain the second airspace matrix Yatt=[α2,1y12,2y2,...,α2,KyK]。
In an alternative embodiment, target object described in every two and the voice from least two target object The smallest target angle of angle is determined in angle between acquisition device, further includes: determine institute's table in the matrix of first airspace The destination number of the spatial feature shown;The microphone pair of the destination number is selected in the voice acquisition device, is determined every The second center between the first microphone and second microphone of a microphone centering, determines the second center and first The second angle that the 4th line between third line and second center and the second position between position is constituted, In, the first position is the position where the first object object in target object described in the every two, the second It is set to the position where the second target object in target object described in the every two, target object pair described in the every two Answer second angle;Corresponding second folder of target object described in every two from least two target object The smallest angle of angle is determined in angle as the second minimum angle, each microphone is to a corresponding described second minimum folder Angle, the target angle include each microphone to corresponding second minimum angle.
In an alternative embodiment, in use weight coefficient corresponding with the target angle to first airspace matrix Before being weighted, which comprises pass through determining second minimum angle pair with each microphone pair of following formula The second weight coefficient answered:
Wherein, θkIt is with k-th of microphone to corresponding second minimum angle, θkValue range is 0 to 180 degree, wk And bkIt is the determining network parameter of initialization, att2k) indicate and the second minimum angle corresponding second of k-th of microphone pair Weight coefficient.
In an alternative embodiment, using weight coefficient corresponding with the target angle to first airspace matrix into Row weighting, obtains the second airspace matrix, comprising: using the product of first weight coefficient and the weight coefficient matrix to institute It states the first airspace matrix to be weighted, obtains second airspace matrix, wherein each Mike is to corresponding one described the Two weight coefficients, the microphone of the destination number are the weight coefficient matrix to the weight coefficient of corresponding destination number. In the present embodiment, in conjunction with the above method, the first weight coefficient a can be used1With weight coefficient matrix α2=[α2,1, α2,2,...,α2,K] product Yatt=[α1α2,1y11α2,2y2,...,α1α2,KyK] the first airspace matrix is weighted, it obtains Second airspace matrix.
Illustrate the application below by specific embodiment.
Specific embodiment 1:
A kind of attention mechanism that can learn is proposed in the present embodiment, under different human world angles of speaking selectively Different degrees of concern (corresponding to weight coefficient) is distributed to frequency domain and spatial feature.If Fig. 5 is to be based on spatial feature to learn Practise the multicenter voice separation system flow chart of attention mechanism, wherein include the following steps:
Step 1: mixing voice signal being obtained by voice acquisition device, wherein mixing voice signal is by multiple targets Voice signal after the voice mixing that object is issued;
Step 2: extracting frequency domain character and spatial feature from mixing voice signal, obtain the first frequency domain matrix and first Airspace matrix;
Step 3: the angle between every two target object and voice acquisition device is being determined at least two target objects. Specifically, every two in multiple speakers can be obtained according to the speaker's angle or speaker's angle estimation module of scene setting Angle of the speaker between voice acquisition device, such as the θ in Fig. 31、θ2And θ3
Step 4: in the angle at least two target objects between every two target object and voice acquisition device really Minimum angle is determined as target angle.For example, speaking human world angle can for separation systems more than three people or three people It is defined as the minimum value of every two people angle in owner.
Step 5: the attention power module that target angle inputs in multicenter voice separated network is obtained into the power of airspace matrix Weight coefficient.Wherein, notice that power module according to target angle θ, calculates weight coefficient, by obtained weight coefficient to spatial feature square Battle array is weighted.Specifically, the calculation method of weight coefficient is as follows:
att1(θ)=f1(θ)
Wherein, att1(θ) indicates multiple speakers in the angle that voice acquisition device is constituted, and the smallest target angle is When θ, contribution of the spatial feature for separated network, the i.e. weight coefficient of spatial feature.f1(θ) is that the dullness based on angle theta is passed Increase (or monotonic nondecreasing) function, with the increase of angle, spatial feature is bigger for the contribution of network.The value range of angle theta 180 degree, f are arrived for 0 degree1The value range of (θ) is 0 to 1.f1A kind of modular design of (θ) is as follows:
Wherein, σ (θ) can be sigmoid function score, be also possible to other methods that can calculate score, and codomain is [0,1] indicates that network should distribute to the weight size of spatial feature, and w and b are the network parameters that can learn.Wherein, b is controlled Sigmoid score curve approach be 1 critical value, and b/w then control curve level off to 0 critical value.Fig. 6 is according to this hair The attention Drawing of Curve figure of bright embodiment has depicted in figure and has worked as w=0.5, attention force curve when b=10.0, wherein note Range degree of anticipating indicates weight size, and attention degree is bigger, and weight is bigger, and the smaller weight of attention is smaller.It can from figure Out, in the lesser situation of the angle of speaker and voice acquisition device (θ < 10 °), spatial feature is for multicenter voice separate mesh The contribution of network is 0, and network relies only on frequency domain character separation mixing voice;And in the biggish situation of angle (θ > 30 °), spatial feature There is higher distinction than frequency domain character, spatial feature is assigned lasting weight, and network is believed with reference to frequency domain and airspace Breath, to obtain better mixing voice separating effect.
Step 6: the separation module in the Input matrix multicenter voice separated network of airspace by frequency domain matrix and after weighting, Voice after network output separation.
Specific embodiment 2:
The extraction of spatial information (si) is mainly based upon the information gap between microphone pair, such as two ear time differences (Interaural time difference, abbreviation ITD), two ear pressure differences (Interaural level difference, Abbreviation ILD) He Lianger phase difference (Interaural phase difference, abbreviation IPD) etc..The institute in specific embodiment 1 The attention mechanism stated imposes identical degree of concern to the spatial feature extracted to all microphones.But speaker Relative to the angle at entire microphone array center, it is different corresponding thereto in angle of some microphone to center.
Fig. 7 is angle schematic diagram of two target objects relative to different microphones pair, and solid line connects a pair of of microphone pair, θ1First position where expression first object object and the second position where the second target object and the first microphone pair The angle that center is constituted, θ2First position where expression first object object and the second where the second target object Set the angle constituted with the center of second microphone pair, wherein first object object and the second target object are to issue language The sound source of sound.As can be seen from the figure angle of the pronunciation object for the center of microphone pair different in entire microphone array It is different.θ is introduced by taking 6 wheat circular microphone arrays as an example1Solution mode.Using microphone array center as origin, Mike The coordinate of wind 1 is (- rsin (0 °), rcos (0 °)), and 2 coordinate of microphone is (- r sin (60 °), rcos (60 °)), and r is Mike The radius of wind array, and so on.First pair of microphone is to for (microphone 1, microphone 2), midpoint coordinates A0For (- r (sin (0°)+sin(60°))/2,r(cos(0°)+cos(60°))/2)。
Assuming that pronunciation object 1 and 2 respectively from(It is the line of pronunciation object 1 and microphone array center, The angle constituted with 0 ° of line of the microphone array, wherein 0 ° of line is generally related according to the design pattern of microphone array.Example Such as, 0 ° of line of circular microphone array can be line vertically upward, and linear microphone array can be in the horizontal direction Line).Then it can be calculated that pronunciation object 1 can be considered that the arrival coordinate A1 in circular microphone array isPronunciation object 2 coordinate beθ can be calculated according to formula1Value
θ2Solution mode and θ1Solution mode it is identical.
When k-th of microphone for the angle before two target objects is 0 degree when, based on k-th of microphone to calculating Obtained spatial feature does not have distinction;It is maximum to the distinction being calculated based on k-th of microphone when for 180 degree.Cause This, for different microphones to the spatial feature extracted.
The application calculates corresponding weight coefficient, meter relative to the angle difference of some microphone pair according to two speakers Calculation mode is as follows:
att2k)=f2k)
att2(Δθk) indicate two speakers relative to k-th of microphone pair angle difference be θkWhen, the microphone to Contribution of the spatial feature arrived for separated network.f2k) it is based on angle thetakMonotonic increase (or monotonic nondecreasing) function, with The increase of angle, spatial feature it is bigger for the contribution of network.Angle thetakValue range be 0 degree to 180 degree, f2k) Value range is 0 to 1.f2k) a kind of modular design it is as follows:
att2(Δθk)=2*max (σ (Δ θk)-0.5,0)
Wherein, σ (Δ θk(the 1+exp (- w of)=1/k(Δθk-bk)) it can be sigmoid letter for kth to microphone pair Number score indicates that network should distribute to degree of concern of the kth to microphone to the spatial feature extracted, wherein concern journey Degree also illustrates that weight size, and degree of concern is bigger, and its weight coefficient is bigger, its smaller weight coefficient of degree of concern is smaller.wkAnd bk Be it is corresponding can learning network parameter.Attention mechanism based on microphone pair can be more accurately every a pair of of microphone pair The spatial feature extracted is used according to its ga s safety degree, further improves the validity of spatial feature, to mention High multicenter voice separation system performance.For separation systems more than three people or three people, for k-th of microphone pair Angle difference may be defined as the minimum value of every two people angle difference in owner.
Specific embodiment 3:
Above-mentioned attention mechanism can directly be weighted spatial feature in input feature vector level.Assuming that frequency domain character is X, spatial feature are Y=[y1,y2…yk], k is the microphone that is selected based on N number of microphone of separation system to sum.To Mr. Yu two A speaker, a1For the spatial feature weighting coefficient that specific embodiment 1 is calculated, α2=[α2,12,2,...,α2,K] it is specific The spatial feature weighting coefficient that embodiment 2 is calculated, wherein α2,KFor the spatial feature weighting coefficient of k-th of microphone pair.It is right Spatial feature can be with when weighting are as follows:
Yatt=[α1y11y2,...,α1yK]
Or
Yatt=[α2,1y12,2y2,...,α2,KyK]
Or
Yatt=[α1α2,1y11α2,2y2,...,α1α2,KyK]
Spatial feature Y after weightingattInput after frequency domain character X splicing, as network.
The application proposes a kind of multicenter voice separation system based on attention mechanism, close for speaker position When, it the problem of multicenter voice separation system effect is deteriorated, proposes based on the multicenter voice separated network that can learn attention, The weight paid close attention to different characteristic is adjusted under different speaker space distribution occasions with enabling network self-adapting, to make full use of With more the feature of distinction, separating effect is promoted;The application integrates an attention mould in multicenter voice separation system Block, without more set alternative systems, the increased runing time of institute and computation complexity very little.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
A kind of separator of mixing voice signal is additionally provided in the present embodiment, and the device is for realizing above-mentioned implementation Example and preferred embodiment, the descriptions that have already been made will not be repeated.As used below, term " module " may be implemented pre- Determine the combination of the software and/or hardware of function.Although device described in following embodiment is preferably realized with software, The realization of the combination of hardware or software and hardware is also that may and be contemplated.
Fig. 8 is the structural block diagram of the separator of mixing voice signal according to an embodiment of the present invention, as shown in figure 8, should Device includes: the first acquisition module 82, for obtaining the collected mixing voice signal of voice acquisition device, wherein described mixed Closing voice signal includes the voice that at least two target objects issue;Second obtains module 84, for obtaining the mixing voice The first airspace square that the spatial feature of the first frequency domain matrix and the mixing voice signal that the frequency domain character of signal is formed is formed Battle array;First determining module 86 is adopted for target object described in the every two from least two target object with the voice The smallest target angle of angle is determined in angle between acquisition means;Weighting block 88 is used for use and the target angle pair The weight coefficient answered is weighted first airspace matrix, obtains the second airspace matrix, wherein the 0≤weight coefficient ≤1;Input module 810, for by the first frequency domain matrix and second airspace Input matrix to target nerve network mould Type obtains isolate from the mixing voice signal and at least two mesh of the target nerve network model output Mark the one-to-one multi-path voice signal of object, wherein the target nerve network model is using multi-group data to original mind It is trained through network model, every group of data in the multi-group data include: the language that at least two target objects issue The frequency domain character matrix and the spatial feature matrix after weighting of sound.
In an alternative embodiment, the determining module includes: the first determination unit, for determine the first center with The first folder that the second line between the first line and first center and the second position between first position is constituted Angle, wherein first center is the center of the voice acquisition device, and the first position is the every two The position where first object object in the target object, the second position are in target object described in the every two The second target object where position, corresponding first angle of target object described in the every two;Second determines Unit, for determining angle from corresponding first angle of target object described in every two at least two target object The smallest angle is spent as the first minimum angle, wherein the target angle includes first minimum angle.
In an alternative embodiment, described device includes: the second determining module, in use and the target angle pair It is corresponding with the target angle by the determination of following formula before the weight coefficient answered is weighted first airspace matrix The first weight coefficient:
Wherein, the θ is the target angle, and the θ value range is 0 to 180 degree, and the w and b are that initialization determines Network parameter, the att1(θ) indicates the weight coefficient corresponding with the target angle.
In an alternative embodiment, above-mentioned weighting block is also used for first weight coefficient to first airspace Matrix is weighted, and obtains second airspace matrix.
In an alternative embodiment, above-mentioned first determining module is also used to, and is determined represented in the matrix of first airspace Spatial feature destination number;The microphone pair of the destination number is selected in the voice acquisition device, is determined each The second center between the first microphone and second microphone of microphone centering, determines the second center and first The second angle that the 4th line between third line and second center and the second position between setting is constituted, In, the first position is the position where the first object object in target object described in the every two, the second It is set to the position where the second target object in target object described in the every two, target object pair described in the every two Answer second angle;Corresponding second folder of target object described in every two from least two target object The smallest angle of angle is determined in angle as the second minimum angle, each microphone is to a corresponding described second minimum folder Angle, the target angle include each microphone to corresponding second minimum angle.
In an alternative embodiment, above-mentioned apparatus is also used to, in use weight coefficient pair corresponding with the target angle Before first airspace matrix is weighted, pass through determining second minimum angle with each microphone pair of following formula Corresponding second weight coefficient:
Wherein, θkIt is with k-th of microphone to corresponding second minimum angle, θkValue range is 0 to 180 degree, wk And bkIt is the determining network parameter of initialization, att2k) indicate and the second minimum angle corresponding second of k-th of microphone pair Weight coefficient.
In an alternative embodiment, above-mentioned weighting block is also used for weight coefficient matrix to first airspace matrix It is weighted, obtains second airspace matrix, wherein each Mike is to corresponding second weight coefficient, the mesh The microphone for marking quantity is the weight coefficient matrix to the weight coefficient of corresponding destination number.
In an alternative embodiment, above-mentioned weighting block is also used for first weight coefficient and the weight coefficient The product of matrix is weighted first airspace matrix, obtains second airspace matrix, wherein each Mike couple Corresponding second weight coefficient, the microphone of the destination number is described to the weight coefficient of corresponding destination number Weight coefficient matrix.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any Combined form is located in different processors.
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following steps Calculation machine program:
S1 obtains the collected mixing voice signal of voice acquisition device, wherein the mixing voice signal includes at least The voice that two target objects issue;
S2 obtains the first frequency domain matrix and the mixing voice signal that the frequency domain character of the mixing voice signal is formed Spatial feature formed the first airspace matrix;
S3, from least two target object described in every two between target object and the voice acquisition device The smallest target angle of angle is determined in angle;
S4 is weighted first airspace matrix using weight coefficient corresponding with the target angle, obtains Two airspace matrixes, wherein 0≤weight coefficient≤1;
S5 obtains institute by the first frequency domain matrix and second airspace Input matrix to target nerve network model State isolate from the mixing voice signal and at least two target object one of target nerve network model output One corresponding multi-path voice signal, wherein the target nerve network model is using multi-group data to original neural network mould What type trained, every group of data in the multi-group data include: the frequency domain for the voice that at least two target objects issue Spatial feature matrix after eigenmatrix and weighting.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.
Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:
S1 obtains the collected mixing voice signal of voice acquisition device, wherein the mixing voice signal includes at least The voice that two target objects issue;
S2 obtains the first frequency domain matrix and the mixing voice signal that the frequency domain character of the mixing voice signal is formed Spatial feature formed the first airspace matrix;
S3, from least two target object described in every two between target object and the voice acquisition device The smallest target angle of angle is determined in angle;
S4 is weighted first airspace matrix using weight coefficient corresponding with the target angle, obtains Two airspace matrixes, wherein 0≤weight coefficient≤1;
S5 obtains institute by the first frequency domain matrix and second airspace Input matrix to target nerve network model State isolate from the mixing voice signal and at least two target object one of target nerve network model output One corresponding multi-path voice signal, wherein the target nerve network model is using multi-group data to original neural network mould What type trained, every group of data in the multi-group data include: the frequency domain for the voice that at least two target objects issue Spatial feature matrix after eigenmatrix and weighting.Optionally, the specific example in the present embodiment can refer to above-described embodiment And example described in optional embodiment, details are not described herein for the present embodiment.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.It is all within principle of the invention, it is made it is any modification, etc. With replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (15)

1. a kind of separation method of mixing voice signal characterized by comprising
Obtain the collected mixing voice signal of voice acquisition device, wherein the mixing voice signal includes at least two mesh Mark the voice that object issues;
Obtain the first frequency domain matrix of the frequency domain character formation of the mixing voice signal and the airspace of the mixing voice signal The first airspace matrix that feature is formed;
From the angle described in every two at least two target object between target object and the voice acquisition device Determine the smallest target angle of angle;
First airspace matrix is weighted using weight coefficient corresponding with the target angle, obtains the second airspace square Battle array, wherein 0≤weight coefficient≤1;
By the first frequency domain matrix and second airspace Input matrix to target nerve network model, the target mind is obtained It is one-to-one at least two target object through being isolated from the mixing voice signal for network model output Multi-path voice signal, wherein the target nerve network model is to be trained using multi-group data to original neural network model Come, every group of data in the multi-group data include: the frequency domain character matrix for the voice that at least two target objects issue With the spatial feature matrix after weighting.
2. the method according to claim 1, wherein the mesh described in every two from least two target object It marks and determines the smallest target angle of angle in the angle between object and the voice acquisition device, comprising:
It determines between the first line and first center and the second position between the first center and first position The second line constitute the first angle, wherein first center be the voice acquisition device center, institute First position is stated as the position where the first object object in target object described in the every two, the second position is institute The position where the second target object in target object described in every two is stated, target object described in the every two is one corresponding First angle;
Angle is determined most from corresponding first angle of target object described in every two at least two target object Small angle is as the first minimum angle, wherein the target angle includes first minimum angle.
3. according to the method described in claim 2, it is characterized in that, in use weight coefficient pair corresponding with the target angle Before first airspace matrix is weighted, which comprises
The first weight coefficient corresponding with the target angle is determined by following formula:
Wherein, θ is the target angle, and θ value range is 0 to 180 degree, and w and b are the determining network parameter of initialization, att1 (θ) indicates the weight coefficient corresponding with the target angle.
4. according to the method described in claim 3, it is characterized in that, using weight coefficient corresponding with the target angle to institute It states the first airspace matrix to be weighted, obtains the second airspace matrix, comprising:
First airspace matrix is weighted using first weight coefficient, obtains second airspace matrix.
5. the method according to claim 1, wherein the mesh described in every two from least two target object It marks and determines the smallest target angle of angle in the angle between object and the voice acquisition device, comprising:
Determine the destination number of spatial feature represented in the matrix of first airspace;
The microphone pair that the destination number is selected in the voice acquisition device determines the first wheat of each microphone centering The second center gram between wind and second microphone, determine third line between the second center and first position with The second angle that the 4th line between second center and the second position is constituted, wherein the first position is institute The position where the first object object in target object described in every two is stated, the second position is mesh described in the every two Mark the position where the second target object in object, corresponding second angle of target object described in the every two;
Angle is determined most from corresponding second angle of target object described in every two at least two target object Small angle is as the second minimum angle, and each microphone is to corresponding second minimum angle, the target folder Angle includes each microphone to corresponding second minimum angle.
6. according to the method described in claim 5, it is characterized in that, in use weight coefficient pair corresponding with the target angle Before first airspace matrix is weighted, which comprises
The second weight coefficient corresponding with second minimum angle of each microphone pair is determined by following formula:
Wherein, θkIt is with k-th of microphone to corresponding second minimum angle, θkValue range is 0 to 180 degree, wkAnd bk It is the determining network parameter of initialization, att2k) indicate the second weight corresponding with the second minimum angle of k-th of microphone pair Coefficient.
7. according to the method described in claim 6, it is characterized in that, using weight coefficient corresponding with the target angle to institute It states the first airspace matrix to be weighted, obtains the second airspace matrix, comprising:
First airspace matrix is weighted using weight coefficient matrix, obtains second airspace matrix, wherein described Each Mike is to corresponding second weight coefficient, the weight coefficient of the microphone of the destination number to corresponding destination number For the weight coefficient matrix.
8. according to the method described in claim 3, it is characterized in that, from least two target object mesh described in every two It marks and determines the smallest target angle of angle in the angle between object and the voice acquisition device, further includes:
Determine the destination number of spatial feature represented in the matrix of first airspace;
The microphone pair that the destination number is selected in the voice acquisition device determines the first wheat of each microphone centering The second center gram between wind and second microphone, determine third line between the second center and first position with The second angle that the 4th line between second center and the second position is constituted, wherein the first position is institute The position where the first object object in target object described in every two is stated, the second position is mesh described in the every two Mark the position where the second target object in object, corresponding second angle of target object described in the every two;
Angle is determined most from corresponding second angle of target object described in every two at least two target object Small angle is as the second minimum angle, and each microphone is to corresponding second minimum angle, the target folder Angle includes each microphone to corresponding second minimum angle.
9. according to the method described in claim 8, it is characterized in that, in use weight coefficient pair corresponding with the target angle Before first airspace matrix is weighted, which comprises
The second weight coefficient corresponding with second minimum angle of each microphone pair is determined by following formula:
Wherein, θkIt is with k-th of microphone to corresponding second minimum angle, θkValue range is 0 to 180 degree, wkAnd bk It is the determining network parameter of initialization, att2k) indicate the second weight corresponding with the second minimum angle of k-th of microphone pair Coefficient.
10. according to the method described in claim 9, it is characterized in that, using weight coefficient pair corresponding with the target angle First airspace matrix is weighted, and obtains the second airspace matrix, comprising:
First airspace matrix is weighted using the product of first weight coefficient and the weight coefficient matrix, is obtained To second airspace matrix, wherein each Mike to corresponding second weight coefficient, the destination number Microphone is the weight coefficient matrix to the weight coefficient of corresponding destination number.
11. a kind of separator of mixing voice signal characterized by comprising
First obtains module, for obtaining the collected mixing voice signal of voice acquisition device, wherein the creolized language message Number include at least two target objects issue voice;
Second obtains module, the first frequency domain matrix that the frequency domain character for obtaining the mixing voice signal is formed and described mixed Close the first airspace matrix that the spatial feature of voice signal is formed;
First determining module, for target object described in the every two from least two target object and the voice collecting The smallest target angle of angle is determined in angle between device;
Weighting block, for using weight coefficient corresponding with the target angle to be weighted first airspace matrix, Obtain the second airspace matrix, wherein 0≤weight coefficient≤1;
Input module, for by the first frequency domain matrix and second airspace Input matrix to target nerve network model, Obtain isolate from the mixing voice signal and at least two target of the target nerve network model output The one-to-one multi-path voice signal of object, wherein the target nerve network model is using multi-group data to original nerve What network model trained, every group of data in the multi-group data include: the voice that at least two target objects issue Frequency domain character matrix and weighting after spatial feature matrix.
12. device according to claim 11, which is characterized in that the determining module includes:
First determination unit, for determining the first line between the first center and first position and first centre bit Set the first angle that the second line between the second position is constituted, wherein first center is the voice collecting The center of device, the first position are the position where the first object object in target object described in the every two It sets, the second position is the position where the second target object in target object described in the every two, the every two Corresponding first angle of the target object;
Second determination unit, for target object corresponding described first described in the every two from least two target object Determine the smallest angle of angle as the first minimum angle in angle, wherein the target angle includes the described first minimum folder Angle.
13. device according to claim 12, which is characterized in that described device includes:
Second determining module, for being carried out in use weight coefficient corresponding with the target angle to first airspace matrix Before weighting, the first weight coefficient corresponding with the target angle is determined by following formula:
Wherein, θ is the target angle, and θ value range is 0 to 180 degree, and w and b are the determining network parameter of initialization, att1 (θ) indicates the weight coefficient corresponding with the target angle.
14. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer Program is arranged to execute method described in any one of claims 1 to 10 when operation.
15. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to run the computer program to execute described in any one of claims 1 to 10 Method.
CN201910736585.6A 2019-08-09 2019-08-09 Method and device for separating mixed voice signal, storage medium and electronic device Active CN110491409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910736585.6A CN110491409B (en) 2019-08-09 2019-08-09 Method and device for separating mixed voice signal, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910736585.6A CN110491409B (en) 2019-08-09 2019-08-09 Method and device for separating mixed voice signal, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN110491409A true CN110491409A (en) 2019-11-22
CN110491409B CN110491409B (en) 2021-09-24

Family

ID=68550353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910736585.6A Active CN110491409B (en) 2019-08-09 2019-08-09 Method and device for separating mixed voice signal, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN110491409B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063365A (en) * 2019-12-13 2020-04-24 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
CN111261186A (en) * 2020-01-16 2020-06-09 南京理工大学 Audio sound source separation method based on improved self-attention mechanism and cross-frequency band characteristics
CN113241092A (en) * 2021-06-15 2021-08-10 新疆大学 Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2063419A1 (en) * 2007-11-21 2009-05-27 Harman Becker Automotive Systems GmbH Speaker localization
CN107251138A (en) * 2015-02-16 2017-10-13 杜比实验室特许公司 Separating audio source
US20180047407A1 (en) * 2015-03-23 2018-02-15 Sony Corporation Sound source separation apparatus and method, and program
CN109830245A (en) * 2019-01-02 2019-05-31 北京大学 A kind of more speaker's speech separating methods and system based on beam forming

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2063419A1 (en) * 2007-11-21 2009-05-27 Harman Becker Automotive Systems GmbH Speaker localization
CN107251138A (en) * 2015-02-16 2017-10-13 杜比实验室特许公司 Separating audio source
US20180047407A1 (en) * 2015-03-23 2018-02-15 Sony Corporation Sound source separation apparatus and method, and program
CN109830245A (en) * 2019-01-02 2019-05-31 北京大学 A kind of more speaker's speech separating methods and system based on beam forming

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063365A (en) * 2019-12-13 2020-04-24 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
CN111063365B (en) * 2019-12-13 2022-06-07 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
CN111261186A (en) * 2020-01-16 2020-06-09 南京理工大学 Audio sound source separation method based on improved self-attention mechanism and cross-frequency band characteristics
CN113241092A (en) * 2021-06-15 2021-08-10 新疆大学 Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network

Also Published As

Publication number Publication date
CN110491409B (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN110491409A (en) Separation method, device, storage medium and the electronic device of mixing voice signal
CN108986835B (en) Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network
US11450337B2 (en) Multi-person speech separation method and apparatus using a generative adversarial network model
CN103456301B (en) A kind of scene recognition method and device and mobile terminal based on ambient sound
CN107430868B (en) Real-time reconstruction of user speech in immersive visualization system
CN106537493A (en) Speech recognition system and method, client device and cloud server
CN106020449B (en) A kind of exchange method and device of virtual reality
JP7170742B2 (en) SOUND SOURCE DETERMINATION METHOD AND DEVICE, COMPUTER PROGRAM, AND ELECTRONIC DEVICE
CN108133707A (en) A kind of content share method and system
CN110600014B (en) Model training method and device, storage medium and electronic equipment
CN108510982A (en) Audio event detection method, device and computer readable storage medium
CN109686382A (en) A kind of speaker clustering method and device
CN110400572A (en) Audio Enhancement Method and system
CN109637525A (en) Method and apparatus for generating vehicle-mounted acoustic model
WO2022012206A1 (en) Audio signal processing method, device, equipment, and storage medium
CN110652726B (en) Game auxiliary system based on image recognition and audio recognition
CN106327555A (en) Method and device for obtaining lip animation
CN108417207A (en) A kind of depth mixing generation network self-adapting method and system
CN107705782A (en) Method and apparatus for determining phoneme pronunciation duration
CN110210695A (en) A kind of tower control simulated training appraisal procedure based on support vector machines
CN108491808A (en) Method and device for obtaining information
CN109785846A (en) The role recognition method and device of the voice data of monophonic
CN110299143A (en) The devices and methods therefor of voice speaker for identification
CN110188179B (en) Voice directional recognition interaction method, device, equipment and medium
CN110223699A (en) A kind of speaker&#39;s identity confirmation method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant