CN108564952B - The method and apparatus of speech roles separation - Google Patents
The method and apparatus of speech roles separation Download PDFInfo
- Publication number
- CN108564952B CN108564952B CN201810198543.7A CN201810198543A CN108564952B CN 108564952 B CN108564952 B CN 108564952B CN 201810198543 A CN201810198543 A CN 201810198543A CN 108564952 B CN108564952 B CN 108564952B
- Authority
- CN
- China
- Prior art keywords
- audio
- role
- channel audio
- label
- speaks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000000926 separation method Methods 0.000 title claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 54
- 230000003993 interaction Effects 0.000 claims abstract description 24
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 13
- 239000012634 fragment Substances 0.000 claims description 92
- 238000011946 reduction process Methods 0.000 claims description 31
- 230000008569 process Effects 0.000 claims description 30
- 230000003044 adaptive effect Effects 0.000 claims description 11
- 230000008030 elimination Effects 0.000 claims description 4
- 238000003379 elimination reaction Methods 0.000 claims description 4
- 230000004807 localization Effects 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims 4
- 239000000463 material Substances 0.000 abstract description 6
- 230000005540 biological transmission Effects 0.000 description 10
- 230000005236 sound signal Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012544 monitoring process Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G10L21/0202—
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The object of the present invention is to provide a kind of methods and apparatus of speech roles separation, the sound of different people is acquired using different hardware by using the microphone of more array directive property, combination algorithm+hardware ability, the accuracy rate than relying on algorithm to carry out role's separation merely are higher.Reporter is in interview without understanding technical detail, it only needs for different interviewees, it is well placed corresponding sound pick-up outfit, open the App on the human-computer interaction devices such as mobile phone, both voice can be changed into text real-time/non-real timely, and the text results for having carried out accurate role's separation are taken, it is that the audio material processing links of reporter save plenty of time and energy.
Description
Technical field
The present invention relates to the methods and apparatus that computer field more particularly to a kind of speech roles separate.
Background technique
As social every profession and trade is information-based and the continuous promotion of the degree of automation, demand of the people to more accurate data
It is higher and higher.By taking interview scene as an example, recording is an indispensable link of interview, and press gang need to audio content
Record, the content in audio material analyzed, win effective information, and finally write as a contribution, work
It is heavy.The development of speech recognition technology provides solution for the processing scene of the audio material.
Speaker role's separation is to interview an important step of audio material processing the inside.Currently, most of realize angle
The scheme of color separation is mainly based upon the vocal print feature of speaker, that is, after receiving voice signal, first based on BIC (English:
Bayesian Information Criterion, Chinese: bayesian information criterion) speaker's turning point is carried out to voice signal
Detection, is divided into multiple sound bites for voice signal;Then by using GMM (Gaussian Mixture Model- Gauss
Mixed model) and HMM (Hidden Markov Model- Hidden Markov Model) sound of each role is modeled.From
And the sound clip of speaker is removed, achieve the purpose that role separates.
Wherein, BIC (Bayesian Information Criterion- bayesian information criterion) is the fitting to model
The index that effect is evaluated, BIC value is smaller, then model is better to the fitting of data, BIC=-2ln (L)+ln (n) * k.
GMM (Gaussian Mixture Model- gauss hybrid models) is accurately to quantify things with Gaussian probability-density function, will
One things is decomposed into several models formed based on Gaussian probability-density function.(Hidden Markov Model- is hidden by HMM
Markov model) it is a kind of statistical model, for describing the markoff process containing implicit unknown parameter
Above-mentioned solution, the separating effect under ideal playback environ-ment are preferable.But under interview scene, due to interview
Space is not known, and sound transmission is larger by spacial influence, and due to space reflection, diffraction, the signal that microphone receives is in addition to straight
Up to other than signal, there are also multipath signal superpositions, so that signal is disturbed, as reverberation.Indoors in environment, by room boundaries or
Person's barrier diffraction, reflection cause sound to continue, high degree influence voice intelligibility, number of speaking in addition not really
Fixed, the accuracy rate of role's separation may have a greatly reduced quality.
Summary of the invention
It is an object of the present invention to provide a kind of methods and apparatus of speech roles separation, are able to solve existing voice
The not high problem of the scheme accuracy rate of role's separation.
According to an aspect of the invention, there is provided a kind of method of speech roles separation, this method comprises:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain
Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio
The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is
It speaks described in corresponding label character role's label.
Further, in the above method, it includes following any for being directed toward the pick-up head of different speakers:
Single pick-up head still has the microphone of more directing modes;
More than two microphones on mobile phone;
More than two microphones on recording pen;
The microphone of more than two autonomous devices.
Further, in the above method, to by the noise reduction process, treated that each channel audio is eliminated back
The processing of sound, comprising:
To by the noise reduction process treated each channel audio, using the method offseted based on ANC active noise
Eliminate the processing of echo.
Further, in the above method, according to the pointed role that speaks corresponding in each channel audio, to each audio
Segment marks corresponding role's label of speaking, comprising:
Estimate that the audio fragment in each channel audio reaches the delay inequality of different microphones using TDOA algorithm, according to institute
It states delay inequality and calculates range difference, then the space geometry of range difference obtained by calculation and microphone to determine that audio fragment is corresponding
The pointed role that speaks.
It further, will be audio fragment, root by each channel audio cutting for eliminating echo processing in the above method
According to the pointed role that speaks corresponding in each channel audio, corresponding role's label of speaking, packet are marked to each audio fragment
It includes:
Man-machine interaction unit receives each channel audio by eliminating echo processing;
Each channel audio cutting is audio fragment by the man-machine interaction unit, according to corresponding to institute in each channel audio
The role that speaks being directed toward marks corresponding role's label of speaking to each audio fragment;
The man-machine interaction unit will mark the corresponding audio fragment for speaking role's label and be uploaded to cloud.
Further, in the above method, each audio fragment is converted into corresponding text, according to each audio fragment mark
Role's label of speaking of note, after role's mark of speaking described in corresponding label character, further includes:
Man-machine interaction unit obtains the audio fragment and corresponding text after role's label of speaking of mark;
The man-machine interaction unit obtains the correspondence audio of a certain role that speaks of user's selection and the request of text;
The man-machine interaction unit is based on the request, obtains the audio fragment and correspondence of the corresponding role's label of speaking of mark
Text play out.
Further, in the above method, each audio fragment is converted into corresponding text, comprising:
By by vad algorithm, identifying and rejecting the audio frame in each audio fragment not comprising voice signal;
It is calculated using ASR, will identify and reject the audio fragment after the audio frame not comprising voice signal and be converted to correspondence
Text.
Further, in the above method, be directed toward the pick-up head of different speakers quantity be 2~4, pick-up head with speak
The distance between role is less than 1 meter.
According to another aspect of the present invention, a kind of equipment of speech roles separation is additionally provided, wherein the equipment includes:
Speech signal collection unit, for the pick-up head by being directed toward different speakers, acquisition is directed toward difference and speaks role
Corresponding channel audio;
Enhance processing unit, for the role that speaks pointed by corresponding in each channel audio, to each sound channel sound
Frequency carries out gain process;
Noise reduction processing unit, for according to the side sound except the pointed role that speaks corresponding in each channel audio
Frequently, noise reduction process is carried out to each channel audio after the gain process;
Adaptive beamformer unit, for treated that each channel audio is eliminated by the noise reduction process
The processing of echo;
Auditory localization unit, each channel audio cutting for that will pass through elimination echo processing is audio fragment, according to
The corresponding pointed role that speaks in each channel audio, marks corresponding role's label of speaking to each audio fragment;
Role's separative unit is marked for each audio fragment to be converted to corresponding text according to each audio fragment
Role's label of speaking, for role's label of speaking described in corresponding label character.
According to another aspect of the present invention, a kind of equipment based on calculating is additionally provided, wherein include:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed
Manage device:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain
Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio
The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is
It speaks described in corresponding label character role's label.
According to another aspect of the present invention, a kind of computer readable storage medium is additionally provided, computer is stored thereon with
Executable instruction, wherein the computer executable instructions make processor when being executed by processor:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain
Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio
The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is
It speaks described in corresponding label character role's label.
Compared with prior art, the present invention is due to the microphone using more array directive property, to the sound of different people, use
Different hardware are acquired, combination algorithm+hardware ability, than relying on algorithm to carry out the accuracy rate of role's separation more merely
It is high.Reporter is in interview without understanding technical detail, it is only necessary to for different interviewees, it is well placed corresponding sound pick-up outfit,
The App on the human-computer interaction devices such as mobile phone is opened, voice can both be changed into real-time/non-real timely to text, and take and carried out
The text results of accurate role's separation are that the audio material processing links of reporter save plenty of time and energy.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, of the invention other
Feature, objects and advantages will become more apparent upon:
Fig. 1 shows the flow chart of the method for the speech roles separation of one embodiment of the invention;
Fig. 2 shows the schematic diagrams of the method and apparatus of the speech roles of one embodiment of the invention separation;
Fig. 3 shows the schematic diagram of Adaptive noise canceller according to an embodiment of the invention.
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
Present invention is further described in detail with reference to the accompanying drawing.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more
Processor (CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or
Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer
Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As illustrated in fig. 1 and 2, the present invention provides a kind of method of speech roles separation, comprising:
Step S1, by being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
Here, speech signal collection unit can obtain the sound of different speakers by multidirectional pickup microphone array
Sound is respectively directed to different speakers by multiple gun shaped microphones, to obtain the different audio signal of multichannel, due to using
The microphone of more array directive property is acquired the sound of different people using different hardware, combination algorithm+hardware energy
Power, the accuracy rate than relying on algorithm to carry out role's separation merely is higher, promotes the accuracy rate of speech roles separation;
Step S2 carries out gain to each channel audio according to the pointed role that speaks corresponding in each channel audio
Processing;
Here, can be by the way that enhance processing unit to the direction being directed toward such as gun shaped microphone, the audio signal beam of acquisition be carried out
Gain process;
Step S3, according to the side audio except the pointed role that speaks corresponding in each channel audio, to by institute
Each channel audio after stating gain process carries out noise reduction process;
Here, can be carried out by audio signal of the noise reduction processing unit to the side input of the microphone of each direction
Inhibit, to carry out noise reduction process;
Step S4, to treated that each channel audio carries out eliminates the processing of echo by the noise reduction process;
Step S5 will be audio fragment by each channel audio cutting for eliminating echo processing, according to each sound channel sound
The corresponding pointed role that speaks in frequency, marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted to corresponding text by step S6, the role that speaks marked according to each audio fragment
Label, for role's label of speaking described in corresponding label character.
Here, the role's label of speaking that can be marked by role's separative unit according to each audio fragment, for correspondence
Label character described in speak role's label.
Scheme through the invention is recorded using the pick-up head (such as gun shaped microphone) of multiple directive property, in this way may be used
To obtain the voice signal of different speakers to the greatest extent, noise jamming is avoided.It is preferably double to pick up under double scene of speaking
Sound head carries out pickup, better effect.Reporter is in interview without understanding technical detail, it is only necessary to for different interviewees,
It is well placed corresponding sound pick-up outfit, voice can both be changed into real-time/non-real timely to text, and takes and has carried out accurate role's separation
Text results, be that the audio material processing links of reporter save plenty of time and energy.
The method of speech roles separation of the invention is in step S1, to be directed toward the pick-up head of different speakers in embodiment
Including following any:
Single pick-up head still has the microphone of more directing modes;
More than two microphones on mobile phone;
More than two microphones on recording pen;
The microphone of more than two autonomous devices.
Here, following methods can also be supported to acquire voice signal in order to distinguish different speakers:
A) single pick-up head still has the microphone of more directing modes as audio input port, in this way, can will derive from not
The audio obtained with the microphone being directed toward is transmitted by different sound channels;
B) using the mobile phone with more than two microphones as audio input source, such as Samsung GALAXY S6;
C) recording pen with more than two microphones;
D) capture live streaming plug-flow: in video interview scene, can be come by obtaining the live streaming plug-flow from distinct device
Obtain the sound of different speakers;
E) other multi-channel audio captures and transmitting device, the microphone+separate microphone carried such as computer or mobile phone
Included microphone+separate microphone.
The method of speech roles separation of the invention is in embodiment, step S4, to after noise reduction process processing
Each channel audio carry out eliminate echo processing, comprising:
To by the noise reduction process treated each channel audio, using the method offseted based on ANC active noise
Eliminate the processing of echo.
Here, can use ABF by an Adaptive beamformer unit, (Adaptive Beam Forming- is adaptive
Wave beam) scheme to reduce the interference of noise and echo.It makes ABF (Adaptive Beam Forming- adaptive beam)
Signal energy is collected as a very narrow wave beam with antenna array, improve antenna propagation efficiency and Radio Link reliability and
The reuse rate of frequency.As shown in figure 3, one of ABF GSLC (generalized sidelobe canceller-
Battle array grade generalized sidelobe canceller) it is a kind of based on ANC (Auto-adapted noise cancellation adaptive noise pair
Disappear device) method that offsets of active noise, signals with noise passes through main channel and accessory channel simultaneously, and the blocking matrix of accessory channel
Voice signal is filtered out, obtain only comprising the reference signal of multi-channel noise, each channel according to noise signal obtain one it is optimal
Signal estimation obtains clean speech signal estimation.ANC (Auto-adapted noise cancellation adaptive noise pair
Disappear device) by the way that counteracting operation will be carried out by the voice signal of noise pollution and reference signal, to eliminate making an uproar in signals with noise
Sound.
The method of speech roles separation of the invention according in each channel audio in step S5, to correspond in embodiment
The pointed role that speaks marks corresponding role's label of speaking to each audio fragment, comprising:
Estimate that the audio fragment in each channel audio reaches the delay inequality of different microphones using TDOA algorithm, according to institute
It states delay inequality and calculates range difference, then the space geometry of range difference obtained by calculation and microphone to determine that audio fragment is corresponding
The pointed role that speaks.
Here, can use TDOA (when Time Difference of Arrival- is reached by an auditory localization unit
Between it is poor) algorithm estimates that the audio fragment (sound source) in each channel audio reaches the delay inequality of different microphones, and calculates distance
Difference, then determine by the space geometry of range difference and microphone the position (the corresponding pointed role that speaks) of audio fragment.
TDOA (Time Difference of Arrival- reaching time-difference) is a kind of to be positioned using the time difference
Method.The time for reaching monitoring station by measuring signal, it can determine the distance of signal source, utilize signal source to each monitoring station
Distance (centered on monitoring station, distance be radius work justify), just can determine that the position of signal.
The method of speech roles separation of the invention is in embodiment, step S5 will be by eliminating each of echo processing
Channel audio cutting is audio fragment, according to the pointed role that speaks corresponding in each channel audio, to each audio fragment
Mark corresponding role's label of speaking, comprising:
Man-machine interaction unit receives each channel audio by eliminating echo processing;
Each channel audio cutting is audio fragment by the man-machine interaction unit, according to corresponding to institute in each channel audio
The role that speaks being directed toward marks corresponding role's label of speaking to each audio fragment;
The man-machine interaction unit will mark the corresponding audio fragment for speaking role's label and be uploaded to cloud.
Here, can by an audio transmission unit by audio signal, by wired (data line) or it is wireless (WiFi,
Bluetooth, other wireless transmission channels etc.) mode be transferred to man-machine interaction units (such as mobile phones such as mobile phone/computer end/Intelligent hardware
App and web application, intelligent sound box etc.), and by the man-machine interaction unit, by audio signal transmission to cloud, to pass through cloud
End carries out subsequent text conversion processing.
In terms of audio transmission, signal transmission is carried out using USB etc. wired scheme, avoids the loss of data of audio signal
(wireless transmission is easier the limitation by transmission, and is easy packet loss during transmission).
With the formal biography audio data of recording file (rather than audio stream data), and text is changed into, it is available preferable
Accuracy rate.
The method of speech roles separation of the invention is in embodiment, each audio fragment is converted to correspondence by step S6
Text, according to role's label of speaking that each audio fragment marks, after role's mark of speaking described in corresponding label character,
Further include:
Man-machine interaction unit obtains the audio fragment and corresponding text after role's label of speaking of mark;
The man-machine interaction unit obtains the correspondence audio of a certain role that speaks of user's selection and the request of text;
The man-machine interaction unit is based on the request, obtains the audio fragment and correspondence of the corresponding role's label of speaking of mark
Text play out.
Here, the man-machine interaction unit can provide corresponding application program (such as cell phone application and web application) to user
It uses, following function can be specifically included:
α) recording control: recording is opened, pause, terminates, save recording in real time;
B) user can be marked important paragraph in Recording Process, and be looked into the subsequent use process
It sees;
C) different microphones is named: the title of interviewee is set;
D) selectivity plays the audio of different speakers: selection speaker undepandent can individually play corresponding speaker's
Audio;
E) selectivity shows the text of different speakers: selection speaker undepandent can show corresponding speaker's audio conversion
Text out;
F) order and which where is broadcast: using sentence as dimension, user can choose different words/paragraphs and play out;
G) content of text is edited: produces next word content later to recording and is edited, deleted, renamed;
H) search for particular keywords: in conjunction with search engine technique, user can input keyword, to own recording, text
Word and speaker scan for;
I) it downloads and exports audio file: recording file is exported;
J) cloud is synchronous: can be used simultaneously in more equipment, and carry out cloud to audio content and synchronize, avoid losing for data
It loses.
The method of speech roles separation of the invention is in embodiment, and in step S6, each audio fragment is converted to pair
The text answered, comprising:
By by vad algorithm, identifying and rejecting the audio frame in each audio fragment not comprising voice signal;
It is calculated using ASR, will identify and reject the audio fragment after the audio frame not comprising voice signal and be converted to correspondence
Text.
Here, can be by a silence removal unit, using vad algorithm (mono- language of Voice Activity Detection
Sound movement monitoring) audio frame for not including voice signal in each audio fragment is identified and rejects, turn text to reduce subsequent voice
The unnecessary calculation amount of word.VAD (monitoring of Voice Activity Detection- speech activity) purpose is from voice signal
The prolonged mute phase is identified and eliminated in stream, to save the scheme of speech transcription cost.
In addition, the voice of people is converted to text by ASR (Automatic Speech Recognition- speech recognition)
Technology, can be by a speech-to-text unit, using ASR (Automatic Speech Recognition- speech recognition) skill
Above-mentioned each audio fragment is changed into text and returns to audio conversion program writing by art.The present embodiment can be supported a variety of using field
Scape, comprising: real-time audio circulation text, and offline recording file turn text.
The method of speech roles separation of the invention is in step S1, to be directed toward the pick-up head of different speakers in embodiment
Quantity be 2~4, pick-up head and speak the distance between role less than 1 meter.
Here, number of microphone >=2, for the recording scene in far field, the constructive difficulty of dual microphone is lower, power consumption
It is low, use cost is lower, scheme is more mature.
Either multiple pick-up heads carry out pickups, if it is single pick-up head, can in such a way that left and right acoustic channels separate,
The sound of different speakers is extracted respectively.
The distance and angle of microphone can be carried out by user it is customized, user can according to interview scene and interview away from
From the personalized processing of the position progress to microphone.
The distance between microphone and speaker can obtain 90% or more accuracy rate preferably no more than 1 meter at this time.
Due to the limitation of Bluetooth transmission bandwidth, audio signal is transmitted according to bluetooth, is needed original pcm lattice
The signal of formula transmits again after being compressed, and number of microphone should be less than 4.
According to another aspect of the present invention, a kind of equipment of speech roles separation is additionally provided, wherein the equipment includes:
Speech signal collection unit, for the pick-up head by being directed toward different speakers, acquisition is directed toward difference and speaks role
Corresponding channel audio;
Enhance processing unit, for the role that speaks pointed by corresponding in each channel audio, to each sound channel sound
Frequency carries out gain process;
Noise reduction processing unit, for according to the side sound except the pointed role that speaks corresponding in each channel audio
Frequently, noise reduction process is carried out to each channel audio after the gain process;
Adaptive beamformer unit, for treated that each channel audio is eliminated by the noise reduction process
The processing of echo;
Auditory localization unit, each channel audio cutting for that will pass through elimination echo processing is audio fragment, according to
The corresponding pointed role that speaks in each channel audio, marks corresponding role's label of speaking to each audio fragment;
Role's separative unit is marked for each audio fragment to be converted to corresponding text according to each audio fragment
Role's label of speaking, for role's label of speaking described in corresponding label character.
According to another aspect of the present invention, a kind of equipment based on calculating is additionally provided, wherein include:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed
Manage device:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain
Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio
The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is
It speaks described in corresponding label character role's label.
According to another aspect of the present invention, a kind of computer readable storage medium is additionally provided, computer is stored thereon with
Executable instruction, wherein the computer executable instructions make processor when being executed by processor:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain
Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio
The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is
It speaks described in corresponding label character role's label.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application
Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies
Within, then the application is also intended to include these modifications and variations.
It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt
With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment
In, software program of the invention can be executed to implement the above steps or functions by processor.Similarly, of the invention
Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory,
Magnetic or optical driver or floppy disc and similar devices.In addition, some of the steps or functions of the present invention may be implemented in hardware, example
Such as, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the invention can be applied to computer program product, such as computer program instructions, when its quilt
When computer executes, by the operation of the computer, it can call or provide according to the method for the present invention and/or technical solution.
And the program instruction of method of the invention is called, it is possibly stored in fixed or moveable recording medium, and/or pass through
Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation
In the working storage of computer equipment.Here, according to one embodiment of present invention including a device, which includes using
Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to
When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the present invention are triggered
Art scheme.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included in the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This
Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple
Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table
Show title, and does not indicate any particular order.
Claims (8)
1. a kind of method of speech roles separation, wherein this method comprises:
Single pick-up head extracts the sound of different speakers, wherein described individually to pick up respectively in such a way that left and right acoustic channels separate
Sound head is the microphone for having more directing modes;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to after the gain process
Each channel audio carry out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding signified in each channel audio
To the role that speaks, corresponding role's label of speaking is marked to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, for correspondence
Label character described in speak role's label;
Important paragraph is marked in Recording Process, for being checked in subsequent use process;
Using sentence as dimension, different words/paragraph is selected to carry out audio broadcasting.
2. according to the method described in claim 1, wherein, to by the noise reduction process, treated that each channel audio carries out
Eliminate the processing of echo, comprising:
To by the noise reduction process treated each channel audio, carried out using the method offseted based on ANC active noise
Eliminate the processing of echo.
3. according to the method described in claim 1, will be audio by each channel audio cutting for eliminating echo processing wherein
Segment marks the corresponding role that speaks to each audio fragment according to the pointed role that speaks corresponding in each channel audio
Label, comprising:
Man-machine interaction unit receives each channel audio by eliminating echo processing;
Each channel audio cutting is audio fragment by the man-machine interaction unit, according to pointed by corresponding in each channel audio
The role that speaks, corresponding role's label of speaking is marked to each audio fragment;
The man-machine interaction unit will mark the corresponding audio fragment for speaking role's label and be uploaded to cloud.
4. according to the method described in claim 3, wherein, each audio fragment is converted to corresponding text, according to each sound
Role's label of speaking of frequency segment mark, after role's mark of speaking described in corresponding label character, further includes:
Man-machine interaction unit obtains the audio fragment and corresponding text after role's label of speaking of mark;
The man-machine interaction unit obtains the correspondence audio of a certain role that speaks of user's selection and the request of text;
The man-machine interaction unit is based on the request, obtains the audio fragment and corresponding text of the corresponding role's label of speaking of mark
Word plays out.
5. according to the method described in claim 1, wherein, each audio fragment is converted to corresponding text, comprising:
By vad algorithm, identifies and reject the audio frame for not including voice signal in each audio fragment;
It is calculated using ASR, will identify and reject the audio fragment after the audio frame not comprising voice signal and be converted to corresponding text
Word.
6. a kind of equipment of speech roles separation, wherein the equipment includes:
Speech signal collection unit is used for single pick-up head, extracts different speakers respectively in such a way that left and right acoustic channels separate
Sound, wherein the single pick-up head is the microphone for having more directing modes;
Enhance processing unit, for according to the role that speaks corresponding pointed in each channel audio, to each channel audio into
Row gain process;
Noise reduction processing unit is right for the side audio except the role that speaks pointed by corresponding in each channel audio
Each channel audio after the gain process carries out noise reduction process;
Adaptive beamformer unit, for treated that each channel audio carries out elimination echo by the noise reduction process
Processing;
Auditory localization unit, each channel audio cutting for that will pass through elimination echo processing is audio fragment, according to each
The corresponding pointed role that speaks in channel audio, marks corresponding role's label of speaking to each audio fragment;
Role's separative unit, for each audio fragment to be converted to corresponding text, according to saying for each audio fragment mark
Role's label is talked about, for role's label of speaking described in corresponding label character
Man-machine interaction unit, for important paragraph to be marked in Recording Process, for being checked in subsequent use process;
Different words/paragraph carries out audio broadcasting with selecting using sentence as dimension.
7. a kind of equipment based on calculating, wherein include:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the processing when executed
Device:
Single pick-up head extracts the sound of different speakers, wherein described individually to pick up respectively in such a way that left and right acoustic channels separate
Sound head is the microphone for having more directing modes;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to after the gain process
Each channel audio carry out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding signified in each channel audio
To the role that speaks, corresponding role's label of speaking is marked to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, for correspondence
Label character described in speak role's label;
Important paragraph is marked in Recording Process, for being checked in subsequent use process;
Using sentence as dimension, different words/paragraph is selected to carry out audio broadcasting.
8. a kind of computer readable storage medium, is stored thereon with computer executable instructions, wherein the computer is executable to be referred to
Make the processor when order is executed by processor:
Single pick-up head extracts the sound of different speakers, wherein described individually to pick up respectively in such a way that left and right acoustic channels separate
Sound head is the microphone for having more directing modes;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to after the gain process
Each channel audio carry out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding signified in each channel audio
To the role that speaks, corresponding role's label of speaking is marked to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, for correspondence
Label character described in speak role's label;
Important paragraph is marked in Recording Process, for being checked in subsequent use process;
Using sentence as dimension, different words/paragraph is selected to carry out audio broadcasting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810198543.7A CN108564952B (en) | 2018-03-12 | 2018-03-12 | The method and apparatus of speech roles separation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810198543.7A CN108564952B (en) | 2018-03-12 | 2018-03-12 | The method and apparatus of speech roles separation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108564952A CN108564952A (en) | 2018-09-21 |
CN108564952B true CN108564952B (en) | 2019-06-07 |
Family
ID=63531600
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810198543.7A Active CN108564952B (en) | 2018-03-12 | 2018-03-12 | The method and apparatus of speech roles separation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108564952B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109709518B (en) * | 2018-12-25 | 2021-07-20 | 北京猎户星空科技有限公司 | Sound source positioning method and device, intelligent equipment and storage medium |
CN110459239A (en) * | 2019-03-19 | 2019-11-15 | 深圳壹秘科技有限公司 | Role analysis method, apparatus and computer readable storage medium based on voice data |
CN110189764B (en) * | 2019-05-29 | 2021-07-06 | 深圳壹秘科技有限公司 | System and method for displaying separated roles and recording equipment |
CN112151041B (en) * | 2019-06-26 | 2024-03-29 | 北京小米移动软件有限公司 | Recording method, device, equipment and storage medium based on recorder program |
CN110473566A (en) * | 2019-07-25 | 2019-11-19 | 深圳壹账通智能科技有限公司 | Audio separation method, device, electronic equipment and computer readable storage medium |
CN110648665A (en) * | 2019-09-09 | 2020-01-03 | 北京左医科技有限公司 | Session process recording system and method |
CN110853639B (en) * | 2019-10-23 | 2023-09-01 | 天津讯飞极智科技有限公司 | Voice transcription method and related device |
CN111128132A (en) * | 2019-12-19 | 2020-05-08 | 秒针信息技术有限公司 | Voice separation method, device and system and storage medium |
CN111243595B (en) * | 2019-12-31 | 2022-12-27 | 京东科技控股股份有限公司 | Information processing method and device |
CN111883135A (en) * | 2020-07-28 | 2020-11-03 | 北京声智科技有限公司 | Voice transcription method and device and electronic equipment |
CN111883168B (en) * | 2020-08-04 | 2023-12-22 | 上海明略人工智能(集团)有限公司 | Voice processing method and device |
CN112530411B (en) * | 2020-12-15 | 2021-07-20 | 北京快鱼电子股份公司 | Real-time role-based role transcription method, equipment and system |
CN113808592A (en) * | 2021-08-17 | 2021-12-17 | 百度在线网络技术(北京)有限公司 | Method and device for transcribing call recording, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106683661A (en) * | 2015-11-05 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Role separation method and device based on voice |
CN107464564A (en) * | 2017-08-21 | 2017-12-12 | 腾讯科技(深圳)有限公司 | voice interactive method, device and equipment |
CN107749313A (en) * | 2017-11-23 | 2018-03-02 | 郑州大学第附属医院 | A kind of automatic transcription and the method for generation Telemedicine Consultation record |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009118316A (en) * | 2007-11-08 | 2009-05-28 | Yamaha Corp | Voice communication device |
JP2011107603A (en) * | 2009-11-20 | 2011-06-02 | Sony Corp | Speech recognition device, speech recognition method and program |
CN102610237A (en) * | 2012-03-21 | 2012-07-25 | 山东大学 | Digital signal processor (DSP) implementation system for two-channel convolution mixed voice signal blind source separation algorithm |
-
2018
- 2018-03-12 CN CN201810198543.7A patent/CN108564952B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106683661A (en) * | 2015-11-05 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Role separation method and device based on voice |
CN107464564A (en) * | 2017-08-21 | 2017-12-12 | 腾讯科技(深圳)有限公司 | voice interactive method, device and equipment |
CN107749313A (en) * | 2017-11-23 | 2018-03-02 | 郑州大学第附属医院 | A kind of automatic transcription and the method for generation Telemedicine Consultation record |
Non-Patent Citations (2)
Title |
---|
《自适应滤波器及其应用研究》;吴正茂;《南昌水专学报》;20040630;第23卷(第2期);第36-38、45页 |
《自适应滤波器在噪声对消中的作用》;杨红等;《长江工程职业技术学院院报》;20051231;第22卷(第4期);第55-56、74页 |
Also Published As
Publication number | Publication date |
---|---|
CN108564952A (en) | 2018-09-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108564952B (en) | The method and apparatus of speech roles separation | |
JP7434137B2 (en) | Speech recognition method, device, equipment and computer readable storage medium | |
US11710478B2 (en) | Pre-wakeword speech processing | |
US10134421B1 (en) | Neural network based beam selection | |
CN110556103B (en) | Audio signal processing method, device, system, equipment and storage medium | |
US20180075860A1 (en) | Method for Microphone Selection and Multi-Talker Segmentation with Ambient Automated Speech Recognition (ASR) | |
US20080312918A1 (en) | Voice performance evaluation system and method for long-distance voice recognition | |
US10650840B1 (en) | Echo latency estimation | |
Sun et al. | Speaker diarization system for RT07 and RT09 meeting room audio | |
CN110858476B (en) | Sound collection method and device based on microphone array | |
CN110610718B (en) | Method and device for extracting expected sound source voice signal | |
CN107863099A (en) | A kind of new dual microphone speech detection and Enhancement Method | |
Ochi et al. | Multi-Talker Speech Recognition Based on Blind Source Separation with ad hoc Microphone Array Using Smartphones and Cloud Storage. | |
WO2022183968A1 (en) | Audio signal processing method, devices, system, and storage medium | |
CN115376534A (en) | Microphone array audio processing method and pickup chest card | |
CN111596261B (en) | Sound source positioning method and device | |
Parada et al. | Robust statistical processing of TDOA estimates for distant speaker diarization | |
Raj et al. | Frustratingly easy noise-aware training of acoustic models | |
Wang et al. | Denoising autoencoder and environment adaptation for distant-talking speech recognition with asynchronous speech recording | |
Takashima et al. | Estimation of talker's head orientation based on discrimination of the shape of cross-power spectrum phase coefficients | |
Sun et al. | Frame selection of interview channel for NIST speaker recognition evaluation | |
Wang et al. | Robust distant speech recognition based on position dependent CMN using a novel multiple microphone processing technique. | |
WO2023245700A1 (en) | Audio energy analysis method and related apparatus | |
Nayak | Multi-channel Enhancement and Diarization for Distant Speech Recognition | |
Fox et al. | Extending Limabeam with discrimination and coarse gradients |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |