CN108564952B - The method and apparatus of speech roles separation - Google Patents

The method and apparatus of speech roles separation Download PDF

Info

Publication number
CN108564952B
CN108564952B CN201810198543.7A CN201810198543A CN108564952B CN 108564952 B CN108564952 B CN 108564952B CN 201810198543 A CN201810198543 A CN 201810198543A CN 108564952 B CN108564952 B CN 108564952B
Authority
CN
China
Prior art keywords
audio
role
channel audio
label
speaks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810198543.7A
Other languages
Chinese (zh)
Other versions
CN108564952A (en
Inventor
徐常亮
陈凌云
廖健
范梦真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Wisdom Cloud Technology Co Ltd
Original Assignee
Xinhua Wisdom Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Wisdom Cloud Technology Co Ltd filed Critical Xinhua Wisdom Cloud Technology Co Ltd
Priority to CN201810198543.7A priority Critical patent/CN108564952B/en
Publication of CN108564952A publication Critical patent/CN108564952A/en
Application granted granted Critical
Publication of CN108564952B publication Critical patent/CN108564952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • G10L21/0202

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The object of the present invention is to provide a kind of methods and apparatus of speech roles separation, the sound of different people is acquired using different hardware by using the microphone of more array directive property, combination algorithm+hardware ability, the accuracy rate than relying on algorithm to carry out role's separation merely are higher.Reporter is in interview without understanding technical detail, it only needs for different interviewees, it is well placed corresponding sound pick-up outfit, open the App on the human-computer interaction devices such as mobile phone, both voice can be changed into text real-time/non-real timely, and the text results for having carried out accurate role's separation are taken, it is that the audio material processing links of reporter save plenty of time and energy.

Description

The method and apparatus of speech roles separation
Technical field
The present invention relates to the methods and apparatus that computer field more particularly to a kind of speech roles separate.
Background technique
As social every profession and trade is information-based and the continuous promotion of the degree of automation, demand of the people to more accurate data It is higher and higher.By taking interview scene as an example, recording is an indispensable link of interview, and press gang need to audio content Record, the content in audio material analyzed, win effective information, and finally write as a contribution, work It is heavy.The development of speech recognition technology provides solution for the processing scene of the audio material.
Speaker role's separation is to interview an important step of audio material processing the inside.Currently, most of realize angle The scheme of color separation is mainly based upon the vocal print feature of speaker, that is, after receiving voice signal, first based on BIC (English: Bayesian Information Criterion, Chinese: bayesian information criterion) speaker's turning point is carried out to voice signal Detection, is divided into multiple sound bites for voice signal;Then by using GMM (Gaussian Mixture Model- Gauss Mixed model) and HMM (Hidden Markov Model- Hidden Markov Model) sound of each role is modeled.From And the sound clip of speaker is removed, achieve the purpose that role separates.
Wherein, BIC (Bayesian Information Criterion- bayesian information criterion) is the fitting to model The index that effect is evaluated, BIC value is smaller, then model is better to the fitting of data, BIC=-2ln (L)+ln (n) * k. GMM (Gaussian Mixture Model- gauss hybrid models) is accurately to quantify things with Gaussian probability-density function, will One things is decomposed into several models formed based on Gaussian probability-density function.(Hidden Markov Model- is hidden by HMM Markov model) it is a kind of statistical model, for describing the markoff process containing implicit unknown parameter
Above-mentioned solution, the separating effect under ideal playback environ-ment are preferable.But under interview scene, due to interview Space is not known, and sound transmission is larger by spacial influence, and due to space reflection, diffraction, the signal that microphone receives is in addition to straight Up to other than signal, there are also multipath signal superpositions, so that signal is disturbed, as reverberation.Indoors in environment, by room boundaries or Person's barrier diffraction, reflection cause sound to continue, high degree influence voice intelligibility, number of speaking in addition not really Fixed, the accuracy rate of role's separation may have a greatly reduced quality.
Summary of the invention
It is an object of the present invention to provide a kind of methods and apparatus of speech roles separation, are able to solve existing voice The not high problem of the scheme accuracy rate of role's separation.
According to an aspect of the invention, there is provided a kind of method of speech roles separation, this method comprises:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is It speaks described in corresponding label character role's label.
Further, in the above method, it includes following any for being directed toward the pick-up head of different speakers:
Single pick-up head still has the microphone of more directing modes;
More than two microphones on mobile phone;
More than two microphones on recording pen;
The microphone of more than two autonomous devices.
Further, in the above method, to by the noise reduction process, treated that each channel audio is eliminated back The processing of sound, comprising:
To by the noise reduction process treated each channel audio, using the method offseted based on ANC active noise Eliminate the processing of echo.
Further, in the above method, according to the pointed role that speaks corresponding in each channel audio, to each audio Segment marks corresponding role's label of speaking, comprising:
Estimate that the audio fragment in each channel audio reaches the delay inequality of different microphones using TDOA algorithm, according to institute It states delay inequality and calculates range difference, then the space geometry of range difference obtained by calculation and microphone to determine that audio fragment is corresponding The pointed role that speaks.
It further, will be audio fragment, root by each channel audio cutting for eliminating echo processing in the above method According to the pointed role that speaks corresponding in each channel audio, corresponding role's label of speaking, packet are marked to each audio fragment It includes:
Man-machine interaction unit receives each channel audio by eliminating echo processing;
Each channel audio cutting is audio fragment by the man-machine interaction unit, according to corresponding to institute in each channel audio The role that speaks being directed toward marks corresponding role's label of speaking to each audio fragment;
The man-machine interaction unit will mark the corresponding audio fragment for speaking role's label and be uploaded to cloud.
Further, in the above method, each audio fragment is converted into corresponding text, according to each audio fragment mark Role's label of speaking of note, after role's mark of speaking described in corresponding label character, further includes:
Man-machine interaction unit obtains the audio fragment and corresponding text after role's label of speaking of mark;
The man-machine interaction unit obtains the correspondence audio of a certain role that speaks of user's selection and the request of text;
The man-machine interaction unit is based on the request, obtains the audio fragment and correspondence of the corresponding role's label of speaking of mark Text play out.
Further, in the above method, each audio fragment is converted into corresponding text, comprising:
By by vad algorithm, identifying and rejecting the audio frame in each audio fragment not comprising voice signal;
It is calculated using ASR, will identify and reject the audio fragment after the audio frame not comprising voice signal and be converted to correspondence Text.
Further, in the above method, be directed toward the pick-up head of different speakers quantity be 2~4, pick-up head with speak The distance between role is less than 1 meter.
According to another aspect of the present invention, a kind of equipment of speech roles separation is additionally provided, wherein the equipment includes:
Speech signal collection unit, for the pick-up head by being directed toward different speakers, acquisition is directed toward difference and speaks role Corresponding channel audio;
Enhance processing unit, for the role that speaks pointed by corresponding in each channel audio, to each sound channel sound Frequency carries out gain process;
Noise reduction processing unit, for according to the side sound except the pointed role that speaks corresponding in each channel audio Frequently, noise reduction process is carried out to each channel audio after the gain process;
Adaptive beamformer unit, for treated that each channel audio is eliminated by the noise reduction process The processing of echo;
Auditory localization unit, each channel audio cutting for that will pass through elimination echo processing is audio fragment, according to The corresponding pointed role that speaks in each channel audio, marks corresponding role's label of speaking to each audio fragment;
Role's separative unit is marked for each audio fragment to be converted to corresponding text according to each audio fragment Role's label of speaking, for role's label of speaking described in corresponding label character.
According to another aspect of the present invention, a kind of equipment based on calculating is additionally provided, wherein include:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed Manage device:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is It speaks described in corresponding label character role's label.
According to another aspect of the present invention, a kind of computer readable storage medium is additionally provided, computer is stored thereon with Executable instruction, wherein the computer executable instructions make processor when being executed by processor:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is It speaks described in corresponding label character role's label.
Compared with prior art, the present invention is due to the microphone using more array directive property, to the sound of different people, use Different hardware are acquired, combination algorithm+hardware ability, than relying on algorithm to carry out the accuracy rate of role's separation more merely It is high.Reporter is in interview without understanding technical detail, it is only necessary to for different interviewees, it is well placed corresponding sound pick-up outfit, The App on the human-computer interaction devices such as mobile phone is opened, voice can both be changed into real-time/non-real timely to text, and take and carried out The text results of accurate role's separation are that the audio material processing links of reporter save plenty of time and energy.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, of the invention other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows the flow chart of the method for the speech roles separation of one embodiment of the invention;
Fig. 2 shows the schematic diagrams of the method and apparatus of the speech roles of one embodiment of the invention separation;
Fig. 3 shows the schematic diagram of Adaptive noise canceller according to an embodiment of the invention.
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
Present invention is further described in detail with reference to the accompanying drawing.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As illustrated in fig. 1 and 2, the present invention provides a kind of method of speech roles separation, comprising:
Step S1, by being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
Here, speech signal collection unit can obtain the sound of different speakers by multidirectional pickup microphone array Sound is respectively directed to different speakers by multiple gun shaped microphones, to obtain the different audio signal of multichannel, due to using The microphone of more array directive property is acquired the sound of different people using different hardware, combination algorithm+hardware energy Power, the accuracy rate than relying on algorithm to carry out role's separation merely is higher, promotes the accuracy rate of speech roles separation;
Step S2 carries out gain to each channel audio according to the pointed role that speaks corresponding in each channel audio Processing;
Here, can be by the way that enhance processing unit to the direction being directed toward such as gun shaped microphone, the audio signal beam of acquisition be carried out Gain process;
Step S3, according to the side audio except the pointed role that speaks corresponding in each channel audio, to by institute Each channel audio after stating gain process carries out noise reduction process;
Here, can be carried out by audio signal of the noise reduction processing unit to the side input of the microphone of each direction Inhibit, to carry out noise reduction process;
Step S4, to treated that each channel audio carries out eliminates the processing of echo by the noise reduction process;
Step S5 will be audio fragment by each channel audio cutting for eliminating echo processing, according to each sound channel sound The corresponding pointed role that speaks in frequency, marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted to corresponding text by step S6, the role that speaks marked according to each audio fragment Label, for role's label of speaking described in corresponding label character.
Here, the role's label of speaking that can be marked by role's separative unit according to each audio fragment, for correspondence Label character described in speak role's label.
Scheme through the invention is recorded using the pick-up head (such as gun shaped microphone) of multiple directive property, in this way may be used To obtain the voice signal of different speakers to the greatest extent, noise jamming is avoided.It is preferably double to pick up under double scene of speaking Sound head carries out pickup, better effect.Reporter is in interview without understanding technical detail, it is only necessary to for different interviewees, It is well placed corresponding sound pick-up outfit, voice can both be changed into real-time/non-real timely to text, and takes and has carried out accurate role's separation Text results, be that the audio material processing links of reporter save plenty of time and energy.
The method of speech roles separation of the invention is in step S1, to be directed toward the pick-up head of different speakers in embodiment Including following any:
Single pick-up head still has the microphone of more directing modes;
More than two microphones on mobile phone;
More than two microphones on recording pen;
The microphone of more than two autonomous devices.
Here, following methods can also be supported to acquire voice signal in order to distinguish different speakers:
A) single pick-up head still has the microphone of more directing modes as audio input port, in this way, can will derive from not The audio obtained with the microphone being directed toward is transmitted by different sound channels;
B) using the mobile phone with more than two microphones as audio input source, such as Samsung GALAXY S6;
C) recording pen with more than two microphones;
D) capture live streaming plug-flow: in video interview scene, can be come by obtaining the live streaming plug-flow from distinct device Obtain the sound of different speakers;
E) other multi-channel audio captures and transmitting device, the microphone+separate microphone carried such as computer or mobile phone Included microphone+separate microphone.
The method of speech roles separation of the invention is in embodiment, step S4, to after noise reduction process processing Each channel audio carry out eliminate echo processing, comprising:
To by the noise reduction process treated each channel audio, using the method offseted based on ANC active noise Eliminate the processing of echo.
Here, can use ABF by an Adaptive beamformer unit, (Adaptive Beam Forming- is adaptive Wave beam) scheme to reduce the interference of noise and echo.It makes ABF (Adaptive Beam Forming- adaptive beam) Signal energy is collected as a very narrow wave beam with antenna array, improve antenna propagation efficiency and Radio Link reliability and The reuse rate of frequency.As shown in figure 3, one of ABF GSLC (generalized sidelobe canceller- Battle array grade generalized sidelobe canceller) it is a kind of based on ANC (Auto-adapted noise cancellation adaptive noise pair Disappear device) method that offsets of active noise, signals with noise passes through main channel and accessory channel simultaneously, and the blocking matrix of accessory channel Voice signal is filtered out, obtain only comprising the reference signal of multi-channel noise, each channel according to noise signal obtain one it is optimal Signal estimation obtains clean speech signal estimation.ANC (Auto-adapted noise cancellation adaptive noise pair Disappear device) by the way that counteracting operation will be carried out by the voice signal of noise pollution and reference signal, to eliminate making an uproar in signals with noise Sound.
The method of speech roles separation of the invention according in each channel audio in step S5, to correspond in embodiment The pointed role that speaks marks corresponding role's label of speaking to each audio fragment, comprising:
Estimate that the audio fragment in each channel audio reaches the delay inequality of different microphones using TDOA algorithm, according to institute It states delay inequality and calculates range difference, then the space geometry of range difference obtained by calculation and microphone to determine that audio fragment is corresponding The pointed role that speaks.
Here, can use TDOA (when Time Difference of Arrival- is reached by an auditory localization unit Between it is poor) algorithm estimates that the audio fragment (sound source) in each channel audio reaches the delay inequality of different microphones, and calculates distance Difference, then determine by the space geometry of range difference and microphone the position (the corresponding pointed role that speaks) of audio fragment.
TDOA (Time Difference of Arrival- reaching time-difference) is a kind of to be positioned using the time difference Method.The time for reaching monitoring station by measuring signal, it can determine the distance of signal source, utilize signal source to each monitoring station Distance (centered on monitoring station, distance be radius work justify), just can determine that the position of signal.
The method of speech roles separation of the invention is in embodiment, step S5 will be by eliminating each of echo processing Channel audio cutting is audio fragment, according to the pointed role that speaks corresponding in each channel audio, to each audio fragment Mark corresponding role's label of speaking, comprising:
Man-machine interaction unit receives each channel audio by eliminating echo processing;
Each channel audio cutting is audio fragment by the man-machine interaction unit, according to corresponding to institute in each channel audio The role that speaks being directed toward marks corresponding role's label of speaking to each audio fragment;
The man-machine interaction unit will mark the corresponding audio fragment for speaking role's label and be uploaded to cloud.
Here, can by an audio transmission unit by audio signal, by wired (data line) or it is wireless (WiFi, Bluetooth, other wireless transmission channels etc.) mode be transferred to man-machine interaction units (such as mobile phones such as mobile phone/computer end/Intelligent hardware App and web application, intelligent sound box etc.), and by the man-machine interaction unit, by audio signal transmission to cloud, to pass through cloud End carries out subsequent text conversion processing.
In terms of audio transmission, signal transmission is carried out using USB etc. wired scheme, avoids the loss of data of audio signal (wireless transmission is easier the limitation by transmission, and is easy packet loss during transmission).
With the formal biography audio data of recording file (rather than audio stream data), and text is changed into, it is available preferable Accuracy rate.
The method of speech roles separation of the invention is in embodiment, each audio fragment is converted to correspondence by step S6 Text, according to role's label of speaking that each audio fragment marks, after role's mark of speaking described in corresponding label character, Further include:
Man-machine interaction unit obtains the audio fragment and corresponding text after role's label of speaking of mark;
The man-machine interaction unit obtains the correspondence audio of a certain role that speaks of user's selection and the request of text;
The man-machine interaction unit is based on the request, obtains the audio fragment and correspondence of the corresponding role's label of speaking of mark Text play out.
Here, the man-machine interaction unit can provide corresponding application program (such as cell phone application and web application) to user It uses, following function can be specifically included:
α) recording control: recording is opened, pause, terminates, save recording in real time;
B) user can be marked important paragraph in Recording Process, and be looked into the subsequent use process It sees;
C) different microphones is named: the title of interviewee is set;
D) selectivity plays the audio of different speakers: selection speaker undepandent can individually play corresponding speaker's Audio;
E) selectivity shows the text of different speakers: selection speaker undepandent can show corresponding speaker's audio conversion Text out;
F) order and which where is broadcast: using sentence as dimension, user can choose different words/paragraphs and play out;
G) content of text is edited: produces next word content later to recording and is edited, deleted, renamed;
H) search for particular keywords: in conjunction with search engine technique, user can input keyword, to own recording, text Word and speaker scan for;
I) it downloads and exports audio file: recording file is exported;
J) cloud is synchronous: can be used simultaneously in more equipment, and carry out cloud to audio content and synchronize, avoid losing for data It loses.
The method of speech roles separation of the invention is in embodiment, and in step S6, each audio fragment is converted to pair The text answered, comprising:
By by vad algorithm, identifying and rejecting the audio frame in each audio fragment not comprising voice signal;
It is calculated using ASR, will identify and reject the audio fragment after the audio frame not comprising voice signal and be converted to correspondence Text.
Here, can be by a silence removal unit, using vad algorithm (mono- language of Voice Activity Detection Sound movement monitoring) audio frame for not including voice signal in each audio fragment is identified and rejects, turn text to reduce subsequent voice The unnecessary calculation amount of word.VAD (monitoring of Voice Activity Detection- speech activity) purpose is from voice signal The prolonged mute phase is identified and eliminated in stream, to save the scheme of speech transcription cost.
In addition, the voice of people is converted to text by ASR (Automatic Speech Recognition- speech recognition) Technology, can be by a speech-to-text unit, using ASR (Automatic Speech Recognition- speech recognition) skill Above-mentioned each audio fragment is changed into text and returns to audio conversion program writing by art.The present embodiment can be supported a variety of using field Scape, comprising: real-time audio circulation text, and offline recording file turn text.
The method of speech roles separation of the invention is in step S1, to be directed toward the pick-up head of different speakers in embodiment Quantity be 2~4, pick-up head and speak the distance between role less than 1 meter.
Here, number of microphone >=2, for the recording scene in far field, the constructive difficulty of dual microphone is lower, power consumption It is low, use cost is lower, scheme is more mature.
Either multiple pick-up heads carry out pickups, if it is single pick-up head, can in such a way that left and right acoustic channels separate, The sound of different speakers is extracted respectively.
The distance and angle of microphone can be carried out by user it is customized, user can according to interview scene and interview away from From the personalized processing of the position progress to microphone.
The distance between microphone and speaker can obtain 90% or more accuracy rate preferably no more than 1 meter at this time.
Due to the limitation of Bluetooth transmission bandwidth, audio signal is transmitted according to bluetooth, is needed original pcm lattice The signal of formula transmits again after being compressed, and number of microphone should be less than 4.
According to another aspect of the present invention, a kind of equipment of speech roles separation is additionally provided, wherein the equipment includes:
Speech signal collection unit, for the pick-up head by being directed toward different speakers, acquisition is directed toward difference and speaks role Corresponding channel audio;
Enhance processing unit, for the role that speaks pointed by corresponding in each channel audio, to each sound channel sound Frequency carries out gain process;
Noise reduction processing unit, for according to the side sound except the pointed role that speaks corresponding in each channel audio Frequently, noise reduction process is carried out to each channel audio after the gain process;
Adaptive beamformer unit, for treated that each channel audio is eliminated by the noise reduction process The processing of echo;
Auditory localization unit, each channel audio cutting for that will pass through elimination echo processing is audio fragment, according to The corresponding pointed role that speaks in each channel audio, marks corresponding role's label of speaking to each audio fragment;
Role's separative unit is marked for each audio fragment to be converted to corresponding text according to each audio fragment Role's label of speaking, for role's label of speaking described in corresponding label character.
According to another aspect of the present invention, a kind of equipment based on calculating is additionally provided, wherein include:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed Manage device:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is It speaks described in corresponding label character role's label.
According to another aspect of the present invention, a kind of computer readable storage medium is additionally provided, computer is stored thereon with Executable instruction, wherein the computer executable instructions make processor when being executed by processor:
By being directed toward the pick-up head of different speakers, acquisition is directed toward difference and speaks the corresponding channel audio of role;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to by the gain Each channel audio after reason carries out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding in each channel audio The pointed role that speaks marks corresponding role's label of speaking to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, is It speaks described in corresponding label character role's label.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies Within, then the application is also intended to include these modifications and variations.
It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, software program of the invention can be executed to implement the above steps or functions by processor.Similarly, of the invention Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory, Magnetic or optical driver or floppy disc and similar devices.In addition, some of the steps or functions of the present invention may be implemented in hardware, example Such as, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the invention can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the method for the present invention and/or technical solution. And the program instruction of method of the invention is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, according to one embodiment of present invention including a device, which includes using Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the present invention are triggered Art scheme.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table Show title, and does not indicate any particular order.

Claims (8)

1. a kind of method of speech roles separation, wherein this method comprises:
Single pick-up head extracts the sound of different speakers, wherein described individually to pick up respectively in such a way that left and right acoustic channels separate Sound head is the microphone for having more directing modes;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to after the gain process Each channel audio carry out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding signified in each channel audio To the role that speaks, corresponding role's label of speaking is marked to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, for correspondence Label character described in speak role's label;
Important paragraph is marked in Recording Process, for being checked in subsequent use process;
Using sentence as dimension, different words/paragraph is selected to carry out audio broadcasting.
2. according to the method described in claim 1, wherein, to by the noise reduction process, treated that each channel audio carries out Eliminate the processing of echo, comprising:
To by the noise reduction process treated each channel audio, carried out using the method offseted based on ANC active noise Eliminate the processing of echo.
3. according to the method described in claim 1, will be audio by each channel audio cutting for eliminating echo processing wherein Segment marks the corresponding role that speaks to each audio fragment according to the pointed role that speaks corresponding in each channel audio Label, comprising:
Man-machine interaction unit receives each channel audio by eliminating echo processing;
Each channel audio cutting is audio fragment by the man-machine interaction unit, according to pointed by corresponding in each channel audio The role that speaks, corresponding role's label of speaking is marked to each audio fragment;
The man-machine interaction unit will mark the corresponding audio fragment for speaking role's label and be uploaded to cloud.
4. according to the method described in claim 3, wherein, each audio fragment is converted to corresponding text, according to each sound Role's label of speaking of frequency segment mark, after role's mark of speaking described in corresponding label character, further includes:
Man-machine interaction unit obtains the audio fragment and corresponding text after role's label of speaking of mark;
The man-machine interaction unit obtains the correspondence audio of a certain role that speaks of user's selection and the request of text;
The man-machine interaction unit is based on the request, obtains the audio fragment and corresponding text of the corresponding role's label of speaking of mark Word plays out.
5. according to the method described in claim 1, wherein, each audio fragment is converted to corresponding text, comprising:
By vad algorithm, identifies and reject the audio frame for not including voice signal in each audio fragment;
It is calculated using ASR, will identify and reject the audio fragment after the audio frame not comprising voice signal and be converted to corresponding text Word.
6. a kind of equipment of speech roles separation, wherein the equipment includes:
Speech signal collection unit is used for single pick-up head, extracts different speakers respectively in such a way that left and right acoustic channels separate Sound, wherein the single pick-up head is the microphone for having more directing modes;
Enhance processing unit, for according to the role that speaks corresponding pointed in each channel audio, to each channel audio into Row gain process;
Noise reduction processing unit is right for the side audio except the role that speaks pointed by corresponding in each channel audio Each channel audio after the gain process carries out noise reduction process;
Adaptive beamformer unit, for treated that each channel audio carries out elimination echo by the noise reduction process Processing;
Auditory localization unit, each channel audio cutting for that will pass through elimination echo processing is audio fragment, according to each The corresponding pointed role that speaks in channel audio, marks corresponding role's label of speaking to each audio fragment;
Role's separative unit, for each audio fragment to be converted to corresponding text, according to saying for each audio fragment mark Role's label is talked about, for role's label of speaking described in corresponding label character
Man-machine interaction unit, for important paragraph to be marked in Recording Process, for being checked in subsequent use process; Different words/paragraph carries out audio broadcasting with selecting using sentence as dimension.
7. a kind of equipment based on calculating, wherein include:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the processing when executed Device:
Single pick-up head extracts the sound of different speakers, wherein described individually to pick up respectively in such a way that left and right acoustic channels separate Sound head is the microphone for having more directing modes;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to after the gain process Each channel audio carry out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding signified in each channel audio To the role that speaks, corresponding role's label of speaking is marked to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, for correspondence Label character described in speak role's label;
Important paragraph is marked in Recording Process, for being checked in subsequent use process;
Using sentence as dimension, different words/paragraph is selected to carry out audio broadcasting.
8. a kind of computer readable storage medium, is stored thereon with computer executable instructions, wherein the computer is executable to be referred to Make the processor when order is executed by processor:
Single pick-up head extracts the sound of different speakers, wherein described individually to pick up respectively in such a way that left and right acoustic channels separate Sound head is the microphone for having more directing modes;
According to the pointed role that speaks corresponding in each channel audio, gain process is carried out to each channel audio;
According to the side audio except the pointed role that speaks corresponding in each channel audio, to after the gain process Each channel audio carry out noise reduction process;
To by the noise reduction process, treated that each channel audio carries out eliminates the processing of echo;
It will be audio fragment by each channel audio cutting for eliminating echo processing, according to corresponding signified in each channel audio To the role that speaks, corresponding role's label of speaking is marked to each audio fragment;
Each audio fragment is converted into corresponding text, according to role's label of speaking that each audio fragment marks, for correspondence Label character described in speak role's label;
Important paragraph is marked in Recording Process, for being checked in subsequent use process;
Using sentence as dimension, different words/paragraph is selected to carry out audio broadcasting.
CN201810198543.7A 2018-03-12 2018-03-12 The method and apparatus of speech roles separation Active CN108564952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810198543.7A CN108564952B (en) 2018-03-12 2018-03-12 The method and apparatus of speech roles separation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810198543.7A CN108564952B (en) 2018-03-12 2018-03-12 The method and apparatus of speech roles separation

Publications (2)

Publication Number Publication Date
CN108564952A CN108564952A (en) 2018-09-21
CN108564952B true CN108564952B (en) 2019-06-07

Family

ID=63531600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810198543.7A Active CN108564952B (en) 2018-03-12 2018-03-12 The method and apparatus of speech roles separation

Country Status (1)

Country Link
CN (1) CN108564952B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109709518B (en) * 2018-12-25 2021-07-20 北京猎户星空科技有限公司 Sound source positioning method and device, intelligent equipment and storage medium
CN110459239A (en) * 2019-03-19 2019-11-15 深圳壹秘科技有限公司 Role analysis method, apparatus and computer readable storage medium based on voice data
CN110189764B (en) * 2019-05-29 2021-07-06 深圳壹秘科技有限公司 System and method for displaying separated roles and recording equipment
CN112151041B (en) * 2019-06-26 2024-03-29 北京小米移动软件有限公司 Recording method, device, equipment and storage medium based on recorder program
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110648665A (en) * 2019-09-09 2020-01-03 北京左医科技有限公司 Session process recording system and method
CN110853639B (en) * 2019-10-23 2023-09-01 天津讯飞极智科技有限公司 Voice transcription method and related device
CN111128132A (en) * 2019-12-19 2020-05-08 秒针信息技术有限公司 Voice separation method, device and system and storage medium
CN111243595B (en) * 2019-12-31 2022-12-27 京东科技控股股份有限公司 Information processing method and device
CN111883135A (en) * 2020-07-28 2020-11-03 北京声智科技有限公司 Voice transcription method and device and electronic equipment
CN111883168B (en) * 2020-08-04 2023-12-22 上海明略人工智能(集团)有限公司 Voice processing method and device
CN112530411B (en) * 2020-12-15 2021-07-20 北京快鱼电子股份公司 Real-time role-based role transcription method, equipment and system
CN113808592A (en) * 2021-08-17 2021-12-17 百度在线网络技术(北京)有限公司 Method and device for transcribing call recording, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683661A (en) * 2015-11-05 2017-05-17 阿里巴巴集团控股有限公司 Role separation method and device based on voice
CN107464564A (en) * 2017-08-21 2017-12-12 腾讯科技(深圳)有限公司 voice interactive method, device and equipment
CN107749313A (en) * 2017-11-23 2018-03-02 郑州大学第附属医院 A kind of automatic transcription and the method for generation Telemedicine Consultation record

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009118316A (en) * 2007-11-08 2009-05-28 Yamaha Corp Voice communication device
JP2011107603A (en) * 2009-11-20 2011-06-02 Sony Corp Speech recognition device, speech recognition method and program
CN102610237A (en) * 2012-03-21 2012-07-25 山东大学 Digital signal processor (DSP) implementation system for two-channel convolution mixed voice signal blind source separation algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683661A (en) * 2015-11-05 2017-05-17 阿里巴巴集团控股有限公司 Role separation method and device based on voice
CN107464564A (en) * 2017-08-21 2017-12-12 腾讯科技(深圳)有限公司 voice interactive method, device and equipment
CN107749313A (en) * 2017-11-23 2018-03-02 郑州大学第附属医院 A kind of automatic transcription and the method for generation Telemedicine Consultation record

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《自适应滤波器及其应用研究》;吴正茂;《南昌水专学报》;20040630;第23卷(第2期);第36-38、45页
《自适应滤波器在噪声对消中的作用》;杨红等;《长江工程职业技术学院院报》;20051231;第22卷(第4期);第55-56、74页

Also Published As

Publication number Publication date
CN108564952A (en) 2018-09-21

Similar Documents

Publication Publication Date Title
CN108564952B (en) The method and apparatus of speech roles separation
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
US11710478B2 (en) Pre-wakeword speech processing
US10134421B1 (en) Neural network based beam selection
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
US20180075860A1 (en) Method for Microphone Selection and Multi-Talker Segmentation with Ambient Automated Speech Recognition (ASR)
US20080312918A1 (en) Voice performance evaluation system and method for long-distance voice recognition
US10650840B1 (en) Echo latency estimation
Sun et al. Speaker diarization system for RT07 and RT09 meeting room audio
CN110858476B (en) Sound collection method and device based on microphone array
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN107863099A (en) A kind of new dual microphone speech detection and Enhancement Method
Ochi et al. Multi-Talker Speech Recognition Based on Blind Source Separation with ad hoc Microphone Array Using Smartphones and Cloud Storage.
WO2022183968A1 (en) Audio signal processing method, devices, system, and storage medium
CN115376534A (en) Microphone array audio processing method and pickup chest card
CN111596261B (en) Sound source positioning method and device
Parada et al. Robust statistical processing of TDOA estimates for distant speaker diarization
Raj et al. Frustratingly easy noise-aware training of acoustic models
Wang et al. Denoising autoencoder and environment adaptation for distant-talking speech recognition with asynchronous speech recording
Takashima et al. Estimation of talker's head orientation based on discrimination of the shape of cross-power spectrum phase coefficients
Sun et al. Frame selection of interview channel for NIST speaker recognition evaluation
Wang et al. Robust distant speech recognition based on position dependent CMN using a novel multiple microphone processing technique.
WO2023245700A1 (en) Audio energy analysis method and related apparatus
Nayak Multi-channel Enhancement and Diarization for Distant Speech Recognition
Fox et al. Extending Limabeam with discrimination and coarse gradients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant