WO2022068675A1 - 发声者语音抽取方法、装置、存储介质及电子设备 - Google Patents

发声者语音抽取方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2022068675A1
WO2022068675A1 PCT/CN2021/120026 CN2021120026W WO2022068675A1 WO 2022068675 A1 WO2022068675 A1 WO 2022068675A1 CN 2021120026 W CN2021120026 W CN 2021120026W WO 2022068675 A1 WO2022068675 A1 WO 2022068675A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
feature information
target speaker
voiceprint
Prior art date
Application number
PCT/CN2021/120026
Other languages
English (en)
French (fr)
Inventor
许家铭
秦磊
郝云喆
徐波
崔强强
陈天珞
Original Assignee
华为技术有限公司
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 中国科学院自动化研究所 filed Critical 华为技术有限公司
Publication of WO2022068675A1 publication Critical patent/WO2022068675A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Definitions

  • the present application relates to the field of computer technology, and in particular, designs a method, device, storage medium and electronic device for extracting voice of a speaker.
  • ASR automatic speech recognition
  • the solution of speech separation must know the specific number of speakers in the mixed speech in advance, but in real scenarios, the specific number of speakers may change dynamically and cannot be obtained accurately; and this solution cannot predict the speaker labels of the output channel in advance, That is, there is a permutation problem; and this scheme separates the speech of all speakers in the mixed speech scene, but in fact we are not necessarily interested in all of them, but may only be interested in some speakers. Therefore, the solution of speech separation cannot be well adapted to practical application scenarios.
  • the target person voice extraction scheme aims to extract the voice of the specified speaker in the mixed voice, which can better adapt to the actual application scenario.
  • the existing target voice extraction scheme usually adopts the short-time Fourier transform (STFT) frequency domain coding method, so the real-time upper limit (immediate delay upper limit) of this scheme is limited by the STFT window length.
  • the upper limit of the delay is equal to the window length of the STFT, which is generally 32 milliseconds, so it has the disadvantage of low real-time processing capability.
  • practical application scenarios, such as ASR or hearing aid front-end have higher requirements on the real-time processing capability of the solution.
  • the present application provides a method, device, storage medium, and electronic device for extracting a speaker's voice, so as to improve the real-time performance of voice recognition, so as to better adapt to practical application scenarios.
  • a method for extracting a speaker's voice comprising:
  • the mixed speech including the speech of the target speaker
  • a speech segment of the target speaker is obtained.
  • the obtaining first voice time domain feature information based on the mixed voice includes:
  • Segmentation is performed on the first single-channel voice to obtain a first voice segmented data stream containing a preset type of voice;
  • the first speech segment data stream is processed by a pre-trained time domain encoder to obtain first speech time domain feature information.
  • the method before extracting the second voice time domain feature information of the target speaker from the first voice time domain feature information in real time based on the existing voiceprint information of the target speaker , the method also includes:
  • Segmentation is performed on the second single-channel voice to obtain a second voice segmented data stream containing preset type sounds;
  • the voiceprint information of the target speaker is obtained.
  • the single-channel speech is segmented to obtain a speech segmented data stream containing preset types of sounds, including:
  • the energy of the frame of voice is detected, and according to the preset energy threshold, it is determined whether the frame of voice is a voice frame with voice or a voice frame without voice;
  • Segmentation is performed on all speech frames with speech in the single-channel speech to obtain a speech segmented data stream containing preset types of sounds.
  • the extracting voice feature information of the target speaker from the second voice segment data stream includes:
  • Short-time Fourier transform is performed on the second speech segment data stream, and only the feature information of the amplitude part is extracted as the speech feature information of the target speaker.
  • the voiceprint information is a voiceprint vector
  • the obtaining the voiceprint information of the target speaker based on the voice feature information of the target speaker includes:
  • the voiceprint information is a voiceprint vector
  • the target is extracted in real time from the first voice time domain feature information based on the existing voiceprint information of the target speaker.
  • the time-domain feature information of the speaker's second speech including:
  • Second speech time domain feature information wherein, the extraction of the second speech time domain feature information of the target speaker corresponding to the current segment depends on the intermediate variables cached in the process of historical segment processing.
  • the method before inputting the segmented segment and the existing voiceprint vector of the target speaker into a pre-trained speech extraction network, the method further includes:
  • the training sample set includes: a mixed speech and a speaker's reference speech, wherein the mixed speech includes the speaker's speech;
  • the speech extraction network is trained, so that the second speech time domain feature information output by the speech extraction network is that of the speaker.
  • the obtaining the speech segment of the target speaker based on the second speech time domain feature information includes:
  • the discrete speech sample points are fused to obtain the speech segment of the target speaker.
  • an apparatus for extracting a speaker's voice comprising:
  • a first voice acquisition module configured to collect mixed voices in the environment, the mixed voices including the voice of the target speaker
  • a mixed speech encoding module configured to obtain first speech time domain feature information based on the mixed speech
  • a voice extraction module configured to extract, in real time, the second voice time domain feature information of the target speaker from the first voice time domain feature information based on the existing voiceprint information of the target speaker;
  • the speech decoding module is configured to obtain the speech segment of the target speaker based on the second speech time domain feature information.
  • the hybrid speech coding module includes:
  • a first single-channel voice acquisition sub-module configured to obtain the first single-channel voice based on the mixed voice
  • a first voice segment data stream acquisition submodule configured to perform sentence segmentation on the first single-channel voice to obtain a first voice segment data stream containing preset types of sounds
  • the mixed speech time-domain feature acquisition sub-module is configured to process the first speech segment data stream through a pre-trained time-domain encoder to obtain first speech time-domain feature information.
  • the submodule for obtaining the first voice segment data stream is configured as:
  • the energy of the frame of voice is detected, and according to the preset energy threshold, it is determined whether the frame of voice is a voice frame with voice or a voice frame without voice;
  • the apparatus further includes:
  • a second voice acquisition module configured to collect the voice of the target speaker
  • a second single-channel voice acquisition module configured to obtain a second single-channel voice based on the voice of the target speaker
  • the second voice segment data stream acquisition module is configured to perform sentence segmentation on the second single-channel voice to obtain a second voice segment data stream containing preset types of sounds;
  • a voice feature extraction module configured to extract voice feature information of the target speaker from the second voice segment data stream
  • the voiceprint obtaining module is configured to obtain the voiceprint information of the target speaker based on the voice feature information of the target speaker.
  • the second voice segment data stream acquisition module is configured to:
  • the energy of the frame of voice is detected, and according to the preset energy threshold, it is determined whether the frame of voice is a voice frame with voice or a voice frame without voice;
  • the speech feature extraction module is configured to:
  • Short-time Fourier transform is performed on the second speech segment data stream, and only the feature information of the amplitude part is extracted as the speech feature information of the target speaker.
  • the voiceprint information is a voiceprint vector
  • the voiceprint acquisition module is configured to:
  • the voiceprint information is a voiceprint vector
  • the voice extraction module is configured to:
  • Second speech time domain feature information wherein, the extraction of the second speech time domain feature information of the target speaker corresponding to the current segment depends on the intermediate variables cached in the process of historical segment processing.
  • the apparatus further includes: a training module;
  • the training module is configured to:
  • the training sample set includes: a mixed speech and a speaker's reference speech, wherein the mixed speech includes the speaker's speech;
  • the speech extraction network is trained, so that the second speech time domain feature information output by the speech extraction network is that of the speaker.
  • the speech decoding module is configured to:
  • the discrete speech sample points are fused to obtain the speech segment of the target speaker.
  • a storage medium on which a computer program is stored, and when the program is executed by a processor, implements the speaker's voice in the first aspect or any possible implementation manner of the first aspect Extraction method steps.
  • an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the first aspect when the program is executed or the steps of the speaker voice extraction method in any possible implementation manner of the first aspect.
  • the technical solutions provided by the embodiments of the present application utilize the voiceprint information of the target speaker to extract the speech segment of the target speaker from the mixed speech in real time, so it can better adapt to the actual application scenario.
  • the scheme adopts the time domain coding method, and the coding window length is much shorter than the STFT window length in frequency domain coding, so it has higher real-time processing capability.
  • FIG. 1 is a schematic flowchart of a method for extracting voice of a speaker according to an embodiment of the present application
  • FIG. 2 is a first structural schematic diagram of a speaker voice extraction device provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a mixed speech encoding module in a speaker speech extraction device provided by an embodiment of the present application;
  • FIG. 4 is a schematic structural diagram of a second type of a speaker voice extraction device provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a third type of a speaker voice extraction device provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
  • word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
  • an embodiment of the present application provides a method for extracting a speaker's voice.
  • the method can be used in an electronic device, such as a terminal, and the method can include the following steps:
  • S101 Collect mixed voices in the environment, where the mixed voices include the voice of a target speaker;
  • the sound generator may be a human, an animal, or a musical instrument (eg, piano, violin, accordion, flute, erhu, etc.).
  • a musical instrument eg, piano, violin, accordion, flute, erhu, etc.
  • a microphone may be employed to capture mixed speech in the environment.
  • step S102 based on the mixed speech, the time domain feature information of the first speech is obtained, including:
  • Segmentation is performed on the first single-channel voice to obtain a first voice segmented data stream containing a preset type of voice;
  • the first speech segment data stream is processed by a pre-trained time domain encoder to obtain first speech time domain feature information.
  • the time domain encoder may map a signal with one time resolution to a time domain with another time resolution for processing.
  • the convolution layer in the time domain encoder is only a single layer, and the convolution window may be, for example, 2 ms.
  • the above-mentioned obtaining the first single-channel voice based on the mixed voice includes:
  • A/D conversion and/or sample rate conversion are performed on the mixed speech to obtain a first single-channel speech.
  • sample rate conversion is performed on the mixed speech to obtain the first single-channel speech with a sampling rate of 16000.
  • the preset type of sound includes human voice, animal voice, or musical instrument sound.
  • performing sentence segmentation on the first single-channel voice above to obtain a first voice segmented data stream containing a preset type of voice including:
  • the energy of the frame of voice is detected, and according to the preset energy threshold, it is determined whether the frame of voice is a voice frame with voice or a voice frame without voice;
  • a speech frame whose energy is greater than the energy threshold is a speech frame with speech
  • a speech frame whose energy is less than the energy threshold is a speech frame without speech (or called a silent zone).
  • the voiceprint information is a voiceprint vector
  • step S103 based on the existing voiceprint information of the target speaker, real-time extraction of the target speaker's voiceprint information from the first voice time domain feature information
  • the second voice time domain feature information including:
  • the speech extraction network may first normalize the input segment, then reduce the feature dimension through a convolutional layer, and then perform a dot product with the voiceprint vector to enhance the voice of the target speaker, and then further After time series integration, the voiceprint fusion-time series integration process is repeated several times, and finally the second voice time domain feature information of the target speaker corresponding to the segment is obtained, and the intermediate variables of the segment during the speech extraction network processing are cached.
  • the speech extraction network may include TCN (Temporal Convolutional Network) or DPRNN (Dual Path Recurrent Neural Network).
  • TCN or DPRNN can efficiently process temporal information, capturing short-time scale and long-time scale dependencies.
  • the speech extraction network only uses historical information.
  • the regularization method can be designed as causal regularization. If the speech extraction network includes TCN, the TCN input complementation method can be set so that the TCN only uses historical information to predict the output; if the speech extraction network includes DPRNN, you can set the intra-chunk LSTM in the DPRNN module to be a unidirectional LSTM.
  • the performance of the voice extraction network will gradually improve, which is convenient for the voice extraction network to weigh the delay and performance during actual deployment, and make targeted settings.
  • time-domain encoder the time-domain decoder, the voiceprint network, and the speech extraction network in the embodiments of the present application may be trained jointly or separately, which is not limited in the embodiments of the present application.
  • the time-domain encoder, the time-domain decoder, the voiceprint network and the speech extraction network can be jointly trained, after the segmented segment and the existing voiceprint vector of the target speaker are input Before the pre-trained speech extraction network, the method also includes:
  • the training sample set includes: a mixed speech and a speaker's reference speech, wherein the mixed speech includes the speaker's speech;
  • the speech extraction network is trained, so that the second speech time domain feature information output by the speech extraction network is that of the speaker.
  • the time-domain encoder, the time-domain decoder, the voiceprint network and the speech extraction network can be trained separately.
  • a database can be established, and training can be performed based on the database.
  • the method further includes:
  • the training sample set includes: the first voice time domain feature information and the voiceprint vector of the speaker;
  • the speech extraction network is trained, so that the second speech time domain feature information output by the speech extraction network is that of the speaker.
  • the first speech time-domain feature information in the training sample set includes the speaker's speech time-domain feature information.
  • the training of the speech extraction network can also be completed on other devices, and then the trained speech extraction network is used on this device, which is not limited in this embodiment of the present application.
  • step S104 the speech segment of the target speaker is obtained based on the second speech time domain feature information, including:
  • the discrete speech sample points are fused to obtain the speech segment of the target speaker.
  • the above method for extracting the speaker's voice uses the voiceprint information of the target speaker to extract the speech segment of the target speaker from the mixed speech, so it can better adapt to the actual application scenario.
  • this method adopts time-domain coding, and the coding window length is much shorter than the STFT window length in frequency-domain coding, so it has higher real-time processing capability.
  • the method before performing step S103 to extract the second voice time domain feature information of the target speaker from the first voice time domain feature information based on the existing voiceprint information of the target speaker , the method also includes:
  • Segmentation is performed on the second single-channel voice to obtain a second voice segmented data stream containing preset type sounds;
  • the voiceprint information of the target speaker is obtained.
  • the voice of the target speaker may be captured using a microphone.
  • obtaining the second single-channel voice based on the voice of the target speaker includes:
  • A/D conversion and/or sampling rate conversion is performed on the speech of the target speaker to obtain a second single-channel speech.
  • sample rate conversion is performed on the speech of the target speaker to obtain a second single-channel speech with a sampling rate of 16000.
  • performing sentence segmentation on the second single-channel voice above to obtain a second voice segmented data stream containing a preset type of voice including:
  • the energy of the frame of voice is detected, and according to the preset energy threshold, it is determined whether the frame of voice is a voice frame with voice or a voice frame without voice;
  • Segmentation is performed on all speech frames with speech in the second single-channel speech to obtain a second speech segmented data stream containing preset type sounds.
  • a speech frame whose energy is greater than the energy threshold is a speech frame with speech
  • a speech frame whose energy is less than the energy threshold is a speech frame without speech (or called a silent zone).
  • the above-mentioned extraction of the voice feature information of the target speaker from the second voice segment data stream includes:
  • Short-time Fourier transform is performed on the second speech segment data stream, and only the feature information of the amplitude part is extracted as the speech feature information of the target speaker.
  • the voiceprint information is a voiceprint vector
  • the above-mentioned obtaining the voiceprint information of the target speaker based on the voice feature information of the target speaker includes:
  • the voiceprint network includes: LSTM (Long Short Term Memory Neural Network), linear layers and mean-pooling layers.
  • the voiceprint network integrates the information on the time scale, and then performs mean-pooling in the time dimension to obtain the voiceprint vector of the target speaker in the high-dimensional space.
  • an embodiment of the present application further provides a speaker voice extraction device, including: a first voice acquisition module 11 , a mixed voice encoding module 12 , a voice extraction module 13 and a voice decoding module 14 .
  • the first voice acquisition module 11 is configured to collect mixed voices in the environment, and the mixed voices include the voice of the target speaker;
  • the mixed speech coding module 12 is configured to obtain the time domain feature information of the first speech based on the mixed speech;
  • the voice extraction module 13 is configured to extract the second voice time domain feature information of the target voicer in real time from the first voice time domain feature information based on the existing voiceprint information of the target speaker;
  • the speech decoding module 14 is configured to obtain the speech segment of the target speaker based on the second speech time domain feature information, so as to recognize the speech of the target speaker.
  • the mixed speech coding module 12 includes:
  • the first single-channel voice obtaining sub-module 121 is configured to obtain the first single-channel voice based on the mixed voice
  • the first voice segment data stream acquisition submodule 122 is configured to perform sentence segmentation on the first single-channel voice to obtain a first voice segment data stream containing preset types of sounds;
  • the mixed speech time-domain feature acquisition sub-module 123 is configured to process the first speech segment data stream through a pre-trained time-domain encoder to obtain first speech time-domain feature information.
  • the submodule 122 for obtaining the first voice segment data stream is configured to:
  • the energy of the frame of voice is detected, and according to the preset energy threshold, it is determined whether the frame of voice is a voice frame with voice or a voice frame without voice;
  • Segmentation is performed on all speech frames with speech in the first single-channel speech to obtain a first speech segmented data stream containing preset type sounds.
  • the above-mentioned apparatus further includes:
  • the second voice collection module 15 is configured to collect the voice of the target speaker
  • the second single-channel voice obtaining module 16 is configured to obtain a second single-channel voice based on the voice of the target speaker
  • the second voice segment data stream acquisition module 17 is configured to perform sentence segmentation on the second single-channel voice to obtain a second voice segment data stream containing preset types of sounds;
  • a voice feature extraction module 18 configured to extract voice feature information of the target speaker from the second voice segment data stream
  • the voiceprint obtaining module 19 is configured to obtain voiceprint information of the target speaker based on the voice feature information of the target speaker.
  • the second voice segment data stream obtaining module 17 is configured to:
  • the energy of the frame of voice is detected, and according to the preset energy threshold, it is determined whether the frame of voice is a voice frame with voice or a voice frame without voice;
  • the speech feature extraction module 18 is configured to:
  • Short-time Fourier transform is performed on the second speech segment data stream, and only the feature information of the amplitude part is extracted as the speech feature information of the target speaker.
  • the voiceprint information is a voiceprint vector
  • the voiceprint acquisition module 19 is configured to:
  • the voiceprint information is a voiceprint vector
  • the voice extraction module 13 is configured as:
  • Second speech time domain feature information wherein, the extraction of the second speech time domain feature information of the target speaker corresponding to the current segment depends on the intermediate variables cached in the process of historical segment processing.
  • the above apparatus further includes: a training module 20;
  • the training module 20 is configured to:
  • the training sample set includes: a mixed speech and a speaker's reference speech, wherein the mixed speech includes the speaker's speech;
  • the speech extraction network is trained, so that the second speech time domain feature information output by the speech extraction network is that of the speaker.
  • the speech decoding module 14 is configured to:
  • the discrete speech sample points are fused to obtain the speech segment of the target speaker.
  • an embodiment of the present application further provides a storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the speaker voice extraction method in any possible implementation manner described above.
  • the storage medium may be a non-transitory computer-readable storage medium, for example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage equipment, etc.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage equipment, etc.
  • an embodiment of the present application further provides a computer program product, including a computer program, when the program is executed by a processor, the steps of the method for extracting a speaker's voice in any possible implementation manner described above are implemented.
  • an embodiment of the present application further provides an electronic device, including a memory 71 (for example, a non-volatile memory), a processor 72 , and a processor 72 that is stored in the memory 71 and can run on the processor 72
  • a computer program when the processor 72 executes the program, it implements the steps of the method for extracting a speaker's voice in any possible implementation manner described above.
  • the electronic device may be, for example, a PC or a terminal.
  • the electronic device may also generally include: a memory 73 , a network interface 74 , and an internal bus 75 .
  • a memory 73 may also include: a graphics processing unit (GPU), a graphics processing unit (GPU), and a graphics processing unit (GPU).
  • a network interface 74 may also include: a graphics processing unit (GPU), a graphics processing unit (GPU), and a graphics processing unit (GPU).
  • a network interface 74 may also be included, which will not be repeated here.
  • the above-mentioned apparatus for extracting the voice of the speaker can be realized by software, and as a logical apparatus, it reads the computer program instructions stored in the non-volatile memory through the processor 72 of the electronic device where it is located. It is formed by running into memory 73.
  • Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuitry, in tangible embodiment of computer software or firmware, in computer hardware including the structures disclosed in this specification and their structural equivalents, or in a combination of one or more.
  • Embodiments of the subject matter described in this specification may be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. multiple modules.
  • the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for interpretation by the data.
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of these.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, eg, an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit.
  • the central processing unit will receive instructions and data from read only memory and/or random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic, magneto-optical or optical disks, to receive data therefrom or to It transmits data, or both.
  • the computer does not have to have such a device.
  • the computer may be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, global positioning system (GPS) receiver, or a universal serial bus (USB) ) flash drives for portable storage devices, to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storage of computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (eg, internal hard disks or flash memory devices). removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices eg, EPROM, EEPROM, and flash memory devices
  • magnetic disks eg, internal hard disks or flash memory devices. removable disks
  • magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
  • the processor and memory may be supplemented by or incorporated in special purpose logic circuitry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种发声者语音抽取方法、装置、存储介质及电子设备,用以提高语音识别的实时性,以便更好的适应实际应用场景。发声者语音抽取方法包括:采集环境中的混合语音,混合语音包括目标发声者的语音(S101);基于混合语音,获得第一语音时域特征信息(S102);基于已有的目标发声者的声纹信息,从第一语音时域特征信息中实时抽取目标发声者的第二语音时域特征信息(S103);基于第二语音时域特征信息,获得目标发声者的语音段(S104)。

Description

发声者语音抽取方法、装置、存储介质及电子设备
本申请要求在2020年9月29日提交中国国家知识产权局、申请号为202011055886.1的中国专利申请的优先权,发明名称为“发声者语音抽取方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其设计一种发声者语音抽取方法、装置、存储介质及电子设备。
背景技术
目前自动语音识别(automatic speech recognition,ASR)技术已经能够在安静场景、单一说话人语音识别任务上取得令人印象深刻的效果,但复杂声学场景,尤其是嘈杂环境下多说话人语音识别任务,效果依然不能令人满意,这就是著名的鸡尾酒会问题。为了解决鸡尾酒会问题,研究人员付出了大量的努力,提出了众多解决方案,包括语音分离的方案和目标人语音抽取的方案。
其中,语音分离的方案必须提前获知混合语音中说话人的具体数目,而在现实场景中,说话人具体数目可能会动态变化,无法准确获取;而且该方案无法提前预知输出通道的说话人标签,即存在排列问题;并且该方案分离混合语音场景中所有说话人的语音,而实际上我们不一定对所有人都感兴趣,可能只对部分说话人感兴趣。因此,语音分离的方案不能很好的适应实际应用场景。
目标人语音抽取的方案旨在抽取出混合语音中指定说话人的语音,该方案可以更好的适应实际应用场景。然而,现有的目标人语音抽取的方案通常采用短时傅里叶变换(short-time Fourier transform,STFT)频域编码方式,因此该方案的实时性上限(即时延上限)受到STFT窗长的限制,其时延上限等于STFT的窗长,一般为32毫秒,因此具有实时处理能力低的缺点。然而,实际应用场景,如ASR或助听器前端等,对方案的实时处理能力有较高的要求。
发明内容
有鉴于此,本申请提供一种发声者语音抽取方法、装置、存储介质及电子设备,用以提高语音识别的实时性,以便更好的适应实际应用场景。
本申请的技术方案如下:
根据本申请实施例的第一方面,提供一种发声者语音抽取方法,所述方法包括:
采集环境中的混合语音,所述混合语音包括目标发声者的语音;
基于所述混合语音,获得第一语音时域特征信息;
基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息;
基于所述第二语音时域特征信息,获得所述目标发声者的语音段。
在一可能的实现方式中,所述基于所述混合语音,获得第一语音时域特征信息,包括:
基于所述混合语音,获得第一单通道语音;
对所述第一单通道语音进行断句切分,得到包含预设类型声音的第一语音分段数据流;
通过预先训练的时域编码器对所述第一语音分段数据流进行处理,获得第一语音时域特征信息。
在一可能的实现方式中,在基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息之前,该方法还包括:
采集所述目标发声者的语音;
基于所述目标发声者的语音,获得第二单通道语音;
对所述第二单通道语音进行断句切分,得到包含预设类型声音的第二语音分段数据流;
从所述第二语音分段数据流中提取所述目标发声者的语音特征信息;
基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息。
在一可能的实现方式中,对单通道语音进行断句切分,得到包含预设类型声音的语音分段数据流,包括:
对于单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
对所述单通道语音中所有有语音的语音帧进行断句切分,得到包含预设类型声音的语音分段数据流。
在一可能的实现方式中,所述从所述第二语音分段数据流中提取所述目标发声者的语音特征信息,包括:
对所述第二语音分段数据流进行短时傅里叶变换,仅提取幅值部分的特征信息,作为所述目标发声者的语音特征信息。
在一可能的实现方式中,所述声纹信息为声纹向量,所述基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息,包括:
将所述目标发声者的语音特征信息输入预先训练的声纹网络,得到所述声纹网络输出的所述目标发声者的声纹向量。
在一可能的实现方式中,所述声纹信息为声纹向量,所述基于已有的所述目 标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息,包括:
将所述第一语音时域特征信息切分为预设长度的片段;
将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络,得到所述语音抽取网络实时输出的各个所述片段对应的所述目标发声者的第二语音时域特征信息;其中,当前片段对应的所述目标发声者的第二语音时域特征信息的提取依赖于历史片段处理过程中缓存的中间变量。
在一可能的实现方式中,在将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络之前,该方法还包括:
获取训练样本集,所述训练样本集包括:混合语音和发声者的参考语音,其中,所述混合语音包括发声者的语音;
通过所述训练样本集,训练所述语音抽取网络,以使得所述语音抽取网络输出的第二语音时域特征信息为所述发声者的。
在一可能的实现方式中,所述基于所述第二语音时域特征信息,获得所述目标发声者的语音段,包括:
通过预先训练的时域解码器将所述第二语音时域特征信息恢复为离散的语音采样点;
将所述离散的语音采样点融合,获得所述目标发声者的语音段。
根据本申请实施例的第二方面,提供一种发声者语音抽取装置,所述装置包括:
第一语音采集模块,被配置为采集环境中的混合语音,所述混合语音包括目标发声者的语音;
混合语音编码模块,被配置为基于所述混合语音,获得第一语音时域特征信息;
语音抽取模块,被配置为基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息;
语音解码模块,被配置为基于所述第二语音时域特征信息,获得所述目标发声者的语音段。
在一可能的实现方式中,所述混合语音编码模块包括:
第一单通道语音获取子模块,被配置为基于所述混合语音,获得第一单通道语音;
第一语音分段数据流获取子模块,被配置为对所述第一单通道语音进行断句切分,得到包含预设类型声音的第一语音分段数据流;
混合语音时域特征获取子模块,被配置为通过预先训练的时域编码器对所述第一语音分段数据流进行处理,获得第一语音时域特征信息。
在一可能的实现方式中,所述第一语音分段数据流获取子模块被配置为:
对于第一单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
对所述第一单通道语音中所有有语音的语音帧进行断句切分,得到包含预设类型声音的第一语音分段数据流。
在一可能的实现方式中,所述装置还包括:
第二语音采集模块,被配置为采集所述目标发声者的语音;
第二单通道语音获取模块,被配置为基于所述目标发声者的语音,获得第二单通道语音;
第二语音分段数据流获取模块,被配置为对所述第二单通道语音进行断句切分,得到包含预设类型声音的第二语音分段数据流;
语音特征提取模块,被配置为从所述第二语音分段数据流中提取所述目标发声者的语音特征信息;
声纹获取模块,被配置为基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息。
在一可能的实现方式中,所述第二语音分段数据流获取模块被配置为:
对于第二单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
对所述第二单通道语音中所有有语音的语音帧进行断句切分,得到包含预设类型声音的第二语音分段数据流。
在一可能的实现方式中,所述语音特征提取模块被配置为:
对所述第二语音分段数据流进行短时傅里叶变换,仅提取幅值部分的特征信息,作为所述目标发声者的语音特征信息。
在一可能的实现方式中,所述声纹信息为声纹向量,所述声纹获取模块被配置为:
将所述目标发声者的语音特征信息输入预先训练的声纹网络,得到所述声纹网络输出的所述目标发声者的声纹向量。
在一可能的实现方式中,所述声纹信息为声纹向量,所述语音抽取模块被配置为:
将所述第一语音时域特征信息切分为预设长度的片段;
将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络,得到所述语音抽取网络实时输出的各个所述片段对应的所述目标发声者的第二语音时域特征信息;其中,当前片段对应的所述目标发声者的第二语音时域特征信息的提取依赖于历史片段处理过程中缓存的中间变量。
在一可能的实现方式中,所述装置还包括:训练模块;
所述训练模块被配置为:
获取训练样本集,所述训练样本集包括:混合语音和发声者的参考语音,其中,所述混合语音包括发声者的语音;
通过所述训练样本集,训练所述语音抽取网络,以使得所述语音抽取网络输出的第二语音时域特征信息为所述发声者的。
在一可能的实现方式中,所述语音解码模块被配置为:
通过预先训练的时域解码器将所述第二语音时域特征信息恢复为离散的语音采样点;
将所述离散的语音采样点融合,获得所述目标发声者的语音段。
根据本申请实施例的第三方面,提供一种存储介质,其上存储有计算机程序,所述程序被处理器执行时实现第一方面或第一方面的任意可能的实现方式中的发声者语音抽取方法的步骤。
根据本申请实施例的第四方面,提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现第一方面或第一方面的任意可能的实现方式中的发声者语音抽取方法的步骤。
本申请的实施例提供的技术方案至少带来以下有益效果:
本申请实施例提供的技术方案,一方面,利用目标发声者的声纹信息,从混合语音中实时抽取目标发声者的语音段,因此能够较好的适应实际应用场景。另一方面,该方案采用时域编码方式,编码窗长远远短于频域编码时的STFT窗长,因此具有较高的实时处理能力。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
图1为本申请实施例提供的一种发声者语音抽取方法的流程示意图;
图2为本申请实施例提供的发声者语音抽取装置的第一种结构示意图;
图3本申请实施例提供的发声者语音抽取装置中混合语音编码模块的结构示意图;
图4为本申请实施例提供的发声者语音抽取装置的第二种结构示意图;
图5为本申请实施例提供的发声者语音抽取装置的第三种结构示意图;
图6为本申请实施例提供的电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
参见图1,本申请实施例提供了一种发声者语音抽取方法,该方法可以用于电子设备中,例如终端,该方法可以包括如下步骤:
S101、采集环境中的混合语音,所述混合语音包括目标发声者的语音;
本申请实施例中,发声者可以为人、动物、乐器(如钢琴、小提琴、手风琴、长笛、二胡等)。
在一些实施例中,可以采用麦克风采集环境中的混合语音。
S102、基于所述混合语音,获得第一语音时域特征信息;
在一些实施例中,步骤S102中基于所述混合语音,获得第一语音时域特征信息,包括:
基于所述混合语音,获得第一单通道语音;
对所述第一单通道语音进行断句切分,得到包含预设类型声音的第一语音分段数据流;
通过预先训练的时域编码器对所述第一语音分段数据流进行处理,获得第一语音时域特征信息。
本申请实施例中,时域编码器可以将具有一种时间分辨率的信号映射到另一种时间分辨率的时域进行处理。
在一些实施例中,为了提高实时性,时域编码器中卷积层仅为单层,卷积窗口例如可以为2ms。
在一些实施例中,上述基于所述混合语音,获得第一单通道语音,包括:
对所述混合语音进行A/D转换和/或采样率转换,获得第一单通道语音。
例如,对混合语音进行采样率转换,得到16000采样率的第一单通道语音。
本申请实施例中,预设类型声音包括人声、动物声或乐器声。
在一些实施例中,上述对第一单通道语音进行断句切分,得到包含预设类型声音的第一语音分段数据流,包括:
对于第一单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
对所述第一单通道语音中所有有语音的语音帧进行断句切分,得到包含预设类型声音的第一语音分段数据流。
例如,能量大于能量阈值的语音帧为有语音的语音帧,能量小于能量阈值的语音帧为无语音的语音帧(或称静音区)。
S103、基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信 息中实时抽取所述目标发声者的第二语音时域特征信息;
在一些实施例中,声纹信息为声纹向量,步骤S103中基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息,包括:
将所述第一语音时域特征信息切分为预设长度(例如100ms)的片段;
将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络,得到所述语音抽取网络实时输出的各个所述片段对应的所述目标发声者的第二语音时域特征信息;其中,当前片段对应的所述目标发声者的第二语音时域特征信息的提取依赖于历史片段(历史片段指当前片段之前的片段)处理过程中缓存的中间变量。
本申请实施例中,语音抽取网络可以首先将输入的片段做正则化处理,之后经过一层卷积层降低特征维度,接下来和声纹向量做点乘,增强目标发声者的声音,再进一步经过时序整合,然后重复若干次声纹融合-时序整合的流程,最后得到该片段对应的目标发声者的第二语音时域特征信息,缓存本次语音抽取网络处理过程中该片段的中间变量。
本申请实施例中,语音抽取网络可以包括TCN(时间卷积网络)或DPRNN(双路径递归神经网络)。TCN或DPRNN能够有效地处理时序信息,捕获短时间尺度和长时间尺度的依赖关系。
本申请实施例中,语音抽取网络仅使用历史信息。例如,应用场景要求较高实时性时,可以设计正则化方式为因果正则化,若语音抽取网络包括TCN,可以设置TCN输入补齐方式,使TCN仅使用历史信息预测输出;若语音抽取网络包括DPRNN,可以设置DPRNN模块中intra-chunk LSTM为单向LSTM。当延时逐渐增加时,语音抽取网络性能会逐渐提高,这样方便语音抽取网络在实际部署时权衡延时和性能,做出针对性的设置。
需要指出的是,本申请实施例中时域编码器、时域解码器、声纹网络和语音抽取网络可以联合训练,也可以分开训练,本申请实施例对此并不进行限定。
在一些实施例中,时域编码器、时域解码器、声纹网络和语音抽取网络可以联合训练,在将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络之前,该方法还包括:
获取训练样本集,所述训练样本集包括:混合语音和发声者的参考语音,其中,所述混合语音包括发声者的语音;
通过所述训练样本集,训练所述语音抽取网络,以使得所述语音抽取网络输出的第二语音时域特征信息为所述发声者的。
在另一些实施例中,时域编码器、时域解码器、声纹网络和语音抽取网络可以分开训练,例如可以建立数据库,基于数据库进行训练,在将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络之前,该方法还包括:
获取训练样本集,所述训练样本集包括:第一语音时域特征信息以及发声者 的声纹向量;
通过所述训练样本集,训练所述语音抽取网络,以使得所述语音抽取网络输出的第二语音时域特征信息为所述发声者的。
需要指出的是,训练样本集中第一语音时域特征信息包含发声者的语音时域特征信息。
当然,语音抽取网络的训练也可以在其他设备上完成,然后在本设备上使用已训练的语音抽取网络,本申请实施例对此并不进行限定。
S104、基于所述第二语音时域特征信息,获得所述目标发声者的语音段。
在一些实施例中,步骤S104中基于所述第二语音时域特征信息,获得所述目标发声者的语音段,包括:
通过预先训练的时域解码器将所述第二语音时域特征信息恢复为离散的语音采样点;
将所述离散的语音采样点融合,获得所述目标发声者的语音段。
上述的发声者语音抽取方法,一方面,利用目标发声者的声纹信息,从混合语音中抽取目标发声者的语音段,因此能够较好的适应实际应用场景。另一方面,该方法采用时域编码方式,编码窗长远远短于频域编码时的STFT窗长,因此具有较高的实时处理能力。
在一些实施例中,在执行步骤S103基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中抽取所述目标发声者的第二语音时域特征信息之前,该方法还包括:
采集所述目标发声者的语音;
基于所述目标发声者的语音,获得第二单通道语音;
对所述第二单通道语音进行断句切分,得到包含预设类型声音的第二语音分段数据流;
从所述第二语音分段数据流中提取所述目标发声者的语音特征信息;
基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息。
在一些实施例中,可以采用麦克风采集目标发声者的语音。
在一些实施例中,上述基于所述目标发声者的语音,获得第二单通道语音,包括:
对所述目标发声者的语音进行A/D转换和/或采样率转换,获得第二单通道语音。
例如,对目标发声者的语音进行采样率转换,得到16000采样率的第二单通道语音。
在一些实施例中,上述对第二单通道语音进行断句切分,得到包含预设类型声音的第二语音分段数据流,包括:
对于第二单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
对所述第二单通道语音中所有有语音的语音帧进行断句切分,得到包含预设 类型声音的第二语音分段数据流。
例如,能量大于能量阈值的语音帧为有语音的语音帧,能量小于能量阈值的语音帧为无语音的语音帧(或称静音区)。
在一些实施例中,上述从所述第二语音分段数据流中提取所述目标发声者的语音特征信息,包括:
对所述第二语音分段数据流进行短时傅里叶变换,仅提取幅值部分的特征信息,作为所述目标发声者的语音特征信息。
需要指出的是,本申请实施例中仅提取幅值部分的特征信息是为了后续便于提取声纹信息。
在一些实施例中,声纹信息为声纹向量,上述基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息,包括:
将所述目标发声者的语音特征信息输入预先训练的声纹网络,得到所述声纹网络输出的所述目标发声者的声纹向量。
在一些实施例中,声纹网络包括:LSTM(长短期记忆神经网络)、线性层和mean-pooling层。声纹网络整合时间尺度上的信息,然后在时间维度上做mean-pooling得到目标发声者高维空间上的声纹向量。
基于同一发明构思,参见图2,本申请实施例还提供了一种发声者语音抽取装置,包括:第一语音采集模块11、混合语音编码模块12、语音抽取模块13和语音解码模块14。
第一语音采集模块11,被配置为采集环境中的混合语音,所述混合语音包括目标发声者的语音;
混合语音编码模块12,被配置为基于所述混合语音,获得第一语音时域特征信息;
语音抽取模块13,被配置为基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息;
语音解码模块14,被配置为基于所述第二语音时域特征信息,获得所述目标发声者的语音段,以识别该目标发声者的语音。
在一可能的实现方式中,如图3所示,混合语音编码模块12包括:
第一单通道语音获取子模块121,被配置为基于所述混合语音,获得第一单通道语音;
第一语音分段数据流获取子模块122,被配置为对所述第一单通道语音进行断句切分,得到包含预设类型声音的第一语音分段数据流;
混合语音时域特征获取子模块123,被配置为通过预先训练的时域编码器对所述第一语音分段数据流进行处理,获得第一语音时域特征信息。
在一可能的实现方式中,第一语音分段数据流获取子模块122被配置为:
对于第一单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
对所述第一单通道语音中所有有语音的语音帧进行断句切分,得到包含预设 类型声音的第一语音分段数据流。
在一可能的实现方式中,如图4所示,上述装置还包括:
第二语音采集模块15,被配置为采集所述目标发声者的语音;
第二单通道语音获取模块16,被配置为基于所述目标发声者的语音,获得第二单通道语音;
第二语音分段数据流获取模块17,被配置为对所述第二单通道语音进行断句切分,得到包含预设类型声音的第二语音分段数据流;
语音特征提取模块18,被配置为从所述第二语音分段数据流中提取所述目标发声者的语音特征信息;
声纹获取模块19,被配置为基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息。
在一可能的实现方式中,第二语音分段数据流获取模块17被配置为:
对于第二单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
对所述第二单通道语音中所有有语音的语音帧进行断句切分,得到包含预设类型声音的第二语音分段数据流。
在一可能的实现方式中,语音特征提取模块18被配置为:
对所述第二语音分段数据流进行短时傅里叶变换,仅提取幅值部分的特征信息,作为所述目标发声者的语音特征信息。
在一可能的实现方式中,所述声纹信息为声纹向量,声纹获取模块19被配置为:
将所述目标发声者的语音特征信息输入预先训练的声纹网络,得到所述声纹网络输出的所述目标发声者的声纹向量。
在一可能的实现方式中,所述声纹信息为声纹向量,语音抽取模块13被配置为:
将所述第一语音时域特征信息切分为预设长度的片段;
将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络,得到所述语音抽取网络实时输出的各个所述片段对应的所述目标发声者的第二语音时域特征信息;其中,当前片段对应的所述目标发声者的第二语音时域特征信息的提取依赖于历史片段处理过程中缓存的中间变量。
在一可能的实现方式中,如图5所示,上述装置还包括:训练模块20;
训练模块20被配置为:
获取训练样本集,所述训练样本集包括:混合语音和发声者的参考语音,其中,所述混合语音包括发声者的语音;
通过所述训练样本集,训练所述语音抽取网络,以使得所述语音抽取网络输出的第二语音时域特征信息为所述发声者的。
在一可能的实现方式中,语音解码模块14被配置为:
通过预先训练的时域解码器将所述第二语音时域特征信息恢复为离散的语 音采样点;
将所述离散的语音采样点融合,获得所述目标发声者的语音段。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
基于同一发明构思,本申请实施例还提供了一种存储介质,其上存储有计算机程序,所述程序被处理器执行时实现上述任意可能的实现方式中的发声者语音抽取方法的步骤。
可选地,存储介质可以是非临时性计算机可读存储介质,例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
基于同一发明构思,本申请实施例还提供了一种计算机程序产品,包括计算机程序,所述程序被处理器执行时实现上述任意可能的实现方式中的发声者语音抽取方法的步骤。
基于同一发明构思,参见图6,本申请实施例还提供了一种电子设备,包括存储器71(例如非易失性存储器)、处理器72及存储在存储器71上并可在处理器72上运行的计算机程序,处理器72执行所述程序时实现上述任意可能的实现方式中的发声者语音抽取方法的步骤。该电子设备例如可以为PC、终端。
如图6所示,该电子设备一般还可以包括:内存73、网络接口74、以及内部总线75。除了这些部件外,还可以包括其他硬件,对此不再赘述。
需要指出的是,上述发声者语音抽取装置可以通过软件实现,其作为一个逻辑意义上的装置,是通过其所在的电子设备的处理器72将非易失性存储器中存储的计算机程序指令读取到内存73中运行形成的。
本说明书中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本说明书中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本说明书中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。
本说明书中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他类型的中央处理单元。通常,中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。
虽然本说明书包含许多具体实施细节,但是这些不应被解释为限制任何发明的范围或所要求保护的范围,而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。

Claims (19)

  1. 一种发声者语音抽取方法,其特征在于,所述方法包括:
    采集环境中的混合语音,所述混合语音包括目标发声者的语音;
    基于所述混合语音,获得第一语音时域特征信息;
    基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息;
    基于所述第二语音时域特征信息,获得所述目标发声者的语音段。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述混合语音,获得第一语音时域特征信息,包括:
    基于所述混合语音,获得第一单通道语音;
    对所述第一单通道语音进行断句切分,得到包含预设类型声音的第一语音分段数据流;
    通过预先训练的时域编码器对所述第一语音分段数据流进行处理,获得第一语音时域特征信息。
  3. 根据权利要求1所述的方法,其特征在于,在基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息之前,该方法还包括:
    采集所述目标发声者的语音;
    基于所述目标发声者的语音,获得第二单通道语音;
    对所述第二单通道语音进行断句切分,得到包含预设类型声音的第二语音分段数据流;
    从所述第二语音分段数据流中提取所述目标发声者的语音特征信息;
    基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息。
  4. 根据权利要求2或3所述的方法,其特征在于,对单通道语音进行断句切分,得到包含预设类型声音的语音分段数据流,包括:
    对于单通道语音中的任意一帧语音,检测该帧语音的能量大小,并根据预设能量阈值,确定该帧语音为有语音的语音帧还是无语音的语音帧;
    对所述单通道语音中所有有语音的语音帧进行断句切分,得到包含预设类型声音的语音分段数据流。
  5. 根据权利要求3所述的方法,其特征在于,所述从所述第二语音分段数据流中提取所述目标发声者的语音特征信息,包括:
    对所述第二语音分段数据流进行短时傅里叶变换,仅提取幅值部分的特征信息,作为所述目标发声者的语音特征信息。
  6. 根据权利要求3所述的方法,其特征在于,所述声纹信息为声纹向量,所述基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息,包括:
    将所述目标发声者的语音特征信息输入预先训练的声纹网络,得到所述声纹 网络输出的所述目标发声者的声纹向量。
  7. 根据权利要求1-3、5、6任一项所述的方法,其特征在于,所述声纹信息为声纹向量,所述基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息,包括:
    将所述第一语音时域特征信息切分为预设长度的片段;
    将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络,得到所述语音抽取网络实时输出的各个所述片段对应的所述目标发声者的第二语音时域特征信息;其中,当前片段对应的所述目标发声者的第二语音时域特征信息的提取依赖于历史片段处理过程中缓存的中间变量。
  8. 根据权利要求7所述的方法,其特征在于,在将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络之前,该方法还包括:
    获取训练样本集,所述训练样本集包括:混合语音和发声者的参考语音,其中,所述混合语音包括发声者的语音;
    通过所述训练样本集,训练所述语音抽取网络,以使得所述语音抽取网络输出的第二语音时域特征信息为所述发声者的。
  9. 根据权利要求2所述的方法,其特征在于,所述基于所述第二语音时域特征信息,获得所述目标发声者的语音段,包括:
    通过预先训练的时域解码器将所述第二语音时域特征信息恢复为离散的语音采样点;
    将所述离散的语音采样点融合,获得所述目标发声者的语音段。
  10. 一种发声者语音抽取装置,其特征在于,所述装置包括:
    第一语音采集模块,被配置为采集环境中的混合语音,所述混合语音包括目标发声者的语音;
    混合语音编码模块,被配置为基于所述混合语音,获得第一语音时域特征信息;
    语音抽取模块,被配置为基于已有的所述目标发声者的声纹信息,从所述第一语音时域特征信息中实时抽取所述目标发声者的第二语音时域特征信息;
    语音解码模块,被配置为基于所述第二语音时域特征信息,获得所述目标发声者的语音段。
  11. 根据权利要求10所述的装置,其特征在于,所述混合语音编码模块包括:
    第一单通道语音获取子模块,被配置为基于所述混合语音,获得第一单通道语音;
    第一语音分段数据流获取子模块,被配置为对所述第一单通道语音进行断句切分,得到包含预设类型声音的第一语音分段数据流;
    混合语音时域特征获取子模块,被配置为通过预先训练的时域编码器对所述 第一语音分段数据流进行处理,获得第一语音时域特征信息。
  12. 根据权利要求10所述的装置,其特征在于,所述装置还包括:
    第二语音采集模块,被配置为采集所述目标发声者的语音;
    第二单通道语音获取模块,被配置为基于所述目标发声者的语音,获得第二单通道语音;
    第二语音分段数据流获取模块,被配置为对所述第二单通道语音进行断句切分,得到包含预设类型声音的第二语音分段数据流;
    语音特征提取模块,被配置为从所述第二语音分段数据流中提取所述目标发声者的语音特征信息;
    声纹获取模块,被配置为基于所述目标发声者的语音特征信息,获得所述目标发声者的声纹信息。
  13. 根据权利要求12所述的装置,其特征在于,所述语音特征提取模块被配置为:
    对所述第二语音分段数据流进行短时傅里叶变换,仅提取幅值部分的特征信息,作为所述目标发声者的语音特征信息。
  14. 根据权利要求12所述的装置,其特征在于,所述声纹信息为声纹向量,所述声纹获取模块被配置为:
    将所述目标发声者的语音特征信息输入预先训练的声纹网络,得到所述声纹网络输出的所述目标发声者的声纹向量。
  15. 根据权利要求10-14任一项所述的装置,其特征在于,所述声纹信息为声纹向量,所述语音抽取模块被配置为:
    将所述第一语音时域特征信息切分为预设长度的片段;
    将切分后的所述片段和已有的所述目标发声者的声纹向量输入预先训练的语音抽取网络,得到所述语音抽取网络实时输出的各个所述片段对应的所述目标发声者的第二语音时域特征信息;其中,当前片段对应的所述目标发声者的第二语音时域特征信息的提取依赖于历史片段处理过程中缓存的中间变量。
  16. 根据权利要求15所述的装置,其特征在于,所述装置还包括:训练模块;
    所述训练模块被配置为:
    获取训练样本集,所述训练样本集包括:混合语音和发声者的参考语音,其中,所述混合语音包括发声者的语音;
    通过所述训练样本集,训练所述语音抽取网络,以使得所述语音抽取网络输出的第二语音时域特征信息为所述发声者的。
  17. 根据权利要求11所述的装置,其特征在于,所述语音解码模块被配置为:
    通过预先训练的时域解码器将所述第二语音时域特征信息恢复为离散的语音采样点;
    将所述离散的语音采样点融合,获得所述目标发声者的语音段。
  18. 一种存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现权利要求1-9任一项所述方法的步骤。
  19. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1-9任一项所述方法的步骤。
PCT/CN2021/120026 2020-09-29 2021-09-24 发声者语音抽取方法、装置、存储介质及电子设备 WO2022068675A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011055886.1 2020-09-29
CN202011055886.1A CN114333767A (zh) 2020-09-29 2020-09-29 发声者语音抽取方法、装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2022068675A1 true WO2022068675A1 (zh) 2022-04-07

Family

ID=80949601

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120026 WO2022068675A1 (zh) 2020-09-29 2021-09-24 发声者语音抽取方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN114333767A (zh)
WO (1) WO2022068675A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8775178B2 (en) * 2008-10-27 2014-07-08 International Business Machines Corporation Updating a voice template
CN108899037A (zh) * 2018-07-05 2018-11-27 平安科技(深圳)有限公司 动物声纹特征提取方法、装置及电子设备
US10176811B2 (en) * 2016-06-13 2019-01-08 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
CN109378002A (zh) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备和存储介质
CN111402880A (zh) * 2020-03-24 2020-07-10 联想(北京)有限公司 一种数据处理方法、装置及电子设备
CN111429914A (zh) * 2020-03-30 2020-07-17 招商局金融科技有限公司 麦克风控制方法、电子装置及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8775178B2 (en) * 2008-10-27 2014-07-08 International Business Machines Corporation Updating a voice template
US10176811B2 (en) * 2016-06-13 2019-01-08 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
CN108899037A (zh) * 2018-07-05 2018-11-27 平安科技(深圳)有限公司 动物声纹特征提取方法、装置及电子设备
CN109378002A (zh) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备和存储介质
CN111402880A (zh) * 2020-03-24 2020-07-10 联想(北京)有限公司 一种数据处理方法、装置及电子设备
CN111429914A (zh) * 2020-03-30 2020-07-17 招商局金融科技有限公司 麦克风控制方法、电子装置及计算机可读存储介质

Also Published As

Publication number Publication date
CN114333767A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
Zhang et al. Deep learning for environmentally robust speech recognition: An overview of recent developments
CN110709924B (zh) 视听语音分离
Peddinti et al. Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms
CN111508498B (zh) 对话式语音识别方法、系统、电子设备和存储介质
KR20230043250A (ko) 뉴럴 네트워크들을 사용하여 대상 화자의 음성으로 텍스트로부터의 스피치의 합성
CN112397083B (zh) 语音处理方法及相关装置
Zmolikova et al. Neural target speech extraction: An overview
JP2006079079A (ja) 分散音声認識システム及びその方法
Ji et al. Speaker-aware target speaker enhancement by jointly learning with speaker embedding extraction
CN111179911A (zh) 目标语音提取方法、装置、设备、介质和联合训练方法
WO2023030235A1 (zh) 目标音频的输出方法及系统、可读存储介质、电子装置
CN111261145B (zh) 语音处理装置、设备及其训练方法
CN107464563B (zh) 一种语音交互玩具
CN113488063B (zh) 一种基于混合特征及编码解码的音频分离方法
CN111667834B (zh) 一种助听设备及助听方法
CN110858476A (zh) 一种基于麦克风阵列的声音采集方法及装置
JP5385876B2 (ja) 音声区間検出方法、音声認識方法、音声区間検出装置、音声認識装置、そのプログラム及び記録媒体
KR20080059881A (ko) 음성 신호의 전처리 장치 및 방법
Zhao et al. Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding.
WO2022068675A1 (zh) 发声者语音抽取方法、装置、存储介质及电子设备
Park et al. Analysis of confidence and control through voice of Kim Jung-un
Zhao et al. Radio2Speech: High Quality Speech Recovery from Radio Frequency Signals
KR101610708B1 (ko) 음성 인식 장치 및 방법
CN114333874A (zh) 处理音频信号的方法
WO2020068401A1 (en) Audio watermark encoding/decoding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21874336

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21874336

Country of ref document: EP

Kind code of ref document: A1