WO2022037383A1 - 语音处理方法、装置、电子设备和计算机可读介质 - Google Patents

语音处理方法、装置、电子设备和计算机可读介质 Download PDF

Info

Publication number
WO2022037383A1
WO2022037383A1 PCT/CN2021/109283 CN2021109283W WO2022037383A1 WO 2022037383 A1 WO2022037383 A1 WO 2022037383A1 CN 2021109283 W CN2021109283 W CN 2021109283W WO 2022037383 A1 WO2022037383 A1 WO 2022037383A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
segment
feature vector
voiceprint feature
Prior art date
Application number
PCT/CN2021/109283
Other languages
English (en)
French (fr)
Inventor
蔡猛
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to US18/041,710 priority Critical patent/US20230306979A1/en
Publication of WO2022037383A1 publication Critical patent/WO2022037383A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Definitions

  • Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a speech processing method, apparatus, device, and computer-readable medium.
  • a related approach may be to use a segmentation and clustering method to obtain a target speech from a given speech.
  • the accuracy of the target speech obtained by the segmentation clustering method is not high.
  • Some embodiments of the present disclosure propose a voice processing method, apparatus, electronic device and computer-readable medium to solve the technical problems mentioned in the above background section.
  • some embodiments of the present disclosure provide a speech processing method, the method comprising: dividing a speech to be processed into at least one speech segment, wherein the speech segment is a segment of speech from the same sound source to the end of speech based on the clustering result of the at least one speech segment, generate at least one first speech, wherein the above-mentioned first speech includes at least one speech segment of the same sound source; for each of the above-mentioned at least one first speech Extracting features of the voice to obtain a voiceprint feature vector corresponding to each of the first voices; generating a second voice based on the voiceprint feature vector, wherein the second voice is an unmixed voice from the same sound source.
  • some embodiments of the present disclosure provide a speech processing apparatus, the apparatus includes: a segmentation unit configured to segment the speech to be processed into at least one speech segment, wherein the speech segment is a segment of speech from the same sound source a segment from the beginning to the end of the speech; the first generating unit is configured to generate at least one first speech based on the clustering result of the at least one speech segment, wherein the above-mentioned first speech includes at least one speech segment of the same sound source; A feature extraction unit, configured to perform feature extraction on each of the at least one first voice, to obtain a voiceprint feature vector corresponding to each of the above-mentioned first voices; a second generation unit, configured to be based on the above-mentioned The voiceprint feature vector generates a second voice, wherein the second voice is an unmixed voice of the same sound source.
  • some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device on which one or more programs are stored, when one or more programs are stored by one or more The processors execute such that the one or more processors implement a method as in any of the first aspects.
  • some embodiments of the present disclosure provide a computer-readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method according to any one of the first aspects.
  • the speech to be processed is divided into at least one speech segment, wherein the speech segment is a segment from the beginning to the end of a speech of the same sound source; Then, based on the clustering result of the at least one speech segment, at least one first speech is generated, wherein the above-mentioned first speech includes at least one speech segment of the same sound source; through the above process, the target speech can be segmented with a certain precision , which lays the foundation for the following generation of second speech.
  • feature extraction is performed on each of the at least one first voice to obtain a voiceprint feature vector corresponding to each of the above-mentioned first voices; a second voice is generated based on the above-mentioned voiceprint feature vector, wherein the second voice Speech is unmixed speech from the same sound source.
  • FIG. 1 is a schematic diagram of an application scenario of a speech processing method according to some embodiments of the present disclosure
  • FIG. 2 is a flowchart of some embodiments of speech processing methods according to the present disclosure
  • FIG. 3 is a flowchart of other embodiments of speech processing methods according to the present disclosure.
  • FIG. 4 is a schematic diagram of another application scenario of the speech processing method according to some embodiments of the present disclosure.
  • FIG. 5 is a schematic structural diagram of some embodiments of a speech processing apparatus according to the present disclosure.
  • FIG. 6 is a schematic structural diagram of an electronic device suitable for implementing some embodiments of the present disclosure.
  • FIG. 1 is a schematic diagram 100 of an application scenario of a speech processing method according to some embodiments of the present disclosure.
  • the electronic device 101 divides the to-be-processed speech 102 including multiple speakers into 9 speech segments according to the speech start point and speech end point of each speaker's speech.
  • 4 first speeches can be generated, such as the first speech A, the first speech B, the first speech C, and the first speech D in the figure.
  • the voiceprint feature of each first voice is extracted, and then four voiceprint feature vectors are obtained.
  • Voiceprint feature vector A, voiceprint feature vector B, voiceprint feature vector C and voiceprint feature vector D in the figure.
  • a second speech corresponding to the voiceprint feature vector may be generated. The second voice A, the second voice B, the second voice C and the second voice D in the figure.
  • the voice processing method may be executed by the above-mentioned electronic device 101 .
  • the electronic device 101 may be hardware or software.
  • the electronic device 101 can be various electronic devices with information processing capabilities, including but not limited to smartphones, tablet computers, e-book readers, laptop computers, desktop computers, servers, and the like.
  • the electronic device 101 is software, it can be installed in the electronic devices listed above. It can be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. There is no specific limitation here.
  • FIG. 1 the number of electronic devices in FIG. 1 is merely illustrative. There may be any number of electronic devices depending on implementation needs.
  • the voice processing method includes the following steps:
  • Step 201 Divide the speech to be processed into at least one speech segment, wherein the speech segment is a segment from the start of the speech to the end of the speech of the same sound source.
  • the executive body of the speech processing method may use various methods to segment the target speech into at least one speech segment.
  • the to-be-processed speech may be any piece of speech.
  • the speech to be processed may be the speech including the voices of multiple speakers in a certain conference.
  • the above-mentioned executive body may use speech segmentation software to segment the speech to be processed into at least one speech segment.
  • Step 202 based on the clustering result of the at least one speech segment, generate at least one first speech, wherein the first speech includes at least one speech segment of the same sound source.
  • the above-mentioned executive body may generate at least one first speech.
  • the clustering result is obtained by the above at least one speech segment based on the clustering algorithm.
  • the clustering result may include multiple categories of speech fragments. For each type of speech segment in the multiple types of speech segments, the first speech may be obtained through various methods. In practice, the speech fragments of this category can be spliced to obtain a first speech.
  • the above-mentioned clustering algorithm may be one of the following: K-Means clustering method, Gaussian mixture clustering method, mean-shift clustering method, density-based clustering method.
  • each of the at least one first speech includes at least one of the following: unmixed speech, mixed speech.
  • the unmixed voice may be the voice spoken by only one person or the voice produced by the same sound source
  • the mixed voice may be the voice spoken by multiple people's colleagues or the voice produced by different sound sources at the same time.
  • Step 203 extracting features from each of the at least one first voice, to obtain a voiceprint feature vector corresponding to each of the first voices.
  • the executive body may use a feature extraction algorithm (eg, a pre-trained deep neural network) to extract a voiceprint feature vector of the first speech. Further, at least one voiceprint feature vector corresponding to the first speech can be obtained.
  • a feature extraction algorithm eg, a pre-trained deep neural network
  • the voiceprint feature vector may be one of the following: voiceprint vector X-vector, voiceprint vector I-vector.
  • the voiceprint feature vector includes at least one of the following: a voiceprint feature vector corresponding to unmixed speech, and a voiceprint feature vector corresponding to mixed speech.
  • Step 204 Generate a second voice based on the voiceprint feature vector, where the second voice is an unmixed voice from the same sound source.
  • the executive body may generate the second speech corresponding to the voiceprint feature vector in various ways.
  • the executive body may input the voiceprint feature vector into a pre-trained time-domain audio separation network to generate the second voice corresponding to the voiceprint feature vector.
  • the voiceprint feature vector is often input into a pre-trained time-domain audio separation network, and then the second voice corresponding to the voiceprint feature vector is obtained.
  • the second voice only contains the voice of one person speaking, that is, unmixed voice.
  • the to-be-processed speech is divided into at least one speech segment, wherein the speech segment is a segment from the beginning to the end of a speech of the same sound source; then, based on the clustering result of the at least one speech segment, at least one first speech segment is generated.
  • a speech wherein the first speech includes at least one speech segment of the same sound source; through the above process, the target speech can be segmented with a certain precision, which lays a foundation for the following generation of the second speech.
  • feature extraction is performed on each of the at least one first voice to obtain a voiceprint feature vector corresponding to each of the above-mentioned first voices; a second voice is generated based on the above-mentioned voiceprint feature vector, wherein the second voice Speech is unmixed speech from the same sound source.
  • the process 300 of the speech processing method includes the following steps:
  • Step 301 segment the target speech into at least one speech segment.
  • step 301 for the specific implementation of step 301 and the technical effects brought about, reference may be made to step 201 in those embodiments corresponding to FIG. 2 , and details are not described herein again.
  • Step 302 splicing the speech segments in each speech segment cluster in the clustering result of the at least one speech segment into an initial first speech to generate at least one initial first speech corresponding to the at least one speech segment.
  • the execution body may splicing the speech segments in each speech segment cluster in the clustering result of the at least one speech segment to generate a plurality of initial first speeches.
  • a plurality of clusters may be included in the clustering result.
  • a clustering algorithm is used to generate the at least one speech segment.
  • each speech segment cluster may include at least one speech segment.
  • the above-mentioned clustering algorithm may be one of the following: K-Means clustering method, Gaussian mixture clustering method, mean-shift clustering method, density-based clustering method.
  • Step 303 for each initial first voice in the at least one initial first voice, the above-mentioned initial first voice is divided into frames to obtain a set of voice frames, and each of the clustering results of the voice frames in the above-mentioned voice frame set is divided into frames.
  • the speech frames in the speech frame cluster are spliced to generate the at least one first speech.
  • the above-mentioned execution body may perform frame processing on a plurality of initial first voices to obtain a set of voice frames.
  • the length of the speech frame may be from the beginning to the end of a speech of the same sound source.
  • the above executive body may splicing the speech frames in each cluster in the clustering result of the speech frames in the speech frame set to generate at least one first speech.
  • the above executive body may use a clustering algorithm for the speech frames in the speech frame set. For example, HMM (Hidden Markov Model, Hidden Markov Model). Get clustering results.
  • the clustering result may include multiple speech frame clusters.
  • Each speech frame cluster includes a plurality of speech frames.
  • the speech frames in the speech frame cluster are spliced. Multiple clusters of speech frames may generate at least one first speech.
  • Step 304 perform feature extraction on each of the at least one first voice, and obtain a voiceprint feature vector corresponding to each of the first voices.
  • the executive body may use a feature extraction algorithm (eg, a pre-trained deep neural network) to extract a voiceprint feature vector of the first speech. Further, at least one voiceprint feature vector corresponding to the first speech can be obtained.
  • a feature extraction algorithm eg, a pre-trained deep neural network
  • the voiceprint feature vector may be one of the following: voiceprint vector X-vector, voiceprint vector I-vector.
  • Step 305 Generate a second voice based on the voiceprint feature vector, where the second voice is an unmixed voice from the same sound source.
  • the executive body may generate the second speech corresponding to the voiceprint feature vector in various ways.
  • the executive body may input the voiceprint feature vector into a pre-trained time-domain audio separation network to generate the second voice corresponding to the voiceprint feature vector.
  • the voiceprint feature vector is often input into a pre-trained time-domain audio separation network, and the second voice corresponding to the voiceprint feature vector is obtained.
  • the second voice only contains the voice of one person speaking, that is, unmixed voice.
  • step 305 for the specific implementation of step 305 and the technical effect brought about, reference may be made to step 204 in those embodiments corresponding to FIG. 2 , and details are not repeated here.
  • the process 300 of the speech processing method in some embodiments corresponding to FIG. 3 embodies that the given target speech is segmented and clustered twice , the first time is to divide the cluster according to the preset duration, and the second time is to divide the cluster according to the audio frame.
  • the accuracy of the first speech obtained by two segmentation and clustering is higher. Using the first speech obtained by two segmentation and clustering to perform speech separation can greatly improve the accuracy of the second speech separated from the speech.
  • FIG. 4 is a schematic diagram 400 of another application scenario of the speech processing method according to some embodiments of the present disclosure.
  • the electronic device 401 divides the to-be-processed speech 402 containing multiple speakers into 9 speech segments, such as segment 1, segment 2, segment 3, segment 4, segment 5, segment 6, segment 7, and segment 7 in the figure. Fragment 8 and Fragment 9.
  • four initial first speeches may be generated, such as the initial first speech A, the initial first speech B, the initial first speech C, and the initial first speech D in the figure.
  • the four initial first voices can be further cut according to voice frames, and a set of voice frames can be obtained.
  • At least one first speech can be generated by splicing the speech frames in each cluster in the clustering result of speech frames in the speech frame set.
  • the first voice A, the first voice B, the first voice C and the first voice D For each of the four first voices, a voiceprint feature vector of the first voice can be extracted, and then four voiceprint feature vectors are obtained.
  • Voiceprint feature vector A, voiceprint feature vector B, voiceprint feature vector C and voiceprint feature vector D in the figure.
  • a second speech corresponding to the voiceprint feature vector may be generated. The second voice A, the second voice B, the second voice C and the second voice D in the figure.
  • the present disclosure provides some embodiments of a speech processing apparatus, and these apparatus embodiments correspond to those method embodiments shown in FIG. 2 .
  • the speech processing apparatus 500 in some embodiments includes: a segmentation unit 501 , a first generation unit 502 , a feature extraction unit 503 and a second generation unit 504 .
  • the segmentation unit 501 is configured to segment the speech to be processed into at least one speech segment, wherein the speech segment is a segment from the start of the speech to the end of the speech of the same sound source;
  • the first generation list 502 is configured to be based on the above at least The clustering result of a speech segment generates at least one first speech, wherein the above-mentioned first speech includes at least one speech segment of the same sound source;
  • the feature extraction unit 503 is configured to Perform feature extraction on a voice to obtain a voiceprint feature vector corresponding to each of the above-mentioned first voices;
  • the second generating unit 504 is configured to generate a second voice based on the above-mentioned voiceprint feature vector, wherein the above-mentioned second voices are the same sound source unmixed voice.
  • the first generating unit 502 may be further configured to: splicing the speech segments in each speech segment cluster in the clustering result of the at least one speech segment into an initial first speech , generating at least one initial first voice corresponding to the at least one voice segment.
  • the first generating unit 502 may be further configured to: for each initial first speech in the at least one initial first speech, perform framing on the foregoing initial first speech to obtain A voice frame set, by splicing the voice frames in each voice frame cluster in the clustering result of the voice frames in the voice frame set to generate the at least one first voice.
  • each of the above at least one first voice includes at least one of the following: unmixed voice and mixed voice.
  • the voiceprint feature vector corresponding to the first voice includes at least one of the following: a voiceprint feature vector corresponding to unmixed voice, and a voiceprint feature vector corresponding to mixed voice.
  • the second generation unit 504 may be further configured to: input the voiceprint feature vector into a pre-trained time-domain audio separation network to generate the second voice, wherein the time A domain audio separation network is used to generate unmixed speech of the target sound source from the voiceprint feature vector.
  • the units recorded in the apparatus 500 correspond to the respective steps in the method described with reference to FIG. 2 . Therefore, the operations, features and beneficial effects described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and details are not described herein again.
  • FIG. 6 a schematic block diagram of an electronic device (such as the electronic device of FIG. 1) 600 suitable for use in implementing some embodiments of the present disclosure is shown.
  • the electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 601 that may be loaded into random access according to a program stored in a read only memory (ROM) 602 or from a storage device 605 Various appropriate actions and processes are executed by the programs in the memory (RAM) 603 . In the RAM 603, various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to bus 604 .
  • I/O interface 605 input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 607 of a computer, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as required.
  • the processes described above with reference to the flowcharts may be implemented as computer software programs.
  • some embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 608, or from the ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of some embodiments of the present disclosure are performed.
  • the computer-readable medium described above may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the foregoing two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein.
  • Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the client and server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned apparatus; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: divides the speech to be processed into at least one speech segment, wherein the above-mentioned speech segment is the same A segment from the start of a speech of the sound source to the end of the speech; based on the clustering result of the at least one speech segment, at least one first speech is generated, wherein the first speech includes at least one speech segment of the same sound source; for the above at least one speech segment Perform feature extraction for each first voice in a first voice to obtain a voiceprint feature vector corresponding to each of the above-mentioned first voices; generate a second voice based on the above-mentioned voiceprint feature vector, wherein the second voice is the same sound source unmixed voice.
  • Computer program code for carrying out operations of some embodiments of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, or a combination thereof, Also included are conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units described in some embodiments of the present disclosure may be implemented by means of software, and may also be implemented by means of hardware.
  • the described unit may also be provided in the processor, for example, it may be described as: a processor includes a segmentation unit, a first generation unit, a feature extraction unit and a second generation unit. Wherein, the names of these units do not constitute a limitation of the unit itself in some cases, for example, the determination unit may also be described as "a unit for dividing the speech to be processed into at least one speech segment".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a speech processing method comprising: dividing the speech to be processed into at least one speech segment, wherein the speech segment is a segment of speech from the same sound source from the start to the end of the speech segment; based on the clustering result of the at least one speech segment, at least one first speech is generated, wherein the above-mentioned first speech includes at least one speech segment of the same sound source; for each first speech in the at least one first speech Perform feature extraction to obtain a voiceprint feature vector corresponding to each of the first voices; generate a second voice based on the voiceprint feature vector, wherein the second voice is an unmixed voice from the same sound source.
  • the speech segments in each speech segment cluster in the clustering result of the at least one speech segment are spliced into an initial first speech, and at least one initial speech corresponding to the at least one speech segment is generated. first voice.
  • the above-mentioned initial first speech is divided into frames to obtain a speech frame set, and the speech frames in the above-mentioned speech frame set are divided into frames.
  • the above-mentioned at least one first speech is generated by splicing the speech frames in each speech frame cluster in the clustering result.
  • each of the at least one first voices includes at least one of the following: unmixed voice, mixed voice.
  • the voiceprint feature vector corresponding to the first voice includes at least one of the following: a voiceprint feature vector corresponding to unmixed voice, and a voiceprint feature vector corresponding to mixed voice.
  • the above-mentioned voiceprint feature vector is input into a pre-trained time-domain audio separation network to generate the above-mentioned second voice, wherein the above-mentioned time-domain audio separation network is used for generating according to the voiceprint feature vector The unmixed speech of the target sound source.
  • a speech processing transposition comprising: a segmentation unit, a first generation unit, a feature extraction unit, and a second generation unit.
  • the segmentation unit is configured to segment the speech to be processed into at least one speech segment, wherein the speech segment is a segment from the beginning of the speech to the end of the speech of the same sound source; the first generating unit is configured to be based on the above at least one segment.
  • the clustering result of one speech segment generates at least one first speech, wherein the above-mentioned first speech includes at least one speech segment of the same sound source;
  • the feature extraction unit is configured to Perform feature extraction on a voice to obtain a voiceprint feature vector corresponding to each of the above-mentioned first voices;
  • a second generating unit is configured to generate a second voice based on the above-mentioned voiceprint feature vector, wherein the above-mentioned second voices are the same sound source unmixed voice.
  • the first generating unit may be further configured to: splicing the speech segments in each speech segment cluster in the clustering result of the at least one speech segment into an initial first speech, and generating at least one initial first speech corresponding to the at least one speech segment.
  • the first generating unit may be further configured to: for each initial first speech in the at least one initial first speech, frame the above-mentioned initial first speech to obtain a speech frame set, and splicing the voice frames in each voice frame cluster in the clustering result of the voice frames in the voice frame set to generate the at least one first voice.
  • each of the at least one first voices includes at least one of the following: unmixed voice, mixed voice.
  • the voiceprint feature vector corresponding to the first voice includes at least one of the following: a voiceprint feature vector corresponding to unmixed voice, and a voiceprint feature vector corresponding to mixed voice.
  • the second generation unit may be further configured to: input the voiceprint feature vector into a pre-trained time-domain audio separation network to generate the second voice, wherein the time-domain audio The separation network is used to generate the unmixed speech of the target sound source from the voiceprint feature vector.
  • an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon, when the one or more programs are stored by one or more The execution of the processor causes one or more processors to implement a method as described in any of the above embodiments.
  • a computer-readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method described in any of the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种语音处理方法(200)包括:将待处理语音分割成至少一个语音片段(201),其中,语音片段是同一声源的一段语音起始到语音结束的片段;基于至少一个语音片段的聚类结果,生成至少一个第一语音(202);以上过程可以对目标语音进行一定精度的语音分割,为以下生成第二语音奠定了基础。对至少一个第一语音中的每个第一语音进行提特征提取,得到每个第一语音对应的声纹特征矢量(203);基于声纹特征矢量生成第二语音(204),其中,第二语音是同一声源的未混合语音。还公开了一种语音处理装置(500)、电子设备(600)和计算机可读介质。通过对第一语音进行特征提取,以及对第一语音进一个语音分离,得到更准确的第二语音,从而提升整体语音分割效果。

Description

语音处理方法、装置、电子设备和计算机可读介质
相关申请的交叉引用
本申请基于申请号为202010824772.2、申请日为2020年08月17日,名称为“语音处理方法、装置、电子设备和计算机可读介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开的实施例涉及计算机技术领域,具体涉及语音处理方法、装置、设备和计算机可读介质。
背景技术
目前,在语音分离过程中,往往需要在一段给定的语音中分离出目标语音。目前,相关的做法可以是采用分割聚类方法来从一段给定语音中,得到目标语音。然而,采用分割聚类方法所得到的目标语音准确率不高。
发明内容
本公开的内容部分用于以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。本公开的内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
本公开的一些实施例提出了语音处理方法、装置、电子设备和计算机可读介质,来解决以上背景技术部分提到的技术问题。
第一方面,本公开的一些实施例提供了一种语音处理方法,该方法包括:将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源的一段语音起始到语音结束的片段;基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;对上述至少一个第一语音中的每个第一语音进行提特征 提取,得到每个上述第一语音对应的声纹特征矢量;基于上述声纹特征矢量生成第二语音,其中,第二语音是同一声源的未混合语音。
第二方面,本公开的一些实施例提供了一种语音处理装置,装置包括:分割单元,被配置成将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源的一段语音起始到语音结束的片段;第一生成单元,被配置成基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;特征提取单元,被配置成对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量;第二生成单元,被配置成基于上述声纹特征矢量生成第二语音,其中,上述第二语音是同一声源的未混合语音。
第三方面,本公开的一些实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一的方法。
第四方面,本公开的一些实施例提供了一种计算机可读介质,其上存储有计算机程序,其中,程序被处理器执行时实现如第一方面中任一的方法。
本公开的上述各个实施例中的一个实施例具有如下有益效果:首先,将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源的一段语音起始到语音结束的片段;然后,基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;通过以上过程,可以对目标语音进行一定精度的语音分割,为以下生成第二语音奠定了基础。进一步,对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量;基于上述声纹特征矢量生成第二语音,其中,第二语音是同一声源的未混合语音。通过对上述第一语音进行特征提取,以及对第一语音进一个语音分离,得到更准确的第二语音,从而提升整体语音分割效果。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,元件和元素不一定按照 比例绘制。
图1是根据本公开的一些实施例的语音处理方法的一个应用场景的示意图;
图2是根据本公开的语音处理方法的一些实施例的流程图;
图3是根据本公开的语音处理方法的另一些实施例的流程图;
图4是根据本公开的一些实施例的语音处理方法的另一个应用场景的示意图;
图5是根据本公开的语音处理装置的一些实施例的结构示意图;
图6是适于用来实现本公开的一些实施例的电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例。相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
下面将参考附图并结合实施例来详细说明本公开。
图1是根据本公开一些实施例的语音处理方法的一个应用场景的示意图100。
如图100所示,电子设备101将包含多个说话人的待处理语音102按照每个说话人语音的语音起始点和语音结束点分割成9个语音片段。如图中片段1、片段2、片段3、片段4、片段5、片段6片段7、片段8和片段9。基于9个语音片段的聚类结果103,可以生成4个第一语音,如图中的第一语音A,第一语音B、第一语音C和第一语音D。对于这4个第一语音中的每个第一语音,提取每个第一语音的声纹特征,进而得到4个声纹特征矢量。如图中的声纹特征矢量A、声纹特征矢量B、声纹特征矢量C和声纹特征矢量D。对于4个声纹特征矢量中的每个声纹特征矢量,可以生成声纹特征矢量对应的第二语音。如图中的第二语音A,第二语音B、第二语音C和第二语音D。
可以理解的是,语音处理方法可以是由上述电子设备101来执行。其中,电子设备101可以是硬件,也可以是软件。当电子设备101为硬件时,可以是具有信息处理能力的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机、台式计算机、服务器等等。当电子设备101为软件时,可以安装在上述所列举的电子设备中。其可以实现成例如用来提供分布式服务的多个软件或软件模块,也可以实现成单个软件或软件模块。在此不做具体限定。
应该理解,图1中的电子设备数目仅仅是示意性的。根据实现需要,可以具有任意数目的电子设备。
继续参考图2,示出了根据本公开的语音处理方法的一些实施例的流程200。该语音处理方法,包括以下步骤:
步骤201,将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源的一段语音起始到语音结束的片段。
在一些实施例中,语音处理方法的执行主体(例如图1所示的电子设备)可以使用多种方式将目标语音分割成至少一个语音片段。其中,所述待处理语音可以是任意一段语音。实践中,待处理语音可以是某次会议上包括多个说话人声音的语音。
作为示例,上述执行主体可以采用语音分割软件,来将待处理语音分割成至少一个语音片段。
步骤202,基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段。
在一些实施例中,基于至少一个语音片段的聚类结果,上述执行主体可以生成至少一个第一语音。这里,聚类结果是由上述至少一个语音片段基于聚类算法得到的。其中,聚类结果可以包括多个类别的语音片段。对于多个类别的语音片段中每个类别的语音片段,可以通过各种方法来得到第一语音。实践中,可以对该类别的语音片段进行拼接,从而得到一个第一语音。这里,上述聚类算法可以是以下之一:K-Means聚类方法,高斯混合聚类方法,均值漂移聚类方法,基于密度的聚类方法。
在一些实施例的一些可选的实现方式中,上述至少一个第一语音中的每一个第一语音包括以下至少一项:未混合语音,混合语音。其中,未混合语音可以是只有一个人说话的语音或者同一声源发出的语音,混合语音可以是多个人同事说话的语音或者不同声源同时发出的语音。
步骤203,对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量。
在一些实施例中,对于上述至少一个第一语音中的每个第一语音,上述执行主体可以使用特征提取算法(例如,预先训练的深度神经网络)来提取第一语音的声纹特征矢量。进而可以得到至少一个第一语音对应的声纹特征矢量。实践中,声纹特征矢量可以是以下之一:声纹矢量X-vector,声纹矢量I-vector。
在一些实施例的一些可选的实现方式中,上述声纹特征矢量包括以下至少一项:未混合语音对应的声纹特征矢量,混合语音对应的声纹特征矢量。
步骤204,基于上述声纹特征矢量生成第二语音,其中,上述第二语音是同一声源的未混合语音。
在一些实施例中,对于上述声纹特征矢量,上述执行主体可以通过各种方式来生成声纹特征矢量对应的第二语音。
作为示例,对于上述声纹特征矢量,上述执行主体可以将上述声纹特征矢量输入预先训练的时域音频分离网络,生成上述声纹特征矢量对应的第二语音。实践中,往往将声纹特征矢量输入预先训练的时域音频分离网 络,进而得到声纹特征矢量对应的第二语音,上述第二语音只包含一个人说话的声音,也就是未混合语音。
本公开的上述各个实施例中的一个实施例具有如下有益效果:
首先,将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源的一段语音起始到语音结束的片段;然后,基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;通过以上过程,可以对目标语音进行一定精度的语音分割,为以下生成第二语音奠定了基础。进一步,对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量;基于上述声纹特征矢量生成第二语音,其中,第二语音是同一声源的未混合语音。通过对上述第一语音进行特征提取,以及对第一语音进一个语音分离,得到更准确的第二语音,从而提升整体语音分割效果。
进一步参考图3,其示出了语音处理方法的另一些实施例的流程300。该语音处理方法的流程300,包括以下步骤:
步骤301,将目标语音分割成至少一个语音片段。
一些实施例中,步骤301的具体实现及所带来的技术效果可以参考图2对应的那些实施例中的步骤201,在此不再赘述。
步骤302,将上述至少一个语音片段的聚类结果中的每个语音片段簇中的语音片段拼接成初始第一语音,生成对应上述至少一个语音片段的至少一个初始第一语音。
在一些实施例中,上述执行主体可以将上述至少一个语音片段的聚类结果中的每个语音片段簇中的语音片段拼接,生成多个初始第一语音。这里,聚类结果中可以包括多个簇。其中,对于多个语音片段簇中每个语音片段簇,是由聚类算法对上述至少一个语音片段生成的。其中,每个语音片段簇中可以包括至少一个语音片段。这里,上述聚类算法可以是以下之一:K-Means聚类方法,高斯混合聚类方法,均值漂移聚类方法,基于密度的聚类方法。
步骤303,对于至少一个初始第一语音中的每一个初始第一语音,对 上述初始第一语音进行分帧,得到语音帧集合,将上述语音帧集合中语音帧的聚类结果中的每个语音帧簇中的语音帧拼接,生成上述至少一个第一语音。
在一些实施例中,上述执行主体可以将多个初始第一语音进行分帧处理,进而得到语音帧集合。其中,语音帧的长度可以是同一声源的一段语音起始到语音结束。上述执行主体可以将语音帧集合中语音帧的聚类结果中的每个簇中的语音帧拼接,生成至少一个第一语音。实践中,上述执行主体可以对语音帧集合中语音帧采用聚类算法。例如,HMM(Hidden Markov Model,隐马尔可夫模型)。得到聚类结果。其中,聚类结果中可以包括多个语音帧簇。每个语音帧簇中包括多个语音帧。对多个语音帧簇中的每个语音帧簇而言,对该语音帧簇内的语音帧进行拼接。多个语音帧簇可以生成至少一个第一语音。
步骤304,对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量。
在一些实施例中,对于上述至少一个第一语音中的每个第一语音,上述执行主体可以使用特征提取算法(例如,预先训练的深度神经网络)来提取第一语音的声纹特征矢量。进而可以得到至少一个第一语音对应的声纹特征矢量。实践中,声纹特征矢量可以是以下之一:声纹矢量X-vector,声纹矢量I-vector。
步骤305,基于上述声纹特征矢量生成第二语音,其中,上述第二语音是同一声源的未混合语音。
在一些实施例中,对于上述声纹特征矢量,上述执行主体可以通过各种方式来生成声纹特征矢量对应的第二语音。
作为示例,对于上述声纹特征矢量,上述执行主体可以将上述声纹特征矢量输入预先训练的时域音频分离网络,生成上述声纹特征矢量对应的第二语音。实践中,往往将声纹特征矢量输入预先训练的时域音频分离网络,进而得到声纹特征矢量对应的第二语音,上述第二语音只包含一个人说话的声音,也就是未混合语音。
在一些实施例中,步骤305的具体实现及所带来的技术效果可以参考图2对应的那些实施例中的步骤204,在此不再赘述。
从图3中可以看出,与图2对应的一些实施例的描述相比,图3对应的一些实施例中的语音处理方法的流程300体现了对给定的目标语音进行两次分割聚类,第一次是按照预设时长分割聚类,第二次是按照音频帧分割聚类。经过两次分割聚类得到的第一语音准确度更高。使用两次分割聚类得到的第一语音去进行语音分离,可以使得语音分离出的第二语音准确度得到较大提升。
图4是根据本公开一些实施例的语音处理方法的另一个应用场景的示意图400。
如图400所示,电子设备401将包含多个说话人的待处理语音402分割成9个语音片段,如图中片段1、片段2、片段3、片段4、片段5、片段6片段7、片段8和片段9。基于9个语音片段的聚类结果403,可以生成4个初始第一语音,如图中的初始第一语音A,初始第一语音B、初始第一语音C和初始第一语音D。然后,可以将这四个初始第一语音按照语音帧继续进行切割,可以得到语音帧集合。如图中语音帧1、语音帧2、语音帧3、语音帧4、语音帧5、语音帧6、语音帧7和语音帧8。将上述语音帧集合中语音帧的聚类结果中的每个簇中的语音帧拼接,可以生成至少一个第一语音。如图中,第一语音A、第一语音B、第一语音C和第一语音D。对于4个第一语音中的每个第一语音,可以提取第一语音的声纹特征矢量,进而得到4个声纹特征矢量。如图中的声纹特征矢量A、声纹特征矢量B、声纹特征矢量C和声纹特征矢量D。对于4个声纹特征矢量中的每个声纹特征矢量,可以生成声纹特征矢量对应的第二语音。如图中的第二语音A,第二语音B、第二语音C和第二语音D。
进一步参考图5,作为对上述各图所示方法的实现,本公开提供了一种语音处理装置的一些实施例,这些装置实施例与图2所示的那些方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,一些实施例的语音处理装置500包括:分割单元501、第一生成单元502、特征提取单元503和第二生成单元504。其中,分割单元501被配置成将待处理语音分割成至少一个语音片段,其中,上述语 音片段是同一声源的一段语音起始到语音结束的片段;第一生成单502被配置成基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;特征提取单元503被配置成对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量;第二生成单元504被配置成基于上述声纹特征矢量生成第二语音,其中,上述第二语音是同一声源的未混合语音。
在一些实施例的一些可选的实现方式,第一生成单元502可以被进一步配置成:将上述至少一个语音片段的聚类结果中的每个语音片段簇中的语音片段拼接成初始第一语音,生成对应上述至少一个语音片段的至少一个初始第一语音。
在一些实施例的一些可选的实现方式,第一生成单元502可以被进一步配置成:对于至少一个初始第一语音中的每一个初始第一语音,对上述初始第一语音进行分帧,得到语音帧集合,将上述语音帧集合中语音帧的聚类结果中的每个语音帧簇中的语音帧拼接,生成上述至少一个第一语音。
在一些实施例的一些可选的实现方式,上述至少一个第一语音中的每一个第一语音包括以下至少一项:未混合语音,混合语音。
在一些实施例的一些可选的实现方式,上述第一语音对应的声纹特征矢量包括以下至少一项:未混合语音对应的声纹特征矢量,混合语音对应的声纹特征矢量。
在一些实施例的一些可选的实现方式,第二生成单元504可以被进一步配置成:将上述声纹特征矢量输入至预先训练的时域音频分离网络,生成上述第二语音,其中,上述时域音频分离网络用于根据声纹特征矢量生成目标声源的未混合语音。
可以理解的是,该装置500中记载的诸单元与参考图2描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作、特征以及产生的有益效果同样适用于装置500及其中包含的单元,在此不再赘述。
下面参考图6,其示出了适于用来实现本公开的一些实施例的电子设 备(例如图1中的电子设备)600的结构示意图。图6示出的电子设备仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置605加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图6中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本公开的一些实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的一些实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的一些实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开的一些实施例的方法中限定的上述功能。
需要说明的是,本公开的一些实施例上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧 凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的一些实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的一些实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述装置中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源的一段语音起始到语音结束的片段;基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量;基于上述声纹特征矢量生成第二语音,其中,第二语音是同一声源的未混合语音。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的一些实施例的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设 计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开的一些实施例中的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括分割单元、第一生成单元、特征提取单元和第二生成单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,确定单元还可以被描述为“将待处理语音分割成至少一个语音片段的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
根据本公开的一个或多个实施例,提供了一种语音处理方法,包括:将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源 的一段语音起始到语音结束的片段;基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量;基于上述声纹特征矢量生成第二语音,其中,第二语音是同一声源的未混合语音。
根据本公开的一个或多个实施例,将上述至少一个语音片段的聚类结果中的每个语音片段簇中的语音片段拼接成初始第一语音,生成对应上述至少一个语音片段的至少一个初始第一语音。
根据本公开的一个或多个实施例,对于至少一个初始第一语音中的每一个初始第一语音,对上述初始第一语音进行分帧,得到语音帧集合,将上述语音帧集合中语音帧的聚类结果中的每个语音帧簇中的语音帧拼接,生成上述至少一个第一语音。
根据本公开的一个或多个实施例,上述至少一个第一语音中的每一个第一语音包括以下至少一项:未混合语音,混合语音。
根据本公开的一个或多个实施例,上述第一语音对应的声纹特征矢量包括以下至少一项:未混合语音对应的声纹特征矢量,混合语音对应的声纹特征矢量。
根据本公开的一个或多个实施例,将上述声纹特征矢量输入至预先训练的时域音频分离网络,生成上述第二语音,其中,上述时域音频分离网络用于根据声纹特征矢量生成目标声源的未混合语音。
根据本公开的一个或多个实施例,提供了一种语音处理转置,包括:分割单元、第一生成单元、特征提取单元和第二生成单元。其中,分割单元,被配置成将待处理语音分割成至少一个语音片段,其中,上述语音片段是同一声源的一段语音起始到语音结束的片段;第一生成单元,被配置成基于上述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,上述第一语音包含同一声源的至少一个语音片段;特征提取单元,被配置成对上述至少一个第一语音中的每个第一语音进行提特征提取,得到每个上述第一语音对应的声纹特征矢量;第二生成单元,被配置成基于上述声纹特征矢量生成第二语音,其中,上述第二语音是同一声源的未混合语音。
根据本公开的一个或多个实施例,第一生成单元可以被进一步配置成:将上述至少一个语音片段的聚类结果中的每个语音片段簇中的语音片段拼接成初始第一语音,生成对应上述至少一个语音片段的至少一个初始第一语音。
根据本公开的一个或多个实施例,第一生成单元可以被进一步配置成:对于至少一个初始第一语音中的每一个初始第一语音,对上述初始第一语音进行分帧,得到语音帧集合,将上述语音帧集合中语音帧的聚类结果中的每个语音帧簇中的语音帧拼接,生成上述至少一个第一语音。
根据本公开的一个或多个实施例,上述至少一个第一语音中的每一个第一语音包括以下至少一项:未混合语音,混合语音。
根据本公开的一个或多个实施例,上述第一语音对应的声纹特征矢量包括以下至少一项:未混合语音对应的声纹特征矢量,混合语音对应的声纹特征矢量。
根据本公开的一个或多个实施例,第二生成单元可以被进一步配置成:将上述声纹特征矢量输入至预先训练的时域音频分离网络,生成上述第二语音,其中,上述时域音频分离网络用于根据声纹特征矢量生成目标声源的未混合语音。
根据本公开的一个或多个实施例,提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如上述任一实施例描述的方法。
根据本公开的一个或多个实施例,提供了一种计算机可读介质,其上存储有计算机程序,其中,程序被处理器执行时实现如上述任一实施例描述的方法。
以上描述仅为本公开的一些较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (9)

  1. 一种语音处理方法,包括:
    将待处理语音分割成至少一个语音片段,其中,所述语音片段是同一声源的一段语音起始到语音结束的片段;
    基于所述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,所述第一语音包含同一声源的至少一个语音片段;
    对所述至少一个第一语音中的每个第一语音进行提特征提取,得到每个所述第一语音对应的声纹特征矢量;
    基于所述声纹特征矢量生成第二语音,其中,所述第二语音是同一声源的未混合语音。
  2. 根据权利要求1所述的方法,其中,所述基于所述至少一个语音片段的聚类结果,生成至少一个第一语音,包括:
    将所述至少一个语音片段的聚类结果中的每个语音片段簇中的语音片段拼接成初始第一语音,生成对应所述至少一个语音片段的至少一个初始第一语音。
  3. 根据权利要求2所述的方法,其中,所述基于所述至少一个语音片段的聚类结果,生成至少一个第一语音,包括:
    对于至少一个初始第一语音中的每一个初始第一语音,对所述初始第一语音进行分帧,得到语音帧集合,将所述语音帧集合中语音帧的聚类结果中的每个语音帧簇中的语音帧拼接,生成所述至少一个第一语音。
  4. 根据权利要求1所述的方法,其中,所述至少一个第一语音中的每一个第一语音包括以下至少一项:未混合语音,混合语音。
  5. 根据权利要求1所述的方法,其中,所述第一语音对应的声纹特征矢量包括以下至少一项:未混合语音对应的声纹特征矢量,混合语音对应的声纹特征矢量。
  6. 根据权利要求5所述的方法,其中,所述基于所述声纹特征矢量生成第二语音,包括:
    将所述声纹特征矢量输入至预先训练的时域音频分离网络,生成所述第二语音,其中,所述时域音频分离网络用于根据声纹特征矢量生成目标声源的未混合语音。
  7. 一种语音处理装置,包括:
    分割单元,被配置成将待处理语音分割成至少一个语音片段,其中,所述语音片段是同一声源的一段语音起始到语音结束的片段;
    第一生成单元,被配置成基于所述至少一个语音片段的聚类结果,生成至少一个第一语音,其中,所述第一语音包含同一声源的至少一个语音片段;
    特征提取单元,被配置成对所述至少一个第一语音中的每个第一语音进行提特征提取,得到每个所述第一语音对应的声纹特征矢量;
    第二生成单元,被配置成基于所述声纹特征矢量生成第二语音,其中,所述第二语音是同一声源的未混合语音。
  8. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-6中任一所述的方法。
  9. 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-6中任一所述的方法。
PCT/CN2021/109283 2020-08-17 2021-07-29 语音处理方法、装置、电子设备和计算机可读介质 WO2022037383A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/041,710 US20230306979A1 (en) 2020-08-17 2021-07-29 Voice processing method and apparatus, electronic device, and computer readable medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010824772.2A CN111968657B (zh) 2020-08-17 2020-08-17 语音处理方法、装置、电子设备和计算机可读介质
CN202010824772.2 2020-08-17

Publications (1)

Publication Number Publication Date
WO2022037383A1 true WO2022037383A1 (zh) 2022-02-24

Family

ID=73388065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109283 WO2022037383A1 (zh) 2020-08-17 2021-07-29 语音处理方法、装置、电子设备和计算机可读介质

Country Status (3)

Country Link
US (1) US20230306979A1 (zh)
CN (1) CN111968657B (zh)
WO (1) WO2022037383A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968657B (zh) * 2020-08-17 2022-08-16 北京字节跳动网络技术有限公司 语音处理方法、装置、电子设备和计算机可读介质
CN113674755B (zh) * 2021-08-19 2024-04-02 北京百度网讯科技有限公司 语音处理方法、装置、电子设备和介质

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000767A (zh) * 2006-01-09 2007-07-18 杭州世导科技有限公司 语音识别设备及其方法
CN105975569A (zh) * 2016-05-03 2016-09-28 深圳市金立通信设备有限公司 一种语音处理的方法及终端
US20170076713A1 (en) * 2015-09-14 2017-03-16 International Business Machines Corporation Cognitive computing enabled smarter conferencing
CN106782545A (zh) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 一种将音视频数据转化成文字记录的系统和方法
CN107749296A (zh) * 2017-10-12 2018-03-02 深圳市沃特沃德股份有限公司 语音翻译方法和装置
CN109256137A (zh) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 语音采集方法、装置、计算机设备和存储介质
CN110381389A (zh) * 2018-11-14 2019-10-25 腾讯科技(深圳)有限公司 一种基于人工智能的字幕生成方法和装置
CN110853615A (zh) * 2019-11-13 2020-02-28 北京欧珀通信有限公司 一种数据处理方法、装置及存储介质
CN110930984A (zh) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111161710A (zh) * 2019-12-11 2020-05-15 Oppo广东移动通信有限公司 同声传译方法、装置、电子设备及存储介质
CN111524527A (zh) * 2020-04-30 2020-08-11 合肥讯飞数码科技有限公司 话者分离方法、装置、电子设备和存储介质
CN111968657A (zh) * 2020-08-17 2020-11-20 北京字节跳动网络技术有限公司 语音处理方法、装置、电子设备和计算机可读介质

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060248019A1 (en) * 2005-04-21 2006-11-02 Anthony Rajakumar Method and system to detect fraud using voice data
CN103514884A (zh) * 2012-06-26 2014-01-15 华为终端有限公司 通话音降噪方法及终端
US10134400B2 (en) * 2012-11-21 2018-11-20 Verint Systems Ltd. Diarization using acoustic labeling
CN103530432A (zh) * 2013-09-24 2014-01-22 华南理工大学 一种具有语音提取功能的会议记录器及语音提取方法
CN106056996B (zh) * 2016-08-23 2017-08-29 深圳市鹰硕技术有限公司 一种多媒体交互教学系统及方法
CN108198560A (zh) * 2018-01-18 2018-06-22 安徽三弟电子科技有限责任公司 基于声纹识别的录音优化方法及其录音优化系统
CN109741754A (zh) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 一种会议语音识别方法及系统、存储介质及终端
CN110197665B (zh) * 2019-06-25 2021-07-09 广东工业大学 一种用于公安刑侦监听的语音分离与跟踪方法
CN110335612A (zh) * 2019-07-11 2019-10-15 招商局金融科技有限公司 基于语音识别的会议记录生成方法、装置及存储介质
CN110473566A (zh) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 音频分离方法、装置、电子设备及计算机可读存储介质
CN110675891B (zh) * 2019-09-25 2020-09-18 电子科技大学 一种基于多层注意力机制的语音分离方法、模块
CN111105801B (zh) * 2019-12-03 2022-04-01 云知声智能科技股份有限公司 一种角色语音分离方法及装置
CN110853666B (zh) * 2019-12-17 2022-10-04 科大讯飞股份有限公司 一种说话人分离方法、装置、设备及存储介质
CN111128223B (zh) * 2019-12-30 2022-08-05 科大讯飞股份有限公司 一种基于文本信息的辅助说话人分离方法及相关装置
CN111063342B (zh) * 2020-01-02 2022-09-30 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000767A (zh) * 2006-01-09 2007-07-18 杭州世导科技有限公司 语音识别设备及其方法
US20170076713A1 (en) * 2015-09-14 2017-03-16 International Business Machines Corporation Cognitive computing enabled smarter conferencing
CN105975569A (zh) * 2016-05-03 2016-09-28 深圳市金立通信设备有限公司 一种语音处理的方法及终端
CN106782545A (zh) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 一种将音视频数据转化成文字记录的系统和方法
CN107749296A (zh) * 2017-10-12 2018-03-02 深圳市沃特沃德股份有限公司 语音翻译方法和装置
CN109256137A (zh) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 语音采集方法、装置、计算机设备和存储介质
CN110381389A (zh) * 2018-11-14 2019-10-25 腾讯科技(深圳)有限公司 一种基于人工智能的字幕生成方法和装置
CN110853615A (zh) * 2019-11-13 2020-02-28 北京欧珀通信有限公司 一种数据处理方法、装置及存储介质
CN110930984A (zh) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111161710A (zh) * 2019-12-11 2020-05-15 Oppo广东移动通信有限公司 同声传译方法、装置、电子设备及存储介质
CN111524527A (zh) * 2020-04-30 2020-08-11 合肥讯飞数码科技有限公司 话者分离方法、装置、电子设备和存储介质
CN111968657A (zh) * 2020-08-17 2020-11-20 北京字节跳动网络技术有限公司 语音处理方法、装置、电子设备和计算机可读介质

Also Published As

Publication number Publication date
CN111968657A (zh) 2020-11-20
US20230306979A1 (en) 2023-09-28
CN111968657B (zh) 2022-08-16

Similar Documents

Publication Publication Date Title
US11196540B2 (en) End-to-end secure operations from a natural language expression
WO2022033327A1 (zh) 视频生成方法、生成模型训练方法、装置、介质及设备
CN112786006B (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
WO2022037388A1 (zh) 语音生成方法、装置、设备和计算机可读介质
WO2022105861A1 (zh) 用于识别语音的方法、装置、电子设备和介质
WO2022037383A1 (zh) 语音处理方法、装置、电子设备和计算机可读介质
US11783808B2 (en) Audio content recognition method and apparatus, and device and computer-readable medium
WO2022105553A1 (zh) 语音合成方法、装置、可读介质及电子设备
CN112153460B (zh) 一种视频的配乐方法、装置、电子设备和存储介质
WO2023083142A1 (zh) 分句方法、装置、存储介质及电子设备
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
WO2022156464A1 (zh) 语音合成方法、装置、可读介质及电子设备
WO2022237665A1 (zh) 语音合成方法、装置、电子设备和存储介质
CN111785247A (zh) 语音生成方法、装置、设备和计算机可读介质
WO2023082931A1 (zh) 用于语音识别标点恢复的方法、设备和存储介质
WO2020220824A1 (zh) 识别语音的方法和装置
CN108962226B (zh) 用于检测语音的端点的方法和装置
EP4360085A1 (en) Robust direct speech-to-speech translation
KR20230020508A (ko) 텍스트 에코 제거
CN111933119A (zh) 用于生成语音识别网络的方法、装置、电子设备和介质
CN114550728B (zh) 用于标记说话人的方法、装置和电子设备
CN112017685B (zh) 语音生成方法、装置、设备和计算机可读介质
CN112489662A (zh) 用于训练语音处理模型的方法和装置
de Abreu Pinna et al. A brazilian portuguese real-time voice recognition to deal with sensitive data
US20230107475A1 (en) Exploring Heterogeneous Characteristics of Layers In ASR Models For More Efficient Training

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857478

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21857478

Country of ref document: EP

Kind code of ref document: A1