WO2024082928A1 - 语音处理方法、装置、设备和介质 - Google Patents

语音处理方法、装置、设备和介质 Download PDF

Info

Publication number
WO2024082928A1
WO2024082928A1 PCT/CN2023/121068 CN2023121068W WO2024082928A1 WO 2024082928 A1 WO2024082928 A1 WO 2024082928A1 CN 2023121068 W CN2023121068 W CN 2023121068W WO 2024082928 A1 WO2024082928 A1 WO 2024082928A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
feature
mixed
speaker
Prior art date
Application number
PCT/CN2023/121068
Other languages
English (en)
French (fr)
Inventor
崔国辉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024082928A1 publication Critical patent/WO2024082928A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a speech processing method, device, equipment and medium.
  • Speech processing technology refers to the technology of audio processing of speech signals.
  • Speech extraction is one of the speech processing technologies.
  • speech extraction technology the sound of interest to the user can be extracted from a complex speech scene.
  • a complex speech scene may include at least one of multi-person speaking interference, large reverberation, high background noise and music noise.
  • speech extraction technology the user can extract the sound of the object of interest from the complex speech scene.
  • speech extraction is usually performed directly on complex speech, and the extracted speech is directly used as the speech of the object to be extracted.
  • the speech extracted in this way often has more noise (for example, the extracted speech also includes the sound of other objects), resulting in a low accuracy rate of speech extraction.
  • a speech processing method, apparatus, device and medium are provided.
  • the present application provides a speech processing method, the method comprising:
  • speech information with speech similarity less than a preset similarity is filtered out to obtain the clean speech of the speaker.
  • the present application provides a speech processing device, the device comprising:
  • An acquisition module used to acquire a registered voice of a speaker and to acquire a mixed voice, wherein the mixed voice includes voice information of multiple sound-generating objects, and the multiple sound-generating objects include the speaker;
  • a first extraction module configured to determine a registered voice feature of the registered voice, and extract a preliminary recognized voice of the speaker from the mixed voice based on the registered voice feature;
  • a determination module configured to determine the voice similarity between the registered voice and the voice information included in the preliminary recognized voice according to the registered voice feature
  • the filtering module is used to filter out the voice information whose voice similarity is less than a preset similarity from the preliminary recognized voice, so as to obtain the clean voice of the speaker.
  • the present application provides a computer device including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor executes the steps in the method embodiments of the present application when executing the computer-readable instructions.
  • the present application provides a computer-readable storage medium storing computer-readable instructions, which, when executed by a processor, perform the steps in the method embodiments of the present application.
  • the present application provides a computer-readable instruction product, including computer-readable instructions, which, when executed by a processor, perform the steps in the method embodiments of the present application.
  • FIG1 is a diagram of an application environment of a speech processing method according to an embodiment
  • FIG2 is a flow chart of a speech processing method in one embodiment
  • FIG3 is a schematic diagram of a network structure of a speech extraction network in one embodiment
  • FIG4 is a schematic diagram of a network structure of a model for performing speech extraction on mixed speech in one embodiment
  • FIG5 is a schematic diagram of the network structure of a primary speech extraction network in one embodiment
  • FIG6 is a schematic diagram of a network structure of a noise reduction network in one embodiment
  • FIG7 is a schematic diagram of a network structure of a registration network in one embodiment
  • FIG8 is a diagram showing an application environment of a speech processing method according to another embodiment
  • FIG9 is a schematic diagram showing the principle of a speech processing method in one embodiment
  • FIG10 is a schematic diagram showing the principle of filtering the initially recognized speech in one embodiment
  • FIG11 is a flow chart of a speech processing method in another embodiment
  • FIG12 is a block diagram of a speech processing device according to an embodiment
  • FIG. 13 is a diagram showing the internal structure of a computer device in one embodiment.
  • the speech processing method provided in this application can be applied in an application environment as shown in FIG1.
  • the terminal 102 communicates with the server 104 through a network.
  • the data storage system can store the data that the server 104 needs to process.
  • the data storage system can be integrated on the server 104, or it can be placed on the cloud or other servers.
  • the terminal 102 can be, but is not limited to, various desktop computers, laptops, smart phones, tablet computers, Internet of Things devices and portable wearable devices.
  • the Internet of Things devices can be smart speakers, smart TVs, smart air conditioners, smart car-mounted devices, etc.
  • Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc.
  • the server 104 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (content distribution network), and big data and artificial intelligence platforms.
  • the terminal 102 and the server 104 can be directly or indirectly connected by wired or wireless communication, and this application is not limited here.
  • the terminal 102 may obtain the registered voice of the speaker and obtain the mixed voice, wherein the mixed voice includes voice information of multiple sound-emitting objects, and the multiple sound-emitting objects include the speaker.
  • the terminal 102 may determine the registered voice features of the registered voice, and extract the preliminary recognized voice of the speaker from the mixed voice according to the registered voice features.
  • the terminal 102 may determine the voice similarity between the registered voice and the voice information included in the preliminary recognized voice according to the registered voice features.
  • the terminal 102 may filter out the voice information whose voice similarity is less than a preset similarity from the preliminary recognized voice, and obtain the clean voice of the speaker.
  • the speech processing method in some embodiments of the present application uses artificial intelligence technology.
  • the registered speech features of the registered speech are features encoded using artificial intelligence technology
  • the speaker's preliminary recognized speech is also speech recognized using artificial intelligence technology.
  • a voice processing method is provided. This embodiment is described by taking the method applied to the terminal 102 in FIG. 1 as an example, and includes the following steps:
  • Step 202 obtain the registered voice of the speaker and obtain the mixed voice, the mixed voice includes the voice of multiple sound objects Voice information, multiple sound objects including the speaker.
  • the sound object is an entity that can make a sound, which can be a natural object or an artificial object, and can be a living or non-living object.
  • the sound object includes at least one of a person, an animal, or an object.
  • the sound object targeted by the speech processing can be called a speaker or a target object. It can be understood that the speaker is the object from which the speech needs to be extracted by the speech processing method of the present application.
  • the speech can be stored in the form of a digital signal as an audio format file.
  • Mixed speech is speech that includes the speech information of multiple sound-generating objects.
  • the multiple sound-generating objects may all be users, and one of the multiple sound-generating objects is a speaker.
  • Mixed speech includes the speaker's speech information.
  • Mixed speech includes the speaker's speech information, which can be understood as the sound recorded by the mixed speech includes the speaker's voice.
  • Registered Voice is a clean voice pre-registered for a speaker, and is a segment of the speaker's voice pre-stored in a voice database. It can be understood that the registered voice basically only includes the speaker's voice information, and does not include the voice information of other sound-making objects other than the speaker, or the voice information of other sound-making objects other than the speaker is very small and can be ignored.
  • the speaker can speak a paragraph in a relatively quiet environment, and the terminal can collect the sound of the speaker speaking this paragraph and generate the registered voice. It can be understood that this paragraph does not include the sound of other objects except the speaker.
  • the terminal can collect the words spoken by the speaker in a quiet environment, and generate the speaker's registered voice based on the words spoken by the speaker in a quiet environment.
  • Quiet can be that the decibel number of the ambient noise does not exceed the preset decibel number.
  • the preset decibel number can be 30-40, or a lower or higher decibel number can be set as needed.
  • the speaker can speak a paragraph in a noisy environment, and the terminal can collect the sound of the speaker speaking this paragraph and generate a mixed voice. It can be understood that this paragraph includes the sound of other sound objects other than the speaker, and can also include environmental noise.
  • the terminal can collect the words spoken by the speaker in a noisy environment, and generate a mixed voice including the speaker's voice information based on the words spoken by the speaker in the noisy environment.
  • noisy can be the decibel number of the environmental noise exceeding the preset decibel number.
  • the terminal can directly use the voice corresponding to the words spoken by the speaker in a quiet environment as the speaker's registered voice.
  • the terminal can directly use the voice corresponding to the words spoken by the speaker in a noisy environment as a mixed voice including the speaker's voice information.
  • Step 204 determining the registration voice features of the registration voice.
  • the registration speech feature is a feature of the registration speech, which can characterize the characteristics of the speaker's speech and can also be called the speaker's speech feature.
  • the terminal can use a machine learning model to extract the registration speech feature from the registration speech, and can also use acoustic features, such as Mel Frequency Cepstrum Coefficient (MFCC), Linear Predictive Coding (LPC), Linear Prediction Cepstrum Coefficient (LPCC), Linear Spectral Frequency (LSF), Discrete Wavelet Transform (Discrete Wavelet Transform) or Perceptual Linear Predictive (PLP) at least one of the following methods.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPC Linear Predictive Coding
  • LPCC Linear Prediction Cepstrum Coefficient
  • LSF Linear Spectral Frequency
  • LSF Linear Spectral Frequency
  • Discrete Wavelet Transform Discrete Wavelet Transform
  • PPP Perceptual Linear Predictive
  • Step 206 extracting the speaker's preliminary recognized speech from the mixed speech based on the registered speech features.
  • the registered speech features can be used to perform preliminary recognition of the speech information of the speaker in the mixed speech.
  • Preliminary recognition is a relatively rough recognition, which is used to extract the preliminary recognition speech from the mixed speech.
  • the preliminary recognition speech is the speech obtained by preliminary recognition of the speech information of the speaker in the mixed speech. It can be understood that the preliminary recognition speech includes not only the speech information of the speaker, but also the speech information of other sound-making objects besides the speaker.
  • the preliminary recognition speech is the basis for subsequent processing and can also be called the initial speech.
  • the terminal can extract the speech that meets the conditions associated with the registered speech features from the mixed speech to obtain the preliminary recognized speech of the speaker.
  • the conditions are, for example, a certain segment or a certain voice information in the mixed speech and the registered speech features, and the values of one or some voice parameters of the two meet the preset matching conditions.
  • the terminal can extract features from the registered voice to obtain the registered voice features of the registered voice. Further, the terminal can perform preliminary recognition of the voice information of the speaker in the mixed voice based on the registered voice features of the registered voice, that is, recognize the mixed voice. The initial speech extraction is performed on the sound to obtain the speaker's preliminary recognized speech.
  • the terminal may perform feature extraction on the mixed speech to obtain the mixed speech feature of the mixed speech. Further, the terminal may perform preliminary recognition of the voice information of the speaker in the mixed speech based on the mixed speech feature and the registered voice feature to obtain the preliminary recognized voice of the speaker.
  • the mixed speech feature is a feature of the mixed speech.
  • the preliminary recognition speech can be extracted by a pre-trained speech extraction model.
  • the terminal can input the registered speech features of the mixed speech and the registered speech into the speech extraction network, so as to perform preliminary recognition of the speech information of the speaker in the mixed speech through the speech extraction network to obtain the preliminary recognition speech of the speaker.
  • the speech extraction network can adopt a convolutional neural network (CNN).
  • Step 208 Determine the voice similarity between the registered voice and the voice information included in the preliminary recognized voice based on the registered voice features.
  • the terminal can determine the voice similarity between the registered voice and the voice information in the preliminary recognized voice according to the registered voice features.
  • the voice similarity is the similarity of the voice sound characteristics, which is basically irrelevant to the content expressed by the voice content.
  • the voice similarity specifically refers to the similarity between the voice information in the registered voice and the preliminary recognized voice. The greater the voice similarity, the more similar it is, and the smaller the voice similarity, the less similar it is.
  • the terminal may extract features from the voice information in the preliminary recognized voice to obtain voice information features. Further, the terminal may determine the voice similarity between the registered voice and the voice information in the preliminary recognized voice based on the registered voice features and the voice information features.
  • Step 210 filtering out speech information with speech similarity less than a preset similarity from the preliminary recognized speech to obtain the clean speech of the speaker.
  • the terminal can determine the voice information whose voice similarity is less than the preset similarity from the preliminary recognized voice, and obtain the voice information to be filtered.
  • the terminal can filter out the voice information to be filtered from the preliminary recognized voice, and obtain the clean voice of the speaker.
  • the voice information to be filtered is the voice information to be filtered in the preliminary recognition voice.
  • the clean voice is the clean voice of the speaker. It can be understood that the clean voice only includes the voice information of the speaker, and does not include the voice information of other objects except the speaker.
  • the clean voice of the speaker is the result of the voice processing method of each embodiment of the present application, which can be called the target voice.
  • the terminal can determine whether the voice similarity between each voice information in the preliminary recognition voice and the registered voice is less than the preset similarity. If the voice similarity is less than the preset similarity, the terminal can use the corresponding voice information as the voice information to be filtered. If the voice similarity is greater than or equal to the preset similarity, it can be understood that the voice similarity between the registered voice and the corresponding voice information is high, indicating that the voice information is likely to belong to the voice information corresponding to the speaker. At this time, the terminal can retain the corresponding voice information.
  • the preset similarity can be set according to the range of speech similarity and filtering strength. The smaller the preset similarity, the smaller the filtering strength, and the easier it is to retain some noise; the larger the preset similarity, the greater the filtering strength, and the easier it is to filter the speaker's voice. Therefore, the preset similarity can be determined within the range of speech similarity according to actual needs and test results.
  • the terminal may filter out the voice information to be filtered in the preliminary recognized voice.
  • the terminal may mute the voice information to be filtered in the preliminary recognized voice, and generate the clean voice of the speaker based on the voice information retained in the preliminary recognized voice.
  • the retained voice information is the voice information that is not muted.
  • the mixed speech includes the speech information of the speaker.
  • the preliminary recognition speech of the speaker is initially extracted from the mixed speech, and the preliminary recognition speech of the speaker can be initially and relatively accurately extracted.
  • advanced filtering processing will be performed on the basis of the preliminary recognition speech, that is, according to the registered speech features, the speech similarity between the speech information in the registered speech and the preliminary recognition speech is determined, and the speech information with speech similarity less than the preset similarity is filtered out from the preliminary recognition speech, so that the residual noise in the preliminary recognition speech can be filtered out, thereby obtaining a cleaner clean speech of the speaker and improving the accuracy of speech extraction.
  • extracting the speaker's preliminary recognized voice from the mixed voice includes: determining the mixed voice features of the mixed voice; fusing the mixed voice features with the registered voice features of the registered voice; Combine to obtain speech fusion features; and based on the speech fusion features, preliminarily recognize the speech information of the speaker in the mixed speech to obtain the speaker's preliminary recognized speech.
  • the speech fusion feature is the speech feature obtained by fusing the mixed speech feature with the registered speech feature of the registered speech.
  • the terminal can extract features from the mixed speech to obtain mixed speech features of the mixed speech, and fuse the mixed speech features with the registered speech features of the registered speech to obtain speech fusion features. Furthermore, the terminal can perform preliminary recognition of the speech information of the speaker in the mixed speech based on the speech fusion features to obtain preliminary recognition speech of the speaker.
  • the terminal may perform Fourier transform on the mixed speech to obtain a Fourier transform result, and perform feature extraction based on the Fourier transform result to obtain mixed speech features of the mixed speech.
  • the terminal may perform feature splicing of the mixed speech feature and the registered speech feature of the registered speech, and use the spliced feature as the speech fusion feature.
  • the terminal may map the mixed speech feature and the registered speech feature of the registered speech to the same dimension, and then perform a weighted summation or weighted average operation to obtain the speech fusion feature.
  • a speech fusion feature including the mixed speech features and the registered speech features can be obtained, and then the speech information of the speaker in the mixed speech is preliminarily recognized based on the speech fusion feature, which can improve the extraction accuracy of the preliminarily recognized speech.
  • the mixed speech feature includes a mixed speech feature matrix
  • the speech fusion feature includes a speech fusion feature matrix
  • the registered speech feature includes a registered speech feature vector
  • the mixed speech feature and the registered speech feature of the registered speech are fused to obtain the speech fusion feature, including: repeating the registered speech feature vector in the time dimension to generate a registered speech feature matrix, the time dimension of the registered speech feature matrix is the same as the time dimension of the mixed speech feature matrix; and splicing the mixed speech feature matrix and the registered speech feature matrix to obtain the speech fusion feature matrix.
  • the time dimension is the dimension corresponding to the number of frames of the speech signal in the time domain.
  • the mixed speech feature matrix is the feature matrix corresponding to the mixed speech feature, and is a specific embodiment of the mixed speech feature.
  • the speech fusion feature matrix is the feature matrix corresponding to the speech fusion feature, and is a specific embodiment of the speech fusion feature.
  • the registered speech feature vector is the feature vector corresponding to the registered speech feature.
  • the registered speech feature matrix is a feature matrix composed of the registered speech feature vectors.
  • the terminal may obtain the length of the time dimension of the mixed speech feature matrix, and repeat the registered speech feature vector in the time dimension with the length of the time dimension of the mixed speech feature matrix as a constraint to generate a registered speech feature matrix having the same time dimension as the time dimension of the mixed speech feature matrix. Furthermore, the terminal may concatenate the mixed speech feature matrix and the registered speech feature matrix to obtain a speech fusion feature matrix.
  • the registered speech feature vector is repeated in the time dimension to generate a registered speech feature matrix having the same time dimension as the mixed speech feature matrix, so that the mixed speech feature matrix and the registered speech feature matrix can be subsequently concatenated to obtain a speech fusion feature matrix, thereby improving the accuracy of feature fusion.
  • determining mixed speech features of mixed speech includes: extracting an amplitude spectrum of the mixed speech to obtain a first amplitude spectrum; performing feature extraction on the first amplitude spectrum to obtain amplitude spectrum features; and performing feature extraction on the amplitude spectrum features to obtain mixed speech features of the mixed speech.
  • the first amplitude spectrum is the amplitude spectrum of the mixed speech, and the amplitude spectrum feature is the feature of the first amplitude spectrum.
  • the terminal may perform Fourier transform on the mixed speech in the time domain to obtain speech information of the mixed speech in the frequency domain.
  • the terminal may obtain a first amplitude spectrum of the mixed speech based on the speech information of the mixed speech in the frequency domain.
  • the terminal may perform feature extraction on the first amplitude spectrum to obtain amplitude spectrum features, and perform feature extraction on the amplitude spectrum features to obtain mixed speech features of the mixed speech.
  • the mixed speech features of the mixed speech can be obtained, thereby improving the accuracy of the mixed speech features.
  • the speech information of the speaker in the mixed speech is preliminarily recognized to obtain the speaker's preliminarily recognized speech, including: based on the speech fusion feature, the speech information of the speaker in the mixed speech is preliminarily recognized, The speaker's speech features are obtained; the speaker's speech features are feature decoded to obtain a second amplitude spectrum; and the second amplitude spectrum is transformed according to the phase spectrum of the mixed speech to obtain the speaker's preliminary recognized speech.
  • the speech feature of the speaker is a feature that reflects the characteristics of the speaker's voice when speaking, and can be called the speaker's target speech feature.
  • the second amplitude spectrum is an amplitude spectrum obtained after decoding the target speech feature.
  • the terminal can perform preliminary recognition of the voice information of the speaker in the mixed voice based on the voice fusion feature to obtain the target voice feature of the speaker. Then, the terminal can feature decode the target voice feature to obtain the second amplitude spectrum. The terminal can obtain the phase spectrum of the mixed voice, and transform the second amplitude spectrum according to the phase spectrum of the mixed voice to obtain the preliminary recognition voice of the speaker.
  • the second amplitude spectrum is used to characterize the speech signal in the frequency domain.
  • the terminal can perform an inverse Fourier transform on the second amplitude spectrum according to the phase spectrum of the mixed speech to obtain the preliminary recognized speech of the speaker in the time domain.
  • the preliminary recognition speech is obtained by extracting through a speech extraction network.
  • the speech extraction network includes a Fourier transform unit, an encoder, a long short-term memory unit, and an inverse Fourier transform unit. It can be understood that the terminal can extract the first amplitude spectrum of the mixed speech through the Fourier transform unit in the speech extraction network. The terminal can perform feature extraction on the first amplitude spectrum through the encoder in the speech extraction network to obtain the amplitude spectrum feature.
  • the terminal can perform feature extraction on the amplitude spectrum feature through the long short-term memory unit in the speech extraction network to obtain the mixed speech feature of the mixed speech, and perform preliminary recognition on the speech information of the speaker in the mixed speech based on the speech fusion feature to obtain the speaker's object speech feature, and perform feature decoding on the object speech feature to obtain the second amplitude spectrum. Furthermore, the terminal can transform the second amplitude spectrum according to the phase spectrum of the mixed speech through the inverse Fourier transform unit in the speech extraction network to obtain the speaker's preliminary recognition speech.
  • the target speech feature of the speaker can be obtained. Then, by feature decoding the target speech feature, a second amplitude spectrum can be obtained, and the second amplitude spectrum is transformed according to the phase spectrum of the mixed speech to convert the signal in the frequency domain into a speech signal in the time domain, and the speaker's preliminary recognition speech is obtained, thereby improving the extraction accuracy of the preliminary recognition speech.
  • determining the registration speech features of the registration speech includes: extracting the frequency spectrum of the registration speech; generating a Mel frequency spectrum of the registration speech based on the frequency spectrum; and performing feature extraction on the Mel frequency spectrum to obtain the registration speech features of the registration speech.
  • the terminal may perform Fourier transform on the registered voice in the time domain to obtain voice information of the registered voice in the frequency domain.
  • the terminal may obtain the frequency spectrum of the registered voice according to the voice information of the registered voice in the frequency domain.
  • the terminal may generate a Mel frequency spectrum of the registered voice according to the frequency spectrum of the registered voice, and perform feature extraction on the Mel frequency spectrum to obtain the registered voice features of the registered voice.
  • the speech information includes speech segments; based on the registered speech features, the speech similarity between the registered speech and the speech information included in the preliminary recognized speech is determined, including: for each speech segment in the preliminary recognized speech, determining the segment speech features corresponding to the speech segment; based on the segment speech features and the registered speech features, determining the speech similarity between the registered speech and the speech segment.
  • the segment speech feature is a speech feature of a speech segment.
  • the preliminary recognized speech includes multiple speech segments.
  • the terminal can extract features of each speech segment in the preliminary recognized speech to obtain the segment speech features of the speech segment, and determine the speech similarity between the registered speech and the speech segment based on the segment speech features and the registered speech features.
  • the terminal may perform feature extraction on each speech segment in the initially recognized speech to obtain a segment speech feature corresponding to the speech segment.
  • the segment speech feature includes a segment speech feature vector
  • the registered speech feature includes a registered speech feature vector.
  • the terminal can determine the speech similarity between the registered speech and the speech segment according to the segment speech feature vector and the registered speech feature vector of the speech segment for each speech segment in the preliminary recognition speech.
  • the speech similarity between the registered speech and the speech segment can be calculated by the following formula:
  • A represents the feature vector of the segment speech
  • B represents the feature vector of the registered speech
  • cos ⁇ represents the speech similarity between the registered speech and the speech segment
  • represents the angle between the feature vector of the segment speech and the feature vector of the registered speech.
  • the calculation accuracy of the speech similarity between the registered speech and the speech information in the preliminary recognized speech can be improved.
  • the segment speech feature corresponding to the speech segment is determined, including: for each speech segment in the preliminary recognized speech, the speech segment is repeated to obtain a reconstructed speech with the same time length as the registered speech; wherein the reconstructed speech includes multiple speech segments; and the segment speech feature corresponding to the speech segment is determined according to the reconstructed speech feature of the reconstructed speech.
  • the voice information in the preliminary recognized speech includes voice segments
  • the voice similarity between the registered speech and the voice information in the preliminary recognized speech is determined based on the registered speech features, including: repeating each voice segment in the preliminary recognized speech according to the time length of the registered speech to obtain a reconstructed speech of the time length; obtaining reconstructed speech features extracted from the reconstructed speech, and determining the segment voice features corresponding to each voice segment in the preliminary recognized speech based on the reconstructed speech features; and determining the voice similarity between the registered speech and each voice segment based on the segment voice features corresponding to each voice segment and the registered speech features.
  • the reconstructed speech is the speech obtained by reconstructing a plurality of identical speech segments. It can be understood that the reconstructed speech includes a plurality of identical speech segments.
  • the terminal can obtain the time length of the registered voice, and for each voice segment in the preliminary recognized voice, repeat the voice segment according to the time length of the registered voice to obtain a reconstructed voice that is consistent with the time length of the registered voice.
  • the obtained reconstructed voice includes multiple identical voice segments.
  • the terminal can perform feature extraction on the reconstructed voice to obtain a reconstructed voice feature of the reconstructed voice, and determine the segment voice feature corresponding to the voice segment according to the reconstructed voice feature of the reconstructed voice.
  • the terminal may directly use the reconstructed speech features of the reconstructed speech as the segment speech features corresponding to the speech segment.
  • the speech segments are repeatedly processed to obtain a reconstructed speech that is consistent with the time length of the registered speech and includes multiple identical speech segments, and then the segment speech features corresponding to the speech segments are determined based on the reconstructed speech features of the reconstructed speech, which can further improve the calculation accuracy of the speech similarity between the speech information in the registered speech and the preliminary recognized speech.
  • the steps of determining the speech similarity between the speech information included in the registered speech and the preliminary recognized speech based on the registered speech features, and filtering out the speech information with speech similarity less than a preset similarity from the preliminary recognized speech to obtain the clean speech of the speaker are performed in the first processing mode.
  • the speech processing method also includes: in a second processing mode, obtaining interference speech, the interference speech is extracted from the mixed speech based on the registered speech features; obtaining mixed speech features of the mixed speech, speech features of the preliminary recognized speech, and speech features of the interference speech; fusing the mixed speech features with the speech features of the preliminary recognized speech based on an attention mechanism to obtain a first attention feature; fusing the mixed speech features with the speech features of the interference speech based on an attention mechanism to obtain a second attention feature; and fusing the mixed speech features, the first attention features, and the second attention features, and obtaining the speaker's clean speech based on the fused features.
  • the terminal in the first processing mode, can perform the steps of determining the voice similarity and the subsequent corresponding voice filtering.
  • the terminal can also extract the interference voice from the mixed voice based on the registered voice features.
  • the interference voice is the voice that interferes with the voice information of the speaker in the mixed voice.
  • the terminal in the second processing mode, can fuse the mixed voice features of the mixed voice and the voice features of the preliminary recognized voice based on the attention mechanism to obtain the first attention feature, and fuse the mixed voice features and the voice features of the interference voice based on the attention mechanism to obtain the second attention feature; based on the mixed voice features, the first attention features and the second attention features are fused.
  • the speaker’s clean speech is obtained based on the fused features.
  • the first attention feature is a feature obtained by fusing the mixed speech features of the mixed speech and the speech features of the preliminary recognized speech based on the attention mechanism.
  • the second attention feature is a feature obtained by fusing the mixed speech features and the speech features of the interfering speech based on the attention mechanism. It can be understood that fusing the mixed speech features of the mixed speech and the speech features of the preliminary recognized speech based on the attention mechanism means multiplying the mixed speech features of the mixed speech and the speech features of the preliminary recognized speech by the corresponding attention weights for fusion. It can also be understood that fusing the mixed speech features and the speech features of the interfering speech based on the attention mechanism means multiplying the mixed speech features and the speech features of the interfering speech by the corresponding attention weights for fusion.
  • the processing mode is the first processing mode or the second processing mode determines the different ways of extracting the clean speech of the speaker.
  • the processing mode can be pre-configured or modified in real time, or can be freely selected by the user.
  • the terminal in response to the first processing mode selection operation, may determine the current processing mode as the first processing mode. In the first processing mode, the terminal may determine the voice similarity between the registered voice and the voice information in the preliminary recognized voice according to the registered voice feature, determine the voice information whose voice similarity is less than the preset similarity from the preliminary recognized voice, obtain the voice information to be filtered, filter the voice information to be filtered in the preliminary recognized voice, and obtain the clean voice of the speaker.
  • the terminal in response to the second processing mode selection operation, can determine the current processing mode as the second processing mode.
  • the terminal can fuse the mixed speech features of the mixed speech and the speech features of the preliminary recognized speech based on the attention mechanism to obtain the first attention features, and fuse the mixed speech features and the speech features of the interfering speech based on the attention mechanism to obtain the second attention features; fuse the mixed speech features, the first attention features and the second attention features, and obtain the speaker's clean speech based on the fused features.
  • the terminal may directly fuse the mixed speech feature, the first attention feature, and the second attention feature to obtain a fused feature. Further, the terminal may determine the clean speech of the speaker based on the fused feature.
  • the terminal may input the mixed speech and registered speech features into a pre-trained speech extraction model to perform speech extraction based on the mixed speech and registered speech features through the speech extraction model, and output preliminary recognition speech and interference speech.
  • the preliminary recognition voice extracted from the mixed voice is subjected to advanced voice filtering by the voice similarity between the voice information in the registered voice and the preliminary recognition voice, so as to obtain a cleaner clean voice of the speaker. It can be understood that in the first processing mode, a relatively clean clean voice can be quickly obtained, and the efficiency of voice extraction is improved.
  • the second processing mode by fusing the mixed voice features of the mixed voice and the voice features of the preliminary recognition voice based on the attention mechanism, and fusing the mixed voice features and the voice features of the interfering voice based on the attention mechanism, the first attention feature and the second attention feature are obtained respectively.
  • the clean voice of the speaker is determined based on the mixed voice features, the first attention feature and the second attention feature. It can be understood that, compared with the first processing mode, a cleaner clean voice can be obtained in the second processing mode, and the accuracy of voice extraction is further improved. In this way, two processing modes are provided for users to choose, which can improve the flexibility of voice extraction.
  • the mixed speech feature, the first attention feature and the second attention feature are fused, and the clean speech of the speaker is obtained based on the fused features, including: fusing the mixed speech feature, the first attention feature, the second attention feature and the registered speech feature, and obtaining the clean speech of the speaker based on the fused features.
  • the terminal may fuse the mixed speech feature, the first attention feature, the second attention feature and the registered speech feature to obtain a fused feature. Further, the terminal may determine the clean speech of the speaker based on the fused feature.
  • the fused features can be made more accurate, and then the speaker's clean speech can be determined based on the more accurate fused features, which can further improve the accuracy of speech extraction.
  • the preliminary recognition speech and the interfering speech are extracted from the mixed speech by a trained speech extraction model.
  • the method also includes: inputting the mixed speech and the registered speech features into a speech extraction model; generating first mask information and second mask information based on the mixed speech and the registered speech features through the speech extraction model; shielding the interference information in the mixed speech according to the first mask information through the speech extraction model to obtain the preliminary recognized speech of the speaker; and shielding the speech information of the speaker in the mixed speech according to the second mask information through the speech extraction model to obtain the interference speech.
  • the preliminary recognition speech and the interference speech are extracted from the mixed speech by a pre-trained speech extraction model; the method also includes: inputting the mixed speech and the registered speech features into the speech extraction model to generate first mask information and second mask information based on the mixed speech and the registered speech features through the speech extraction model; shielding the interference information in the mixed speech according to the first mask information to obtain the preliminary recognition speech of the speaker; shielding the speech information of the speaker in the mixed speech according to the second mask information to obtain the interference speech.
  • the first mask information is information for shielding interference information in the mixed speech
  • the second mask information is information for shielding speech information of the speaker in the mixed speech.
  • the terminal may input the mixed speech and registered speech features into a pre-trained speech extraction model to generate first mask information and second mask information corresponding to the input mixed speech and registered speech features based on the mixed speech and registered speech features through the speech extraction model.
  • the terminal can shield the interference information in the mixed speech according to the first mask information to generate the speaker's preliminary recognized speech, and shield the speaker's speech information in the mixed speech according to the second mask information to generate interference speech that interferes with the speaker's speech information.
  • the terminal may input the mixed speech and the registered speech features into the speech extraction model to generate first mask information and second mask information corresponding to the mixed speech and the registered speech features based on the trained model parameters through the speech extraction model.
  • the first mask information includes a first masking parameter. It can be understood that since the first mask information is used to shield the interference information in the mixed speech, the first mask information includes the first masking parameter to achieve shielding of the interference information in the mixed speech.
  • the terminal can multiply the first masking parameter with the mixed speech amplitude spectrum of the mixed speech to obtain the object speech amplitude spectrum corresponding to the speaker's voice information, and generate the speaker's preliminary recognition speech based on the object speech amplitude spectrum.
  • the mixed speech amplitude spectrum is the amplitude spectrum of the mixed speech.
  • the object speech amplitude spectrum is the amplitude spectrum of the speaker's voice information.
  • the second mask information includes a second masking parameter. It can be understood that since the second mask information is used to mask the voice information of the speaker in the mixed voice, the second mask information includes the second masking parameter to achieve the masking of the voice information of the speaker in the mixed voice.
  • the terminal can multiply the second masking parameter with the mixed voice amplitude spectrum of the mixed voice to obtain the interference amplitude spectrum corresponding to the interference information in the mixed voice, and generate interference voice that interferes with the voice information of the speaker based on the interference amplitude spectrum.
  • the interference amplitude spectrum is the amplitude spectrum of the interference information in the mixed voice.
  • the first mask information and the second mask information corresponding to the mixed speech and the registered speech features can be generated by the speech extraction model based on the mixed speech and the registered speech features, and then the interference information in the mixed speech can be shielded according to the first mask information, and the preliminary recognition speech of the speaker can be obtained, thereby further improving the extraction accuracy of the preliminary recognition speech. And, according to the second mask information, the speech information of the speaker in the mixed speech can be shielded, and the interference speech can be obtained, thereby improving the extraction accuracy of the interference speech.
  • the pre-trained model parameters in the speech extraction model include a first mask mapping parameter and a second mask mapping parameter; the mixed speech and registered speech features are input into the speech extraction model to generate the first mask information and the second mask information based on the mixed speech and registered speech features through the speech extraction model, including: the mixed speech and registered speech features are input into the speech extraction model to generate the corresponding first mask information through the first mask mapping parameter mapping, and generate the corresponding second mask information through the second mask mapping parameter mapping.
  • the terminal may generate first mask information based on the first mask mapping parameter of the speech extraction model, the mixed speech and the registered speech features.
  • the terminal may generate second mask information based on the second mask mapping parameter of the speech extraction model, the mixed speech and the registered speech features.
  • the mask mapping parameter is a parameter related to mapping speech features into mask information.
  • the parameters can be mapped to generate mask information for shielding interference information in mixed speech, namely, first mask information.
  • the second mask mapping parameters can be mapped to generate mask information for shielding speech information of a speaker in mixed speech, namely, second mask information.
  • the terminal can input the mixed speech and registered speech features into the speech extraction model to map and generate first mask information corresponding to the input mixed speech and registered speech features through the first mask mapping parameters in the speech extraction model, and map and generate second mask information corresponding to the input mixed speech and registered speech features through the second mask mapping parameters in the speech extraction model.
  • the first mask information and the second mask information are generated based on the mixed speech and registered speech features input to the speech extraction model, and the first mask mapping parameters and the second mask mapping parameters pre-trained in the speech extraction model, the first mask information and the second mask information can be dynamically changed with different inputs. In this way, the accuracy of the first mask information and the second mask information can be improved, thereby further improving the extraction accuracy of the preliminary recognition speech and the interference speech.
  • the mixed speech features of the mixed speech, the speech features of the preliminary recognized speech, and the speech features of the interfering speech are extracted by the feature extraction layer after the mixed speech, the preliminary recognized speech, and the interfering speech are respectively input into the feature extraction layer in the secondary processing model.
  • the first attention feature is obtained by the first attention unit in the secondary processing model by fusing the mixed speech feature and the speech feature of the preliminary recognized speech through an attention mechanism.
  • the second attention feature is obtained by the second attention unit in the secondary processing model by fusing the mixed speech feature and the speech feature of the interfering speech through an attention mechanism.
  • the terminal inputs the mixed speech, the preliminary recognized speech output by the primary speech extraction model and the interference speech into the feature extraction layer in the secondary processing model for feature extraction, and obtains the mixed speech features of the mixed speech, the speech features of the preliminary recognized speech and the speech features of the interference speech.
  • the terminal can input the speech features and mixed speech features of the preliminary recognized speech into the first attention unit in the secondary processing model to fuse the mixed speech features of the mixed speech and the speech features of the preliminary recognized speech based on the attention mechanism to obtain the first attention features.
  • the terminal can input the speech features and mixed speech features of the interfering speech into the second attention unit in the secondary processing model to fuse the mixed speech features of the mixed speech and the speech features of the interfering speech based on the attention mechanism to obtain the second attention features.
  • the model for voice extraction of mixed speech includes a primary voice extraction model and a secondary processing model.
  • the primary voice extraction model is used to extract preliminary recognition voice and interference voice from the mixed speech.
  • the secondary processing model is used to perform advanced voice extraction on the mixed speech based on the preliminary recognition voice and interference voice to obtain the speaker's clean voice.
  • the secondary processing model includes a feature extraction layer, a first attention unit, and a second attention unit.
  • the terminal may input the mixed speech, the preliminary recognition speech output by the primary speech extraction model, and the interference speech into the feature extraction layer in the secondary processing model respectively, so as to extract features of the mixed speech, the preliminary recognition speech, and the interference speech respectively through the feature extraction layer, and obtain the mixed speech features of the mixed speech, the speech features of the preliminary recognition speech, and the speech features of the interference speech.
  • the terminal may input the speech features of the preliminary recognition speech and the mixed speech features into the first attention unit in the secondary processing model, so as to fuse the mixed speech features of the mixed speech and the speech features of the preliminary recognition speech through the first attention unit based on the attention mechanism, and obtain the first attention features.
  • the terminal may input the speech features of the interference speech and the mixed speech features into the second attention unit in the secondary processing model, so as to fuse the mixed speech features of the mixed speech and the speech features of the interference speech through the second attention unit based on the attention mechanism, and obtain the second attention features.
  • the primary speech extraction model is used to extract the preliminary recognition speech and the interference speech
  • the secondary processing model is used to refer to the preliminary recognition speech and the interference speech to perform advanced speech extraction on the mixed speech, which can further improve the accuracy of speech extraction.
  • the preliminary recognition speech and the interference speech are extracted from the mixed speech by a primary speech extraction model
  • the secondary processing model also includes a feature fusion layer and a secondary speech extraction model, based on the fusion of the mixed speech features, the first attention features and the second attention features, and the clean speech of the speaker is obtained based on the fused features, including: inputting the mixed speech features, the first attention features, the second attention features and the registered speech features into the feature fusion layer for fusion to obtain speech fusion features; and inputting the speech fusion features into the secondary speech extraction model to obtain the clean speech of the speaker based on the speech fusion features through the secondary speech extraction model.
  • the speech extraction model that extracts the preliminary recognition speech and the interference speech is a primary speech extraction model.
  • the secondary processing model also includes a feature fusion layer and a secondary speech extraction model.
  • the terminal can input the mixed speech feature, the first attention feature, the second attention feature and the registered speech feature into the feature fusion layer for fusion to obtain the speech fusion feature; and input the speech fusion feature into the secondary speech extraction model to obtain the speaker's clean speech based on the speech fusion feature through the secondary speech extraction model.
  • the secondary processing model also includes a feature fusion layer and a secondary speech extraction model.
  • the terminal can input the mixed speech feature, the first attention feature, the second attention feature and the registered speech feature into the feature fusion layer in the secondary processing model, so as to fuse the mixed speech feature, the first attention feature, the second attention feature and the registered speech feature through the feature fusion layer to obtain a speech fusion feature.
  • the terminal can input the speech fusion feature into the secondary speech extraction model in the secondary processing model, so as to obtain the clean speech of the speaker based on the speech fusion feature through the secondary speech extraction model.
  • the terminal may input the speech fusion features into a secondary speech extraction model in the secondary processing model to perform feature extraction on the speech fusion features through the secondary speech extraction model, and generate the speaker's clean speech based on the extracted features.
  • the model for extracting speech from mixed speech includes a primary speech extraction model and a secondary processing model.
  • the secondary processing model includes a first feature extraction layer, a second feature extraction layer, a third feature extraction layer, a first attention unit, a second attention unit, a feature fusion layer, and a secondary speech extraction model.
  • the terminal can input the mixed speech and the registered speech features into the primary speech extraction model to obtain preliminary recognition speech and interference speech based on the mixed speech and the registered speech features through the speech extraction model.
  • the terminal can input the mixed speech, preliminary recognition speech and interference speech into the first feature extraction layer, the second feature extraction layer and the third feature extraction layer in the secondary processing model, respectively, to extract features of the mixed speech, the preliminary recognition speech and the interference speech, respectively, to obtain the mixed speech features of the mixed speech, the speech features of the preliminary recognition speech and the speech features of the interference speech.
  • the terminal can input the speech features and mixed speech features of the preliminary recognized speech into the first attention unit in the secondary processing model, so as to fuse the mixed speech features of the mixed speech and the speech features of the preliminary recognized speech based on the attention mechanism through the first attention unit to obtain the first attention features.
  • the terminal can input the speech features and mixed speech features of the interfering speech into the second attention unit in the secondary processing model, so as to fuse the mixed speech features of the mixed speech and the speech features of the interfering speech based on the attention mechanism through the second attention unit to obtain the second attention features.
  • the terminal may input the mixed speech feature, the first attention feature, the second attention feature, and the registered speech feature into the feature fusion layer in the secondary processing model, so as to fuse the mixed speech feature, the first attention feature, the second attention feature, and the registered speech feature through the feature fusion layer to obtain a speech fusion feature. Furthermore, the terminal may input the speech fusion feature into the secondary speech extraction model in the secondary processing model, so as to obtain the clean speech of the speaker based on the speech fusion feature through the secondary speech extraction model.
  • the primary speech extraction model includes a Fourier transform unit, an encoder, a long short-term memory unit, a first inverse Fourier transform unit, and a second inverse Fourier transform unit. It can be understood that the terminal can extract the mixed speech amplitude spectrum of the mixed speech through the Fourier transform unit in the primary speech extraction model.
  • the terminal can extract the mixed speech amplitude spectrum through the encoder in the primary speech extraction model to obtain the amplitude spectrum feature.
  • the terminal can generate the first mask based on the amplitude spectrum feature through the long short-term memory unit in the primary speech extraction model. mapping parameters and a first mask mapping parameter.
  • the terminal may multiply the first mask mapping parameter by the mixed speech amplitude spectrum of the mixed speech to obtain the object speech amplitude spectrum corresponding to the speaker's speech information.
  • the terminal may transform the object speech amplitude spectrum according to the phase spectrum of the mixed speech through the first inverse Fourier transform unit in the primary speech extraction model to obtain the speaker's preliminary recognized speech.
  • the terminal can multiply the second mask mapping parameter with the mixed speech amplitude spectrum of the mixed speech to obtain the interference amplitude spectrum corresponding to the interference information in the mixed speech.
  • the terminal can transform the interference amplitude spectrum according to the phase spectrum of the mixed speech through the second inverse Fourier transform unit in the first-level speech extraction model to obtain the interference speech.
  • the secondary speech extraction model is used to determine the speaker's clean speech based on the more accurate speech fusion feature, and the speech extraction accuracy can be further extracted.
  • obtaining a registered voice of a speaker and obtaining a mixed voice include: obtaining an initial mixed voice and an initial registered voice of the speaker; the initial mixed voice includes the voice information of the speaker; performing noise reduction processing on the initial mixed voice and the initial registered voice respectively to obtain the mixed voice and the registered voice of the speaker.
  • the initial mixed speech is the mixed speech that has not been subjected to noise reduction processing
  • the initial registered speech is the registered speech that has not been subjected to noise reduction processing.
  • the terminal may respectively obtain the initial mixed voice and the initial registered voice of the speaker, wherein the initial mixed voice includes the voice information of the speaker. It is understandable that the initial mixed voice and the initial registered voice contain noise, for example, at least one of large reverberation, high background noise and music noise.
  • the terminal may perform noise reduction processing on the initial mixed voice to obtain the mixed voice.
  • the terminal may perform noise reduction processing on the initial registered voice to obtain the registered voice of the speaker.
  • the mixed speech and the registered speech are obtained by performing noise reduction processing through a pre-trained noise reduction network.
  • the terminal can input the obtained initial mixed speech and the initial registered speech of the speaker into the noise reduction network respectively, so as to perform noise reduction processing on the initial mixed speech and the initial registered speech through the noise reduction network to obtain the mixed speech and the registered speech of the speaker.
  • the noise in the initial mixed speech and the initial registered speech can be removed to obtain noise-free mixed speech and registered speech, so that subsequent speech extraction based on the noise-free mixed speech and registered speech can further improve the accuracy of speech extraction.
  • the preliminary recognition speech is generated by a pre-trained speech processing model; the speech processing model includes a noise reduction network and a speech extraction network; the mixed speech and the registered speech are obtained by noise reduction processing through the noise reduction network.
  • the preliminary recognition speech of the speaker is extracted from the mixed speech, including: inputting the registered speech features of the registered speech into the speech extraction network, so as to perform preliminary recognition on the speech information of the speaker in the mixed speech through the speech extraction network, and obtain the preliminary recognition speech of the speaker.
  • the speech processing model includes a noise reduction network and a speech extraction network.
  • the terminal can input the acquired initial mixed speech and the initial registered speech of the speaker into the noise reduction network respectively, so as to perform noise reduction processing on the initial mixed speech and the initial registered speech through the noise reduction network, and obtain the mixed speech and the registered speech of the speaker. Furthermore, the terminal can input the registered speech features of the mixed speech and the registered speech into the speech extraction network, so as to perform preliminary recognition of the speech information of the speaker in the mixed speech through the speech extraction network, and obtain the preliminary recognized speech of the speaker.
  • the noise reduction network includes a Fourier transform unit, an encoder, a long short-term memory unit, a decoder, and an inverse Fourier transform unit.
  • the noisy speech includes an initial mixed speech and an initial registered speech.
  • Clean speech includes a mixed speech and a registered speech.
  • the terminal can input the noisy speech into the noise reduction network, so as to perform Fourier transform on the noisy speech through the Fourier transform unit in the noise reduction network to obtain the amplitude spectrum and phase spectrum of the noisy speech, and then, the amplitude spectrum of the noisy speech is feature encoded by the encoder in the noise reduction network to obtain the encoded features, and then the encoded features are extracted by the long short-term memory unit in the noise reduction network, and the extracted features are decoded by the decoder in the noise reduction network to obtain the decoded amplitude spectrum, and then the decoded amplitude spectrum is inversely Fourier transformed by the inverse Fourier transform unit in the noise reduction network to obtain clean speech.
  • the noise reduction network in the speech processing model performs noise reduction processing on the initial mixed speech and the initial registered speech, so as to obtain the noise-free mixed speech and the registered speech, thereby improving the speech noise reduction effect. Then, the speech information of the speaker in the mixed speech is preliminarily recognized by the speech extraction network, so as to improve the extraction accuracy of the preliminarily recognized speech.
  • noise reduction is performed respectively through a trained noise reduction network.
  • the speech processing method also includes: obtaining sample noisy speech, which is obtained by adding noise to a reference clean speech used as a reference; inputting the sample noisy speech into the noise reduction network to be trained, so as to perform noise reduction on the sample noise reduction speech through the noise reduction network to obtain a predicted speech after noise reduction; and iteratively training the noise reduction network to be trained according to the difference between the predicted speech and the reference clean speech to obtain a trained noise reduction network.
  • the mixed speech and the registered speech are obtained by performing noise reduction processing on a pre-trained noise reduction network.
  • the terminal can obtain a sample noise speech; the sample noise speech is obtained by adding noise to a reference clean speech as a reference.
  • the terminal can input the sample noise speech into the noise reduction network to be trained, so as to perform noise reduction processing on the sample noise reduction speech through the noise reduction network to obtain a predicted speech after noise reduction.
  • the terminal iteratively trains the noise reduction network to be trained according to the difference between the predicted speech and the reference clean speech to obtain a pre-trained noise reduction network.
  • the sample noisy speech is a speech containing noise and used to train the noise reduction network.
  • the sample noisy speech is obtained by adding noise to the clean speech used as a reference.
  • the reference clean speech is a speech without noise and serves as a reference in training the noise reduction network.
  • the predicted speech is the speech predicted by the sample noisy speech after noise reduction during the noise reduction network training process.
  • the terminal may obtain a reference clean speech as a reference, and add noise to the reference clean speech to obtain a sample noisy speech. Furthermore, the terminal may input the sample noisy speech into the noise reduction network to be trained, so as to perform noise reduction processing on the sample noise reduction speech through the noise reduction network to obtain a noise-reduced predicted speech. The terminal may iteratively train the noise reduction network to be trained according to the difference between the predicted speech and the reference clean speech to obtain a pre-trained noise reduction network.
  • the terminal may determine a noise reduction loss value based on the difference between the predicted speech and the reference clean speech, and iteratively train the noise reduction network to be trained based on the noise reduction loss value, and obtain a pre-trained noise reduction network when the iteration stops.
  • the above noise reduction loss value can be calculated by the following loss function:
  • X represents the predicted speech, and The same type may be the predicted speech itself, the speech signal of the predicted speech, the energy value of the predicted speech, or the probability distribution of the occurrence probability of each frequency of the predicted speech in the frequency domain.
  • Loss SDR indicates the noise reduction loss value.
  • indicates a norm function, which may be an L2 norm function.
  • the noise reduction capability of the noise reduction network can be improved by iteratively training the noise reduction network to be trained through the difference between the predicted speech and the clean speech.
  • the preliminary recognition speech is extracted by a pre-trained speech extraction network.
  • the method also includes: obtaining sample data; the sample data includes sample mixed speech and sample registered speech features of a sample speaker; the sample mixed speech is obtained by adding noise to the sample clean speech of the sample speaker; the sample data is input into the speech extraction network to be trained, so as to recognize the sample speech information of the sample speaker in the sample mixed speech according to the sample registered speech features through the speech extraction network, and obtain the predicted clean speech of the sample speaker; according to the difference between the predicted clean speech and the sample clean speech, the speech extraction network to be trained is iteratively trained to obtain the pre-trained speech extraction network.
  • the sample data is the data used to train the speech extraction network.
  • the sample mixed speech is the mixed speech used to train the speech extraction network.
  • the sample speaker is the speaker involved in the process of training the speech extraction network.
  • the voice features are the registered voice features used to train the voice extraction network.
  • the sample clean voice is the voice that contains only the voice information of the sample speaker and serves as a reference in the training of the voice extraction network.
  • the predicted clean voice is the voice of the sample speaker extracted from the sample mixed voice during the training of the voice extraction network.
  • the terminal can obtain the sample clean speech of the sample speaker, and add noise to the sample clean speech of the sample speaker to obtain the sample mixed speech.
  • the terminal can obtain the sample registered speech of the sample speaker, and perform feature extraction on the sample registered speech to obtain the sample registered speech feature of the sample speaker.
  • the terminal can use the sample mixed speech and the sample registered speech feature of the sample speaker as sample data.
  • the terminal can input the sample data into the speech extraction network to be trained, so as to identify the sample speech information of the sample speaker in the sample mixed speech according to the sample registered speech feature through the speech extraction network, and obtain the predicted clean speech of the sample speaker, and iteratively train the speech extraction network to be trained according to the difference between the predicted clean speech and the sample clean speech to obtain the pre-trained speech extraction network.
  • the terminal may determine an extraction loss value based on the difference between the predicted clean speech and the sample clean speech, and iteratively train the speech extraction network to be trained based on the extraction loss value, and obtain a pre-trained speech extraction network when the iteration stops.
  • the above extraction loss value can be calculated by the following loss function:
  • i represents the i-th sample in the N mixed speech samples
  • the i-th sample mixed speech which can be the i-th sample mixed speech itself, the speech signal of the i-th sample mixed speech, the energy value of the i-th sample mixed speech, or the probability distribution of the probability of occurrence of each frequency of the i-th sample mixed speech in the frequency domain.
  • Yi represents the predicted clean speech, which can be the predicted clean speech itself, the speech signal of the predicted clean speech, the energy value of the predicted clean speech, or the probability distribution of the probability of occurrence of each frequency of the predicted clean speech in the frequency domain.
  • Loss MAE represents the extraction loss value.
  • the speech extraction network to be trained is iteratively trained, so as to improve the speech extraction accuracy of the speech extraction network.
  • the above-mentioned speech processing model also includes a registration network.
  • the registration speech feature is obtained by extraction through the registration network.
  • the registration network includes a Mel frequency spectrum generation unit, a long short-term memory unit and a feature generation unit.
  • the terminal can extract the frequency spectrum of the registration speech through the Mel frequency spectrum generation unit in the registration network, and generate the Mel frequency spectrum of the registration speech based on the frequency spectrum.
  • the terminal can perform feature extraction on the Mel frequency spectrum through the long short-term memory unit in the registration network to obtain multiple feature vectors.
  • the terminal can average the above-mentioned multiple feature vectors in the time dimension through the feature generation unit in the registration network to obtain the registration speech feature of the registration speech.
  • the frequency spectrum of the registration speech is extracted to convert the registration speech signal in the time domain into a signal in the frequency domain. Then, the Mel frequency spectrum of the registration speech is generated according to the frequency spectrum, and the Mel frequency spectrum is subjected to feature extraction, so as to improve the accuracy of the registration speech feature extraction.
  • obtaining the registered voice of a speaker and obtaining a mixed voice includes: in response to a call trigger operation, determining the speaker specified by the call trigger operation, and determining the registered voice of the speaker from pre-stored candidate registered voices; and when a voice call is established with a terminal corresponding to the speaker based on the call trigger operation, receiving the mixed voice sent by the terminal corresponding to the speaker in the voice call.
  • the terminal determines the speaker's registered voice from pre-stored candidate registered voices in response to a call trigger operation for the speaker.
  • the terminal establishes a voice call with the terminal corresponding to the speaker based on the call trigger operation, the terminal receives the mixed voice sent by the terminal corresponding to the speaker in the voice call.
  • the user can initiate a call request to the speaker based on the terminal. That is, the terminal can respond to the user's call trigger operation on the speaker and search for the speaker's registered voice from the pre-stored candidate registered voices. At the same time, the terminal can generate a call request for the speaker in response to the call trigger operation and send the call request to the terminal corresponding to the speaker. In the case of establishing a voice call with the terminal corresponding to the speaker based on the call request, the terminal can receive the call. Receive the mixed voice sent by the terminal corresponding to the speaker in the voice call.
  • the terminal can, based on the registered voice features of the registered voice, preliminarily identify the voice information of the speaker in the received mixed voice to obtain the speaker's preliminary recognized voice, determine the voice similarity between the registered voice and the voice information in the preliminary recognized voice based on the registered voice features, determine the voice information whose voice similarity is less than the preset similarity from the preliminary recognized voice, obtain the voice information to be filtered, and filter the voice information to be filtered in the preliminary recognized voice to obtain the speaker's clean voice.
  • the speaker's registered voice can be determined from the pre-stored candidate registered voices.
  • the speaker's voice can be extracted in the call scenario, thereby improving the call quality.
  • obtaining a registered voice of a speaker and obtaining a mixed voice include: obtaining a multimedia voice of a multimedia object; the multimedia voice is a mixed voice including voice information of multiple speakers; in response to a designated operation for a speaker in the multimedia voice, obtaining an identifier of the designated speaker; the speaker is a speaker designated from multiple sound-generating objects whose voice needs to be extracted; and obtaining a registered voice having a mapping relationship with the identifier of the speaker from the pre-stored registered voices for each speaker in the multimedia voice to obtain the registered voice of the speaker.
  • the terminal can obtain multimedia voice of the multimedia object, and the multimedia voice is a mixed voice including voice information of multiple sound objects.
  • the terminal can obtain the identifier of the designated speaker in response to the designated operation for the speaker in the multimedia voice, and the speaker is the sound object designated to extract the voice from the multiple sound objects.
  • the terminal can obtain the registered voice with a mapping relationship with the identifier of the speaker from the registered voices pre-stored for each sound object in the multimedia voice, and obtain the registered voice of the speaker.
  • the multiple sound objects can be multiple speakers, and the designated speaker can be called the target speaker.
  • a multimedia object is a multimedia file, including a video object and an audio object.
  • a multimedia voice is a voice in a multimedia object.
  • An identifier is a string used to uniquely identify the speaker.
  • the terminal can extract multimedia voice from the multimedia object. It can be understood that the multimedia voice is a mixed voice including voice information of multiple speakers.
  • the terminal can obtain the identifier of the designated speaker in response to the designated operation for the speaker in the multimedia voice. It can be understood that the speaker is the speaker designated to extract the voice from the multiple speakers.
  • the terminal can find the registered voice with a mapping relationship with the identifier from the registered voices pre-stored for each speaker in the multimedia voice as the registered voice of the designated speaker.
  • the terminal can extract the speaker's clean voice from the multimedia voice. Specifically, the terminal can perform preliminary recognition of the speaker's voice information in the multimedia voice based on the registered voice features of the speaker's registered voice to obtain the speaker's preliminary recognized voice, determine the voice similarity between the registered voice and the voice information in the preliminary recognized voice based on the registered voice features, determine the voice information whose voice similarity is less than the preset similarity from the preliminary recognized voice, obtain the voice information to be filtered, and filter the voice information to be filtered in the preliminary recognized voice to obtain the speaker's clean voice.
  • the identifier of the designated speaker can be acquired. Then, from the registered voices pre-stored for each speaker in the multimedia voice, the registered voice with a mapping relationship with the speaker identifier can be acquired, and the registered voice of the speaker can be obtained, so that the voice of the speaker of interest to the user can be extracted from the multimedia object.
  • the speaker for extracting clean voice can be quickly designated and the clean voice can be extracted, avoiding the consumption of extra resources due to the inability to hear clearly in a noisy multi-voice environment.
  • the speech processing method of the present application can be applied to speech extraction scenarios in film and television videos or voice calls.
  • the terminal can obtain the video and speech of the film and television video, which is a mixed speech including the speech information of multiple speakers.
  • the terminal can obtain the identifier of the designated target speaker in response to the designated operation for the speaker in the video and speech, and the target speaker is the speaker designated to extract speech from multiple speakers.
  • the terminal can obtain the target speaker corresponding to the identifier from the registered speech pre-stored for each speaker in the video and speech.
  • the registered voice with a mapping relationship is used to obtain the registered voice of the target speaker.
  • the registered voice of the target speaker is extracted from the video voice based on the registered voice.
  • the terminal can determine the registered voice of the target speaker from the pre-stored candidate registered voices in response to the call trigger operation for the target speaker, and when a voice call is established with the terminal corresponding to the target speaker based on the call trigger operation, receive the mixed voice sent by the terminal corresponding to the target speaker in the voice call.
  • the clean voice of the target speaker is extracted from the mixed voice obtained during the voice call based on the registered voice.
  • clean speech is generated by a speech processing model and a filtering processing unit, wherein the speech processing model includes a noise reduction network, a registration network, and a speech extraction network.
  • the terminal can perform noise reduction on the initial mixed speech and the initial registered speech respectively through the noise reduction network in the speech processing model to obtain the mixed speech after noise reduction and the registered speech after noise reduction.
  • the terminal can perform special encoding on the registered speech after noise reduction through the registration network in the speech processing model to obtain the registered speech feature.
  • the terminal can extract the preliminary recognized speech from the mixed speech after noise reduction through the speech extraction network in the speech processing model according to the registered speech feature.
  • the terminal uses the filtering processing unit to filter the preliminary recognized speech based on the registered speech feature to obtain the clean speech of the speaker.
  • the terminal uses a filtering processing unit to filter the preliminary recognized voice based on the registered voice features to obtain the speaker's clean voice.
  • the specific implementation is as follows: for each voice segment in the preliminary recognized voice, the terminal can extract features of the voice segment through the above-mentioned registration network to obtain the segment voice features of the voice segment. Then, the terminal can determine the voice similarity between the registered voice and the voice segment based on the segment voice features and the registered voice features. The terminal can store voice segments whose similarity is greater than or equal to a preset voice similarity threshold, and mute voice segments whose similarity is less than a preset voice similarity threshold. Then, the terminal can generate the speaker's clean voice based on the retained voice segments.
  • a voice processing method is provided. This embodiment is described by taking the method applied to the terminal 102 in FIG. 1 as an example. The method specifically includes the following steps:
  • Step 1102 obtaining a mixed voice and a registered voice of a speaker; the mixed voice includes the voice information of the speaker; the voice information includes voice segments.
  • Step 1104 input the mixed speech and the registered speech features into the speech extraction model, and based on the mixed speech and the registered speech features, generate at least first mask information in the first mode, and generate first mask information and second mask information in the second mode. It can be understood that the first mask information and the second mask information can also be generated in the first mode.
  • Step 1106 shield the interference information in the mixed speech according to the first mask information to obtain the preliminary recognized speech of the speaker.
  • Step 1108 mask the voice information of the speaker in the mixed voice according to the second mask information to obtain the interference voice.
  • Step 1110 in the first processing mode, for each speech segment in the preliminary recognized speech, the speech segment is repeatedly processed to obtain a reconstructed speech with the same time length as the registered speech; wherein the reconstructed speech includes multiple speech segments.
  • Step 1112 determining the segment speech feature corresponding to the speech segment according to the reorganized speech feature of the reorganized speech.
  • Step 1114 determining the speech similarity between the registered speech and the speech segment according to the segment speech feature and the registered speech feature.
  • Step 1116 determining voice information whose voice similarity is less than a preset similarity from the initially recognized voice, and obtaining voice information to be filtered.
  • Step 1118 filtering the to-be-filtered voice information in the preliminary recognized voice to obtain the clean voice of the speaker.
  • Step 1120 in the second processing mode, the mixed speech features of the mixed speech and the speech features of the preliminary recognized speech are fused based on the attention mechanism to obtain the first attention features, and the mixed speech features and the speech features of the interfering speech are fused based on the attention mechanism to obtain the second attention features.
  • Step 1122 the mixed speech feature, the first attention feature, the second attention feature and the registered speech feature are fused, and the clean speech of the speaker is obtained based on the fused features.
  • the present application also provides an application scenario, which applies the above-mentioned speech processing method.
  • the speech processing method can be applied to the scenario of speech extraction in film and television videos.
  • the film and television video includes film and television voice (i.e. mixed voice), and the film and television voice includes voice information of multiple actors (i.e. speakers).
  • the terminal can obtain the initial film and television voice and the initial registered voice of the target actor; the initial film and television voice includes the voice information of the target actor; the voice information includes a voice segment.
  • the mixed voice and the registered voice features are input into the speech extraction model to generate the first mask information and the second mask information based on the mixed voice and the registered voice features through the speech extraction model.
  • the interference information in the mixed voice is shielded to obtain the initial film and television voice of the target actor; according to the second mask information, the voice information of the target actor in the mixed voice is shielded to obtain the interference voice.
  • the terminal can repeatedly process the voice segment to obtain a reconstructed voice with the same time length as the registered voice; wherein the reconstructed voice includes multiple voice segments.
  • the segment voice features corresponding to the voice segment are determined according to the reconstructed voice features of the reconstructed voice.
  • the voice similarity between the registered voice and the voice segment is determined according to the segment voice features and the registered voice features.
  • the voice information whose voice similarity is less than the preset similarity is determined from the initial film and television voice to obtain the voice information to be filtered.
  • the voice information to be filtered in the initial film and television voice is filtered to obtain the clean voice of the target actor.
  • the terminal can fuse the mixed speech features of the mixed speech and the speech features of the initial film and television speech based on the attention mechanism to obtain the first attention features, and fuse the mixed speech features and the speech features of the interfering speech based on the attention mechanism to obtain the second attention features.
  • the mixed speech features, the first attention features, the second attention features and the registered speech features are fused, and the clean speech of the target actor is obtained based on the fused features.
  • the present application also provides an application scenario, which applies the above-mentioned voice processing method.
  • the voice processing method can be applied to the scenario of voice extraction in voice calls.
  • the terminal can determine the registered voice of the target caller from the pre-stored candidate registered voices in response to a call trigger operation for the target caller (ie, the speaker).
  • the call voice ie, mixed voice
  • the voice of the target caller can be extracted from the call voice to improve the call quality.
  • the present application also provides an application scenario, which applies the above-mentioned speech processing method.
  • the speech processing method can be applied to the scenario of obtaining training data before training the neural network model.
  • training the neural network model requires a large amount of training data.
  • the speech processing method of the present application can extract clean speech of interest from complex mixed speech as training data. Through the speech processing method of the present application, large amounts of training data can be quickly acquired, which saves labor costs compared to traditional manual extraction methods.
  • each step in the flow chart of the above-mentioned embodiments is shown in order, these steps are not necessarily performed in order. Unless there is clear explanation in this article, the execution of these steps does not have strict order restriction, and these steps can be performed in other orders. Moreover, at least a part of the steps in the above-mentioned embodiments may include a plurality of sub-steps or a plurality of stages, and these sub-steps or stages are not necessarily performed at the same time, but can be performed at different times, and the execution order of these sub-steps or stages is not necessarily performed in order, but can be performed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a speech processing device 1200 is provided.
  • the device may adopt a software module or a hardware module, or a combination of the two to form a part of a computer device.
  • the device specifically includes:
  • the acquisition module 1202 is used to acquire the registered voice of the speaker and acquire the mixed voice, where the mixed voice includes voice information of multiple sound-generating objects, and the multiple sound-generating objects include the speaker.
  • the first extraction module 1204 is used to determine the registered speech features of the registered speech, and extract the speaker's preliminary recognized speech from the mixed speech based on the registered speech features.
  • the determination module 1206 is used to determine the voice similarity between the registered voice and the voice information included in the preliminary recognition voice according to the registered voice feature.
  • the filtering module 1208 is used to filter out the voice information with a voice similarity less than a preset similarity from the preliminary recognized voice, so as to obtain the clean voice of the speaker.
  • the first extraction module 1204 is also used to determine the mixed speech features of the mixed speech; fuse the mixed speech features with the registered speech features of the registered speech to obtain speech fusion features; and based on the speech fusion features, preliminarily recognize the speech information of the speaker in the mixed speech to obtain the speaker's preliminarily recognized speech.
  • the mixed speech feature includes a mixed speech feature matrix
  • the speech fusion feature includes a speech fusion feature matrix
  • the first extraction module 1204 is further used to repeat the registered speech feature vector in the time dimension to generate a registered speech feature matrix, the time dimension of the registered speech feature matrix is the same as the time dimension of the mixed speech feature matrix; and to concatenate the mixed speech feature matrix and the registered speech feature matrix to obtain a speech fusion feature matrix.
  • the first extraction module 1204 is further used to extract the amplitude spectrum of the mixed speech to obtain a first amplitude spectrum; perform feature extraction on the first amplitude spectrum to obtain amplitude spectrum features; and perform feature extraction on the amplitude spectrum features to obtain mixed speech features of the mixed speech.
  • the first extraction module 1204 is also used to perform preliminary recognition of the speech information of the speaker in the mixed speech based on the speech fusion features to obtain the speech features of the speaker; perform feature decoding on the speech features of the speaker to obtain a second amplitude spectrum; and transform the second amplitude spectrum according to the phase spectrum of the mixed speech to obtain the preliminary recognized speech of the speaker.
  • the first extraction module 1204 is further used to extract the frequency spectrum of the registration speech; generate the Mel frequency spectrum of the registration speech according to the frequency spectrum; and perform feature extraction on the Mel frequency spectrum to obtain the registration speech features of the registration speech.
  • the speech information in the preliminary recognized speech includes speech segments; the determination module 1206 is also used to repeat each speech segment in the preliminary recognized speech according to the time length of the registered speech to obtain a reconstructed speech of the time length; obtain reconstructed speech features extracted from the reconstructed speech, and determine the segment speech features corresponding to each speech segment in the preliminary recognized speech according to the reconstructed speech features; and determine the speech similarity between the registered speech and each speech segment according to the segment speech features corresponding to each speech segment and the registered speech features.
  • the device 1200 also includes a noise reduction network training module, which is used to obtain sample noisy speech, where the sample noisy speech is obtained by adding noise to the reference clean speech used as a reference; the sample noisy speech is input into the noise reduction network to be trained, so as to perform noise reduction on the sample noise reduction speech through the noise reduction network to obtain the predicted speech after noise reduction; and according to the difference between the predicted speech and the reference clean speech, the noise reduction network to be trained is iteratively trained to obtain a trained noise reduction network.
  • a noise reduction network training module which is used to obtain sample noisy speech, where the sample noisy speech is obtained by adding noise to the reference clean speech used as a reference; the sample noisy speech is input into the noise reduction network to be trained, so as to perform noise reduction on the sample noise reduction speech through the noise reduction network to obtain the predicted speech after noise reduction; and according to the difference between the predicted speech and the reference clean speech, the noise reduction network to be trained is iteratively trained to obtain a trained noise reduction network.
  • the determination module 1206 is further configured to determine, in the first processing mode, the voice similarity between the registered voice and the voice information included in the preliminary recognized voice according to the registered voice feature.
  • the filtering module 1208 is further configured to filter out speech information with speech similarity less than a preset similarity from the preliminary recognized speech in the first processing mode, so as to obtain the clean speech of the speaker.
  • the device 1200 further includes a primary speech extraction model for obtaining interference speech in the second processing mode, where the interference speech is extracted from the mixed speech based on the registered speech features.
  • the device 1200 also includes a secondary processing model for obtaining mixed speech features of mixed speech, speech features of preliminary recognized speech, and speech features of interfering speech; fusing the mixed speech features and the speech features of preliminary recognized speech based on an attention mechanism to obtain a first attention feature; fusing the mixed speech features and the speech features of interfering speech based on an attention mechanism to obtain a second attention feature; and fusing the mixed speech features, the first attention features, and the second attention features, and obtaining the speaker's clean speech based on the fused features.
  • the secondary processing model is also used to fuse the mixed speech features, the first attention features, the second attention features and the registered speech features, and obtain the speaker's clean speech based on the fused features.
  • the primary speech extraction model is further used to generate first mask information and second mask information based on the mixed speech and the registered speech features after the mixed speech and the registered speech features are input, shield the interference information in the mixed speech according to the first mask information, obtain the preliminary recognition speech of the speaker, shield the mixed speech according to the second mask information
  • the voice information of the speaker is obtained to obtain the interference voice.
  • the trained model parameters in the primary speech extraction model include first mask mapping parameters and second mask mapping parameters.
  • the primary speech extraction model is also used to generate first mask information based on the first mask mapping parameters of the speech extraction model, mixed speech and registered speech features; and to generate second mask information based on the second mask mapping parameters of the speech extraction model, mixed speech and registered speech features.
  • the mixed speech features of the mixed speech, the speech features of the preliminary recognized speech, and the speech features of the interfering speech are extracted by the feature extraction layer after the mixed speech, the preliminary recognized speech, and the interfering speech are respectively input into the feature extraction layer in the secondary processing model.
  • the first attention feature is obtained by the first attention unit in the secondary processing model by fusing the mixed speech feature and the speech feature of the preliminary recognized speech through the attention mechanism.
  • the second attention feature is obtained by the second attention unit in the secondary processing model by fusing the mixed speech features and the speech features of the interfering speech through the attention mechanism.
  • the secondary processing model also includes a feature fusion layer and a secondary speech extraction model.
  • the secondary processing model is also used to input the mixed speech features, the first attention features, the second attention features and the registered speech features into the feature fusion layer for fusion to obtain speech fusion features; and input the speech fusion features into the secondary speech extraction model to obtain the speaker's clean speech based on the speech fusion features through the secondary speech extraction model.
  • the acquisition module 1202 is also used to determine the speaker specified by the call trigger operation in response to the call trigger operation, and determine the speaker's registered voice from pre-stored candidate registered voices; when a voice call is established with the terminal corresponding to the speaker based on the call trigger operation, receive the mixed voice sent by the terminal corresponding to the speaker in the voice call.
  • the acquisition module 1202 is also used to acquire multimedia speech of the multimedia object;
  • the multimedia speech is a mixed speech including speech information of multiple speakers; in response to a designated operation for a speaker in the multimedia speech, an identifier of the designated speaker is acquired;
  • the speaker is a speaker whose speech needs to be extracted from multiple sound-emitting objects; from the registered speech pre-stored for each speaker in the multimedia speech, a registered speech having a mapping relationship with the speaker's identifier is acquired to obtain the registered speech of the speaker.
  • the above-mentioned speech processing device 1200 obtains mixed speech and the registered speech of the speaker, wherein the mixed speech includes the speech information of the speaker.
  • the preliminary recognition speech of the speaker is initially extracted from the mixed speech, and the preliminary recognition speech of the speaker can be initially and relatively accurately extracted.
  • advanced filtering processing is performed on the basis of the preliminary recognition speech, that is, based on the registered speech features, the speech similarity between the speech information in the registered speech and the preliminary recognition speech is determined, and the speech information with speech similarity less than the preset similarity is filtered out from the preliminary recognition speech, so that the residual noise in the preliminary recognition speech can be filtered out, thereby obtaining a cleaner clean speech of the speaker, and improving the accuracy of speech extraction.
  • Each module in the above-mentioned speech processing device 1200 can be implemented in whole or in part by software, hardware or a combination thereof.
  • Each module can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each module above.
  • a computer device which may be a terminal, and its internal structure diagram may be shown in FIG13.
  • the computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device.
  • the processor, the memory, and the input/output interface are connected via a system bus, and the communication interface, the display unit, and the input device are connected to the system bus via the input/output interface.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the input/output interface of the computer device is used to exchange information between the processor and an external device.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, a mobile cellular network, NFC (near field communication) or other technologies.
  • the display unit of the computer device is used to form a visually visible picture, which can be a display screen, a projection device or a virtual reality imaging device.
  • the display screen can be a liquid crystal display screen or an electronic ink display screen.
  • the input device of the computer device can be a touch layer covered on the display screen, or a key, trackball or touchpad set on the computer device shell, or an external keyboard, touchpad or mouse.
  • FIG. 13 is merely a block diagram of a partial structure related to the scheme of the present application, and does not constitute a limitation on the computer device to which the scheme of the present application is applied.
  • the specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps in the above-mentioned method embodiments when executing the computer-readable instructions.
  • a computer-readable storage medium which stores computer-readable instructions.
  • the steps in the above-mentioned method embodiments are implemented.
  • a computer-readable instruction product including computer-readable instructions, which implement the steps in the above-mentioned method embodiments when executed by a processor.
  • user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • Non-volatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音处理方法,包括:获取说话人的注册语音,并获取混合语音,混合语音包括多个发声对象的语音信息,多个发声对象包括说话人(202);确定注册语音的注册语音特征(204);依据注册语音特征,从混合语音中,提取出说话人的初步识别语音(206);根据注册语音特征,确定注册语音和初步识别语音所包括语音信息之间的语音相似度(208);及从初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到说话人的干净语音(210)。

Description

语音处理方法、装置、设备和介质
相关申请
本申请要求于2022年10月21日提交中国专利局,申请号为2022112978433、发明名称为“语音处理方法、装置、设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别是涉及一种语音处理方法、装置、设备和介质。
背景技术
随着计算机技术的发展,出现了语音处理技术,语音处理技术是指对语音信号进行音频处理的技术。语音提取则属于语音处理技术中的其中一种,通过语音提取技术,可从复杂语音场景中提取用户感兴趣的声音。可以理解,复杂语音场景可以包括多人说话干扰、大混响、高背景噪音和音乐噪音等中的至少一种。比如,通过语音提取技术,用户可以从复杂语音场景中提取出自己感兴趣的对象的声音。传统技术中,通常直接对复杂语音进行语音提取,并将提取得到的语音直接作为最终要提取的对象的语音,但是,通过这种方式提取得到的语音经常会残留有较多噪声(比如,提取的语音中还包括其他对象的声音),从而导致语音提取准确率较低。
发明内容
根据本申请提供的各种实施例,提供一种语音处理方法、装置、设备和介质。
第一方面,本申请提供了一种语音处理方法,所述方法包括:
获取说话人的注册语音,并获取混合语音,所述混合语音包括多个发声对象的语音信息,所述多个发声对象包括所述说话人;
确定所述注册语音的注册语音特征;
依据所述注册语音特征,从所述混合语音中,提取出所述说话人的初步识别语音;
根据所述注册语音特征,确定所述注册语音和所述初步识别语音所包括语音信息之间的语音相似度;及
从所述初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到所述说话人的干净语音。
第二方面,本申请提供了一种语音处理装置,所述装置包括:
获取模块,用于获取说话人的注册语音,并获取混合语音,所述混合语音包括多个发声对象的语音信息,所述多个发声对象包括所述说话人;
第一提取模块,用于确定所述注册语音的注册语音特征,依据所述注册语音特征,从所述混合语音中,提取出所述说话人的初步识别语音;
确定模块,用于根据所述注册语音特征,确定所述注册语音和所述初步识别语音所包括语音信息之间的语音相似度;及
过滤模块,用于从所述初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到所述说话人的干净语音。
第三方面,本申请提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时执行本申请各方法实施例中的步骤。
第四方面,本申请提供了一种计算机可读存储介质,存储有计算机可读指令,该计算机可读指令被处理器执行时执行本申请各方法实施例中的步骤。
第五方面,本申请提供了一种计算机可读指令产品,包括计算机可读指令,计算机可读指令被处理器执行时执行本申请各方法实施例中的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或传统技术中的技术方案,下面将对实施例或传统技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据公开的附图获得其他的附图。
图1为一个实施例中语音处理方法的应用环境图;
图2为一个实施例中语音处理方法的流程示意图;
图3为一个实施例中语音提取网络的网络结构示意图;
图4为一个实施例中用于对混合语音进行语音提取的模型的网络结构示意图;
图5为一个实施例中一级语音提取网络的网络结构示意图;
图6为一个实施例中降噪网络的网络结构示意图;
图7为一个实施例中注册网络的网络结构示意图;
图8为另一个实施例中语音处理方法的应用环境图;
图9为一个实施例中语音处理方法的原理示意图;
图10为一个实施例中对初步识别语音进行过滤处理的原理示意图;
图11为另一个实施例中语音处理方法的流程示意图;
图12为一个实施例中语音处理装置的结构框图;
图13为一个实施例中计算机设备的内部结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的语音处理方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上,也可以放在云上或其他服务器上。其中,终端102可以但不限于是各种台式计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端102以及服务器104可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
终端102可获取说话人的注册语音,并获取混合语音,混合语音包括多个发声对象的语音信息,多个发声对象包括说话人。终端102可确定注册语音的注册语音特征,依据注册语音特征,从混合语音中,提取出说话人的初步识别语音。终端102可根据注册语音特征,确定注册语音和初步识别语音所包括语音信息之间的语音相似度。终端102可从初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到说话人的干净语音。
需要说明的是,本申请一些实施例中的语音处理方法使用到了人工智能技术。比如,注册语音的注册语音特征,则属于使用人工智能技术编码得到的特征,以及,说话人的初步识别语音,也属于使用人工智能技术识别得到的语音。
在一个实施例中,如图2所示,提供了一种语音处理方法,本实施例以该方法应用于图1中的终端102为例进行说明,包括以下步骤:
步骤202,获取说话人的注册语音,并获取混合语音,混合语音包括多个发声对象的 语音信息,多个发声对象包括说话人。
其中,发声对象是可以发出声音的实体,可以是自然物或人造物,可以是活体或非活体。发声对象包括人物、动物或物体等中的至少一种。作为语音处理所针对目标的发声对象,可以称其为说话人,也可以称其为目标对象。可以理解,说话人是需要通过本申请的语音处理方法提取语音的对象。语音可以数字信号的形式存储为音频格式文件。
混合语音是包括多个发声对象各自语音信息的语音,这里多个发声对象可以均是用户,多个发声对象中的一个为说话人。混合语音包括说话人的语音信息。混合语音包括说话人的语音信息,可以理解为混合语音所记载的声音包括说话人的声音。
注册语音(Registered Voice)是预先针对说话人注册的干净的语音,是在一个语音数据库中预存储的该说话人的一段语音。可以理解,注册语音中基本上仅包括说话人的语音信息,不包括除说话人之外的其他发声对象的语音信息,或者除说话人之外的其他发声对象的语音信息非常少,可以忽略。
说话人可以在较为安静的环境下说一段话,终端可采集说话人说这一段话时的声音,生成注册语音。可以理解,这段话不包括除说话人之外的其他对象的声音。终端可采集说话人在安静的环境下所说的话,并根据说话人在安静的环境下所说的话,生成说话人的注册语音。安静可以是环境噪声的分贝数不超过预设分贝数。预设分贝数可以取30-40,也可以根据需要设置更低或更高的分贝数。
说话人可以在较为吵闹的环境下说一段话,终端可采集说话人说这一段话时的声音,生成混合语音。可以理解,这段话包括除说话人之外的其他发声对象的声音,还可以包括环境噪音。终端可采集说话人在吵闹的环境下所说的话,并根据说话人在吵闹的环境下所说的话,生成包括说话人的语音信息的混合语音。吵闹可以是环境噪声的分贝数超过预设分贝数。
在一个实施例中,终端可将说话人在安静的环境下所说的话对应的语音,直接作为说话人的注册语音。终端可将说话人在吵闹的环境下所说的话对应的语音,直接作为包括说话人的语音信息的混合语音。
步骤204,确定注册语音的注册语音特征。
其中,注册语音特征是注册语音的特征,可以表征说话人语音的特性,也可以称为说话人语音特征。终端可采用采用机器学习模型从注册语音中提取注册语音特征,还可以采用声学特征,如梅尔频率倒谱系数(MFCC,Mel Frequency Cepstrum Coefficient)、线性预测系数(LPC,Linear Predictive Coding)、线性预测倒谱系数(LPCC,Linear Prediction Cepstrum Coefficient)、线谱频率(LSF,Linear Spectral Frequency)、离散小波变换(Discrete Wavelet Transform)或感知线性预测(PLP,Perceptual Linear Predictive)中至少一种方式。终端可从注册语音中提取特征,得到注册语音的注册语音特征。注册语音的注册语音特征可以是即时提取的,也可以是预先提取并存储的。
步骤206,依据注册语音特征,从混合语音中,提取出说话人的初步识别语音。
注册语音特征可用于对混合语音中说话人的语音信息进行初步识别。初步识别是相对较为粗略的识别,用以从混合语音中提取出初步识别语音。初步识别语音,是对混合语音中说话人的语音信息进行初步识别得到的语音,可以理解,初步识别语音中除了包括说话人的语音信息,还有可能包括除说话人之外的其他发声对象的语音信息。初步识别语音是后续处理的基础,也可以称之为初始语音。
终端可从混合语音中,提取满足与注册语音特征相关联的条件的语音,得到说话人的初步识别语音。这里的条件,比如混合语音中的某段或某条语音信息与注册语音特征,二者的某个或某些语音参数的值满足预设的匹配条件。
终端可对注册语音进行特征提取,得到注册语音的注册语音特征。进而,终端可依据注册语音的注册语音特征,对混合语音中说话人的语音信息进行初步识别,即,对混合语 音进行初步的语音提取,得到说话人的初步识别语音。
在一个实施例中,终端可对混合语音进行特征提取,得到混合语音的混合语音特征。进而,终端可根据混合语音特征和注册语音特征,对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音。其中,混合语音特征是混合语音的特征。
在一个实施例中,初步识别语音可以是通过预先训练的语音提取模型提取得到的。终端可将混合语音和注册语音的注册语音特征输入至语音提取网络,以通过语音提取网络对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音。语音提取网络可以采用卷积神经网络(CNN)。
步骤208,根据注册语音特征,确定注册语音和初步识别语音所包括语音信息之间的语音相似度。
终端可根据注册语音特征,确定注册语音和初步识别语音中语音信息之间的语音相似度。其中,语音相似度,是语音声音特性的相似度,基本跟语音内容所表达的内容无关。这里语音相似度具体是注册语音和初步识别语音中的语音信息之间的相似度。语音相似度越大,表示越相似,语音相似度越小,表示越不相似。
在一个实施例中,终端可对初步识别语音中的语音信息进行特征提取,得到语音信息特征。进而,终端可根据注册语音特征和语音信息特征,确定注册语音和初步识别语音中语音信息之间的语音相似度。
步骤210,从初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到说话人的干净语音。
终端可从初步识别语音中确定语音相似度小于预设相似度的语音信息,得到待过滤语音信息。终端可从初步识别语音中,滤除该待过滤语音信息,得到说话人的干净语音。
其中,待过滤语音信息,是初步识别语音中即将要进行过滤处理的语音信息。其中,干净语音,是说话人的干净的语音,可以理解,干净语音中仅包括说话人的语音信息,不包括除说话人之外的其他对象的语音信息。说话人的干净语音,是本申请各实施例的语音处理方法处理的结果,可以称之为目标语音。
终端可分别判断初步识别语音中的各个语音信息与注册语音之间的语音相似度是否小于预设相似度。若语音相似度小于预设相似度,则终端可将相应的语音信息作为待过滤语音信息。若语音相似度大于或等于预设相似度,可以理解,注册语音和相应语音信息之间的语音相似度较高,说明该语音信息大概率属于说话人对应的语音信息,此时,终端可将相应的语音信息保留。
预设相似度可以根据语音相似度的取值范围和过滤强度设定。预设相似度越小,过滤强度越小,越容易保留一些噪声;预设相似度越大,过滤强度越大,也越容易将说话人的声音也过滤。因此,预设相似度可在语音相似度取值范围内根据实际需要和测试效果确定。
终端可将初步识别语音中待过滤语音信息进行滤除。终端可在初步识别语音中,将待过滤语音信息置为静音,并根据初步识别语音中保留下来的语音信息,生成说话人的干净语音。保留下来的语音信息,是未置为静音的语音信息。
上述语音处理方法中,通过获取混合语音和说话人的注册语音,混合语音中包括说话人的语音信息。依据注册语音的注册语音特征,从混合语音中初步提取出说话人的初步识别语音,能够初步较为准确地提取到说话人的初步识别语音。进而,会在初步识别语音的基础上进行进阶地过滤处理,即,根据注册语音特征,确定注册语音和初步识别语音中语音信息之间的语音相似度,并从初步识别语音中过滤掉语音相似度小于预设相似度的语音信息,就可以将初步识别语音中残留的噪声过滤掉,从而得到更为干净的说话人的干净语音,提升语音提取的准确率。
在一个实施例中,依据注册语音特征,从混合语音中,提取出说话人的初步识别语音,包括:确定混合语音的混合语音特征;将混合语音特征和注册语音的注册语音特征进行融 合,得到语音融合特征;及基于语音融合特征,对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音。
其中,语音融合特征,是将混合语音特征和注册语音的注册语音特征进行融合之后得到的语音特征。
终端可对混合语音进行特征提取,得到混合语音的混合语音特征,并将混合语音特征和注册语音的注册语音特征进行融合,得到语音融合特征。进而,终端可基于语音融合特征对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音。
在一个实施例中,终端可对混合语音进行傅里叶变换,获得傅里叶变换结果,并基于傅里叶变换结果进行特征提取,得到混合语音的混合语音特征。
在一个实施例中,终端可将混合语音特征和注册语音的注册语音特征进行特征拼接,并将拼接后的特征作为语音融合特征。
在一个实施例中,终端可将混合语音特征和注册语音的注册语音特征映射到相同维度后,进行加权求和或者加权求平均的运算,得到语音融合特征。
上述实施例中,通过将混合语音特征和注册语音的注册语音特征进行融合,可以得到包括混合语音特征和注册语音特征的语音融合特征,进而再基于语音融合特征对混合语音中说话人的语音信息进行初步识别,可以提升初步识别语音的提取准确率。
在一个实施例中,混合语音特征包括混合语音特征矩阵,语音融合特征包括语音融合特征矩阵,注册语音特征包括注册语音特征向量,将混合语音特征和注册语音的注册语音特征进行融合,得到语音融合特征,包括:将注册语音特征向量在时间维度上重复,以生成注册语音特征矩阵,注册语音特征矩阵的时间维度与混合语音特征矩阵的时间维度相同;及将混合语音特征矩阵和注册语音特征矩阵拼接,得到语音融合特征矩阵。
其中,时间维度,是时域中的语音信号的帧数所对应的维度。混合语音特征矩阵是混合语音特征对应的特征矩阵,是混合语音特征的具体体现形式。语音融合特征矩阵是语音融合特征对应的特征矩阵,是语音融合特征的具体体现形式。注册语音特征向量是注册语音特征对应的特征向量。注册语音特征矩阵是注册语音特征向量所组成的特征矩阵。
终端可获取混合语音特征矩阵时间维度的长度,以混合语音特征矩阵时间维度的长度为约束,将注册语音特征向量在时间维度上重复,以生成时间维度与混合语音特征矩阵的时间维度相同的注册语音特征矩阵。进而,终端可将混合语音特征矩阵和注册语音特征矩阵进行拼接,得到语音融合特征矩阵。
上述实施例中,通过将注册语音特征向量在时间维度上进行重复,以生成时间维度与混合语音特征矩阵的时间维度相同的注册语音特征矩阵,以便后续将混合语音特征矩阵和注册语音特征矩阵进行拼接,得到语音融合特征矩阵,提升特征融合的准确率。
在一个实施例中,确定混合语音的混合语音特征,包括:提取混合语音的幅度谱,得到第一幅度谱;对第一幅度谱进行特征提取,得到幅度谱特征;及对幅度谱特征进行特征提取,得到混合语音的混合语音特征。
其中,第一幅度谱是混合语音的幅度谱。幅度谱特征是第一幅度谱的特征。
终端可对时域下的混合语音进行傅里叶变换,得到频域下的混合语音的语音信息。终端可根据频域下的混合语音的语音信息,得到混合语音的第一幅度谱。进而,终端可对第一幅度谱进行特征提取,得到幅度谱特征,并对幅度谱特征进行特征提取,得到混合语音的混合语音特征。
上述实施例中,通过提取混合语音的第一幅度谱,以将时域的混合语音信号转换为频域的信号,并对第一幅度谱进行特征提取得到幅度谱特征,进而再对幅度谱特征进行特征提取,可以得到混合语音的混合语音特征,从而提升混合语音特征的准确率。
基于语音融合特征,对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音,包括:基于语音融合特征,对混合语音中说话人的语音信息进行初步识别, 得到说话人的语音特征;对说话人的语音特征进行特征解码,得到第二幅度谱;及根据混合语音的相位谱将第二幅度谱进行变换,得到说话人的初步识别语音。
其中,说话人的语音特征,是反映说话人说话时声音特性的特征,可以称之为说话人的对象语音特征。第二幅度谱是对象语音特征解码后得到的幅度谱。
终端可基于语音融合特征对混合语音中说话人的语音信息进行初步识别,得到说话人的对象语音特征。进而,终端可将对象语音特征进行特征解码,得到第二幅度谱。终端可获取混合语音的相位谱,并根据混合语音的相位谱将第二幅度谱进行变换,得到说话人的初步识别语音。
在一个实施例中,第二幅度谱用于表征位于频域的语音信号。终端可根据混合语音的相位谱将第二幅度谱进行反傅里叶变换,得到位于时域的说话人的初步识别语音。
在一个实施例中,初步识别语音是通过语音提取网络提取得到的。如图3所示,语音提取网络包括傅里叶变换单元、编码器、长短期记忆单元和反傅里叶变换单元。可以理解,终端可通过语音提取网络中的傅里叶变换单元,提取混合语音的第一幅度谱。终端可通过语音提取网络中的编码器对第一幅度谱进行特征提取,得到幅度谱特征。终端可通过语音提取网络中的长短期记忆单元对幅度谱特征进行特征提取,得到混合语音的混合语音特征,并基于语音融合特征对混合语音中说话人的语音信息进行初步识别,得到说话人的对象语音特征,对对象语音特征进行特征解码,得到第二幅度谱。进而,终端可通过语音提取网络中的反傅里叶变换单元,根据混合语音的相位谱将第二幅度谱进行变换,得到说话人的初步识别语音。
上述实施例中,通过基于语音融合特征对混合语音中说话人的语音信息进行初步识别,可以得到说话人的对象语音特征。进而再通过对对象语音特征进行特征解码,可以得到第二幅度谱,根据混合语音的相位谱将第二幅度谱进行变换,以将频域的信号转换为时域的语音信号,得到说话人的初步识别语音,提升初步识别语音的提取准确率。
在一个实施例中,确定注册语音的注册语音特征,包括:提取注册语音的频率谱;根据频率谱,生成注册语音的梅尔频率谱;及对梅尔频率谱进行特征提取,得到注册语音的注册语音特征。
具体地,终端可对时域下的注册语音进行傅里叶变换,得到频域下的注册语音的语音信息。终端可根据频域下的注册语音的语音信息,得到注册语音的频率谱。进而,终端可根据注册语音的频率谱,生成注册语音的梅尔频率谱,并对梅尔频率谱进行特征提取,得到注册语音的注册语音特征。
在一个实施例中,语音信息包括语音片段;根据注册语音特征,确定注册语音和初步识别语音所包括语音信息之间的语音相似度,包括:针对初步识别语音中的每一个语音片段,确定语音片段对应的片段语音特征;根据片段语音特征和注册语音特征,确定注册语音和语音片段之间的语音相似度。
其中,片段语音特征是语音片段的语音特征。初步识别语音中包括多个语音片段。
终端可针对初步识别语音中的每一个语音片段,对该语音片段进行特征提取,得到该语音片段的片段语音特征,并根据该片段语音特征和注册语音特征,确定注册语音和语音片段之间的语音相似度。
在一个实施例中,终端可针对初步识别语音中的每一个语音片段,对该语音片段进行特征提取,得到该语音片段对应的片段语音特征。
在一个实施例中,片段语音特征包括片段语音特征向量,注册语音特征包括注册语音特征向量。终端可针对初步识别语音中的每一个语音片段,根据该语音片段的片段语音特征向量和注册语音特征向量,确定注册语音和该语音片段之间的语音相似度。
在一个实施例中,注册语音和语音片段之间的语音相似度可通过以下公式计算得到:
其中,A表示片段语音特征向量,B表示注册语音特征向量,cosθ表示注册语音和语音片段之间的语音相似度,θ表示片段语音特征向量和注册语音特征向量的夹角。
上述实施例中,通过根据片段语音特征和注册语音特征,确定注册语音和语音片段之间的语音相似度,可以提升注册语音和初步识别语音中语音信息之间的语音相似度的计算准确率。
在一个实施例中,针对初步识别语音中的每一个语音片段,确定语音片段对应的片段语音特征,包括:针对初步识别语音中的每一个语音片段,将语音片段进行重复,得到与注册语音的时间长度一致的重组语音;其中,重组语音包括多个语音片段;根据重组语音的重组语音特征确定语音片段对应的片段语音特征。
本实施例中,初步识别语音中语音信息包括语音片段,根据注册语音特征,确定注册语音和初步识别语音中语音信息之间的语音相似度,包括:按照注册语音的时间长度,对初步识别语音中的每一个语音片段分别进行重复,得到时间长度的重组语音;获取从重组语音提取的重组语音特征,根据重组语音特征确定初步识别语音中每一个语音片段对应的片段语音特征;及分别根据每个语音片段对应的片段语音特征和注册语音特征,确定注册语音和每个语音片段之间的语音相似度。
其中,重组语音是由多个相同的语音片段重组得到的语音,可以理解,重组语音中包括多个相同的语音片段。
终端可获取注册语音的时间长度,针对初步识别语音中的每一个语音片段,按照注册语音的时间长度将该语音片段进行重复,得到与注册语音的时间长度一致的重组语音。针对每一个语音片段,得到的重组语音包括多个相同的语音片段。终端可对重组语音进行特征提取,得到重组语音的重组语音特征,并根据重组语音的重组语音特征,确定该语音片段对应的片段语音特征。
在一个实施例中,终端可将重组语音的重组语音特征,直接作为该语音片段对应的片段语音特征。
上述实施例中,将语音片段进行重复处理,得到与注册语音的时间长度一致的、且包括多个相同的语音片段的重组语音,进而再根据重组语音的重组语音特征确定语音片段对应的片段语音特征,可以进一步提升注册语音和初步识别语音中语音信息之间的语音相似度的计算准确率。
在一个实施例中,根据注册语音特征,确定注册语音和初步识别语音所包括语音信息之间的语音相似度的步骤,以及从初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到说话人的干净语音的步骤,是在第一处理模式下执行的。
在一个实施例中,该语音处理方法还包括:在第二处理模式下,获取干扰语音,干扰语音,是依据注册语音特征,从混合语音中提取出的;获取混合语音的混合语音特征、初步识别语音的语音特征、以及干扰语音的语音特征;将混合语音特征和初步识别语音的语音特征,基于注意力机制融合,得到第一注意力特征;将混合语音特征和干扰语音的语音特征,基于注意力机制融合,得到第二注意力特征;及基于混合语音特征、第一注意力特征和第二注意力特征融合,并基于融合后的特征得到说话人的干净语音。
本实施例中,在第一处理模式下,终端可执行确定语音相似度及后续相应语音过滤步骤。在第二处理模式下,终端可依据注册语音特征从混合语音中还提取出干扰语音,干扰语音是在混合语音中干扰识别说话人的语音信息的语音。进一步地,在第二处理模式下,终端可将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,得到第一注意力特征,以及将混合语音特征和干扰语音的语音特征基于注意力机制进行融合,得到第二注意力特征;基于混合语音特征、第一注意力特征和第二注意力特征进行融 合,并基于融合后的特征得到说话人的干净语音。
其中,第一注意力特征,是将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制融合得到的特征。第二注意力特征,是将混合语音特征和干扰语音的语音特征基于注意力机制融合得到的特征。可以理解,将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,是指将混合语音的混合语音特征和初步识别语音的语音特征分别乘以相应的注意力权重,以进行融合。还可以理解,将混合语音特征和干扰语音的语音特征基于注意力机制进行融合,是指将混合语音特征和干扰语音的语音特征,分别乘以相应的注意力权重,以进行融合。
处理模式是第一处理模式还是第二处理模式,决定了提取出说话人的干净语音的方式不同。处理模式可以预先配置或实时修改配置,也可以由用户自由选择。
在用户需要快速获取干净语音的情况下,响应于第一处理模式选择操作,终端可将当前处理模式确定为第一处理模式。在第一处理模式下,终端可根据注册语音特征,确定注册语音和初步识别语音中语音信息之间的语音相似度,从初步识别语音中确定语音相似度小于预设相似度的语音信息,得到待过滤语音信息,将初步识别语音中待过滤语音信息进行过滤处理,得到说话人的干净语音。
在用户需要获取高准确率的干净语音的情况下,响应于第二处理模式选择操作,终端可将当前处理模式确定为第二处理模式。在第二处理模式下,终端可将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,得到第一注意力特征,以及将混合语音特征和干扰语音的语音特征基于注意力机制进行融合,得到第二注意力特征;基于混合语音特征、第一注意力特征和第二注意力特征进行融合,并基于融合后的特征得到说话人的干净语音。
在一个实施例中,终端可将混合语音特征、第一注意力特征和第二注意力特征直接进行特征融合,得到融合后的特征。进而,终端可基于融合后的特征确定说话人的干净语音。
在一个实施例中,终端可将混合语音和注册语音特征输入至预先训练的语音提取模型,以通过语音提取模型基于混合语音和注册语音特征进行语音提取,输出初步识别语音和干扰语音。
上述实施例中,在第一处理模式下,通过注册语音和初步识别语音中语音信息之间的语音相似度,对从混合语音中提取的初步识别语音进行进阶的语音过滤,得到更为干净的说话人的干净语音。可以理解,在第一处理模式下可以快速获得较为干净的干净语音,提升语音提取效率。在第二处理模式下,通过将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,以及将混合语音特征和干扰语音的语音特征基于注意力机制进行融合,分别得到第一注意力特征和第二注意力特征。进而再基于混合语音特征、第一注意力特征和第二注意力特征确定说话人的干净语音。可以理解,相较于第一处理模式,在第二处理模式下可以获得更为干净的干净语音,进一步提升语音提取准确率。这样,提供两种处理模式供用户选择,可以提升语音提取的灵活性。
在一个实施例中,基于混合语音特征、第一注意力特征和第二注意力特征融合,并基于融合后的特征得到说话人的干净语音,包括:将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行融合,并基于融合后的特征得到说话人的干净语音。
本实施例中,终端可将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行特征融合,得到融合后的特征。进而,终端可基于融合后的特征确定说话人的干净语音。
上述实施例中,通过将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行融合,可以使得融合后的特征更为准确,从而再基于更为准确的融合后的特征确定说话人的干净语音,可以进一步提升语音提取准确率。
在一个实施例中,初步识别语音和干扰语音是通过训练过的语音提取模型从混合语音 中提取出的,方法还包括:将混合语音和注册语音特征输入至语音提取模型;通过语音提取模型,基于混合语音和注册语音特征,生成第一掩码信息和第二掩码信息;通过语音提取模型,根据第一掩码信息屏蔽混合语音中的干扰信息,得到说话人的初步识别语音;及通过语音提取模型,根据第二掩码信息屏蔽混合语音中说话人的语音信息,得到干扰语音。
本实施例中,初步识别语音和干扰语音是通过预先训练的语音提取模型从混合语音中提取出的;方法还包括:将混合语音和注册语音特征输入至语音提取模型,以通过语音提取模型基于混合语音和注册语音特征,生成第一掩码信息和第二掩码信息;根据第一掩码信息屏蔽混合语音中的干扰信息,得到说话人的初步识别语音;根据第二掩码信息屏蔽混合语音中说话人的语音信息,得到干扰语音。
其中,第一掩码信息,是用于屏蔽混合语音中的干扰信息的信息。第二掩码信息,是用于屏蔽混合语音中说话人的语音信息的信息。
终端可将混合语音和注册语音特征输入至预先训练的语音提取模型,以通过语音提取模型基于混合语音和注册语音特征,生成与输入的混合语音和注册语音特征对应的第一掩码信息和第二掩码信息。
进而,终端可根据第一掩码信息屏蔽混合语音中的干扰信息,生成说话人的初步识别语音,以及根据第二掩码信息屏蔽混合语音中说话人的语音信息,生成干扰说话人的语音信息的干扰语音。
在一个实施例中,终端可将混合语音和注册语音特征输入至语音提取模型,以通过语音提取模型基于已训练的模型参数,生成与混合语音和注册语音特征对应的第一掩码信息和第二掩码信息。
在一个实施例中,第一掩码信息包括第一屏蔽参数。可以理解,由于第一掩码信息是用于屏蔽混合语音中的干扰信息的,所以第一掩码信息包括第一屏蔽参数,以实现对混合语音中的干扰信息的屏蔽。具体地,终端可将第一屏蔽参数与混合语音的混合语音幅度谱相乘,得到说话人的语音信息对应的对象语音幅度谱,并根据对象语音幅度谱,生成说话人的初步识别语音。其中,混合语音幅度谱是混合语音的幅度谱。对象语音幅度谱是说话人的语音信息的幅度谱。
在一个实施例中,第二掩码信息包括第二屏蔽参数。可以理解,由于第二掩码信息是用于屏蔽混合语音中说话人的语音信息的,所以第二掩码信息包括第二屏蔽参数,以实现对混合语音中说话人的语音信息的屏蔽。具体地,终端可将第二屏蔽参数与混合语音的混合语音幅度谱相乘,得到混合语音中干扰信息对应的干扰幅度谱,并根据干扰幅度谱,生成干扰说话人的语音信息的干扰语音。其中,干扰幅度谱是混合语音中干扰信息的幅度谱。
上述实施例中,通过语音提取模型基于混合语音和注册语音特征,可以生成与混合语音和注册语音特征对应的第一掩码信息和第二掩码信息,进而根据第一掩码信息屏蔽混合语音中的干扰信息,可以得到说话人的初步识别语音,从而进一步提升了初步识别语音的提取准确率。以及,根据第二掩码信息屏蔽混合语音中说话人的语音信息,可以得到干扰语音,从而提升了干扰语音的提取准确率。
在一个实施例中,语音提取模型中预先训练好的模型参数中包括第一掩码映射参数和第二掩码映射参数;将混合语音和注册语音特征输入至语音提取模型,以通过语音提取模型基于混合语音和注册语音特征,生成第一掩码信息和第二掩码信息,包括:将混合语音和注册语音特征输入至语音提取模型,以通过第一掩码映射参数映射生成对应的第一掩码信息,以及通过第二掩码映射参数映射生成对应的第二掩码信息。
本实施例中,终端可基于语音提取模型的第一掩码映射参数、混合语音和注册语音特征,生成第一掩码信息。终端可基于语音提取模型的第二掩码映射参数、混合语音和注册语音特征,生成第二掩码信息。
其中,掩码映射参数,是将语音特征映射为掩码信息的相关参数。通过第一掩码映射 参数可映射生成用来屏蔽混合语音中干扰信息的掩码信息,即第一掩码信息。通过第二掩码映射参数可映射生成用来屏蔽混合语音中说话人的语音信息的掩码信息,即第二掩码信息。
终端可将混合语音和注册语音特征输入至语音提取模型,以通过语音提取模型中的第一掩码映射参数,映射生成与输入的混合语音和注册语音特征对应的第一掩码信息,以及通过语音提取模型中的第二掩码映射参数,映射生成与输入的混合语音和注册语音特征对应的第二掩码信息。
上述实施例中,由于第一掩码信息和第二掩码信息是基于输入至语音提取模型的混合语音和注册语音特征,以及语音提取模型中预先训练好的第一掩码映射参数和第二掩码映射参数生成的,因此,第一掩码信息和第二掩码信息是可随着输入的不同而动态改变的。这样可以提升第一掩码信息和第二掩码信息的准确率,从而进一步提升初步识别语音和干扰语音的提取准确率。
在一个实施例中,混合语音的混合语音特征、初步识别语音的语音特征、以及干扰语音的语音特征,是将混合语音、初步识别语音和干扰语音分别输入至二级处理模型中的特征提取层后,由特征提取层提取的。
在一个实施例中,第一注意力特征,是由二级处理模型中的第一注意力单元,将混合语音特征和初步识别语音的语音特征进行注意力机制融合得到的。
在一个实施例中,第二注意力特征,是由二级处理模型中的第二注意力单元,将混合语音特征和干扰语音的语音特征进行注意力机制融合得到的。
终端在第二处理模式下,将混合语音、一级语音提取模型输出的初步识别语音和干扰语音分别输入至二级处理模型中的特征提取层进行特征提取,得到混合语音的混合语音特征、初步识别语音的语音特征和干扰语音的语音特征。
终端可将初步识别语音的语音特征和混合语音特征输入至二级处理模型中的第一注意力单元,以将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,得到第一注意力特征。
终端可将干扰语音的语音特征和混合语音特征输入至二级处理模型中的第二注意力单元,以将混合语音的混合语音特征和干扰语音的语音特征基于注意力机制进行融合,得到第二注意力特征。
可以理解,用于对混合语音进行语音提取的模型包括一级语音提取模型和二级处理模型。其中,一级语音提取模型用于从混合语音中提取出初步识别语音和干扰语音。二级处理模型用于基于初步识别语音和干扰语音对混合语音进行进阶的语音提取,得到说话人的干净语音。
在一个实施例中,二级处理模型中包括特征提取层、第一注意力单元和第二注意力单元。在第二处理模式下,终端可将混合语音、一级语音提取模型输出的初步识别语音和干扰语音分别输入至二级处理模型中的特征提取层,以通过特征提取层对混合语音、初步识别语音和干扰语音分别进行特征提取,得到混合语音的混合语音特征、初步识别语音的语音特征和干扰语音的语音特征。终端可将初步识别语音的语音特征和混合语音特征输入至二级处理模型中的第一注意力单元,以通过第一注意力单元将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,得到第一注意力特征。终端可将干扰语音的语音特征和混合语音特征输入至二级处理模型中的第二注意力单元,以通过第二注意力单元将混合语音的混合语音特征和干扰语音的语音特征基于注意力机制进行融合,得到第二注意力特征。
上述实施例中,通过一级语音提取模型提取初步识别语音和干扰语音,通过二级处理模型参考初步识别语音和干扰语音对混合语音进行进阶的语音提取,可以进一步提升语音提取准确率。
在一个实施例中,初步识别语音和干扰语音,是通过一级语音提取模型从混合语音中提取的,二级处理模型还包括特征融合层和二级语音提取模型,基于混合语音特征、第一注意力特征和第二注意力特征融合,并基于融合后的特征得到说话人的干净语音,包括:将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征输入至特征融合层进行融合,得到语音融合特征;及将语音融合特征输入至二级语音提取模型,以通过二级语音提取模型基于语音融合特征得到说话人的干净语音。
其中,提取初步识别语音和干扰语音的语音提取模型为一级语音提取模型。二级处理模型还包括特征融合层和二级语音提取模型。终端可将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征输入至特征融合层进行融合,得到语音融合特征;将语音融合特征输入至二级语音提取模型,以通过二级语音提取模型基于语音融合特征得到说话人的干净语音。
二级处理模型中除了包括特征提取层、第一注意力单元和第二注意力单元之外,还包括特征融合层和二级语音提取模型。终端可将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征输入至二级处理模型中的特征融合层,以通过特征融合层对混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行融合,得到语音融合特征。进而,终端可将语音融合特征输入至二级处理模型中的二级语音提取模型,以通过二级语音提取模型基于语音融合特征得到说话人的干净语音。
在一个实施例中,终端可将语音融合特征输入至二级处理模型中的二级语音提取模型,以通过二级语音提取模型对语音融合特征进行特征提取,并基于提取到的特征生成说话人的干净语音。
在一个实施例中,如图4所示,用于对混合语音进行语音提取的模型包括一级语音提取模型和二级处理模型。其中,二级处理模型中包括第一特征提取层、第二特征提取层、第三特征提取层、第一注意力单元、第二注意力单元、特征融合层和二级语音提取模型。终端可将混合语音和注册语音特征输入至一级语音提取模型,以通过语音提取模型基于混合语音和注册语音特征,得到初步识别语音和干扰语音。
进而,终端可将混合语音、初步识别语音和干扰语音,分别输入至二级处理模型中的第一特征提取层、第二特征提取层和第三特征提取层,以对混合语音、初步识别语音和干扰语音分别进行特征提取,得到混合语音的混合语音特征、初步识别语音的语音特征和干扰语音的语音特征。
终端可将初步识别语音的语音特征和混合语音特征输入至二级处理模型中的第一注意力单元,以通过第一注意力单元将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,得到第一注意力特征。
终端可将干扰语音的语音特征和混合语音特征输入至二级处理模型中的第二注意力单元,以通过第二注意力单元将混合语音的混合语音特征和干扰语音的语音特征基于注意力机制进行融合,得到第二注意力特征。
终端可将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征输入至二级处理模型中的特征融合层,以通过特征融合层对混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行融合,得到语音融合特征。进而,终端可将语音融合特征输入至二级处理模型中的二级语音提取模型,以通过二级语音提取模型基于语音融合特征得到说话人的干净语音。
在一个实施例中,如图5所示,上述一级语音提取模型中包括傅里叶变换单元、编码器、长短期记忆单元、第一反傅里叶变换单元和第二反傅里叶变换单元。可以理解,终端可通过一级语音提取模型中的傅里叶变换单元,提取混合语音的混合语音幅度谱。
终端可通过一级语音提取模型中的编码器对混合语音幅度谱进行特征提取,得到幅度谱特征。终端可通过一级语音提取模型中的长短期记忆单元基于幅度谱特征生成第一掩码 映射参数和第一掩码映射参数。
终端可将第一掩码映射参数与混合语音的混合语音幅度谱相乘,得到说话人的语音信息对应的对象语音幅度谱。终端可通过一级语音提取模型中的第一反傅里叶变换单元,根据混合语音的相位谱将对象语音幅度谱进行变换,得到说话人的初步识别语音。
终端可将第二掩码映射参数与混合语音的混合语音幅度谱相乘,得到混合语音中干扰信息对应的干扰幅度谱,终端可通过一级语音提取模型中的第二反傅里叶变换单元,根据混合语音的相位谱将干扰幅度谱进行变换,得到干扰语音。
上述实施例中,通过将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征输入至二级处理模型的特征融合层进行融合,可以得到更为准确的语音融合特征,进而通过二级语音提取模型基于更为准确的语音融合特征,确定说话人的干净语音,可以进一步提取语音提取准确率。
在一个实施例中,获取说话人的注册语音,并获取混合语音,包括:获取初始混合语音和说话人的初始注册语音;初始混合语音中包括说话人的语音信息;分别对初始混合语音和初始注册语音进行降噪处理,得到混合语音和说话人的注册语音。
其中,初始混合语音是未经过降噪处理的混合语音。初始注册语音是未经过降噪处理的注册语音。
具体地,终端可分别获取初始混合语音和说话人的初始注册语音,其中,初始混合语音中包括说话人的语音信息。可以理解,初始混合语音和初始注册语音中含有噪声,比如,含有大混响、高背景噪音和音乐噪音等中的至少一种。终端可对初始混合语音进行降噪处理,得到混合语音。终端可对初始注册语音进行降噪处理,得到说话人的注册语音。
在一个实施例中,混合语音和注册语音是通过预先训练的降噪网络进行降噪处理得到的。具体地,终端可将获取的初始混合语音和说话人的初始注册语音,分别输入至降噪网络,以通过降噪网络对初始混合语音和初始注册语音进行降噪处理,得到混合语音和说话人的注册语音。
上述实施例中,通过分别对初始混合语音和初始注册语音进行降噪处理,可以去除掉初始混合语音和初始注册语音中的噪音,得到不含噪声的混合语音和注册语音,从而后续基于不含噪声的混合语音和注册语音进行语音提取,可以进一步提升语音提取的准确率。
在一个实施例中,初步识别语音是通过预先训练的语音处理模型生成得到的;语音处理模型包括降噪网络和语音提取网络;混合语音和注册语音是通过降噪网络进行降噪处理得到的。依据注册语音特征,从混合语音中,提取出说话人的初步识别语音,包括:将注册语音的注册语音特征输入至语音提取网络,以通过语音提取网络对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音。
本实施例中,语音处理模型包括降噪网络和语音提取网络。终端可将获取的初始混合语音和说话人的初始注册语音,分别输入至降噪网络,以通过降噪网络对初始混合语音和初始注册语音进行降噪处理,得到混合语音和说话人的注册语音。进而,终端可将混合语音和注册语音的注册语音特征输入至语音提取网络,以通过语音提取网络对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音。
在一个实施例中,如图6所示,降噪网络包括傅里叶变换单元、编码器、长短期记忆单元、解码器和反傅里叶变换单元。可以理解,噪声语音包括初始混合语音和初始注册语音。干净语音包括混合语音和注册语音。终端可将噪声语音输入至降噪网络,以通过降噪网络中的傅里叶变换单元对噪声语音进行傅里叶变换,得到噪声语音的幅度谱和相位谱,进而,通过降噪网络中的编码器对噪声语音的幅度谱进行特征编码,得到编码后的特征,再通过降噪网络中的长短期记忆单元对编码后的特征进行特征提取,并通过降噪网络中的解码器对提的特征进行解码,得到解码后的幅度谱,进而再通过降噪网络中的反傅里叶变换单元对解码后的幅度谱进行反傅里叶变换,得到干净语音。
上述实施例中,通过语音处理模型中的降噪网络对初始的混合语音和初始的注册语音进行降噪处理,可以得到不含噪声的混合语音和注册语音,提升语音降噪效果。进而通过语音提取网络对混合语音中说话人的语音信息进行初步识别,可以提升初步识别语音的提取准确率。
在一个实施例中,在获取到混合语音和注册语音后,分别通过训练过的降噪网络进行降噪,语音处理方法还包括:获取样本噪声语音,样本噪声语音是对作为参照的参考干净语音增加噪声得到;将样本噪声语音输入至待训练的降噪网络,以通过降噪网络对样本待降噪语音进行降噪,得到降噪后的预测语音;及根据预测语音和参考干净语音之间的差异,对待训练的降噪网络进行迭代训练,得到训练过的降噪网络。
本实施例中,混合语音和注册语音是通过预先训练的降噪网络进行降噪处理得到的。终端可获取样本噪声语音;样本噪声语音是通过对作为参照的参考干净语音增加噪声得到。终端可将样本噪声语音输入至待训练的降噪网络,以通过降噪网络对样本待降噪语音进行降噪处理,得到降噪后的预测语音。终端根据预测语音和参考干净语音之间的差异,对待训练的降噪网络进行迭代训练,得到预先训练的降噪网络。
其中,样本噪声语音是含有噪声的、且用于训练降噪网络的语音,样本噪声语音是通过对作为参照的干净语音增加噪声得到。参考干净语音是不含噪声的、且在训练降噪网络中起参照作用的语音。预测语音,是训练降噪网络过程中样本噪声语音经过降噪之后所预测得到的语音。
具体地,终端可获取作为参照的参考干净语音,并对参考干净语音增加噪声,得到样本噪声语音。进而,终端可将样本噪声语音输入至待训练的降噪网络,以通过降噪网络对样本待降噪语音进行降噪处理,得到降噪后的预测语音。终端可根据预测语音和参考干净语音之间的差异,对待训练的降噪网络进行迭代训练,得到预先训练的降噪网络。
在一个实施例中,终端可根据预测语音和参考干净语音之间的差异,确定降噪损失值,并根据降噪损失值,对待训练的降噪网络进行迭代训练,在迭代停止的情况下得到预先训练的降噪网络。
在一个实施例中,上述的降噪损失值可通过以下损失函数计算得到:
其中,代表参考干净语音,具体可以是参考干净语音本身,可以是参考干净语音的语音信号,可以是参考干净语音的能量值,还可以是参考干净语音在频域上各频率出现概率的概率分布。X代表预测语音,与相同种类,具体可以是预测语音本身,可以是预测语音的语音信号,可以是预测语音的能量值,还可以是预测语音在频域上各频率出现概率的概率分布。LossSDR表示降噪损失值。||·||表示范数函数,具体可以是L2范数函数。
上述实施例中,通过预测语音和干净语音之间的差异,对待训练的降噪网络进行迭代训练,可以提升降噪网络的降噪能力。
在一个实施例中,初步识别语音是通过预先训练的语音提取网络提取得到的。该方法还包括:获取样本数据;样本数据包括样本混合语音和样本说话人的样本注册语音特征;样本混合语音是通过对样本说话人的样本干净语音增加噪声得到的;将样本数据输入至待训练的语音提取网络,以通过语音提取网络依据样本注册语音特征,对样本混合语音中样本说话人的样本语音信息进行识别,得到样本说话人的预测干净语音;根据预测干净语音和样本干净语音之间的差异,对待训练的语音提取网络进行迭代训练,得到预先训练的语音提取网络。
其中,样本数据是用于训练语音提取网络的数据。样本混合语音是用于训练语音提取网络的混合语音。样本说话人是训练语音提取网络过程中所涉及到的说话人。样本注册语 音特征是用于训练语音提取网络的注册语音特征。样本干净语音是仅含样本说话人的语音信息的、且在训练语音提取网络中起参照作用的语音。预测干净语音,是训练语音提取网络过程中从样本混合语音中提取得到的样本说话人的语音。
终端可获取样本说话人的样本干净语音,并将样本说话人的样本干净语音增加噪声,得到样本混合语音。终端可获取样本说话人的样本注册语音,并对样本注册语音进行特征提取,得到样本说话人的样本注册语音特征。进而,终端可根据样本混合语音和样本说话人的样本注册语音特征一起作为样本数据。终端可将样本数据输入至待训练的语音提取网络,以通过语音提取网络依据样本注册语音特征,对样本混合语音中样本说话人的样本语音信息进行识别,得到样本说话人的预测干净语音,并根据预测干净语音和样本干净语音之间的差异,对待训练的语音提取网络进行迭代训练,得到预先训练的语音提取网络。
在一个实施例中,终端可根据预测干净语音和样本干净语音之间的差异,确定提取损失值,并根据提取损失值,对待训练的语音提取网络进行迭代训练,在迭代停止的情况下得到预先训练的语音提取网络。
在一个实施例中,上述的提取损失值可通过以下损失函数计算得到:
其中,i表示N个样本混合语音中的第i个,代表第i个样本混合语音,具体可以是第i个样本混合语音本身,可以是第i个样本混合语音的语音信号,可以是第i个样本混合语音的能量值,还可以是第i个样本混合语音在频域上各频率出现概率的概率分布。Yi代表预测干净语音,具体可以是预测干净语音本身,可以是预测干净语音的语音信号,可以是预测干净语音的能量值,还可以是预测干净语音在频域上各频率出现概率的概率分布。LossMAE表示提取损失值。
上述实施例中,通过预测样本干净语音和样本干净语音之间的差异,对待训练的语音提取网络进行迭代训练,可以语音提取网络的语音提取准确率。
在一个实施例中,上述的语音处理模型还包括注册网络。注册语音特征是通过注册网络提取得到的。注册网络包括梅尔频率谱生成单元、长短期记忆单元和特征生成单元。如图7所示,终端可通过注册网络中的梅尔频率谱生成单元,提取注册语音的频率谱,并根据频率谱,生成注册语音的梅尔频率谱。终端可通过注册网络中的长短期记忆单元,对梅尔频率谱进行特征提取,得到多个特征向量。进而,终端可通过注册网络中的特征生成单元,在时间维度上对上述的多个特征向量求平均,得到注册语音的注册语音特征。
上述实施例中,通过提取注册语音的频率谱,以将时域的注册语音信号转换为频域的信号。进而再根据频率谱生成注册语音的梅尔频率谱,并对梅尔频率谱进行特征提取,可以提升注册语音特征的提取准确率。
在一个实施例中,获取说话人的注册语音,并获取混合语音,包括:响应于通话触发操作,确定通话触发操作指定的说话人,从预先存储的候选的注册语音中,确定说话人的注册语音;在基于通话触发操作与说话人对应的终端建立有语音通话的情况下,接收说话人对应的终端在语音通话中发送的混合语音。
本实施例中,终端响应于针对说话人的通话触发操作,从预先存储的候选的注册语音中,确定说话人的注册语音。终端在基于通话触发操作与说话人对应的终端建立语音通话的情况下,接收说话人对应的终端在语音通话中发送的混合语音。
在语音通话的场景下,用户可基于终端向说话人发起通话请求,即,终端可响应于用户针对说话人的通话触发操作,从预先存储的候选的注册语音中,查找说话人的注册语音。同时,终端可相应于通话触发操作,生成针对说话人的通话请求,并将通话请求发送至说话人对应的终端。在基于通话请求与说话人对应的终端建立语音通话的情况下,终端可接 收说话人对应的终端在语音通话中发送的混合语音。
可以理解,终端可在语音通话过程中,依据注册语音的注册语音特征,对接收到的混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音,根据注册语音特征,确定注册语音和初步识别语音中语音信息之间的语音相似度,从初步识别语音中确定语音相似度小于预设相似度的语音信息,得到待过滤语音信息,并将初步识别语音中待过滤语音信息进行过滤处理,得到说话人的干净语音。
上述实施例中,通过响应于针对说话人的通话触发操作,可以从预先存储的候选的注册语音中,确定说话人的注册语音。通过在基于通话触发操作与说话人对应的终端建立语音通话的情况下,接收说话人对应的终端在语音通话中发送的混合语音,可以实现在通话场景下提取说话人的语音,从而提升通话质量。
在一个实施例中,获取说话人的注册语音,并获取混合语音,包括:获取多媒体对象的多媒体语音;多媒体语音是包括多个说话人的语音信息的混合语音;响应于针对多媒体语音中说话人的指定操作,获取指定的说话人的标识;说话人是从多个发声对象中指定的需提取语音的说话人;从针对多媒体语音中各说话人预先存储的注册语音中,获取与说话人的标识具有映射关系的注册语音,得到说话人的注册语音。
本实施例中,终端可获取多媒体对象的多媒体语音,多媒体语音是包括多个发声对象的语音信息的混合语音。终端可响应于针对多媒体语音中的说话人的指定操作,获取指定的说话人的标识,说话人是多个发声对象中指定提取语音的发声对象。终端可从针对多媒体语音中各发声对象预先存储的注册语音中,获取与说话人的标识具有映射关系的注册语音,得到说话人的注册语音。多个发声对象可以是多个说话人,指定的说话人可以称之为目标说话人。
其中,多媒体对象是一种多媒体文件,多媒体对象包括视频对象和音频对象。多媒体语音是多媒体对象中的语音。标识是用于唯一标识说话人身份的字符串。
终端可从多媒体对象中提取多媒体语音,可以理解,该多媒体语音是包括多个说话人的语音信息的混合语音。终端可响应于针对多媒体语音中的说话人的指定操作,获取指定的说话人的标识,可以理解,说话人是多个说话人中指定提取语音的说话人。终端可从针对多媒体语音中各说话人预先存储的注册语音中,查找到与该标识具有映射关系的注册语音,作为指定的说话人的注册语音。
可以理解,终端可从多媒体语音中提取得到说话人的干净语音,具体地,终端可依据该说话人的注册语音的注册语音特征,对多媒体语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音,根据注册语音特征,确定注册语音和初步识别语音中语音信息之间的语音相似度,从初步识别语音中确定语音相似度小于预设相似度的语音信息,得到待过滤语音信息,并将初步识别语音中待过滤语音信息进行过滤处理,得到说话人的干净语音。
上述实施例中,通过获取多媒体对象的多媒体语音,并响应于针对多媒体语音中的说话人的指定操作,可以获取指定的说话人的标识。进而从针对多媒体语音中各说话人预先存储的注册语音中,可以获取与说话人的标识具有映射关系的注册语音,得到说话人的注册语音,可以实现从多媒体对象中提取用户感兴趣的说话人的语音。可以快速指定提取干净语音的说话人并提取干净语音,避免因多人语音环境嘈杂导致无法听清而消耗额外的资源。
在一个实施例中,如图8所示,本申请的语音处理方法可应用于影视视频或语音通话中的语音提取场景。具体地,针对应用于影视视频的场景,终端可获取影视视频的视频语音,视频语音是包括多个说话人的语音信息的混合语音。终端可响应于针对视频语音中的说话人的指定操作,获取指定的目标说话人的标识,目标说话人是多个说话人中指定提取语音的说话人。终端可从针对视频语音中各说话人预先存储的注册语音中,获取与该标识 具有映射关系的注册语音,得到目标说话人的注册语音。从而通过本申请的语音处理方法,基于注册语音从视频语音中提取出目标说话人的注册语音。针对应用于语音通话的场景,终端可响应于针对目标说话人的通话触发操作,从预先存储的候选的注册语音中,确定目标说话人的注册语音,在基于通话触发操作与目标说话人对应的终端建立语音通话的情况下,接收目标说话人对应的终端在语音通话中发送的混合语音。从而通过本申请的语音处理方法,基于注册语音从语音通话过程中获取的混合语音中,提取出目标说话人的干净语音。
在一个实施例中,干净语音是通过语音处理模型和过滤处理单元生成得到的,其中,语音处理模型包括降噪网络、注册网络和语音提取网络。如图9所示,终端可通过语音处理模型中的降噪网络对初始混合语音和初始注册语音分别进行降噪,得到降噪后的混合语音和降噪后的注册语音。终端可通过语音处理模型中的注册网络对降噪后的注册语音进行特别编码,得到注册语音特征。终端可根据注册语音特征,通过语音处理模型中的语音提取网络从降噪后的混合语音中提取得到初步识别语音。进而,终端再使用过滤处理单元,基于注册语音特征对初步识别语音进行过滤处理,得到说话人的干净语音。
在一个实施例中,如图10所示,终端使用过滤处理单元,基于注册语音特征对初步识别语音进行过滤处理,得到说话人的干净语音的具体实现如下:针对初步识别语音中的每一个语音片段,终端可通过上述的注册网络对该语音片段进行特征提取,得到该语音片段的片段语音特征,进而,终端可根据片段语音特征和注册语音特征,确定注册语音和该语音片段之间的语音相似度。终端可将相似度大于或等于预设语音相似度阈值的语音片段进行存储,并将相似度小于预设语音相似度阈值的语音片段置为静音。进而,终端可根据保留下来的语音片段,生成说话人的干净语音。
如图11所示,在一个实施例中,提供了一种语音处理方法,本实施例以该方法应用于图1中的终端102为例进行说明,该方法具体包括以下步骤:
步骤1102,获取混合语音和说话人的注册语音;混合语音中包括说话人的语音信息;语音信息包括语音片段。
步骤1104,将混合语音和注册语音特征输入至语音提取模型,基于混合语音和注册语音特征,在第一模式下至少成第一掩码信息,在第二模式下生成第一掩码信息和和第二掩码信息。可以理解,在第一模式下也可以生成第一掩码信息和和第二掩码信息。
步骤1106,根据第一掩码信息屏蔽混合语音中的干扰信息,得到说话人的初步识别语音。
步骤1108,根据第二掩码信息屏蔽混合语音中说话人的语音信息,得到干扰语音。
步骤1110,在第一处理模式下,针对初步识别语音中的每一个语音片段,将语音片段进行重复处理,得到与注册语音的时间长度一致的重组语音;其中,重组语音包括多个语音片段。
步骤1112,根据重组语音的重组语音特征确定语音片段对应的片段语音特征。
步骤1114,根据片段语音特征和注册语音特征,确定注册语音和语音片段之间的语音相似度。
步骤1116,从初步识别语音中确定语音相似度小于预设相似度的语音信息,得到待过滤语音信息。
步骤1118,将初步识别语音中待过滤语音信息进行过滤,得到说话人的干净语音。
步骤1120,在第二处理模式下,将混合语音的混合语音特征和初步识别语音的语音特征基于注意力机制进行融合,得到第一注意力特征,以及将混合语音特征和干扰语音的语音特征基于注意力机制进行融合,得到第二注意力特征。
步骤1122,将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行融合,并基于融合后的特征得到说话人的干净语音。
本申请还提供一种应用场景,该应用场景应用上述的语音处理方法。具体地,该语音处理方法可应用于影视视频中语音提取的场景。可以理解,影视视频中包括影视语音(即混合语音),该影视语音中包括多个演员(即说话人)的语音信息。具体地,终端可获取初始影视语音和目标演员的初始注册语音;初始影视语音中包括目标演员的语音信息;语音信息包括语音片段。将混合语音和注册语音特征输入至语音提取模型,以通过语音提取模型基于混合语音和注册语音特征,生成第一掩码信息和第二掩码信息。根据第一掩码信息屏蔽混合语音中的干扰信息,得到目标演员的初始影视语音;根据第二掩码信息屏蔽混合语音中目标演员的语音信息,得到干扰语音。
在第一处理模式下,针对初始影视语音中的每一个语音片段,终端可将语音片段进行重复处理,得到与注册语音的时间长度一致的重组语音;其中,重组语音包括多个语音片段。根据重组语音的重组语音特征确定语音片段对应的片段语音特征。根据片段语音特征和注册语音特征,确定注册语音和语音片段之间的语音相似度。从初始影视语音中确定语音相似度小于预设相似度的语音信息,得到待过滤语音信息。将初始影视语音中待过滤语音信息进行过滤处理,得到目标演员的干净语音。
在第二处理模式下,终端可将混合语音的混合语音特征和初始影视语音的语音特征基于注意力机制进行融合,得到第一注意力特征,以及将混合语音特征和干扰语音的语音特征基于注意力机制进行融合,得到第二注意力特征。将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行融合,并基于融合后的特征得到目标演员的干净语音。通过本申请的语音处理方法,可以准确提取得到用户感兴趣的演员的声音,提升演员语音的提取准确率。
本申请还另外提供一种应用场景,该应用场景应用上述的语音处理方法。具体地,该语音处理方法可应用于语音通话中语音提取的场景。具体地,终端可响应于针对目标通话人(即说话人)的通话触发操作,从预先存储的候选的注册语音中,确定目标通话人的注册语音。在基于通话触发操作与目标通话人对应的终端建立语音通话的情况下,接收目标通话人对应的终端在语音通话中发送的通话语音(即混合语音)。可以理解,通过本申请的语音处理方法,可以从通话语音中提取出目标通话人的声音,以提升通话质量。
此外,本申请还另外提供一种应用场景,该应用场景应用上述的语音处理方法。具体地,该语音处理方法可应用于训练神经网络模型之前的针对训练数据的获取场景。具体地,训练神经网络模型需要大量的训练数据,通过本申请的语音处理方法可从复杂的混合语音中提取感兴趣的干净语音,以作为训练数据。通过本申请的语音处理方法,可快速获取到大批量的训练数据,相较于传统的人工提取的方式,节省了人力成本。
应该理解的是,虽然上述各实施例的流程图中的各个步骤按照顺序依次显示,但是这些步骤并不是必然按照顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述各实施例中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图12所示,提供了一种语音处理装置1200,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:
获取模块1202,用于获取说话人的注册语音,并获取混合语音,混合语音包括多个发声对象的语音信息,多个发声对象包括说话人。
第一提取模块1204,用于确定注册语音的注册语音特征,依据注册语音特征,从混合语音中,提取出说话人的初步识别语音。
确定模块1206,用于根据注册语音特征,确定注册语音和初步识别语音所包括语音信息之间的语音相似度。
过滤模块1208,用于从初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到说话人的干净语音。
第一提取模块1204,还用于确定混合语音的混合语音特征;将混合语音特征和注册语音的注册语音特征进行融合,得到语音融合特征;及基于语音融合特征,对混合语音中说话人的语音信息进行初步识别,得到说话人的初步识别语音。
在一个实施例中,混合语音特征包括混合语音特征矩阵,语音融合特征包括语音融合特征矩阵,第一提取模块1204还用于将注册语音特征向量在时间维度上重复,以生成注册语音特征矩阵,注册语音特征矩阵的时间维度与混合语音特征矩阵的时间维度相同;及将混合语音特征矩阵和注册语音特征矩阵拼接,得到语音融合特征矩阵。
在一个实施例中,第一提取模块1204还用于提取混合语音的幅度谱,得到第一幅度谱;对第一幅度谱进行特征提取,得到幅度谱特征;及对幅度谱特征进行特征提取,得到混合语音的混合语音特征。
在一个实施例中,第一提取模块1204还用于基于语音融合特征,对混合语音中说话人的语音信息进行初步识别,得到说话人的语音特征;对说话人的语音特征进行特征解码,得到第二幅度谱;及根据混合语音的相位谱将第二幅度谱进行变换,得到说话人的初步识别语音。
在一个实施例中,第一提取模块1204还用于提取注册语音的频率谱;根据频率谱,生成注册语音的梅尔频率谱;及对梅尔频率谱进行特征提取,得到注册语音的注册语音特征。
在一个实施例中,初步识别语音中语音信息包括语音片段;确定模块1206还用于按照注册语音的时间长度,对初步识别语音中的每一个语音片段分别进行重复,得到时间长度的重组语音;获取从重组语音提取的重组语音特征,根据重组语音特征确定初步识别语音中每一个语音片段对应的片段语音特征;及分别根据每个语音片段对应的片段语音特征和注册语音特征,确定注册语音和每个语音片段之间的语音相似度。
在一个实施例中,在获取到混合语音和注册语音后,分别通过训练过的降噪网络进行降噪,装置1200还包括降噪网络训练模块,用于获取样本噪声语音,样本噪声语音是对作为参照的参考干净语音增加噪声得到;将样本噪声语音输入至待训练的降噪网络,以通过降噪网络对样本待降噪语音进行降噪,得到降噪后的预测语音;及根据预测语音和参考干净语音之间的差异,对待训练的降噪网络进行迭代训练,得到训练过的降噪网络。
在一个实施例中,确定模块1206,还用于在第一处理模式下,根据注册语音特征,确定注册语音和初步识别语音所包括语音信息之间的语音相似度。
过滤模块1208,还用于在第一处理模式下,从初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到说话人的干净语音。
在一个实施例中,装置1200还包括一级语音提取模型,用于在第二处理模式下,获取干扰语音,干扰语音,是依据注册语音特征,从混合语音中提取出的。
在一个实施例中,装置1200还包括二级处理模型,用于获取混合语音的混合语音特征、初步识别语音的语音特征、以及干扰语音的语音特征;将混合语音特征和初步识别语音的语音特征,基于注意力机制融合,得到第一注意力特征;将混合语音特征和干扰语音的语音特征,基于注意力机制融合,得到第二注意力特征;及基于混合语音特征、第一注意力特征和第二注意力特征融合,并基于融合后的特征得到说话人的干净语音。
在一个实施例中,二级处理模型还用于将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征进行融合,并基于融合后的特征得到说话人的干净语音。
在一个实施例中,一级语音提取模型,还用于在输入了混合语音和注册语音特征后,基于混合语音和注册语音特征,生成第一掩码信息和第二掩码信息,根据第一掩码信息屏蔽混合语音中的干扰信息,得到说话人的初步识别语音,根据第二掩码信息屏蔽混合语音 中说话人的语音信息,得到干扰语音。
在一个实施例中,一级语音提取模型中训练过的模型参数包括第一掩码映射参数和第二掩码映射参数,一级语音提取模型还用于基于语音提取模型的第一掩码映射参数、混合语音和注册语音特征,生成第一掩码信息;及基于语音提取模型的第二掩码映射参数、混合语音和注册语音特征,生成第二掩码信息。
在一个实施例中,混合语音的混合语音特征、初步识别语音的语音特征、以及干扰语音的语音特征,是将混合语音、初步识别语音和干扰语音分别输入至二级处理模型中的特征提取层后,由特征提取层提取的。
第一注意力特征,是由二级处理模型中的第一注意力单元,将混合语音特征和初步识别语音的语音特征进行注意力机制融合得到的。
第二注意力特征,是由二级处理模型中的第二注意力单元,将混合语音特征和干扰语音的语音特征进行注意力机制融合得到的。
在一个实施例中,二级处理模型还包括特征融合层和二级语音提取模型,二级处理模型还用于将混合语音特征、第一注意力特征、第二注意力特征和注册语音特征输入至特征融合层进行融合,得到语音融合特征;及将语音融合特征输入至二级语音提取模型,以通过二级语音提取模型基于语音融合特征得到说话人的干净语音。
在一个实施例中,获取模块1202,还用于响应于通话触发操作,确定通话触发操作指定的说话人,从预先存储的候选的注册语音中,确定说话人的注册语音;在基于通话触发操作与说话人对应的终端建立有语音通话的情况下,接收说话人对应的终端在语音通话中发送的混合语音。
在一个实施例中,获取模块1202还用于获取多媒体对象的多媒体语音;多媒体语音是包括多个说话人的语音信息的混合语音;响应于针对多媒体语音中说话人的指定操作,获取指定的说话人的标识;说话人是从多个发声对象中指定的需提取语音的说话人;从针对多媒体语音中各说话人预先存储的注册语音中,获取与说话人的标识具有映射关系的注册语音,得到说话人的注册语音。
上述语音处理装置1200,通过获取混合语音和说话人的注册语音,混合语音中包括说话人的语音信息。依据注册语音的注册语音特征,从混合语音中初步提取出说话人的初步识别语音,能够初步较为准确地提取到说话人的初步识别语音。进而,会在初步识别语音的基础上进行进阶地过滤处理,即,根据注册语音特征,确定注册语音和初步识别语音中语音信息之间的语音相似度,并从初步识别语音中过滤掉语音相似度小于预设相似度的语音信息,就可以将初步识别语音中残留的噪声过滤掉,从而得到更为干净的说话人的干净语音,提升语音提取的准确率。
上述语音处理装置1200中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图13所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口、显示单元和输入装置通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该 计算机可读指令被处理器执行时以实现一种语音处理方法。该计算机设备的显示单元用于形成视觉可见的画面,可以是显示屏、投影装置或虚拟现实成像装置,显示屏可以是液晶显示屏或电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图13中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读指令产品,包括计算机可读指令,计算机可读指令被处理器执行时实现上述各方法实施例中的步骤。
需要说明的是,本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音处理方法,包括:
    获取说话人的注册语音,并获取混合语音,所述混合语音包括多个发声对象的语音信息,所述多个发声对象包括所述说话人;
    确定所述注册语音的注册语音特征;
    依据所述注册语音特征,从所述混合语音中提取出所述说话人的初步识别语音;
    根据所述注册语音特征,确定所述注册语音和所述初步识别语音所包括语音信息之间的语音相似度;及
    从所述初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到所述说话人的干净语音。
  2. 根据权利要求1所述的方法,所述依据所述注册语音特征,从所述混合语音中,提取出所述说话人的初步识别语音,包括:
    确定所述混合语音的混合语音特征;
    将所述混合语音特征和所述注册语音的注册语音特征进行融合,得到语音融合特征;及
    基于所述语音融合特征,对所述混合语音中所述说话人的语音信息进行初步识别,得到所述说话人的初步识别语音。
  3. 根据权利要求2所述的方法,所述混合语音特征包括混合语音特征矩阵,所述语音融合特征包括语音融合特征矩阵,所述注册语音特征包括注册语音特征向量,所述将所述混合语音特征和所述注册语音的注册语音特征进行融合,得到语音融合特征,包括:
    将所述注册语音特征向量在时间维度上重复,以生成注册语音特征矩阵,所述注册语音特征矩阵的时间维度与所述混合语音特征矩阵的时间维度相同;及
    将所述混合语音特征矩阵和所述注册语音特征矩阵拼接,得到语音融合特征矩阵。
  4. 根据权利要求2或3所述的方法,所述确定所述混合语音的混合语音特征,包括:
    提取所述混合语音的幅度谱,得到第一幅度谱;
    对所述第一幅度谱进行特征提取,得到幅度谱特征;及
    对所述幅度谱特征进行特征提取,得到所述混合语音的混合语音特征。
  5. 根据权利要求4所述的方法,所述基于所述语音融合特征,对所述混合语音中所述说话人的语音信息进行初步识别,得到所述说话人的初步识别语音,包括:
    基于所述语音融合特征,对所述混合语音中所述说话人的语音信息进行初步识别,得到所述说话人的语音特征;
    对所述说话人的语音特征进行特征解码,得到第二幅度谱;及
    根据所述混合语音的相位谱将所述第二幅度谱进行变换,得到所述说话人的初步识别语音。
  6. 根据权利要求1至5中任一项所述的方法,所述确定所述注册语音的注册语音特征,包括:
    提取所述注册语音的频率谱;
    根据所述频率谱,生成所述注册语音的梅尔频率谱;及
    对所述梅尔频率谱进行特征提取,得到所述注册语音的注册语音特征。
  7. 根据权利要求1至6中任一项所述的方法,所述初步识别语音中语音信息包括语音片段;所述根据所述注册语音特征,确定所述注册语音和所述初步识别语音中语音信息之间的语音相似度,包括:
    按照所述注册语音的时间长度,对所述初步识别语音中的每一个语音片段分别进行重复,得到所述时间长度的重组语音;
    获取从所述重组语音提取的重组语音特征,根据所述重组语音特征确定所述初步识别 语音中每一个语音片段对应的片段语音特征;及
    分别根据每个语音片段对应的所述片段语音特征和所述注册语音特征,确定所述注册语音和每个所述语音片段之间的语音相似度。
  8. 根据权利要求1至7中任一项所述的方法,在获取到所述混合语音和所述注册语音后,分别通过训练过的降噪网络进行降噪,所述方法还包括:
    获取样本噪声语音,所述样本噪声语音是对作为参照的参考干净语音增加噪声得到;
    将所述样本噪声语音输入至待训练的降噪网络,以通过所述降噪网络对所述样本待降噪语音进行降噪,得到降噪后的预测语音;及
    根据所述预测语音和所述参考干净语音之间的差异,对所述待训练的降噪网络进行迭代训练,得到训练过的降噪网络。
  9. 根据权利要求1至8任一项所述的方法,所述根据所述注册语音特征,确定所述注册语音和所述初步识别语音所包括语音信息之间的语音相似度的步骤,以及所述从所述初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到所述说话人的干净语音的步骤,是在第一处理模式下执行的,所述方法还包括:
    在第二处理模式下,获取干扰语音,所述干扰语音,是依据所述注册语音特征,从所述混合语音中提取出的;
    获取所述混合语音的混合语音特征、所述初步识别语音的语音特征、以及所述干扰语音的语音特征;
    将所述混合语音特征和所述初步识别语音的语音特征,基于注意力机制融合,得到第一注意力特征;
    将所述混合语音特征和所述干扰语音的语音特征,基于注意力机制融合,得到第二注意力特征;及
    基于所述混合语音特征、所述第一注意力特征和所述第二注意力特征融合,并基于融合后的特征得到所述说话人的干净语音。
  10. 根据权利要求9所述的方法,所述基于所述混合语音特征、所述第一注意力特征和所述第二注意力特征融合,并基于融合后的特征得到所述说话人的干净语音,包括:
    将所述混合语音特征、所述第一注意力特征、所述第二注意力特征和所述注册语音特征进行融合,并基于融合后的特征得到所述说话人的干净语音。
  11. 根据权利要求10所述的方法,所述初步识别语音和所述干扰语音是通过训练过的语音提取模型从所述混合语音中提取出的,所述方法还包括:
    将所述混合语音和所述注册语音特征输入至所述语音提取模型;
    通过所述语音提取模型,基于所述混合语音和所述注册语音特征,生成第一掩码信息和第二掩码信息;
    通过所述语音提取模型,根据所述第一掩码信息屏蔽所述混合语音中的干扰信息,得到所述说话人的初步识别语音;及
    通过所述语音提取模型,根据所述第二掩码信息屏蔽所述混合语音中所述说话人的语音信息,得到干扰语音。
  12. 根据权利要求11所述的方法,所述语音提取模型中训练过的模型参数包括第一掩码映射参数和第二掩码映射参数,所述通过所述语音提取模型,基于所述混合语音和所述注册语音特征,生成第一掩码信息和第二掩码信息,包括:
    基于所述语音提取模型的第一掩码映射参数、所述混合语音和所述注册语音特征,生成第一掩码信息;及
    基于所述语音提取模型的第二掩码映射参数、所述混合语音和所述注册语音特征,生成第二掩码信息。
  13. 根据权利要求10至12任一项所述的方法,所述混合语音的混合语音特征、所述 初步识别语音的语音特征、以及所述干扰语音的语音特征,是将所述混合语音、所述初步识别语音和所述干扰语音分别输入至二级处理模型中的特征提取层后,由所述特征提取层提取的;
    所述第一注意力特征,是由所述二级处理模型中的第一注意力单元,将所述混合语音特征和所述初步识别语音的语音特征进行注意力机制融合得到的;及
    所述第二注意力特征,是由所述二级处理模型中的第二注意力单元,将所述混合语音特征和所述干扰语音的语音特征进行注意力机制融合得到的。
  14. 根据权利要求13所述的方法,所述初步识别语音和所述干扰语音,是通过一级语音提取模型从所述混合语音中提取的,所述二级处理模型还包括特征融合层和二级语音提取模型,所述基于所述混合语音特征、所述第一注意力特征和所述第二注意力特征融合,并基于融合后的特征得到所述说话人的干净语音,包括:
    将所述混合语音特征、所述第一注意力特征、所述第二注意力特征和所述注册语音特征输入至所述特征融合层进行融合,得到语音融合特征;及
    将所述语音融合特征输入至所述二级语音提取模型,以通过所述二级语音提取模型基于所述语音融合特征得到所述说话人的干净语音。
  15. 根据权利要求1至14中任一项所述的方法,所述获取说话人的注册语音,并获取混合语音,包括:
    响应于通话触发操作,确定所述通话触发操作指定的说话人,从预先存储的候选的注册语音中,确定所述说话人的注册语音;
    在基于所述通话触发操作与所述说话人对应的终端建立有语音通话的情况下,接收所述说话人对应的终端在所述语音通话中发送的混合语音。
  16. 根据权利要求1至14中任一项所述的方法,所述获取说话人的注册语音,并获取混合语音,包括:
    获取多媒体对象的多媒体语音;所述多媒体语音是包括多个说话人的语音信息的混合语音;
    响应于针对多媒体语音中说话人的指定操作,获取指定的说话人的标识;所述说话人是从所述多个发声对象中指定的需提取语音的说话人;
    从针对多媒体语音中各说话人预先存储的注册语音中,获取与所述说话人的标识具有映射关系的注册语音,得到所述说话人的注册语音。
  17. 一种语音处理装置,包括:
    获取模块,用于获取说话人的注册语音,并获取混合语音,所述混合语音包括多个发声对象的语音信息,所述多个发声对象包括所述说话人;
    第一提取模块,用于确定所述注册语音的注册语音特征,依据所述注册语音特征,从所述混合语音中,提取出所述说话人的初步识别语音;
    确定模块,用于根据所述注册语音特征,确定所述注册语音和所述初步识别语音所包括语音信息之间的语音相似度;及
    过滤模块,用于从所述初步识别语音中,滤除语音相似度小于预设相似度的语音信息,得到所述说话人的干净语音。
  18. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时执行权利要求1至16中任一项所述的方法。
  19. 一种计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被处理器执行时执行权利要求1至16中任一项所述的方法。
  20. 一种计算机可读指令产品,包括计算机可读指令,所述计算机可读指令被处理器执行时执行权利要求1至16中任一项所述的方法。
PCT/CN2023/121068 2022-10-21 2023-09-25 语音处理方法、装置、设备和介质 WO2024082928A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211297843.3 2022-10-21
CN202211297843.3A CN116978358A (zh) 2022-10-21 2022-10-21 语音处理方法、装置、设备和介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/431,826 Continuation US20240177717A1 (en) 2022-10-21 2024-02-02 Voice processing method and apparatus, device, and medium

Publications (1)

Publication Number Publication Date
WO2024082928A1 true WO2024082928A1 (zh) 2024-04-25

Family

ID=88475462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/121068 WO2024082928A1 (zh) 2022-10-21 2023-09-25 语音处理方法、装置、设备和介质

Country Status (2)

Country Link
CN (1) CN116978358A (zh)
WO (1) WO2024082928A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962237A (zh) * 2018-05-24 2018-12-07 腾讯科技(深圳)有限公司 混合语音识别方法、装置及计算机可读存储介质
KR20200040425A (ko) * 2018-10-10 2020-04-20 주식회사 케이티 화자 인식 장치 및 그 동작방법
CN112053695A (zh) * 2020-09-11 2020-12-08 北京三快在线科技有限公司 声纹识别方法、装置、电子设备及存储介质
CN113823293A (zh) * 2021-09-28 2021-12-21 武汉理工大学 一种基于语音增强的说话人识别方法及系统
CN114495973A (zh) * 2022-01-25 2022-05-13 中山大学 一种基于双路径自注意力机制的特定人语音分离方法
CN114898762A (zh) * 2022-05-07 2022-08-12 北京快鱼电子股份公司 基于目标人的实时语音降噪方法、装置和电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962237A (zh) * 2018-05-24 2018-12-07 腾讯科技(深圳)有限公司 混合语音识别方法、装置及计算机可读存储介质
KR20200040425A (ko) * 2018-10-10 2020-04-20 주식회사 케이티 화자 인식 장치 및 그 동작방법
CN112053695A (zh) * 2020-09-11 2020-12-08 北京三快在线科技有限公司 声纹识别方法、装置、电子设备及存储介质
CN113823293A (zh) * 2021-09-28 2021-12-21 武汉理工大学 一种基于语音增强的说话人识别方法及系统
CN114495973A (zh) * 2022-01-25 2022-05-13 中山大学 一种基于双路径自注意力机制的特定人语音分离方法
CN114898762A (zh) * 2022-05-07 2022-08-12 北京快鱼电子股份公司 基于目标人的实时语音降噪方法、装置和电子设备

Also Published As

Publication number Publication date
CN116978358A (zh) 2023-10-31

Similar Documents

Publication Publication Date Title
CN110600017B (zh) 语音处理模型的训练方法、语音识别方法、系统及装置
CN109801644B (zh) 混合声音信号的分离方法、装置、电子设备和可读介质
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
WO2019196196A1 (zh) 一种耳语音恢复方法、装置、设备及可读存储介质
JP2019216408A (ja) 情報を出力するための方法、及び装置
WO2021082941A1 (zh) 视频人物识别方法、装置、存储介质与电子设备
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN112435684A (zh) 语音分离方法、装置、计算机设备和存储介质
CN111883107A (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
CN114203163A (zh) 音频信号处理方法及装置
CN113205793B (zh) 音频生成方法、装置、存储介质及电子设备
CN114333865A (zh) 一种模型训练以及音色转换方法、装置、设备及介质
WO2022062800A1 (zh) 语音分离方法、电子设备、芯片及计算机可读存储介质
WO2024082928A1 (zh) 语音处理方法、装置、设备和介质
US20210249033A1 (en) Speech processing method, information device, and computer program product
CN116129931A (zh) 一种视听结合的语音分离模型搭建方法及语音分离方法
WO2022166738A1 (zh) 语音增强方法、装置、设备及存储介质
CN115472174A (zh) 声音降噪方法和装置、电子设备和存储介质
Zhao et al. Radio2Speech: High Quality Speech Recovery from Radio Frequency Signals
CN117373468A (zh) 远场语音增强处理方法、装置、计算机设备和存储介质
US20240177717A1 (en) Voice processing method and apparatus, device, and medium
CN111916095A (zh) 语音增强方法、装置、存储介质及电子设备
WO2024055751A1 (zh) 音频数据处理方法、装置、设备、存储介质及程序产品
CN117649846B (zh) 语音识别模型生成方法、语音识别方法、设备和介质
CN117316160B (zh) 无声语音识别方法、装置、电子设备和计算机可读介质