CN113257230B

CN113257230B - Voice processing method and device and computer storage medium

Info

Publication number: CN113257230B
Application number: CN202110694885.XA
Authority: CN
Inventors: 李成飞; 汪光璟
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2022-02-08
Anticipated expiration: 2041-06-23
Also published as: CN113257230A

Abstract

The disclosure relates to a voice processing method and device and a computer storage medium, and relates to the field of voice processing. The voice processing method comprises the following steps: extracting the characteristics of the voice of the current speaker to obtain voice characteristics; respectively determining the current text content characteristic and the current speaker identity characteristic by utilizing a first encoder and a second encoder with different parameters according to the voice characteristics; determining the identity characteristics of a target speaker corresponding to the voice according to the identity characteristics of the current speaker; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the identity characteristics of the target speaker. According to the present disclosure, the accuracy of speech processing can be improved.

Description

Voice processing method and device and computer storage medium

Technical Field

The present disclosure relates to the field of speech processing, and in particular, to a speech processing method and apparatus, and a computer-readable storage medium.

Background

In the related technology, a single encoder is adopted to encode the voice characteristics of the voice to obtain text content characteristics and speaker identity characteristics, and then the text content characteristics and the speaker identity characteristics are respectively input into two different decoders to obtain text content information and speaker identity information corresponding to the voice.

Disclosure of Invention

In the related art, a single encoder executes multiple encoding tasks, the training difficulty of the encoder is high, a large amount of training data is needed compared with a single encoding task, the accuracy of the encoder is low, and therefore the accuracy of voice processing is low. Moreover, the double decoders decode the text content characteristics and the speaker identity characteristics respectively, and the accuracy of speech processing is low.

In order to solve the technical problem, the present disclosure provides a solution to improve the accuracy of speech processing.

According to a first aspect of the present disclosure, there is provided a speech processing method, comprising: extracting the characteristics of the voice of the current speaker to obtain voice characteristics; respectively determining the current text content characteristic and the current speaker identity characteristic by utilizing a first encoder and a second encoder with different parameters according to the voice characteristics; determining the identity characteristics of a target speaker corresponding to the voice according to the identity characteristics of the current speaker; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the identity characteristics of the target speaker.

In some embodiments, the speech includes a multi-frame speech, the current speaker identity feature includes a multi-frame current speaker identity feature corresponding to the multi-frame speech, and determining the target speaker identity feature corresponding to the speech includes: calculating the average value of the identity characteristics of the current speakers of the multiple frames; acquiring the identity characteristics of reference speakers of a plurality of reference speakers; and screening the identity characteristics of the target speaker from the identity characteristics of the speakers of the multiple reference speakers according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker.

In some embodiments, determining the current textual content information and speaker identity information corresponding to the speech comprises: determining a weighted value of the identity characteristics of the target speaker according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker; according to the weighted value, the identity characteristics of the target speaker are adjusted; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker.

In some embodiments, filtering out the target speaker identity profile from the reference speaker identity profiles of the plurality of reference speakers comprises: and selecting the reference speaker identity characteristic with the maximum similarity with the average value from the reference speaker identity characteristics of the plurality of reference speakers as the target speaker identity characteristic.

In some embodiments, the speech processing method further comprises: and training a deep neural network model by using the reference voices of the multiple reference speakers with the speaker identity labeling information to obtain the reference speaker identity characteristics of the multiple reference speakers.

In some embodiments, the speech processing method further comprises: training the first encoder by using first training data, wherein the first training data comprises a plurality of pieces of first training voice and text content labeling information corresponding to each piece of first training voice; and training the second encoder by using second training data, wherein the second training data comprises a plurality of second training voices and speaker identity marking information corresponding to each second training voice.

In some embodiments, the first encoder comprises an encoding layer of a Transformer model and the second encoder comprises an encoding layer based on a convolution enhanced Transformer model.

In some embodiments, the speech features are Mel frequency cepstral coefficients MFCC or a filter bank Fbank.

In some embodiments, the speaker identity information includes students and teachers responsible for different disciplines.

According to a second aspect of the present disclosure, there is provided a speech processing apparatus comprising: the processor is configured to perform feature extraction on the voice of the current speaker to obtain voice features; a first encoder configured to determine a current text content characteristic from the speech characteristic; a second encoder configured to determine an identity characteristic of a current speaker based on the speech characteristic, the second encoder having different parameters than the first encoder; the processor is further configured to determine a target speaker identity characteristic corresponding to the speech based on the current speaker identity characteristic; a decoder configured to determine text content information and speaker identity information corresponding to the speech according to the current text content characteristics and the identity characteristics of the target speaker.

According to a third aspect of the present disclosure, there is provided a speech processing apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to perform the speech processing method of any of the above embodiments based on instructions stored in the memory.

According to a fourth aspect of the present disclosure, there is provided a computer-storable medium characterized by having stored thereon computer program instructions which, when executed by a processor, implement a speech processing method according to any of the embodiments described above.

In the above embodiment, the accuracy of the speech processing can be improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram illustrating a method of speech processing according to some embodiments of the present disclosure;

FIG. 2 is a flow diagram illustrating determining identity characteristics of a target speaker corresponding to speech according to some embodiments of the present disclosure;

FIG. 3 is a block diagram illustrating a speech processing apparatus according to some embodiments of the present disclosure;

FIG. 4 is a block diagram illustrating a speech processing apparatus according to further embodiments of the present disclosure;

FIG. 5 is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

FIG. 1 is a flow diagram illustrating a method of speech processing according to some embodiments of the present disclosure.

As shown in fig. 1, the speech processing method includes: step S10, extracting the characteristics of the current speaker voice to obtain voice characteristics; step S20, according to the voice characteristics, using the first encoder and the second encoder with different parameters to respectively determine the current text content characteristics and the current speaker identity characteristics; step S30, according to the identity characteristics of the current speaker, determining the identity characteristics of the target speaker corresponding to the voice; and step S40, determining the text content information and the speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristic and the identity characteristic of the target speaker. For example, the voice processing method is executed by a voice processing apparatus.

In the above embodiment, a dual-encoder structure with different parameters is adopted to respectively determine the text content characteristics and the speaker identity characteristics, so that the functions of the two encoders are relatively independent, the training difficulty of the encoders can be reduced, the accuracy of each encoder is improved, and the accuracy of speech processing is improved. Moreover, the speaker identity characteristic and the text content characteristic are input into the same decoder together, so that the decoding of the text content characteristic and the decoding of the speaker identity characteristic are mutually assisted, and the accuracy of speech processing can be further improved.

In step S10, a speech feature is obtained by performing feature extraction on the speech of the current speaker. In some embodiments, the speech comprises multi-frame speech and the speech features comprise multi-frame speech features of the multi-frame speech. Each frame of speech feature corresponds to a frame of speech. For example, speech is characterized by MFCC (Mel Frequency Cepstrum Coefficient) or Fbank (Filter Bank ). In some embodiments, the speech is audio in wav format.

In step S20, a current text content feature and a current speaker identity feature (also called speaker feature) are determined based on the speech feature using a first encoder and a second encoder having different parameters, respectively. The current textual content characteristics describe coding characteristics of the textual content of the current speaker's voice. The current speaker identity characteristic describes an encoding characteristic of speaker identity information of the current speaker.

In some embodiments, the identity characteristic of the current speaker corresponding to the multi-frame speech includes an identity characteristic of the current speaker corresponding to the multi-frame speech. The current text content features corresponding to the multi-frame voices may also include multi-frame current text content features corresponding to the multi-frame voices. For example, the identity characteristic of the current speaker may form a current speaker identity characteristic sequence, and the current text content characteristic of the current speaker may also form a current text content characteristic sequence.

In some embodiments, the first encoder is trained using the first training data prior to determining the current textual content feature using the first encoder. The first training data includes a plurality of pieces of first training voices and text content label information corresponding to each piece of the first training voices. In training the first encoder, a goal is to minimize a loss function with respect to a difference between predicted textual content information and textual content annotation information.

The first Encoder may be named ASR-Encoder. ASR-Encoder is a transform-encoding layer (Encoder-layer) consisting of self-attention (self-attention). The transform coding Layer consists of 6 identical layers (layers), each Layer consisting of two sub-layers, which are a multi-head self-attention mechanism (multi-head self-attention mechanism) and a fully-connected feed-forward network (full connected feed-forward network), respectively. Each sub-layer is added with residual linking (residual connection) and normalization (normalization).

For example, the output of the transform coding layer can be expressed using the formula: transformer _ encoder = LayerNorm (x + (sublayer (x))), where x denotes the input, sublayer (x) denotes the operation of the sublayer, and LayerNorm denotes the normalization of the layer.

The operation of the sub-layer includes the operation of a multi-head mechanism, for example the result of the multi-head mechanism can be expressed as: MultiHead (Q, K, V) = Concat (head1, head2, …, head) × Wo. Q, K, V is a vector obtained by multiplying an input vector x by different first parameter matrixes, Wo is a second parameter matrix, headi represents the calculation mode of the ith head, and Concat represents the splicing operation. header = Self _ attribute (Q × W1, K × W2, V × W3). W1, W2, and W3 are three third parameter matrices, and Self _ attention represents the Self-attention mechanism.

Wherein Softmax is a function, d_kIs the dimension of Q, K^TRepresenting the transpose of K.

The second encoder is trained using second training data prior to determining the identity of the current speaker using the second encoder. The second training data includes a plurality of second training speeches and speaker identification labeling information corresponding to each of the second training speeches. In training the first encoder, a goal is to minimize a loss function with respect to a difference between the predicted speaker identity information and the speaker ID tag information.

The second Encoder may be named SPK-Encoder. The SPK-Encoder is an improvement over the first Encoder. Considering that the identity information of a speaker is coded aiming at a certain voice, the voice identity information of a speaker has tandem relation in continuous speaking, and a second coder adds a one-dimensional lap machine network between a multi-head self-attention mechanism (multi-head self-attention mechanism) and a fully connected feed-forward network (fully connected feed-forward network), so that the SPK-Encoder can better learn the identity information of the speaker by considering the global and local information simultaneously.

In some embodiments, the first encoder includes an encoding layer of a Transformer model and the second encoder includes an encoding layer based on a convolution enhanced Transformer model.

In step S30, the identity of the target speaker corresponding to the speech is determined based on the identity of the current speaker.

Taking the example that the identity characteristics of the current speaker corresponding to the multiple frames of voices include the identity characteristics of the multiple frames of the current speaker corresponding to the multiple frames of voices, the identity characteristics of the target speaker corresponding to the voices can be determined in the manner shown in fig. 2.

FIG. 2 is a flow diagram illustrating determining identity characteristics of a target speaker corresponding to speech according to some embodiments of the present disclosure.

As shown in FIG. 2, determining the identity of the target speaker corresponding to speech includes steps S31-S33.

In step S31, an average of the identity characteristics of the current speaker over multiple frames is calculated. For example, the current speaker feature is represented as a feature vector, and the average value is an average value of multi-frame feature vectors and is a speaker identity information encoding vector of voice.

In step S32, reference speaker identity characteristics for a plurality of reference speakers are obtained.

In some embodiments, before obtaining the reference speaker identity characteristics of the multiple reference speakers, a Deep Neural Network (DNN) model is trained using the reference voices of the multiple reference speakers with speaker identity labeling information to obtain the reference speaker identity characteristics of the multiple reference speakers. For example, with the audio (referring to the voice of a speaker) of each teacher and student prepared in advance, the audio is labeled with speaker identification information. In the DNN model training stage, after training, the DNN model can perform speaker identification information recognition (speaker classification) at the frame level. And after the DNN model training is finished, taking the voice characteristic of the last hidden layer of the DNN model as the identity characteristic (D-vector) of a reference speaker. In this way, the registration of the identity characteristic of the speaker can be completed.

For example, the reference speaker ID of a plurality of reference speakers can be represented as a two-dimensional matrix D N, where D is the dimension of the speaker ID information vector and N is the number of reference speakers.

In step S33, the target speaker identity characteristic is filtered out from the speaker identity characteristics of the plurality of reference speakers according to the similarity between the average value and the reference speaker identity characteristic of each reference speaker. In some embodiments, the reference speaker identity feature having the greatest similarity to the mean is selected from the reference speaker identity features of the plurality of reference speakers as the target speaker identity feature. For example, the similarity is calculated using cosine distances.

In the above embodiment, the identity characteristic of the target speaker is screened by using an average value method, so that the efficiency of speech processing can be improved.

Returning to fig. 1, in step S40, the text content information and the speaker identification information corresponding to the speech are determined using the same decoder based on the current text content characteristics and the identity characteristics of the target speaker. For example, speaker identity information (speaker role information) includes students and teachers responsible for different disciplines. Teachers responsible for different disciplines include physical teachers, chemical teachers, Chinese teachers, English teachers, and the like. In some embodiments, the decoder is the decoding layer of the transform model. The Q, K, and V vectors are derived from the output of the transform's coding layer (first encoder), the identity of the target speaker, and the coding vector at the previous time, respectively, when calculating self-attention.

In some embodiments, after calculating the average value of the identity characteristics of the current speakers of the plurality of frames, the weighted value of the identity characteristics of the target speaker is determined according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker. And then adjusting the identity characteristics of the target speaker according to the weight value. And finally, determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker. The identity characteristics of the target speaker are adjusted by utilizing the weight values determined according to the similarity, so that the difference between the identity characteristics of the reference speaker and the actual identity characteristics of the current speaker can be reduced, the negative influence of the reference speaker on the decoding process is reduced, and the accuracy of voice processing is further improved.

In some embodiments, a maximum value of the similarity between the average and the reference speaker identity characteristic for each reference speaker may be determined as a weighted value of the target speaker identity characteristic.

FIG. 3 is a block diagram illustrating a speech processing apparatus according to some embodiments of the present disclosure.

As shown in fig. 3, the speech processing apparatus 3 includes a processor 31, a first encoder 32, a second encoder 33, and a decoder 34. The first encoder 32 and the second encoder 33 have different parameters.

The processor 31 is configured to perform feature extraction on the speech of the current speaker to obtain speech features, for example, execute step S10 shown in fig. 1.

The first encoder 32 is configured to determine the current text content characteristic based on the speech characteristic, for example, to perform step S20 as shown in fig. 1.

The second encoder 33 is configured to determine the identity of the current speaker based on the speech characteristics, for example, to perform step S20 as shown in fig. 1.

The processor 31 is further configured to determine the identity of the target speaker corresponding to the speech based on the identity of the current speaker, for example, to perform step S30 shown in fig. 1.

The decoder 34 is configured to determine the text content information and the speaker identity information corresponding to the speech based on the current text content characteristics and the identity characteristics of the target speaker, for example, to perform step S40 shown in fig. 1.

In the above embodiment, the encoder-decoder model in the whole speech processing apparatus adopts an autoregressive mode to complete the decoding of speech, so as to obtain text content information and speaker identity information.

FIG. 4 is a block diagram illustrating a speech processing apparatus according to further embodiments of the present disclosure.

As shown in fig. 4, the speech processing apparatus 4 includes a memory 41; and a processor 42 coupled to the memory 41. The memory 41 is used for storing instructions for executing the corresponding embodiment of the speech processing method. The processor 42 is configured to perform the speech processing method in any of the embodiments of the present disclosure based on instructions stored in the memory 41.

As shown in FIG. 5, the computer system 50 may be embodied in the form of a general purpose computing device. Computer system 50 includes a memory 510, a processor 520, and a bus 500 that connects the various system components.

The memory 510 may include, for example, system memory, non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs. The system memory may include volatile storage media such as Random Access Memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions to perform corresponding embodiments of at least one of the speech processing methods. Non-volatile storage media include, but are not limited to, magnetic disk storage, optical storage, flash memory, and the like.

The processor 520 may be implemented as discrete hardware components, such as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gates or transistors, or the like. Accordingly, each of the modules, such as the judging module and the determining module, may be implemented by a Central Processing Unit (CPU) executing instructions in a memory for performing the corresponding step, or may be implemented by a dedicated circuit for performing the corresponding step.

Bus 500 may use any of a variety of bus architectures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.

Computer system 50 may also include input-output interface 530, network interface 540, storage interface 550, and the like. These

interfaces

530, 540, 550 and the memory 510 and the processor 520 may be connected by a bus 500. The input/output interface 530 may provide a connection interface for an input/output device such as a display, a mouse, and a keyboard. The network interface 540 provides a connection interface for various networking devices. The storage interface 550 provides a connection interface for external storage devices such as a floppy disk, a usb disk, and an SD card.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable apparatus to produce a machine, such that the execution of the instructions by the processor results in an apparatus that implements the functions specified in the flowchart and/or block diagram block or blocks.

These computer-readable program instructions may also be stored in a computer-readable memory that can direct a computer to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function specified in the flowchart and/or block diagram block or blocks.

The present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.

By the voice processing method and device and the computer storage medium in the embodiment, the accuracy of voice processing can be improved.

So far, the speech processing method and apparatus, computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

Claims

1. A method of speech processing, comprising:

performing feature extraction on the voice of the current speaker to obtain voice features, wherein the voice comprises multi-frame voice;

respectively determining the current text content characteristic and the current speaker identity characteristic corresponding to the current speaker by utilizing a first encoder and a second encoder with different parameters according to the voice characteristics, wherein the current speaker identity characteristic comprises multi-frame current speaker identity characteristics corresponding to the multi-frame voice;

calculating the average value of the identity characteristics of the current speakers of the multiple frames;

acquiring the identity characteristics of reference speakers of a plurality of reference speakers;

screening out the identity characteristics of the target speaker from the identity characteristics of the reference speakers of the plurality of reference speakers according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker;

determining a weighted value of the identity characteristics of the target speaker according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker;

according to the weighted value, the identity characteristics of the target speaker are adjusted;

and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker.

2. The speech processing method of claim 1, wherein screening the identity of the target speaker from the identity of the reference speakers of the plurality of reference speakers comprises:

and selecting the reference speaker identity characteristic with the maximum similarity with the average value from the reference speaker identity characteristics of the plurality of reference speakers as the target speaker identity characteristic.

3. The speech processing method according to claim 1, further comprising:

and training a deep neural network model by using the reference voices of the multiple reference speakers with the speaker identity labeling information to obtain the reference speaker identity characteristics of the multiple reference speakers.

4. The speech processing method according to claim 1, further comprising:

training the first encoder by using first training data, wherein the first training data comprises a plurality of pieces of first training voice and text content labeling information corresponding to each piece of first training voice;

and training the second encoder by using second training data, wherein the second training data comprises a plurality of second training voices and speaker identity marking information corresponding to each second training voice.

5. The speech processing method of claim 1 wherein the first encoder comprises a transform model coding layer and the second encoder comprises a convolution-enhanced transform model-based coding layer.

6. The speech processing method according to claim 1, wherein the speech features are Mel Frequency Cepstral Coefficients (MFCCs) or a filter bank (Fbank).

7. The speech processing method of claim 1 wherein the speaker identity information comprises students and teachers responsible for different disciplines.

8. A speech processing apparatus, comprising:

the processor is configured to perform feature extraction on the voice of the current speaker to obtain voice features, wherein the voice comprises multi-frame voice;

a first encoder configured to determine a current text content feature corresponding to the current speaker based on the speech feature;

a second encoder configured to determine a current speaker identity feature corresponding to the current speaker based on the speech features, the second encoder having different parameters than the first encoder, the current speaker identity feature comprising a multi-frame current speaker identity feature corresponding to the multi-frame speech;

the processor is further configured to calculate an average of identity characteristics of the plurality of frames of the current speaker; acquiring the identity characteristics of reference speakers of a plurality of reference speakers; screening out the identity characteristics of the target speaker from the identity characteristics of the reference speakers of the plurality of reference speakers according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker;

a decoder configured to determine a weighted value of the identity characteristic of the target speaker according to a similarity between the average value and the identity characteristic of a reference speaker of each reference speaker; according to the weighted value, the identity characteristics of the target speaker are adjusted; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker.

9. A speech processing apparatus, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the speech processing method of any of claims 1 to 7 based on instructions stored in the memory.

10. A computer-storable medium having stored thereon computer program instructions which, when executed by a processor, implement a speech processing method according to any one of claims 1 to 7.