CN113257230B - Voice processing method and device and computer storage medium - Google Patents

Voice processing method and device and computer storage medium Download PDF

Info

Publication number
CN113257230B
CN113257230B CN202110694885.XA CN202110694885A CN113257230B CN 113257230 B CN113257230 B CN 113257230B CN 202110694885 A CN202110694885 A CN 202110694885A CN 113257230 B CN113257230 B CN 113257230B
Authority
CN
China
Prior art keywords
speaker
identity
voice
current
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110694885.XA
Other languages
Chinese (zh)
Other versions
CN113257230A (en
Inventor
李成飞
汪光璟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110694885.XA priority Critical patent/CN113257230B/en
Publication of CN113257230A publication Critical patent/CN113257230A/en
Application granted granted Critical
Publication of CN113257230B publication Critical patent/CN113257230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The disclosure relates to a voice processing method and device and a computer storage medium, and relates to the field of voice processing. The voice processing method comprises the following steps: extracting the characteristics of the voice of the current speaker to obtain voice characteristics; respectively determining the current text content characteristic and the current speaker identity characteristic by utilizing a first encoder and a second encoder with different parameters according to the voice characteristics; determining the identity characteristics of a target speaker corresponding to the voice according to the identity characteristics of the current speaker; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the identity characteristics of the target speaker. According to the present disclosure, the accuracy of speech processing can be improved.

Description

Voice processing method and device and computer storage medium
Technical Field
The present disclosure relates to the field of speech processing, and in particular, to a speech processing method and apparatus, and a computer-readable storage medium.
Background
In the related technology, a single encoder is adopted to encode the voice characteristics of the voice to obtain text content characteristics and speaker identity characteristics, and then the text content characteristics and the speaker identity characteristics are respectively input into two different decoders to obtain text content information and speaker identity information corresponding to the voice.
Disclosure of Invention
In the related art, a single encoder executes multiple encoding tasks, the training difficulty of the encoder is high, a large amount of training data is needed compared with a single encoding task, the accuracy of the encoder is low, and therefore the accuracy of voice processing is low. Moreover, the double decoders decode the text content characteristics and the speaker identity characteristics respectively, and the accuracy of speech processing is low.
In order to solve the technical problem, the present disclosure provides a solution to improve the accuracy of speech processing.
According to a first aspect of the present disclosure, there is provided a speech processing method, comprising: extracting the characteristics of the voice of the current speaker to obtain voice characteristics; respectively determining the current text content characteristic and the current speaker identity characteristic by utilizing a first encoder and a second encoder with different parameters according to the voice characteristics; determining the identity characteristics of a target speaker corresponding to the voice according to the identity characteristics of the current speaker; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the identity characteristics of the target speaker.
In some embodiments, the speech includes a multi-frame speech, the current speaker identity feature includes a multi-frame current speaker identity feature corresponding to the multi-frame speech, and determining the target speaker identity feature corresponding to the speech includes: calculating the average value of the identity characteristics of the current speakers of the multiple frames; acquiring the identity characteristics of reference speakers of a plurality of reference speakers; and screening the identity characteristics of the target speaker from the identity characteristics of the speakers of the multiple reference speakers according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker.
In some embodiments, determining the current textual content information and speaker identity information corresponding to the speech comprises: determining a weighted value of the identity characteristics of the target speaker according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker; according to the weighted value, the identity characteristics of the target speaker are adjusted; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker.
In some embodiments, filtering out the target speaker identity profile from the reference speaker identity profiles of the plurality of reference speakers comprises: and selecting the reference speaker identity characteristic with the maximum similarity with the average value from the reference speaker identity characteristics of the plurality of reference speakers as the target speaker identity characteristic.
In some embodiments, the speech processing method further comprises: and training a deep neural network model by using the reference voices of the multiple reference speakers with the speaker identity labeling information to obtain the reference speaker identity characteristics of the multiple reference speakers.
In some embodiments, the speech processing method further comprises: training the first encoder by using first training data, wherein the first training data comprises a plurality of pieces of first training voice and text content labeling information corresponding to each piece of first training voice; and training the second encoder by using second training data, wherein the second training data comprises a plurality of second training voices and speaker identity marking information corresponding to each second training voice.
In some embodiments, the first encoder comprises an encoding layer of a Transformer model and the second encoder comprises an encoding layer based on a convolution enhanced Transformer model.
In some embodiments, the speech features are Mel frequency cepstral coefficients MFCC or a filter bank Fbank.
In some embodiments, the speaker identity information includes students and teachers responsible for different disciplines.
According to a second aspect of the present disclosure, there is provided a speech processing apparatus comprising: the processor is configured to perform feature extraction on the voice of the current speaker to obtain voice features; a first encoder configured to determine a current text content characteristic from the speech characteristic; a second encoder configured to determine an identity characteristic of a current speaker based on the speech characteristic, the second encoder having different parameters than the first encoder; the processor is further configured to determine a target speaker identity characteristic corresponding to the speech based on the current speaker identity characteristic; a decoder configured to determine text content information and speaker identity information corresponding to the speech according to the current text content characteristics and the identity characteristics of the target speaker.
According to a third aspect of the present disclosure, there is provided a speech processing apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to perform the speech processing method of any of the above embodiments based on instructions stored in the memory.
According to a fourth aspect of the present disclosure, there is provided a computer-storable medium characterized by having stored thereon computer program instructions which, when executed by a processor, implement a speech processing method according to any of the embodiments described above.
In the above embodiment, the accuracy of the speech processing can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram illustrating a method of speech processing according to some embodiments of the present disclosure;
FIG. 2 is a flow diagram illustrating determining identity characteristics of a target speaker corresponding to speech according to some embodiments of the present disclosure;
FIG. 3 is a block diagram illustrating a speech processing apparatus according to some embodiments of the present disclosure;
FIG. 4 is a block diagram illustrating a speech processing apparatus according to further embodiments of the present disclosure;
FIG. 5 is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
FIG. 1 is a flow diagram illustrating a method of speech processing according to some embodiments of the present disclosure.
As shown in fig. 1, the speech processing method includes: step S10, extracting the characteristics of the current speaker voice to obtain voice characteristics; step S20, according to the voice characteristics, using the first encoder and the second encoder with different parameters to respectively determine the current text content characteristics and the current speaker identity characteristics; step S30, according to the identity characteristics of the current speaker, determining the identity characteristics of the target speaker corresponding to the voice; and step S40, determining the text content information and the speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristic and the identity characteristic of the target speaker. For example, the voice processing method is executed by a voice processing apparatus.
In the above embodiment, a dual-encoder structure with different parameters is adopted to respectively determine the text content characteristics and the speaker identity characteristics, so that the functions of the two encoders are relatively independent, the training difficulty of the encoders can be reduced, the accuracy of each encoder is improved, and the accuracy of speech processing is improved. Moreover, the speaker identity characteristic and the text content characteristic are input into the same decoder together, so that the decoding of the text content characteristic and the decoding of the speaker identity characteristic are mutually assisted, and the accuracy of speech processing can be further improved.
In step S10, a speech feature is obtained by performing feature extraction on the speech of the current speaker. In some embodiments, the speech comprises multi-frame speech and the speech features comprise multi-frame speech features of the multi-frame speech. Each frame of speech feature corresponds to a frame of speech. For example, speech is characterized by MFCC (Mel Frequency Cepstrum Coefficient) or Fbank (Filter Bank ). In some embodiments, the speech is audio in wav format.
In step S20, a current text content feature and a current speaker identity feature (also called speaker feature) are determined based on the speech feature using a first encoder and a second encoder having different parameters, respectively. The current textual content characteristics describe coding characteristics of the textual content of the current speaker's voice. The current speaker identity characteristic describes an encoding characteristic of speaker identity information of the current speaker.
In some embodiments, the identity characteristic of the current speaker corresponding to the multi-frame speech includes an identity characteristic of the current speaker corresponding to the multi-frame speech. The current text content features corresponding to the multi-frame voices may also include multi-frame current text content features corresponding to the multi-frame voices. For example, the identity characteristic of the current speaker may form a current speaker identity characteristic sequence, and the current text content characteristic of the current speaker may also form a current text content characteristic sequence.
In some embodiments, the first encoder is trained using the first training data prior to determining the current textual content feature using the first encoder. The first training data includes a plurality of pieces of first training voices and text content label information corresponding to each piece of the first training voices. In training the first encoder, a goal is to minimize a loss function with respect to a difference between predicted textual content information and textual content annotation information.
The first Encoder may be named ASR-Encoder. ASR-Encoder is a transform-encoding layer (Encoder-layer) consisting of self-attention (self-attention). The transform coding Layer consists of 6 identical layers (layers), each Layer consisting of two sub-layers, which are a multi-head self-attention mechanism (multi-head self-attention mechanism) and a fully-connected feed-forward network (full connected feed-forward network), respectively. Each sub-layer is added with residual linking (residual connection) and normalization (normalization).
For example, the output of the transform coding layer can be expressed using the formula: transformer _ encoder = LayerNorm (x + (sublayer (x))), where x denotes the input, sublayer (x) denotes the operation of the sublayer, and LayerNorm denotes the normalization of the layer.
The operation of the sub-layer includes the operation of a multi-head mechanism, for example the result of the multi-head mechanism can be expressed as: MultiHead (Q, K, V) = Concat (head1, head2, …, head) × Wo. Q, K, V is a vector obtained by multiplying an input vector x by different first parameter matrixes, Wo is a second parameter matrix, headi represents the calculation mode of the ith head, and Concat represents the splicing operation. header = Self _ attribute (Q × W1, K × W2, V × W3). W1, W2, and W3 are three third parameter matrices, and Self _ attention represents the Self-attention mechanism.
Figure 221584DEST_PATH_IMAGE001
Wherein Softmax is a function, dkIs the dimension of Q, KTRepresenting the transpose of K.
The second encoder is trained using second training data prior to determining the identity of the current speaker using the second encoder. The second training data includes a plurality of second training speeches and speaker identification labeling information corresponding to each of the second training speeches. In training the first encoder, a goal is to minimize a loss function with respect to a difference between the predicted speaker identity information and the speaker ID tag information.
The second Encoder may be named SPK-Encoder. The SPK-Encoder is an improvement over the first Encoder. Considering that the identity information of a speaker is coded aiming at a certain voice, the voice identity information of a speaker has tandem relation in continuous speaking, and a second coder adds a one-dimensional lap machine network between a multi-head self-attention mechanism (multi-head self-attention mechanism) and a fully connected feed-forward network (fully connected feed-forward network), so that the SPK-Encoder can better learn the identity information of the speaker by considering the global and local information simultaneously.
In some embodiments, the first encoder includes an encoding layer of a Transformer model and the second encoder includes an encoding layer based on a convolution enhanced Transformer model.
In step S30, the identity of the target speaker corresponding to the speech is determined based on the identity of the current speaker.
Taking the example that the identity characteristics of the current speaker corresponding to the multiple frames of voices include the identity characteristics of the multiple frames of the current speaker corresponding to the multiple frames of voices, the identity characteristics of the target speaker corresponding to the voices can be determined in the manner shown in fig. 2.
FIG. 2 is a flow diagram illustrating determining identity characteristics of a target speaker corresponding to speech according to some embodiments of the present disclosure.
As shown in FIG. 2, determining the identity of the target speaker corresponding to speech includes steps S31-S33.
In step S31, an average of the identity characteristics of the current speaker over multiple frames is calculated. For example, the current speaker feature is represented as a feature vector, and the average value is an average value of multi-frame feature vectors and is a speaker identity information encoding vector of voice.
In step S32, reference speaker identity characteristics for a plurality of reference speakers are obtained.
In some embodiments, before obtaining the reference speaker identity characteristics of the multiple reference speakers, a Deep Neural Network (DNN) model is trained using the reference voices of the multiple reference speakers with speaker identity labeling information to obtain the reference speaker identity characteristics of the multiple reference speakers. For example, with the audio (referring to the voice of a speaker) of each teacher and student prepared in advance, the audio is labeled with speaker identification information. In the DNN model training stage, after training, the DNN model can perform speaker identification information recognition (speaker classification) at the frame level. And after the DNN model training is finished, taking the voice characteristic of the last hidden layer of the DNN model as the identity characteristic (D-vector) of a reference speaker. In this way, the registration of the identity characteristic of the speaker can be completed.
For example, the reference speaker ID of a plurality of reference speakers can be represented as a two-dimensional matrix D N, where D is the dimension of the speaker ID information vector and N is the number of reference speakers.
In step S33, the target speaker identity characteristic is filtered out from the speaker identity characteristics of the plurality of reference speakers according to the similarity between the average value and the reference speaker identity characteristic of each reference speaker. In some embodiments, the reference speaker identity feature having the greatest similarity to the mean is selected from the reference speaker identity features of the plurality of reference speakers as the target speaker identity feature. For example, the similarity is calculated using cosine distances.
In the above embodiment, the identity characteristic of the target speaker is screened by using an average value method, so that the efficiency of speech processing can be improved.
Returning to fig. 1, in step S40, the text content information and the speaker identification information corresponding to the speech are determined using the same decoder based on the current text content characteristics and the identity characteristics of the target speaker. For example, speaker identity information (speaker role information) includes students and teachers responsible for different disciplines. Teachers responsible for different disciplines include physical teachers, chemical teachers, Chinese teachers, English teachers, and the like. In some embodiments, the decoder is the decoding layer of the transform model. The Q, K, and V vectors are derived from the output of the transform's coding layer (first encoder), the identity of the target speaker, and the coding vector at the previous time, respectively, when calculating self-attention.
In some embodiments, after calculating the average value of the identity characteristics of the current speakers of the plurality of frames, the weighted value of the identity characteristics of the target speaker is determined according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker. And then adjusting the identity characteristics of the target speaker according to the weight value. And finally, determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker. The identity characteristics of the target speaker are adjusted by utilizing the weight values determined according to the similarity, so that the difference between the identity characteristics of the reference speaker and the actual identity characteristics of the current speaker can be reduced, the negative influence of the reference speaker on the decoding process is reduced, and the accuracy of voice processing is further improved.
In some embodiments, a maximum value of the similarity between the average and the reference speaker identity characteristic for each reference speaker may be determined as a weighted value of the target speaker identity characteristic.
FIG. 3 is a block diagram illustrating a speech processing apparatus according to some embodiments of the present disclosure.
As shown in fig. 3, the speech processing apparatus 3 includes a processor 31, a first encoder 32, a second encoder 33, and a decoder 34. The first encoder 32 and the second encoder 33 have different parameters.
The processor 31 is configured to perform feature extraction on the speech of the current speaker to obtain speech features, for example, execute step S10 shown in fig. 1.
The first encoder 32 is configured to determine the current text content characteristic based on the speech characteristic, for example, to perform step S20 as shown in fig. 1.
The second encoder 33 is configured to determine the identity of the current speaker based on the speech characteristics, for example, to perform step S20 as shown in fig. 1.
The processor 31 is further configured to determine the identity of the target speaker corresponding to the speech based on the identity of the current speaker, for example, to perform step S30 shown in fig. 1.
The decoder 34 is configured to determine the text content information and the speaker identity information corresponding to the speech based on the current text content characteristics and the identity characteristics of the target speaker, for example, to perform step S40 shown in fig. 1.
In the above embodiment, the encoder-decoder model in the whole speech processing apparatus adopts an autoregressive mode to complete the decoding of speech, so as to obtain text content information and speaker identity information.
FIG. 4 is a block diagram illustrating a speech processing apparatus according to further embodiments of the present disclosure.
As shown in fig. 4, the speech processing apparatus 4 includes a memory 41; and a processor 42 coupled to the memory 41. The memory 41 is used for storing instructions for executing the corresponding embodiment of the speech processing method. The processor 42 is configured to perform the speech processing method in any of the embodiments of the present disclosure based on instructions stored in the memory 41.
FIG. 5 is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure.
As shown in FIG. 5, the computer system 50 may be embodied in the form of a general purpose computing device. Computer system 50 includes a memory 510, a processor 520, and a bus 500 that connects the various system components.
The memory 510 may include, for example, system memory, non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs. The system memory may include volatile storage media such as Random Access Memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions to perform corresponding embodiments of at least one of the speech processing methods. Non-volatile storage media include, but are not limited to, magnetic disk storage, optical storage, flash memory, and the like.
The processor 520 may be implemented as discrete hardware components, such as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gates or transistors, or the like. Accordingly, each of the modules, such as the judging module and the determining module, may be implemented by a Central Processing Unit (CPU) executing instructions in a memory for performing the corresponding step, or may be implemented by a dedicated circuit for performing the corresponding step.
Bus 500 may use any of a variety of bus architectures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
Computer system 50 may also include input-output interface 530, network interface 540, storage interface 550, and the like. These interfaces 530, 540, 550 and the memory 510 and the processor 520 may be connected by a bus 500. The input/output interface 530 may provide a connection interface for an input/output device such as a display, a mouse, and a keyboard. The network interface 540 provides a connection interface for various networking devices. The storage interface 550 provides a connection interface for external storage devices such as a floppy disk, a usb disk, and an SD card.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable apparatus to produce a machine, such that the execution of the instructions by the processor results in an apparatus that implements the functions specified in the flowchart and/or block diagram block or blocks.
These computer-readable program instructions may also be stored in a computer-readable memory that can direct a computer to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function specified in the flowchart and/or block diagram block or blocks.
The present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
By the voice processing method and device and the computer storage medium in the embodiment, the accuracy of voice processing can be improved.
So far, the speech processing method and apparatus, computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

Claims (10)

1. A method of speech processing, comprising:
performing feature extraction on the voice of the current speaker to obtain voice features, wherein the voice comprises multi-frame voice;
respectively determining the current text content characteristic and the current speaker identity characteristic corresponding to the current speaker by utilizing a first encoder and a second encoder with different parameters according to the voice characteristics, wherein the current speaker identity characteristic comprises multi-frame current speaker identity characteristics corresponding to the multi-frame voice;
calculating the average value of the identity characteristics of the current speakers of the multiple frames;
acquiring the identity characteristics of reference speakers of a plurality of reference speakers;
screening out the identity characteristics of the target speaker from the identity characteristics of the reference speakers of the plurality of reference speakers according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker;
determining a weighted value of the identity characteristics of the target speaker according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker;
according to the weighted value, the identity characteristics of the target speaker are adjusted;
and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker.
2. The speech processing method of claim 1, wherein screening the identity of the target speaker from the identity of the reference speakers of the plurality of reference speakers comprises:
and selecting the reference speaker identity characteristic with the maximum similarity with the average value from the reference speaker identity characteristics of the plurality of reference speakers as the target speaker identity characteristic.
3. The speech processing method according to claim 1, further comprising:
and training a deep neural network model by using the reference voices of the multiple reference speakers with the speaker identity labeling information to obtain the reference speaker identity characteristics of the multiple reference speakers.
4. The speech processing method according to claim 1, further comprising:
training the first encoder by using first training data, wherein the first training data comprises a plurality of pieces of first training voice and text content labeling information corresponding to each piece of first training voice;
and training the second encoder by using second training data, wherein the second training data comprises a plurality of second training voices and speaker identity marking information corresponding to each second training voice.
5. The speech processing method of claim 1 wherein the first encoder comprises a transform model coding layer and the second encoder comprises a convolution-enhanced transform model-based coding layer.
6. The speech processing method according to claim 1, wherein the speech features are Mel Frequency Cepstral Coefficients (MFCCs) or a filter bank (Fbank).
7. The speech processing method of claim 1 wherein the speaker identity information comprises students and teachers responsible for different disciplines.
8. A speech processing apparatus, comprising:
the processor is configured to perform feature extraction on the voice of the current speaker to obtain voice features, wherein the voice comprises multi-frame voice;
a first encoder configured to determine a current text content feature corresponding to the current speaker based on the speech feature;
a second encoder configured to determine a current speaker identity feature corresponding to the current speaker based on the speech features, the second encoder having different parameters than the first encoder, the current speaker identity feature comprising a multi-frame current speaker identity feature corresponding to the multi-frame speech;
the processor is further configured to calculate an average of identity characteristics of the plurality of frames of the current speaker; acquiring the identity characteristics of reference speakers of a plurality of reference speakers; screening out the identity characteristics of the target speaker from the identity characteristics of the reference speakers of the plurality of reference speakers according to the similarity between the average value and the identity characteristics of the reference speakers of each reference speaker;
a decoder configured to determine a weighted value of the identity characteristic of the target speaker according to a similarity between the average value and the identity characteristic of a reference speaker of each reference speaker; according to the weighted value, the identity characteristics of the target speaker are adjusted; and determining text content information and speaker identity information corresponding to the voice by using the same decoder according to the current text content characteristics and the adjusted identity characteristics of the target speaker.
9. A speech processing apparatus, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the speech processing method of any of claims 1 to 7 based on instructions stored in the memory.
10. A computer-storable medium having stored thereon computer program instructions which, when executed by a processor, implement a speech processing method according to any one of claims 1 to 7.
CN202110694885.XA 2021-06-23 2021-06-23 Voice processing method and device and computer storage medium Active CN113257230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110694885.XA CN113257230B (en) 2021-06-23 2021-06-23 Voice processing method and device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110694885.XA CN113257230B (en) 2021-06-23 2021-06-23 Voice processing method and device and computer storage medium

Publications (2)

Publication Number Publication Date
CN113257230A CN113257230A (en) 2021-08-13
CN113257230B true CN113257230B (en) 2022-02-08

Family

ID=77189235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110694885.XA Active CN113257230B (en) 2021-06-23 2021-06-23 Voice processing method and device and computer storage medium

Country Status (1)

Country Link
CN (1) CN113257230B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913859B (en) * 2022-05-17 2024-06-04 北京百度网讯科技有限公司 Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN115019804B (en) * 2022-08-03 2022-11-01 北京惠朗时代科技有限公司 Multi-verification type voiceprint recognition method and system for multi-employee intensive sign-in

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10614813B2 (en) * 2016-11-04 2020-04-07 Intellisist, Inc. System and method for performing caller identity verification using multi-step voice analysis
CN110634492B (en) * 2019-06-13 2023-08-25 中信银行股份有限公司 Login verification method, login verification device, electronic equipment and computer readable storage medium
CN111968629A (en) * 2020-07-08 2020-11-20 重庆邮电大学 Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
CN111768789B (en) * 2020-08-03 2024-02-23 上海依图信息技术有限公司 Electronic equipment, and method, device and medium for determining identity of voice generator of electronic equipment
CN112259106B (en) * 2020-10-20 2024-06-11 网易(杭州)网络有限公司 Voiceprint recognition method and device, storage medium and computer equipment
CN112712813B (en) * 2021-03-26 2021-07-20 北京达佳互联信息技术有限公司 Voice processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113257230A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
US11488586B1 (en) System for speech recognition text enhancement fusing multi-modal semantic invariance
US9728183B2 (en) System and method for combining frame and segment level processing, via temporal pooling, for phonetic classification
CN110570845B (en) Voice recognition method based on domain invariant features
CN112951240B (en) Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium
CN113257230B (en) Voice processing method and device and computer storage medium
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN111310441A (en) Text correction method, device, terminal and medium based on BERT (binary offset transcription) voice recognition
CN113744727A (en) Model training method, system, terminal device and storage medium
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN113674733A (en) Method and apparatus for speaking time estimation
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN110120231B (en) Cross-corpus emotion recognition method based on self-adaptive semi-supervised non-negative matrix factorization
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN113505611B (en) Training method and system for obtaining better speech translation model in generation of confrontation
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
CN113327584A (en) Language identification method, device, equipment and storage medium
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN116844573A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
ElMaghraby et al. Noise-robust speech recognition system based on multimodal audio-visual approach using different deep learning classification techniques
Zhang et al. Discriminatively trained sparse inverse covariance matrices for speech recognition
CN112863518A (en) Method and device for voice data theme recognition
CN114724547A (en) Method and system for identifying accent English
Tamura et al. GIF-SP: GA-based informative feature for noisy speech recognition
CN117238324A (en) Voice emotion recognition method and system based on multi-mode double-convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant