CN112634868A

CN112634868A - Voice signal processing method, device, medium and equipment

Info

Publication number: CN112634868A
Application number: CN202011517821.4A
Authority: CN
Inventors: 陈孝良; 孔德威; 冯大航; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-04-09
Anticipated expiration: 2040-12-21
Also published as: CN112634868B

Abstract

The invention relates to a voice signal processing method, a voice signal processing device, a voice signal processing medium and voice signal processing equipment. According to the scheme provided by the embodiment of the invention, the damaged voice segment in the voice signal, namely the lost voice signal segment and/or the lost voice signal segment can be identified. And then synthesizing to obtain the voice signal corresponding to the damaged voice segment aiming at each damaged voice segment, wherein aiming at each lost voice segment, the voice signal corresponding to the lost voice segment can be synthesized by utilizing at least one normal voice signal segment adjacent to the lost voice segment according to the time length of the lost voice signal segment. And aiming at each voice signal loss segment, synthesizing the voice signal corresponding to the voice signal loss segment by utilizing a pre-trained voice synthesis model based on the voice signal loss segment, thereby realizing the restoration of the voice signal based on the voice signal obtained by synthesis, improving the voice communication quality and ensuring that a user can continuously and clearly receive and hear the voice.

Description

Voice signal processing method, device, medium and equipment

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a medium, and a device for processing a speech signal.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In the process of performing voice communication, for example, in the process of performing a video conference voice call, the voice communication quality is affected by factors such as network transmission quality. In some cases, for example, in the case of poor network communication quality, there often occurs a situation that a part of the voice signal is lost (which can be understood as poor voice quality of the part of the voice signal) or even lost (which can be understood as very poor voice quality of the part of the voice signal), which results in poor voice communication quality.

In the case of partial voice signal loss, the definition of the voice heard by the user is low, and in the case of partial voice signal loss, the user cannot hear the voice, and the integrity of the semantics is affected.

Therefore, it is desirable to provide a solution that can improve the quality of voice communication.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a medium, and a device for processing a voice signal, which are used to solve the problem of poor voice communication quality when part of a voice signal is lost.

In a first aspect, the present invention provides a speech signal processing method, including:

determining at least one voice corrupted segment in the received voice signal, the voice corrupted segment comprising a voice signal lost segment and a voice signal lost segment;

synthesizing to obtain a voice signal corresponding to each damaged voice fragment, wherein if one damaged voice fragment is a lost voice signal fragment, synthesizing to obtain a voice signal corresponding to the lost voice signal fragment based on at least one normal voice signal fragment adjacent to the lost voice signal fragment according to the duration of the lost voice signal fragment; if one voice damaged segment is a voice signal loss segment, synthesizing to obtain a voice signal corresponding to the voice signal loss segment by utilizing a pre-trained voice synthesis model based on the voice signal loss segment;

and replacing the original voice signal corresponding to the damaged voice segment in the received voice signal by utilizing the voice signal corresponding to the damaged voice segment obtained by synthesis.

Optionally, if the duration of a speech signal missing segment is less than the duration corresponding to a phoneme, synthesizing to obtain a speech signal corresponding to the speech signal missing segment based on at least one speech signal normal segment adjacent to the speech signal missing segment, including:

determining at least one normal segment of the voice signal adjacent to the lost segment of the voice signal in the voice signal;

determining the voice characteristics corresponding to each voice frame of the normal voice signal segment as input, and sequentially determining the voice characteristics corresponding to each voice frame of the lost voice signal segment by utilizing a pre-trained characteristic prediction model;

and taking the voice characteristics corresponding to each voice frame of the determined voice signal loss segment as input, and sequentially synthesizing each voice point corresponding to the voice frame corresponding to the input voice characteristics by using a pre-trained vocoder model.

Optionally, if the duration of a lost segment of a speech signal is not less than the duration corresponding to a phoneme, synthesizing to obtain a speech signal corresponding to the lost segment of the speech signal based on at least one normal segment of the speech signal adjacent to the lost segment of the speech signal, including:

determining a phonon sequence corresponding to the normal fragment of the voice signal through automatic voice recognition;

determining the corresponding phononic sequence of the normal fragment of the voice signal as input, and determining the corresponding phononic sequence of the lost fragment of the voice signal by utilizing a pre-trained language model;

taking the determined phonon sequence corresponding to the voice signal loss segment as input, and sequentially determining voice characteristics corresponding to each voice frame of the voice signal loss segment by utilizing a pre-trained acoustic model;

Optionally, determining at least one normal segment of the speech signal adjacent to the lost segment of the speech signal in the speech signal includes:

and determining a normal segment of the voice signal which is adjacent to the lost segment of the voice signal and is positioned before the lost segment of the voice signal in the voice signal.

Optionally, the speech feature corresponding to each speech frame includes:

linear predictive coding LPC features, Pitch features, Pitch frequency f0 features, gain features, bark frequency cepstral coefficients BFCC features.

Optionally, based on the speech signal loss segment, synthesizing to obtain a speech signal corresponding to the speech signal loss segment by using a pre-trained speech synthesis model, including:

and taking the voice characteristics corresponding to each voice frame of the voice signal loss segment as input, and sequentially synthesizing each voice point corresponding to the voice frame corresponding to the input voice characteristics by using a pre-trained voice synthesis model.

Optionally, the speech synthesis model is obtained by training through the following method:

acquiring a training sample set, wherein each training sample pair in the training sample set comprises a voice frame of which the voice quality is not less than a second threshold, the voice quality of the processed voice frame is less than the second threshold and not less than a first threshold, the acquired voice characteristics corresponding to the processed voice frame and each voice point corresponding to the processed voice frame execute the following operations aiming at each training sample pair:

taking the voice characteristics corresponding to the processed voice frame as the input of a pre-established voice synthesis model to obtain each voice point output by the pre-established voice synthesis model;

and adjusting the pre-established speech synthesis model in a mode of reducing the error between each speech point output by the pre-established speech synthesis model and each speech point corresponding to the speech frame before processing until each training sample pair finishes the operation or the output error of the pre-established speech synthesis model is smaller than a set value.

Optionally, before determining at least one speech corrupted segment in the received speech signal, the method further includes:

it is determined that the voice quality of the received voice signal is lower than a set value.

In a second aspect, the present invention also provides a speech signal processing apparatus, comprising:

the device comprises a fragment determining module, a fragment determining module and a fragment analyzing module, wherein the fragment determining module is used for determining at least one voice damaged fragment in a received voice signal, and the voice damaged fragment comprises a voice signal lost fragment and a voice signal lost fragment;

the synthesis module is used for synthesizing each damaged voice segment to obtain a voice signal corresponding to the damaged voice segment, wherein if the damaged voice segment is determined to be a lost voice signal segment according to the voice quality of each voice frame in the received voice signal, the synthesis module synthesizes at least one normal voice signal segment adjacent to the lost voice signal segment according to the duration of the lost voice signal segment to obtain a voice signal corresponding to the lost voice signal segment; if a damaged voice segment is determined to be a voice signal loss segment according to the voice quality of each voice frame in the received voice signals, and a voice signal corresponding to the voice signal loss segment is synthesized by utilizing a pre-trained voice synthesis model based on the voice signal loss segment;

and the recovery module is used for replacing the original voice signal corresponding to the damaged voice segment in the received voice signal by utilizing the voice signal corresponding to the damaged voice segment obtained by synthesis.

In a third aspect, the present invention also provides a non-volatile computer storage medium storing an executable program for execution by a processor to implement the method as described above.

In a fourth aspect, the present invention further provides a speech signal processing device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, is configured to implement the method steps as described above.

According to the scheme provided by the embodiment of the invention, the damaged voice segment in the voice signal, namely the lost voice signal segment and/or the lost voice signal segment can be identified. And then synthesizing to obtain the voice signal corresponding to the damaged voice segment aiming at each damaged voice segment, wherein aiming at each lost voice segment, the voice signal corresponding to the lost voice segment can be synthesized by utilizing at least one normal voice signal segment adjacent to the lost voice segment according to the time length of the lost voice signal segment. And aiming at each voice signal loss segment, synthesizing the voice signal corresponding to the voice signal loss segment by utilizing a pre-trained voice synthesis model based on the voice signal loss segment, thereby realizing the restoration of the voice signal based on the voice signal obtained by synthesis, improving the voice communication quality and ensuring that a user can continuously and clearly receive and hear the voice.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a speech signal processing method according to an embodiment of the present invention;

FIG. 2 is a diagram of a received speech signal according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a speech signal with missing segments according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of synthesizing a speech signal corresponding to a missing segment of a speech signal according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating synthesizing a speech signal corresponding to a missing segment of the speech signal according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an lpcnet vocoder model according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a speech synthesis model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, the "plurality" or "a plurality" mentioned herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The following is a brief description of several concepts involved in embodiments of the invention.

Speech signal loss segment: part of voice signals in a section of voice signals comprise at least one voice frame, and each voice frame is continuous and meets the condition that the voice quality is less than a first threshold value;

speech signal loss segment: part of voice signals in a section of voice signals comprise at least one voice frame, each voice frame is continuous and meets the condition that the voice quality is not less than a first threshold value but less than a second threshold value, wherein the second threshold value is greater than the first threshold value;

damaged speech segment: the above-mentioned voice signal loss segment and voice signal loss segment are collectively referred to as a voice damaged segment, and it can be understood that the voice quality of the voice signal loss segment is better than that of the voice signal loss segment;

normal segment of speech signal: part of voice signals in a section of voice signals comprise at least one voice frame, and each voice frame is continuous and meets the condition that the voice quality is not less than a second threshold value;

a phonon: the smallest semantic unit in a speech signal, for example, an initial or a final.

The core concept of the present invention is briefly explained below.

In the scheme provided by the embodiment of the invention, the voice signal loss segment and/or the voice signal loss segment in the voice signal can be determined, and the voice signal is repaired respectively through corresponding voice synthesis modes aiming at the voice signal loss segment or the voice signal loss segment in the voice signal, so that the voice signal quality is improved, and the user can be ensured to hear normal voice.

For a lost segment of a voice signal, a voice signal corresponding to the lost segment of the voice signal can be re-synthesized based on at least one normal segment of the voice signal adjacent to the lost segment of the voice signal according to the duration of the lost segment of the voice signal.

When the time length corresponding to the voice signal loss segment is shorter than the time length corresponding to one phoneme, because the lost voice signal does not influence the semantics, the voice feature of the voice signal loss segment can be predicted directly based on the voice feature corresponding to the normal segment of the voice signal adjacent to the voice signal loss segment, and then the voice signal corresponding to the voice signal loss segment is synthesized based on the predicted voice feature.

When the time length corresponding to the lost voice signal segment is longer and is not less than the time length corresponding to one phoneme, for example, when the lost voice signal segment corresponds to a part of a sentence, because the lost voice signal can affect the semantics, the automatic voice recognition can be performed on the normal voice signal segments adjacent to the lost voice signal segment, the phoneme sequence corresponding to the lost voice signal segment is predicted and obtained based on the recognized phoneme sequence, and then the voice feature of the lost voice signal segment can be determined based on the predicted phoneme sequence, and further the voice signal corresponding to the lost voice signal segment is synthesized and obtained based on the obtained voice feature.

For the speech signal loss segment, the speech signal corresponding to the speech signal loss segment may be re-synthesized by using a pre-trained speech synthesis model based on the speech signal loss segment, so as to implement compensation for the lost speech signal.

Based on the above description, an embodiment of the present invention provides a speech signal processing method, where the flow of the steps of the method may be as shown in fig. 1, and the method includes:

step 101, determining at least one speech breakage segment in the received speech signal.

In the present embodiment, the speech breakage segment includes a speech signal loss segment and a speech signal loss segment, and the speech quality of the speech signal loss segment is better than that of the speech signal loss segment. In this embodiment, determining at least one speech corrupted segment in the received speech signal may be understood as determining at least one speech signal missing segment and/or at least one speech signal missing segment in the received speech signal.

Wherein the speech signal loss segment and the speech signal loss segment in the received speech signal may be determined by:

the method comprises the steps of firstly, determining the voice quality of each voice frame in a received voice signal.

In this step, the speech quality of each speech frame may be determined in units of speech frames with respect to the received speech signal. The method for determining the speech quality of the speech frame may be any method, which is not described in this embodiment again. A schematic diagram of a received speech signal may be as shown in fig. 2.

The voice quality may be determined based on an index value of at least one voice quality index, and the voice quality index may be any index that can indicate voice quality, such as voice intensity, signal-to-noise ratio, and the like.

And secondly, determining a voice signal loss segment and/or a voice signal loss segment in the voice signal according to the determined voice quality of each voice frame.

In this step, a speech signal loss segment and/or a speech signal loss segment in the received speech signal may be determined according to the speech quality of each speech frame determined in the first step.

That is, there may be at least one missing segment of the speech signal and/or at least one missing segment of the speech signal in the received speech signal. In this step, a speech signal loss segment and/or a speech signal loss segment in the received speech signal may be determined.

Assuming that there is a speech signal missing segment in the received speech signal, a schematic diagram of the speech signal with the speech signal missing segment may be shown in fig. 3. In fig. 3, the speech signal discontinuity area pointed by 2 arrows can be understood as the area corresponding to two lost speech signal segments.

If there is no speech signal loss segment and no speech signal loss segment in the received speech signal, the process may be ended, and if there is a speech signal loss segment and/or a speech signal loss segment in the received speech signal, the process may continue to execute step 102.

It should be further noted that, in a possible implementation manner, it may be determined whether the voice quality of the received voice signal is lower than a set value through a voice scoring model, and if so, it is determined that the voice signal needs to be repaired, and then recognition of a damaged voice segment is performed on the voice signal whose voice quality is lower than the set value. If the voice quality of the received voice signal is judged to be not lower than the set value, the voice signal is not required to be repaired, and the process can be ended.

And 102, synthesizing each damaged voice segment to obtain a voice signal corresponding to the damaged voice segment.

In this step, for each lost segment of the voice signal, the voice signal corresponding to the lost segment of the voice signal may be synthesized based on at least one normal segment of the voice signal adjacent to the lost segment of the voice signal according to the duration of the lost segment of the voice signal.

In a possible implementation manner, if the duration of a lost segment of a speech signal is less than the duration corresponding to a phoneme, synthesizing to obtain a speech signal corresponding to the lost segment of the speech signal based on at least one normal segment of the speech signal adjacent to the lost segment of the speech signal, which may include:

determining at least one normal segment of the voice signal adjacent to the lost segment of the voice signal in the received voice signal;

taking the voice characteristics corresponding to each voice frame of the determined normal voice signal segment as input, and sequentially determining the voice characteristics corresponding to each voice frame of the lost voice signal segment by utilizing a pre-trained characteristic prediction model;

and taking the determined voice characteristics corresponding to each voice frame of the voice signal loss segment as input, and sequentially synthesizing each voice point (the voice point can be understood as a sampling point) corresponding to the voice frame corresponding to the input voice characteristics by using a pre-trained vocoder model.

It should be noted that the at least one determined normal segment of the voice signal adjacent to the lost segment of the voice signal may be a normal segment of the voice signal adjacent to the lost segment of the voice signal and located before the lost segment of the voice signal in the received voice signal. And rapidly predicting the voice characteristics corresponding to each voice frame of the voice signal loss segment through the normal segment of the voice signal before the voice signal loss segment.

Of course, the determined at least one normal segment of the voice signal adjacent to the lost segment of the voice signal may also include normal segments of the voice signal at other positions, for example, a normal segment of the voice signal adjacent to and before the lost segment of the voice signal and a normal segment of the voice signal adjacent to and before the lost segment of the voice signal in the received voice signal may also be included. So as to more accurately predict the corresponding voice characteristics of each voice frame of the voice signal loss segment through the normal segments of the voice signal before and after the voice signal loss segment.

At this time, a flow chart of synthesizing the speech signal corresponding to the missing segment of the speech signal can be as shown in fig. 4. The method comprises the steps of taking the voice characteristics corresponding to each voice frame of the determined normal voice signal segment as input, sequentially predicting the characteristics of each voice frame of the voice signal loss segment through a characteristic prediction model, taking the voice characteristics of each predicted voice frame as input, performing voice synthesis through a vocoder model, and sequentially obtaining each voice point of each voice frame of the voice signal loss segment.

The speech features corresponding to each speech frame may include, but are not limited to: linear Predictive Coding (LPC) features, Pitch (Pitch) features, Pitch frequency (f0) features, gain (gain) features, Bark Frequency Cepstral Coefficient (BFCC) features.

The feature prediction model trained in advance may be any feature prediction model, for example, a tacontron feature prediction model, a tacontron 2 feature prediction model, a wavenet feature prediction model, or a Res-net feature prediction model may be used.

The pre-trained vocoder model may be any vocoder model, for example, lpcnet vocoder model or wavernn vocoder model may be used.

In a possible implementation manner, if the duration of a lost segment of a speech signal is not less than the duration corresponding to a phoneme, synthesizing to obtain a speech signal corresponding to the lost segment of the speech signal based on at least one normal segment of the speech signal adjacent to the lost segment of the speech signal, which may include:

It should be noted that, similarly to the case where the duration of the speech signal missing segment is less than the duration corresponding to a phonon, the determined at least one speech signal normal segment adjacent to the speech signal missing segment may be a speech signal normal segment which is adjacent to the speech signal missing segment and is located before the speech signal missing segment in the received speech signal. And rapidly predicting the corresponding phoneme subsequence of each speech frame of the speech signal loss segment through the normal segment of the speech signal before the speech signal loss segment.

Of course, the determined at least one normal segment of the voice signal adjacent to the lost segment of the voice signal may also include normal segments of the voice signal at other positions, for example, a normal segment of the voice signal adjacent to and before the lost segment of the voice signal and a normal segment of the voice signal adjacent to and before the lost segment of the voice signal in the received voice signal may also be included. So as to more accurately predict the corresponding phoneme subsequence of each speech frame of the speech signal loss segment through the normal segments of the speech signal before and after the speech signal loss segment.

At this time, a flow chart of synthesizing the speech signal corresponding to the missing segment of the speech signal can be as shown in fig. 5. The method comprises the steps of taking a determined phonon sequence corresponding to a normal segment of a voice signal as input, predicting the phonon sequence corresponding to a lost segment of the voice signal through a language model, taking the predicted phonon sequence as input, determining voice characteristics corresponding to each voice frame of the lost segment of the voice signal through an acoustic model, taking the voice characteristics corresponding to each determined voice frame as input, carrying out voice synthesis through a vocoder model, and sequentially obtaining each voice point of each voice frame of the lost segment of the voice signal.

Similar to the case where the duration of the lost segment of the speech signal is less than the duration corresponding to one phoneme, the speech feature corresponding to each speech frame may include, but is not limited to: linear Predictive Coding (LPC) characteristics, Pitch (Pitch) characteristics, Pitch Frequency (f0) characteristics, gain (gain) characteristics, Bark Frequency Cepstral Coefficients (BFCC) characteristics.

The pre-trained language model may be any language model, for example, a BERT language model or a GPT language model may be used.

The acoustic model trained in advance may be any acoustic model, for example, a tacotron2 acoustic model or a wavenet acoustic model may be used.

In this step, for each speech signal loss segment, a speech signal corresponding to the speech signal loss segment may be synthesized based on the speech signal loss segment by using a speech synthesis model trained in advance.

In a possible implementation manner, for a speech signal loss segment, synthesizing a speech signal corresponding to the speech signal loss segment by using a speech synthesis model trained in advance based on the speech signal loss segment may include:

In one possible implementation, the speech synthesis model may be trained by:

It should be noted that the speech frame with the speech quality not less than the second threshold is processed, so that the speech quality of the processed speech frame is less than the second threshold and not less than the first threshold, and the processing can be implemented in any manner. For example, this may be achieved by adding noise to the speech frame, for example, by performing spectral addition to the speech frame, for example, by performing spectral subtraction to the speech frame, and so on.

The speech synthesis model may employ any speech synthesis model. In a possible implementation manner, the speech synthesis model may be implemented based on an improvement on the lpcnet vocoder model, the schematic structure of the lpcnet vocoder model may be as shown in fig. 6, and the schematic structure of the speech synthesis model obtained through the improvement on the lpcnet vocoder model may be, but is not limited to, as shown in fig. 7.

In the lpcnet vocoder model, the speech characteristics of an input speech frame can be obtained through a frame prediction network (including two convolution layers (respectively designated as conv1 × 3) and two full-connection DNN layers (respectively designated as FC)) to obtain a prediction network (including a concatenation layer (respectively designated as concat) and two GRU layers (respectively designated as GRU) to a speech point_AAnd GRU_B) The speech characteristics of the speech frame input by the fully connected DNN layer (denoted as dual FC) and the softmax classification layer (denoted as softmax) are further sequentially synthesized through the speech point prediction network and the speech quantization layer to obtain each speech point of the speech frame, for example, assuming that one speech frame is 10ms and corresponds to 160 speech points (corresponding to a sampling rate of 16000hz), 160 speech points corresponding to the speech frame may be sequentially synthesized based on the speech characteristics of the speech frame.

Wherein, the voice point prediction network can calculate the expected value p according to the voice characteristics of the voice frame input by the frame prediction network_tA speech point synthesized at time t-1 and an excitation signal e at time t-1_t-1And synthesizing another voice point corresponding to the voice frame at the time t through the voice quantization layer.

In the voice synthesis model obtained by improving the lpcnet vocoder model, the frame prediction network is mainly improved. For example, as shown in FIG. 7, the frame prediction network of the speech synthesis model may include six layers, including a GRU layer and 5 full-connected DNN layers. In addition, as shown in fig. 7, in the speech point prediction network of the speech synthesis model, two GRU layers can be simplified into one, so as to reduce the amount of calculation and improve the synthesis speed of speech points.

It should be noted that, through experiments by the inventor of the present application, it is found that, by using the speech synthesis model shown in fig. 7 for a speech signal loss segment, the accuracy of a synthesized speech point can be higher, so that the speech loss can be compensated better, and the speech quality can be improved.

And 103, replacing the original voice signal corresponding to the damaged voice segment in the received voice signal by using the voice signal corresponding to the damaged voice segment obtained by synthesis.

In this step, the synthesized speech signal may be used to replace the speech signal corresponding to the speech signal loss segment and the speech signal loss segment in the received speech signal, and the speech signal with the speech signal loss segment and the speech signal loss segment replaced is output, so that the user can hear clear and coherent speech.

According to the scheme provided by the embodiment of the invention, the problems of voice loss, voice loss and the like caused by network communication quality fluctuation can be effectively repaired, and the tone color and the like of the repaired voice can be basically consistent with the original voice. In addition, the scheme provided by the embodiment of the invention is particularly suitable for the English voice communication process, and can effectively realize the pronunciation synthesis of unknown English words, so that the voice is more coherent and the pronunciation is more natural.

Corresponding to the provided method, the following device is further provided.

An embodiment of the present invention provides a speech signal processing apparatus, where the apparatus may be as shown in fig. 8, and includes:

the segment determining module 11 is configured to determine at least one voice corrupted segment in the received voice signal, where the voice corrupted segment includes a voice signal lost segment and a voice signal lost segment;

the synthesis module 12 is configured to synthesize, for each damaged speech segment, a speech signal corresponding to the damaged speech segment, where if a damaged speech segment is determined to be a lost speech signal segment according to the speech quality of each speech frame in the received speech signal, according to the duration of the lost speech signal segment, a speech signal corresponding to the lost speech signal segment is synthesized based on at least one normal speech signal segment adjacent to the lost speech signal segment; if a damaged voice segment is determined to be a voice signal loss segment according to the voice quality of each voice frame in the received voice signals, and a voice signal corresponding to the voice signal loss segment is synthesized by utilizing a pre-trained voice synthesis model based on the voice signal loss segment;

the restoring module 13 is configured to replace the original voice signal corresponding to the damaged voice segment in the received voice signal by using the voice signal corresponding to the damaged voice segment obtained through synthesis.

It is understood that the speech signal loss segment includes at least one speech frame, each speech frame being continuous and satisfying that the speech quality is less than the first threshold;

the voice signal loss segment comprises at least one voice frame, each voice frame is continuous and meets the condition that the voice quality is not less than the first threshold value but less than a second threshold value, wherein the second threshold value is greater than the first threshold value;

the normal segment of the voice signal comprises at least one voice frame, and each voice frame is continuous and meets the condition that the voice quality is not less than the second threshold value.

Optionally, if the duration of a lost segment of a speech signal is less than the duration corresponding to a phoneme, the synthesizing module 12 synthesizes, based on at least one normal segment of the speech signal adjacent to the lost segment of the speech signal, a speech signal corresponding to the lost segment of the speech signal, including:

Optionally, if the duration of a lost segment of a speech signal is not less than the duration corresponding to a phoneme, the synthesizing module 12 synthesizes, based on at least one normal segment of the speech signal adjacent to the lost segment of the speech signal, a speech signal corresponding to the lost segment of the speech signal, including:

Further optionally, the determining, by the synthesis module 12, at least one normal segment of the speech signal adjacent to the lost segment of the speech signal in the speech signal includes:

Further optionally, the speech feature corresponding to each speech frame includes:

Optionally, the synthesizing module 12 synthesizes, based on the speech signal loss segment, a speech synthesis model trained in advance to obtain a speech signal corresponding to the speech signal loss segment, including:

acquiring a training sample set, wherein each training sample in the training sample set carries out noise adding processing on the voice features corresponding to the voice frames of which the voice quality is not less than a second threshold value, and the obtained noise added voice features corresponding to the voice frames and each voice point corresponding to the voice frames execute the following operations aiming at each training sample pair:

taking the voice characteristics after noise addition corresponding to the voice frame as the input of a pre-established voice synthesis model to obtain each voice point output by the pre-established voice synthesis model;

and adjusting the pre-established speech synthesis model in a mode of reducing the error between each speech point output by the pre-established speech synthesis model and each speech point corresponding to the speech frame until each training sample pair finishes the operation or the output error of the pre-established speech synthesis model is smaller than a set value.

Optionally, the segment determining module 11 is further configured to determine that the voice quality of the received voice signal is lower than a set value before determining at least one voice corrupted segment in the received voice signal.

The functions of the functional units of the apparatuses provided in the above embodiments of the present invention may be implemented by the steps of the corresponding methods, and therefore, detailed working processes and beneficial effects of the functional units in the apparatuses provided in the embodiments of the present invention are not described herein again.

Based on the same inventive concept, embodiments of the present invention provide the following apparatus and medium.

The structure of the speech signal processing device provided by the embodiment of the present invention can be as shown in fig. 9, and the speech signal processing device includes a processor 21, a communication interface 22, a memory 23, and a communication bus 24, where the processor 21, the communication interface 22, and the memory 23 complete mutual communication through the communication bus 24;

the memory 23 is used for storing computer programs;

the processor 21 is configured to implement the steps of the above method embodiments of the present invention when executing the program stored in the memory.

Optionally, the processor 21 may specifically include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), one or more Integrated circuits for controlling program execution, a hardware Circuit developed by using a Field Programmable Gate Array (FPGA), and a baseband processor.

Optionally, the processor 21 may include at least one processing core.

Alternatively, the Memory 23 may include a Read-Only Memory (ROM), a Random Access Memory (RAM), and a disk Memory. The memory 23 is used for storing data required by the at least one processor 21 during operation. The number of the memory 23 may be one or more.

An embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores an executable program, and when the executable program is executed by a processor, the method provided in the foregoing method embodiment of the present invention is implemented.

In particular implementations, computer storage media may include: various storage media capable of storing program codes, such as a Universal Serial Bus Flash Drive (USB), a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In the embodiments of the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the described unit or division of units is only one division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical or other form.

The functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be an independent physical module.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device, such as a personal computer, a server, or a network device, or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a Universal Serial Bus Flash Drive (usb Flash Drive), a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of speech signal processing, the method comprising:

2. The method of claim 1, wherein if the duration of a missing segment of a speech signal is less than the duration corresponding to a phoneme, synthesizing a speech signal corresponding to the missing segment of the speech signal based on at least one normal segment of the speech signal adjacent to the missing segment of the speech signal, comprises:

3. The method of claim 1, wherein if the duration of a missing segment of a speech signal is not less than the duration corresponding to a phoneme, synthesizing a speech signal corresponding to the missing segment of the speech signal based on at least one normal segment of the speech signal adjacent to the missing segment of the speech signal, comprises:

4. The method of claim 2 or 3, wherein determining at least one normal segment of the speech signal adjacent to the lost segment of the speech signal comprises:

5. The method of claim 2 or 3, wherein the speech characteristics corresponding to each speech frame comprise:

6. The method of claim 1, wherein synthesizing a speech signal corresponding to the lost segment of the speech signal based on the lost segment of the speech signal by using a pre-trained speech synthesis model comprises:

7. The method of claim 6, wherein the speech synthesis model is trained by:

8. The method of claim 1, wherein prior to determining at least one speech corrupted segment in the received speech signal, the method further comprises:

9. A speech signal processing apparatus, characterized in that the apparatus comprises:

10. A non-transitory computer storage medium storing an executable program for execution by a processor to perform the method of any one of claims 1 to 8.

11. A speech signal processing device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-8.