CN112634868B

CN112634868B - Voice signal processing method, device, medium and equipment

Info

Publication number: CN112634868B
Application number: CN202011517821.4A
Authority: CN
Inventors: 陈孝良; 孔德威; 冯大航; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2024-04-05
Anticipated expiration: 2040-12-21
Also published as: CN112634868A

Abstract

The invention relates to a voice signal processing method, a device, a medium and equipment. According to the scheme provided by the embodiment of the invention, the voice breakage fragments in the voice signal, namely, the voice signal loss fragments and/or the voice signal loss fragments can be identified. And synthesizing each voice breakage segment to obtain a voice signal corresponding to the voice breakage segment, wherein each voice signal loss segment can utilize at least one voice signal normal segment adjacent to the voice signal loss segment to synthesize the voice signal corresponding to the voice signal loss segment according to the duration of the voice signal loss segment. And for each voice signal loss segment, the voice signal corresponding to the voice signal loss segment can be synthesized based on the voice signal loss segment by utilizing a pre-trained voice synthesis model, so that the restoration of the voice signal can be realized based on the synthesized voice signal, the voice communication quality is improved, and the user can continuously and clearly listen to the voice.

Description

Voice signal processing method, device, medium and equipment

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a medium, and a device for processing a speech signal.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In the process of performing voice communication, for example, in the process of performing a videoconference voice call, the voice communication quality may be affected by factors such as the network transmission quality. In some cases, such as the case where the network communication quality is poor, a loss of a part of the voice signal (which may be understood as poor voice quality of the part of the voice signal) or even a loss (which may be understood as very poor voice quality of the part of the voice signal) often occurs, resulting in poor voice communication quality.

In the case of partial speech signal loss, the clarity of the partial speech heard by the user is lower, and in the case of partial speech signal loss, the user cannot hear the partial speech, and the integrity of the semantics is affected.

Accordingly, there is a need to provide a solution that can improve the quality of voice communications.

Disclosure of Invention

The embodiment of the invention provides a voice signal processing method, a device, a medium and equipment, which are used for solving the problem of poor voice communication quality under the condition that part of voice signals are lost or lost.

In a first aspect, the present invention provides a method for processing a speech signal, the method comprising:

determining at least one speech corruption segment in a received speech signal, the speech corruption segment comprising a speech signal loss segment and a speech signal loss segment;

synthesizing to obtain a voice signal corresponding to each voice breakage segment according to each voice breakage segment, wherein if one voice breakage segment is a voice signal loss segment, synthesizing to obtain a voice signal corresponding to the voice signal loss segment based on at least one voice signal normal segment adjacent to the voice signal loss segment according to the duration of the voice signal loss segment; if one voice breakage segment is a voice signal loss segment, synthesizing to obtain a voice signal corresponding to the voice signal loss segment by utilizing a pre-trained voice synthesis model based on the voice signal loss segment;

and replacing the original voice signal corresponding to the damaged voice fragment in the received voice signal by utilizing the voice signal corresponding to the damaged voice fragment obtained by synthesis.

Optionally, if the duration of a speech signal loss segment is less than the duration corresponding to a phonon, synthesizing to obtain a speech signal corresponding to the speech signal loss segment based on at least one speech signal normal segment adjacent to the speech signal loss segment, including:

Determining at least one normal segment of the speech signal adjacent to the lost segment of the speech signal in the speech signal;

taking the determined voice characteristics corresponding to each voice frame of the voice signal normal segment as input, and sequentially determining the voice characteristics corresponding to each voice frame of the voice signal lost segment by utilizing a characteristic prediction model trained in advance;

and taking the determined voice characteristics corresponding to each voice frame of the voice signal lost segment as input, and sequentially synthesizing each voice point corresponding to the voice frame corresponding to the input voice characteristics by utilizing a pre-trained vocoder model.

Optionally, if the duration of a speech signal loss segment is not less than the duration corresponding to a phonon, synthesizing to obtain a speech signal corresponding to the speech signal loss segment based on at least one speech signal normal segment adjacent to the speech signal loss segment, including:

determining a sound sequence corresponding to the normal segment of the voice signal through automatic voice recognition;

the determined voice sequence corresponding to the voice signal normal segment is used as input, and the voice sequence corresponding to the voice signal lost segment is determined by utilizing a language model trained in advance;

Taking the determined voice subsequence corresponding to the voice signal loss segment as input, and sequentially determining the voice characteristics corresponding to each voice frame of the voice signal loss segment by utilizing an acoustic model trained in advance;

Optionally, determining at least one normal segment of the speech signal adjacent to the lost segment of the speech signal includes:

and determining a normal voice signal segment which is adjacent to and positioned before the voice signal loss segment in the voice signal.

Optionally, the voice features corresponding to each voice frame include:

the linear prediction encodes at least one of an LPC feature, a Pitch Pitch feature, a Pitch frequency f0 feature, a gain feature, and a bark frequency cepstrum coefficient BFCC feature.

Optionally, based on the speech signal loss segment, synthesizing to obtain a speech signal corresponding to the speech signal loss segment by using a speech synthesis model trained in advance, including:

And taking the voice characteristics corresponding to each voice frame of the voice signal loss segment as input, and sequentially synthesizing each voice point corresponding to the voice frame corresponding to the input voice characteristics by utilizing a pre-trained voice synthesis model.

Optionally, the speech synthesis model is trained by the following method:

obtaining a training sample set, wherein each training sample pair in the training sample set comprises processing a voice frame with voice quality not smaller than a second threshold value, the voice quality of the processed voice frame is smaller than the second threshold value and not smaller than a first threshold value, the obtained voice characteristics corresponding to the processed voice frame and each voice point corresponding to the voice frame before processing are carried out the following operations aiming at each training sample pair:

taking the voice characteristics corresponding to the processed voice frames as the input of a pre-established voice synthesis model, and obtaining each voice point output by the pre-established voice synthesis model;

and adjusting the pre-established voice synthesis model in a mode of reducing the error between each voice point output by the pre-established voice synthesis model and each voice point corresponding to the voice frame before processing until each training sample pair performs the above operation or the output error of the pre-established voice synthesis model is smaller than a set value.

Optionally, before determining at least one speech corruption segment in the received speech signal, the method further comprises:

it is determined that the voice quality of the received voice signal is below a set point.

In a second aspect, the present invention also provides a speech signal processing apparatus, the apparatus comprising:

the segment determining module is used for determining at least one voice breakage segment in the received voice signal, wherein the voice breakage segment comprises a voice signal loss segment and a voice signal loss segment;

the synthesis module is used for synthesizing and obtaining a voice signal corresponding to each voice breakage segment according to each voice breakage segment, wherein if one voice breakage segment is determined to be a voice signal loss segment according to the voice quality of each voice frame in the received voice signal, the voice signal corresponding to the voice signal loss segment is synthesized and obtained based on at least one voice signal normal segment adjacent to the voice signal loss segment according to the duration of the voice signal loss segment; if the voice quality of each voice frame in the received voice signal is determined, determining a voice breakage segment as a voice signal loss segment, and synthesizing to obtain a voice signal corresponding to the voice signal loss segment by using a pre-trained voice synthesis model based on the voice signal loss segment;

And the recovery module is used for replacing the original voice signal corresponding to the damaged voice fragment in the received voice signal by utilizing the voice signal corresponding to the damaged voice fragment obtained by synthesis.

In a third aspect, the present invention also provides a non-volatile computer storage medium storing an executable program for execution by a processor to implement the method as described above.

In a fourth aspect, the present invention further provides a speech signal processing device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored on the memory, implements the method steps described above.

According to the scheme provided by the embodiment of the invention, the voice breakage fragments in the voice signal, namely, the voice signal loss fragments and/or the voice signal loss fragments can be identified. And synthesizing each voice breakage segment to obtain a voice signal corresponding to the voice breakage segment, wherein each voice signal loss segment can utilize at least one voice signal normal segment adjacent to the voice signal loss segment to synthesize the voice signal corresponding to the voice signal loss segment according to the duration of the voice signal loss segment. And for each voice signal loss segment, the voice signal corresponding to the voice signal loss segment can be synthesized based on the voice signal loss segment by utilizing a pre-trained voice synthesis model, so that the restoration of the voice signal can be realized based on the synthesized voice signal, the voice communication quality is improved, and the user can continuously and clearly listen to the voice.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a voice signal processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a received voice signal according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech signal with a lost speech signal segment according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of a speech signal corresponding to a missing segment of a synthesized speech signal according to an embodiment of the present invention;

Fig. 5 is a schematic flow chart of a speech signal corresponding to a lost segment of a synthesized speech signal according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a lpcnet vocoder model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a speech synthesis model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a voice signal processing device according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a voice signal processing device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, as used herein, reference to "a plurality of" or "a plurality of" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Several concepts related to embodiments of the present invention are briefly described below.

Speech signal loss segment: a part of the voice signals in the section of voice signals comprise at least one voice frame, and each voice frame is continuous and meets the condition that the voice quality is smaller than a first threshold value;

speech signal loss segment: a part of the voice signals in the section of voice signals comprises at least one voice frame, wherein each voice frame is continuous and meets the condition that the voice quality is not smaller than a first threshold value but smaller than a second threshold value, and the second threshold value is larger than the first threshold value;

Speech disruption segment: the above-mentioned speech signal loss section and speech signal loss section are collectively called speech break section, and it can be understood that the speech quality of the speech signal loss section is superior to that of the speech signal loss section;

normal segment of speech signal: a part of voice signals in the section of voice signals comprise at least one voice frame, and each voice frame is continuous and meets the condition that the voice quality is not smaller than a second threshold value;

phonon: the smallest semantic unit in a speech signal, e.g. an initial or a final.

The core concept of the present invention will be briefly described as follows.

According to the scheme provided by the embodiment of the invention, the voice signal loss fragments and/or the voice signal loss fragments in the voice signals can be determined, and the voice signal is restored by corresponding voice synthesis modes aiming at the voice signal loss fragments or the voice signal loss fragments in the voice signals, so that the voice signal quality is improved, and the normal voice can be heard by a user.

For the voice signal loss segment, the voice signal corresponding to the voice signal loss segment can be synthesized again based on at least one voice signal normal segment adjacent to the voice signal loss segment according to the duration of the voice signal loss segment.

When the duration corresponding to the voice signal loss segment is shorter and is smaller than the duration corresponding to one voice, the lost voice signal does not affect the semantics, so that the voice characteristics of the voice signal loss segment can be predicted directly based on the voice characteristics corresponding to the voice signal normal segment adjacent to the voice signal loss segment, and the voice signal corresponding to the voice signal loss segment is synthesized based on the predicted voice characteristics.

When the duration corresponding to the voice signal loss segment is longer and is not less than the duration corresponding to one voice, for example, when the voice signal loss segment corresponds to a part of a sentence, automatic voice recognition can be performed on the voice signal normal segment adjacent to the voice signal loss segment due to the influence of the lost voice signal on the semantics, the voice sequence corresponding to the voice signal loss segment is obtained through prediction based on the recognized voice sequence, further, the voice characteristic of the voice signal loss segment can be determined based on the predicted voice sequence, and then the voice signal corresponding to the voice signal loss segment is obtained through synthesis based on the obtained voice characteristic.

For the voice signal loss segment, the voice signal corresponding to the voice signal loss segment can be synthesized again based on the voice signal loss segment by utilizing a pre-trained voice synthesis model, so that the compensation of the lost voice signal is realized.

Based on the above description, the embodiment of the present invention provides a method for processing a voice signal, where the flow of steps of the method may be as shown in fig. 1, and the method includes:

step 101, determining at least one speech break in the received speech signal.

In this embodiment, the speech corruption section includes a speech signal loss section and a speech signal loss section, and the speech quality of the speech signal loss section is superior to that of the speech signal loss section. In this embodiment, determining at least one speech corruption segment in the received speech signal may be understood as determining at least one speech signal loss segment and/or at least one speech signal loss segment in the received speech signal.

Wherein the speech signal loss segment and the speech signal loss segment in the received speech signal may be determined by:

first, the speech quality of each speech frame in the received speech signal is determined.

In this step, the voice quality of each voice frame may be determined in units of voice frames for the received voice signal. The manner of determining the voice quality of the voice frame may be any manner, and this embodiment will not be described in detail. A schematic representation of a received speech signal may be as shown in fig. 2.

The speech quality may be determined based on an index value of at least one speech quality index, which may be any index that may represent speech quality, such as speech strength, signal to noise ratio, etc.

And a second step of determining a speech signal loss segment and/or a speech signal loss segment in the speech signal according to the determined speech quality of each speech frame.

In this step, a speech signal loss segment and/or a speech signal loss segment in the received speech signal may be determined according to the speech quality of each speech frame determined in the first step.

That is, there may be at least one lost segment of the speech signal and/or at least one lost segment of the speech signal in the received speech signal. In this step, a speech signal loss segment and/or a speech signal loss segment in the received speech signal may be determined.

Assuming that there is a speech signal missing segment in the received speech signal, a schematic diagram of the speech signal with the speech signal missing segment may be shown in fig. 3. In fig. 3, the speech signal interruption area pointed by 2 arrows can be understood as an area corresponding to two speech signal loss segments.

If there are no speech signal loss segments and speech signal loss segments in the received speech signal, the process may end, and if there are speech signal loss segments and/or speech signal loss segments in the received speech signal, the process may continue to step 102.

It should be further noted that, in one possible implementation manner, it may be first determined, by using a voice scoring model, whether the voice quality of the received voice signal is lower than a set value, if so, the voice signal is considered to need to be repaired, and then the recognition of the voice breakage fragments is performed for the voice signal with the voice quality lower than the set value. If the voice quality of the received voice signal is judged not to be lower than the set value, the voice signal is considered not to be required to be repaired, and the process can be ended.

Step 102, synthesizing to obtain a voice signal corresponding to each voice breakage segment.

In this step, for each speech signal loss segment, according to the duration of the speech signal loss segment, based on at least one speech signal normal segment adjacent to the speech signal loss segment, a speech signal corresponding to the speech signal loss segment may be synthesized.

In one possible implementation manner, if a duration of a lost speech signal segment is less than a duration corresponding to a phonon, synthesizing to obtain a speech signal corresponding to the lost speech signal segment based on at least one normal speech signal segment adjacent to the lost speech signal segment may include:

determining at least one normal segment of the speech signal adjacent to the lost segment of the speech signal in the received speech signal;

taking the voice characteristics corresponding to each voice frame of the determined voice signal normal segment as input, and sequentially determining the voice characteristics corresponding to each voice frame of the voice signal lost segment by utilizing a feature prediction model trained in advance;

and taking the determined voice characteristics corresponding to each voice frame of the voice signal lost segment as input, and sequentially synthesizing each voice point (the voice point can be understood as a sampling point) corresponding to the voice frame corresponding to the voice characteristics by utilizing a pre-trained vocoder model.

It should be noted that, the at least one determined normal speech signal segment adjacent to the speech signal loss segment may be a normal speech signal segment adjacent to and preceding the speech signal loss segment in the received speech signal. So as to rapidly predict the voice characteristics corresponding to each voice frame of the voice signal lost segment through the voice signal normal segment before the voice signal lost segment.

Of course, the determined at least one speech signal normal segment adjacent to the speech signal lost segment may also include other speech signal normal segments, for example, a speech signal normal segment adjacent to and before the speech signal lost segment, and a speech signal normal segment adjacent to and before the speech signal lost segment in the received speech signal. So as to more accurately predict the voice characteristics corresponding to each voice frame of the voice signal lost segment through the voice signal normal segment before and after the voice signal lost segment.

At this time, a flow chart of synthesizing a speech signal corresponding to a speech signal missing segment may be as shown in fig. 4. The method comprises the steps of taking the voice characteristics corresponding to each voice frame of a determined voice signal normal fragment as input, sequentially carrying out characteristic prediction on each voice frame of a voice signal lost fragment through a characteristic prediction model, taking the predicted voice characteristics of each voice frame as input, and carrying out voice synthesis through a vocoder model to sequentially obtain each voice point of each voice frame of the voice signal lost fragment.

Wherein the voice features corresponding to each voice frame may include, but are not limited to: at least one of Linear Predictive Coding (LPC) features, pitch (Pitch) features, pitch frequency (f 0) features, gain (gain) features, and Barker Frequency Cepstral Coefficient (BFCC) features.

The feature prediction model trained in advance can be any feature prediction model, for example, a tacotron feature prediction model, a tacotron2 feature prediction model, a wavenet feature prediction model or a Res-net feature prediction model can be adopted.

The pre-trained vocoder model may be any vocoder model, for example, a lpcnet vocoder model or a wanmann vocoder model may be used.

In one possible implementation manner, if the duration of a lost speech signal segment is not less than the duration corresponding to a phonon, synthesizing to obtain a speech signal corresponding to the lost speech signal segment based on at least one normal speech signal segment adjacent to the lost speech signal segment may include:

It should be noted that, similarly to when the duration of the speech signal loss segment is less than the duration corresponding to one phonon, the determined at least one speech signal normal segment adjacent to the speech signal loss segment may be a speech signal normal segment adjacent to and located before the speech signal loss segment in the received speech signal. So as to rapidly predict the corresponding voice subsequence of each voice frame of the voice signal lost fragment through the voice signal normal fragment before the voice signal lost fragment.

Of course, the determined at least one speech signal normal segment adjacent to the speech signal lost segment may also include other speech signal normal segments, for example, a speech signal normal segment adjacent to and before the speech signal lost segment, and a speech signal normal segment adjacent to and before the speech signal lost segment in the received speech signal. So as to more accurately predict the corresponding voice subsequence of each voice frame of the voice signal lost segment through the voice signal normal segment before and after the voice signal lost segment.

At this time, a flow chart of synthesizing a speech signal corresponding to a speech signal missing segment may be as shown in fig. 5. The voice synthesis method comprises the steps of taking a determined voice sequence corresponding to a voice signal normal segment as input, predicting a voice sequence corresponding to a voice signal loss segment through a language model, taking the predicted voice sequence as input, determining voice characteristics corresponding to each voice frame of the voice signal loss segment through an acoustic model, taking the determined voice characteristics corresponding to each voice frame as input, and carrying out voice synthesis through a vocoder model to sequentially obtain each voice point of each voice frame of the voice signal loss segment.

Wherein, similar to when the duration of the lost segment of the voice signal is less than the duration corresponding to one voice, the voice features corresponding to each voice frame may include, but are not limited to: at least one of Linear Predictive Coding (LPC) features, pitch (Pitch) features, pitch frequency (f 0) features, gain (gain) features, bark frequency cepstral coefficient (BFCC, bark-Frequency Cepstral Coefficients) features.

The pre-trained language model may be any language model, for example, a BERT language model or a GPT language model may be used.

The pre-trained acoustic model may be any acoustic model, for example, a tacotron2 acoustic model or a wavenet acoustic model may be used.

In this step, the speech signal corresponding to the speech signal loss segment may be synthesized based on the speech signal loss segment using a speech synthesis model trained in advance.

In one possible implementation manner, for a speech signal loss segment, synthesizing, based on the speech signal loss segment, a speech signal corresponding to the speech signal loss segment by using a speech synthesis model trained in advance may include:

In one possible implementation, the speech synthesis model may be trained by:

It should be noted that, processing the voice frame with voice quality not less than the second threshold value so that the voice quality of the processed voice frame is less than the second threshold value and not less than the first threshold value may be implemented in any manner. For example, it may be implemented by adding noise to the speech frame, or it may be implemented by spectral adding to the speech frame, or it may be implemented by spectral subtracting the speech frame, or the like.

The speech synthesis model may be any speech synthesis model. In one possible implementation, the speech synthesis model may be implemented based on an improvement to the lpcnet vocoder model, and the schematic structure of the lpcnet vocoder model may be as shown in fig. 6, and the schematic structure of the speech synthesis model obtained by the improvement to the lpcnet vocoder model may be as shown in, but not limited to, fig. 7.

In the lpcnet vocoder model, the speech characteristics of an input speech frame may be passed through a frame prediction network (including two convolution layers (conv1×3 respectively) and two fully-connected DNN layers (FC respectively)) to obtain a speech point prediction network (including a splice layer (concat), and two GRU layers (GRU respectively) _A And GRU (glass fiber reinforced Unit) _B ) The voice characteristics of the voice frame input by the fully connected DNN layer (noted as dual FC) and the softmax classification layer (noted as softmax)) are sequentially synthesized by the voice point prediction network and the voice quantization layer to obtain each voice point of the voice frame, for example, assuming that one voice frame is 10ms, corresponding to 160 voice points (corresponding to a sampling rate of 16000 hz), Then 160 speech points corresponding to the speech frame may be synthesized in turn based on the speech characteristics of the speech frame.

Wherein the speech point prediction network can calculate the expected value p according to the speech characteristics of the speech frame input by the frame prediction network _t Speech point synthesized at time t-1 and excitation signal e at time t-1 _t-1 And synthesizing another voice point corresponding to the voice frame at the time t through the voice quantization layer.

In a speech synthesis model obtained by improving a lpcnet vocoder model, the improvement of a frame prediction network is mainly performed. For example, as shown in fig. 7, the frame prediction network of the speech synthesis model may include six layers of one GRU layer and 5 fully connected DNN layers. In addition, as shown in fig. 7, in the speech point prediction network of the speech synthesis model, two GRU layers can be simplified into one, so as to reduce the calculation amount and improve the synthesis speed of the speech points.

It should be noted that, according to the experiments of the present inventor, the speech synthesis model shown in fig. 7 is adopted for the speech signal loss segment, so that the accuracy of the synthesized speech point is higher, the speech loss is better compensated, and the speech quality is improved.

Step 103, replacing the original voice signal corresponding to the damaged voice segment in the received voice signal by utilizing the voice signal corresponding to the damaged voice segment obtained by synthesis.

In this step, the synthesized speech signal may be used to replace the speech signal corresponding to the speech signal loss segment and the speech signal loss segment in the received speech signal, and the speech signal with the speech signal loss segment and the speech signal loss segment replaced is output, so that the user can hear clear and coherent speech.

According to the scheme provided by the embodiment of the invention, the problems of voice loss, voice loss and the like caused by network communication quality fluctuation can be effectively repaired, and the tone and the like of the repaired voice are basically consistent with the original voice. In addition, the scheme provided by the embodiment of the invention is particularly suitable for the English voice communication process, and can effectively realize pronunciation synthesis of unknown English words, so that the voices are more coherent and the pronunciation is more natural.

Corresponding to the provided method, the following apparatus is further provided.

An embodiment of the present invention provides a speech signal processing apparatus, where the apparatus may have a structure as shown in fig. 8, and includes:

the segment determining module 11 is configured to determine at least one speech breaking segment in the received speech signal, where the speech breaking segment includes a speech signal loss segment and a speech signal loss segment;

The synthesis module 12 is configured to synthesize, for each speech impairment segment, a speech signal corresponding to the speech impairment segment, where if, according to the speech quality of each speech frame in the received speech signal, a speech impairment segment is determined to be a speech signal loss segment, and according to the duration of the speech signal loss segment, based on at least one normal speech signal segment adjacent to the speech signal loss segment, a speech signal corresponding to the speech signal loss segment is synthesized; if the voice quality of each voice frame in the received voice signal is determined, determining a voice breakage segment as a voice signal loss segment, and synthesizing to obtain a voice signal corresponding to the voice signal loss segment by using a pre-trained voice synthesis model based on the voice signal loss segment;

the restoration module 13 is configured to replace an original speech signal corresponding to the damaged speech segment in the received speech signals with a speech signal corresponding to the damaged speech segment obtained by synthesis.

It is understood that the speech signal loss segment includes at least one speech frame, each speech frame being continuous and satisfying a speech quality less than a first threshold;

the speech signal loss segment comprises at least one speech frame, each speech frame being continuous and satisfying a speech quality not less than the first threshold but less than a second threshold, wherein the second threshold is greater than the first threshold;

The normal segment of the speech signal comprises at least one speech frame, each speech frame being continuous and satisfying a speech quality not less than the second threshold.

Optionally, if the duration of a speech signal loss segment is less than the duration corresponding to a phonon, the synthesizing module 12 synthesizes the speech signal corresponding to the speech signal loss segment based on at least one speech signal normal segment adjacent to the speech signal loss segment, including:

Optionally, if the duration of a speech signal loss segment is not less than the duration corresponding to a phonon, the synthesizing module 12 synthesizes the speech signal corresponding to the speech signal loss segment based on at least one speech signal normal segment adjacent to the speech signal loss segment, including:

Further optionally, the synthesizing module 12 determines at least one normal segment of the speech signal adjacent to the lost segment of the speech signal, including:

Further optionally, the voice features corresponding to each voice frame include:

Optionally, the synthesizing module 12 synthesizes, based on the speech signal loss segment, a speech signal corresponding to the speech signal loss segment by using a speech synthesis model trained in advance, including:

Optionally, the speech synthesis model is trained by the following method:

obtaining a training sample set, wherein each training sample pair in the training sample set comprises a voice feature corresponding to a voice frame with voice quality not smaller than a second threshold value, the obtained voice feature corresponding to the voice frame after noise addition, and each voice point corresponding to the voice frame, and executing the following operations aiming at each training sample pair:

taking the voice characteristics after the noise addition corresponding to the voice frame as the input of a pre-established voice synthesis model, and obtaining each voice point output by the pre-established voice synthesis model;

And adjusting the pre-established voice synthesis model according to a mode of reducing the error between each voice point output by the pre-established voice synthesis model and each voice point corresponding to the voice frame until each training sample pair is executed, or the output error of the pre-established voice synthesis model is smaller than a set value.

Optionally, the segment determining module 11 is further configured to determine that the voice quality of the received voice signal is lower than the set value before determining at least one voice break segment in the received voice signal.

The functions of the functional units of each device provided in the foregoing embodiments of the present invention may be implemented by the steps of the corresponding methods, so that the specific working process and the beneficial effects of each functional unit in each device provided in the embodiments of the present invention are not repeated herein.

Based on the same inventive concept, embodiments of the present invention provide the following apparatuses and media.

The embodiment of the invention provides a voice signal processing device, which can be structured as shown in fig. 9, and comprises a processor 21, a communication interface 22, a memory 23 and a communication bus 24, wherein the processor 21, the communication interface 22 and the memory 23 complete communication with each other through the communication bus 24;

The memory 23 is used for storing a computer program;

the processor 21 is configured to implement the steps described in the above method embodiments of the present invention when executing the program stored in the memory.

Alternatively, the processor 21 may specifically include a Central Processing Unit (CPU), an application specific integrated circuit (ASIC, application Specific Integrated Circuit), one or more integrated circuits for controlling program execution, a hardware circuit developed using a field programmable gate array (FPGA, field Programmable Gate Array), and a baseband processor.

Alternatively, the processor 21 may comprise at least one processing core.

Alternatively, the Memory 23 may include a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), and a disk Memory. The memory 23 is used for storing data required by the operation of the at least one processor 21. The number of memories 23 may be one or more.

The embodiment of the invention also provides a non-volatile computer storage medium, which stores an executable program, and when the executable program is executed by a processor, the method provided by the embodiment of the method of the invention is realized.

In a specific implementation, the computer storage medium may include: a universal serial bus flash disk (USB, universal Serial Bus Flash Drive), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

In the embodiments of the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, e.g., the division of the units or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, indirect coupling or communication connection of devices or units, electrical or otherwise.

The functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be an independent physical module.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. With such understanding, all or part of the technical solution of the embodiments of the present invention may be embodied in the form of a software product stored in a storage medium, including instructions for causing a computer device, which may be, for example, a personal computer, a server, or a network device, or a processor (processor), to perform all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: universal serial bus flash disk (Universal Serial Bus Flash Drive), removable hard disk, ROM, RAM, magnetic or optical disk, or other various media capable of storing program code.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of processing a speech signal, the method comprising:

replacing an original voice signal corresponding to the damaged voice fragment in the received voice signal by utilizing the voice signal corresponding to the damaged voice fragment obtained by synthesis;

if the duration of a speech signal loss segment is smaller than the duration corresponding to a phonon, synthesizing to obtain a speech signal corresponding to the speech signal loss segment based on at least one speech signal normal segment adjacent to the speech signal loss segment, including:

2. The method of claim 1, wherein synthesizing the speech signal corresponding to a lost speech signal segment based on at least one normal speech signal segment adjacent to the lost speech signal segment if the duration of the lost speech signal segment is not less than the duration corresponding to a tone, comprises:

3. The method of claim 1 or 2, wherein determining at least one normal segment of the speech signal adjacent to the lost segment of the speech signal comprises:

4. The method of claim 1 or 2, wherein the corresponding speech feature of each speech frame comprises:

5. The method of claim 1, wherein synthesizing the speech signal corresponding to the speech signal loss segment based on the speech signal loss segment using a pre-trained speech synthesis model, comprising:

6. The method of claim 5, wherein the speech synthesis model is trained by:

7. The method of claim 1, wherein prior to determining at least one speech corruption segment in the received speech signal, the method further comprises:

8. A speech signal processing apparatus, the apparatus comprising:

The recovery module is used for replacing an original voice signal corresponding to the damaged voice fragment in the received voice signal by utilizing the voice signal corresponding to the damaged voice fragment obtained by synthesis;

if the duration of a speech signal loss segment is smaller than the duration corresponding to a phonon, the synthesis module synthesizes and obtains the speech signal corresponding to the speech signal loss segment based on at least one speech signal normal segment adjacent to the speech signal loss segment, which comprises the following steps:

9. A non-transitory computer storage medium storing an executable program that is executed by a processor to implement the method of any one of claims 1 to 7.

10. A speech signal processing device, characterized in that the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, the memory complete the communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method steps of any one of claims 1 to 7 when executing the program stored on the memory.