CN113409803B

CN113409803B - Voice signal processing method, device, storage medium and equipment

Info

Publication number: CN113409803B
Application number: CN202011233786.3A
Authority: CN
Inventors: 陈杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2024-01-23
Anticipated expiration: 2040-11-06
Also published as: CN113409803A

Abstract

The embodiment of the application discloses a voice signal processing method, a device, a storage medium and equipment, and belongs to the technical field of artificial intelligence-voice. The method comprises the following steps: the method comprises the steps of obtaining an original voice signal to be processed, carrying out separation processing on the original voice signal to obtain an effective voice signal in the original voice signal, carrying out feature extraction on the original voice signal to obtain feature information of the original voice signal, and generating an enhancement coefficient of the effective voice signal according to the feature information of the original voice signal. And carrying out enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal. Through the method and the device, the loss of the effective voice signal can be effectively avoided, and the signal to noise ratio of the effective voice signal is improved.

Description

Voice signal processing method, device, storage medium and equipment

Technical Field

The present disclosure relates to the field of speech technology of artificial intelligence, and in particular, to a method, an apparatus, a storage medium, and a device for processing a speech signal.

Background

The voice separation technology is a technology for separating effective voice signals from voice signals so as to filter background interference signals, wherein the effective voice signals are signals with utilization value; for example, in a conference, the effective speech signal may refer to the speaking content of a main participant, and the user is facilitated to understand the main content of the conference according to the effective speech signal, or in a concert, the effective speech signal may refer to the singing speech signal of a singer, and the user is facilitated to provide a good hearing effect according to the effective speech signal. Therefore, the voice separation algorithm has great practical value.

At present, a time domain processing method is mainly adopted to separate voice signals, so that part of effective voice signals in the voice signals can be separated, but background interference signals still exist in the separated effective voice signals, and part of effective voice signals are lost.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a method, an apparatus, a storage medium, and a device for processing a voice signal, which can effectively avoid loss of an effective voice signal and improve a signal-to-noise ratio of the effective voice signal.

An aspect of an embodiment of the present application provides a method for processing a speech signal, including:

acquiring an original voice signal to be processed;

separating the original voice signals to obtain effective voice signals in the original voice signals;

extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and carrying out enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal.

The feature extraction of the original voice signal to obtain feature information of the original voice signal, and generating an enhancement coefficient of the effective voice signal according to the feature information of the original voice signal, including:

dividing the original voice signal to obtain at least two original voice signal fragments, and dividing the effective voice signal to obtain at least two effective voice signal fragments, wherein one original voice signal fragment corresponds to one effective voice signal fragment;

extracting the characteristics of each original voice signal segment in the at least two original voice signal segments to obtain the characteristic information of each original voice signal segment;

generating enhancement coefficients of corresponding effective voice signal fragments in the at least two effective voice signal fragments according to the characteristic information of each original voice signal fragment;

and taking the enhancement coefficient corresponding to each effective voice signal segment in the at least two effective voice signal segments as the enhancement coefficient of the effective voice signal.

Wherein the generating the enhancement coefficient of the corresponding valid voice signal segment in the at least two valid voice signal segments according to the feature information of each original voice signal segment includes:

Determining the data volume duty ratio of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment;

and generating enhancement coefficients of corresponding effective voice signal fragments in the at least two effective voice signal fragments by adopting the data volume duty ratio.

Wherein the determining the data size ratio of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment includes:

determining the data quantity of the effective voice signals included in each original voice signal segment according to the characteristic information of each original voice signal segment;

acquiring the total data volume of the original voice signals;

and obtaining the ratio between the data volume of the effective voice signals included in each original voice signal segment and the total data volume of the original voice signals, and obtaining the data volume ratio of the effective voice signals included in each original voice signal segment.

Wherein the at least two original voice signal fragments comprise target original voice signal fragments, and the at least two effective voice signal fragments comprise target effective voice signal fragments corresponding to the target original voice signal fragments;

The step of performing enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal comprises the following steps:

if the enhancement coefficient of the target effective voice signal segment is larger than a first enhancement coefficient threshold and smaller than a second enhancement coefficient threshold, extracting an original voice signal sub-segment of a target data volume from the target original voice signal segment, and carrying out fusion processing on the original voice signal sub-segment and the target effective voice signal segment to obtain an enhanced target effective voice signal segment; the target data volume is determined according to the data volume ratio of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold value is smaller than the second enhancement coefficient threshold value;

if the enhancement coefficient of the target effective voice signal segment is greater than or equal to the second enhancement coefficient threshold, taking the target original voice signal segment as an enhanced target effective voice signal segment;

and splicing the enhanced target effective voice signal fragments to obtain an enhanced target voice signal.

Wherein generating enhancement coefficients for corresponding ones of the at least two valid speech signal segments using the data size ratio comprises:

if the data volume duty ratio corresponding to the target original voice information fragment is larger than a first data volume duty ratio threshold and smaller than a second data volume duty ratio threshold, determining a first enhancement coefficient as an enhancement coefficient of the target effective voice information fragment; the first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold;

if the corresponding data volume duty ratio of the target original voice information fragment is larger than the second data volume duty ratio threshold, determining a second enhancement coefficient as the enhancement coefficient of the target effective voice information fragment; the second enhancement coefficient is greater than or equal to the second enhancement coefficient threshold.

The step of separating the original voice signal to obtain an effective voice signal in the original voice signal includes:

masking the original voice signal according to the characteristic information of the original voice signal to obtain a mask matrix corresponding to the original voice signal;

And separating the effective voice signals from the original voice signals according to the mask matrix corresponding to the original voice signals.

In one aspect, an embodiment of the present application provides a voice signal apparatus, including:

the acquisition module is used for acquiring an original voice signal to be processed;

the separation processing module is used for carrying out separation processing on the original voice signals to obtain effective voice signals in the original voice signals;

the generating module is used for extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and the enhancement processing module is used for carrying out enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal.

Wherein, the generating module comprises:

the dividing processing unit is used for dividing the original voice signal to obtain at least two original voice signal fragments, and dividing the effective voice signal to obtain at least two effective voice signal fragments, wherein one original voice signal fragment corresponds to one effective voice signal fragment;

The feature extraction unit is used for extracting the features of each original voice signal segment in the at least two original voice signal segments to obtain the feature information of each original voice signal segment;

the generating unit is used for generating enhancement coefficients of corresponding effective voice signal fragments in the at least two effective voice signal fragments according to the characteristic information of each original voice signal fragment;

and the first determining unit is used for taking the enhancement coefficient corresponding to each effective voice signal segment in the at least two effective voice signal segments as the enhancement coefficient of the effective voice signal.

Wherein, the generating unit is specifically configured to:

Wherein, the generating unit is further specifically configured to:

Acquiring the total data volume of the original voice signals;

the enhancement processing module includes:

the fusion processing unit is used for extracting original voice signal sub-segments with target data volume from the target original voice signal segments if the enhancement coefficient of the target effective voice signal segments is larger than a first enhancement coefficient threshold and smaller than a second enhancement coefficient threshold, and carrying out fusion processing on the original voice signal sub-segments and the target effective voice signal segments to obtain enhanced target effective voice signal segments; the target data volume is determined according to the data volume ratio of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold value is smaller than the second enhancement coefficient threshold value;

A second determining unit, configured to take the target original speech signal segment as an enhanced target valid speech signal segment if the enhancement coefficient of the target valid speech signal segment is greater than or equal to the second enhancement coefficient threshold;

and the splicing unit is used for splicing the enhanced target effective voice signal fragments to obtain an enhanced target voice signal.

Wherein, the generating unit is further specifically configured to:

if the data volume ratio corresponding to the target original voice information fragment is larger than the first data volume ratio threshold and smaller than the second data volume ratio threshold, determining a first enhancement coefficient as the enhancement coefficient of the target effective voice information fragment; the first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold;

Wherein, separation processing module includes:

the mask processing unit is used for carrying out mask processing on the original voice signal according to the characteristic information of the original voice signal to obtain a mask matrix corresponding to the original voice signal;

And the separation unit is used for separating the effective voice signals from the original voice signals according to the mask matrix corresponding to the original voice signals.

In one aspect, the present application provides a computer device comprising: a processor and a memory;

wherein the memory is configured to store a computer program, and the processor is configured to call the computer program to perform the following steps:

acquiring an original voice signal to be processed;

In one aspect, the present application provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, perform the steps of:

Acquiring an original voice signal to be processed;

In the embodiment of the application, the original voice signal to be processed is obtained, the original voice signal is subjected to separation processing to obtain an effective voice signal in the original voice signal, the original voice signal is subjected to feature extraction to obtain feature information of the original voice signal, and the enhancement coefficient of the effective voice signal is generated according to the feature information of the original voice signal. The effective voice signal is enhanced according to the enhancement coefficient of the effective voice signal and the original voice signal, so that an enhanced target voice signal is obtained; the information loss of the effective voice signal can be effectively avoided, namely, the performance damage of the effective voice signal is reduced; and the background interference signal in the effective voice signal is reduced, so that the signal-to-noise ratio of the effective voice signal can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a speech signal processing system according to the present application;

fig. 2 is a flow chart of a voice signal processing method provided in the present application;

fig. 3 is a schematic diagram of a method for separating a voice signal by using a Conv-TasNet pure time domain processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of separating an original speech signal according to an embodiment of the present application;

fig. 5 is a schematic diagram of a 1*D convolution processing module according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a method for generating enhancement coefficients for an active speech signal according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a system for obtaining an enhanced target speech signal provided by an embodiment of the present application;

FIG. 8 is a flow chart of another method for processing speech signals provided in the present application;

fig. 9 is a schematic structural diagram of a voice signal processing device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among the key technologies of the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. In the application, the original voice signal can be separated by utilizing the voice technology to obtain the effective voice signal in the original voice signal, and then the original voice signal is subjected to feature extraction to obtain the feature information of the original voice signal. And generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal, and carrying out enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal. Thus, the signal-to-noise ratio of the target language information can be obviously improved, and the performance damage of the separated target language information can be reduced.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a speech signal processing system according to an embodiment of the present application. As shown in fig. 1, the speech signal processing system can comprise a server 10 and a cluster of user terminals. The cluster of user terminals may comprise one or more user terminals, the number of which will not be limited here. As shown in fig. 1, the user terminals 100a, 100b, 100c, …, and 100n may be specifically included. As shown in fig. 1, the user terminals 100a, 100b, 100c, …, 100n may respectively perform network connection with the server 10, so that each user terminal may perform data interaction with the server 10 through the network connection.

Wherein each user terminal in the user terminal cluster may include: smart terminals with business number processing functions such as smart phones, tablet computers, notebook computers, desktop computers, wearable devices, smart home and head-mounted devices. It should be appreciated that each user terminal in the cluster of user terminals shown in fig. 1 may be provided with a target application (i.e. application client) that, when running in each user terminal, may interact with the server 10 shown in fig. 1, respectively, as described above.

As shown in fig. 1, the server 10 may be configured to perform separation processing on an original voice signal to obtain an effective voice signal in the original voice signal, then generate an enhancement coefficient of the effective voice signal according to feature information of the original voice signal, and perform enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal; the server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms.

For ease of understanding, the embodiment of the present application may select one user terminal from the plurality of user terminals shown in fig. 1 as the target user terminal. For example, the embodiment of the present application may use the user terminal 100a shown in fig. 1 as a target user terminal, and the target user terminal may integrate a target application (i.e., an application client) having the service data processing function. At this time, the target user terminal may implement data interaction between the service data platform corresponding to the application client and the server 10. If the target user terminal can send the original voice signal to the server 10, the server 10 can perform separation processing on the original voice signal, perform enhancement processing on the effective voice signal after obtaining the effective voice signal, and send the target voice signal to the target user terminal after obtaining the target voice signal.

Fig. 2 is a schematic flow chart of a voice signal processing method according to an embodiment of the present application. The method may be performed by a computer device, which may refer to the server 11 or any terminal in fig. 1, and the speech signal processing method may comprise steps S101-S104 as shown in fig. 2.

S101, acquiring an original voice signal to be processed.

The original voice signal to be processed is obtained, and the original voice signal can be obtained by recording by a voice obtaining module, for example, the voice required by the owner is obtained by recording by the intelligent television, or the voice signal in the video of the explanation of a certain object uploaded by the user, and the like.

S102, separating the original voice signals to obtain effective voice signals in the original voice signals.

After the original voice signal is obtained, because the original voice signal may contain an interference voice signal, the interference voice signal refers to other signals except the effective voice signal, if a target user in a section of video data speaks, the corresponding voice signal is used as the effective voice signal, and then other vehicles whistle, etc. are used as the interference signal. The method can adopt two processing methods of time-frequency domain processing and pure time domain processing to separate the original voice signals, so as to obtain effective voice signals in the original voice signals. The method comprises the steps that effective voice signals in original voice signals and interference signals can be separated based on a time-frequency domain processing method, so that the effective voice signals in the original voice signals are obtained; the method can also be based on a pure time domain processing method to separate the effective voice signal from the interference signal in the original voice signal, so as to obtain the effective voice signal in the original voice signal, and the pure time domain processing can be used for directly processing the original voice signal and reserving the phase information of the original voice signal, thereby obtaining better performance. The time-frequency domain processing method is to change the original voice signal into a relation representing the original voice signal by taking a time axis as a coordinate and change the original voice signal into a relation representing the original voice signal by taking a frequency axis as a coordinate, so as to analyze the original voice signal and separate effective voice signals from the original voice signal.

Optionally, masking processing is performed on the original voice signal according to the characteristic information of the original voice signal, so as to obtain a mask matrix corresponding to the original voice signal. And separating effective voice signals from the original voice signals according to the mask matrix corresponding to the original voice signals.

The method can distinguish the sound characteristics of different pronunciation sources according to the characteristic information of the original voice signals, namely based on the characteristic information of the original voice signals, further construct mask matrixes corresponding to the different pronunciation sources, wherein each mask matrix comprises mask vectors corresponding to each pronunciation source respectively, and separate effective voice signals from the original voice signals according to the mask matrixes corresponding to the original voice signals. The time domain voice characteristics corresponding to the sound source can be obtained by converting the time domain voice characteristics based on the mask vector corresponding to any sound source, so that the voice signals corresponding to different sound sources in the voice signals are separated, and the voice signals of different sound sources are obtained as voice separation results and output.

Alternatively, the speech separation model may be obtained by training in advance, specifically, the speech separation model may be obtained by training in the following manner: firstly, a candidate voice separation model, a sample voice signal and a labeling voice separation result corresponding to the sample voice signal are obtained. And inputting the sample voice signals into a candidate voice separation model, and separating the sample voice signals to obtain a predicted voice separation result. Calculating a model loss value according to the marked voice separation result and the predicted voice separation result, and adjusting the candidate voice separation model according to the model loss value until the candidate voice separation model meets the convergence condition, taking the candidate voice separation model meeting the convergence condition as a target voice separation model, and separating the original voice signal according to the target voice separation model.

As shown in fig. 3, a schematic diagram of a method for separating a voice signal by using a Conv-TasNet pure time domain processing method is provided in an embodiment of the present application, and as shown in fig. 3, the Conv-TasNet pure time domain processing method is a full convolution time domain audio separation network, and is mainly composed of an Encoder (encoding), a Separator (separation) and a Decoder (decoding), where the Encoder (encoding) directly encodes an original voice signal in a time domain, and converts a segment of a time domain waveform into a corresponding representation in an intermediate feature space; the Separator is composed of a series of TCN (temporal convolutional network), and uses the output of the encoder to estimate a Mask that acts on the encoder output to filter the useful signal and remove interference. And finally, reconstructing the output after Mask by a Decoder (decoding) module to obtain the separated effective voice signal. After the original voice signal is input into a Conv-TasNet model, an encoding module of an Encoder converts a segment of a time domain waveform into a corresponding representation in an intermediate feature space, and feature extraction is carried out on the original voice signal to obtain feature information of the original voice signal. The Separator separation module generates a mask matrix according to the characteristic information of the original voice signal output by the Encoder coding module, so as to obtain a spectrogram of a single voice, and the mask acts on the output of the Encoder to play a role in filtering useful signals and removing interference. And the Decoder decoding module is used for reconstructing the output after the mask matrix is generated, namely decoding, so as to obtain the separated effective voice signals.

As shown in fig. 4, a schematic diagram of separating an original speech signal according to an embodiment of the present application is shown in fig. 4, where the original speech signal may be input into a speech separation model and then subjected to 1*1 convolution processing in an encoding module. And then taking the output in the coding module as the input of a separation model, carrying out 1*1 convolution processing and 1*D convolution processing on the original voice signal, classifying the original voice signal after the convolution processing, and carrying out effective voice signal separation processing. And decoding the effective voice signals subjected to separation processing to obtain effective voice signals in the original voice signals.

As shown in fig. 5, a schematic diagram of a 1*D convolution processing module according to an embodiment of the present application is provided, where, as shown in fig. 5, the 1*D convolution processing module performs 1*1 convolution processing on its input, and then activates and normalizes the 1*1 convolution processing. Then, the D convolution processing is carried out, the content of the D convolution processing is activated and standardized, the 1*1 convolution processing is carried out as output, and jump connection is carried out, wherein the jump connection refers to skip connections which are commonly used in a residual error network, and the jump connection is used for solving the problems of gradient explosion and gradient disappearance in the training process in a deeper network.

S103, extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal.

The feature extraction can be performed on the original voice signal to obtain feature information of the original voice signal, wherein the feature information of the original voice signal comprises the data volume ratio of the effective voice signal in the original voice signal, the number of different voice signals in the original voice signal and the type of the original voice signal (such as voice signals of dubbing actors recorded in a recording studio, voice signals of a reporter recorded in bad weather, and the like). And generating enhancement coefficients of the effective voice signal according to the characteristic information of the original voice signal.

Fig. 6 is a schematic diagram of a method for generating enhancement coefficients of an effective speech signal according to an embodiment of the present application, and as shown in fig. 6, the method for generating enhancement coefficients of an effective speech signal includes steps S21-S24.

S21, dividing the original voice signal to obtain at least two original voice signal fragments, and dividing the effective voice signal to obtain at least two effective voice signal fragments.

In an alternative embodiment, one original speech signal segment corresponds to one valid speech signal segment, and the original speech signal may be divided into at least two original speech signal segments, for example, the original speech signal is divided into T original speech signal segments, where T is a positive integer greater than or equal to 2. The length of each original voice signal segment may be equal, for example, the length of each original voice signal segment is L, where L is a natural number greater than 0, and of course, the length of each original voice signal may also be unequal. After the original voice signals are separated and processed to obtain voice signals in the original voice signals, the effective voice signals can be divided according to the method for dividing the original voice signals, and at least two effective voice signal fragments are obtained. One of the original speech signal segments corresponds to one of the valid speech signal segments, that is, each of the original speech signal segments has the same length and position information as the corresponding valid speech signal segment, for example, a third one of the at least two original speech signal segments corresponds to a third one of the at least two valid speech signal segments, and the third one of the at least two original speech signal segments has the same length and position as the third valid speech signal segment.

S22, extracting the characteristics of each original voice signal segment in at least two original voice signal segments to obtain the characteristic information of each original voice signal segment.

And extracting the characteristics of each original voice signal segment in the at least two original voice signal segments to obtain the characteristic information of each original voice signal segment, wherein the characteristic information of each original voice signal segment comprises the data volume ratio of effective voice signals in the original voice signal segment, the number of different voice signals in the original effective voice signal segment and the types of the original voice signal segments and the like (such as voice signals of dubbing actors recorded in a recording studio, voice signals of a reporter recorded in severe weather and the like).

S23, generating enhancement coefficients of corresponding effective voice signal fragments in at least two effective voice signal fragments according to the characteristic information of each original voice signal fragment.

S24, the enhancement coefficient corresponding to each effective voice signal segment in at least two effective voice signal segments is used as the enhancement coefficient of the effective voice signal.

And generating enhancement coefficients of corresponding effective voice signal segments in at least two effective voice signal segments according to the characteristic information of each original voice signal segment, namely generating the enhancement coefficients of the corresponding effective voice signal segments in at least two effective voice signal segments according to at least one of the data volume ratio of the effective voice signals in the original voice signal segments, the number of different voice signals in the original voice signal segments and the types of the original voice signal segments.

Optionally, when generating the enhancement coefficients of the corresponding effective speech signal segments in the at least two effective speech signal segments according to the feature information of each original speech signal segment, the data volume ratio of the effective speech signal included in each original speech signal segment may be determined according to the feature information of each original speech signal segment, and the enhancement coefficients of the corresponding effective speech signal segments in the at least two effective speech signal segments may be generated by using the data volume ratio.

The data amount ratio of the effective speech signal included in each original speech signal segment, which refers to how much of the effective speech signal is in each original speech signal segment, may be determined based on the characteristic information of each original speech signal segment. The more the effective speech signals in the original speech signal segment are, the larger the corresponding data volume duty ratio is, and the less the effective speech signals in the original speech signal segment are, the smaller the corresponding data volume duty ratio is. Generating enhancement coefficients of corresponding effective voice signal fragments in at least two effective voice signal fragments by adopting the data volume ratio, wherein the enhancement coefficients refer to the signal-to-noise ratio of the effective voice signals, if the signal-to-noise ratio of the effective voice signals is higher, the enhancement coefficients corresponding to the effective voice signals are higher, and the enhancement processing intensity of the effective voice signals can be reduced; if the signal-to-noise ratio of the effective voice signal is lower, the enhancement coefficient of the effective voice signal is lower, and the enhancement processing intensity of the effective voice signal can be increased. The enhancement coefficient is used for determining whether a part of original voice signals are required to be extracted from the original voice signals and fused into the effective voice signals, so that the performance of the effective voice signals is enhanced. If the data volume of the effective voice signal in the original voice signal segment is larger, the enhancement coefficient of the corresponding effective voice signal segment is larger; if the data size of the effective voice signal in the original voice signal segment is smaller, the enhancement coefficient of the corresponding effective voice signal segment is smaller. Because the data volume of the effective voice signals contained in the voice signals at each position in the original voice signals is different, namely the degree of the mixed interference signals at each position in the original voice signals is different, the original voice signals are divided to obtain at least two original voice signal fragments, and then the enhancement coefficients corresponding to the effective voice signal fragments are generated according to the characteristic information of each original voice signal fragment, so that the effective voice signal fragments can be enhanced more accurately, the signal-to-noise ratio of the enhanced target voice signals is improved, and the performance of the enhanced target voice signals is improved.

Alternatively, the enhancement coefficients of corresponding ones of the at least two valid speech signal segments may be generated based on the number of different speech signals in the original speech signal segment. If the number of different speech signals is larger, which means that the original speech signals are more mixed, the enhancement coefficient is smaller; the fewer the number of different speech signals, which means that the original speech signal is clean, the larger the enhancement factor will be. The enhancement coefficients of corresponding effective voice signal segments in at least two effective voice signal segments can be generated according to the types of the original voice signal segments, for example, voice signals of dubbing actors recorded in a recording studio are cleaner, no excessive interference signals exist, the larger the corresponding enhancement coefficients are, voice signals of the markers recorded in severe weather are more mixed, the more the interference signals are, the lower the corresponding enhancement coefficients are, and the like.

After the enhancement coefficient of each effective voice signal segment in the at least two effective voice signal segments is obtained, the enhancement coefficient corresponding to each effective voice signal segment in the at least two effective voice signal segments is used as the enhancement coefficient of the effective voice signal.

Alternatively, when determining the data amount of the effective speech signal included in each original speech signal segment according to the feature information of each original speech signal segment, the data amount of the effective speech signal included in each original speech signal segment may be determined according to the feature information of each original speech signal segment. And then the total data volume of the original voice signals is obtained, the ratio between the data volume of the effective voice signals contained in each original voice signal segment and the total data volume of the original voice signals is obtained, and the data volume ratio of the effective voice signals contained in each original voice signal segment is obtained.

The data amount of the effective speech signal included in each original speech signal segment may be determined according to the characteristic information of each original speech signal segment, and then the total data amount of the original speech signal may be acquired. Then, the ratio between the data amount of the effective voice signal included in each original voice signal segment and the total data amount of the original voice signal is calculated, so that the data amount duty ratio of the effective voice signal included in each original voice signal segment can be obtained. Therefore, the data volume ratio of the effective voice signals contained in each original voice signal segment can be known more accurately, and then the enhancement coefficient is generated according to the data volume ratio of the effective voice signals contained in each original voice signal segment, and the enhancement processing is carried out on the effective voice signal segment corresponding to the original voice signal segment, so that the signal-to-noise ratio of the target voice signal after the enhancement processing and the performance of the target voice signal after the enhancement processing can be improved.

S104, according to the enhancement coefficient of the effective voice signal and the original voice signal, enhancing the effective voice signal to obtain an enhanced target voice signal.

After the enhancement coefficient of the effective voice signal is obtained, the effective voice signal can be enhanced according to the enhancement coefficient of the effective voice signal and the original voice signal, so as to obtain an enhanced target voice signal. Thus, the signal-to-noise ratio of the target voice signal after enhancement processing can be improved, and the performance damage of the effective voice signal after separation from the original voice signal can be reduced, namely, the performance of the target voice signal after enhancement processing can be improved.

Optionally, if the enhancement coefficient of the target effective speech signal segment is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold, extracting an original speech signal sub-segment of the target data amount from the target original speech signal segment, and performing fusion processing on the original speech signal sub-segment and the target effective speech signal segment to obtain the enhanced target effective speech signal segment. The target data volume is determined according to the data volume duty ratio of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold value is smaller than the second enhancement coefficient threshold value; and if the enhancement coefficient of the target effective voice signal segment is greater than or equal to the second enhancement coefficient threshold value, taking the target original voice signal segment as the enhanced target effective voice signal segment. And then splicing the enhanced target effective voice signal fragments to obtain an enhanced target voice signal.

The at least two original voice signal fragments comprise target original voice signal fragments, the target original voice signal fragments are any one of the at least two original voice signal fragments, and the at least two effective voice signal fragments comprise target effective voice signal fragments corresponding to the target original voice signal fragments. If the enhancement coefficient of the target effective speech signal segment is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold, the target original speech signal segment corresponding to the target effective speech signal segment is indicated to contain an effective speech signal with a certain data volume and an interference signal with a certain data volume, namely the target original speech signal segment is a mixed speech signal. Because the target effective voice signal segment is separated from the target original voice signal segment, and the problem of performance damage exists, the original voice signal sub-segment with the target data volume can be extracted from the target original voice signal segment, and the original voice signal sub-segment and the target effective voice signal segment are fused to obtain the enhanced target effective voice signal segment. Thus, the signal-to-noise ratio of the separated target effective voice signal segment can be obviously improved, and the performance damage of the separated target effective voice signal segment can be reduced. The target data amount is determined based on a data amount duty cycle of the valid voice information included in the target original voice signal segment, and the first enhancement coefficient threshold is smaller than the second enhancement coefficient threshold.

If the enhancement coefficient of the target effective speech signal segment is greater than or equal to the second enhancement coefficient threshold, the target original speech signal segment corresponding to the target effective speech signal segment is indicated to only contain effective speech signals, i.e. the target original speech signal segment is a clean speech signal. Because the target original voice signal segment does not have any interference signals, the target effective voice signal segment is separated from the target original voice signal segment, and the target effective voice signal segment also has the problem of performance damage. Therefore, the target original voice signal segment can be used as the enhanced target effective voice signal segment, and thus, the enhanced target effective voice signal segment does not have the problem of performance damage. After the enhanced target effective voice signal segment of each effective voice signal segment in the at least two effective voice signal segments is obtained, the enhanced target effective voice signal segment of each effective voice signal segment is spliced, and an enhanced target voice signal is obtained.

Optionally, when the data volume ratio is adopted to generate the enhancement coefficient of the corresponding effective voice signal segment in the at least two effective voice signal segments, if the data volume ratio corresponding to the target original voice information segment is greater than the first data volume ratio threshold and less than the second data volume ratio threshold, the first enhancement coefficient is determined as the enhancement coefficient of the target effective voice information segment. A first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold. And if the corresponding data volume ratio of the target original voice information fragment is larger than the second data volume ratio threshold, determining the second enhancement coefficient as the enhancement coefficient of the target effective voice information fragment, wherein the second enhancement coefficient is larger than or equal to the second enhancement coefficient threshold.

If the data size ratio corresponding to the target original voice information fragment is larger than the first data size ratio threshold and smaller than the second data size ratio threshold, the target original voice signal fragment corresponding to the target effective voice signal fragment contains an effective voice signal with a certain data size and an interference signal with a certain data size, namely the target original voice signal fragment is a mixed voice signal. Then a first enhancement coefficient may be determined as an enhancement coefficient for the target significant speech information segment, the first enhancement coefficient being greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold. If the data volume duty ratio corresponding to the target original voice information fragment is larger than the second data volume duty ratio threshold value, the target original voice signal fragment corresponding to the target effective voice signal fragment only contains effective voice signals, namely the target original voice signal fragment is a pure clean voice signal. Then a second enhancement coefficient is determined as an enhancement coefficient for the target valid speech information segment, the second enhancement coefficient being greater than or equal to the second enhancement coefficient threshold. If the data size ratio corresponding to the target original voice information fragment is smaller than the first data size ratio threshold value, the target original voice signal fragment does not contain effective voice signals and only contains interference signals, and all-zero signals can be output, and the all-zero signals do not have any signal information. Thus, the interference signal can be completely removed, and the signal-to-noise ratio of the target voice signal can be improved.

For example, the first data volume duty ratio threshold may refer to 0, the second data volume duty ratio threshold may refer to 1, and if the data volume duty ratio corresponding to the target original voice information segment is greater than the first data volume duty ratio threshold and less than the second data volume duty ratio threshold, it is indicated that the target original voice signal segment contains not only valid voice signals but also other voice signals, and is a mixed voice signal. Then a first enhancement coefficient may be determined as an enhancement coefficient for the target significant speech information segment, the first enhancement coefficient being greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold. The threshold value of the first enhancement coefficient may be determined based on a data size ratio of the effective speech signal in the target original speech signal segment. If the data volume ratio of the effective voice signal in the target original voice signal segment is higher, the threshold value of the first enhancement coefficient is higher; the lower the data size of the effective speech signal in the target original speech signal segment, the lower the threshold value of the first enhancement coefficient. When the enhancement coefficient of the target effective voice information fragment is the second enhancement coefficient, the target original voice signal fragment only contains the effective voice signal.

Fig. 7 is a schematic diagram of an enhanced target voice signal obtaining system according to an embodiment of the present application, and fig. 7 is a schematic diagram of an enhanced target voice signal obtaining system according to an embodiment of the present application. The first module is a speech separation model, and the input of the first module is an original speech signal X, and the first module has a length of N, namely N sampling points. The original voice signal is separated and output as a separated effective voice signal S, and the same length is N. The voice separation module of the first module may be a ConvTasNet network. The second module comprises four parts, namely a segmentation part, a coding part, a self-attention network part and a classification part. Firstly, the segmentation part divides the input original voice signal to obtain at least two original voice signal fragments, for example, the original voice signal is divided into T non-overlapping original voice signal fragments with the length of L, wherein T=N/L, and then an original voice signal X with the dimension of T×L is obtained. And then dividing the separated effective voice signals to obtain at least two effective voice signal fragments, for example, dividing the effective voice signals according to a method for dividing the original voice signals to obtain T fragments with the length of L, which are not overlapped with each other, wherein T=N/L, and then obtaining the effective voice signals S with the dimension of T. Then, at least two original voice signals are inputted into the encoding section, that is, the original voice signal X is inputted into an encoder composed of 1D-Conv for encoding processing. The convolution kernel size of the one-dimensional convolution network in the encoder is L, the stride is also L, namely, non-overlapping convolution operation is performed, the number of input channels is 1, the number of output channels is D, the dimension of the encoded features is represented, then a T x D dimension encoded feature can be obtained after processing, namely, a 1*D dimension encoded feature can be obtained for one original voice signal segment, and therefore feature information of each original voice signal segment in at least two original voice signal segments is obtained.

Then, inputting the coding features of at least two original speech signal fragments into a Self-Attention network (Self-Attention network), wherein position encoding (position coding) operation is added before the first Self-Attention layer to add position information, and a Linear layer with an output of 3 and a Softmax layer are added after the last Self-Attention layer, namely, features with an output of T3 dimensions, wherein each dimension represents the weighting proportionality coefficients of the effective speech signal, the original speech signal and the all-zero signal respectively; the self-attention network part plays a role of learning and obtaining a weighting coefficient corresponding to each original voice signal segment in the T original voice signal segments (namely, a weighting coefficient corresponding to a target effective voice signal segment, a weighting coefficient of a target original voice signal segment and a weighting coefficient of an all-zero signal) by utilizing the information of the whole original voice signal to play a role of a gating mechanism, wherein the weighting coefficient corresponding to the target effective voice signal segment refers to an enhancement coefficient. And finally, obtaining at least two original voice signal fragments and effective voice signal fragments by the segmentation part, and respectively carrying out weighted summation according to weighting coefficients obtained from the attention network to obtain an enhanced target voice signal. Thus, when the input original voice signal is a mixed voice signal, the original voice signal can be separated to obtain the effective voice signal in the original voice signal, and then the enhancement coefficient of the effective voice signal is determined according to the data volume ratio of the effective voice signal in the original voice signal. According to the enhancement coefficient, the original voice signal of the target data volume is extracted from the original voice signal, the effective voice signal and the original voice signal of the target data volume are fused, the enhanced target voice signal is obtained, the separated effective voice signal is repaired to a certain extent, the signal-to-noise ratio of the separated effective voice signal can be effectively improved, and the accuracy of subsequent voice recognition is improved. When the input original voice signal is a pure clean voice signal, namely a pure effective voice signal, the original voice signal is separated, and the obtained effective voice signal has the problem of performance damage, so that the original voice signal can be directly used as an enhanced target voice signal and directly output the original voice signal, and the obtained target voice signal has no problem of performance damage after separation. When the input original voice signal is a pure interference signal, if the original voice signal is separated, part of interference signal residues exist, so that an all-zero signal can be directly used as a target voice signal, the input all-zero signal has no signal information, the interference information can be perfectly removed, and the signal to noise ratio of the target voice signal is improved. The scheme can be used for the scene of enhancing the separated effective voice signals after voice separation, and can also be directly applied to the scene of voice enhancement.

Alternatively, a candidate speech enhancement model and a sample original speech signal and a target speech signal marked by sample original speech data can be obtained, the sample original speech signal is separated by adopting the candidate speech enhancement model, an effective speech signal in the original speech signal is obtained, and the original speech signal is subjected to feature extraction, so that feature information of the original speech signal is obtained. And determining the enhancement coefficient of the effective voice signal in the original voice signal according to the characteristic information of the original voice signal, namely obtaining the weight collecting coefficient (the weighting coefficient of the target effective voice signal segment, the weighting coefficient of the target original voice signal segment and the weighting coefficient of the all-zero signal) corresponding to the original voice signal, namely determining the enhancement coefficient of the effective voice signal. And carrying out enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal, and outputting a prediction target voice signal. And determining a predicted loss value of the candidate speech enhancement model according to the labeling target speech signal and the predicted target speech signal. And adjusting the candidate enhanced voice signal according to the prediction loss value until the candidate voice enhancement model meets the convergence condition, and taking the candidate voice enhancement model meeting the convergence condition as a target voice enhancement model. The input original speech signal may be processed according to the target speech enhancement model to obtain an enhanced target speech signal.

In the embodiment of the application, the original voice signal to be processed is obtained, the original voice signal is subjected to separation processing to obtain an effective voice signal in the original voice signal, the original voice signal is subjected to feature extraction to obtain feature information of the original voice signal, and the enhancement coefficient of the effective voice signal is generated according to the feature information of the original voice signal. And carrying out enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal. When the input original voice signal is a mixed voice signal, the original voice sub-signal of the target data amount is extracted from the original voice signal according to the enhancement coefficient of the effective voice signal, the effective voice signal and the original voice sub-signal are fused, the enhanced target voice signal is obtained, the separated effective voice signal can be repaired to a certain extent, and the signal to noise ratio of the target voice signal can be effectively improved. When the input original voice signal is a pure clean voice signal, the original voice signal can be directly used as an enhanced target voice signal, and the original voice signal can be directly output, so that the obtained target voice signal can not have the problem of performance damage after separation. When the input original voice signal is a pure interference signal, the all-zero signal can be directly used as a target voice signal, and the input all-zero signal has no signal information, so that the interference information can be perfectly removed, and the signal-to-noise ratio of the target voice signal is improved. The effective voice signal is enhanced according to the enhancement coefficient of the effective voice signal and the original voice signal, so that an enhanced target voice signal is obtained; the information loss of the effective voice signal can be effectively avoided, namely, the performance damage of the effective voice signal is reduced; and the background interference signal in the effective voice signal is reduced, so that the signal-to-noise ratio of the effective voice signal can be improved.

As shown in fig. 8, a schematic diagram of another voice signal processing method according to an embodiment of the present application is shown, and as shown in fig. 8, the steps of the other voice signal processing method include S201-207.

S201, obtaining an original voice signal to be processed.

S202, separating the original voice signals to obtain effective voice signals in the original voice signals.

The details of steps S201-202 can be found in the embodiment described in fig. 2, which is not further described here.

S203, dividing the original voice signal to obtain at least two original voice signal fragments, and dividing the effective voice signal to obtain at least two effective voice signal fragments.

S204, extracting the characteristics of each original voice signal segment in the at least two original voice signal segments to obtain the characteristic information of each original voice signal segment.

S205, generating enhancement coefficients of corresponding effective voice signal fragments in at least two effective voice signal fragments according to the characteristic information of each original voice signal fragment.

S206, using the enhancement coefficient corresponding to each effective voice signal segment in at least two effective voice signal segments as the enhancement coefficient of the effective voice signal.

S207, according to the enhancement coefficient of the effective voice signal and the original voice signal, enhancing the effective voice signal to obtain an enhanced target voice signal.

In this embodiment of the present application, the original speech signal may be divided into at least two original speech signal segments, where, if the original speech signal is divided into T original speech signal segments, T is a positive integer greater than or equal to 2. The length of each original voice signal segment may be equal, for example, the length of each original voice signal segment is L, where L is a natural number greater than 0, and of course, the length of each original voice signal may also be unequal. After the original voice signals are separated and processed to obtain voice signals in the original voice signals, the effective voice signals can be divided according to the method for dividing the original voice signals, and at least two effective voice signal fragments are obtained. One of the original speech signal segments corresponds to one of the valid speech signal segments, that is, each of the original speech signal segments has the same length and position information as the corresponding valid speech signal segment, for example, a third one of the at least two original speech signal segments corresponds to a third one of the at least two valid speech signal segments, and the third one of the at least two original speech signal segments has the same length and position as the third valid speech signal segment.

The details of this embodiment can be found in the embodiment described with reference to fig. 2, which is not further described here.

Fig. 9 is a schematic structural diagram of a speech signal processing device according to an embodiment of the present application. The above-mentioned speech signal processing means may be a computer program (comprising program code) running in a computer device, for example the speech signal processing means is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 9, the voice signal processing apparatus may include: an acquisition module 11, a separation processing module 12, a generation module 13, and an enhancement processing module 14.

An acquisition module 11, configured to acquire an original speech signal to be processed.

The separation processing module 12 is configured to perform separation processing on the original voice signal, so as to obtain an effective voice signal in the original voice signal.

The generating module 13 is configured to perform feature extraction on the original speech signal to obtain feature information of the original speech signal, and generate an enhancement coefficient of the effective speech signal according to the feature information of the original speech signal.

The enhancement processing module 14 is configured to perform enhancement processing on the effective speech signal according to the enhancement coefficient of the effective speech signal and the original speech signal, so as to obtain an enhanced target speech signal.

Wherein, the generating module 13 includes:

Wherein, the generating unit is specifically configured to:

Wherein, the generating unit is further specifically configured to:

acquiring the total data volume of the original voice signals;

the enhancement processing module 14 includes:

Wherein, the generating unit is further specifically configured to:

Wherein the separation processing module 12 comprises:

According to one embodiment of the present application, the steps involved in the speech signal processing method shown in fig. 2 may be performed by respective modules in the speech signal processing apparatus shown in fig. 9. For example, step S101 shown in fig. 2 may be performed by the acquisition module 11 in fig. 9, and step S102 shown in fig. 2 may be performed by the separation processing module 12 in fig. 9; step S103 shown in fig. 2 may be performed by the generation module 13 in fig. 9; step S104 shown in fig. 2 may be performed by the enhancement processing module 14 in fig. 9.

According to an embodiment of the present application, each module in the speech signal processing apparatus shown in fig. 9 may be separately or completely combined into one or several units to form a structure, or some (some) of the units may be further split into a plurality of sub-units with smaller functions, so that the same operation may be implemented without affecting the implementation of the technical effects of the embodiments of the present application. The above modules are divided based on logic functions, and in practical applications, the functions of one module may be implemented by a plurality of units, or the functions of a plurality of modules may be implemented by one unit. In other embodiments of the present application, the speech signal processing apparatus can also include other units, and in practical applications, these functions can also be implemented with assistance of other units, and can be implemented by cooperation of a plurality of units.

According to one embodiment of the present application, a speech signal processing apparatus as shown in fig. 9 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 2 or 8 on a general-purpose computer device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and the speech signal processing method of the embodiments of the present application is implemented. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and executed by the computing device via the computer-readable recording medium.

Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 10, the above-mentioned computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a nonvolatile memory (non-volatile memory), such as at least one magnetic disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 10, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 1005, which is one type of computer-readable storage medium.

In the computer device 1000 shown in FIG. 10, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

alternatively, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005 to implement:

acquiring an original voice signal to be processed;

Acquiring the total data volume of the original voice signals;

In the embodiment of the application, by acquiring the original voice signal to be processed, separating the original voice signal to obtain the effective voice signal in the original voice signal, extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal, wherein the enhancement coefficient is used for determining whether a part of original voice signal is extracted from the original voice signal to be fused into the effective voice signal or not, so that the performance of the effective voice signal is enhanced. And carrying out enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal. When the input original voice signal is a mixed voice signal, the original voice sub-signal of the target data amount is extracted from the original voice signal according to the enhancement coefficient of the effective voice signal, the effective voice signal and the original voice sub-signal are fused, the enhanced target voice signal is obtained, the separated effective voice signal can be repaired to a certain extent, the signal to noise ratio of the target voice signal can be effectively improved, and the accuracy rate of the subsequent target voice recognition is improved. When the input original voice signal is a pure clean voice signal, namely a pure effective voice signal, the original voice signal is separated, and the obtained effective voice signal has the problem of performance damage, so that the original voice signal can be directly used as an enhanced target voice signal and directly output the original voice signal, and the obtained target voice signal has no problem of performance damage after separation. When the input original voice signal is a pure interference signal, if the original voice signal is separated, part of interference signal residues exist, so that an all-zero signal can be directly used as a target voice signal, the input all-zero signal has no signal information, the interference information can be perfectly removed, and the signal to noise ratio of the target voice signal is improved. According to the method and the device, the signal-to-noise ratio of the target voice signal can be remarkably improved, the performance damage of the target voice signal is reduced, and the accuracy of subsequent target voice recognition processing is improved.

It should be understood that the computer device 1000 described in the embodiment of the present application may perform the description of the above-mentioned voice signal processing method in the embodiment corresponding to fig. 2 or fig. 8, and may also perform the description of the above-mentioned voice signal processing apparatus in the embodiment corresponding to fig. 9, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device may execute the method for processing a voice signal in the embodiment corresponding to fig. 2 or fig. 8, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.

As an example, the above-described program instructions may be executed on one computer device or on a plurality of computer devices disposed at one site, or alternatively, on a plurality of computer devices distributed at a plurality of sites and interconnected by a communication network, which may constitute a blockchain network.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. A method of processing a speech signal, comprising:

acquiring an original voice signal to be processed;

generating enhancement coefficients of corresponding effective voice signal segments in the at least two effective voice signal segments by adopting the data volume duty ratio;

taking the enhancement coefficient corresponding to each effective voice signal segment in the at least two effective voice signal segments as the enhancement coefficient of the effective voice signal;

2. The method of claim 1, wherein said determining the data size ratio of the effective speech signal included in each original speech signal segment based on the characteristic information of each original speech signal segment comprises:

Acquiring the total data volume of the original voice signals;

3. The method of claim 2, wherein the at least two original speech signal segments comprise a target original speech signal segment and the at least two valid speech signal segments comprise a target valid speech signal segment corresponding to the target original speech signal segment;

4. The method of claim 3, wherein said generating enhancement coefficients for corresponding ones of the at least two active speech signal segments using the data volume duty cycle comprises:

5. The method of claim 1, wherein the separating the original speech signal to obtain the valid speech signal in the original speech signal comprises:

6. A speech signal processing apparatus, comprising:

the generation module comprises:

A generating unit, configured to determine a data volume ratio of an effective speech signal included in each original speech signal segment according to the feature information of each original speech signal segment; generating enhancement coefficients of corresponding effective voice signal segments in the at least two effective voice signal segments by adopting the data volume duty ratio;

a first determining unit, configured to use, as an enhancement coefficient of the effective speech signal, an enhancement coefficient corresponding to each of the at least two effective speech signal segments;

7. A computer device, comprising: a processor and a memory;

wherein the memory is for storing program code, the processor is for invoking the program code to perform the method of any of claims 1 to 5.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the steps of the method according to any of claims 1 to 5.