CN111081264A

CN111081264A - Voice signal processing method, device, equipment and storage medium

Info

Publication number: CN111081264A
Application number: CN201911248791.9A
Authority: CN
Inventors: 谭志鹏; 谭北平
Original assignee: Tsinghua University; Beijing Mininglamp Software System Co ltd
Current assignee: Tsinghua University; Beijing Mininglamp Software System Co ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-28
Anticipated expiration: 2039-12-06
Also published as: CN111081264B

Abstract

The application provides a voice signal processing method, a device, equipment and a storage medium, and relates to the technical field of voice recognition. The method comprises the following steps: detecting the voice quality of an input analog audio signal; determining the weights of the differential pulse code modulation DPCM and the adaptive differential pulse code modulation ADPCM according to the voice quality; DPCM processing and ADPCM processing are respectively carried out on the analog audio signal to obtain a first modulation signal and a second modulation signal; and weighting preset type parameters of the first modulation signal and the second modulation signal according to the weight of the DPCM and the ADPCM to obtain a target modulation signal. The method and the device can effectively solve the problem of low voice recognition efficiency and improve the voice recognition rate.

Description

Voice signal processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing a voice signal.

Background

Speech recognition technology is a high technology that allows machines to convert speech signals into corresponding text or commands through a recognition and understanding process. The voice conversion method is widely applied to various aspects in life, and voice can be converted into characters only by dictation of a user, so that life becomes more convenient.

In the prior art, the speech coding techniques applied in the speech recognition field mainly include Differential Pulse Code Modulation (DPCM) and Adaptive Differential Pulse Code Modulation (ADPCM).

However, ADPCM and DPCM have advantages and disadvantages, and most of the smart devices on the market adopt only one speech coding technique during speech processing, so that problems of poor speech input effect, low recognition rate and even no recognition may occur, and the use effect of the smart devices and the user experience are greatly affected.

Disclosure of Invention

An object of the present application is to provide a method, an apparatus, a device and a storage medium for processing a voice signal, so as to solve the above-mentioned drawbacks of the prior art.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a speech signal processing method, including:

detecting the voice quality of an input analog audio signal;

determining weights of Differential Pulse Code Modulation (DPCM) and Adaptive Differential Pulse Code Modulation (ADPCM) according to the voice quality;

respectively carrying out DPCM processing and ADPCM processing on the analog audio signal to obtain a first modulation signal and a second modulation signal;

and weighting preset type parameters of the first modulation signal and the second modulation signal according to the weight of the DPCM and the ADPCM to obtain a target modulation signal.

Optionally, the determining weights of Differential Pulse Code Modulation (DPCM) and Adaptive Differential Pulse Code Modulation (ADPCM) according to the voice quality comprises:

judging whether the voice quality meets a preset quality requirement or not;

and determining the weight of the DPCM and the ADPCM according to the judgment result of the voice quality.

Optionally, the determining the weight of the DPCM and the ADPCM according to the determination result of the voice quality includes:

if the voice quality does not meet the quality requirement, determining a first weighted modulation mode, wherein the first weighted modulation mode is that the weight of the DPCM is greater than the weight of the ADPCM.

and if the voice quality meets the quality requirement, determining a second weighted modulation mode, wherein the weight of the ADPCM in the second weighted modulation mode is greater than the weight of the DPCM.

Optionally, the voice quality comprises: signal correlation of adjacent sampling points in the analog audio signal; the judging whether the voice quality meets the preset quality requirement includes:

judging whether the signal correlation is greater than or equal to a preset correlation threshold value;

determining that the speech quality does not meet the quality requirement if the signal correlation is less than the correlation threshold;

determining that the speech quality satisfies the quality requirement if the signal correlation is greater than or equal to the correlation threshold.

Optionally, the method further comprises:

performing analog-to-digital conversion on the target modulation signal to generate an audio file;

generating a Haiming window image corresponding to the audio file;

and generating a target language spectrogram according to the Hamming window image, and performing audio matching in a preset audio recognition library by adopting the target language spectrogram.

Optionally, the preset type parameter is any one of the following types: peak, common peak, frequency, stop, fricative.

In a second aspect, an embodiment of the present application further provides a speech signal processing apparatus, including: detection module, confirm module, processing module and weighting module, wherein:

the detection module is used for detecting the voice quality of the input analog audio signal;

the determining module is used for determining the weights of the Differential Pulse Code Modulation (DPCM) and the Adaptive Differential Pulse Code Modulation (ADPCM) according to the voice quality;

the processing module is used for respectively carrying out DPCM processing and ADPCM processing on the analog audio signal to obtain a first modulation signal and a second modulation signal;

the weighting module is configured to weight preset type parameters of the first modulation signal and the second modulation signal according to the weights of the DPCM and the ADPCM, so as to obtain a target modulation signal.

Optionally, the apparatus further comprises: the judging module is used for judging whether the voice quality meets the preset quality requirement or not;

the determining module is further configured to determine the weights of the DPCM and the ADPCM according to the determination result of the voice quality.

Optionally, the determining module is further configured to determine a first weighted modulation scheme if the voice quality does not meet the quality requirement, where the first weighted modulation is that a weight of the DPCM is greater than a weight of the ADPCM.

Optionally, the determining module is further configured to determine a second weighted modulation scheme if the voice quality meets the quality requirement, where a weight of the ADPCM in the second weighted modulation scheme is greater than a weight of the DPCM.

Optionally, the determining module is further configured to determine whether the signal correlation is greater than or equal to a preset correlation threshold;

the determining module is further configured to determine that the speech quality does not meet the quality requirement if the signal correlation is less than the correlation threshold;

the determining module is further configured to determine that the speech quality meets the quality requirement if the signal correlation is greater than or equal to the correlation threshold.

Optionally, the apparatus further comprises a generating module and a matching module, wherein:

the generating module is used for performing analog-to-digital conversion on the target modulation signal to generate an audio file; generating a Haiming window image corresponding to the audio file;

and the matching module is used for generating a target language spectrogram according to the Hamming window image and performing audio matching in a preset audio recognition library by adopting the target language spectrogram.

In a third aspect, an embodiment of the present application further provides a speech signal processing apparatus, including: a memory storing a computer program executable by the processor, and a processor implementing any of the methods provided by the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is read and executed, the computer program implements any one of the methods provided in the first aspect.

The beneficial effect of this application is: by adopting the voice signal processing method provided by the application, the analog audio signal can be respectively processed by adopting Differential Pulse Code Modulation (DPCM) and Adaptive Differential Pulse Code Modulation (ADPCM) to obtain a corresponding first modulation signal and a second modulation signal, the weights of the DPCM and the ADPCM are determined according to the voice quality of the input analog audio signal, and the first modulation signal and the second modulation signal are weighted according to the determined weights to obtain a target modulation signal. The processing mode can determine the weight of DPCM and ADPCM according to different voice quality, and carry out weighting processing on the first modulation signal and the second modulation signal according to the weight to obtain a target modulation signal, so that the problem of low voice recognition efficiency can be effectively solved by the obtained target modulation signal, and the voice recognition rate is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flowchart of a speech signal processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech signal processing method according to another embodiment of the present application;

fig. 3 is a schematic flowchart of a speech signal processing method according to another embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech signal processing apparatus according to another embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech signal processing apparatus according to another embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

Fig. 1 is a schematic flow chart of a speech signal processing method according to an embodiment of the present application, where the speech signal processing method may be executed by any electronic device with speech recognition and speech processing functions, such as a mobile phone, a tablet computer, and a wearable device, and may also be executed by an application server corresponding to a speech application program in the electronic device, as described below, an example is performed on an execution process of the electronic device, and the details of the method executed by the server are not repeated herein. As shown in fig. 1, the method may include:

s101: the speech quality of an input analog audio signal is detected.

Optionally, the input analog audio signal may be a user audio signal acquired by the electronic device in real time, or may also be an audio signal selected by the electronic device from a preset audio signal set, and the specific audio signal uploading mode may be determined according to a user requirement, which is not limited herein.

Alternatively, the voice quality of the analog audio signal may be determined by analyzing the sampled data using a waveform diagram of the voice signal as the sampled data, and the analyzing method may be any one of the following methods: judging whether the amplitudes in the sampling data are uniformly distributed or not; judging by correlation between adjacent samples in the sampling data; judging by correlation between cycles of sampling data; by the correlation between the sampled data pitches.

S102: the weights of the differential pulse code modulation DPCM and the adaptive differential pulse code modulation ADPCM are determined according to the speech quality.

The DPCM predicts the current sample by using the previous n samples according to a certain rule for the difference coding between adjacent samples, quantizes the error between the predicted value and the actual value and then transmits the quantized error, and recovers the original signal by adopting the same prediction method as that of the transmitting end according to the error signal. The DPCM modulated signal reduces the system transmission bandwidth required correspondingly, and can improve the signal-to-noise ratio; such a determination method estimates a prediction value of the current sample using the previous n sample values such that a difference between the actual sample value and the prediction value is always minimized.

ADPCM is waveform coding with better performance, is a good way to obtain sound with low space consumption and high quality, and has the shortest coding and decoding time delay, low algorithm complexity and small compression ratio for other voice technologies.

The DPCM has better processing effect under the condition of higher voice quality; the APCM has a poor processing effect under the condition of low voice quality, and in the application, the weights of the two algorithms are determined according to the voice quality, so that the voice recognition rate can be improved by the scheme of weighting processing of the two algorithms no matter whether the voice quality of the analog audio signal is high or low.

S103: DPCM processing and ADPCM processing are respectively carried out on the analog audio signal to obtain a first modulation signal and a second modulation signal.

Wherein, the first modulation signal is obtained by processing the audio signal according to DPCM, and the second modulation signal is obtained by processing the audio signal according to ADPCM.

S104: and weighting preset type parameters of the first modulation signal and the second modulation signal according to the weight of the DPCM and the ADPCM to obtain a target modulation signal.

Alternatively, the preset type parameter may include any one of the following types of parameters: peak, co-peak, frequency, stop, fricative, etc.

And weighting each parameter of the preset type according to the preset weight to obtain a target modulation signal.

By adopting the voice signal processing method provided by the application, the analog audio signal can be respectively processed by adopting Differential Pulse Code Modulation (DPCM) and Adaptive Differential Pulse Code Modulation (ADPCM) to obtain a corresponding first modulation signal and a second modulation signal, the weights of the DPCM and the ADPCM are determined according to the voice quality of the input analog audio signal, and the first modulation signal and the second modulation signal are weighted according to the determined weights to obtain a target modulation signal. The processing mode can determine the weight of DPCM and ADPCM according to different voice quality, and carry out weighting processing on the first modulation signal and the second modulation signal according to the weight to obtain a target modulation signal, so that the problem of low voice recognition efficiency can be effectively solved by the obtained target modulation signal, and the voice recognition rate is improved.

Fig. 2 is a schematic flowchart of a speech signal processing method according to an embodiment of the present application, and as shown in fig. 2, S102 includes:

s105: and judging whether the voice quality meets the preset quality requirement.

Then, the weight of DPCM and ADPCM is determined according to the judgment result of the voice quality.

In one embodiment of the present application, the speech quality includes: the signal correlation of adjacent sample points in the analog audio signal. Then, determining whether the voice quality meets a preset quality requirement may be: judging whether the signal correlation is greater than or equal to a preset correlation threshold value; if the signal correlation is smaller than the correlation threshold value, determining that the voice quality does not meet the quality requirement; and if the signal correlation is greater than or equal to the correlation threshold, determining that the voice quality meets the quality requirement.

For example, the following steps are carried out: in an embodiment of the present application, the preset relevance threshold is set to 0.04, that is, the voice quality may be determined by: the judgment is carried out through the correlation between adjacent signals in the sampling data, and the specific judgment method comprises the following steps: collecting the correlation between adjacent signals in the sampling data according to the sampling frequency of 48KHZ, judging the correlation between the adjacent signals, and if the correlation between the adjacent signals is greater than or equal to 0.04, determining that the voice quality of the audio signal is high-quality voice; if the correlation between adjacent signals is less than 0.04, the voice quality of the audio signal is considered to be low-quality voice, but the specific preset correlation threshold may be designed according to the user's needs, and is not limited to the threshold given in the above embodiment.

The method comprises the steps of collecting correlation between adjacent signals of sampling data according to a preset sampling frequency, dividing the sampling data into a plurality of frames, wherein each frame of the sampling data corresponds to a frequency spectrum (calculated through short-time FFT), and the frequency spectrum represents the relation between frequency and energy.

Optionally, if the voice quality does not meet the preset criterion, the current voice quality is considered not to meet the quality requirement, and then S106a is executed: a first weighted modulation scheme is determined.

Wherein, the weight of DPCM in the first weighted modulation is larger than that of ADPCM, that is, the DPCM algorithm is taken as the main algorithm to perform the weighted modulation of the audio. The modulation mode can improve the voice recognition rate, and avoids the problem of low recognition rate caused by only adopting an ADPCM algorithm for modulation when the collected voice quality is poor.

If the voice quality meets the preset standard, the current voice quality is considered to meet the quality requirement, and then S106b is executed: a second weighted modulation scheme is determined.

The ADPCM weight in the second weighted modulation mode is larger than the DPCM weight, namely the ADPCM algorithm is mainly used for carrying out weighted modulation on the audio, and the modulation mode can compress the obtained audio file on the premise of not damaging the sound quality after the A/D conversion is carried out on the subsequent voice, so that the size of the file is reduced, the voice recognition rate is improved, and the problem of difficult subsequent compression caused by only adopting the DPCM algorithm in the audio modulation process is solved.

Optionally, in an embodiment of the present application, in the first weighted modulation: DPCM was weighted 60% and ADPCM was weighted 40%; in the second weighted modulation mode, the weight of ADPCM is 60%, and the weight of DPCM is 40%; however, the setting of the specific weight can be adjusted according to the user's needs, and is not limited to the above embodiment.

Fig. 3 is a schematic flowchart of a speech signal processing method according to another embodiment of the present application, and as shown in fig. 3, the method further includes:

s107: and performing analog-to-digital conversion on the target modulation signal to generate an audio file.

The audio file is obtained according to the weighted target modulation signal, and compared with the audio file obtained by only processing through one algorithm in the prior art, the voice recognition rate is higher.

S108: and generating a Haiming window image corresponding to the audio file.

After the audio file is obtained, format conversion, resampling, pre-emphasis, and framing are performed on the audio file, and a hamming window image corresponding to the audio is constructed.

S109: and generating a target language spectrogram according to the Hamming window image, and performing audio matching in a preset audio recognition library by adopting the target language spectrogram.

After Fourier transform is carried out on the Hamming window image, a target spectrogram corresponding to an input analog audio signal is generated, audio matching is carried out in a preset audio recognition base according to the spectrogram, and character information corresponding to the target spectrogram is obtained.

The Hamming window image is subjected to Fourier transform, a nonlinear problem can be converted into a linear problem, and therefore the matching mode becomes more visual.

In the method provided by this embodiment, because the target modulation signal is obtained by weighting the input analog audio signal through two algorithms, and the processed target modulation signal is subjected to digital-to-analog conversion to generate a corresponding hamming window, and a target spectrogram is generated according to the hamming window, compared with the conventional technique in which only one algorithm is used to process the analog audio signal, the processing method of the present application can make the transmission bit rate of the processed target modulation signal lower; the system transmission bandwidth is reduced; under the condition of the same bit rate, the signal-to-noise ratio can be improved; the quantization levels are increased, and the quantization noise is improved.

Fig. 4 is a speech signal processing apparatus according to an embodiment of the present application, and as shown in fig. 4, the apparatus includes: a detection module 201, a determination module 202, a processing module 203, and a weighting module 204, wherein:

the detecting module 201 is configured to detect a voice quality of an input analog audio signal.

A determining module 202, configured to determine weights of the differential pulse code modulation DPCM and the adaptive differential pulse code modulation ADPCM according to the voice quality.

The processing module 203 is configured to perform DPCM processing and ADPCM processing on the analog audio signal to obtain a first modulation signal and a second modulation signal.

And the weighting module 204 is configured to weight preset type parameters of the first modulation signal and the second modulation signal according to the weights of the DPCM and the ADPCM, so as to obtain a target modulation signal.

Fig. 5 is a speech signal processing apparatus according to another embodiment of the present application, and as shown in fig. 5, the apparatus further includes: the determining module 205 is configured to determine whether the voice quality meets a preset quality requirement.

The determining module 202 is further configured to determine the weights of the DPCM and the ADPCM according to the determination result of the voice quality.

Optionally, the determining module 202 is further configured to determine a first weighted modulation mode if the voice quality does not meet the quality requirement, where the first weighted modulation mode is that the weight of the intermediate DPCM is greater than the weight of the ADPCM.

Optionally, the determining module 202 is further configured to determine a second weighted modulation scheme if the voice quality meets the quality requirement, where a weight of the ADPCM in the second weighted modulation scheme is greater than a weight of the DPCM.

Optionally, the determining module 205 is further configured to determine whether the signal correlation is greater than or equal to a preset correlation threshold.

The determining module 202 is further configured to determine that the voice quality does not meet the quality requirement if the signal correlation is smaller than the correlation threshold.

The determining module 202 is further configured to determine that the voice quality meets the quality requirement if the signal correlation is greater than or equal to the correlation threshold.

Fig. 6 is a speech signal processing apparatus according to another embodiment of the present application, and as shown in fig. 6, the apparatus further includes a generating module 206 and a matching module 207, where:

a generating module 206, configured to perform analog-to-digital conversion on the target modulation signal to generate an audio file; and generating a Haiming window image corresponding to the audio file.

And the matching module 207 is used for generating a target language spectrogram according to the hamming window image, and performing audio matching in a preset audio recognition library by adopting the target language spectrogram.

The following describes apparatuses, devices, and storage media for executing the methods provided in the present application, and specific implementation procedures and technical effects thereof are referred to above, and will not be described again below.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 7 is a schematic structural diagram of a performance testing device of a game scenario provided in an embodiment of the present application, where the performance testing device of the game scenario may be integrated in a terminal device or a chip of the terminal device.

The performance test equipment of the game scene comprises: a processor 501, a storage medium 502, and a bus 503.

The processor 501 is used for storing a program, and the processor 501 calls the program stored in the storage medium 502 to execute the method embodiment corresponding to fig. 1-3. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present application also provides a program product, such as a storage medium, on which a computer program is stored, including a program, which, when executed by a processor, performs embodiments corresponding to the above-described method.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A speech signal processing method, comprising:

detecting the voice quality of an input analog audio signal;

2. The method according to claim 1, wherein determining weights for Differential Pulse Code Modulation (DPCM) and Adaptive Differential Pulse Code Modulation (ADPCM) based on the speech quality comprises:

judging whether the voice quality meets a preset quality requirement or not;

3. The method of claim 2, wherein the determining the weight of the DPCM and the ADPCM according to the determination of the speech quality comprises:

if the voice quality does not meet the quality requirement, determining a first weighted modulation mode, wherein the weight of the DPCM in the first weighted modulation mode is larger than the weight of the ADPCM.

4. The method of claim 2, wherein the determining the weight of the DPCM and the ADPCM according to the determination of the speech quality comprises:

5. The method of claim 2, wherein the speech quality comprises: signal correlation of adjacent sampling points in the analog audio signal; the judging whether the voice quality meets the preset quality requirement includes:

6. The method according to any one of claims 1-5, further comprising:

generating a Haiming window image corresponding to the audio file;

7. The method according to any one of claims 1 to 5, wherein the preset type of parameter is any one of the following types of parameters: peak, common peak, frequency, stop, fricative.

8. A speech signal processing apparatus, comprising: detection module, confirm module, processing module and weighting module, wherein:

9. A speech signal processing apparatus, characterized by comprising: a memory storing a computer program executable by the processor, and a processor implementing the method of any of the preceding claims 1-7 when executing the computer program.

10. A storage medium having stored thereon a computer program which, when read and executed, implements the method of any of claims 1-7.