CN113707162A - Voice signal processing method, device, equipment and storage medium - Google Patents

Voice signal processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113707162A
CN113707162A CN202110226589.7A CN202110226589A CN113707162A CN 113707162 A CN113707162 A CN 113707162A CN 202110226589 A CN202110226589 A CN 202110226589A CN 113707162 A CN113707162 A CN 113707162A
Authority
CN
China
Prior art keywords
power spectrum
voice signal
frequency point
voice
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110226589.7A
Other languages
Chinese (zh)
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110226589.7A priority Critical patent/CN113707162A/en
Publication of CN113707162A publication Critical patent/CN113707162A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/057Time compression or expansion for improving intelligibility

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The application provides a voice signal processing method, a voice signal processing device, voice signal processing equipment and a storage medium, and belongs to the technical field of artificial intelligence. For the voice signals to be processed, first power spectrums and phase information of the voice signals at each frequency point on a frequency domain are firstly obtained, then the first power spectrums are enhanced by obtaining frequency band gain values corresponding to the frequency points, second power spectrums of the frequency points are obtained, and then target voice signals meeting voice playing conditions are generated according to the second power spectrums and the phase information of the frequency points. The processing mode can enhance the power spectrum of each frequency point in a targeted manner, so that the enhancement effect of the voice signal is more stable, the voice quality is effectively improved, and the voice intelligibility is enhanced; moreover, whether the speech signal to be processed is subjected to the cascade coding processing or not, the speech signal can be enhanced by adopting the processing mode, and the application range is wide.

Description

Voice signal processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing a voice signal.
Background
With the rapid development of mobile communication technology and internet technology, various application programs with communication functions are produced, and users can carry out voice communication with each other through the application programs installed on the terminals. In order to enable voice interfacing between terminals located in different networks, multiple codecs, i.e. concatenated coding, may occur in a call link. However, the more times of the concatenated coding, the more serious the damage of the speech signal is, which results in that the speech content of the other party cannot be heard by both parties, i.e. speech intelligibility is reduced.
The solutions of the related art to solve the above problems are generally: and carrying out formant search on the voice signal subjected to the cascade coding, then extracting formants of the damaged voice signal from the searched formants, and promoting the formants with the same enhanced amplitude value so as to realize compensation on the damaged voice signal.
However, after the speech signals are cascade-coded, the damage degrees of the speech signals at different frequencies are often inconsistent, and the above scheme adopts the same enhancement amplitude, that is, the compensation obtained by the speech signals at different damage degrees is consistent, which may make the enhancement effect of the damaged speech signals unstable, and may not effectively improve the speech quality.
Disclosure of Invention
The embodiment of the application provides a voice signal processing method, a voice signal processing device, equipment and a storage medium, so that the voice quality is effectively improved, and the voice intelligibility is further enhanced. The technical scheme is as follows:
in one aspect, a method for processing a speech signal is provided, the method comprising:
converting a voice signal to be processed from a time domain to a frequency domain, and acquiring a first power spectrum and phase information of each frequency point on the frequency domain; the voice signal to be processed is an initial voice signal or a damaged voice signal, the initial voice signal refers to a voice signal which is not subjected to cascade coding processing, and the damaged voice signal refers to a voice signal obtained after the cascade coding processing;
acquiring a frequency band gain value of each frequency point, and determining a second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point;
and generating a target voice signal which accords with the voice playing condition based on the phase information and the second power spectrum of each frequency point.
In another aspect, there is provided a speech signal processing apparatus, the apparatus comprising:
the acquisition module is used for converting the voice signal to be processed from a time domain to a frequency domain and acquiring a first power spectrum and phase information of each frequency point on the frequency domain; the voice signal to be processed is an initial voice signal or a damaged voice signal, the initial voice signal refers to a voice signal which is not subjected to cascade coding processing, and the damaged voice signal refers to a voice signal obtained after the cascade coding processing;
the determining module is used for acquiring the frequency band gain value of each frequency point and determining the second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point;
and the generating module is used for generating a target voice signal which accords with the voice playing condition based on the phase information and the second power spectrum of each frequency point.
In an optional implementation manner, in response to the to-be-processed speech signal being the impaired speech signal, the apparatus further includes:
and the processing module is used for performing the cascade coding processing on the initial voice signal before transforming the voice signal to be processed from the time domain to the frequency domain to obtain the damaged voice signal.
In an optional implementation, the apparatus further comprises a training module configured to:
acquiring a third power spectrum of each frequency point of the voice sample on a frequency domain, wherein the third power spectrum is obtained by converting the voice sample from a time domain to the frequency domain;
inputting a third power spectrum corresponding to the voice sample into an initial neural network to obtain a predicted frequency band gain value corresponding to the third power spectrum;
constructing a loss function based on the predicted band gain value and the target band gain value of the voice sample;
continuously adjusting network parameters of the initial neural network based on the loss function until preset conditions are met to obtain the target neural network;
the target frequency band gain value is obtained based on the third power spectrum and a fourth power spectrum corresponding to the voice sample, and the fourth power spectrum is obtained by performing the cascade coding processing on the voice sample and then transforming the voice sample from a time domain to a frequency domain.
In an alternative implementation, the target band gain value is a square root of a ratio of the third power spectrum to the fourth power spectrum.
In an optional implementation, the determining module is further configured to:
inputting the first power spectrum of each frequency point into the first full-connection layer, and performing feature extraction on the first power spectrum of each frequency point through the first full-connection layer to obtain a feature vector;
inputting the characteristic vectors into the gated circulation unit layer, and extracting correlation and effective information among the characteristic vectors through an update gate and a reset gate in the gated circulation unit layer to obtain output vectors;
and inputting the output vector into the second full-connection layer, and integrating the output vector into the band gain value of each frequency point through the second full-connection layer.
In an optional implementation manner, the obtaining module is further configured to:
sequentially performing framing processing and windowing processing on the voice signal to be processed;
carrying out fast Fourier transform on the voice signal to be processed after framing processing and windowing processing; and determining the first power spectrum and the phase information of each frequency point on the frequency domain based on the obtained conversion result.
In an optional implementation manner, the concatenated coding process includes M coding and decoding processes, where M is a positive integer greater than 1, and the processing module is further configured to:
performing encoding and decoding processing on the initial voice signal for M times to obtain the damaged voice signal;
wherein, the output of the last coding and decoding processing is used as the input of the next coding and decoding processing; for any one-time coding and decoding process, the coding and decoding process comprises a one-time coding process and a one-time decoding process, and the output of the coding process is used as the input of the decoding process.
In another aspect, a computer device is provided, which includes a processor and a memory, where the memory is used to store at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the operations executed in the speech signal processing method in the embodiments of the present application.
In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor to implement the operations performed in the speech signal processing method in the embodiments of the present application.
In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, so that the computer device performs the voice signal processing method provided in the above-described various alternative implementations.
For the voice signals to be processed, the first power spectrum and the phase information of the voice signals at each frequency point in the frequency domain are firstly obtained, then the first power spectrum is enhanced by obtaining the band gain value corresponding to each frequency point, the second power spectrum of each frequency point is obtained, and then the target voice signals meeting the voice playing condition are generated according to the second power spectrum and the phase information of each frequency point. The processing mode can enhance the power spectrum of each frequency point in a targeted manner, so that the enhancement effect of the voice signal is more stable, the voice quality is effectively improved, and the voice intelligibility is enhanced; moreover, whether the speech signal to be processed is subjected to the cascade coding processing or not, the speech signal can be enhanced by adopting the processing mode, and the application range is wide.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment of a speech signal processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for processing a speech signal according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a speech signal processing scheme provided in accordance with an embodiment of the present application;
FIG. 4 is a flow chart of another speech signal processing method provided according to an embodiment of the present application;
FIG. 5 is a flow chart of another speech signal processing method provided according to an embodiment of the present application;
FIG. 6 is a schematic diagram of another speech signal processing scheme provided in accordance with an embodiment of the present application;
FIG. 7 is a flow chart of another speech signal processing method provided in accordance with an embodiment of the present application;
FIG. 8 is a flow chart of another speech signal processing method provided in accordance with an embodiment of the present application;
fig. 9 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a terminal provided in an embodiment of the present application;
fig. 11 is a schematic structural diagram of a server provided according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms.
These terms are only used to distinguish one element from another. For example, the first power spectrum can be referred to as a second power spectrum, and similarly, the second power spectrum can also be referred to as a first power spectrum, without departing from the scope of the various examples. Both the first power spectrum and the second power spectrum may be power spectra, and in some cases, may be separate and distinct power spectra.
For example, at least one frequency point may be any integer number of frequency points greater than or equal to one, such as one frequency point, two frequency points, three frequency points, and the like. The plurality of frequency points refers to two or more, for example, the plurality of frequency points may be two frequency points, three frequency points, or any integer number of frequency points greater than or equal to two.
Techniques that may be used in the speech signal processing scheme provided by the embodiments of the present application are described below.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
The following describes key terms or abbreviations that may be used in the speech signal processing scheme provided by the embodiments of the present application.
Voice over IP (VoIP): a voice call technology, which achieves a voice call via Internet Protocol (IP), that is, performs communication via the Internet. Other informal names are IP Phone (IP Telephony), Internet Phone (Internet Telephony), Broadband Phone (Broadband Telephony) and Broadband Phone Service (Broadband Phone Service). VoIP can be used in many internet access devices including VoIP phones, smart phones, and personal computers, and can communicate and send short messages via cellular networks and wireless networks.
Fast Fourier Transform (FFT): a method of rapidly computing a discrete Fourier transform of a sequence or its inverse. Fourier analysis transforms the signal from the original domain (usually the time or spatial domain) to a representation of the frequency domain or vice versa. Accordingly, converting the signal from the frequency domain to the original domain is called Inverse Fast Fourier Transform (IFFT).
Frequency domain: a coordinate system is described for use in characterizing a signal in terms of frequency. In electronics, control system engineering and statistics, frequency domain plots show the amount of signal in each given frequency band within a frequency range. The frequency domain representation may also include phase information for each sinusoid so that the frequency components can be recombined to recover the original time signal.
Power spectrum: the power spectral density function is abbreviated and is defined as the signal power within a unit frequency band. It shows the variation of signal power with frequency, i.e. the distribution of signal power in frequency domain. The power spectrum represents the variation of the signal power with frequency.
Gated Recycling Unit (GRU): a deformed Recurrent Neural Network (RNN) structure not only effectively solves the problems of gradient dispersion and gradient explosion of the traditional RNN, but also is quicker to learn the hidden state, and compared with a Long Short-Term Memory Network (LSTM), the structure of the RNN structure is simpler and has quicker training speed.
Cascade coding: for systems where there are multiple encodings (at least twice), each level of encoding is considered to be an overall encoding, referred to as a concatenated code. The cascade code divides the coding process into several stages to complete, can meet the requirement of channel error correction on the coding length, and obtains the error correction capability and high coding gain which are close to or even the same as those of long codes. Moreover, the coding complexity that increases therewith is not very large. That is, if a system includes multiple encodings, the multiple encodings are considered to be concatenated encodings. In a related transmission network, the concatenated code can balance the coding gain performance and the coding and decoding complexity, and thus is widely applied. Concatenated codes can be realized in the form of a combination of two or more coding methods.
The following describes an implementation environment related to a speech signal processing method provided by an embodiment of the present application.
Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment of a speech signal processing method according to an embodiment of the present application. The implementation environment includes: a terminal 101 and a server 102.
The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. Optionally, the terminal 101 is a smartphone, a tablet, a laptop, a desktop computer, etc., but is not limited thereto. The terminal 101 can be installed and run with an application. Optionally, the application is a social application, an online conference application, or a voice call application, etc. Illustratively, the terminal 101 is a terminal used by a user, and a user account of the user is registered in an application running in the terminal 101. For example, a social application program is run on the terminal 101, and the social application program provides a voice call function, and a voice call can be performed between users through the social application program.
The server 102 may be an independent physical server, a server cluster or a distributed system including a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The server 102 is configured to provide background services for the application program executed by the terminal 101.
Optionally, during the process of processing the voice signal, the server 102 undertakes primary computation work, and the terminal 101 undertakes secondary computation work; or, the server 102 undertakes the secondary computing work, and the terminal 101 undertakes the primary computing work; alternatively, the server 102 or the terminal 101 can be respectively capable of separately assuming the calculation work.
Optionally, the terminal 101 generally refers to one of a plurality of terminals, and this embodiment is only illustrated by the terminal 101. Those skilled in the art will appreciate that the number of terminals 101 can be greater. For example, the number of the terminals 101 is dozens or hundreds, or more, and the implementation environment of the voice signal processing method also includes other terminals. The number of terminals and the type of the device are not limited in the embodiments of the present application.
Optionally, the wireless or wired networks described above use standard communication techniques and/or protocols. The Network is typically the Internet, but can be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Markup Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links can also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques can also be used in place of or in addition to the data communication techniques described above.
Schematically, application scenarios of the speech signal processing method provided by the embodiment of the present application include, but are not limited to, the following exemplary scenarios:
scene one, application scene of mutual fusion between different networks
With the popularization of VoIP services, the number of applications of interworking between different networks is increasing. For example, an IP Telephone over the internet interworks with a fixed Telephone over a Public Switched Telephone Network (PSTN), an IP Telephone interworks with a handset of a wireless Network, and so on. Different speech coding and decoding are adopted for the speech of different networks, for example, AMR-NB coding is adopted for a wireless Global System for Mobile Communication (GSM) network, G.711 coding is adopted for a fixed telephone, G.729 coding is adopted for an IP telephone, and because the speech coding formats supported by all network terminals are not consistent, the cascade coding of a call link is inevitably caused. In the scene, the voice signal processing method provided by the application can be used for enhancing the collected initial voice signal or enhancing the damaged voice signal subjected to the cascade coding processing, so that the finally played voice signal is closer to the collected initial voice signal, the voice quality in the application scene can be effectively improved, and the voice intelligibility is further enhanced.
Scene two, many people voice call scene
At present, many applications running on a terminal can provide functions of voice call of multiple people, such as group audio/video call, online audio/video conference, and the like. In such a multi-user voice call scenario, since the voice signals of each call party are collected, encoded and compressed from each terminal, and then sent to the audio mixing server for audio mixing, the voice signals must undergo a decoding and encoding process, which also belongs to a cascade encoding process, and the voice signals are damaged, resulting in a decrease in voice quality. By adopting the voice signal processing method provided by the application, the initial voice signals of all calling parties can be enhanced, or the damaged voice signals after cascade coding processing can be enhanced, so that the finally played voice signals are closer to the acquired initial voice signals, the voice quality under the application scene can be effectively improved, and the voice intelligibility is enhanced.
Scene three, live broadcast scene
With the development of internet technology and the wide application of live broadcast services, people can watch live broadcast through different types of terminals, and due to the difference of coding formats of different terminals, transcoding processing needs to be performed on voice signals in live broadcast, and the transcoding processing also belongs to cascade coding processing, so that the voice signals in live broadcast are damaged, and the live broadcast quality is influenced. By adopting the voice signal processing method provided by the application, the initial voice signal in the direct broadcasting process can be enhanced, or the damaged voice signal after the cascade coding processing can be enhanced, so that the finally played voice signal is closer to the acquired initial voice signal, the voice quality under the application scene can be effectively improved, and the voice intelligibility is enhanced.
The embodiment of the application provides a speech signal processing method, wherein a speech signal to be processed can be an initial speech signal which is not subjected to cascade coding processing, or can be a damaged speech signal which is subjected to cascade coding processing. Generally, since the human ear is sensitive to sound energy, the difference of the sound perception of the human ear by the speech signals in different frequency bands is large, and the subjective damage to the sound by the cascade coding process is most directly expressed in the damage to the frequency domain sub-bands. For example, after the initial speech signal is subjected to multiple concatenated coding processes, the signal of the high frequency part is significantly attenuated more, which results in blurred sound perception and lower sound recognition. The embodiment of the application makes full use of the above features, learns the problem of damage caused by cascade coding processing on the speech signal to be processed by using the target neural network, and obtains the band gain value of each frequency point of the speech signal to be processed in the frequency domain through the target neural network, thereby realizing enhancement of the speech signal of the frequency domain sub-band, and achieving the purpose of improving speech quality to further enhance speech intelligibility.
Fig. 2 is a flowchart of a speech signal processing method according to an embodiment of the present application. The main body of the speech signal processing method is a computer device, which is schematically the terminal 101 or the server 102 in fig. 1, and this is not limited in this embodiment of the present application. Referring to fig. 2, taking an application of the embodiment of the present application to a terminal as an example for explanation, as shown in fig. 2, the voice signal processing method includes the following steps:
201. converting a voice signal to be processed from a time domain to a frequency domain, and acquiring a first power spectrum and phase information of each frequency point on the frequency domain; the voice signal to be processed is an initial voice signal or a damaged voice signal, the initial voice signal refers to a voice signal which is not subjected to the cascade coding processing, and the damaged voice signal refers to a voice signal obtained after the cascade coding processing.
In the embodiment of the present application, the speech signal to be processed may be a speech signal of a certain speaker, or may be a speech signal received in a certain scene. For example, the terminal collects the voice signal of the speaker in real time through the microphone. As another example, the terminal receives a voice signal in a live scene or an online conference scene. The embodiment of the present application does not limit the acquisition mode of the to-be-processed speech signal.
Optionally, the concatenated coding process includes M coding and decoding processes, where M is a positive integer greater than 1. Wherein, the output of the last coding and decoding processing is used as the input of the next coding and decoding processing; for any one-time coding and decoding process, the coding and decoding process comprises a one-time coding process and a one-time decoding process, and the output of the coding process is used as the input of the decoding process.
Schematically, taking an application scenario of mutual convergence among different networks as an example, when a voice signal to be processed passes through an actual link, it needs to undergo multiple codec processing, for example, an IP phone supporting g.729 is intercommunicated with a GSM handset, and the above-mentioned concatenated coding processing includes two codec processing: g.729 encoding process + g.729 decoding process + AMR-NB encoding process + AMR-NB decoding process. The number and type of the concatenated coding process are not limited in the embodiments of the present application.
Optionally, the terminal transforms the voice signal to be processed from the time domain to the frequency domain, and obtains the first power spectrum and the phase information of each frequency point on the frequency domain, including but not limited to the following steps 2011 and 2012.
2011. And sequentially performing framing processing and windowing processing on the voice signal to be processed.
The speech signal to be processed is a series of ordered signals, which are unstable macroscopically but stable microscopically, i.e., the speech signal to be processed has short-time stationarity (for example, the speech signal to be processed can be considered to be approximately unchanged within 10ms to 30 ms). Based on this characteristic, the speech signal to be processed can be divided into short segments for processing, wherein each short segment can be referred to as a frame, i.e. an audio frame. Illustratively, the playing duration of an audio frame may be 16ms, 46.64ms, or 128ms, etc., which is not limited in this embodiment of the application.
Optionally, when the terminal performs framing processing on the speech signal to be processed, in order to ensure transition smoothness and continuity between adjacent audio frames, it is further required to ensure that there is overlap between frames, where an overlapping portion between two adjacent frames is referred to as frame shift.
Optionally, when the terminal performs windowing on the voice signal to be processed, an analysis window of 10ms or 20ms may be used, where the window function may be a hanning window, a hamming window, a rectangular window, or the like, which is not limited in this embodiment of the present application. That is, after windowing, a plurality of analysis windows are formed, and only the speech signal to be processed in one analysis window may be processed at a time. Here, it should be understood that the windowing process makes the speech signal to be processed periodic to reduce the speech energy leakage of the speech signal to be processed in the subsequent FFT.
2012. Carrying out FFT on the voice signal to be processed after framing processing and windowing processing; and determining the first power spectrum and the phase information of each frequency point on the frequency domain based on the obtained conversion result.
The terminal performs N-point FFT on the speech signal to be processed after framing processing and windowing processing according to N (N is a positive integer) points and K (K is a positive integer) points of the FFT points, so as to obtain an FFT result, that is, a spectrogram. Then, the terminal can calculate the power spectrum value of each frequency point according to the amplitude corresponding to each frequency point in the spectrogram, and obtain the phase information of each frequency point. In the embodiment of the present application, the power spectrum value of each frequency point is referred to as a first power spectrum.
Illustratively, take the number of FFT points N as 256 points and K as 129. That is, the terminal performs 256-point FFT on a certain audio frame of the speech signal to be processed, and can obtain power spectrum values of 129 frequency points. It should be noted that the number of FFT points and the number of frequency points may be set according to actual needs, which is not limited in the embodiment of the present application.
Through the step 201, the terminal performs frequency domain conversion on the voice signal to be processed, obtains the first power spectrum and the phase information of each frequency point, and provides a basis for subsequently obtaining the frequency band gain value of each frequency point.
202. And acquiring the frequency band gain value of each frequency point, and determining the second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point.
In the embodiment of the present application, the frequency band gain value is an enhanced amplitude value required for the first power spectrum of each frequency point, and may also be understood as a gain value required for enhancing the first power spectrum of each frequency point. Wherein, enhancing can be understood as enhancing the voice signal to be processed in essence, so as to improve the voice quality. The second power spectrum is obtained by enhancing the first power spectrum of each frequency point.
The frequency band gain value can also be used to measure the damage degree of the voice signal corresponding to each frequency point, that is, the higher the frequency band gain value of a certain frequency point is, the more serious the damage degree of the voice signal corresponding to the frequency point is.
Optionally, the terminal uses the product of the first power spectrum of each frequency point and the band gain value as the second power spectrum of each frequency point. For example, the first power spectrum of a certain frequency point is 30dB/Hz, and the gain value of the frequency band corresponding to the frequency point acquired by the terminal is 1.10, then the second power spectrum of the frequency point is 33 dB/Hz.
It should be noted that the above examples of the band gain value are only illustrative, and in some embodiments, the band gain value may also be expressed in a percentage form, for example, the band gain value is 110%, and the like, and the embodiments of the present application do not limit this.
Optionally, the terminal obtains the band gain value of each frequency point by means of deep learning, and this step 202 may be replaced by the following step 2021 and step 2022, illustratively.
2021. Inputting the first power spectrum of each frequency point into a target neural network to obtain a frequency band gain value of each frequency point; the target neural network comprises a first full connection layer, a gating cycle unit GRU layer and a second full connection layer which are connected in sequence.
Wherein, the target neural network is a neural network based on deep learning. Optionally, the target neural network adopts a four-layer network structure, and has the following characteristics: the input layer of the target neural network adopts a full connection layer, namely a first full connection layer; the output layer adopts a full connection layer, namely a second full connection layer; the middle hidden layers are two layers, GRU layers are adopted, and all the hidden layers are sequentially connected.
Schematically, in the present embodiment, the neuron number of the first fully-connected layer and the GRU layer is set to 64; the neuron number of the second fully-connected layer was set to 129. In other embodiments, the neuron number for the first and second fully-connected layers is set to 129 and the neuron number for the GRU layer is set to 64. The embodiments of the present application do not limit this.
Schematically, in the embodiment of the present application, the activation function of the first fully-connected layer is set as a tanh function; setting an activation function of a GRU layer as a relu function and a sigmoid function; and setting the activation function of the second full connection layer as a sigmoid function. The embodiment of the present application does not limit the types of activation functions of each layer of the target neural network.
It should be noted that the structure of the target neural network can be flexibly adjusted according to actual situations. Wherein, the adjustment mode includes but is not limited to: adjusting the connection mode between layers, changing the characteristic input dimension, neuron number, hidden layer type and activation function type of each layer, and the like. The embodiments of the present application do not limit this.
This step 2021 is described in detail below, and includes the following steps 2021-1 to 2021-3.
2021-1, inputting the first power spectrum of each frequency point into the first full connection layer, and performing feature extraction on the first power spectrum of each frequency point through the first full connection layer to obtain a feature vector.
The first full-connection layer is used as an input layer of the target neural network, and can perform feature extraction on the first power spectrum of each frequency point through a feature extraction function. The feature vector is used to characterize a power spectrum feature of the first power spectrum.
2021-2, inputting the feature vectors into the GRU layer, and extracting the correlation and valid information between the feature vectors through an update gate and a reset gate in the GRU layer to obtain output vectors.
The GRU is a commonly used gated recurrent neural network, and the input of the GRU is the input at the current moment and the hidden state at the previous moment, that is, the output vector is influenced by the information at the current moment T and T-1 moments before, and T is greater than 1. The GRU layer includes two gate functions: the updating gate is used for controlling the degree of state information at the previous moment being brought into the current state, and the larger the value of the updating gate is, the more the state information at the previous moment is brought into the current state; the reset gate controls how much information of the previous state is written to the current output vector, the smaller the reset gate, the less information of the previous state is written.
Because the voice signal belongs to the time sequence feature, after the terminal inputs the feature vector of each frequency point into the GRU layer, the GRU layer can extract the correlation and effective information between the feature vectors of each frequency point, thereby obtaining an output vector. Illustratively, the GRU layer combines the feature vector corresponding to the current frequency point with the output vector of the previous frequency point that is reserved before, and generates an output vector of the current frequency point through the processing of the update gate and the reset gate, and the iteration is repeated until the iteration is finished.
Through the update gate and the reset gate of the GRU layer, which feature vectors can be finally used as output vectors of the GRU layer can be determined. The two gating mechanisms can preserve information in long-term sequences and do not clear over time or remove because they are not relevant to prediction, ensuring the reliability of the target neural network.
2021-3, inputting the output vector into the second full connection layer, and integrating the output vector into the band gain value of each frequency point through the second full connection layer.
The second full-connection layer is used as an output layer, each neuron in the layer is in full connection with all neurons in the previous layer, and based on the full-connection mode, the second full-connection layer can integrate output vectors output by the GRU layer, and finally the frequency band gain value of each frequency point is obtained.
2022. And taking the product of the first power spectrum of each frequency point and the band gain value of each frequency point as the second power spectrum of each frequency point.
Through the step 2021 and the step 2022, the terminal acquires the band gain value of each frequency point in a targeted manner through the target neural network based on deep learning, so that the enhancement effect of the voice signal is more stable.
203. And generating a target voice signal which accords with the voice playing condition based on the phase information and the second power spectrum of each frequency point.
In the embodiment of the present application, the voice playing condition refers to that the voice quality of the voice signal meets a preset requirement. And the terminal performs N-point IFFT according to the number of IFFT points of N and the number of frequency points of K based on the phase information and the second power spectrum of each frequency point to obtain an IFFT conversion result, namely, a target voice signal is generated. Since the FFT and the IFFT are two inverse transform methods, in the step 201, the FFT embodiment has been described, and therefore the IFFT embodiment is not described herein again.
Wherein, the step 203 includes the following two cases:
in case one, the terminal responds that the voice signal to be processed is the initial voice signal, and the present step 203 can be replaced by the following step 2031 and step 2032.
2031. And generating an intermediate voice signal based on the phase information and the second power spectrum of each frequency point.
2032. And carrying out cascade coding processing on the intermediate voice signal to obtain a target voice signal.
In case two, in response to that the to-be-processed speech signal is a damaged speech signal, the terminal performs the following steps "perform concatenated coding processing on the initial speech signal to obtain a damaged speech signal" before performing step 201, and then sequentially performs steps 201 to 203.
Optionally, the step of "performing the cascade coding processing on the initial speech signal to obtain the damaged speech signal" may also be replaced by "performing the coding and decoding processing on the initial speech signal M times to obtain the damaged speech signal".
Optionally, before the terminal performs step 2021, the terminal trains the target neural network through a large number of voice samples, and finally obtains the target neural network. Illustratively, the step 2021 further includes a training process for the target neural network, including the following steps 2021-4 to 2021-7.
2021-4, obtaining a third power spectrum of each frequency point of the voice sample in the frequency domain, where the third power spectrum is obtained by transforming the voice sample from the time domain to the frequency domain.
Among these, speech samples include, but are not limited to: the voice signal of a certain speaker, the voice signal in a certain video, the voice signal collected under a certain scene, and the like. Generally, the number of the voice samples is usually multiple, and the target neural network is trained through a large number of voice samples, so that the finally trained target neural network has good universality and robustness.
In step 2021-4, the terminal sequentially performs framing, windowing and FFT on the voice sample to obtain an FFT result, thereby implementing frequency domain conversion on the voice sample and obtaining a third power spectrum of each frequency point of the voice sample in the frequency domain.
2021-5, inputting the third power spectrum corresponding to the voice sample into the initial neural network to obtain a predicted band gain value corresponding to the third power spectrum.
The voice samples include a voice signal marked with a target frequency band gain value corresponding to the third power spectrum. And the terminal acquires a predicted frequency band gain value corresponding to the third power spectrum based on the network parameters of the initial neural network.
2021-6, constructing a loss function based on the predicted band gain values and the target band gain values for the speech samples.
And the target frequency band gain value is obtained based on the third power spectrum and a fourth power spectrum corresponding to the voice sample, and the fourth power spectrum is obtained by performing cascade coding processing on the voice sample and then transforming the voice sample from a time domain to a frequency domain.
Optionally, the method for the terminal to construct the loss function includes, but is not limited to: the loss function is constructed using a difference between a predicted band gain value and a target band gain value for a speech sample, using a ratio between the predicted band gain value and the target band gain value for the speech sample, using a product value between the predicted band gain value and the target band gain value for the speech sample, and so on.
In addition, the loss function in the embodiment of the present application may be various loss functions commonly used in neural network training, for example, an absolute value loss function, a cosine similarity loss function, a square loss function, a cross entropy loss function, and the like, which is not limited in the embodiment of the present application.
Optionally, the target band gain value is a square root of a ratio of the third power spectrum to the fourth power spectrum. Schematically, reference is made to the following formula (1):
target band gain value sqrt (E _ org (i))/E _ deg (i)) (1)
Wherein sqrt is a square root function; e _ org (i) represents the original speech power spectrum of the ith frequency point of each audio frame after the speech samples are subjected to frequency domain transform by FFT, that is, the original speech power spectrum is the third power spectrum; e _ deg (i) represents the degraded speech power spectrum of the i-th frequency point of each audio frame after the speech samples are subjected to the cascade coding processing and the frequency domain transform by the FFT, that is, the fourth power spectrum.
2021-7, continuously adjusting network parameters of the initial neural network based on the loss function until a preset condition is met, and obtaining a target neural network.
The preset condition is that the loss value (also referred to as an error value) is smaller than a set threshold, and the set threshold may be set according to an actual requirement, for example, set according to a value precision of the target neural network, which is not limited in this application. In addition, in response to that the loss function does not meet the preset condition, the network parameters of the current neural network are adjusted, and then the execution is started again from the above step 2021-4, and the training is stopped until the loss function meets the preset condition, so as to obtain the target neural network.
It should be noted that the training process of the target neural network may further include other steps or other alternative implementations, which are not limited in this application.
In addition, the target neural network according to the embodiment of the present application is not limited to the above type, and any other network based on machine learning or deep learning and for obtaining the band gain value of each frequency point may be used as the target neural network according to the embodiment of the present application.
For the voice signals to be processed, the first power spectrum and the phase information of the voice signals at each frequency point in the frequency domain are firstly obtained, then the first power spectrum is enhanced by obtaining the band gain value corresponding to each frequency point, the second power spectrum of each frequency point is obtained, and then the target voice signals meeting the voice playing condition are generated according to the second power spectrum and the phase information of each frequency point. The processing mode can enhance the power spectrum of each frequency point in a targeted manner, so that the enhancement effect of the voice signal is more stable, the voice quality is effectively improved, and the voice intelligibility is enhanced; moreover, whether the speech signal to be processed is subjected to the cascade coding processing or not, the speech signal can be enhanced by adopting the processing mode, and the application range is wide.
It should be noted that the speech signal processing method shown in fig. 2 covers two speech signal processing schemes, one is a speech signal processing method when the speech signal to be processed is an initial speech signal, and the other is a speech signal processing method when the speech signal to be processed is a damaged speech signal. The two speech signal processing schemes provided in the present application are schematically illustrated based on two specific embodiments.
First, a speech signal processing scheme when a speech signal to be processed is an initial speech signal.
Referring first to fig. 3, fig. 3 is a schematic diagram of a speech signal processing scheme according to an embodiment of the present application. As shown in fig. 3, the terminal performs deep learning preprocessing on the initial speech signal, and then performs cascade coding processing on the initial speech signal after the deep learning preprocessing, so as to finally obtain the target speech signal. The deep learning preprocessing is a processing process in which the terminal obtains a frequency band gain value corresponding to the initial voice signal through a target neural network, and then enhances the initial voice signal.
Next, referring to fig. 4, fig. 4 is a flowchart of another speech signal processing method provided in the embodiment of the present application. This speech signal processing scheme is explained in detail below in conjunction with fig. 4. As shown in fig. 4, first, a terminal performs FFT on an initial voice signal to obtain a power spectrum and phase information of each frequency point of the initial voice signal in a frequency domain; then, the terminal inputs the power spectrum of each frequency point into a target neural network, and after the power spectrum is processed by the target neural network, a frequency band gain value corresponding to each frequency point is obtained, wherein the target neural network comprises two full connection layers and two GRU layers; then, the terminal multiplies the power spectrum of each frequency point by the corresponding frequency band gain value of each frequency point to obtain the enhanced power spectrum of each frequency point; finally, the terminal obtains a preprocessed voice signal through IFFT based on the phase information of each frequency point and the enhanced power spectrum of each frequency point, and performs cascade coding processing on the preprocessed voice signal to obtain a target voice signal, where the preprocessed voice signal is also the intermediate voice signal shown in the above embodiment.
Finally, referring to fig. 5, fig. 5 is a flow chart of another speech signal processing method provided according to an embodiment of the present application. As shown in fig. 5, the speech signal processing method includes the following steps 501 to 504.
501. Converting the initial voice signal from a time domain to a frequency domain, and acquiring a first power spectrum and phase information of each frequency point on the frequency domain; the initial speech signal refers to a speech signal that has not undergone a concatenated coding process.
502. And acquiring a frequency band gain value of each frequency point, and determining a second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point.
503. And generating an intermediate voice signal based on the phase information and the second power spectrum of each frequency point.
504. And carrying out cascade coding processing on the intermediate voice signal to obtain a target voice signal which accords with a voice playing condition.
It should be noted that the preprocessed voice signal can be obtained by performing deep learning preprocessing on the initial voice signal, and the preprocessed voice signal can be processed by the cascade coding to obtain a voice signal closer to the initial voice signal, that is, can be restored to a better tone quality, so that the voice quality in the application scene of the cascade coding is effectively improved, and the voice intelligibility is further enhanced.
And secondly, a speech signal processing scheme is adopted when the speech signal to be processed is a damaged speech signal.
Referring first to fig. 6, fig. 6 is a schematic diagram of another speech signal processing scheme provided according to an embodiment of the present application. As shown in fig. 6, the terminal performs a cascade coding process on the initial speech signal to obtain a damaged speech signal, which is also referred to as a degraded speech signal; and then the terminal carries out deep learning restoration on the damaged voice signal to finally obtain the target voice signal. The deep learning and repairing is a processing process in which the terminal obtains a frequency band gain value corresponding to the damaged voice signal through a target neural network, and then enhances the damaged voice signal.
Next, referring to fig. 7, fig. 7 is a flowchart of another speech signal processing method provided in the embodiment of the present application. This speech signal processing scheme is explained in detail below in conjunction with fig. 7. As shown in fig. 7, first, the terminal performs a cascade coding process on an initial speech signal to obtain a damaged speech signal; then, the terminal performs FFT on the damaged voice signal to obtain the power spectrum and the phase information of each frequency point of the damaged voice signal in the frequency domain; then, the terminal inputs the power spectrum of each frequency point into a target neural network, and after the power spectrum is processed by the target neural network, a frequency band gain value corresponding to each frequency point is obtained, wherein the target neural network comprises two full connection layers and two GRU layers; then, the terminal multiplies the power spectrum of each frequency point by the corresponding frequency band gain value of each frequency point to obtain the enhanced power spectrum of each frequency point; and finally, the terminal obtains the target voice signal through IFFT based on the phase information of each frequency point and the enhanced power spectrum of each frequency point.
Finally, referring to fig. 8, fig. 8 is a flowchart of another speech signal processing method provided according to an embodiment of the present application. As shown in fig. 8, the speech signal processing method includes the following steps 801 to 804.
801. And carrying out cascade coding processing on the initial voice signal to obtain a damaged voice signal, wherein the initial voice signal refers to the voice signal which is not subjected to the cascade coding processing, and the damaged voice signal refers to the voice signal which is obtained after the cascade coding processing.
802. And transforming the initial voice signal from a time domain to a frequency domain to obtain a first power spectrum and phase information of each frequency point on the frequency domain.
803. And acquiring the frequency band gain value of each frequency point, and determining the second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point.
804. And generating a target voice signal which accords with the voice playing condition based on the phase information and the second power spectrum of each frequency point.
It should be noted that after the damaged speech signal is deeply learned and repaired, the obtained speech signal is a speech signal closer to the initial speech signal, i.e., a better tone quality can be restored, the speech quality in the application scenario of concatenated coding is effectively improved, and the speech intelligibility is further enhanced.
In summary, for the voice signal to be processed, in the embodiment of the present application, first power spectrums and phase information of the voice signal at each frequency point in a frequency domain are obtained, then the first power spectrums are enhanced by obtaining the band gain value corresponding to each frequency point, second power spectrums of each frequency point are obtained, and then a target voice signal meeting a voice playing condition is generated according to the second power spectrums and the phase information of each frequency point. The processing mode can enhance the power spectrum of each frequency point in a targeted manner, so that the enhancement effect of the voice signal is more stable, the voice quality is effectively improved, and the voice intelligibility is enhanced; moreover, whether the speech signal to be processed is subjected to the cascade coding processing or not, the speech signal can be enhanced by adopting the processing mode, and the application range is wide.
Fig. 9 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present application. The apparatus is used for executing the steps executed by the voice signal processing method, and referring to fig. 9, the voice signal processing apparatus includes: an acquisition module 901, a determination module 902 and a generation module 903.
An obtaining module 901, configured to transform a speech signal to be processed from a time domain to a frequency domain, and obtain a first power spectrum and phase information of each frequency point in the frequency domain; the voice signal to be processed is an initial voice signal or a damaged voice signal, the initial voice signal refers to a voice signal which is not subjected to cascade coding processing, and the damaged voice signal refers to a voice signal obtained after the cascade coding processing;
a determining module 902, configured to obtain a frequency band gain value of each frequency point, and determine a second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point;
a generating module 903, configured to generate a target voice signal meeting the voice playing condition based on the phase information and the second power spectrum of each frequency point.
In an optional implementation manner, in response to that the to-be-processed speech signal is an initial speech signal, the generating module 903 is further configured to:
generating an intermediate voice signal based on the phase information and the second power spectrum of each frequency point;
and performing the cascade coding processing on the intermediate voice signal to obtain the target voice signal.
In an optional implementation manner, in response to the to-be-processed speech signal being the impaired speech signal, the apparatus further includes:
and the processing module is used for performing the cascade coding processing on the initial voice signal before transforming the voice signal to be processed from the time domain to the frequency domain to obtain the damaged voice signal.
In an alternative implementation, the determining module 902 is configured to:
inputting the first power spectrum of each frequency point into a target neural network to obtain a frequency band gain value of each frequency point; wherein the target neural network comprises a first fully connected layer, a gated cyclic unit layer and a second fully connected layer which are connected in sequence;
taking the product of the first power spectrum of each frequency point and the band gain value as the second power spectrum of each frequency point.
In an optional implementation, the apparatus further comprises a training module configured to:
acquiring a third power spectrum of each frequency point of the voice sample on a frequency domain, wherein the third power spectrum is obtained by converting the voice sample from a time domain to the frequency domain;
inputting a third power spectrum corresponding to the voice sample into an initial neural network to obtain a predicted frequency band gain value corresponding to the third power spectrum;
constructing a loss function based on the predicted band gain value and the target band gain value of the voice sample;
continuously adjusting network parameters of the initial neural network based on the loss function until preset conditions are met to obtain the target neural network;
the target frequency band gain value is obtained based on the third power spectrum and a fourth power spectrum corresponding to the voice sample, and the fourth power spectrum is obtained by performing the cascade coding processing on the voice sample and then transforming the voice sample from a time domain to a frequency domain.
In an alternative implementation, the target band gain value is a square root of a ratio of the third power spectrum to the fourth power spectrum.
In an optional implementation, the determining module 902 is further configured to:
inputting the first power spectrum of each frequency point into the first full-connection layer, and performing feature extraction on the first power spectrum of each frequency point through the first full-connection layer to obtain a feature vector;
inputting the characteristic vectors into the gated circulation unit layer, and extracting correlation and effective information among the characteristic vectors through an update gate and a reset gate in the gated circulation unit layer to obtain output vectors;
and inputting the output vector into the second full-connection layer, and integrating the output vector into the band gain value of each frequency point through the second full-connection layer.
In an optional implementation manner, the obtaining module 901 is further configured to:
sequentially performing framing processing and windowing processing on the voice signal to be processed;
carrying out fast Fourier transform on the voice signal to be processed after framing processing and windowing processing; and determining the first power spectrum and the phase information of each frequency point on the frequency domain based on the obtained conversion result.
In an optional implementation manner, the concatenated coding process includes M coding and decoding processes, where M is a positive integer greater than 1, and the processing module is further configured to:
performing encoding and decoding processing on the initial voice signal for M times to obtain the damaged voice signal;
wherein, the output of the last coding and decoding processing is used as the input of the next coding and decoding processing; for any one-time coding and decoding process, the coding and decoding process comprises a one-time coding process and a one-time decoding process, and the output of the coding process is used as the input of the decoding process.
For the voice signals to be processed, the first power spectrum and the phase information of the voice signals at each frequency point in the frequency domain are firstly obtained, then the first power spectrum is enhanced by obtaining the band gain value corresponding to each frequency point, the second power spectrum of each frequency point is obtained, and then the target voice signals meeting the voice playing condition are generated according to the second power spectrum and the phase information of each frequency point. The processing mode can enhance the power spectrum of each frequency point in a targeted manner, so that the enhancement effect of the voice signal is more stable, the voice quality is effectively improved, and the voice intelligibility is enhanced; moreover, whether the speech signal to be processed is subjected to the cascade coding processing or not, the speech signal can be enhanced by adopting the processing mode, and the application range is wide.
It should be noted that: in the voice signal processing apparatus provided in the foregoing embodiment, when processing a voice signal, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the functions described above. In addition, the speech signal processing apparatus and the speech signal processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
In the speech signal processing method provided by the embodiment of the present application, the computer device can be configured as a terminal or a server, that is, the method can be executed by the terminal as an execution subject and can also be executed by the server as an execution subject. Certainly, the processing can also be performed by interaction between the terminal and the server, for example, the terminal sends a to-be-processed voice signal to the server and requests to acquire a target voice signal, and the server performs voice signal processing on the to-be-processed voice signal based on the received request, and feeds back the target voice signal to the terminal after the target voice signal is acquired. It should be noted that, in the embodiment of the present application, the interaction mode between the terminal and the server is not limited.
In an exemplary embodiment, a computer device is also provided. Taking a computer device as an example of a terminal, fig. 10 shows a schematic structural diagram of the terminal 1000 according to an exemplary embodiment of the present application. The terminal 1000 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1000 can also be referred to as user equipment, portable terminal, laptop terminal, desktop terminal, or the like by other names.
In general, terminal 1000 can include: a processor 1001 and a memory 1002.
Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.
Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1002 is used to store at least one program code for execution by the processor 1001 to implement the speech signal processing methods provided by the method embodiments herein.
In some embodiments, terminal 1000 can also optionally include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, display screen 1005, camera assembly 1006, audio circuitry 1007, positioning assembly 1008, and power supply 1009.
The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1004 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display screen 1005 can be one, disposed on a front panel of terminal 1000; in other embodiments, display 1005 can be at least two, respectively disposed on different surfaces of terminal 1000 or in a folded design; in other embodiments, display 1005 can be a flexible display disposed on a curved surface or a folded surface of terminal 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing or inputting the electric signals to the radio frequency circuit 1004 for realizing voice communication. For stereo sound collection or noise reduction purposes, multiple microphones can be provided, each at a different location of terminal 1000. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.
A Location component 1008 is employed to locate a current geographic Location of terminal 1000 for purposes of navigation or LBS (Location Based Service). The Positioning component 1008 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.
Power supply 1009 is used to supply power to various components in terminal 1000. The power source 1009 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1009 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 1000 can also include one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.
Acceleration sensor 1011 can detect acceleration magnitudes on three coordinate axes of a coordinate system established with terminal 1000. For example, the acceleration sensor 1011 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1001 may control the display screen 1005 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 1012 may detect a body direction and a rotation angle of the terminal 1000, and the gyro sensor 1012 and the acceleration sensor 1011 may cooperate to acquire a 3D motion of the user on the terminal 1000. From the data collected by the gyro sensor 1012, the processor 1001 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensor 1013 can be disposed on a side frame of terminal 1000 and/or underneath display screen 1005. When pressure sensor 1013 is disposed on a side frame of terminal 1000, a user's grip signal on terminal 1000 can be detected, and processor 1001 performs left-right hand recognition or shortcut operation according to the grip signal collected by pressure sensor 1013. When the pressure sensor 1013 is disposed at a lower layer of the display screen 1005, the processor 1001 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1005. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 1014 is used to collect a fingerprint of the user, and the processor 1001 identifies the user according to the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 1001 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying, and changing settings, etc. Fingerprint sensor 1014 may be disposed on a front, back, or side of terminal 1000. When a physical key or vendor Logo is provided on terminal 1000, fingerprint sensor 1014 can be integrated with the physical key or vendor Logo.
The optical sensor 1015 is used to collect the ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the display screen 1005 according to the ambient light intensity collected by the optical sensor 1015. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1005 is increased; when the ambient light intensity is low, the display brightness of the display screen 1005 is turned down. In another embodiment, the processor 1001 may also dynamically adjust the shooting parameters of the camera assembly 1006 according to the intensity of the ambient light collected by the optical sensor 1015.
Proximity sensor 1016, also known as a distance sensor, is typically disposed on a front panel of terminal 1000. Proximity sensor 1016 is used to gather the distance between the user and the front face of terminal 1000. In one embodiment, when proximity sensor 1016 detects that the distance between the user and the front surface of terminal 1000 is gradually reduced, processor 1001 controls display screen 1005 to switch from a bright screen state to a dark screen state; when proximity sensor 1016 detects that the distance between the user and the front of terminal 1000 is gradually increased, display screen 1005 is controlled by processor 1001 to switch from a breath-screen state to a bright-screen state.
Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and that terminal 1000 can include more or fewer components than shown, or some components can be combined, or a different arrangement of components can be employed.
Taking a computer device as an example, fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1100 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1101 and one or more memories 1102, where the memories 1102 store at least one computer program, and the at least one computer program is loaded and executed by the processors 1101 to implement the voice signal Processing methods provided by the above-mentioned method embodiments. Certainly, the server can also have components such as a wired or wireless network interface, a keyboard, an input/output interface, and the like so as to perform input and output, and the server can also include other components for realizing the functions of the device, which is not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, which is applied to a computer device, and the computer-readable storage medium stores at least one computer program, which is loaded and executed by a processor to implement the operations performed by the computer device in the voice signal processing method of the foregoing embodiment.
Embodiments of the present application also provide a computer program product or a computer program comprising computer program code stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, so that the computer device performs the voice signal processing method provided in the above-described various alternative implementations.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (15)

1. A method of speech signal processing, the method comprising:
converting a voice signal to be processed from a time domain to a frequency domain, and acquiring a first power spectrum and phase information of each frequency point on the frequency domain; the voice signal to be processed is an initial voice signal or a damaged voice signal, the initial voice signal refers to a voice signal which is not subjected to cascade coding processing, and the damaged voice signal refers to a voice signal obtained after the cascade coding processing;
acquiring a frequency band gain value of each frequency point, and determining a second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point;
and generating a target voice signal which accords with the voice playing condition based on the phase information and the second power spectrum of each frequency point.
2. The method of claim 1, wherein in response to the to-be-processed speech signal being the initial speech signal, the generating a target speech signal meeting speech playing conditions based on the phase information and the second power spectrum of each frequency point comprises:
generating an intermediate voice signal based on the phase information and the second power spectrum of each frequency point;
and performing the cascade coding processing on the intermediate voice signal to obtain the target voice signal.
3. The method of claim 1, wherein in response to the speech signal to be processed being the corrupted speech signal, the method further comprises:
and before the voice signal to be processed is converted from the time domain to the frequency domain, the initial voice signal is subjected to the cascade coding processing to obtain the damaged voice signal.
4. The method according to claim 1, wherein the obtaining the band gain value of each frequency point and determining the second power spectrum of each frequency point based on the first power spectrum and the band gain value of each frequency point comprises:
inputting the first power spectrum of each frequency point into a target neural network to obtain a frequency band gain value of each frequency point; wherein the target neural network comprises a first fully-connected layer, a gated cyclic unit layer and a second fully-connected layer which are connected in sequence;
and taking the product of the first power spectrum of each frequency point and the frequency band gain value as the second power spectrum of each frequency point.
5. The method of claim 4, wherein the training process of the target neural network comprises:
acquiring a third power spectrum of each frequency point of a voice sample on a frequency domain, wherein the third power spectrum is obtained by converting the voice sample from a time domain to the frequency domain;
inputting a third power spectrum corresponding to the voice sample into an initial neural network to obtain a predicted frequency band gain value corresponding to the third power spectrum;
constructing a loss function based on the predicted band gain value and the target band gain value of the voice sample;
continuously adjusting network parameters of the initial neural network based on the loss function until preset conditions are met to obtain the target neural network;
the target frequency band gain value is obtained based on the third power spectrum and a fourth power spectrum corresponding to the voice sample, and the fourth power spectrum is obtained by performing the cascade coding processing on the voice sample and then transforming the voice sample from a time domain to a frequency domain.
6. The method of claim 5, wherein the target band gain value is a square root of a ratio of the third power spectrum to the fourth power spectrum.
7. The method according to any one of claims 4 to 6, wherein the inputting the first power spectrum of each frequency point into a target neural network to obtain a band gain value of each frequency point comprises:
inputting the first power spectrum of each frequency point into the first full-connection layer, and performing feature extraction on the first power spectrum of each frequency point through the first full-connection layer to obtain a feature vector;
inputting the characteristic vectors into the gated circulation unit layer, and extracting correlation and effective information between the characteristic vectors through an update gate and a reset gate in the gated circulation unit layer to obtain output vectors;
and inputting the output vectors into the second full-connection layer, and integrating the output vectors into the frequency band gain value of each frequency point through the second full-connection layer.
8. The method according to claim 1, wherein the transforming the speech signal to be processed from the time domain to the frequency domain to obtain the first power spectrum and the phase information of each frequency point in the frequency domain comprises:
sequentially performing framing processing and windowing processing on the voice signal to be processed;
performing fast Fourier transform on the voice signal to be processed after framing processing and windowing processing; and determining the first power spectrum and the phase information of each frequency point on the frequency domain based on the obtained conversion result.
9. The method of claim 3, wherein the concatenated coding process comprises M coding processes, where M is a positive integer greater than 1, and wherein the performing the concatenated coding process on the initial speech signal to obtain the corrupted speech signal comprises:
performing encoding and decoding processing on the initial voice signal for M times to obtain the damaged voice signal;
wherein, the output of the last coding and decoding processing is used as the input of the next coding and decoding processing; for any one-time coding and decoding process, the coding and decoding process comprises a one-time coding process and a one-time decoding process, and the output of the coding process is used as the input of the decoding process.
10. A speech signal processing apparatus, characterized in that the apparatus comprises:
the acquisition module is used for converting the voice signal to be processed from a time domain to a frequency domain and acquiring a first power spectrum and phase information of each frequency point on the frequency domain; the voice signal to be processed is an initial voice signal or a damaged voice signal, the initial voice signal refers to a voice signal which is not subjected to cascade coding processing, and the damaged voice signal refers to a voice signal obtained after the cascade coding processing;
the determining module is used for acquiring the frequency band gain value of each frequency point and determining the second power spectrum of each frequency point based on the first power spectrum and the frequency band gain value of each frequency point;
and the generating module is used for generating a target voice signal which accords with a voice playing condition based on the phase information and the second power spectrum of each frequency point.
11. The apparatus of claim 10, wherein in response to the to-be-processed speech signal being the initial speech signal, the generation module is further configured to:
generating an intermediate voice signal based on the phase information and the second power spectrum of each frequency point;
and performing the cascade coding processing on the intermediate voice signal to obtain the target voice signal.
12. The apparatus of claim 10, wherein in response to the speech signal to be processed being the corrupted speech signal, the apparatus further comprises:
and the processing module is used for performing the cascade coding processing on the initial voice signal to obtain the damaged voice signal before transforming the voice signal to be processed from a time domain to a frequency domain.
13. The apparatus of claim 10, wherein the determining module is configured to:
inputting the first power spectrum of each frequency point into a target neural network to obtain a frequency band gain value of each frequency point; wherein the target neural network comprises a first fully-connected layer, a gated cyclic unit layer and a second fully-connected layer which are connected in sequence;
and taking the product of the first power spectrum of each frequency point and the frequency band gain value as the second power spectrum of each frequency point.
14. A computer device, characterized in that the computer device comprises a processor and a memory for storing at least one computer program, which is loaded by the processor and which performs the speech signal processing method according to any one of claims 1 to 9.
15. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to implement the speech signal processing method according to any one of claims 1 to 9.
CN202110226589.7A 2021-03-01 2021-03-01 Voice signal processing method, device, equipment and storage medium Pending CN113707162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110226589.7A CN113707162A (en) 2021-03-01 2021-03-01 Voice signal processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110226589.7A CN113707162A (en) 2021-03-01 2021-03-01 Voice signal processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113707162A true CN113707162A (en) 2021-11-26

Family

ID=78647795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110226589.7A Pending CN113707162A (en) 2021-03-01 2021-03-01 Voice signal processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113707162A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114338623A (en) * 2022-01-05 2022-04-12 腾讯科技(深圳)有限公司 Audio processing method, device, equipment, medium and computer program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114338623A (en) * 2022-01-05 2022-04-12 腾讯科技(深圳)有限公司 Audio processing method, device, equipment, medium and computer program product
CN114338623B (en) * 2022-01-05 2023-12-05 腾讯科技(深圳)有限公司 Audio processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN108615526B (en) Method, device, terminal and storage medium for detecting keywords in voice signal
CN111179961B (en) Audio signal processing method and device, electronic equipment and storage medium
CN110097019B (en) Character recognition method, character recognition device, computer equipment and storage medium
CN111179962B (en) Training method of voice separation model, voice separation method and device
CN111696532B (en) Speech recognition method, device, electronic equipment and storage medium
CN111986691B (en) Audio processing method, device, computer equipment and storage medium
CN109887494B (en) Method and apparatus for reconstructing a speech signal
CN111696570B (en) Voice signal processing method, device, equipment and storage medium
CN111462764B (en) Audio encoding method, apparatus, computer-readable storage medium and device
CN110992963A (en) Network communication method, device, computer equipment and storage medium
CN111581958A (en) Conversation state determining method and device, computer equipment and storage medium
CN111739517A (en) Speech recognition method, speech recognition device, computer equipment and medium
CN112967730A (en) Voice signal processing method and device, electronic equipment and storage medium
CN111223475B (en) Voice data generation method and device, electronic equipment and storage medium
CN111613213A (en) Method, device, equipment and storage medium for audio classification
CN113823296A (en) Voice data processing method and device, computer equipment and storage medium
CN109961802A (en) Sound quality comparative approach, device, electronic equipment and storage medium
CN113707162A (en) Voice signal processing method, device, equipment and storage medium
CN115378302A (en) Control method, device and equipment of linear motor and readable storage medium
CN112133319A (en) Audio generation method, device, equipment and storage medium
CN111341307A (en) Voice recognition method and device, electronic equipment and storage medium
CN110990549A (en) Method and device for obtaining answers, electronic equipment and storage medium
CN112151017B (en) Voice processing method, device, system, equipment and storage medium
CN110460856B (en) Video encoding method, video encoding device, video encoding apparatus, and computer-readable storage medium
CN113822084A (en) Statement translation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination