CN111710344A - Signal processing method, device, equipment and computer readable storage medium - Google Patents

Signal processing method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111710344A
CN111710344A CN202010597937.7A CN202010597937A CN111710344A CN 111710344 A CN111710344 A CN 111710344A CN 202010597937 A CN202010597937 A CN 202010597937A CN 111710344 A CN111710344 A CN 111710344A
Authority
CN
China
Prior art keywords
audio signal
noise
audio
dimensional
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010597937.7A
Other languages
Chinese (zh)
Inventor
夏咸军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010597937.7A priority Critical patent/CN111710344A/en
Publication of CN111710344A publication Critical patent/CN111710344A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the invention provides a signal processing method, a signal processing device, signal processing equipment and a computer readable storage medium. The method comprises the following steps: collecting an audio signal to be processed, and extracting the spectral feature of the audio signal to be processed, wherein the spectral feature comprises an N-dimensional logarithmic energy spectral feature; calling a noise optimization model to process the logarithmic energy spectrum characteristics to obtain M-dimensional noise correction coefficients corresponding to the N-dimensional logarithmic energy spectrum characteristics, wherein N and M are positive integers; and calculating the N-dimensional logarithmic energy spectrum characteristic and the M-dimensional noise correction coefficient to obtain the processed audio signal. According to the embodiment of the application, the noise correction coefficient is generated through the noise optimization model, and the noise in the audio signal to be processed is reduced or eliminated through the noise correction coefficient, so that the communication quality is improved.

Description

Signal processing method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a signal processing method, apparatus, device, and computer-readable storage medium.
Background
With the continuous development of communication technology, the demand of people for signal quality is continuously increasing, and especially in some scenes such as holding a network conference by using a computer network and a mobile communication network, it is desirable that the call signal of the conference is clear and recognizable, and unnecessary signals input along with the voices of participants can be reduced to the maximum extent.
In one scenario, the unwanted signal is primarily referred to as a noise signal, which may be some unwanted echo audio signal. In a multi-party teleconference scene, a situation that multiple participants speak simultaneously occurs, at this time, the local voice communication device needs to play voices of participants in other areas and also needs to collect local voices of the local participants, and due to the influence of factors such as a conference room environment, a part of special noise signals, such as echoes reflected by the conference room and related to the voices played by the voice communication device, exist in the local voices collected by the voice communication device.
These echo signals may adversely affect the conference voice signals, such as interaction, for example, these echoes may cause "hula" and other noises in the voice conference, which reduces the quality of voice interaction.
Disclosure of Invention
The embodiment of the invention provides a signal processing method, a signal processing device, signal processing equipment and a computer readable storage medium, which can improve the quality of voice interaction.
In one aspect, an embodiment of the present application provides a signal processing method, where the method includes:
acquiring an audio signal to be processed, and extracting the spectral feature of the audio signal to be processed, wherein the spectral feature comprises an N-dimensional logarithmic energy spectral feature;
calling a noise optimization model to process the logarithmic energy spectrum feature to obtain an M-dimensional noise correction coefficient corresponding to the N-dimensional logarithmic energy spectrum feature, wherein N and M are positive integers;
calculating the N-dimensional logarithmic energy spectrum feature and the M-dimensional noise correction coefficient to obtain a processed audio signal;
the noise optimization model is obtained by training according to audio training data including noise audio signals, and the M-dimensional noise correction coefficient output by the noise optimization model includes: and a p-dimensional coefficient for modifying a characteristic of the input log energy spectrum characteristic with respect to the noise audio signal, p being smaller than M.
In another aspect, the present application provides a signal processing apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an audio signal to be processed and extracting the spectral characteristics of the audio signal to be processed, and the spectral characteristics comprise N-dimensional logarithmic energy spectral characteristics;
the processing unit is used for calling a noise optimization model to process the logarithmic energy spectrum feature to obtain an M-dimensional noise correction coefficient corresponding to the N-dimensional logarithmic energy spectrum feature, wherein N and M are positive integers; calculating the N-dimensional logarithmic energy spectrum feature and the M-dimensional noise correction coefficient to obtain a processed audio signal;
the noise optimization model is obtained by training according to audio training data including noise audio signals, and the M-dimensional noise correction coefficient output by the noise optimization model includes: and a p-dimensional coefficient for modifying a characteristic of the input log energy spectrum characteristic with respect to the noise audio signal, p being smaller than M.
Correspondingly, the embodiment of the present application further provides a signal processing device, which includes a processor, a memory and a communication interface, where the processor, the memory and the communication interface are connected to each other, where the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the above-mentioned signal processing method.
Accordingly, the present application provides a computer-readable storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to perform the above-mentioned signal processing method.
Accordingly, the present application provides a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium, the computer instructions being read by a processor of a computer device from the computer-readable storage medium, the computer instructions being executed by the processor to cause the computer device to perform the above-mentioned signal processing method.
In the embodiment of the application, collected audio signals to be processed, which are generated under the conditions of audio and video conferences, audio and video calls and the like, can be optimized and corrected by pre-training a noise correction coefficient for the audio signals to be processed, which is generated by an optimized noise optimization model, from logarithmic energy spectrum characteristics, so that adverse effects of noise audio signals such as echo and the like in the audio signals to be processed on the collected audio signals are reduced and even eliminated, and the quality of voice interaction is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1a is a scene architecture diagram of signal processing according to an embodiment of the present invention;
fig. 1b is a flowchart of signal processing according to an embodiment of the present application;
fig. 2 is a flowchart of a signal processing method according to an embodiment of the present application;
fig. 3 is a flowchart of extracting frequency-domain spectral features from a time-domain audio signal according to an embodiment of the present application;
FIG. 4a is a flowchart of a model training method according to an embodiment of the present disclosure;
FIG. 4b is a flow chart of another model training method provided by the embodiments of the present application;
FIG. 5 is a schematic diagram illustrating training of a noise optimization model according to an embodiment of the present disclosure;
fig. 6 is a flowchart of another signal processing method provided in the embodiment of the present application;
fig. 7 is a diagram of a conference session interface provided in an embodiment of the present application;
fig. 8 is a schematic structural diagram of a signal processing apparatus according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an intelligent device according to an embodiment of the present application.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
The embodiment of the application relates to Artificial Intelligence (AI) and Machine Learning (ML), and features in an audio signal can be mined and analyzed by combining AI and ML, so that a device can more accurately identify and process the audio signal and determine spectral features of a noise signal such as echo, and therefore adverse effects of the noise signal on the original audio signal can be reduced or even eliminated. The AI is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses the knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The AI technology is a comprehensive subject, and relates to the field of extensive technology, both hardware level technology and software level technology. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, processing technologies for large applications, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like, and the embodiment of the application mainly relates to the language processing technology.
ML is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. ML is the core of artificial intelligence, is the fundamental way to make computers intelligent, and its application is spread over various fields of artificial intelligence. ML and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, migratory learning, inductive learning, and formal learning.
Statistical estimation echo cancellation algorithms based on conventional machine learning may be used to analyze the audio signal to be processed, such algorithms may include, for example, Adaptive filtering (Adaptive Filter) based echo cancellation algorithms. For these conventional statistical learning cancellation algorithms, the coefficients of the filter can be estimated and the weighting coefficients can be automatically adjusted to achieve echo cancellation using a specific algorithm according to the statistical characteristics of the input and output signals. For the estimation of the filter coefficients, the minimum Mean Square error (LMS) will usually be the target of the optimization.
Echo cancellation algorithms based on a neural network mode can also be used for analyzing and processing audio signals to be processed, and the echo cancellation algorithms respectively extract the frequency spectrum characteristics of signals of a Far End (Far-End) and a Near End (Near-End) by collecting the signals, and splice the signals to be used as the input of the neural network, and the frequency spectrum characteristics of the Near End is used as the output of the neural network. Some mainstream Network models, such as Convolutional Neural Network (CNN) and cyclic convolutional Neural Network (Recurrent Neural Network), can be applied to the application of eliminating noise signals such as echo.
In order to solve the above problems, the present application provides a signal processing method, for an acquired audio signal to be processed, a spectral feature of the audio signal to be processed is extracted first, then a pre-trained noise optimization model is called to process a logarithmic energy spectral feature of the audio signal, a noise correction coefficient corresponding to the logarithmic energy spectral feature is obtained, and the logarithmic energy spectral feature of the audio signal is corrected by the noise correction coefficient, so that noise in the audio signal to be processed is reduced or even eliminated, and the quality of voice interaction is improved.
Referring to fig. 1a, fig. 1a is a schematic diagram of a scene architecture for signal processing according to an embodiment of the present invention. As shown in fig. 1a, the scene architecture diagram includes an attendee, a participant and a terminal 101. The conference method includes the steps that an attendee and participants participate in a teleconference through respective terminal devices, for example, the attendee adopts the terminal device 101 to conduct the teleconference, and in the teleconference process, the terminal device 101 collects sound waves of the attendee and sends the sound waves to the participants; in addition, the terminal device 101 also plays the voice transmitted by the participant. In the illustration of fig. 1a, the sound wave at the opposite end is the sound wave emitted when the terminal device 101 plays the voice of the participant; the sound wave at the opposite end is reflected when meeting a reflector (such as a wall surface) to form echo sound wave. Therefore, after generating the echo sound wave, if the user of the party also speaks to generate the sound wave of the user of the party, the audio signal collected by the terminal device 101 may include: and the audio signal corresponding to the voice sound wave of the party entering the conference and the audio signal corresponding to the echo sound wave. Of course, the participant signal processing scenario and the participant signal processing scenario may be the same.
The number of the terminal devices 101 may be one or more, the form of the terminal device 101 is only used for example, and the terminal device 101 may include but is not limited to: smart phones (such as Android phones and iOS phones), tablet computers, portable personal computers, Mobile Internet Devices (MID), voice collection (player), and other devices having voice playing and collecting functions. The number of the participants can be one or more, and the embodiment of the application is not limited.
Fig. 1b is a flowchart of signal processing according to an embodiment of the present disclosure. As shown in fig. 1b, the signal processing flow mainly includes: the terminal device 101 collects a human voice sound wave and a noise sound wave of an party, wherein the noise sound wave may be, for example, the echo sound wave mentioned above, and obtains a corresponding audio input signal according to the collected sound wave, i.e., obtains an audio signal to be processed; then extracting the spectral feature of the audio signal to be processed, wherein the spectral feature can be an N-dimensional logarithmic energy spectral feature; calling a noise optimization model to process the N-dimensional logarithmic energy spectrum feature to obtain an M-dimensional noise correction coefficient corresponding to the N-dimensional logarithmic energy spectrum feature, wherein in one embodiment, the noise optimization model can be a model constructed by a convolutional neural network based on a Long Short Term Memory (LSTM), and the M-dimensional noise correction coefficient is an estimated spectrum correction coefficient; and calculating the logarithmic energy spectrum characteristics of each dimension and the noise correction coefficient of the corresponding dimension to obtain a processed audio signal, wherein the processed audio signal weakens or eliminates signals such as echo and the like, and can be transmitted to one or more participants through a conference system.
In one embodiment, when the model is constructed and trained, the log energy spectrum of the N-dimensional vocal audio signal and the log energy spectrum of the p-dimensional echo audio signal are included in the N-dimensional log energy spectrum feature, that is, N + p. The log energy spectrum of the n-dimensional human voice audio signal and the log energy spectrum of the p-dimensional echo audio signal may be arranged in sequence, for example, the former n-dimensional is defined as the log energy spectrum of the human voice audio signal, and the latter p-dimensional is defined as the log energy spectrum of the echo audio signal, or may be arranged in a cross-mixing manner. Correspondingly, the M-dimensional noise correction coefficient includes a noise correction coefficient of a logarithmic energy spectrum of the N-dimensional human voice audio signal and a noise correction coefficient of a logarithmic energy spectrum of the p-dimensional echo audio signal, that is, M ═ N + p. The arrangement mode of the noise correction coefficient of the logarithmic energy spectrum of the N-dimensional human voice audio signal and the noise correction coefficient of the logarithmic energy spectrum of the p-dimensional echo audio signal corresponds to the arrangement mode of the logarithmic energy spectrum of the N-dimensional human voice audio signal and the logarithmic energy spectrum of the p-dimensional echo audio signal in the N-dimensional logarithmic energy spectrum characteristic. For example, it is assumed that the j-dimension noise correction coefficient represents a correction coefficient corresponding to a human voice audio signal, and the i-dimension noise correction coefficient represents a correction coefficient corresponding to a noise signal; the j-th dimension log energy spectrum feature also corresponds to a feature representing the human voice audio signal, and the i-th dimension log energy spectrum feature corresponds to a feature representing the noise signal. In addition, when the j-th dimension noise correction coefficient and the i-th dimension noise correction coefficient are 1 and 0.01 among the M-dimension noise correction coefficients obtained in the calculation, the calculation of the logarithmic energy spectrum feature and the noise correction coefficient of the corresponding dimension means: and multiplying the value of the ith logarithmic energy spectrum characteristic in the N-dimensional logarithmic energy spectrum characteristic by the 0.01 th noise correction coefficient in the ith dimension in the M-dimensional noise correction coefficient to obtain a new value of the logarithmic energy spectrum characteristic. It can be understood that, since the echo energy is greatly reduced after the logarithmic energy spectrum of the echo is multiplied by the corresponding noise correction coefficient 0.01, and the human voice energy is unchanged after the logarithmic energy spectrum of the human voice is multiplied by the corresponding noise correction coefficient 1, the influence of the echo audio signal on the human voice audio signal after the operation is greatly reduced.
Referring to fig. 2, fig. 2 is a flowchart of a signal processing method according to an embodiment of the present disclosure. The method may be executed by an intelligent device, which may specifically be the terminal device 101 shown in fig. 1a, and the terminal device is installed with an application program designed based on a noise optimization model, and the method of the embodiment of the present invention includes the following steps.
S201: the method comprises the steps of collecting an audio signal to be processed, and extracting the spectral characteristics of the audio signal to be processed, wherein the spectral characteristics comprise N-dimensional logarithmic energy spectral characteristics. In the embodiment of the present invention, the log energy spectrum feature of the N-dimension can uniquely characterize the acquired audio signal to be processed, and the duration of the audio signal to be processed may be predetermined, for example, the duration of the audio signal to be processed is 10ms, so that each 10ms of the audio signal corresponds to the log energy spectrum feature of the N-dimension; for another example, if the duration of the audio signal to be processed is 100ms, then every 100ms of the audio signal will correspond to the N-dimensional logarithmic energy spectrum feature; of course, the duration of the audio signal to be processed may also be other values. For the voice which is continuously sent by the user and corresponds to the input audio signal, the audio signals to be processed with the time duration of 10ms (or 100ms) and the like can be obtained in sequence according to the time sequence. The audio signal to be processed is obtained by dividing the acquired time domain audio signal, and the audio signal to be processed mentioned in the embodiment of the present invention includes at least one of the following audio signals: in the embodiment of the present invention, the noise signal mainly refers to an echo signal, which is considered as a noise signal in the present application, and thus, the echo audio signal is used as a noise signal for explanation. The spectral features of the audio signal to be processed are frequency-domain spectral features, which include a Log Power Spectrum (LPS).
Fig. 3 is a flowchart of extracting frequency-domain spectral features from a time-domain audio signal according to an embodiment of the present disclosure. As shown in fig. 3, firstly, time-frame processing is performed on sound corresponding to a time-domain audio signal, and a sliding window operation is added, for example, if the duration of the time-domain audio signal 1 is 10 seconds and the duration of the audio signal to be processed is 100ms, the time-domain audio signal 1 is divided into 100 frames of audio signals to be processed according to a time sequence; then, Fast Fourier Transform (FFT) is performed on each divided frame of sound segment to obtain the spectral energy distribution (i.e. Frequency domain discrete spectrum) of each Frequency band point (Frequency Bin); then, performing a squaring operation on the frequency domain discrete spectrum (for example, inputting the frequency domain discrete spectrum into a spectrum squaring operator); and finally, carrying out logarithm operation on the result of the square operation to obtain the logarithmic energy spectrum characteristic corresponding to the time domain audio signal.
S202: and calling a noise optimization model to process the logarithmic energy spectrum characteristics to obtain M-dimensional noise correction coefficients corresponding to the N-dimensional logarithmic energy spectrum characteristics, wherein N and M are positive integers. The noise correction coefficient is used to reduce or eliminate a noise audio signal in the audio signal to be processed. In one embodiment, M and N may take values between 64 and 512 dimensions, such as 500 dimensions. For each acquired audio signal to be processed (a segment of speech between 10ms and 100ms), a 500-dimensional equal-dimension log energy spectrum feature is corresponding.
In one embodiment, M is equal to N, that is, after the N-dimensional logarithmic energy spectrum feature is processed by the noise optimization model, a noise correction coefficient corresponding to each-dimensional logarithmic energy spectrum feature is obtained. Based on the noise optimization model, the larger the energy of the echo contained in the log-energy spectrum feature of each dimension is, the smaller the value of the expected noise correction coefficient is, i.e. the value of the expected noise correction coefficient is inversely proportional to the energy of the noise, so as to reduce or even remove the echo as much as possible. The value range of each noise correction coefficient is [0,1], the coefficient value corresponding to the human voice part in the noise correction coefficient is 1 or approaches to 1, and the coefficient value corresponding to the echo part is 0 or approaches to 0. For example, assuming that 400 dimensions are used to characterize the human voice audio signal and 100 dimensions are used to characterize the echo audio signal in the 500-dimensional logarithmic energy spectrum, the noise correction coefficient of the logarithmic energy spectrum of the 400-dimensional logarithmic energy spectrum for characterizing the human voice audio signal has a value of 1 or close to 1, and the noise correction coefficient of the logarithmic energy spectrum of the 100-dimensional logarithmic energy spectrum for characterizing the echo audio signal has a value of 0 or close to 0. It can be understood that, if, in the 500-dimensional logarithmic energy spectrum, the values of the noise correction coefficients of the 400-dimensional logarithmic energy spectrum for characterizing the human voice audio signal are all smaller than the energy threshold value or 0, it indicates that the human voice audio signal of the party to be processed is not included in the audio signal to be processed; similarly, if the values of the noise correction coefficients of the logarithmic energy spectrum of 100 dimensions used for characterizing the echo audio signal in the 500-dimensional logarithmic energy spectrum are all smaller than the energy threshold or 0, it indicates that the echo audio signal is not included in the audio signal to be processed.
S203: and calculating the N-dimensional logarithmic energy spectrum characteristic and the M-dimensional noise correction coefficient to obtain the processed audio signal. In one embodiment, each dimension of the log-energy spectrum feature is multiplied by a corresponding noise correction coefficient (e.g., the ith dimension of the log-energy spectrum feature is multiplied by the ith dimension of the noise correction coefficient) to obtain a processed audio signal. By reducing the energy of the noise, the effect of eliminating or weakening the residual noise is achieved.
In the embodiment of the application, collected audio signals to be processed, which are generated under the conditions of a voice conference, a voice call and the like, can be optimized and corrected by training a noise correction coefficient for the audio signals to be processed, which is generated by an optimized noise optimization model in advance, from logarithmic energy spectrum characteristics, so that adverse effects of the characteristics of the audio signals to be processed on the audio signals to be processed are reduced or even eliminated, and the quality of voice interaction is improved.
Referring to fig. 4a, fig. 4a is a flowchart of a model training method according to an embodiment of the present disclosure. The method may be executed by an intelligent device, which may specifically be the terminal device 101 shown in fig. 1a, or may also be a server for performing model training optimization, and the method according to the embodiment of the present invention includes the following steps.
S401: echo audio signals are collected in a target environment where the audio signals are played. The target environment refers to selected environments that generate echoes, and the target environment may be multiple, for example, an office, a conference room, and the like. Under these circumstances, an echo audio signal can be acquired to further derive audio training data to train the model.
In one embodiment, in order to obtain more suitable audio training data subsequently, different target environments may be selected during recording of the echo audio signal, and the echo effects are different in different target environments, that is: in some target environments, the echo sound is relatively large, and in some target environments, the echo sound is relatively small, so that the acquired echo audio signals also include echo audio signals with different sound intensities, and therefore, after the echo audio signals are mixed with clean human voice audio signals, mixed audio signals including echoes with different sound intensities can be obtained, and further, model training can be performed more comprehensively under multiple echo sound intensities.
In addition, in some embodiments, under these target environments capable of generating different sound intensities, the device for playing the audio signal may also be subjected to sound adjustment, so as to play the audio at different volume levels, thereby obtaining richer echo audio signals. In addition, the echo audio signal can be collected when different audio signal playing devices play in the same or different target environments. Then, audio training data can be subsequently generated for different clients (for example, smart terminals such as mobile phones, and special conference devices such as octopus), and noise optimization models for different clients are obtained through training.
S402: a human voice audio signal is acquired. In one embodiment, the human voice audio signal is generated by directly capturing the user's voice. In another embodiment, the human voice audio signal is obtained from a library of pre-stored voice audio signals.
S403: and superposing the acquired human voice audio signal and the echo audio signal on a time domain to obtain a mixed audio signal, and generating audio training data according to the mixed audio signal. In one embodiment, a plurality of human voice audio signals are respectively superposed with the collected echo audio signals to obtain audio training data. Wherein, each mixed audio signal comprises an echo audio signal and a human voice audio signal. In another embodiment, it is also possible that the partial period of the mixed audio signal includes an echo audio signal and a human voice audio signal, and the partial period does not include the human voice audio signal and includes only the echo audio signal. That is to say, the audio training data may include X segments of mixed audio signals, each segment of mixed audio signal includes an echo audio signal, the ith segment of mixed audio signal includes a human voice audio signal and an echo audio signal, and the jth segment of mixed audio signal includes only an echo audio signal, where i, j, X is a positive integer, i ≠ j, and i, j is less than or equal to X. For example, the audio training data includes 100 pieces of mixed audio signals, 50 pieces of mixed audio signals include only echo audio signals, and 50 pieces of mixed audio signals include echo audio signals and human voice audio signals.
S404: and (5) triggering the initial model by adopting audio training data to train the initial model to obtain a noise optimization model. The noise optimization model is obtained by training the initial model through audio training data and a loss function. The method comprises the steps of calculating and converting audio training data to obtain corresponding logarithmic energy spectrum characteristics, normalizing the logarithmic energy spectrum characteristics to be used as input of an initial model, performing loss calculation on the basis of a loss function according to a result output by the initial model, and adjusting parameters of the initial model according to a result of the loss calculation so as to obtain a noise optimization model finally.
In one embodiment, the loss function employed by the noise optimization model may be:
minE[(Ymix:clean+echo(w)Hmodel_coef(w)-Xclean(w))2]
wherein w represents the specific dimensional value, Ymix:clean+echo(w) is the input of the noise optimization model, i.e. the N-dimensional logarithmic energy spectrum feature corresponding to a segment of mixed audio signal in the audio training data; hmodel_coef(w) is the coefficient estimated by the noise optimization model, i.e. the noise modification coefficient mentioned earlier; y ismix:clean+echo(w)Hmodel_coef(w) is a first clean spectral feature. Xclean(w) is the corresponding log energy spectral feature of the human voice audio signal, i.e., the second clean spectral feature. Based on the value obtained by the loss function, parameters (such as convolution parameters) in the noise correction model are optimized, so that after the parameters of the noise optimization model are optimized, for target audio training data (any audio training data in all audio training data), the noise optimization model can generate a corresponding noise correction coefficient, and the difference between the value obtained by multiplying the logarithmic energy spectrum feature corresponding to the target audio training data by the noise correction coefficient and the logarithmic energy spectrum feature corresponding to the human voice audio signal in the audio training data is the minimum.
Further, Gmodel_coef(w) is represented by:
Figure BDA0002558041740000101
the noise optimization model can be considered to be constructed based on the above expression, where Sclean(w) is the log spectral energy of the human voice audio signal corresponding to the second clean log spectral feature; secho(w) is the log spectrum of the echo audio signalEnergy; sclean(w)+SechoAnd (w) is the log spectral energy corresponding to the mixed audio signal. It can be considered that the logarithmic spectrum energy corresponding to the mixed audio signal is the sum of the logarithmic spectrum energy of the human voice audio signal and the logarithmic spectrum energy of the echo audio signal.
Further, Sclean(w)=log{|F[xclean(t)]|2},Secho(w)=log{|F[xecho(t)]|2And (4) obtaining the logarithmic spectrum energy of the human voice audio signal and the logarithmic spectrum energy of the echo audio signal respectively by Fourier fast transformation and logarithm calculation.
S405: and testing the noise optimization model obtained by the optimization training. The performance of the noise-optimized model is determined by testing, wherein the testing step is an optional step.
S401 to S404 describe the training process of the noise optimization model of the present application, and refer to fig. 5 again specifically, and fig. 5 is a schematic diagram of training of the noise optimization model provided in the embodiment of the present application. As shown in fig. 5, first, an Echo is collected and a corresponding Echo audio Signal (Echo Signal) is generated; then, different human voice audio signals (such as human voice collected in advance or collected voice of a current user) are respectively superposed with echo audio signals in a time domain to obtain a plurality of different mixed audio signals, and the mixed audio signals are used as audio training data of an initial model; and then, training and optimizing the initial model by using the audio training data to obtain a noise optimization model, wherein the loss function is as described above.
With continued reference to fig. 5, after the training optimization of the noise optimization model is completed, the testing procedure is entered. Collecting audio test data, calculating and converting the audio test data to obtain corresponding logarithmic energy spectrum characteristics, inputting the logarithmic energy spectrum characteristics of the audio test data into a noise optimization model to obtain a noise correction coefficient output by the noise optimization model, and multiplying the noise correction coefficient output by the noise optimization model by the logarithmic energy spectrum characteristics of the audio test data to obtain a test result. And if the test result does not contain the noise of 'Hula' during playing and/or the numerical value of the loss function in the noise optimization model is smaller than the noise elimination threshold value, judging that the noise optimization model passes the test. The noise optimization model is deployed to an end client, such as a conferencing application, in order to perform the corresponding embodiment described below with respect to fig. 6. Correspondingly, if the test result contains 'zila' noise when being played, and/or the value of the loss function in the noise optimization model is larger than the noise elimination threshold value, the noise optimization model is judged not to pass the test, and the noise optimization model is trained continuously by the method in the S401-S404 until the test passes.
Referring to fig. 4b, fig. 4b is a flowchart of another model training method according to an embodiment of the present disclosure. The method may be executed by an intelligent device, which may specifically be the terminal device 101 shown in fig. 1a, or may also be a server for performing model training optimization, and the method according to the embodiment of the present invention includes the following steps.
S411: and carrying out audio recording operation in a plurality of target environments in which audio signals are played to obtain a plurality of sections of noise audio information, wherein each section of noise audio information comprises a noise audio signal and recording equipment information. The target environment refers to selected environments that generate echoes, and the target environment may be multiple, for example, an office, a conference room, and the like. Under these circumstances, an echo audio signal can be acquired, and audio training data can be further obtained to train the model.
In an embodiment, different audio recording devices (such as an intelligent terminal, a multi-party conference phone, a dedicated conference device for octopus, and the like) may be further used to collect the audio signals played in different target environments with different volume levels (such as setting the playing volume levels of the audio playing device to 1-10 levels respectively) so as to collect the generated echo audio signals, thereby obtaining multiple pieces of echo audio information, where each piece of echo audio information includes the echo audio signals and the recording device information. For example, the 1 st segment of echo audio information is obtained by using a microphone to collect echo in a process that a voice playing device in the conference room 1 plays voice with 2-level playing volume; the 2 nd section of echo audio information is obtained by using echo in the process that a voice playing device collected by a multi-party conference telephone in the conference room 1 plays voice by using 2-level playing volume; the 3 rd section of echo audio information is obtained by using a microphone to acquire echo in the process of playing the voice by using 2-level playing volume by using voice playing equipment in the conference room 2; the 4 th section of echo audio information is obtained by using echo in the process of playing voice by using 5-level playing volume by using voice playing equipment acquired by an intelligent terminal such as a mobile phone in the conference room 1.
S412: generating audio training data corresponding to the information of each recording device according to the multiple sections of noise audio information; wherein the audio training data comprises Y segments of noise audio signals, wherein Y is a positive integer. In S412, audio training data corresponding to different recording devices may be respectively generated according to multiple pieces of noise audio information, and it may be understood that, for example, audio training data 1 corresponding to an intelligent terminal is generated according to an echo audio signal recorded by the intelligent terminal, and audio training data 2 corresponding to a multi-party conference call is generated according to an echo audio signal recorded by a multi-party conference call (e.g., conference dedicated devices such as octopus).
S413: and (5) triggering the initial model by adopting audio training data to train the initial model to obtain a noise optimization model. Acquiring echo audio signals through different recording devices to obtain audio training data of the different recording devices so as to obtain a more targeted noise optimization model; for example, when the recording device in a certain target environment is an intelligent terminal such as a mobile phone, audio training data generated based on echo audio signals with different sound intensities acquired by the intelligent terminal can be trained to obtain a noise optimization model for the intelligent terminal; when the recording device in a certain target environment is a conference special device such as a octopus, the audio training data generated based on the echo audio signals with different sound intensities collected by the conference special device can be trained to obtain a noise optimization model corresponding to the conference special device.
It can be understood that, the recording device during model training corresponds to the speech input device corresponding to the client during model use, and a mapping table may be established, where the mapping table records a relationship between a device type identifier (a type identifier corresponding to the recording device or the speech input device) and an identifier of the noise optimization model.
S414: and testing the noise optimization model obtained by the optimization training. The performance of the noise-optimized model is determined by testing, wherein the testing step is an optional step. The performance of the respective noise optimization models may be tested based on different types of devices, respectively. It is to be understood that the above detailed description of S413 and S414 may also refer to the description of the relevant contents of S404 and S405 corresponding to fig. 4a, and is not repeated herein.
Because different noise optimization models correspond to different types of equipment, in the subsequent use process of the models, the types of the equipment for collecting sound can be referred to select the corresponding noise optimization models to perform optimization processing for weakening or even removing noise such as echo.
According to the embodiment of the application, the echo is eliminated as a specific noise (namely one of the noises), and the signals of the participants are not required to be collected in the training process, so that the dimension (size) of the model is effectively reduced, and further the calculation efficiency is improved.
Fig. 6 and 6 are flowcharts of another signal processing method according to an embodiment of the present disclosure. The method may be executed by an intelligent device, which may specifically be the terminal device 101 shown in fig. 1a, and the terminal device is installed with a conference application, and the conference application is deployed with a noise optimization model after the test is passed.
S601: the method comprises the steps of collecting an audio signal to be processed, and extracting the spectral characteristics of the audio signal to be processed, wherein the spectral characteristics comprise N-dimensional logarithmic energy spectral characteristics. When a terminal user opens a conference application to participate in an internet conference, no matter the video conference or the pure voice conference, a microphone of the terminal device can be called to collect audio signals to be processed, and the audio signals to be processed at the moment are conference audio signals. The conference audio signal may include a human voice audio signal, may also include an echo audio signal, and may also include both a human voice audio signal and an echo audio signal. In particular, the conference audio signal is captured upon detection of entry into the conference session interface as shown in fig. 7 (i.e., during a multi-end conference communication).
In one embodiment, before the spectrum feature of the conference audio signal, the terminal device may determine a spatial type of a current location, so as to determine whether to invoke a noise optimization model deployed in the conference application to optimize echo and the like according to the spatial type. If the space type of the current position of the terminal device belongs to the first type (in the application, the first type refers to an environment space type where the position is clear and the echo is generally small or even absent), the acquired conference audio signal is encoded, and the encoded conference audio signal is sent to the participant, that is, the echo cancellation processing is not required to be performed on the conference audio signal in the first type space. If the space type of the current position of the terminal device belongs to a second type (in the present application, the second type mainly refers to an indoor environment or the like, for example, when a user is found in a certain building through positioning, the space type of the conference user is considered to belong to the second type), a step of extracting the frequency spectrum feature of the conference audio signal is performed, so as to perform a subsequent step related to echo cancellation. Therefore, by judging the space type of the current position, the echo cancellation processing of the audio signal to be processed with the echo smaller than the threshold value can be avoided, the waste of memory resources is reduced, and the efficiency of signal processing is improved.
S602: and normalizing the extracted spectral features. In one embodiment, the extracted spectral features are mapped into a processing value interval of [0,1] by performing a normalization process on the extracted spectral features.
S603: and calling a noise optimization model to process the logarithmic energy spectrum characteristics to obtain M-dimensional noise correction coefficients corresponding to the N-dimensional logarithmic energy spectrum characteristics, wherein N and M are positive integers. Specifically, the log energy spectrum feature is a log energy spectrum feature corresponding to 64 dimensions or 500 dimensions and the like of a section of speech between 10ms and 100ms, and the output is a coefficient with the same dimension (64-512) as the energy spectrum feature.
In an embodiment, when at least two noise optimization models are recorded in a stored noise optimization model set, a noise optimization model associated with a position attribute in a current conference environment can be selected from the noise optimization model set according to the position attribute of the position in the current conference environment, and the associated noise optimization model is called to process a logarithmic energy spectrum feature of a conference audio signal which is currently used as an input. Wherein, the location attribute refers to a location identifier, for example, location a is located in the XX area YY building, and the corresponding identifier is 567678; position B is located in ZZ zone XY building, corresponding to designation 877454.
Further, a relationship mapping table of the position attribute and the noise optimization model is established, and table 1 is an exemplary relationship mapping table provided in the embodiment of the present application:
TABLE 1
Address Location identification Noise optimization model identification
XX zone YY mansion 567678 MX-68465
ZZ zone XY mansion 877454 MX-68968
In table 1, the address, the location identifier, and the noise optimization model identifier all have an indexing function, and each address, the location identifier, and the noise optimization model identifier are associated with each other. When the terminal device is positioned to the position of the current conference environment by the positioning function and belongs to a certain address in the table 1, the noise optimization model associated with the terminal device is called to process the logarithmic energy spectrum feature of the conference audio signal to be processed, and the noise correction coefficient of the logarithmic energy spectrum feature of the conference audio signal is obtained. Optionally, the terminal device adds the logarithmic energy spectrum feature of the audio signal of the conference to an audio training data set, and optimizes the noise optimization model corresponding to the current conference environment according to the updated audio training data again, and the specific optimization manner may refer to step S404 and step S405 in fig. 4a, which is not described herein again.
Correspondingly, if the terminal device does not detect that the position of the terminal device in the current conference environment belongs to a certain address in the table 1 through the positioning function, acquiring an environment audio signal in the current conference environment; for example, when the conference-entering user is not detected speaking, an ambient audio signal (i.e., an echo of the current conference environment) is collected. And superposing the environment audio signal and the human voice audio signal to obtain new audio training data (namely, audio training data for the current conference environment), and performing optimization training on the noise optimization model through the new audio training data to obtain an optimized noise optimization model (namely, a noise optimization model for the current conference environment), wherein the specific implementation process of the optimization training can refer to the process of training the noise optimization model in steps S401 to S405, and is not described herein again. And storing the position attribute of the current conference environment and the optimized noise optimization model into the relational mapping table in an associated manner.
As described in the foregoing embodiment, different types of noise optimization models may be corresponding to different types of recording devices, that is, the voice input device where the corresponding client is located, so that, in an embodiment, when a user participates in a conference through a conference application, the type of the voice input device currently collecting the voice of the user may also be determined, if the type is an intelligent terminal such as a mobile phone, the noise optimization model invoked in S603 is an optimized noise optimization model corresponding to the intelligent terminal, and if the type is a conference-dedicated device type such as a octopus, the noise optimization model invoked in S603 is an optimized noise optimization model corresponding to the conference-dedicated device type.
In one embodiment, an intelligent terminal such as a mobile phone may plug in other voice input devices through a wireless connection manner such as bluetooth, and the like, so that a device currently establishing a wireless connection with the intelligent terminal may be determined, and if the device is a voice input device (such as a microphone, or a conference-dedicated device), a noise optimization model is also selected based on the type of the plug-in voice input device. And if the type of the voice input equipment corresponding to the current conference application is not detected, executing corresponding processing by using a default or randomly selected noise optimization model. The mapping relation between the device type identifier and the noise optimization model can be established in a mapping table mode, and therefore the corresponding noise optimization model is found and called.
S604: and calculating the N-dimensional logarithmic energy spectrum characteristic and the M-dimensional noise correction coefficient to obtain the processed audio signal.
In one embodiment, after obtaining the processed audio signal, the embodiment of the present invention may further perform the following steps.
S605: and performing inverse logarithmic transformation on the processed audio signal to obtain a conference audio signal.
S606: and coding the conference audio signal, and sending the coded conference audio signal to terminal equipment logged in by each conference account on a conference session interface.
In the embodiment of the application, the noise optimization model is secondarily optimized according to different multi-terminal communication environments, so that the optimized noise optimization model has pertinence, and the echo cancellation effect of the optimized noise optimization model on the current conference environment is further improved. In addition, the environment attributes of different environments and the corresponding noise optimization models are stored in an associated mode, so that when a user carries out teleconference communication next time, the corresponding noise optimization models can be called quickly, and user experience is further improved.
While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a signal processing apparatus according to an embodiment of the present application, where the apparatus may be mounted on an intelligent device in the foregoing method embodiment, and the intelligent device may specifically be the terminal device 101 shown in fig. 1a, and the terminal device is installed with an application program designed based on a noise optimization model. The signal processing means shown in fig. 8 may be used to perform some or all of the functions in the method embodiments described above with reference to fig. 2, 4a, 4b and 6. Wherein, the detailed description of each unit is as follows:
the acquisition unit 801 is configured to acquire an audio signal to be processed and extract a spectral feature of the audio signal to be processed, where the spectral feature includes an N-dimensional logarithmic energy spectral feature;
a processing unit 802, configured to invoke a noise optimization model to process the logarithmic energy spectrum feature, so as to obtain an M-dimensional noise correction coefficient corresponding to the N-dimensional logarithmic energy spectrum feature, where N and M are positive integers; calculating the N-dimensional logarithmic energy spectrum feature and the M-dimensional noise correction coefficient to obtain a processed audio signal;
the noise optimization model is obtained by training according to audio training data including noise audio signals, and the M-dimensional noise correction coefficient output by the noise optimization model includes: and a p-dimensional coefficient for modifying a characteristic of the input log energy spectrum characteristic with respect to the noise audio signal, p being smaller than M.
In one embodiment, the processing unit 802 is further configured to:
collecting a noise audio signal in a target environment in which the audio signal is played;
acquiring a human voice audio signal;
superposing the acquired human voice audio signal and the acquired noise audio signal on a time domain to obtain a mixed audio signal, and generating audio training data according to the mixed audio signal;
the audio training data comprises X sections of mixed audio signals, the ith section of mixed audio signals comprises human voice audio signals and noise audio signals, wherein i and X are positive integers, and i is smaller than or equal to X.
In one embodiment, the processing unit 802 is further configured to:
performing audio recording operation in a plurality of target environments in which audio signals are played to obtain a plurality of sections of noise audio information, wherein each section of noise audio information comprises a noise audio signal and recording equipment information;
generating audio training data corresponding to the recording equipment information according to the multiple sections of noise audio information;
wherein the audio training data comprises Y segments of noise audio signals, wherein Y is a positive integer.
In one embodiment, the noise optimization model is obtained by optimizing an initial model based on a loss function constructed by mean square error of a first clean log spectrum feature and a second clean log spectrum feature; the first clean log spectrum feature is obtained by multiplying a mixed audio signal in the audio training data by a training noise correction coefficient output after the mixed audio signal in the audio training data is processed through the initial model, and the second clean spectrum feature is obtained according to the human voice audio signal.
In one embodiment, the training noise correction coefficient output by the constructed initial model is used for embodying a ratio of the log spectrum energy of the human voice audio signal corresponding to the second clean log spectrum characteristic to the log spectrum energy corresponding to the mixed audio signal; wherein, the logarithmic spectrum energy corresponding to the mixed audio signal is: the sum of the logarithmic spectral energy of the noise audio in the mixed audio signal and the logarithmic spectral energy of the human voice audio signal in the mixed audio signal.
In an embodiment, the audio signal to be processed is acquired when it is detected that the audio signal enters a conference session interface, where the processed audio signal is a signal obtained by multiplying the N-dimensional logarithmic energy spectrum feature by the M-dimensional noise correction coefficient, and the processing unit 802 is further configured to:
carrying out inverse logarithmic transformation on the processed audio signal to obtain a conference audio signal;
and coding the conference audio signal, and sending the coded conference audio signal to each corresponding conference account on the conference session interface.
In one embodiment, the processing unit 802 is further configured to:
when sound signals from the participant account are detected, collecting environmental audio signals;
taking the environment audio signal as new audio training data to carry out optimization training on the noise optimization model to obtain an optimized noise optimization model;
and recording the optimized noise optimization model so as to process the acquired logarithmic energy spectrum characteristics corresponding to the audio signal to be processed according to the optimized noise optimization model.
In one embodiment, at least two noise optimization models are recorded in a stored set of noise optimization models; the processing unit 802 is specifically configured to: calling a noise optimization model to process the logarithmic energy spectrum characteristics;
selecting a noise optimization model from the noise optimization model set according to the position attribute of the position of the current conference environment;
and calling the selected noise optimization model to process the logarithmic energy spectrum characteristics.
In one embodiment, before extracting the spectral feature of the audio signal to be processed, the processing unit 802 is further configured to:
judging the space type of the current position;
if the space type is the first type, encoding the collected audio signal to be processed to obtain an encoded audio signal;
and if the space type is a second type, triggering and executing the step of extracting the spectral feature of the audio signal to be processed.
According to an embodiment of the present application, some steps involved in the signal processing methods shown in fig. 2, 4a, 4b and 6 may be performed by respective units in the signal processing apparatus shown in fig. 8. For example, step S201 shown in fig. 2 may be performed by the acquisition unit 801 shown in fig. 8, and step S202 and step S203 may be performed by the processing unit 802 shown in fig. 8. Steps S401 and S402 shown in fig. 4a may be executed by the acquisition unit 801 shown in fig. 8, and steps S403 to S405 may be executed by the processing unit 802 shown in fig. 8. Step S411 shown in fig. 4b may be executed by the acquisition unit 801 shown in fig. 8, and steps S412 to S414 may be executed by the processing unit 802 shown in fig. 8. Step S601 shown in fig. 6 may be executed by the acquisition unit 801 shown in fig. 8, and steps S602 to S606 may be executed by the processing unit 802 shown in fig. 8. The units in the signal processing apparatus shown in fig. 8 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple functionally smaller units to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effect of the embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the signal processing apparatus may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.
According to another embodiment of the present application, a signal processing apparatus as shown in fig. 8 may be constructed by running a computer program (including program codes) capable of executing steps involved in the respective methods shown in fig. 2, 4a, 4b and 6 on a general-purpose computing apparatus such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM) and a storage element, and implementing the signal processing method of the embodiment of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.
Based on the same inventive concept, the principle and the advantageous effect of the signal processing apparatus provided in the embodiment of the present application for solving the problem are similar to the principle and the advantageous effect of the signal processing apparatus in the embodiment of the method of the present application for solving the problem, and for brevity, the principle and the advantageous effect of the implementation of the method may be referred to, and are not described herein again.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an intelligent device according to an embodiment of the present disclosure, where the intelligent device at least includes a processor 901, a communication interface 902, and a memory 903. The processor 901, the communication interface 902, and the memory 903 may be connected by a bus or other means. The processor 901 (or Central Processing Unit (CPU)) is a computing core and a control core of the terminal, and can analyze various instructions in the terminal and process various data of the terminal, for example: the CPU can be used for analyzing a power-on and power-off instruction sent to the terminal by a user and controlling the terminal to carry out power-on and power-off operation; the following steps are repeated: the CPU may transmit various types of interactive data between the internal structures of the terminal, and so on. The communication interface 902 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.), and may be controlled by the processor 901 to transmit and receive data; the communication interface 902 may also be used for transmission and interaction of data inside the terminal. A Memory 903(Memory) is a Memory device in the terminal for storing programs and data. It is understood that the memory 903 herein can include both the built-in memory of the terminal and, of course, the expansion memory supported by the terminal. The memory 903 provides storage space that stores the operating system of the terminal, which may include, but is not limited to: android system, iOS system, Windows Phone system, etc., which are not limited in this application.
In the embodiment of the present application, the processor 901 executes the executable program code in the memory 903 to perform the following operations:
acquiring an audio signal to be processed through a communication interface 902, and extracting a spectral feature of the audio signal to be processed, wherein the spectral feature comprises an N-dimensional logarithmic energy spectral feature;
calling a noise optimization model to process the logarithmic energy spectrum feature to obtain an M-dimensional noise correction coefficient corresponding to the N-dimensional logarithmic energy spectrum feature, wherein N and M are positive integers;
calculating the N-dimensional logarithmic energy spectrum feature and the M-dimensional noise correction coefficient to obtain a processed audio signal;
the noise optimization model is obtained by training according to audio training data including noise audio signals, and the M-dimensional noise correction coefficient output by the noise optimization model includes: and a p-dimensional coefficient for modifying a characteristic of the input log energy spectrum characteristic with respect to the noise audio signal, p being smaller than M.
As an alternative embodiment, the processor 901 further performs the following operations:
collecting a noise audio signal in a target environment in which the audio signal is played;
acquiring a human voice audio signal;
superposing the acquired human voice audio signal and the acquired noise audio signal on a time domain to obtain a mixed audio signal, and generating audio training data according to the mixed audio signal;
the audio training data comprises X sections of mixed audio signals, the ith section of mixed audio signals comprises human voice audio signals and noise audio signals, wherein i and X are positive integers, and i is smaller than or equal to X.
As an alternative embodiment, the processor 901 further performs the following operations:
performing audio recording operation in a plurality of target environments in which audio signals are played to obtain a plurality of sections of noise audio information, wherein each section of noise audio information comprises a noise audio signal and recording equipment information;
generating audio training data corresponding to the recording equipment information according to the multiple sections of noise audio information;
wherein the audio training data comprises Y segments of noise audio signals, wherein Y is a positive integer.
As an optional embodiment, the noise optimization model is obtained by optimizing an initial model based on a loss function constructed by mean square error of a first clean log-spectrum feature and a second clean log-spectrum feature; the first clean log spectrum feature is obtained by multiplying a mixed audio signal in the audio training data by a training noise correction coefficient output after the mixed audio signal in the audio training data is processed through the initial model, and the second clean spectrum feature is obtained according to the human voice audio signal.
As an optional embodiment, the training noise correction coefficient output by the constructed initial model is used to represent a ratio of the log spectrum energy of the human voice audio signal corresponding to the second clean log spectrum feature to the log spectrum energy corresponding to the mixed audio signal; wherein, the logarithmic spectrum energy corresponding to the mixed audio signal is: the sum of the logarithmic spectral energy of the noise audio in the mixed audio signal and the logarithmic spectral energy of the human voice audio signal in the mixed audio signal.
As an optional embodiment, the audio signal to be processed is acquired when it is detected that the audio signal enters the conference session interface, where the processed audio signal is a signal obtained by multiplying the N-dimensional logarithmic energy spectrum feature by the M-dimensional noise correction coefficient, and the processor 901 further performs the following operations:
carrying out inverse logarithmic transformation on the processed audio signal to obtain a conference audio signal;
and coding the conference audio signal, and sending the coded conference audio signal to each corresponding conference account on the conference session interface.
As an alternative embodiment, the processor 901 further performs the following operations:
when sound signals from the participant account are detected, collecting environmental audio signals;
taking the environment audio signal as new audio training data to carry out optimization training on the noise optimization model to obtain an optimized noise optimization model;
and recording the optimized noise optimization model so as to process the acquired logarithmic energy spectrum characteristics corresponding to the audio signal to be processed according to the optimized noise optimization model.
As an alternative embodiment, at least two noise optimization models are recorded in the stored set of noise optimization models; the specific embodiment of the processor 901 invoking the noise optimization model to process the logarithmic energy spectrum feature is as follows:
selecting a noise optimization model from the noise optimization model set according to the position attribute of the position of the current conference environment;
and calling the selected noise optimization model to process the logarithmic energy spectrum characteristics.
As an alternative embodiment, before extracting the spectral feature of the audio signal to be processed, the processor 901 further performs the following operations:
judging the space type of the current position;
if the space type is the first type, encoding the collected audio signal to be processed to obtain an encoded audio signal;
and if the space type is a second type, triggering and executing the step of extracting the spectral feature of the audio signal to be processed.
Based on the same inventive concept, the principle and the beneficial effect of solving the problem of the intelligent device provided in the embodiment of the present application are similar to the principle and the beneficial effect of solving the problem of the signal processing method in the embodiment of the present application, and for brevity, the principle and the beneficial effect of the implementation of the method can be referred to, and are not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, where one or more instructions are stored in the computer-readable storage medium, and the one or more instructions are adapted to be loaded by a processor and to execute the signal processing method according to the foregoing method embodiment.
The embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium, and a processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the methods mentioned in the foregoing embodiments.
It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The modules in the device can be merged, divided and deleted according to actual needs.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (12)

1. A method of signal processing, the method comprising:
acquiring an audio signal to be processed, and extracting the spectral feature of the audio signal to be processed, wherein the spectral feature comprises an N-dimensional logarithmic energy spectral feature;
calling a noise optimization model to process the N-dimensional logarithmic energy spectrum feature to obtain an M-dimensional noise correction coefficient corresponding to the N-dimensional logarithmic energy spectrum feature, wherein N and M are positive integers;
calculating the N-dimensional logarithmic energy spectrum feature and the M-dimensional noise correction coefficient to obtain a processed audio signal;
the noise optimization model is obtained by training according to audio training data including noise audio signals, and the M-dimensional noise correction coefficient output by the noise optimization model includes: and a p-dimensional coefficient for modifying a feature with respect to the noise audio signal among the input N-dimensional logarithmic energy spectrum features, p being smaller than M.
2. The method of claim 1, wherein the method further comprises:
collecting a noise audio signal in a target environment in which the audio signal is played;
acquiring a human voice audio signal;
superposing the acquired human voice audio signal and the acquired noise audio signal on a time domain to obtain a mixed audio signal, and generating audio training data according to the mixed audio signal;
the audio training data comprises X sections of mixed audio signals, the ith section of mixed audio signals comprises human voice audio signals and noise audio signals, wherein i and X are positive integers, and i is smaller than or equal to X.
3. The method of claim 1, wherein the method further comprises:
performing audio recording operation in a plurality of target environments in which audio signals are played to obtain a plurality of sections of noise audio information, wherein each section of noise audio information comprises a noise audio signal and recording equipment information;
generating audio training data corresponding to the information of each recording device according to the multiple sections of noise audio information;
wherein the audio training data comprises Y segments of noise audio signals, wherein Y is a positive integer.
4. The method of claim 2,
the noise optimization model is obtained by optimizing an initial model by a loss function constructed on the basis of mean square errors of a first clean log spectrum characteristic and a second clean log spectrum characteristic;
the first clean log spectrum feature is obtained by multiplying a mixed audio signal in the audio training data by a training noise correction coefficient output after the mixed audio signal in the audio training data is processed through the initial model, and the second clean spectrum feature is obtained according to the human voice audio signal.
5. The method of claim 4,
the training noise correction coefficient output by the constructed initial model is used for reflecting the ratio of the log spectrum energy of the human voice audio signal corresponding to the second clean log spectrum characteristic to the log spectrum energy corresponding to the mixed audio signal;
wherein, the logarithmic spectrum energy corresponding to the mixed audio signal is: the sum of the logarithmic spectral energy of the noise audio in the mixed audio signal and the logarithmic spectral energy of the human voice audio signal in the mixed audio signal.
6. The method of claim 1, wherein the audio signal to be processed is collected when an entry into a conference session interface is detected, and the audio signal after processing is a signal obtained by multiplying the N-dimensional logarithmic energy spectrum feature by the M-dimensional noise correction coefficient, the method further comprising:
carrying out inverse logarithmic transformation on the processed audio signal to obtain a conference audio signal;
and coding the conference audio signal, and sending the coded conference audio signal to each corresponding conference account on the conference session interface.
7. The method of claim 6, wherein the method further comprises:
when sound signals from the participant account are detected, collecting environmental audio signals;
taking the environment audio signal as new audio training data to carry out optimization training on the noise optimization model to obtain an optimized noise optimization model;
and recording the optimized noise optimization model so as to process the acquired logarithmic energy spectrum characteristics corresponding to the audio signal to be processed according to the optimized noise optimization model.
8. The method of claim 7, wherein at least two noise optimization models are recorded in a stored set of noise optimization models; the step of calling a noise optimization model to process the logarithmic energy spectrum features comprises the following steps:
selecting a noise optimization model from the noise optimization model set according to the position attribute of the position of the current conference environment;
and calling the selected noise optimization model to process the logarithmic energy spectrum characteristics.
9. The method of claim 1, wherein prior to extracting the spectral features of the audio signal to be processed, the method further comprises:
judging the space type of the current position;
if the space type is the first type, encoding the collected audio signal to be processed to obtain an encoded audio signal;
and if the space type is a second type, triggering and executing the step of extracting the spectral feature of the audio signal to be processed.
10. A signal processing apparatus, characterized by comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an audio signal to be processed and extracting the spectral characteristics of the audio signal to be processed, and the spectral characteristics comprise N-dimensional logarithmic energy spectral characteristics;
the processing unit is used for calling a noise optimization model to process the logarithmic energy spectrum feature to obtain an M-dimensional noise correction coefficient corresponding to the N-dimensional logarithmic energy spectrum feature, wherein N and M are positive integers; calculating the N-dimensional logarithmic energy spectrum feature and the M-dimensional noise correction coefficient to obtain a processed audio signal;
the noise optimization model is obtained by training audio training data including noise audio signal characteristics, and the M-dimensional noise correction coefficient output by the noise optimization model includes: and a p-dimensional coefficient for modifying a characteristic of the input log energy spectrum characteristic with respect to the noise audio signal, p being smaller than M.
11. A signal processing apparatus characterized by comprising:
a memory storing computer readable instructions;
a processor coupled to the memory, the processor to execute the computer readable instructions to implement the signal processing method of any of claims 1-9.
12. A computer-readable storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to perform the signal processing method according to any of claims 1-9.
CN202010597937.7A 2020-06-28 2020-06-28 Signal processing method, device, equipment and computer readable storage medium Pending CN111710344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010597937.7A CN111710344A (en) 2020-06-28 2020-06-28 Signal processing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010597937.7A CN111710344A (en) 2020-06-28 2020-06-28 Signal processing method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111710344A true CN111710344A (en) 2020-09-25

Family

ID=72544410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010597937.7A Pending CN111710344A (en) 2020-06-28 2020-06-28 Signal processing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111710344A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112492380A (en) * 2020-11-18 2021-03-12 腾讯科技(深圳)有限公司 Sound effect adjusting method, device, equipment and storage medium
CN112634923A (en) * 2020-12-14 2021-04-09 广州智讯通信系统有限公司 Audio echo cancellation method, device and storage medium based on command scheduling system
CN113192527A (en) * 2021-04-28 2021-07-30 北京达佳互联信息技术有限公司 Method, apparatus, electronic device and storage medium for cancelling echo
CN113296727A (en) * 2021-05-08 2021-08-24 广州市奥威亚电子科技有限公司 Sound control method, device, electronic equipment and storage medium
CN113516988A (en) * 2020-12-30 2021-10-19 腾讯科技(深圳)有限公司 Audio processing method and device, intelligent equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112492380A (en) * 2020-11-18 2021-03-12 腾讯科技(深圳)有限公司 Sound effect adjusting method, device, equipment and storage medium
CN112492380B (en) * 2020-11-18 2023-06-30 腾讯科技(深圳)有限公司 Sound effect adjusting method, device, equipment and storage medium
CN112634923A (en) * 2020-12-14 2021-04-09 广州智讯通信系统有限公司 Audio echo cancellation method, device and storage medium based on command scheduling system
CN113516988A (en) * 2020-12-30 2021-10-19 腾讯科技(深圳)有限公司 Audio processing method and device, intelligent equipment and storage medium
CN113516988B (en) * 2020-12-30 2024-02-23 腾讯科技(深圳)有限公司 Audio processing method and device, intelligent equipment and storage medium
CN113192527A (en) * 2021-04-28 2021-07-30 北京达佳互联信息技术有限公司 Method, apparatus, electronic device and storage medium for cancelling echo
CN113192527B (en) * 2021-04-28 2024-03-19 北京达佳互联信息技术有限公司 Method, apparatus, electronic device and storage medium for canceling echo
CN113296727A (en) * 2021-05-08 2021-08-24 广州市奥威亚电子科技有限公司 Sound control method, device, electronic equipment and storage medium
CN113296727B (en) * 2021-05-08 2024-05-14 广州市奥威亚电子科技有限公司 Sound control method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111710344A (en) Signal processing method, device, equipment and computer readable storage medium
CN110619885B (en) Method for generating confrontation network voice enhancement based on deep complete convolution neural network
US11948552B2 (en) Speech processing method, apparatus, electronic device, and computer-readable storage medium
CN107910014B (en) Echo cancellation test method, device and test equipment
CN111489760B (en) Speech signal dereverberation processing method, device, computer equipment and storage medium
CN103391347B (en) A kind of method and device of automatic recording
CN111429931B (en) Noise reduction model compression method and device based on data enhancement
CN112185410B (en) Audio processing method and device
US20240071402A1 (en) Method and apparatus for processing audio data, device, storage medium
CN114338623B (en) Audio processing method, device, equipment and medium
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
WO2023216760A1 (en) Speech processing method and apparatus, and storage medium, computer device and program product
CN112151055B (en) Audio processing method and device
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
Zhou et al. Speech Enhancement via Residual Dense Generative Adversarial Network.
CN113516992A (en) Audio processing method and device, intelligent equipment and storage medium
CN113571079A (en) Voice enhancement method, device, equipment and storage medium
CN113516988B (en) Audio processing method and device, intelligent equipment and storage medium
CN116741193B (en) Training method and device for voice enhancement network, storage medium and computer equipment
CN117153178B (en) Audio signal processing method, device, electronic equipment and storage medium
CN111833897B (en) Voice enhancement method for interactive education
CN117219107B (en) Training method, device, equipment and storage medium of echo cancellation model
CN114512141A (en) Method, apparatus, device, storage medium and program product for audio separation
CN114898765A (en) Noise reduction method and device, electronic equipment and computer readable storage medium
CN111800552A (en) Audio output processing method, device and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination