CN113593595A - Voice noise reduction method and device based on artificial intelligence and electronic equipment - Google Patents

Voice noise reduction method and device based on artificial intelligence and electronic equipment Download PDF

Info

Publication number
CN113593595A
CN113593595A CN202110116096.8A CN202110116096A CN113593595A CN 113593595 A CN113593595 A CN 113593595A CN 202110116096 A CN202110116096 A CN 202110116096A CN 113593595 A CN113593595 A CN 113593595A
Authority
CN
China
Prior art keywords
noise reduction
voice signal
voice
training
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110116096.8A
Other languages
Chinese (zh)
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110116096.8A priority Critical patent/CN113593595A/en
Publication of CN113593595A publication Critical patent/CN113593595A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application provides a voice noise reduction method and device based on artificial intelligence, electronic equipment and a computer readable storage medium; the method comprises the following steps: acquiring a voice signal; acquiring a constraint condition corresponding to the voice signal; and calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal. Through the application, the noise reduction effect and the efficiency of noise reduction treatment can be improved.

Description

Voice noise reduction method and device based on artificial intelligence and electronic equipment
Technical Field
The present application relates to artificial intelligence technology, and in particular, to a method, an apparatus, a device, and a computer readable storage medium for speech noise reduction based on artificial intelligence.
Background
Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
In voice call or man-machine voice interaction application, collected voice signals generally have noise, the effects of call and recognition are affected, noise reduction processing needs to be carried out on the voice signals, so that the voice signal-to-noise ratio is improved, the call definition and the intelligibility are enhanced, and the machine voice recognition is more accurate and effective.
Disclosure of Invention
The embodiment of the application provides a voice noise reduction method and device based on artificial intelligence, an electronic device and a computer readable storage medium, and can improve the noise reduction effect and efficiency of noise reduction processing.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a voice noise reduction method based on artificial intelligence, which comprises the following steps:
acquiring a voice signal;
acquiring a constraint condition corresponding to the voice signal;
and calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal.
The embodiment of the application provides a device of making an uproar falls in pronunciation based on artificial intelligence, includes:
the acquisition module is used for acquiring a voice signal;
the acquisition module is used for acquiring constraint conditions corresponding to the voice signals;
and the noise reduction module is used for calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal.
In the foregoing solution, the obtaining module is further configured to: acquiring a constraint condition corresponding to the voice signal before acquiring the voice signal, wherein the constraint condition comprises at least one of a scheduled time and a scheduled place of voice communication; the noise reduction module is further configured to: before acquiring the voice signal, acquiring a noise reduction model adaptive to the constraint condition in advance; and when the constraint condition is met, automatically calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the acquired voice signal to obtain a noise reduction voice signal.
In the foregoing solution, the obtaining module is further configured to obtain a plurality of noise reduction models before calling the noise reduction model adapted to the constraint condition to perform noise reduction processing on the voice signal, where the plurality of noise reduction models are a set of all noise reduction models that can be called, or a set of noise reduction models adapted to the constraint condition; presenting options of the plurality of noise reduction models in a human-computer interaction interface; and in response to the selection operation, using the selected noise reduction model as the noise reduction model adapted to the constraint condition.
In the foregoing solution, the obtaining module is further configured to: acquiring at least one of the following attribute information of the voice signal: sending or receiving time information of the voice signal, sending or receiving geographical location information of the voice signal, sending or receiving user information of the voice signal, and sending or receiving environment information of the voice signal; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between the different attribute information and the different constraint conditions.
In the foregoing solution, the noise reduction module is further configured to: extracting speech features from the speech signal; performing first full-connection processing on the voice characteristics corresponding to the voice signals through a first full-connection layer of the voice endpoint network to obtain first full-connection processing results corresponding to the voice signals; performing voice endpoint detection processing on a first full-connection result corresponding to the voice signal through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result corresponding to the voice signal; predicting the noise spectrum characteristic of the voice signal through a second gating circulation unit of the noise spectrum prediction network by taking the voice feature corresponding to the voice signal, a first full-connection processing result corresponding to the voice signal and a voice endpoint detection result corresponding to the voice signal as input; and predicting the gain corresponding to the voice signal through a third gating circulating unit of the noise spectrum removal network by taking the noise spectrum characteristic of the voice signal, the voice endpoint detection result corresponding to the voice signal and the voice characteristic corresponding to the voice signal as input, and applying the gain corresponding to the voice signal to obtain a noise-reduced voice signal.
In the foregoing solution, the noise reduction module is further configured to: calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal through the noise reduction model to obtain a noise reduction voice signal, and then determining a noise reduction effect parameter of the noise reduction voice signal; when the noise reduction effect parameter is lower than a noise reduction effect parameter threshold, performing the following processing: determining the similarity between a plurality of other constraint conditions and the constraint conditions corresponding to the voice signals, and determining other constraint conditions with the highest similarity to call corresponding noise reduction models to perform noise reduction processing on the noise reduction voice signals to obtain updated noise reduction voice signals; wherein the other constraints are different from the constraints corresponding to the speech signal.
In the above scheme, each noise reduction model is adapted to a constraint condition, and different noise reduction models are adapted to different constraint conditions; the device also comprises a training module, a processing module and a processing module, wherein the training module is used for acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one to one before acquiring the constraint conditions corresponding to the voice signals; and training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one.
In the above solution, each noise reduction model is adapted to at least one of the constraints, and the constraints for different noise reduction model adaptations are different; the training module is further configured to, before obtaining a constraint condition corresponding to the voice signal, obtain a training voice signal sample set corresponding to a plurality of the constraint conditions one to one; training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one; clustering the plurality of noise reduction models to obtain at least one clustered noise reduction model; wherein the noise reduction model of each cluster corresponds to at least one of the constraints.
In the above scheme, the training module is further configured to obtain a test speech signal sample set corresponding to each constraint condition; performing the following for each set of test speech signal samples: carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set through the plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and clustering the noise reduction models according to the noise reduction results which are obtained by aiming at each test voice signal sample set and correspond to the noise reduction models one by one to obtain at least one clustered noise reduction model.
In the foregoing solution, the training module is further configured to perform the following processing for each test speech signal sample set: determining a plurality of noise reduction results obtained after the noise reduction processing is carried out on the test voice signal samples in the test voice signal sample set by the plurality of noise reduction models; determining a minimum mean square error for each of the noise reduction results, wherein the noise reduction results comprise noise reduced speech signal samples of the plurality of test speech signal samples in the set of test speech signal samples; based on the minimum mean square error of the noise reduction result, sequencing the noise reduction models in an ascending order, and taking at least one noise reduction model sequenced at the front as at least one candidate noise reduction model corresponding to the test voice signal sample set; extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each of the test speech signal sample sets to use the intersection noise reduction model as a clustered noise reduction model; and the candidate noise reduction model serving as the intersection noise reduction model corresponds to a plurality of test voice signal sample sets.
In the above scheme, the training module is further configured to perform clustering processing on the plurality of constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; acquiring a test voice signal sample set corresponding to a plurality of original constraint conditions one by one and a noise reduction model corresponding to the original constraint conditions one by one; carrying out fusion processing on the test voice signal sample sets which correspond to the original constraint conditions one by one to obtain a test voice signal sample set corresponding to the clustering constraint conditions; carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set corresponding to the clustering constraint condition through a plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one by one; and acquiring the minimum mean square error of each noise reduction result, and determining a noise reduction model corresponding to the minimum mean square error as the clustered noise reduction model.
In the above scheme, the training module is further configured to acquire a plurality of noises carrying a plurality of attribute information; wherein the attribute information includes: sending or receiving time information of the voice signal, sending or receiving geographical location information of the voice signal, sending or receiving user information of the voice signal, and sending or receiving environment information of the voice signal; dividing the plurality of noises according to the plurality of attribute information to obtain noise sets corresponding to the plurality of constraint conditions one by one; wherein each of the constraints has the same plurality of kinds of attribute information; and overlapping the noise of each noise set and a pure voice signal sample to obtain a training voice signal sample set corresponding to each constraint condition.
In the above scheme, the training module is further configured to obtain a weight of the clean speech signal sample and a weight of the noise; weighting the clean voice signal sample and the noise according to the weight of the clean voice signal sample and the weight of the noise to obtain a training voice signal sample; adding the pure voice signal or the noise based on the training voice signal sample to obtain a new training voice signal sample; and forming the training voice signal sample set according to the training voice signal samples and the new training voice signal samples.
In the above scheme, the training module is further configured to perform noise reduction processing on a training speech signal sample included in the training speech signal sample set through the noise reduction model to obtain a noise reduction speech signal corresponding to the training speech signal sample; determining an error between a noise reduction speech signal corresponding to the training speech signal sample and a clean speech signal sample, and substituting the error into a loss function of the noise reduction model; and determining a parameter change value of the noise reduction model when the loss function obtains a minimum value based on the learning rate of the noise reduction model, and updating the parameter of the noise reduction model based on the parameter change value.
In the above scheme, the training module is further configured to extract a speech feature from a training speech signal sample; performing first full-connection processing on the voice features through a first full-connection layer of a voice endpoint network to obtain a first full-connection processing result; performing voice endpoint detection processing on the first full-connection result through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result; predicting the noise spectrum characteristic of the training voice signal sample by taking the voice feature, the first full-connection processing result and the voice endpoint detection result as input through a second gating circulation unit of a noise spectrum prediction network; and predicting the gain corresponding to the training voice signal sample through a third gating circulating unit of a noise spectrum removal network by taking the noise spectrum characteristic, the voice endpoint detection result and the voice characteristic as input, and applying the gain to the training voice signal sample.
The embodiment of the application provides a training method of a noise reduction model based on artificial intelligence, which comprises the following steps:
obtaining a plurality of training voice signal samples carrying noise in a plurality of constraint conditions to form a training voice signal sample set corresponding to the constraint conditions one by one;
performing noise reduction processing on training voice signal samples included in a training voice signal sample set corresponding to the plurality of constraint conditions through noise reduction models corresponding to the plurality of constraint conditions one to obtain noise reduction voice signals;
determining an error between the noise reduction voice signal and a pure voice signal sample corresponding to the training voice signal sample, and updating parameters of a noise reduction model corresponding to the constraint condition according to the error;
and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
The embodiment of the application provides a training device of model of making an uproar falls based on artificial intelligence, includes:
the training module is used for acquiring a plurality of training voice signal samples carrying noise in a plurality of constraint conditions to form a training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions;
the training module is further configured to perform noise reduction processing on training speech signal samples included in a training speech signal sample set corresponding to the multiple constraint conditions through noise reduction models corresponding to the multiple constraint conditions one to one, so as to obtain noise-reduced speech signals;
the training module is further configured to determine an error between the noise reduction speech signal and a clean speech signal sample corresponding to the training speech signal sample, and update parameters of a noise reduction model corresponding to the constraint condition according to the error;
and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the artificial intelligence based speech noise reduction method or the artificial intelligence based noise reduction model training method provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for implementing, when executed by a processor, the artificial intelligence based speech noise reduction method or the artificial intelligence based noise reduction model training method provided in the embodiment of the present application.
The embodiment of the application has the following beneficial effects:
different constraint conditions of the applied voice signals are identified through the attribute information of the applied voice signals, corresponding noise reduction models are called according to the different constraint conditions to perform noise reduction processing on the voice signals, namely different suppression means are adopted according to the different constraint conditions, the noise reduction effect is optimized in a targeted mode, and the noise reduction efficiency is improved.
Drawings
FIG. 1 is a schematic structural diagram of an artificial intelligence-based speech noise reduction system provided by an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;
3A-3D are schematic flow diagrams of artificial intelligence based speech noise reduction methods provided by embodiments of the present application;
FIG. 4 is a schematic diagram of a noise reduction model of an artificial intelligence-based speech noise reduction method provided by an embodiment of the present application;
5A-5B are schematic constraint conditions of an artificial intelligence based speech noise reduction method provided by an embodiment of the present application;
FIG. 6A is a flowchart illustrating an artificial intelligence based speech noise reduction method according to an embodiment of the present application;
fig. 6B is a flowchart illustrating a training method of a noise reduction model based on artificial intelligence according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Noise spectrum characteristics: frequency is one of the main parameters describing sound characteristics, and is a necessary condition for studying the distribution of sound intensity with frequency, and the sound spectrum refers to a graph in which the intensity of constituent complex sounds (sound waves synthesized by simple harmonic components with different frequencies) is distributed with frequency. The acoustic spectrum consisting of noise is the noise spectrum. From the noise spectrum, the analysis knows the composition and properties of the noise, called spectral analysis. In spectrum analysis, it is usually known whether the peak noise is in the low frequency, intermediate frequency or high frequency range, which provides a basis for noise control.
The related technology comprises a common noise reduction algorithm which mainly comprises the following steps: the general noise reduction algorithm of the type usually assumes that noise is additive random stationary noise, and voice has short-time stationary characteristics, and can estimate noise spectrum characteristics through various statistical methods, and then carry out noise suppression on the voice with noise according to the signal-to-noise ratio result obtained by calculation, in the enhanced voice signal after noise suppression, the proportion of noise components is reduced, the voice signal-to-noise ratio is improved, and the voice is clearer and understandable, but the general noise reduction algorithm assumes that noise has the additive random stationary characteristics, but under the actual constraint condition, the noise is divided into stationary noise and non-stationary noise, and for the non-stationary noise, part of noise reduction means in the related technology cannot effectively suppress, and has obvious shortcuts.
The related technology comprises an AI noise reduction algorithm, the process of the AI noise reduction algorithm is divided into a deep learning training stage and a deep learning reasoning stage, the training stage firstly extracts related voice time-frequency domain characteristic data such as power spectrum, pitch period, voice endpoint and the like from a large number of voice and noise samples, and trains a designed deep network model (usually a network structure such as a multilayer deep neural network, a convolutional neural network, a gate control cycle unit, a long-short term memory network and the like) by using the large number of data, the training aims to continuously optimize all parameters in the model, the optimization result and the optimization aim to enable the deep network model to more accurately predict pure voice components (or noise components) under various noisy voices, namely accurately identify noise and voices, the AI noise reduction algorithm technology is a data-driven technology, namely if the noise type of an actual application scene is covered in a sample library of the training stage, the AI noise reduction algorithm can usually suppress effectively to get cleaner speech, if the noise type of the actual scene is not included in the training process, the AI noise reduction algorithm may fail, and another disadvantage of the AI noise reduction algorithm is that the AI noise reduction algorithm may misjudge to suppress the effective speech signal because the effective speech under some constraint conditions is closer to the individual noise.
The following describes an exemplary application of the electronic device provided in the embodiment of the present application, and the device provided in the embodiment of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated message device, and a portable game device), and may also be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a server.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an artificial intelligence based voice noise reduction system provided in an embodiment of the present application, in order to support a voice call application, a terminal 400-1 and a terminal 400-2 are connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two. The terminal 400-1 initiates a voice call request to the terminal 400-2 through the server 200, the voice call request carries a voice message, after the server sends the voice message to the terminal 400-2, the terminal 400-2 sends attribute information of the voice message (geographical information of the terminal 400-2 and time information of receiving the voice message) to the server 200, so that the server 200 acquires a corresponding constraint condition and calls a corresponding noise reduction model to perform noise reduction processing on the voice message, a noise reduction voice signal is obtained, and the noise reduction voice signal is returned to the terminal 400-2 to be played.
In some embodiments, the artificial intelligence based speech noise reduction system is applied to a speech call application, the speech call application comprising a real-time speech call application and a non-real-time speech call application, the real-time speech call comprising: the process of making a call, the process of making a voice call, the non-real-time voice call comprises: the process of sending voice message, etc., the terminal 400-1 used by the user can receive the voice message and also can send out the voice message, aiming at the sent voice message, a noise reduction model which is operated in the server 200 and is adaptive to the constraint condition of sending out the voice message can be called to carry out noise reduction processing on the sent voice message, the noise reduction voice signal is sent to the terminal 400-2 used by another user, aiming at the received voice message, the terminal 400-1 can call a noise reduction model which is operated in the server 200 and is adaptive to the constraint condition of receiving the voice message to carry out noise reduction processing on the received voice message, and the noise reduction voice signal is played in the terminal 400-1.
In some embodiments, the speech noise reduction system based on artificial intelligence is applied to an application having a speech interaction function, a terminal 400-1 used by a user receives speech information of the user, the speech information carries contents of a control instruction, for the speech information of the user, a noise reduction model running in the server 200 and adapted to a constraint condition for sending the speech information can be called to perform noise reduction processing on the sent speech information, speech recognition processing is performed on a noise reduction speech signal, the control instruction is recognized, and the control instruction is returned to the terminal 400-1, so that the terminal 400-1 continuously responds to the recognized control instruction.
In some embodiments, the electronic device may be implemented as a terminal, the terminal 400-1 used by the user receives the voice information from the terminals 400-2 of other users, the terminal 400-1 directly calls the adaptive local noise reduction model to perform noise reduction processing on the voice information according to the constraint conditions (geographical location information and time information) of receiving the voice information, and plays the noise-reduced voice signal corresponding to the voice information on the terminal 400-1.
In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminals 400-1 and 400-2 may be, but are not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present invention.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and a server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in terminal 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.
The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.
The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.
In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 2 shows an artificial intelligence based speech noise reduction apparatus 255-1 stored in a memory 250, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: an acquisition module 2551, an acquisition module 2552, a noise reduction module 2553 and a training module 2554, fig. 2 shows a training device 255-2 for an artificial intelligence based noise reduction model stored in a memory 250, which may be software in the form of programs and plug-ins or the like, comprising the following software modules: a training module 2555, which is logical and thus can be arbitrarily combined or further split according to the implemented functions, the functions of which will be described below.
The artificial intelligence based speech noise reduction method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the electronic device provided by the embodiment of the present application. The voice noise reduction method based on artificial intelligence can be implemented through a terminal or a server, and aiming at received voice information, the terminal directly collects voice signals from the voice information and calls a noise reduction model adaptive to constraint conditions for receiving the voice information to perform noise reduction processing, or the terminal calls the noise reduction model adaptive to the constraint conditions for receiving the voice information from the server to perform noise reduction processing.
Referring to fig. 3A, fig. 3A is a schematic flowchart of a method for speech noise reduction based on artificial intelligence according to an embodiment of the present application, which will be described with reference to the steps shown in fig. 3A.
In step 101, a speech signal is acquired.
As an example, the voice signal may be a voice signal collected in real time during a real-time voice call, the real-time voice call including: the process of making a call, the process of making a voice call, the voice signal can be the voice signal gathered in the non-real-time voice call, the non-real-time voice call includes: the process of sending voice messages, voice messages and the like, and the voice signals can be collected from human-computer interaction voice commands sent by a user aiming at the human-computer interaction interface.
In step 102, constraints corresponding to the speech signal are obtained.
In some embodiments, the constraint condition of the corresponding voice signal obtained in step 102 may be implemented by the following technical solutions: acquiring at least one of the following attribute information of the voice signal: time information of sending or receiving a voice signal, geographical location information of sending or receiving a voice signal, user information of sending or receiving a voice signal, such as a person who initiates a call, or a person who accepts a call, environmental information where a voice signal is sent or received, for example, a computer accepts to initiate a voice call or a mobile device accepts to initiate a voice call; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between different attribute information and different constraint conditions.
As an example, the attribute information is information for characterizing constraints related to the voice signal, the attribute information includes one or more of time information, geographical location information, user information, or environment information, the constraints of the voice signal include constraints for generating the voice signal, constraints for sending out the voice signal, or constraints for receiving the voice signal, and thus the time information includes time information for sending or receiving the voice signal, the geographical location information includes geographical location information for sending or receiving the voice signal, the user information includes user information for sending or receiving the voice signal, such as a person who initiated a call, or a person who accepted a call, and the environment information includes environment information where the voice signal was sent or received, such as accepting or initiating a voice call through a computer, or accepting or initiating a voice call through a mobile device, the computer and the mobile device represent different environment information, the computer represents relatively fixed environment information, such as an office environment, the mobile device represents changed environment information, such as a leisure environment and the like; different constraint conditions can be defined by the combination of any several kinds of attribute information, and the corresponding relation exists between the independent attribute information or the combination of the attribute information and the constraint conditions, for example, an airport from 10 am to 11 am is used as one constraint condition, and finally the constraint conditions corresponding to the attribute information of the voice signals are determined according to the preset corresponding relation between different attribute information and different constraint conditions.
In step 103, a noise reduction model adapted to the constraint condition is called to perform noise reduction processing on the speech signal, so as to obtain a noise reduced speech signal.
Referring to fig. 3B, fig. 3B is a schematic flow chart of the artificial intelligence based speech noise reduction method provided in the embodiment of the present application, the step 103 of invoking a noise reduction model adapted to the constraint condition to perform noise reduction processing on the speech signal, and obtaining a noise-reduced speech signal may be realized through steps 1031-1035, which will be described with reference to the steps shown in fig. 3B.
In step 1031, speech features are extracted from the speech signal.
In step 1032, a first full-connection processing is performed on the voice feature through a first full-connection layer of the voice endpoint network, so as to obtain a first full-connection processing result.
In step 1033, a voice endpoint detection process is performed on the first full connection result through a first gate control loop unit of the voice endpoint network, so as to obtain a voice endpoint detection result.
In step 1034, the speech feature, the first full-link processing result, and the speech endpoint detection result are used as input, and the noise spectrum characteristic of the speech signal is predicted through the second gating loop unit of the noise spectrum estimation network.
In step 1035, the noise spectrum characteristic, the voice endpoint detection result and the voice feature are used as input, the gain of the corresponding voice signal is predicted by the third gate control loop unit of the noise spectrum removal network, and the gain is applied to the voice signal, so as to obtain a noise-reduced voice signal.
By way of example, referring to fig. 4, fig. 4 is a schematic diagram of a noise reduction model of an artificial intelligence based speech noise reduction method provided in an embodiment of the present application, where the noise reduction model includes a speech endpoint network, a noise spectrum estimation network, and a noise spectrum removal network, a speech feature is extracted from a speech signal, the speech feature is extracted from time-frequency domain data, such as a power spectrum, a pitch period, a speech endpoint, etc., the speech feature is subjected to a first full-concatenation processing through a first full-concatenation layer (24-dimensional) of the speech endpoint network, the first full-concatenation processing is implemented through a dense function and a tanh activation function, a first full-concatenation processing result is obtained, a speech endpoint detection result is obtained by performing speech endpoint detection processing on the first full-concatenation result through a first gating loop unit (24-dimensional) of the speech endpoint network, the gating loop unit is a commonly used gating loop neural network, the gated cyclic neural network receives a current input and a hidden state transmitted by a previous node, the hidden state comprises related information of the previous node, the gated cyclic neural network obtains an output of the current hidden node and a hidden state transmitted to a next node by combining the hidden state and the current input, a voice endpoint detection result is obtained by iterative processing of a plurality of gated cyclic units and mapping processing of an activation function, a voice characteristic, a first full-connection processing result and the voice endpoint detection result are used as input, a noise spectrum is used for predicting a second gated cyclic unit (48 dimensions) of the network, the noise spectrum characteristic of a voice signal is predicted, a noise spectrum characteristic, a voice endpoint detection result and the voice characteristic are used as input, a gain (22 dimensions) corresponding to the voice signal is predicted by a third gated cyclic unit (96 dimensions) of a noise spectrum removing network, and applying the gain to the speech signal to obtain a noise-reduced speech signal, wherein the data involved in steps 1031-1035 are all corresponding processing results obtained for the corresponding speech signal.
It should be noted that the execution order of the above steps 101, 102 and 103 is diversified, and the following description is given.
In some embodiments, step 101 may be performed first, and then step 102 and step 103 may be performed in sequence, one specific example being described below. In step 101 a speech signal is acquired and in step 102 the speech signal acquired in step 101 is analyzed to obtain constraints corresponding to the speech signal. In step 103, a noise reduction model adapted to the constraint condition is called to perform noise reduction processing on the speech signal, so as to obtain a noise reduced speech signal.
When steps 101-103 are performed sequentially, step 102 is implemented as follows: acquiring at least one of the following attribute information of the voice signal: time information of sending or receiving a voice signal, geographical location information of sending or receiving a voice signal, user information of sending or receiving a voice signal, such as a person who initiates a call, or a person who accepts a call, environmental information where a voice signal is sent or received, for example, a computer accepts to initiate a voice call or a mobile device accepts to initiate a voice call; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between different attribute information and different constraint conditions.
As an example, a terminal receives a piece of voice information, collects the voice information, responds to a playing operation of a user for the voice information, acquires playing time information and playing place information of the voice information, acquires constraint conditions corresponding to the playing time information and the playing place information, calls a noise reduction model corresponding to the constraint conditions to perform noise reduction processing on the voice signal, obtains a noise reduction voice signal, and plays the noise reduction voice signal.
In some embodiments, step 102 may be performed first, followed by step 103 and step 101, an example of which is described below. In step 102, i.e., before acquiring the voice signal, constraints corresponding to the voice signal are acquired, the constraints including at least one of a scheduled time and a scheduled location of the voice communication. In step 103, a noise reduction model adapted to the constraint condition is obtained in advance, then step 101 is carried out to obtain the voice signal, and when the constraint condition is met, step 103 is carried out to automatically call the noise reduction model adapted to the constraint condition to carry out noise reduction processing on the obtained voice signal, so as to obtain a noise reduction voice signal.
As an example, for a voice signal in a teleconference, extracting a scheduled time and a scheduled location of the teleconference from a schedule function and/or a proxy function of an instant messaging client of a user, or extracting a scheduled time and a scheduled location of the teleconference from a memo of a voice assistant, where the scheduled time and the scheduled location of the teleconference constitute constraints of the voice signal transmitted in the teleconference, and the scheduled location includes an airport, an office, and the like, a noise reduction model adapted to the constraints is obtained in advance before the teleconference is started, for example, a noise reduction model adapted to the constraints is obtained in advance from a server, or a noise reduction model adapted to the constraints is loaded in advance on a processing terminal (terminal or server), and when the constraints are satisfied, that is, when the user reaches the conference scheduled location at the conference scheduled time, the noise reduction model adapted to the constraints is automatically called to perform noise reduction processing on the obtained voice signal, and obtaining the noise reduction voice signal, thereby saving the time for adapting and calling the noise reduction model in real time and improving the noise reduction efficiency.
As an example, for a voice call scenario, plan time of a voice call (a plan place where the voice call is not recorded in a memo) is extracted from a memo of a voice assistant, geographical location information of a terminal is obtained when the voice call is played but the voice call is not connected, the geographical location information and the plan time constitute a constraint condition of a voice signal to be transmitted in the voice call, a noise reduction model adapted to the constraint condition is obtained in advance before the voice call is connected, for example, a noise reduction model adapted to the constraint condition is obtained in advance from a server, or a noise reduction model adapted to the constraint condition is loaded in advance at a processing end (the terminal or the server), after the voice call is connected, when the constraint condition is satisfied (a time point when the constraint condition is satisfied may be before the voice call is connected), the noise reduction model adapted to the constraint condition is automatically called to perform noise reduction processing on the obtained voice signal, and obtaining the noise reduction voice signal, thereby saving the time for adapting and calling the noise reduction model in real time and improving the noise reduction efficiency.
In some embodiments, step 101 and step 102 may be performed first, then the user-selected noise reduction model is obtained according to the constraint conditions obtained in step 102, and then step 103 is performed according to the user-selected noise reduction model, an example of which is described below. In step 101 a speech signal is acquired and in step 102 the speech signal acquired in step 101 is analyzed to obtain constraints corresponding to the speech signal. Before calling a noise reduction model adapted to the constraint condition to perform noise reduction processing on the voice signal in step 103, acquiring a plurality of noise reduction models, wherein the plurality of noise reduction models are a set of all noise reduction models which can be called or a set of noise reduction models adapted to the constraint condition (obtaining the constraint condition through step 102); presenting options of a plurality of noise reduction models in a human-computer interaction interface; in response to the selection operation, the selected noise reduction model is used as the noise reduction model adapted to the constraint condition, and step 103 is executed according to the noise reduction model selected by the user.
As an example, for a teleconference scene, after step 102 and before step 103, options of a plurality of noise reduction models are presented in the human-computer interaction interface, the options respectively correspond to different noise reduction models, the noise reduction model corresponding to the presented option is an entire noise reduction model or a noise reduction model adapted to a constraint condition, if the noise reduction model corresponding to the presented option is an entire noise reduction model, the option corresponding to the noise reduction model adapted to the constraint condition is highlighted, and the selected noise reduction model is updated to the noise reduction model adapted to the constraint condition in response to a user selection operation, and in the teleconference scene, the user selection operation may occur in the process of the teleconference.
As an example, in a teleconference scenario, a selection operation of a user may occur before a teleconference, extracting a scheduled time and a scheduled location of the teleconference from a memo of a voice assistant, where the scheduled time and the scheduled location of the teleconference constitute a constraint condition of a voice signal transmitted in the teleconference, and before starting the teleconference, acquiring a noise reduction model adapted to the constraint condition in advance, for example, acquiring a noise reduction model adapted to the constraint condition in advance from a server, or, loading the noise reduction model adapted to the constraint condition in advance at a processing terminal (terminal or server), presenting options of a plurality of noise reduction models in a human-computer interaction interface, where the options correspond to different noise reduction models respectively, and the noise reduction model corresponding to the presented option is an entire noise reduction model or an acquired noise reduction model adapted to the constraint condition in advance, and if the noise reduction model corresponding to the presented option is an entire noise reduction model, highlighting options corresponding to the pre-acquired noise reduction model adaptive to the constraint condition, responding to user selection operation, updating the selected noise reduction model into the noise reduction model adaptive to the constraint condition, starting to acquire voice signals after a conference starts, and automatically calling the noise reduction model adaptive to the constraint condition to perform noise reduction processing on the acquired voice signals to obtain noise reduction voice signals, so that the user requirement is accurately met, the noise reduction pertinence is improved, and the human-computer interaction efficiency of a user is improved.
Referring to fig. 3C, fig. 3C is a schematic flow chart of the artificial intelligence based speech noise reduction method provided in the embodiment of the present application, and after the noise reduction model adapted to the constraint condition is called in step 103 to perform noise reduction processing on the speech signal through the noise reduction model to obtain a noise-reduced speech signal, step 104 and step 105 may be executed, which will be described with reference to the steps shown in fig. 3C.
In step 104, noise reduction effect parameters of the noise reduced speech signal are determined.
In step 105, when the noise reduction effect parameter is lower than the noise reduction effect parameter threshold, determining similarities between the plurality of other constraint conditions and the constraint conditions corresponding to the speech signal, and determining the other constraint conditions with the highest similarity, so as to call the corresponding noise reduction model to perform noise reduction processing on the noise reduction speech signal, thereby obtaining an updated noise reduction speech signal.
As an example, a noise reduction effect parameter of the noise reduction speech signal is determined, the noise reduction effect parameter is a parameter obtained based on the signal-to-noise ratio, the noise spectrum intensity, and the like of the noise reduction speech signal, the noise reduction effect threshold is a threshold parameter calculated based on a plurality of set parameters (signal-to-noise ratio, noise spectrum intensity, and the like), the noise reduction effect parameter being lower than the noise reduction effect parameter threshold indicates that the noise reduction effect is not ideal enough, the noise reduction processing is required, thus determining the similarity between the plurality of other constraints and the constraint corresponding to the speech signal, and determining the other constraint having the highest similarity, calling the noise reduction model corresponding to the other constraint conditions to perform noise reduction processing on the noise reduction voice signal to obtain a noise reduction voice signal, other constraint conditions are different from the constraint conditions of the corresponding voice signals, and the noise reduction quality and the noise reduction efficiency can be effectively improved through noise reduction processing.
In some embodiments, each noise reduction model is adapted to one constraint, and different noise reduction models are adapted to different constraints; before obtaining the constraint condition of the corresponding voice signal in step 102, the following technical scheme may be implemented: acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one by one; and training to obtain a plurality of original noise reduction models which are in one-to-one correspondence with the plurality of constraint conditions based on the training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions.
As an example, a plurality of training speech signal samples carrying noise in a plurality of constraints are obtained to form a training speech signal sample set corresponding to the plurality of constraints one by one, and a noise reduction model corresponding to the plurality of constraints one by one is initialized; carrying out noise reduction processing on training voice signal samples included in a training voice signal sample set corresponding to a plurality of constraint conditions through noise reduction models corresponding to the constraint conditions one to obtain noise reduction voice signals; and determining an error between the noise reduction voice signal and a pure voice signal sample corresponding to the training voice signal sample, and updating parameters of a noise reduction model corresponding to the constraint condition according to the error, wherein the noise reduction model is used for performing noise reduction processing on the voice signal in the corresponding constraint condition.
In some embodiments, each noise reduction model is adapted with at least one constraint, and the constraints for different noise reduction model adaptations are different; before obtaining the constraint condition of the corresponding voice signal, the following technical scheme can be executed: acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one by one; training to obtain a plurality of original noise reduction models which are in one-to-one correspondence with a plurality of constraint conditions based on a training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions; clustering the plurality of original noise reduction models to obtain at least one clustered noise reduction model; wherein the noise reduction model of each cluster corresponds to at least one constraint condition.
As an example, a noise reduction model obtained based on training of multiple constraint conditions and corresponding to the constraint conditions one to one is used as an original noise reduction model, clustering processing is performed on the original noise reduction model to obtain at least one clustered noise reduction model, and after the clustering processing, one noise reduction model can perform noise reduction corresponding to the multiple constraint conditions, for example, the noise reduction model corresponding to the constraint conditions of 8-10 am at an airport can be used for processing voice signals of the constraint conditions of 10-12 am at the airport, occupation of storage resources can be effectively reduced through clustering of the noise reduction model, and the utilization rate of the noise reduction model can also be improved.
In some embodiments, the clustering process performed on the plurality of noise reduction models to obtain at least one clustered noise reduction model may be implemented by the following technical solutions: acquiring a test voice signal sample set corresponding to each constraint condition; for each set of test speech signal samples, performing the following: carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set through a plurality of noise reduction models to obtain noise reduction results which are in one-to-one correspondence with the plurality of noise reduction models, namely noise reduction voice signal samples; and clustering the plurality of noise reduction models according to the noise reduction results which are obtained by aiming at each test voice signal sample set and correspond to the plurality of noise reduction models one by one to obtain at least one clustered noise reduction model.
As an example, M groups of test speech signal sample sets are used as test sets, each test set has a trained noise reduction model (not subjected to clustering processing), for a certain test set, the trained M noise reduction models are used to perform noise reduction processing on the test speech signal samples of the test set to obtain M noise reduction results (respectively corresponding to different noise reduction models), and then the plurality of noise reduction models are clustered according to the noise reduction results which are obtained for each test speech signal sample set and correspond to the plurality of noise reduction models one to obtain at least one clustered noise reduction model.
In some embodiments, the clustering process is performed on the plurality of noise reduction models according to the noise reduction result obtained for each test speech signal sample set and corresponding to the plurality of noise reduction models one to one, so as to obtain at least one clustered noise reduction model, and the method may be implemented by the following technical solutions: for each set of test speech signal samples, performing the following: determining a plurality of noise reduction results obtained after the noise reduction processing is carried out on the test voice signal samples in the test voice signal sample set by a plurality of noise reduction models; determining a minimum mean square error of each noise reduction result, wherein the noise reduction result comprises noise reduction speech signal samples of a plurality of test speech signal samples in a set of test speech signal samples; based on the minimum mean square error of the noise reduction result, sequencing a plurality of noise reduction models in an ascending order, and taking at least one noise reduction model sequenced in the front as a candidate noise reduction model corresponding to the test voice signal sample set; extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each test voice signal sample set so as to take the intersection noise reduction model as a clustered noise reduction model; and the candidate noise reduction model serving as the intersection noise reduction model corresponds to a plurality of test voice signal sample sets.
Taking the minimum mean square error as the evaluation criterion, as an example, the above-mentioned embodiments are carried out, M noise reduction results are sorted, each noise reduction result includes a noise reduction test speech signal of a certain noise reduction model for all test speech signal samples in a certain test speech signal sample set, if the test speech signal sample set includes 10 test speech signal samples, 10 noise reduction test speech signals are obtained, the minimum mean square error of the 10 noise reduction test speech signals is determined, as the minimum mean square error of the noise reduction model for a specific test speech signal sample set, M minimum mean square errors for a certain test speech signal sample set can be obtained based on the M noise reduction models, the noise reduction model corresponding to the first K noise reduction results with the minimum mean square error value smaller than the error threshold is selected as the candidate noise reduction model of the test speech signal sample set, if the number of the noise reduction models corresponding to the noise reduction result with the minimum mean square error value smaller than the error threshold is smaller than K, the actual number is used as the standard, if the noise reduction result with the minimum mean square error value smaller than the error threshold does not exist, the noise reduction model corresponding to the noise reduction result with the minimum mean square error value is selected as the only candidate noise reduction model of the test voice signal sample set, if M test voice signal sample sets have models with intersections corresponding to the candidate noise reduction models, for example, the test voice signal sample set A has two corresponding candidate noise reduction models (a, B), the test voice signal sample set B has three corresponding candidate noise reduction models (a, B, c), the candidate noise reduction models (a, B) are taken as the intersection noise reduction model of the corresponding test voice signal sample set A and the test voice signal sample set B, the test voice signal sample set with intersection uniformly uses the model with the minimum comprehensive error in the intersection noise reduction model as the noise reduction model of the final clustering corresponding to the test voice signal sample set A and the test voice signal sample set B, when only one intersection noise reduction model exists between the test voice signal sample set A and the test voice signal sample set B, the intersection noise reduction model is directly used as the noise reduction model of the final clustering, the comprehensive error can be the minimum mean square error or other evaluation parameters, and the comprehensive error is the comprehensive noise reduction evaluation result of a plurality of intersection noise reduction models aiming at a plurality of test voice signal sample sets.
In some embodiments, the clustering process is performed on the plurality of original noise reduction models to obtain at least one clustered noise reduction model, and the clustering process may be implemented by the following technical solutions: clustering the plurality of constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; acquiring a test voice signal sample set corresponding to a plurality of original constraint conditions one by one and an original noise reduction model corresponding to the original constraint conditions one by one; carrying out fusion processing on the test voice signal sample sets which correspond to the original constraint conditions one by one to obtain a test voice signal sample set corresponding to the clustering constraint conditions; carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set corresponding to the clustering constraint conditions through a plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and acquiring the minimum mean square error of each noise reduction result, and determining the noise reduction model corresponding to the minimum mean square error as a clustered noise reduction model.
As an example, different from the foregoing embodiment, clustering may be performed on a plurality of constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; firstly clustering constraint conditions, then acquiring a noise reduction model corresponding to each clustering constraint condition as a clustering noise reduction model, assuming that M original constraint conditions, corresponding M test voice signal sample sets and corresponding M original noise reduction models exist, obtaining N constraint conditions after the constraint conditions are clustered, further obtaining N test voice signal sample sets, and performing noise reduction processing on test voice signal samples in the test voice signal sample set of a certain clustering constraint condition through the M noise reduction models to obtain M noise reduction results; the minimum mean square error of each noise reduction result is obtained, and the noise reduction model corresponding to the minimum mean square error is determined as the noise reduction model (clustered noise reduction model) of the clustering constraint condition, so that the noise reduction model of each clustering constraint condition is sequentially obtained, the corresponding clustered noise reduction models between different clustering constraint conditions can have intersection, namely the noise reduction model of the clustering constraint condition A is a, and the noise reduction model of the clustering constraint condition B is a, at the moment, the multiplexing of the noise reduction models can be further carried out, the noise reduction model a can be automatically multiplexed by the two clustering constraint conditions, the occupation of storage resources can be effectively reduced through the clustering of the noise reduction models, and the utilization rate of the noise reduction models can also be improved.
In some embodiments, the obtaining of the training speech signal sample sets corresponding to the multiple constraint conditions one to one may be implemented by the following technical solutions: acquiring a plurality of noises carrying various attribute information; wherein the attribute information includes: time information of sending or receiving voice signals, geographical location information of sending or receiving voice signals, user information of sending or receiving voice signals, and environment information of sending or receiving voice signals; dividing a plurality of noises according to a plurality of attribute information to obtain a noise set corresponding to a plurality of constraint conditions one by one; wherein each constraint condition has the same multiple kinds of attribute information; and overlapping the noise of each noise set and the pure voice signal sample to obtain a training voice signal sample set corresponding to each constraint condition.
In some embodiments, the above-mentioned overlapping processing of the noise of each noise set and the clean speech signal sample to obtain the training speech signal sample set corresponding to each constraint condition may be implemented by the following technical solutions: acquiring the weight of a pure voice signal sample and the weight of noise; weighting the clean voice signal sample and the noise according to the weight of the clean voice signal sample and the weight of the noise to obtain a training voice signal sample; adding a pure voice signal or noise based on the training voice signal sample to obtain a new training voice signal sample; and forming a training voice signal sample set according to the training voice signal sample and the new training voice signal sample.
Taking time information and geographic position information as an example, the time information and the geographic position information form corresponding constraint conditions in the form of two-dimensional information, all noises are divided into M groups (M is an integer greater than or equal to 2), the collected noises are divided into M groups of different noises according to the time information and the geographic position information, for example, the time of the first group of noises is 10: 00-11: 00 in the morning, the geographic coordinate is GPS0, the time of the second group of noises is 11: 00-12: 00 in the morning, the geographic coordinate is GPS0, and so on, each group of noises and pure voice signals are linearly superposed through different weighted values to obtain a training voice signal sample set of a certain scale for training noise reduction models, noise reduction models corresponding to each time period and geographic position (constraint conditions) are obtained through respective training of each noise reduction model, and M groups of noise sample sets (each group corresponds to different time), Geographical positions, namely corresponding to different constraint conditions), M training voice signal sample sets are constructed, a larger-scale training voice signal sample can be generated in a linear superposition mode of different parameters (for example, the weighting value of a pure voice signal is 0.7, and the weighting value of noise is 0.3), and along with the increase of the scale of the training voice signal sample set, a deep network of a noise reduction model has better constraint condition adaptability and can better identify and inhibit noise under different constraint conditions.
In some embodiments, the above training to obtain a plurality of noise reduction models corresponding to a plurality of constraints one-to-one based on a training speech signal sample set corresponding to a plurality of constraints one-to-one may be implemented by the following technical solutions: carrying out noise reduction processing on training voice signal samples included in the training voice signal sample set through a noise reduction model to obtain noise reduction voice signals corresponding to the training voice signal samples; determining an error between a noise reduction voice signal corresponding to the training voice signal sample and the pure voice signal sample, and substituting the error into a loss function of the noise reduction model; and determining a parameter change value of the noise reduction model when the loss function obtains the minimum value based on the learning rate of the noise reduction model, and updating the parameter of the noise reduction model based on the parameter change value.
In some embodiments, the performing noise reduction processing on the training speech signal samples included in the training speech signal sample set by using the noise reduction model may be implemented by the following technical solutions: extracting voice features from training voice signal samples; performing first full-connection processing on voice features through a first full-connection layer of a voice endpoint network to obtain a first full-connection processing result; performing voice endpoint detection processing on the first full-connection result through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result; predicting the noise spectrum characteristic of a training voice signal sample by taking the voice characteristic, the first full-connection processing result and the voice endpoint detection result as input through a second gating circulation unit of a noise spectrum estimation network; and predicting the gain of the corresponding training voice signal sample by using the noise spectrum characteristic, the voice endpoint detection result and the voice characteristic as input through a third gating circulating unit of the noise spectrum removal network, and applying the gain to the training voice signal sample.
As an example, the noise reduction model includes a speech endpoint network, a noise spectrum estimation network, and a noise spectrum removal network, extracting speech features from training speech signal samples, the speech features being extracted from time-frequency domain data, such as power spectrum, pitch period, speech endpoint, etc., performing a first full-connection process on the speech features through a first full-connection layer of the speech endpoint network to obtain a first full-connection process result corresponding to the training speech signal samples, performing a speech endpoint detection process on the first full-connection result corresponding to the training speech signal samples through a first gate-control loop unit of the speech endpoint network to obtain a speech endpoint detection result corresponding to the training speech signal samples, and inputting the speech features corresponding to the training speech signal samples, the first full-connection process result corresponding to the training speech signal samples, and the speech endpoint detection result corresponding to the training speech signal samples, predicting the noise spectrum characteristic of a training voice signal sample by a second gating circulation unit of a noise spectrum prediction network, taking the noise spectrum characteristic of the corresponding training voice signal sample, the voice endpoint detection result of the corresponding training voice signal sample and the voice characteristic of the corresponding training voice signal sample as input, predicting the gain of the corresponding training voice signal sample by a third gating circulation unit of a noise spectrum removal network, applying the gain of the corresponding training voice signal sample to the training voice signal sample to obtain a noise reduction voice signal of the corresponding training voice signal sample, determining the error between the noise reduction voice signal of the corresponding training voice signal sample and a pure voice signal sample, substituting the error into a loss function of the noise reduction model, determining the parameter change value of the noise reduction model when the loss function obtains the minimum value based on the learning rate of the noise reduction model, and updating the parameters of the noise reduction model based on the parameter variation values.
Referring to fig. 3D, fig. 3D is a schematic flowchart of a training method of a noise reduction model based on artificial intelligence according to an embodiment of the present application, and the steps shown in fig. 3D will be described.
In step 201, obtaining a plurality of training speech signal samples carrying noise in a plurality of constraint conditions to form a training speech signal sample set corresponding to the plurality of constraint conditions one to one, and initializing a noise reduction model corresponding to the plurality of constraint conditions one to one;
in step 202, noise reduction processing is performed on training speech signal samples included in a training speech signal sample set corresponding to a plurality of constraint conditions through noise reduction models corresponding to the plurality of constraint conditions one to obtain noise reduction speech signals;
in step 203, determining an error between the noise-reduced speech signal and a clean speech signal sample corresponding to the training speech signal sample, and updating parameters of the noise-reduced model corresponding to the constraint conditions according to the error;
and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
The specific implementation manner in the step 201-203 can be referred to the embodiment corresponding to the step 101-103.
In the following, an exemplary application of the embodiments of the present application to a practical application constraint will be described.
The method comprises the steps of independently training a plurality of noise reduction models through different sound samples acquired by two types of attribute information, namely system time and geographic positions, clustering the trained noise reduction models to obtain noise reduction models which are matched with different constraint conditions (time and geographic positions) and have a deep network, and further effectively improving noise reduction effect and noise reduction efficiency.
Referring to fig. 5A-5B, fig. 5A-5B are schematic diagrams illustrating constraint conditions of an artificial intelligence based voice noise reduction method provided by an embodiment of the present application, where fig. 5A illustrates a schematic diagram of a street from 8 am to 10 am, a road street and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, an airport from 8 am to 10 am, an airport from airport and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, an office from 7 pm to 9 pm, an office and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, a subway station from 3 pm to 5 pm, a subway station and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, and constraint condition identification is performed on two-dimensional information consisting of geographic location information and time information, and respectively identifying constraint conditions 1-4, and correspondingly calling the noise reduction models A-D.
Referring to fig. 6A, fig. 6A is a schematic flowchart of a speech noise reduction method based on artificial intelligence provided in an embodiment of the present application, and referring to fig. 6B, fig. 6B is a schematic flowchart of a training method of a noise reduction model based on artificial intelligence provided in an embodiment of the present application, the noise reduction model training is performed first, then the noise reduction model is called to perform inference, a training stage may be run on a server, and finally N types of noise reduction models (N is an integer greater than or equal to 2) are obtained, and an inference stage may be run on a terminal in real time or on the server to obtain noise reduction speech signals.
In some embodiments, first, a training speech signal sample set for training is obtained, the training speech signal sample set needs to be labeled in advance, that is, the attribute of each frame of signal needs to be confirmed, for example, whether noise or speech, and the training process needs to give an expected output result (i.e., pure speech with noise suppressed in an ideal case) as a guide to a noise reduction model for parameter optimization, since the noise reduction model training needs a large number of labeled samples and an expected output result, in order to meet the requirement of a trainable model network, the training speech signal sample set is synthesized in a structured manner, that is, the pure speech signal is linearly superimposed with noise, so that a speech signal with noise is synthesized as a speech signal sample of training input, and the expected output signal is a pure speech signal before synthesis, and the training speech signal is obtained by different parameters (for example, the weighting value of the pure speech signal in a single frame signal is 0.7, and the noise weighted value is 0.3), a larger scale of training speech signal samples can be generated, the training process of the noise reduction model can be sufficiently supported, and the sample construction mode can well provide expected data corresponding to each sample to guide the optimization training of the noise reduction model.
In some embodiments, the system time information and the geographic coordinate information may be read from the terminal, a corresponding constraint condition is formed according to the two-dimensional information, all the noises are divided into M groups (M is an integer greater than or equal to 2), the noise set to be collected is divided into M groups of different samples according to time and coordinates, for example, the time of the first group of noise samples is 10:00 to 11:00 in the morning, the geographic coordinate is GPS0, the time of the second group of noise samples is 11:00 to 12:00 in the morning, the geographic coordinate is GPS0, and so on, each group of noise and clean voice signals (which can be purchased or found on an open source website) are linearly superimposed by different weighted values to obtain a training voice signal sample set of a certain scale for training of the noise reduction model, a noise reduction model corresponding to each time period and geographic position (constraint condition) is obtained through respective training of each noise reduction model, examples are as follows: m groups of noise sample sets (each group corresponds to different time and geographic positions, namely different constraint conditions), M training voice signal sample sets are obtained through construction, M noise reduction models are trained, the noise reduction models can be designed to be of the same structure or different structures, along with the scale increase and parameter increase of the training voice signal sample sets, the depth network of the noise reduction models has better constraint condition adaptability, and noise under different constraint conditions can be better identified and suppressed.
In some embodiments, in order to reduce the number of models and reduce the storage resources, the noise reduction models may be clustered to obtain N types of noise reduction models (M > -N), where the N types of noise reduction models correspond to constraints obtained by combining M different time coordinates, in practical application, for example, the user can talk in 11: 00-12: 00 morning and with GPS0 as geographic coordinates, the server calls the noise reduction model corresponding to the second set of training speech signal samples to perform real-time noise reduction processing, see figure 4, fig. 4 is a schematic diagram of a noise reduction model of an artificial intelligence-based speech noise reduction method according to an embodiment of the present application, where the noise reduction model shown in fig. 4 is used to perform noise reduction processing on training speech signal samples, namely, accurately identifying noise and voice signals, and further suppressing the noise through a spectral subtraction or wiener filtering post-processing mode to obtain a noise-reduced voice signal (clean voice signal). The inference stage is a noise reduction process in the actual call process, and is to perform voice or noise estimation on the collected noisy voice signal by using a trained noise reduction model (in which the network unit parameters have already obtained a relatively optimal solution in the training process), and finally suppress the noise component therein to obtain a cleaner voice signal.
In some embodiments, the noise reduction model clusters: in order to reduce the number of models and reduce the consumption of storage resources, a noise reduction model clustering method is proposed, which compresses M noise reduction models into N noise reduction models (M > -N), and the process is as follows: firstly, preparing M groups of noisy sample sets as noise reduction test sets, wherein the noise reduction test sets respectively correspond to M time geographic coordinates, and the number of samples does not need to be large; using the trained M noise reduction models to run each group of noisy test samples, namely AI noise reduction processing, obtaining M noise reduction results, sequencing the M results by using the minimum mean square error as a judgment criterion, selecting the first K model results with error values smaller than a preset error threshold as candidate noise reduction models of the group of test sample sets, taking the actual number as the standard if the number of the models meeting the error condition is smaller than K, and selecting the model with the minimum error as the only candidate noise reduction model of the sample set if all the model errors are larger than a preset threshold value; and if the candidate noise reduction models corresponding to the M sample sets have intersection models, uniformly using the model with the minimum comprehensive error in the intersection models as the final noise reduction model of the candidate noise reduction models with the intersection models. The compression of the number of models is realized by the clustering method.
Continuing with the exemplary structure of the artificial intelligence based speech noise reducer 255 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based speech noise reducer 255-1 of the memory 250 may include: an obtaining module 2551, configured to obtain a voice signal; an obtaining module 2552, configured to obtain a constraint condition of a corresponding voice signal; and the noise reduction module 2553 is configured to call a noise reduction model adapted to the constraint condition to perform noise reduction processing on the speech signal, so as to obtain a noise-reduced speech signal.
In some embodiments, the obtaining module 2552 is further configured to: acquiring a constraint condition corresponding to the voice signal before acquiring the voice signal, wherein the constraint condition comprises at least one of a scheduled time and a scheduled place of voice communication; a noise reduction module further to: before acquiring a voice signal, acquiring a noise reduction model adaptive to a constraint condition in advance; and when the constraint condition is met, automatically calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the acquired voice signal to obtain a noise reduction voice signal.
In some embodiments, the obtaining module 2552 is further configured to, before the noise reduction model adapted to the constraint condition is called to perform noise reduction processing on the speech signal, obtain a plurality of noise reduction models, where the plurality of noise reduction models is a set of all noise reduction models that can be called, or a set of noise reduction models adapted to the constraint condition; presenting options of a plurality of noise reduction models in a human-computer interaction interface; and in response to the selection operation, taking the selected noise reduction model as the noise reduction model adapted to the constraint condition.
In some embodiments, the obtaining module 2552 is further configured to: acquiring at least one of the following attribute information of the voice signal: time information of sending or receiving voice signals, geographical location information of sending or receiving voice signals, user information of sending or receiving voice signals, and environment information of sending or receiving voice signals; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between different attribute information and different constraint conditions.
In some embodiments, noise reduction module 2553 is further configured to: extracting voice features from the voice signal; performing first full-connection processing on voice features of corresponding voice signals through a first full-connection layer of a voice endpoint network to obtain first full-connection processing results of the corresponding voice signals; performing voice endpoint detection processing on a first full-connection result of the corresponding voice signal through a first gate control circulating unit of the voice endpoint network to obtain a voice endpoint detection result of the corresponding voice signal; predicting the noise spectrum characteristic of the voice signal by a second gating circulation unit of the noise spectrum prediction network by taking the voice feature of the corresponding voice signal, the first full-connection processing result of the corresponding voice signal and the voice endpoint detection result of the corresponding voice signal as input; and predicting the gain of the corresponding voice signal through a third gating circulating unit of the noise spectrum removing network by taking the noise spectrum characteristic of the voice signal, the voice endpoint detection result of the corresponding voice signal and the voice characteristic of the corresponding voice signal as input, and applying the gain of the corresponding voice signal to the voice signal to obtain the noise-reduced voice signal.
In some embodiments, noise reduction module 2553 is further configured to: calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal through the noise reduction model to obtain a noise reduction voice signal, and then determining a noise reduction effect parameter of the noise reduction voice signal; when the noise reduction effect parameter is lower than the noise reduction effect parameter threshold, performing the following processing: determining the similarity between the plurality of other constraint conditions and the constraint conditions of the corresponding voice signals, and determining the other constraint conditions with the highest similarity so as to call the corresponding noise reduction model to perform noise reduction processing on the noise reduction voice signals to obtain updated noise reduction voice signals; wherein the other constraints are different from the constraints corresponding to the speech signal.
In some embodiments, each noise reduction model is adapted to one constraint, and different noise reduction models are adapted to different constraints; the apparatus further comprises a training module 2554, configured to, before obtaining the constraint condition of the corresponding speech signal, obtain a training speech signal sample set corresponding to the multiple constraint conditions one to one; and training to obtain a plurality of noise reduction models which are in one-to-one correspondence with the plurality of constraint conditions based on the training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions.
In some embodiments, each noise reduction model is adapted with at least one constraint, and the constraints for different noise reduction model adaptations are different; the training module 2554 is further configured to, before obtaining the constraint conditions of the corresponding voice signals, obtain a training voice signal sample set corresponding to the multiple constraint conditions one to one; training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one; clustering the plurality of noise reduction models to obtain at least one clustered noise reduction model; wherein the noise reduction model of each cluster corresponds to at least one constraint condition.
In some embodiments, the training module 2554 is further configured to obtain a set of test speech signal samples corresponding to each constraint; for each set of test speech signal samples, performing the following: carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set through the plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and clustering the plurality of noise reduction models according to the noise reduction results which are obtained by aiming at each test voice signal sample set and correspond to the plurality of noise reduction models one by one to obtain at least one clustered noise reduction model.
In some embodiments, the training module 2554 is further configured to perform the following for each set of test speech signal samples: determining a plurality of noise reduction results obtained after the noise reduction processing is carried out on the test voice signal samples in the test voice signal sample set by a plurality of noise reduction models; determining a minimum mean square error of each noise reduction result, wherein the noise reduction result comprises noise reduction speech signal samples of a plurality of test speech signal samples in a set of test speech signal samples; based on the minimum mean square error of the noise reduction result, sequencing the noise reduction models in an ascending order, and taking at least one noise reduction model sequenced at the front as at least one candidate noise reduction model of the corresponding test voice signal sample set; extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each test voice signal sample set so as to take the intersection noise reduction model as a clustered noise reduction model; and the candidate noise reduction model serving as the intersection noise reduction model corresponds to a plurality of test voice signal sample sets.
In some embodiments, the training module 2554 is further configured to perform clustering processing on the multiple constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; acquiring a test voice signal sample set corresponding to a plurality of original constraint conditions one by one and a noise reduction model corresponding to the original constraint conditions one by one; carrying out fusion processing on the test voice signal sample sets which correspond to the original constraint conditions one by one to obtain a test voice signal sample set corresponding to the clustering constraint conditions; carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set corresponding to the clustering constraint conditions through a plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and acquiring the minimum mean square error of each noise reduction result, and determining the noise reduction model corresponding to the minimum mean square error as a clustered noise reduction model.
In some embodiments, the training module 2554 is further configured to obtain a plurality of noises carrying information of various attributes; wherein the attribute information includes: time information of sending or receiving voice signals, geographical location information of sending or receiving voice signals, user information of sending or receiving voice signals, and environment information of sending or receiving voice signals; dividing a plurality of noises according to a plurality of attribute information to obtain a noise set corresponding to a plurality of constraint conditions one by one; wherein each constraint condition has the same multiple kinds of attribute information; and overlapping the noise of each noise set and the pure voice signal sample to obtain a training voice signal sample set corresponding to each constraint condition.
In some embodiments, the training module 2554 is further configured to obtain weights of clean speech signal samples and weights of noise; weighting the clean voice signal sample and the noise according to the weight of the clean voice signal sample and the weight of the noise to obtain a training voice signal sample; adding a pure voice signal or noise based on the training voice signal sample to obtain a new training voice signal sample; and forming a training voice signal sample set according to the training voice signal sample and the new training voice signal sample.
In some embodiments, the training module 2554 is further configured to perform noise reduction processing on the training speech signal samples included in the training speech signal sample set through the noise reduction model to obtain a noise-reduced speech signal corresponding to the training speech signal samples; determining an error between a noise reduction voice signal corresponding to the training voice signal sample and the pure voice signal sample, and substituting the error into a loss function of the noise reduction model; and determining a parameter change value of the noise reduction model when the loss function obtains the minimum value based on the learning rate of the noise reduction model, and updating the parameter of the noise reduction model based on the parameter change value.
In some embodiments, the training module 2554 is further configured to extract speech features from the training speech signal samples; performing first full-connection processing on voice features through a first full-connection layer of a voice endpoint network to obtain a first full-connection processing result; performing voice endpoint detection processing on the first full-connection result through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result; predicting the noise spectrum characteristic of a training voice signal sample by taking the voice characteristic, the first full-connection processing result and the voice endpoint detection result as input through a second gating circulation unit of a noise spectrum estimation network; and predicting the gain of the corresponding training voice signal sample by using the noise spectrum characteristic, the voice endpoint detection result and the voice characteristic as input through a third gating circulating unit of the noise spectrum removal network, and applying the gain to the training voice signal sample.
Continuing with the exemplary structure of the artificial intelligence based noise reduction model training device 255-2 implemented as software modules provided by the embodiments of the present application, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based noise reduction model training device 255-2 of the memory 250 may include: a training module 2555, configured to obtain a plurality of training speech signal samples carrying noise in a plurality of constraints to form a training speech signal sample set corresponding to the plurality of constraints one to one; the training module 2555 is further configured to perform noise reduction processing on training speech signal samples included in a training speech signal sample set corresponding to the multiple constraint conditions through a noise reduction model corresponding to the multiple constraint conditions one to one, so as to obtain noise-reduced speech signals; the training module 2555 is further configured to determine an error between the noise-reduced speech signal and a clean speech signal sample corresponding to the training speech signal sample, and update parameters of the noise-reduction model corresponding to the constraint condition according to the error; and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the artificial intelligence based speech noise reduction method or the artificial intelligence based noise reduction model training method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform methods provided by embodiments of the present application, such as an artificial intelligence based speech noise reduction method as shown in fig. 3A-3C and an artificial intelligence based noise reduction model training method as shown in fig. 3D.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, according to the embodiment of the present application, by obtaining the constraint condition when the speech signal is applied, the corresponding noise reduction model is called for different constraint conditions to perform noise reduction processing on the speech signal, that is, a distinctive suppression means is adopted for different constraint conditions, so that the noise reduction effect is optimized in a targeted manner and the noise reduction efficiency is improved.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (19)

1. An artificial intelligence based speech noise reduction method, the method comprising:
acquiring a voice signal;
acquiring a constraint condition corresponding to the voice signal;
and calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal.
2. The method of claim 1, wherein obtaining the constraint corresponding to the speech signal comprises:
acquiring a constraint condition corresponding to the voice signal before acquiring the voice signal, wherein the constraint condition comprises at least one of a scheduled time and a scheduled place of voice communication;
the calling of the noise reduction model adaptive to the constraint condition is used for carrying out noise reduction processing on the voice signal to obtain a noise reduction voice signal, and the noise reduction processing method comprises the following steps:
before acquiring the voice signal, acquiring a noise reduction model adaptive to the constraint condition in advance;
and when the constraint condition is met, automatically calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the acquired voice signal to obtain a noise reduction voice signal.
3. The method according to claim 1, wherein before the calling the noise reduction model adapted to the constraint condition to perform noise reduction processing on the speech signal, the method comprises:
obtaining a plurality of noise reduction models, wherein the plurality of noise reduction models are a set of all noise reduction models which can be called or a set of noise reduction models which are adaptive to the constraint condition;
presenting options of the plurality of noise reduction models in a human-computer interaction interface;
and in response to the selection operation, using the selected noise reduction model as the noise reduction model adapted to the constraint condition.
4. The method of claim 1, wherein obtaining the constraint corresponding to the speech signal comprises:
acquiring at least one of the following attribute information of the voice signal: sending or receiving time information of the voice signal, sending or receiving geographical location information of the voice signal, sending or receiving user information of the voice signal, and sending or receiving environment information of the voice signal;
and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between the different attribute information and the different constraint conditions.
5. The method of claim 1, wherein the noise reduction model includes a speech endpoint network, a noise spectrum estimation network, and a noise spectrum removal network, and the invoking the noise reduction model adapted to the constraint condition performs noise reduction on the speech signal to obtain a noise-reduced speech signal includes:
extracting speech features from the speech signal;
performing first full-connection processing on the voice characteristics corresponding to the voice signals through a first full-connection layer of the voice endpoint network to obtain first full-connection processing results corresponding to the voice signals;
performing voice endpoint detection processing on a first full-connection result corresponding to the voice signal through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result corresponding to the voice signal;
predicting the noise spectrum characteristic of the voice signal through a second gating circulation unit of the noise spectrum prediction network by taking the voice feature corresponding to the voice signal, a first full-connection processing result corresponding to the voice signal and a voice endpoint detection result corresponding to the voice signal as input;
and predicting the gain corresponding to the voice signal through a third gating circulating unit of the noise spectrum removal network by taking the noise spectrum characteristic of the voice signal, the voice endpoint detection result corresponding to the voice signal and the voice characteristic corresponding to the voice signal as input, and applying the gain corresponding to the voice signal to obtain a noise-reduced voice signal.
6. The method according to claim 1, wherein after the invoking of the noise reduction model adapted to the constraint condition to perform noise reduction processing on the speech signal by the noise reduction model to obtain a noise-reduced speech signal, the method further comprises:
determining a noise reduction effect parameter of the noise reduction voice signal;
when the noise reduction effect parameter is lower than a noise reduction effect parameter threshold, performing the following processing:
determining the similarity between a plurality of other constraint conditions and the constraint conditions corresponding to the voice signals, and determining other constraint conditions with the highest similarity to call a noise reduction model corresponding to the other constraint conditions to perform noise reduction processing on the noise reduction voice signals to obtain updated noise reduction voice signals;
wherein the other constraints are different from the constraints corresponding to the speech signal.
7. The method of claim 1,
each noise reduction model is adapted to one constraint condition, and different noise reduction models are adapted to different constraint conditions;
before the obtaining of the constraint condition corresponding to the speech signal, the method further includes:
acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one by one;
and training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one.
8. The method of claim 1,
each noise reduction model is adapted to one constraint condition, and the different noise reduction model is adapted to different constraint conditions;
before the obtaining of the constraint condition corresponding to the speech signal, the method further includes:
acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one by one;
training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one;
clustering the plurality of noise reduction models to obtain at least one clustered noise reduction model;
wherein the noise reduction model of each cluster corresponds to at least one of the constraints.
9. The method of claim 8, wherein clustering the plurality of noise reduction models to obtain at least one clustered noise reduction model comprises:
acquiring a test voice signal sample set corresponding to each constraint condition;
performing the following for each set of test speech signal samples: carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set through the plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one;
and clustering the noise reduction models according to the noise reduction results which are obtained by aiming at each test voice signal sample set and correspond to the noise reduction models one by one to obtain at least one clustered noise reduction model.
10. The method of claim 9, wherein clustering the noise reduction models according to the noise reduction results obtained for each test speech signal sample set and corresponding to the noise reduction models in a one-to-one manner to obtain at least one clustered noise reduction model comprises:
performing the following for each set of test speech signal samples: determining a plurality of noise reduction results obtained after the noise reduction processing is carried out on the test voice signal samples in the test voice signal sample set by the plurality of noise reduction models;
determining a minimum mean square error for each of the noise reduction results, wherein the noise reduction results comprise noise reduced speech signal samples of the plurality of test speech signal samples in the set of test speech signal samples;
based on the minimum mean square error of the noise reduction result, sequencing the noise reduction models in an ascending order, and taking at least one noise reduction model sequenced at the front as at least one candidate noise reduction model corresponding to the test voice signal sample set;
extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each of the test speech signal sample sets to use the intersection noise reduction model as a clustered noise reduction model;
and the candidate noise reduction model serving as the intersection noise reduction model corresponds to a plurality of test voice signal sample sets.
11. The method of claim 8, wherein clustering the plurality of original noise reduction models to obtain at least one clustered noise reduction model comprises:
clustering the constraint conditions serving as original constraint conditions to obtain at least one clustering constraint condition;
acquiring a test voice signal sample set corresponding to a plurality of original constraint conditions one by one and a noise reduction model corresponding to the original constraint conditions one by one;
carrying out fusion processing on the test voice signal sample sets which correspond to the original constraint conditions one by one to obtain a test voice signal sample set corresponding to the clustering constraint conditions;
carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set corresponding to the clustering constraint condition through a plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one by one;
and acquiring the minimum mean square error of each noise reduction result, and determining a noise reduction model corresponding to the minimum mean square error as the clustered noise reduction model.
12. The method according to claim 7 or 8, wherein the obtaining of the training speech signal sample sets corresponding to the plurality of constraints comprises:
acquiring a plurality of noises carrying various attribute information;
wherein the attribute information includes: sending or receiving time information of the voice signal, sending or receiving geographical location information of the voice signal, sending or receiving user information of the voice signal, and sending or receiving environment information of the voice signal;
dividing the plurality of noises according to the plurality of attribute information to obtain noise sets corresponding to the plurality of constraint conditions one by one;
wherein each of the constraints has the same plurality of kinds of attribute information;
and overlapping the noise of each noise set and a pure voice signal sample to obtain a training voice signal sample set corresponding to each constraint condition.
13. The method of claim 12, wherein the superimposing the noise of each noise set with the clean speech signal sample to obtain a training speech signal sample set corresponding to each constraint condition comprises:
acquiring the weight of the pure voice signal sample and the weight of the noise;
weighting the clean voice signal sample and the noise according to the weight of the clean voice signal sample and the weight of the noise to obtain a training voice signal sample;
adding the pure voice signal or the noise based on the training voice signal sample to obtain a new training voice signal sample;
and forming the training voice signal sample set according to the training voice signal samples and the new training voice signal samples.
14. The method according to claim 7 or 8, wherein the training of the noise reduction models corresponding to the constraints based on the training speech signal sample sets corresponding to the constraints comprises:
carrying out noise reduction processing on training voice signal samples included in the training voice signal sample set through the noise reduction model to obtain noise reduction voice signals corresponding to the training voice signal samples;
determining an error between a noise reduction speech signal corresponding to the training speech signal sample and a clean speech signal sample, and substituting the error into a loss function of the noise reduction model;
and determining a parameter change value of the noise reduction model when the loss function obtains a minimum value based on the learning rate of the noise reduction model, and updating the parameter of the noise reduction model based on the parameter change value.
15. The method of claim 14, wherein the step of determining the target position is performed by a computer
Performing noise reduction processing on training voice signal samples included in the training voice signal sample set through the noise reduction model, including:
extracting voice features from training voice signal samples;
performing first full-connection processing on the voice features corresponding to the training voice signal samples through a first full-connection layer of a voice endpoint network to obtain a first full-connection processing result;
performing voice endpoint detection processing on a first full-connection result corresponding to the training voice signal sample through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result corresponding to the training voice signal sample;
predicting the noise spectrum characteristic of the training voice signal sample by a second gating circulation unit of a noise spectrum estimation network by taking the voice feature corresponding to the training voice signal sample, the first full-connection processing result corresponding to the training voice signal sample and the voice endpoint detection result corresponding to the training voice signal sample as input;
and predicting the gain corresponding to the training voice signal sample through a third gating circulating unit of a noise spectrum removing network by taking the noise spectrum characteristic corresponding to the training voice signal sample, the voice endpoint detection result corresponding to the training voice signal sample and the voice characteristic corresponding to the training voice signal sample as input, and applying the gain corresponding to the training voice signal sample.
16. A training method of a noise reduction model based on artificial intelligence is characterized by comprising the following steps:
obtaining a plurality of training voice signal samples carrying noise in a plurality of constraint conditions to form a training voice signal sample set corresponding to the constraint conditions one by one;
performing noise reduction processing on training voice signal samples included in a training voice signal sample set corresponding to the plurality of constraint conditions through noise reduction models corresponding to the plurality of constraint conditions one to obtain noise reduction voice signals;
determining an error between the noise reduction voice signal and a pure voice signal sample corresponding to the training voice signal sample, and updating parameters of a noise reduction model corresponding to the constraint condition according to the error;
and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
17. A speech noise reduction device based on artificial intelligence, comprising:
the acquisition module is used for acquiring a voice signal;
the acquisition module is used for acquiring constraint conditions corresponding to the voice signals;
and the noise reduction module is used for calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal.
18. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the artificial intelligence based speech noise reduction method of any one of claims 1 to 15 or the artificial intelligence based noise reduction model training method of claim 16 when executing executable instructions stored in the memory.
19. A computer-readable storage medium storing executable instructions for, when executed by a processor, implementing the artificial intelligence based speech noise reduction method of any one of claims 1 to 15 or the training method of the artificial intelligence based noise reduction model of claim 16.
CN202110116096.8A 2021-01-28 2021-01-28 Voice noise reduction method and device based on artificial intelligence and electronic equipment Pending CN113593595A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110116096.8A CN113593595A (en) 2021-01-28 2021-01-28 Voice noise reduction method and device based on artificial intelligence and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110116096.8A CN113593595A (en) 2021-01-28 2021-01-28 Voice noise reduction method and device based on artificial intelligence and electronic equipment

Publications (1)

Publication Number Publication Date
CN113593595A true CN113593595A (en) 2021-11-02

Family

ID=78238137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110116096.8A Pending CN113593595A (en) 2021-01-28 2021-01-28 Voice noise reduction method and device based on artificial intelligence and electronic equipment

Country Status (1)

Country Link
CN (1) CN113593595A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793620A (en) * 2021-11-17 2021-12-14 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device and equipment based on scene classification and storage medium
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114664322A (en) * 2022-05-23 2022-06-24 深圳市听多多科技有限公司 Single-microphone hearing-aid noise reduction method based on Bluetooth headset chip and Bluetooth headset

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793620A (en) * 2021-11-17 2021-12-14 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device and equipment based on scene classification and storage medium
CN113793620B (en) * 2021-11-17 2022-03-08 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device and equipment based on scene classification and storage medium
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114664322A (en) * 2022-05-23 2022-06-24 深圳市听多多科技有限公司 Single-microphone hearing-aid noise reduction method based on Bluetooth headset chip and Bluetooth headset

Similar Documents

Publication Publication Date Title
CN113593595A (en) Voice noise reduction method and device based on artificial intelligence and electronic equipment
CN108255934B (en) Voice control method and device
CN104538024B (en) Phoneme synthesizing method, device and equipment
CN112309365B (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN110853618A (en) Language identification method, model training method, device and equipment
CN116844543A (en) Control method and system based on voice interaction
CN107316635B (en) Voice recognition method and device, storage medium and electronic equipment
CN111081280A (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN112148850A (en) Dynamic interaction method, server, electronic device and storage medium
CN111816170A (en) Training of audio classification model and junk audio recognition method and device
CN112837669A (en) Voice synthesis method and device and server
CN114007064B (en) Special effect synchronous evaluation method, device, equipment and storage medium
CN111933135A (en) Terminal control method and device, intelligent terminal and computer readable storage medium
CN117059068A (en) Speech processing method, device, storage medium and computer equipment
US20220137917A1 (en) Method and system for assigning unique voice for electronic device
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN116741193B (en) Training method and device for voice enhancement network, storage medium and computer equipment
CN109885668A (en) A kind of expansible field interactive system status tracking method and apparatus
CN109637509A (en) A kind of music automatic generation method, device and computer readable storage medium
CN113571082A (en) Voice call control method and device, computer readable medium and electronic equipment
CN113571063A (en) Voice signal recognition method and device, electronic equipment and storage medium
CN117037772A (en) Voice audio segmentation method, device, computer equipment and storage medium
CN116959464A (en) Training method of audio generation network, audio generation method and device
CN114722234B (en) Music recommendation method, device and storage medium based on artificial intelligence
CN110364169A (en) Method for recognizing sound-groove, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40054076

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination