CN113593595A - Voice noise reduction method and device based on artificial intelligence and electronic equipment - Google Patents

Voice noise reduction method and device based on artificial intelligence and electronic equipment Download PDF

Info

Publication number
CN113593595A
CN113593595A CN202110116096.8A CN202110116096A CN113593595A CN 113593595 A CN113593595 A CN 113593595A CN 202110116096 A CN202110116096 A CN 202110116096A CN 113593595 A CN113593595 A CN 113593595A
Authority
CN
China
Prior art keywords
noise reduction
voice signal
voice
training
speech signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110116096.8A
Other languages
Chinese (zh)
Other versions
CN113593595B (en
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110116096.8A priority Critical patent/CN113593595B/en
Publication of CN113593595A publication Critical patent/CN113593595A/en
Application granted granted Critical
Publication of CN113593595B publication Critical patent/CN113593595B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本申请提供了一种基于人工智能的语音降噪方法、装置、电子设备及计算机可读存储介质;方法包括:获取语音信号;获取对应所述语音信号的约束条件;调用与所述约束条件适配的降噪模型对所述语音信号进行降噪处理,得到降噪语音信号。通过本申请,能够提高降噪处理的降噪效果与效率。

Figure 202110116096

The present application provides an artificial intelligence-based voice noise reduction method, device, electronic device, and computer-readable storage medium; the method includes: acquiring a voice signal; acquiring a constraint corresponding to the voice signal; The matched noise reduction model performs noise reduction processing on the voice signal to obtain a noise reduction voice signal. Through the present application, the noise reduction effect and efficiency of noise reduction processing can be improved.

Figure 202110116096

Description

Voice noise reduction method and device based on artificial intelligence and electronic equipment
Technical Field
The present application relates to artificial intelligence technology, and in particular, to a method, an apparatus, a device, and a computer readable storage medium for speech noise reduction based on artificial intelligence.
Background
Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
In voice call or man-machine voice interaction application, collected voice signals generally have noise, the effects of call and recognition are affected, noise reduction processing needs to be carried out on the voice signals, so that the voice signal-to-noise ratio is improved, the call definition and the intelligibility are enhanced, and the machine voice recognition is more accurate and effective.
Disclosure of Invention
The embodiment of the application provides a voice noise reduction method and device based on artificial intelligence, an electronic device and a computer readable storage medium, and can improve the noise reduction effect and efficiency of noise reduction processing.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a voice noise reduction method based on artificial intelligence, which comprises the following steps:
acquiring a voice signal;
acquiring a constraint condition corresponding to the voice signal;
and calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal.
The embodiment of the application provides a device of making an uproar falls in pronunciation based on artificial intelligence, includes:
the acquisition module is used for acquiring a voice signal;
the acquisition module is used for acquiring constraint conditions corresponding to the voice signals;
and the noise reduction module is used for calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal.
In the foregoing solution, the obtaining module is further configured to: acquiring a constraint condition corresponding to the voice signal before acquiring the voice signal, wherein the constraint condition comprises at least one of a scheduled time and a scheduled place of voice communication; the noise reduction module is further configured to: before acquiring the voice signal, acquiring a noise reduction model adaptive to the constraint condition in advance; and when the constraint condition is met, automatically calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the acquired voice signal to obtain a noise reduction voice signal.
In the foregoing solution, the obtaining module is further configured to obtain a plurality of noise reduction models before calling the noise reduction model adapted to the constraint condition to perform noise reduction processing on the voice signal, where the plurality of noise reduction models are a set of all noise reduction models that can be called, or a set of noise reduction models adapted to the constraint condition; presenting options of the plurality of noise reduction models in a human-computer interaction interface; and in response to the selection operation, using the selected noise reduction model as the noise reduction model adapted to the constraint condition.
In the foregoing solution, the obtaining module is further configured to: acquiring at least one of the following attribute information of the voice signal: sending or receiving time information of the voice signal, sending or receiving geographical location information of the voice signal, sending or receiving user information of the voice signal, and sending or receiving environment information of the voice signal; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between the different attribute information and the different constraint conditions.
In the foregoing solution, the noise reduction module is further configured to: extracting speech features from the speech signal; performing first full-connection processing on the voice characteristics corresponding to the voice signals through a first full-connection layer of the voice endpoint network to obtain first full-connection processing results corresponding to the voice signals; performing voice endpoint detection processing on a first full-connection result corresponding to the voice signal through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result corresponding to the voice signal; predicting the noise spectrum characteristic of the voice signal through a second gating circulation unit of the noise spectrum prediction network by taking the voice feature corresponding to the voice signal, a first full-connection processing result corresponding to the voice signal and a voice endpoint detection result corresponding to the voice signal as input; and predicting the gain corresponding to the voice signal through a third gating circulating unit of the noise spectrum removal network by taking the noise spectrum characteristic of the voice signal, the voice endpoint detection result corresponding to the voice signal and the voice characteristic corresponding to the voice signal as input, and applying the gain corresponding to the voice signal to obtain a noise-reduced voice signal.
In the foregoing solution, the noise reduction module is further configured to: calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal through the noise reduction model to obtain a noise reduction voice signal, and then determining a noise reduction effect parameter of the noise reduction voice signal; when the noise reduction effect parameter is lower than a noise reduction effect parameter threshold, performing the following processing: determining the similarity between a plurality of other constraint conditions and the constraint conditions corresponding to the voice signals, and determining other constraint conditions with the highest similarity to call corresponding noise reduction models to perform noise reduction processing on the noise reduction voice signals to obtain updated noise reduction voice signals; wherein the other constraints are different from the constraints corresponding to the speech signal.
In the above scheme, each noise reduction model is adapted to a constraint condition, and different noise reduction models are adapted to different constraint conditions; the device also comprises a training module, a processing module and a processing module, wherein the training module is used for acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one to one before acquiring the constraint conditions corresponding to the voice signals; and training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one.
In the above solution, each noise reduction model is adapted to at least one of the constraints, and the constraints for different noise reduction model adaptations are different; the training module is further configured to, before obtaining a constraint condition corresponding to the voice signal, obtain a training voice signal sample set corresponding to a plurality of the constraint conditions one to one; training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one; clustering the plurality of noise reduction models to obtain at least one clustered noise reduction model; wherein the noise reduction model of each cluster corresponds to at least one of the constraints.
In the above scheme, the training module is further configured to obtain a test speech signal sample set corresponding to each constraint condition; performing the following for each set of test speech signal samples: carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set through the plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and clustering the noise reduction models according to the noise reduction results which are obtained by aiming at each test voice signal sample set and correspond to the noise reduction models one by one to obtain at least one clustered noise reduction model.
In the foregoing solution, the training module is further configured to perform the following processing for each test speech signal sample set: determining a plurality of noise reduction results obtained after the noise reduction processing is carried out on the test voice signal samples in the test voice signal sample set by the plurality of noise reduction models; determining a minimum mean square error for each of the noise reduction results, wherein the noise reduction results comprise noise reduced speech signal samples of the plurality of test speech signal samples in the set of test speech signal samples; based on the minimum mean square error of the noise reduction result, sequencing the noise reduction models in an ascending order, and taking at least one noise reduction model sequenced at the front as at least one candidate noise reduction model corresponding to the test voice signal sample set; extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each of the test speech signal sample sets to use the intersection noise reduction model as a clustered noise reduction model; and the candidate noise reduction model serving as the intersection noise reduction model corresponds to a plurality of test voice signal sample sets.
In the above scheme, the training module is further configured to perform clustering processing on the plurality of constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; acquiring a test voice signal sample set corresponding to a plurality of original constraint conditions one by one and a noise reduction model corresponding to the original constraint conditions one by one; carrying out fusion processing on the test voice signal sample sets which correspond to the original constraint conditions one by one to obtain a test voice signal sample set corresponding to the clustering constraint conditions; carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set corresponding to the clustering constraint condition through a plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one by one; and acquiring the minimum mean square error of each noise reduction result, and determining a noise reduction model corresponding to the minimum mean square error as the clustered noise reduction model.
In the above scheme, the training module is further configured to acquire a plurality of noises carrying a plurality of attribute information; wherein the attribute information includes: sending or receiving time information of the voice signal, sending or receiving geographical location information of the voice signal, sending or receiving user information of the voice signal, and sending or receiving environment information of the voice signal; dividing the plurality of noises according to the plurality of attribute information to obtain noise sets corresponding to the plurality of constraint conditions one by one; wherein each of the constraints has the same plurality of kinds of attribute information; and overlapping the noise of each noise set and a pure voice signal sample to obtain a training voice signal sample set corresponding to each constraint condition.
In the above scheme, the training module is further configured to obtain a weight of the clean speech signal sample and a weight of the noise; weighting the clean voice signal sample and the noise according to the weight of the clean voice signal sample and the weight of the noise to obtain a training voice signal sample; adding the pure voice signal or the noise based on the training voice signal sample to obtain a new training voice signal sample; and forming the training voice signal sample set according to the training voice signal samples and the new training voice signal samples.
In the above scheme, the training module is further configured to perform noise reduction processing on a training speech signal sample included in the training speech signal sample set through the noise reduction model to obtain a noise reduction speech signal corresponding to the training speech signal sample; determining an error between a noise reduction speech signal corresponding to the training speech signal sample and a clean speech signal sample, and substituting the error into a loss function of the noise reduction model; and determining a parameter change value of the noise reduction model when the loss function obtains a minimum value based on the learning rate of the noise reduction model, and updating the parameter of the noise reduction model based on the parameter change value.
In the above scheme, the training module is further configured to extract a speech feature from a training speech signal sample; performing first full-connection processing on the voice features through a first full-connection layer of a voice endpoint network to obtain a first full-connection processing result; performing voice endpoint detection processing on the first full-connection result through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result; predicting the noise spectrum characteristic of the training voice signal sample by taking the voice feature, the first full-connection processing result and the voice endpoint detection result as input through a second gating circulation unit of a noise spectrum prediction network; and predicting the gain corresponding to the training voice signal sample through a third gating circulating unit of a noise spectrum removal network by taking the noise spectrum characteristic, the voice endpoint detection result and the voice characteristic as input, and applying the gain to the training voice signal sample.
The embodiment of the application provides a training method of a noise reduction model based on artificial intelligence, which comprises the following steps:
obtaining a plurality of training voice signal samples carrying noise in a plurality of constraint conditions to form a training voice signal sample set corresponding to the constraint conditions one by one;
performing noise reduction processing on training voice signal samples included in a training voice signal sample set corresponding to the plurality of constraint conditions through noise reduction models corresponding to the plurality of constraint conditions one to obtain noise reduction voice signals;
determining an error between the noise reduction voice signal and a pure voice signal sample corresponding to the training voice signal sample, and updating parameters of a noise reduction model corresponding to the constraint condition according to the error;
and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
The embodiment of the application provides a training device of model of making an uproar falls based on artificial intelligence, includes:
the training module is used for acquiring a plurality of training voice signal samples carrying noise in a plurality of constraint conditions to form a training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions;
the training module is further configured to perform noise reduction processing on training speech signal samples included in a training speech signal sample set corresponding to the multiple constraint conditions through noise reduction models corresponding to the multiple constraint conditions one to one, so as to obtain noise-reduced speech signals;
the training module is further configured to determine an error between the noise reduction speech signal and a clean speech signal sample corresponding to the training speech signal sample, and update parameters of a noise reduction model corresponding to the constraint condition according to the error;
and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the artificial intelligence based speech noise reduction method or the artificial intelligence based noise reduction model training method provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for implementing, when executed by a processor, the artificial intelligence based speech noise reduction method or the artificial intelligence based noise reduction model training method provided in the embodiment of the present application.
The embodiment of the application has the following beneficial effects:
different constraint conditions of the applied voice signals are identified through the attribute information of the applied voice signals, corresponding noise reduction models are called according to the different constraint conditions to perform noise reduction processing on the voice signals, namely different suppression means are adopted according to the different constraint conditions, the noise reduction effect is optimized in a targeted mode, and the noise reduction efficiency is improved.
Drawings
FIG. 1 is a schematic structural diagram of an artificial intelligence-based speech noise reduction system provided by an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;
3A-3D are schematic flow diagrams of artificial intelligence based speech noise reduction methods provided by embodiments of the present application;
FIG. 4 is a schematic diagram of a noise reduction model of an artificial intelligence-based speech noise reduction method provided by an embodiment of the present application;
5A-5B are schematic constraint conditions of an artificial intelligence based speech noise reduction method provided by an embodiment of the present application;
FIG. 6A is a flowchart illustrating an artificial intelligence based speech noise reduction method according to an embodiment of the present application;
fig. 6B is a flowchart illustrating a training method of a noise reduction model based on artificial intelligence according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Noise spectrum characteristics: frequency is one of the main parameters describing sound characteristics, and is a necessary condition for studying the distribution of sound intensity with frequency, and the sound spectrum refers to a graph in which the intensity of constituent complex sounds (sound waves synthesized by simple harmonic components with different frequencies) is distributed with frequency. The acoustic spectrum consisting of noise is the noise spectrum. From the noise spectrum, the analysis knows the composition and properties of the noise, called spectral analysis. In spectrum analysis, it is usually known whether the peak noise is in the low frequency, intermediate frequency or high frequency range, which provides a basis for noise control.
The related technology comprises a common noise reduction algorithm which mainly comprises the following steps: the general noise reduction algorithm of the type usually assumes that noise is additive random stationary noise, and voice has short-time stationary characteristics, and can estimate noise spectrum characteristics through various statistical methods, and then carry out noise suppression on the voice with noise according to the signal-to-noise ratio result obtained by calculation, in the enhanced voice signal after noise suppression, the proportion of noise components is reduced, the voice signal-to-noise ratio is improved, and the voice is clearer and understandable, but the general noise reduction algorithm assumes that noise has the additive random stationary characteristics, but under the actual constraint condition, the noise is divided into stationary noise and non-stationary noise, and for the non-stationary noise, part of noise reduction means in the related technology cannot effectively suppress, and has obvious shortcuts.
The related technology comprises an AI noise reduction algorithm, the process of the AI noise reduction algorithm is divided into a deep learning training stage and a deep learning reasoning stage, the training stage firstly extracts related voice time-frequency domain characteristic data such as power spectrum, pitch period, voice endpoint and the like from a large number of voice and noise samples, and trains a designed deep network model (usually a network structure such as a multilayer deep neural network, a convolutional neural network, a gate control cycle unit, a long-short term memory network and the like) by using the large number of data, the training aims to continuously optimize all parameters in the model, the optimization result and the optimization aim to enable the deep network model to more accurately predict pure voice components (or noise components) under various noisy voices, namely accurately identify noise and voices, the AI noise reduction algorithm technology is a data-driven technology, namely if the noise type of an actual application scene is covered in a sample library of the training stage, the AI noise reduction algorithm can usually suppress effectively to get cleaner speech, if the noise type of the actual scene is not included in the training process, the AI noise reduction algorithm may fail, and another disadvantage of the AI noise reduction algorithm is that the AI noise reduction algorithm may misjudge to suppress the effective speech signal because the effective speech under some constraint conditions is closer to the individual noise.
The following describes an exemplary application of the electronic device provided in the embodiment of the present application, and the device provided in the embodiment of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated message device, and a portable game device), and may also be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a server.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an artificial intelligence based voice noise reduction system provided in an embodiment of the present application, in order to support a voice call application, a terminal 400-1 and a terminal 400-2 are connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two. The terminal 400-1 initiates a voice call request to the terminal 400-2 through the server 200, the voice call request carries a voice message, after the server sends the voice message to the terminal 400-2, the terminal 400-2 sends attribute information of the voice message (geographical information of the terminal 400-2 and time information of receiving the voice message) to the server 200, so that the server 200 acquires a corresponding constraint condition and calls a corresponding noise reduction model to perform noise reduction processing on the voice message, a noise reduction voice signal is obtained, and the noise reduction voice signal is returned to the terminal 400-2 to be played.
In some embodiments, the artificial intelligence based speech noise reduction system is applied to a speech call application, the speech call application comprising a real-time speech call application and a non-real-time speech call application, the real-time speech call comprising: the process of making a call, the process of making a voice call, the non-real-time voice call comprises: the process of sending voice message, etc., the terminal 400-1 used by the user can receive the voice message and also can send out the voice message, aiming at the sent voice message, a noise reduction model which is operated in the server 200 and is adaptive to the constraint condition of sending out the voice message can be called to carry out noise reduction processing on the sent voice message, the noise reduction voice signal is sent to the terminal 400-2 used by another user, aiming at the received voice message, the terminal 400-1 can call a noise reduction model which is operated in the server 200 and is adaptive to the constraint condition of receiving the voice message to carry out noise reduction processing on the received voice message, and the noise reduction voice signal is played in the terminal 400-1.
In some embodiments, the speech noise reduction system based on artificial intelligence is applied to an application having a speech interaction function, a terminal 400-1 used by a user receives speech information of the user, the speech information carries contents of a control instruction, for the speech information of the user, a noise reduction model running in the server 200 and adapted to a constraint condition for sending the speech information can be called to perform noise reduction processing on the sent speech information, speech recognition processing is performed on a noise reduction speech signal, the control instruction is recognized, and the control instruction is returned to the terminal 400-1, so that the terminal 400-1 continuously responds to the recognized control instruction.
In some embodiments, the electronic device may be implemented as a terminal, the terminal 400-1 used by the user receives the voice information from the terminals 400-2 of other users, the terminal 400-1 directly calls the adaptive local noise reduction model to perform noise reduction processing on the voice information according to the constraint conditions (geographical location information and time information) of receiving the voice information, and plays the noise-reduced voice signal corresponding to the voice information on the terminal 400-1.
In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminals 400-1 and 400-2 may be, but are not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present invention.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and a server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in terminal 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.
The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.
The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.
In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 2 shows an artificial intelligence based speech noise reduction apparatus 255-1 stored in a memory 250, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: an acquisition module 2551, an acquisition module 2552, a noise reduction module 2553 and a training module 2554, fig. 2 shows a training device 255-2 for an artificial intelligence based noise reduction model stored in a memory 250, which may be software in the form of programs and plug-ins or the like, comprising the following software modules: a training module 2555, which is logical and thus can be arbitrarily combined or further split according to the implemented functions, the functions of which will be described below.
The artificial intelligence based speech noise reduction method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the electronic device provided by the embodiment of the present application. The voice noise reduction method based on artificial intelligence can be implemented through a terminal or a server, and aiming at received voice information, the terminal directly collects voice signals from the voice information and calls a noise reduction model adaptive to constraint conditions for receiving the voice information to perform noise reduction processing, or the terminal calls the noise reduction model adaptive to the constraint conditions for receiving the voice information from the server to perform noise reduction processing.
Referring to fig. 3A, fig. 3A is a schematic flowchart of a method for speech noise reduction based on artificial intelligence according to an embodiment of the present application, which will be described with reference to the steps shown in fig. 3A.
In step 101, a speech signal is acquired.
As an example, the voice signal may be a voice signal collected in real time during a real-time voice call, the real-time voice call including: the process of making a call, the process of making a voice call, the voice signal can be the voice signal gathered in the non-real-time voice call, the non-real-time voice call includes: the process of sending voice messages, voice messages and the like, and the voice signals can be collected from human-computer interaction voice commands sent by a user aiming at the human-computer interaction interface.
In step 102, constraints corresponding to the speech signal are obtained.
In some embodiments, the constraint condition of the corresponding voice signal obtained in step 102 may be implemented by the following technical solutions: acquiring at least one of the following attribute information of the voice signal: time information of sending or receiving a voice signal, geographical location information of sending or receiving a voice signal, user information of sending or receiving a voice signal, such as a person who initiates a call, or a person who accepts a call, environmental information where a voice signal is sent or received, for example, a computer accepts to initiate a voice call or a mobile device accepts to initiate a voice call; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between different attribute information and different constraint conditions.
As an example, the attribute information is information for characterizing constraints related to the voice signal, the attribute information includes one or more of time information, geographical location information, user information, or environment information, the constraints of the voice signal include constraints for generating the voice signal, constraints for sending out the voice signal, or constraints for receiving the voice signal, and thus the time information includes time information for sending or receiving the voice signal, the geographical location information includes geographical location information for sending or receiving the voice signal, the user information includes user information for sending or receiving the voice signal, such as a person who initiated a call, or a person who accepted a call, and the environment information includes environment information where the voice signal was sent or received, such as accepting or initiating a voice call through a computer, or accepting or initiating a voice call through a mobile device, the computer and the mobile device represent different environment information, the computer represents relatively fixed environment information, such as an office environment, the mobile device represents changed environment information, such as a leisure environment and the like; different constraint conditions can be defined by the combination of any several kinds of attribute information, and the corresponding relation exists between the independent attribute information or the combination of the attribute information and the constraint conditions, for example, an airport from 10 am to 11 am is used as one constraint condition, and finally the constraint conditions corresponding to the attribute information of the voice signals are determined according to the preset corresponding relation between different attribute information and different constraint conditions.
In step 103, a noise reduction model adapted to the constraint condition is called to perform noise reduction processing on the speech signal, so as to obtain a noise reduced speech signal.
Referring to fig. 3B, fig. 3B is a schematic flow chart of the artificial intelligence based speech noise reduction method provided in the embodiment of the present application, the step 103 of invoking a noise reduction model adapted to the constraint condition to perform noise reduction processing on the speech signal, and obtaining a noise-reduced speech signal may be realized through steps 1031-1035, which will be described with reference to the steps shown in fig. 3B.
In step 1031, speech features are extracted from the speech signal.
In step 1032, a first full-connection processing is performed on the voice feature through a first full-connection layer of the voice endpoint network, so as to obtain a first full-connection processing result.
In step 1033, a voice endpoint detection process is performed on the first full connection result through a first gate control loop unit of the voice endpoint network, so as to obtain a voice endpoint detection result.
In step 1034, the speech feature, the first full-link processing result, and the speech endpoint detection result are used as input, and the noise spectrum characteristic of the speech signal is predicted through the second gating loop unit of the noise spectrum estimation network.
In step 1035, the noise spectrum characteristic, the voice endpoint detection result and the voice feature are used as input, the gain of the corresponding voice signal is predicted by the third gate control loop unit of the noise spectrum removal network, and the gain is applied to the voice signal, so as to obtain a noise-reduced voice signal.
By way of example, referring to fig. 4, fig. 4 is a schematic diagram of a noise reduction model of an artificial intelligence based speech noise reduction method provided in an embodiment of the present application, where the noise reduction model includes a speech endpoint network, a noise spectrum estimation network, and a noise spectrum removal network, a speech feature is extracted from a speech signal, the speech feature is extracted from time-frequency domain data, such as a power spectrum, a pitch period, a speech endpoint, etc., the speech feature is subjected to a first full-concatenation processing through a first full-concatenation layer (24-dimensional) of the speech endpoint network, the first full-concatenation processing is implemented through a dense function and a tanh activation function, a first full-concatenation processing result is obtained, a speech endpoint detection result is obtained by performing speech endpoint detection processing on the first full-concatenation result through a first gating loop unit (24-dimensional) of the speech endpoint network, the gating loop unit is a commonly used gating loop neural network, the gated cyclic neural network receives a current input and a hidden state transmitted by a previous node, the hidden state comprises related information of the previous node, the gated cyclic neural network obtains an output of the current hidden node and a hidden state transmitted to a next node by combining the hidden state and the current input, a voice endpoint detection result is obtained by iterative processing of a plurality of gated cyclic units and mapping processing of an activation function, a voice characteristic, a first full-connection processing result and the voice endpoint detection result are used as input, a noise spectrum is used for predicting a second gated cyclic unit (48 dimensions) of the network, the noise spectrum characteristic of a voice signal is predicted, a noise spectrum characteristic, a voice endpoint detection result and the voice characteristic are used as input, a gain (22 dimensions) corresponding to the voice signal is predicted by a third gated cyclic unit (96 dimensions) of a noise spectrum removing network, and applying the gain to the speech signal to obtain a noise-reduced speech signal, wherein the data involved in steps 1031-1035 are all corresponding processing results obtained for the corresponding speech signal.
It should be noted that the execution order of the above steps 101, 102 and 103 is diversified, and the following description is given.
In some embodiments, step 101 may be performed first, and then step 102 and step 103 may be performed in sequence, one specific example being described below. In step 101 a speech signal is acquired and in step 102 the speech signal acquired in step 101 is analyzed to obtain constraints corresponding to the speech signal. In step 103, a noise reduction model adapted to the constraint condition is called to perform noise reduction processing on the speech signal, so as to obtain a noise reduced speech signal.
When steps 101-103 are performed sequentially, step 102 is implemented as follows: acquiring at least one of the following attribute information of the voice signal: time information of sending or receiving a voice signal, geographical location information of sending or receiving a voice signal, user information of sending or receiving a voice signal, such as a person who initiates a call, or a person who accepts a call, environmental information where a voice signal is sent or received, for example, a computer accepts to initiate a voice call or a mobile device accepts to initiate a voice call; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between different attribute information and different constraint conditions.
As an example, a terminal receives a piece of voice information, collects the voice information, responds to a playing operation of a user for the voice information, acquires playing time information and playing place information of the voice information, acquires constraint conditions corresponding to the playing time information and the playing place information, calls a noise reduction model corresponding to the constraint conditions to perform noise reduction processing on the voice signal, obtains a noise reduction voice signal, and plays the noise reduction voice signal.
In some embodiments, step 102 may be performed first, followed by step 103 and step 101, an example of which is described below. In step 102, i.e., before acquiring the voice signal, constraints corresponding to the voice signal are acquired, the constraints including at least one of a scheduled time and a scheduled location of the voice communication. In step 103, a noise reduction model adapted to the constraint condition is obtained in advance, then step 101 is carried out to obtain the voice signal, and when the constraint condition is met, step 103 is carried out to automatically call the noise reduction model adapted to the constraint condition to carry out noise reduction processing on the obtained voice signal, so as to obtain a noise reduction voice signal.
As an example, for a voice signal in a teleconference, extracting a scheduled time and a scheduled location of the teleconference from a schedule function and/or a proxy function of an instant messaging client of a user, or extracting a scheduled time and a scheduled location of the teleconference from a memo of a voice assistant, where the scheduled time and the scheduled location of the teleconference constitute constraints of the voice signal transmitted in the teleconference, and the scheduled location includes an airport, an office, and the like, a noise reduction model adapted to the constraints is obtained in advance before the teleconference is started, for example, a noise reduction model adapted to the constraints is obtained in advance from a server, or a noise reduction model adapted to the constraints is loaded in advance on a processing terminal (terminal or server), and when the constraints are satisfied, that is, when the user reaches the conference scheduled location at the conference scheduled time, the noise reduction model adapted to the constraints is automatically called to perform noise reduction processing on the obtained voice signal, and obtaining the noise reduction voice signal, thereby saving the time for adapting and calling the noise reduction model in real time and improving the noise reduction efficiency.
As an example, for a voice call scenario, plan time of a voice call (a plan place where the voice call is not recorded in a memo) is extracted from a memo of a voice assistant, geographical location information of a terminal is obtained when the voice call is played but the voice call is not connected, the geographical location information and the plan time constitute a constraint condition of a voice signal to be transmitted in the voice call, a noise reduction model adapted to the constraint condition is obtained in advance before the voice call is connected, for example, a noise reduction model adapted to the constraint condition is obtained in advance from a server, or a noise reduction model adapted to the constraint condition is loaded in advance at a processing end (the terminal or the server), after the voice call is connected, when the constraint condition is satisfied (a time point when the constraint condition is satisfied may be before the voice call is connected), the noise reduction model adapted to the constraint condition is automatically called to perform noise reduction processing on the obtained voice signal, and obtaining the noise reduction voice signal, thereby saving the time for adapting and calling the noise reduction model in real time and improving the noise reduction efficiency.
In some embodiments, step 101 and step 102 may be performed first, then the user-selected noise reduction model is obtained according to the constraint conditions obtained in step 102, and then step 103 is performed according to the user-selected noise reduction model, an example of which is described below. In step 101 a speech signal is acquired and in step 102 the speech signal acquired in step 101 is analyzed to obtain constraints corresponding to the speech signal. Before calling a noise reduction model adapted to the constraint condition to perform noise reduction processing on the voice signal in step 103, acquiring a plurality of noise reduction models, wherein the plurality of noise reduction models are a set of all noise reduction models which can be called or a set of noise reduction models adapted to the constraint condition (obtaining the constraint condition through step 102); presenting options of a plurality of noise reduction models in a human-computer interaction interface; in response to the selection operation, the selected noise reduction model is used as the noise reduction model adapted to the constraint condition, and step 103 is executed according to the noise reduction model selected by the user.
As an example, for a teleconference scene, after step 102 and before step 103, options of a plurality of noise reduction models are presented in the human-computer interaction interface, the options respectively correspond to different noise reduction models, the noise reduction model corresponding to the presented option is an entire noise reduction model or a noise reduction model adapted to a constraint condition, if the noise reduction model corresponding to the presented option is an entire noise reduction model, the option corresponding to the noise reduction model adapted to the constraint condition is highlighted, and the selected noise reduction model is updated to the noise reduction model adapted to the constraint condition in response to a user selection operation, and in the teleconference scene, the user selection operation may occur in the process of the teleconference.
As an example, in a teleconference scenario, a selection operation of a user may occur before a teleconference, extracting a scheduled time and a scheduled location of the teleconference from a memo of a voice assistant, where the scheduled time and the scheduled location of the teleconference constitute a constraint condition of a voice signal transmitted in the teleconference, and before starting the teleconference, acquiring a noise reduction model adapted to the constraint condition in advance, for example, acquiring a noise reduction model adapted to the constraint condition in advance from a server, or, loading the noise reduction model adapted to the constraint condition in advance at a processing terminal (terminal or server), presenting options of a plurality of noise reduction models in a human-computer interaction interface, where the options correspond to different noise reduction models respectively, and the noise reduction model corresponding to the presented option is an entire noise reduction model or an acquired noise reduction model adapted to the constraint condition in advance, and if the noise reduction model corresponding to the presented option is an entire noise reduction model, highlighting options corresponding to the pre-acquired noise reduction model adaptive to the constraint condition, responding to user selection operation, updating the selected noise reduction model into the noise reduction model adaptive to the constraint condition, starting to acquire voice signals after a conference starts, and automatically calling the noise reduction model adaptive to the constraint condition to perform noise reduction processing on the acquired voice signals to obtain noise reduction voice signals, so that the user requirement is accurately met, the noise reduction pertinence is improved, and the human-computer interaction efficiency of a user is improved.
Referring to fig. 3C, fig. 3C is a schematic flow chart of the artificial intelligence based speech noise reduction method provided in the embodiment of the present application, and after the noise reduction model adapted to the constraint condition is called in step 103 to perform noise reduction processing on the speech signal through the noise reduction model to obtain a noise-reduced speech signal, step 104 and step 105 may be executed, which will be described with reference to the steps shown in fig. 3C.
In step 104, noise reduction effect parameters of the noise reduced speech signal are determined.
In step 105, when the noise reduction effect parameter is lower than the noise reduction effect parameter threshold, determining similarities between the plurality of other constraint conditions and the constraint conditions corresponding to the speech signal, and determining the other constraint conditions with the highest similarity, so as to call the corresponding noise reduction model to perform noise reduction processing on the noise reduction speech signal, thereby obtaining an updated noise reduction speech signal.
As an example, a noise reduction effect parameter of the noise reduction speech signal is determined, the noise reduction effect parameter is a parameter obtained based on the signal-to-noise ratio, the noise spectrum intensity, and the like of the noise reduction speech signal, the noise reduction effect threshold is a threshold parameter calculated based on a plurality of set parameters (signal-to-noise ratio, noise spectrum intensity, and the like), the noise reduction effect parameter being lower than the noise reduction effect parameter threshold indicates that the noise reduction effect is not ideal enough, the noise reduction processing is required, thus determining the similarity between the plurality of other constraints and the constraint corresponding to the speech signal, and determining the other constraint having the highest similarity, calling the noise reduction model corresponding to the other constraint conditions to perform noise reduction processing on the noise reduction voice signal to obtain a noise reduction voice signal, other constraint conditions are different from the constraint conditions of the corresponding voice signals, and the noise reduction quality and the noise reduction efficiency can be effectively improved through noise reduction processing.
In some embodiments, each noise reduction model is adapted to one constraint, and different noise reduction models are adapted to different constraints; before obtaining the constraint condition of the corresponding voice signal in step 102, the following technical scheme may be implemented: acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one by one; and training to obtain a plurality of original noise reduction models which are in one-to-one correspondence with the plurality of constraint conditions based on the training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions.
As an example, a plurality of training speech signal samples carrying noise in a plurality of constraints are obtained to form a training speech signal sample set corresponding to the plurality of constraints one by one, and a noise reduction model corresponding to the plurality of constraints one by one is initialized; carrying out noise reduction processing on training voice signal samples included in a training voice signal sample set corresponding to a plurality of constraint conditions through noise reduction models corresponding to the constraint conditions one to obtain noise reduction voice signals; and determining an error between the noise reduction voice signal and a pure voice signal sample corresponding to the training voice signal sample, and updating parameters of a noise reduction model corresponding to the constraint condition according to the error, wherein the noise reduction model is used for performing noise reduction processing on the voice signal in the corresponding constraint condition.
In some embodiments, each noise reduction model is adapted with at least one constraint, and the constraints for different noise reduction model adaptations are different; before obtaining the constraint condition of the corresponding voice signal, the following technical scheme can be executed: acquiring a training voice signal sample set corresponding to a plurality of constraint conditions one by one; training to obtain a plurality of original noise reduction models which are in one-to-one correspondence with a plurality of constraint conditions based on a training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions; clustering the plurality of original noise reduction models to obtain at least one clustered noise reduction model; wherein the noise reduction model of each cluster corresponds to at least one constraint condition.
As an example, a noise reduction model obtained based on training of multiple constraint conditions and corresponding to the constraint conditions one to one is used as an original noise reduction model, clustering processing is performed on the original noise reduction model to obtain at least one clustered noise reduction model, and after the clustering processing, one noise reduction model can perform noise reduction corresponding to the multiple constraint conditions, for example, the noise reduction model corresponding to the constraint conditions of 8-10 am at an airport can be used for processing voice signals of the constraint conditions of 10-12 am at the airport, occupation of storage resources can be effectively reduced through clustering of the noise reduction model, and the utilization rate of the noise reduction model can also be improved.
In some embodiments, the clustering process performed on the plurality of noise reduction models to obtain at least one clustered noise reduction model may be implemented by the following technical solutions: acquiring a test voice signal sample set corresponding to each constraint condition; for each set of test speech signal samples, performing the following: carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set through a plurality of noise reduction models to obtain noise reduction results which are in one-to-one correspondence with the plurality of noise reduction models, namely noise reduction voice signal samples; and clustering the plurality of noise reduction models according to the noise reduction results which are obtained by aiming at each test voice signal sample set and correspond to the plurality of noise reduction models one by one to obtain at least one clustered noise reduction model.
As an example, M groups of test speech signal sample sets are used as test sets, each test set has a trained noise reduction model (not subjected to clustering processing), for a certain test set, the trained M noise reduction models are used to perform noise reduction processing on the test speech signal samples of the test set to obtain M noise reduction results (respectively corresponding to different noise reduction models), and then the plurality of noise reduction models are clustered according to the noise reduction results which are obtained for each test speech signal sample set and correspond to the plurality of noise reduction models one to obtain at least one clustered noise reduction model.
In some embodiments, the clustering process is performed on the plurality of noise reduction models according to the noise reduction result obtained for each test speech signal sample set and corresponding to the plurality of noise reduction models one to one, so as to obtain at least one clustered noise reduction model, and the method may be implemented by the following technical solutions: for each set of test speech signal samples, performing the following: determining a plurality of noise reduction results obtained after the noise reduction processing is carried out on the test voice signal samples in the test voice signal sample set by a plurality of noise reduction models; determining a minimum mean square error of each noise reduction result, wherein the noise reduction result comprises noise reduction speech signal samples of a plurality of test speech signal samples in a set of test speech signal samples; based on the minimum mean square error of the noise reduction result, sequencing a plurality of noise reduction models in an ascending order, and taking at least one noise reduction model sequenced in the front as a candidate noise reduction model corresponding to the test voice signal sample set; extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each test voice signal sample set so as to take the intersection noise reduction model as a clustered noise reduction model; and the candidate noise reduction model serving as the intersection noise reduction model corresponds to a plurality of test voice signal sample sets.
Taking the minimum mean square error as the evaluation criterion, as an example, the above-mentioned embodiments are carried out, M noise reduction results are sorted, each noise reduction result includes a noise reduction test speech signal of a certain noise reduction model for all test speech signal samples in a certain test speech signal sample set, if the test speech signal sample set includes 10 test speech signal samples, 10 noise reduction test speech signals are obtained, the minimum mean square error of the 10 noise reduction test speech signals is determined, as the minimum mean square error of the noise reduction model for a specific test speech signal sample set, M minimum mean square errors for a certain test speech signal sample set can be obtained based on the M noise reduction models, the noise reduction model corresponding to the first K noise reduction results with the minimum mean square error value smaller than the error threshold is selected as the candidate noise reduction model of the test speech signal sample set, if the number of the noise reduction models corresponding to the noise reduction result with the minimum mean square error value smaller than the error threshold is smaller than K, the actual number is used as the standard, if the noise reduction result with the minimum mean square error value smaller than the error threshold does not exist, the noise reduction model corresponding to the noise reduction result with the minimum mean square error value is selected as the only candidate noise reduction model of the test voice signal sample set, if M test voice signal sample sets have models with intersections corresponding to the candidate noise reduction models, for example, the test voice signal sample set A has two corresponding candidate noise reduction models (a, B), the test voice signal sample set B has three corresponding candidate noise reduction models (a, B, c), the candidate noise reduction models (a, B) are taken as the intersection noise reduction model of the corresponding test voice signal sample set A and the test voice signal sample set B, the test voice signal sample set with intersection uniformly uses the model with the minimum comprehensive error in the intersection noise reduction model as the noise reduction model of the final clustering corresponding to the test voice signal sample set A and the test voice signal sample set B, when only one intersection noise reduction model exists between the test voice signal sample set A and the test voice signal sample set B, the intersection noise reduction model is directly used as the noise reduction model of the final clustering, the comprehensive error can be the minimum mean square error or other evaluation parameters, and the comprehensive error is the comprehensive noise reduction evaluation result of a plurality of intersection noise reduction models aiming at a plurality of test voice signal sample sets.
In some embodiments, the clustering process is performed on the plurality of original noise reduction models to obtain at least one clustered noise reduction model, and the clustering process may be implemented by the following technical solutions: clustering the plurality of constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; acquiring a test voice signal sample set corresponding to a plurality of original constraint conditions one by one and an original noise reduction model corresponding to the original constraint conditions one by one; carrying out fusion processing on the test voice signal sample sets which correspond to the original constraint conditions one by one to obtain a test voice signal sample set corresponding to the clustering constraint conditions; carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set corresponding to the clustering constraint conditions through a plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and acquiring the minimum mean square error of each noise reduction result, and determining the noise reduction model corresponding to the minimum mean square error as a clustered noise reduction model.
As an example, different from the foregoing embodiment, clustering may be performed on a plurality of constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; firstly clustering constraint conditions, then acquiring a noise reduction model corresponding to each clustering constraint condition as a clustering noise reduction model, assuming that M original constraint conditions, corresponding M test voice signal sample sets and corresponding M original noise reduction models exist, obtaining N constraint conditions after the constraint conditions are clustered, further obtaining N test voice signal sample sets, and performing noise reduction processing on test voice signal samples in the test voice signal sample set of a certain clustering constraint condition through the M noise reduction models to obtain M noise reduction results; the minimum mean square error of each noise reduction result is obtained, and the noise reduction model corresponding to the minimum mean square error is determined as the noise reduction model (clustered noise reduction model) of the clustering constraint condition, so that the noise reduction model of each clustering constraint condition is sequentially obtained, the corresponding clustered noise reduction models between different clustering constraint conditions can have intersection, namely the noise reduction model of the clustering constraint condition A is a, and the noise reduction model of the clustering constraint condition B is a, at the moment, the multiplexing of the noise reduction models can be further carried out, the noise reduction model a can be automatically multiplexed by the two clustering constraint conditions, the occupation of storage resources can be effectively reduced through the clustering of the noise reduction models, and the utilization rate of the noise reduction models can also be improved.
In some embodiments, the obtaining of the training speech signal sample sets corresponding to the multiple constraint conditions one to one may be implemented by the following technical solutions: acquiring a plurality of noises carrying various attribute information; wherein the attribute information includes: time information of sending or receiving voice signals, geographical location information of sending or receiving voice signals, user information of sending or receiving voice signals, and environment information of sending or receiving voice signals; dividing a plurality of noises according to a plurality of attribute information to obtain a noise set corresponding to a plurality of constraint conditions one by one; wherein each constraint condition has the same multiple kinds of attribute information; and overlapping the noise of each noise set and the pure voice signal sample to obtain a training voice signal sample set corresponding to each constraint condition.
In some embodiments, the above-mentioned overlapping processing of the noise of each noise set and the clean speech signal sample to obtain the training speech signal sample set corresponding to each constraint condition may be implemented by the following technical solutions: acquiring the weight of a pure voice signal sample and the weight of noise; weighting the clean voice signal sample and the noise according to the weight of the clean voice signal sample and the weight of the noise to obtain a training voice signal sample; adding a pure voice signal or noise based on the training voice signal sample to obtain a new training voice signal sample; and forming a training voice signal sample set according to the training voice signal sample and the new training voice signal sample.
Taking time information and geographic position information as an example, the time information and the geographic position information form corresponding constraint conditions in the form of two-dimensional information, all noises are divided into M groups (M is an integer greater than or equal to 2), the collected noises are divided into M groups of different noises according to the time information and the geographic position information, for example, the time of the first group of noises is 10: 00-11: 00 in the morning, the geographic coordinate is GPS0, the time of the second group of noises is 11: 00-12: 00 in the morning, the geographic coordinate is GPS0, and so on, each group of noises and pure voice signals are linearly superposed through different weighted values to obtain a training voice signal sample set of a certain scale for training noise reduction models, noise reduction models corresponding to each time period and geographic position (constraint conditions) are obtained through respective training of each noise reduction model, and M groups of noise sample sets (each group corresponds to different time), Geographical positions, namely corresponding to different constraint conditions), M training voice signal sample sets are constructed, a larger-scale training voice signal sample can be generated in a linear superposition mode of different parameters (for example, the weighting value of a pure voice signal is 0.7, and the weighting value of noise is 0.3), and along with the increase of the scale of the training voice signal sample set, a deep network of a noise reduction model has better constraint condition adaptability and can better identify and inhibit noise under different constraint conditions.
In some embodiments, the above training to obtain a plurality of noise reduction models corresponding to a plurality of constraints one-to-one based on a training speech signal sample set corresponding to a plurality of constraints one-to-one may be implemented by the following technical solutions: carrying out noise reduction processing on training voice signal samples included in the training voice signal sample set through a noise reduction model to obtain noise reduction voice signals corresponding to the training voice signal samples; determining an error between a noise reduction voice signal corresponding to the training voice signal sample and the pure voice signal sample, and substituting the error into a loss function of the noise reduction model; and determining a parameter change value of the noise reduction model when the loss function obtains the minimum value based on the learning rate of the noise reduction model, and updating the parameter of the noise reduction model based on the parameter change value.
In some embodiments, the performing noise reduction processing on the training speech signal samples included in the training speech signal sample set by using the noise reduction model may be implemented by the following technical solutions: extracting voice features from training voice signal samples; performing first full-connection processing on voice features through a first full-connection layer of a voice endpoint network to obtain a first full-connection processing result; performing voice endpoint detection processing on the first full-connection result through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result; predicting the noise spectrum characteristic of a training voice signal sample by taking the voice characteristic, the first full-connection processing result and the voice endpoint detection result as input through a second gating circulation unit of a noise spectrum estimation network; and predicting the gain of the corresponding training voice signal sample by using the noise spectrum characteristic, the voice endpoint detection result and the voice characteristic as input through a third gating circulating unit of the noise spectrum removal network, and applying the gain to the training voice signal sample.
As an example, the noise reduction model includes a speech endpoint network, a noise spectrum estimation network, and a noise spectrum removal network, extracting speech features from training speech signal samples, the speech features being extracted from time-frequency domain data, such as power spectrum, pitch period, speech endpoint, etc., performing a first full-connection process on the speech features through a first full-connection layer of the speech endpoint network to obtain a first full-connection process result corresponding to the training speech signal samples, performing a speech endpoint detection process on the first full-connection result corresponding to the training speech signal samples through a first gate-control loop unit of the speech endpoint network to obtain a speech endpoint detection result corresponding to the training speech signal samples, and inputting the speech features corresponding to the training speech signal samples, the first full-connection process result corresponding to the training speech signal samples, and the speech endpoint detection result corresponding to the training speech signal samples, predicting the noise spectrum characteristic of a training voice signal sample by a second gating circulation unit of a noise spectrum prediction network, taking the noise spectrum characteristic of the corresponding training voice signal sample, the voice endpoint detection result of the corresponding training voice signal sample and the voice characteristic of the corresponding training voice signal sample as input, predicting the gain of the corresponding training voice signal sample by a third gating circulation unit of a noise spectrum removal network, applying the gain of the corresponding training voice signal sample to the training voice signal sample to obtain a noise reduction voice signal of the corresponding training voice signal sample, determining the error between the noise reduction voice signal of the corresponding training voice signal sample and a pure voice signal sample, substituting the error into a loss function of the noise reduction model, determining the parameter change value of the noise reduction model when the loss function obtains the minimum value based on the learning rate of the noise reduction model, and updating the parameters of the noise reduction model based on the parameter variation values.
Referring to fig. 3D, fig. 3D is a schematic flowchart of a training method of a noise reduction model based on artificial intelligence according to an embodiment of the present application, and the steps shown in fig. 3D will be described.
In step 201, obtaining a plurality of training speech signal samples carrying noise in a plurality of constraint conditions to form a training speech signal sample set corresponding to the plurality of constraint conditions one to one, and initializing a noise reduction model corresponding to the plurality of constraint conditions one to one;
in step 202, noise reduction processing is performed on training speech signal samples included in a training speech signal sample set corresponding to a plurality of constraint conditions through noise reduction models corresponding to the plurality of constraint conditions one to obtain noise reduction speech signals;
in step 203, determining an error between the noise-reduced speech signal and a clean speech signal sample corresponding to the training speech signal sample, and updating parameters of the noise-reduced model corresponding to the constraint conditions according to the error;
and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
The specific implementation manner in the step 201-203 can be referred to the embodiment corresponding to the step 101-103.
In the following, an exemplary application of the embodiments of the present application to a practical application constraint will be described.
The method comprises the steps of independently training a plurality of noise reduction models through different sound samples acquired by two types of attribute information, namely system time and geographic positions, clustering the trained noise reduction models to obtain noise reduction models which are matched with different constraint conditions (time and geographic positions) and have a deep network, and further effectively improving noise reduction effect and noise reduction efficiency.
Referring to fig. 5A-5B, fig. 5A-5B are schematic diagrams illustrating constraint conditions of an artificial intelligence based voice noise reduction method provided by an embodiment of the present application, where fig. 5A illustrates a schematic diagram of a street from 8 am to 10 am, a road street and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, an airport from 8 am to 10 am, an airport from airport and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, an office from 7 pm to 9 pm, an office and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, a subway station from 3 pm to 5 pm, a subway station and a constraint condition corresponding to time having a specific noise reduction requirement and noise characteristics, and constraint condition identification is performed on two-dimensional information consisting of geographic location information and time information, and respectively identifying constraint conditions 1-4, and correspondingly calling the noise reduction models A-D.
Referring to fig. 6A, fig. 6A is a schematic flowchart of a speech noise reduction method based on artificial intelligence provided in an embodiment of the present application, and referring to fig. 6B, fig. 6B is a schematic flowchart of a training method of a noise reduction model based on artificial intelligence provided in an embodiment of the present application, the noise reduction model training is performed first, then the noise reduction model is called to perform inference, a training stage may be run on a server, and finally N types of noise reduction models (N is an integer greater than or equal to 2) are obtained, and an inference stage may be run on a terminal in real time or on the server to obtain noise reduction speech signals.
In some embodiments, first, a training speech signal sample set for training is obtained, the training speech signal sample set needs to be labeled in advance, that is, the attribute of each frame of signal needs to be confirmed, for example, whether noise or speech, and the training process needs to give an expected output result (i.e., pure speech with noise suppressed in an ideal case) as a guide to a noise reduction model for parameter optimization, since the noise reduction model training needs a large number of labeled samples and an expected output result, in order to meet the requirement of a trainable model network, the training speech signal sample set is synthesized in a structured manner, that is, the pure speech signal is linearly superimposed with noise, so that a speech signal with noise is synthesized as a speech signal sample of training input, and the expected output signal is a pure speech signal before synthesis, and the training speech signal is obtained by different parameters (for example, the weighting value of the pure speech signal in a single frame signal is 0.7, and the noise weighted value is 0.3), a larger scale of training speech signal samples can be generated, the training process of the noise reduction model can be sufficiently supported, and the sample construction mode can well provide expected data corresponding to each sample to guide the optimization training of the noise reduction model.
In some embodiments, the system time information and the geographic coordinate information may be read from the terminal, a corresponding constraint condition is formed according to the two-dimensional information, all the noises are divided into M groups (M is an integer greater than or equal to 2), the noise set to be collected is divided into M groups of different samples according to time and coordinates, for example, the time of the first group of noise samples is 10:00 to 11:00 in the morning, the geographic coordinate is GPS0, the time of the second group of noise samples is 11:00 to 12:00 in the morning, the geographic coordinate is GPS0, and so on, each group of noise and clean voice signals (which can be purchased or found on an open source website) are linearly superimposed by different weighted values to obtain a training voice signal sample set of a certain scale for training of the noise reduction model, a noise reduction model corresponding to each time period and geographic position (constraint condition) is obtained through respective training of each noise reduction model, examples are as follows: m groups of noise sample sets (each group corresponds to different time and geographic positions, namely different constraint conditions), M training voice signal sample sets are obtained through construction, M noise reduction models are trained, the noise reduction models can be designed to be of the same structure or different structures, along with the scale increase and parameter increase of the training voice signal sample sets, the depth network of the noise reduction models has better constraint condition adaptability, and noise under different constraint conditions can be better identified and suppressed.
In some embodiments, in order to reduce the number of models and reduce the storage resources, the noise reduction models may be clustered to obtain N types of noise reduction models (M > -N), where the N types of noise reduction models correspond to constraints obtained by combining M different time coordinates, in practical application, for example, the user can talk in 11: 00-12: 00 morning and with GPS0 as geographic coordinates, the server calls the noise reduction model corresponding to the second set of training speech signal samples to perform real-time noise reduction processing, see figure 4, fig. 4 is a schematic diagram of a noise reduction model of an artificial intelligence-based speech noise reduction method according to an embodiment of the present application, where the noise reduction model shown in fig. 4 is used to perform noise reduction processing on training speech signal samples, namely, accurately identifying noise and voice signals, and further suppressing the noise through a spectral subtraction or wiener filtering post-processing mode to obtain a noise-reduced voice signal (clean voice signal). The inference stage is a noise reduction process in the actual call process, and is to perform voice or noise estimation on the collected noisy voice signal by using a trained noise reduction model (in which the network unit parameters have already obtained a relatively optimal solution in the training process), and finally suppress the noise component therein to obtain a cleaner voice signal.
In some embodiments, the noise reduction model clusters: in order to reduce the number of models and reduce the consumption of storage resources, a noise reduction model clustering method is proposed, which compresses M noise reduction models into N noise reduction models (M > -N), and the process is as follows: firstly, preparing M groups of noisy sample sets as noise reduction test sets, wherein the noise reduction test sets respectively correspond to M time geographic coordinates, and the number of samples does not need to be large; using the trained M noise reduction models to run each group of noisy test samples, namely AI noise reduction processing, obtaining M noise reduction results, sequencing the M results by using the minimum mean square error as a judgment criterion, selecting the first K model results with error values smaller than a preset error threshold as candidate noise reduction models of the group of test sample sets, taking the actual number as the standard if the number of the models meeting the error condition is smaller than K, and selecting the model with the minimum error as the only candidate noise reduction model of the sample set if all the model errors are larger than a preset threshold value; and if the candidate noise reduction models corresponding to the M sample sets have intersection models, uniformly using the model with the minimum comprehensive error in the intersection models as the final noise reduction model of the candidate noise reduction models with the intersection models. The compression of the number of models is realized by the clustering method.
Continuing with the exemplary structure of the artificial intelligence based speech noise reducer 255 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based speech noise reducer 255-1 of the memory 250 may include: an obtaining module 2551, configured to obtain a voice signal; an obtaining module 2552, configured to obtain a constraint condition of a corresponding voice signal; and the noise reduction module 2553 is configured to call a noise reduction model adapted to the constraint condition to perform noise reduction processing on the speech signal, so as to obtain a noise-reduced speech signal.
In some embodiments, the obtaining module 2552 is further configured to: acquiring a constraint condition corresponding to the voice signal before acquiring the voice signal, wherein the constraint condition comprises at least one of a scheduled time and a scheduled place of voice communication; a noise reduction module further to: before acquiring a voice signal, acquiring a noise reduction model adaptive to a constraint condition in advance; and when the constraint condition is met, automatically calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the acquired voice signal to obtain a noise reduction voice signal.
In some embodiments, the obtaining module 2552 is further configured to, before the noise reduction model adapted to the constraint condition is called to perform noise reduction processing on the speech signal, obtain a plurality of noise reduction models, where the plurality of noise reduction models is a set of all noise reduction models that can be called, or a set of noise reduction models adapted to the constraint condition; presenting options of a plurality of noise reduction models in a human-computer interaction interface; and in response to the selection operation, taking the selected noise reduction model as the noise reduction model adapted to the constraint condition.
In some embodiments, the obtaining module 2552 is further configured to: acquiring at least one of the following attribute information of the voice signal: time information of sending or receiving voice signals, geographical location information of sending or receiving voice signals, user information of sending or receiving voice signals, and environment information of sending or receiving voice signals; and determining the constraint condition corresponding to the attribute information of the voice signal according to the preset corresponding relation between different attribute information and different constraint conditions.
In some embodiments, noise reduction module 2553 is further configured to: extracting voice features from the voice signal; performing first full-connection processing on voice features of corresponding voice signals through a first full-connection layer of a voice endpoint network to obtain first full-connection processing results of the corresponding voice signals; performing voice endpoint detection processing on a first full-connection result of the corresponding voice signal through a first gate control circulating unit of the voice endpoint network to obtain a voice endpoint detection result of the corresponding voice signal; predicting the noise spectrum characteristic of the voice signal by a second gating circulation unit of the noise spectrum prediction network by taking the voice feature of the corresponding voice signal, the first full-connection processing result of the corresponding voice signal and the voice endpoint detection result of the corresponding voice signal as input; and predicting the gain of the corresponding voice signal through a third gating circulating unit of the noise spectrum removing network by taking the noise spectrum characteristic of the voice signal, the voice endpoint detection result of the corresponding voice signal and the voice characteristic of the corresponding voice signal as input, and applying the gain of the corresponding voice signal to the voice signal to obtain the noise-reduced voice signal.
In some embodiments, noise reduction module 2553 is further configured to: calling a noise reduction model adaptive to the constraint condition to perform noise reduction processing on the voice signal through the noise reduction model to obtain a noise reduction voice signal, and then determining a noise reduction effect parameter of the noise reduction voice signal; when the noise reduction effect parameter is lower than the noise reduction effect parameter threshold, performing the following processing: determining the similarity between the plurality of other constraint conditions and the constraint conditions of the corresponding voice signals, and determining the other constraint conditions with the highest similarity so as to call the corresponding noise reduction model to perform noise reduction processing on the noise reduction voice signals to obtain updated noise reduction voice signals; wherein the other constraints are different from the constraints corresponding to the speech signal.
In some embodiments, each noise reduction model is adapted to one constraint, and different noise reduction models are adapted to different constraints; the apparatus further comprises a training module 2554, configured to, before obtaining the constraint condition of the corresponding speech signal, obtain a training speech signal sample set corresponding to the multiple constraint conditions one to one; and training to obtain a plurality of noise reduction models which are in one-to-one correspondence with the plurality of constraint conditions based on the training voice signal sample set in one-to-one correspondence with the plurality of constraint conditions.
In some embodiments, each noise reduction model is adapted with at least one constraint, and the constraints for different noise reduction model adaptations are different; the training module 2554 is further configured to, before obtaining the constraint conditions of the corresponding voice signals, obtain a training voice signal sample set corresponding to the multiple constraint conditions one to one; training to obtain a plurality of noise reduction models corresponding to the plurality of constraint conditions one by one on the basis of a training voice signal sample set corresponding to the plurality of constraint conditions one by one; clustering the plurality of noise reduction models to obtain at least one clustered noise reduction model; wherein the noise reduction model of each cluster corresponds to at least one constraint condition.
In some embodiments, the training module 2554 is further configured to obtain a set of test speech signal samples corresponding to each constraint; for each set of test speech signal samples, performing the following: carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set through the plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and clustering the plurality of noise reduction models according to the noise reduction results which are obtained by aiming at each test voice signal sample set and correspond to the plurality of noise reduction models one by one to obtain at least one clustered noise reduction model.
In some embodiments, the training module 2554 is further configured to perform the following for each set of test speech signal samples: determining a plurality of noise reduction results obtained after the noise reduction processing is carried out on the test voice signal samples in the test voice signal sample set by a plurality of noise reduction models; determining a minimum mean square error of each noise reduction result, wherein the noise reduction result comprises noise reduction speech signal samples of a plurality of test speech signal samples in a set of test speech signal samples; based on the minimum mean square error of the noise reduction result, sequencing the noise reduction models in an ascending order, and taking at least one noise reduction model sequenced at the front as at least one candidate noise reduction model of the corresponding test voice signal sample set; extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each test voice signal sample set so as to take the intersection noise reduction model as a clustered noise reduction model; and the candidate noise reduction model serving as the intersection noise reduction model corresponds to a plurality of test voice signal sample sets.
In some embodiments, the training module 2554 is further configured to perform clustering processing on the multiple constraint conditions as original constraint conditions to obtain at least one clustering constraint condition; acquiring a test voice signal sample set corresponding to a plurality of original constraint conditions one by one and a noise reduction model corresponding to the original constraint conditions one by one; carrying out fusion processing on the test voice signal sample sets which correspond to the original constraint conditions one by one to obtain a test voice signal sample set corresponding to the clustering constraint conditions; carrying out noise reduction processing on the test voice signal samples in the test voice signal sample set corresponding to the clustering constraint conditions through a plurality of noise reduction models to obtain noise reduction results corresponding to the plurality of noise reduction models one to one; and acquiring the minimum mean square error of each noise reduction result, and determining the noise reduction model corresponding to the minimum mean square error as a clustered noise reduction model.
In some embodiments, the training module 2554 is further configured to obtain a plurality of noises carrying information of various attributes; wherein the attribute information includes: time information of sending or receiving voice signals, geographical location information of sending or receiving voice signals, user information of sending or receiving voice signals, and environment information of sending or receiving voice signals; dividing a plurality of noises according to a plurality of attribute information to obtain a noise set corresponding to a plurality of constraint conditions one by one; wherein each constraint condition has the same multiple kinds of attribute information; and overlapping the noise of each noise set and the pure voice signal sample to obtain a training voice signal sample set corresponding to each constraint condition.
In some embodiments, the training module 2554 is further configured to obtain weights of clean speech signal samples and weights of noise; weighting the clean voice signal sample and the noise according to the weight of the clean voice signal sample and the weight of the noise to obtain a training voice signal sample; adding a pure voice signal or noise based on the training voice signal sample to obtain a new training voice signal sample; and forming a training voice signal sample set according to the training voice signal sample and the new training voice signal sample.
In some embodiments, the training module 2554 is further configured to perform noise reduction processing on the training speech signal samples included in the training speech signal sample set through the noise reduction model to obtain a noise-reduced speech signal corresponding to the training speech signal samples; determining an error between a noise reduction voice signal corresponding to the training voice signal sample and the pure voice signal sample, and substituting the error into a loss function of the noise reduction model; and determining a parameter change value of the noise reduction model when the loss function obtains the minimum value based on the learning rate of the noise reduction model, and updating the parameter of the noise reduction model based on the parameter change value.
In some embodiments, the training module 2554 is further configured to extract speech features from the training speech signal samples; performing first full-connection processing on voice features through a first full-connection layer of a voice endpoint network to obtain a first full-connection processing result; performing voice endpoint detection processing on the first full-connection result through a first gating circulating unit of the voice endpoint network to obtain a voice endpoint detection result; predicting the noise spectrum characteristic of a training voice signal sample by taking the voice characteristic, the first full-connection processing result and the voice endpoint detection result as input through a second gating circulation unit of a noise spectrum estimation network; and predicting the gain of the corresponding training voice signal sample by using the noise spectrum characteristic, the voice endpoint detection result and the voice characteristic as input through a third gating circulating unit of the noise spectrum removal network, and applying the gain to the training voice signal sample.
Continuing with the exemplary structure of the artificial intelligence based noise reduction model training device 255-2 implemented as software modules provided by the embodiments of the present application, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based noise reduction model training device 255-2 of the memory 250 may include: a training module 2555, configured to obtain a plurality of training speech signal samples carrying noise in a plurality of constraints to form a training speech signal sample set corresponding to the plurality of constraints one to one; the training module 2555 is further configured to perform noise reduction processing on training speech signal samples included in a training speech signal sample set corresponding to the multiple constraint conditions through a noise reduction model corresponding to the multiple constraint conditions one to one, so as to obtain noise-reduced speech signals; the training module 2555 is further configured to determine an error between the noise-reduced speech signal and a clean speech signal sample corresponding to the training speech signal sample, and update parameters of the noise-reduction model corresponding to the constraint condition according to the error; and the noise reduction model is used for carrying out noise reduction processing on the voice signals in the corresponding constraint conditions.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the artificial intelligence based speech noise reduction method or the artificial intelligence based noise reduction model training method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform methods provided by embodiments of the present application, such as an artificial intelligence based speech noise reduction method as shown in fig. 3A-3C and an artificial intelligence based noise reduction model training method as shown in fig. 3D.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, according to the embodiment of the present application, by obtaining the constraint condition when the speech signal is applied, the corresponding noise reduction model is called for different constraint conditions to perform noise reduction processing on the speech signal, that is, a distinctive suppression means is adopted for different constraint conditions, so that the noise reduction effect is optimized in a targeted manner and the noise reduction efficiency is improved.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (19)

1.一种基于人工智能的语音降噪方法,其特征在于,所述方法包括:1. a speech noise reduction method based on artificial intelligence, is characterized in that, described method comprises: 获取语音信号;get voice signal; 获取对应所述语音信号的约束条件;obtaining constraints corresponding to the speech signal; 调用与所述约束条件适配的降噪模型对所述语音信号进行降噪处理,得到降噪语音信号。A noise reduction model adapted to the constraints is invoked to perform noise reduction processing on the speech signal to obtain a noise reduction speech signal. 2.根据权利要求1所述的方法,其特征在于,所述获取对应所述语音信号的约束条件,包括:2. The method according to claim 1, wherein the acquiring the constraint condition corresponding to the speech signal comprises: 在获取所述语音信号之前,获取对应所述语音信号的约束条件,所述约束条件包括语音通信的计划时间和计划地点至少之一;Before acquiring the voice signal, acquire constraints corresponding to the voice signal, where the constraints include at least one of a planned time and a planned location for voice communication; 所述调用与所述约束条件适配的降噪模型对所述语音信号进行降噪处理,得到降噪语音信号,包括:The noise reduction model adapted to the constraints is invoked to perform noise reduction processing on the speech signal to obtain a noise reduction speech signal, including: 在获取所述语音信号之前,预先获取与所述约束条件适配的降噪模型;Before acquiring the speech signal, pre-acquire a noise reduction model adapted to the constraint; 当满足所述约束条件时,自动调用与所述约束条件适配的降噪模型对获取的所述语音信号进行降噪处理,得到降噪语音信号。When the constraint condition is satisfied, a noise reduction model adapted to the constraint condition is automatically invoked to perform noise reduction processing on the acquired voice signal, to obtain a noise reduction voice signal. 3.根据权利要求1所述的方法,其特征在于,所述调用与所述约束条件适配的降噪模型对所述语音信号进行降噪处理之前,所述方法包括:3. The method according to claim 1, characterized in that, before invoking a noise reduction model adapted to the constraints to perform noise reduction processing on the speech signal, the method comprises: 获取多个降噪模型,所述多个降噪模型是能够被调用的全部降噪模型的集合,或者是与所述约束条件适配的降噪模型的集合;Acquiring multiple noise reduction models, where the multiple noise reduction models are a set of all noise reduction models that can be called, or a set of noise reduction models adapted to the constraints; 在人机交互界面中呈现所述多个降噪模型的选项;presenting options for the plurality of noise reduction models in a human-computer interface; 响应于选择操作,将被选择的降噪模型作为与所述约束条件适配的降噪模型。In response to the selection operation, the selected noise reduction model is used as the noise reduction model adapted to the constraints. 4.根据权利要求1所述的方法,其特征在于,所述获取对应所述语音信号的约束条件,包括:4. The method according to claim 1, wherein the acquiring the constraints corresponding to the speech signal comprises: 获取所述语音信号的以下属性信息至少之一:发送或接收所述语音信号的时间信息、发送或接收所述语音信号的地理位置信息、发送或接收所述语音信号的用户信息、发送或接收所述语音信号时所处的环境信息;Acquire at least one of the following attribute information of the voice signal: time information for sending or receiving the voice signal, geographic location information for sending or receiving the voice signal, user information for sending or receiving the voice signal, sending or receiving the voice signal Information about the environment where the voice signal is located; 根据不同的所述属性信息与不同的所述约束条件之间的预设对应关系,确定与所述语音信号的属性信息对应的所述约束条件。The constraint condition corresponding to the attribute information of the speech signal is determined according to the preset correspondence between the different attribute information and the different constraint conditions. 5.根据权利要求1所述的方法,其特征在于,所述降噪模型包括语音端点网络、噪声谱预估网络以及噪声谱去除网络,所述调用与所述约束条件适配的降噪模型对所述语音信号进行降噪处理,得到降噪语音信号,包括:5. The method according to claim 1, wherein the noise reduction model comprises a voice endpoint network, a noise spectrum estimation network and a noise spectrum removal network, and the call is adapted to the noise reduction model with the constraint condition. Perform noise reduction processing on the voice signal to obtain a noise reduction voice signal, including: 从所述语音信号中提取语音特征;extracting speech features from the speech signal; 通过所述语音端点网络的第一全连接层对对应所述语音信号的语音特征进行第一全连接处理,得到对应所述语音信号的第一全连接处理结果;Perform first full connection processing on the voice feature corresponding to the voice signal through the first full connection layer of the voice endpoint network to obtain a first full connection processing result corresponding to the voice signal; 通过所述语音端点网络的第一门控循环单元对对应所述语音信号的第一全连接结果进行语音端点检测处理,得到对应所述语音信号的语音端点检测结果;Perform voice endpoint detection processing on the first full connection result corresponding to the voice signal by the first gated loop unit of the voice endpoint network to obtain a voice endpoint detection result corresponding to the voice signal; 以对应所述语音信号的所述语音特征、对应所述语音信号的第一全连接处理结果以及对应所述语音信号的语音端点检测结果为输入,通过所述噪声谱预估网络的第二门控循环单元,预测所述语音信号的噪声光谱特性;Taking the described voice feature corresponding to the voice signal, the first full connection processing result corresponding to the voice signal, and the voice endpoint detection result corresponding to the voice signal as input, the second gate of the network is estimated by the noise spectrum. a control loop unit to predict the noise spectral characteristics of the speech signal; 以所述语音信号的噪声光谱特性,对应所述语音信号的语音端点检测结果以及对应所述语音信号的语音特征为输入,通过所述噪声谱去除网络的第三门控循环单元预测对应所述语音信号的增益,并将对应所述语音信号的增益应用于所述语音信号,得到降噪语音信号。Taking the noise spectral characteristics of the voice signal, the voice endpoint detection result corresponding to the voice signal and the voice characteristics corresponding to the voice signal as input, the third gated loop unit of the noise spectrum removal network predicts corresponding to the the gain of the speech signal, and applying the gain corresponding to the speech signal to the speech signal to obtain a noise-reduced speech signal. 6.根据权利要求1所述的方法,其特征在于,所述调用与所述约束条件适配的降噪模型,以通过所述降噪模型对所述语音信号进行降噪处理,得到降噪语音信号之后,所述方法还包括:6 . The method according to claim 1 , wherein the noise reduction model adapted to the constraints is called to perform noise reduction processing on the speech signal through the noise reduction model to obtain noise reduction. 7 . After the voice signal, the method further includes: 确定所述降噪语音信号的降噪效果参数;determining a noise reduction effect parameter of the noise reduction speech signal; 当所述降噪效果参数低于降噪效果参数阈值时,执行以下处理:When the noise reduction effect parameter is lower than the noise reduction effect parameter threshold, the following processing is performed: 确定多个其他约束条件与对应所述语音信号的约束条件之间的相似度,并确定具有最高相似度的其他约束条件,以调用对应所述其他约束条件的降噪模型对所述降噪语音信号进行降噪处理,得到更新的降噪语音信号;Determine the similarity between a plurality of other constraints and the constraints corresponding to the speech signal, and determine the other constraints with the highest similarity, so as to call the noise reduction model corresponding to the other constraints to the noise reduction speech The signal is subjected to noise reduction processing to obtain an updated noise reduction voice signal; 其中,所述其他约束条件区别于对应所述语音信号的约束条件。The other constraints are different from constraints corresponding to the speech signal. 7.根据权利要求1所述的方法,其特征在于,7. The method of claim 1, wherein, 每个所述降噪模型适配一个所述约束条件,且不同的所述降噪模型适配不同的约束条件;Each of the noise reduction models is adapted to one of the constraints, and different noise reduction models are adapted to different constraints; 所述获取对应所述语音信号的约束条件之前,所述方法还包括:Before the acquiring the constraint condition corresponding to the speech signal, the method further includes: 获取与多个所述约束条件一一对应的训练语音信号样本集合;acquiring a training speech signal sample set corresponding to a plurality of the constraint conditions one-to-one; 基于与多个所述约束条件一一对应的训练语音信号样本集合,训练得到与多个所述约束条件一一对应的多个降噪模型。Based on the training speech signal sample sets corresponding to the plurality of constraints one-to-one, a plurality of noise reduction models corresponding to the plurality of constraints are obtained through training. 8.根据权利要求1所述的方法,其特征在于,8. The method of claim 1, wherein: 每个所述降噪模型适配一个所述约束条件,且不同的所述降噪模型适配的约束条件不同;Each of the noise reduction models is adapted to one of the constraints, and the constraints adapted to different noise reduction models are different; 所述获取对应所述语音信号的约束条件之前,所述方法还包括:Before the acquiring the constraint condition corresponding to the speech signal, the method further includes: 获取与多个所述约束条件一一对应的训练语音信号样本集合;acquiring a training speech signal sample set corresponding to a plurality of the constraint conditions one-to-one; 基于与多个所述约束条件一一对应的训练语音信号样本集合,训练得到与多个所述约束条件一一对应的多个降噪模型;Based on the training speech signal sample sets corresponding to the plurality of constraint conditions one-to-one, training obtains a plurality of noise reduction models corresponding to the plurality of the constraint conditions one-to-one; 对所述多个降噪模型进行聚类处理,得到至少一个聚类的降噪模型;Perform clustering processing on the multiple noise reduction models to obtain at least one clustered noise reduction model; 其中,每个所述聚类的降噪模型对应至少一个所述约束条件。Wherein, the noise reduction model of each cluster corresponds to at least one of the constraints. 9.根据权利要求8所述的方法,其特征在于,所述对所述多个降噪模型进行聚类处理,得到至少一个聚类的降噪模型,包括:9. The method according to claim 8, wherein the performing clustering processing on the plurality of noise reduction models to obtain at least one clustered noise reduction model, comprising: 获取对应每个所述约束条件的测试语音信号样本集合;Obtain a sample set of test speech signals corresponding to each of the constraints; 针对每个所述测试语音信号样本集合执行以下处理:通过多个所述降噪模型对所述测试语音信号样本集合中的测试语音信号样本进行降噪处理,得到与多个所述降噪模型一一对应的降噪结果;The following processing is performed for each test voice signal sample set: performing noise reduction processing on the test voice signal samples in the test voice signal sample set by using a plurality of the noise reduction models to obtain a One-to-one corresponding noise reduction results; 根据针对每个所述测试语音信号样本集合得到的,且与多个所述降噪模型一一对应的降噪结果,对多个所述降噪模型进行聚类处理,得到至少一个聚类的降噪模型。According to the noise reduction results obtained for each of the test speech signal sample sets and corresponding to the multiple noise reduction models one-to-one, perform clustering processing on the multiple noise reduction models to obtain at least one clustered Noise reduction model. 10.根据权利要求9所述的方法,其特征在于,所述根据针对每个所述测试语音信号样本集合得到的,且与多个所述降噪模型一一对应的降噪结果,对多个所述降噪模型进行聚类处理,得到至少一个聚类的降噪模型,包括:10 . The method according to claim 9 , wherein, according to the noise reduction results obtained for each of the test speech signal sample sets and corresponding to multiple noise reduction models one-to-one, the multiple Perform clustering processing on each of the noise reduction models to obtain at least one clustered noise reduction model, including: 针对每个所述测试语音信号样本集合执行以下处理:确定所述测试语音信号样本集合中的测试语音信号样本被多个所述降噪模型进行降噪处理后得到的多个降噪结果;The following processing is performed for each of the test voice signal sample sets: determining multiple noise reduction results obtained after the test voice signal samples in the test voice signal sample set are subjected to noise reduction processing by a plurality of the noise reduction models; 确定每个所述降噪结果的最小均方误差,其中,所述降噪结果包括所述测试语音信号样本集合中多个测试语音信号样本的降噪语音信号样本;determining the minimum mean square error of each of the noise reduction results, wherein the noise reduction results include noise reduction speech signal samples of a plurality of test speech signal samples in the test speech signal sample set; 基于所述降噪结果的最小均方误差,对多个所述降噪模型进行升序排序,将排序在前的至少一个降噪模型,作为对应所述测试语音信号样本集合的至少一个候选降噪模型;Based on the minimum mean square error of the noise reduction result, sort the multiple noise reduction models in ascending order, and use the at least one noise reduction model in the first order as at least one noise reduction candidate corresponding to the test speech signal sample set Model; 从对应每个所述测试语音信号样本集合的所述候选降噪模型中提取交集降噪模型,以将所述交集降噪模型作为聚类的降噪模型;Extracting an intersection noise reduction model from the candidate noise reduction models corresponding to each of the test speech signal sample sets, to use the intersection noise reduction model as a clustered noise reduction model; 其中,作为所述交集降噪模型的候选降噪模型对应有多个测试语音信号样本集合。Wherein, the candidate noise reduction model as the intersection noise reduction model corresponds to a plurality of test speech signal sample sets. 11.根据权利要求8所述的方法,其特征在于,所述对所述多个原始的降噪模型进行聚类处理,得到至少一个聚类的降噪模型,包括:11. The method according to claim 8, wherein the performing clustering processing on the plurality of original noise reduction models to obtain at least one clustered noise reduction model, comprising: 将多个所述约束条件作为原始约束条件进行聚类处理,得到至少一个聚类约束条件;Perform clustering processing on a plurality of the constraints as original constraints to obtain at least one clustering constraint; 获取与多个所述原始约束条件一一对应的测试语音信号样本集合,以及与多个所述原始约束条件一一对应的降噪模型;Acquiring a set of test speech signal samples corresponding to a plurality of the original constraints one-to-one, and a noise reduction model corresponding to a plurality of the original constraints one-to-one; 将与多个所述原始约束条件一一对应的测试语音信号样本集合进行融合处理,得到对应所述聚类约束条件的测试语音信号样本集合;Perform fusion processing on the test voice signal sample sets corresponding to the multiple original constraints one-to-one to obtain a test voice signal sample set corresponding to the clustering constraints; 通过多个所述降噪模型,对对应所述聚类约束条件的所述测试语音信号样本集合中的测试语音信号样本进行降噪处理,得到与多个所述降噪模型一一对应的降噪结果;Perform noise reduction processing on the test speech signal samples in the test speech signal sample set corresponding to the clustering constraints by using a plurality of the noise reduction models, to obtain a one-to-one correspondence with the plurality of noise reduction models. noise result; 获取每个所述降噪结果的最小均方误差,并将最小的最小均方误差对应的降噪模型确定为所述聚类的降噪模型。Obtain the minimum mean square error of each of the noise reduction results, and determine the noise reduction model corresponding to the minimum minimum mean square error as the noise reduction model of the cluster. 12.根据权利要求7或8所述的方法,其特征在于,所述获取与多个所述约束条件一一对应的训练语音信号样本集合,包括:12. The method according to claim 7 or 8, wherein the acquiring a training speech signal sample set corresponding to a plurality of the constraint conditions one-to-one comprises: 获取多个携带有多种属性信息的噪声;Obtain multiple noises that carry multiple attribute information; 其中,所述属性信息包括:发送或接收所述语音信号的时间信息、发送或接收所述语音信号的地理位置信息、发送或接收所述语音信号的用户信息、发送或接收所述语音信号时所处的环境信息;The attribute information includes: time information for sending or receiving the voice signal, geographic location information for sending or receiving the voice signal, user information for sending or receiving the voice signal, time when the voice signal is sent or received environmental information; 按照所述多种属性信息,对多个所述噪声进行划分处理,得到与多个所述约束条件一一对应的噪声集合;Divide and process a plurality of the noises according to the plurality of attribute information to obtain a noise set corresponding to the plurality of the constraint conditions one-to-one; 其中,每个所述约束条件具有相同的所述多种属性信息;Wherein, each of the constraints has the same plurality of attribute information; 将每个所述噪声集合的噪声与纯净语音信号样本进行叠加处理,得到对应每个所述约束条件的训练语音信号样本集合。The noise of each of the noise sets and the pure speech signal samples are superimposed to obtain a training speech signal sample set corresponding to each of the constraints. 13.根据权利要求12所述的方法,其特征在于,所述将每个所述噪声集合的噪声与纯净语音信号样本进行叠加处理,得到对应每个所述约束条件的训练语音信号样本集合,包括:13. The method according to claim 12, wherein the noise of each of the noise sets and the pure speech signal samples are superimposed to obtain a training speech signal sample set corresponding to each of the constraints, include: 获取所述纯净语音信号样本的权重以及所述噪声的权重;Obtain the weight of the pure speech signal sample and the weight of the noise; 根据所述纯净语音信号样本的权重以及所述噪声的权重,对所述纯净语音信号样本以及所述噪声进行加权处理,得到训练语音信号样本;According to the weight of the pure voice signal sample and the weight of the noise, weighting processing is performed on the pure voice signal sample and the noise to obtain a training voice signal sample; 以所述训练语音信号样本为基础增加所述纯净语音信号或者所述噪声,得到新的训练语音信号样本;adding the pure speech signal or the noise based on the training speech signal sample to obtain a new training speech signal sample; 根据所述训练语音信号样本以及所述新的训练语音信号样本,组成所述训练语音信号样本集合。The training voice signal sample set is formed according to the training voice signal sample and the new training voice signal sample. 14.根据权利要求7或8所述的方法,其特征在于,所述基于与多个所述约束条件一一对应的训练语音信号样本集合,训练得到与多个所述约束条件一一对应的多个降噪模型,包括:14. The method according to claim 7 or 8, characterized in that, based on the training speech signal sample sets corresponding to a plurality of the constraint conditions one-to-one, training to obtain a one-to-one correspondence with a plurality of the constraint conditions Multiple noise reduction models, including: 通过所述降噪模型对所述训练语音信号样本集合中包括的训练语音信号样本进行降噪处理,得到对应所述训练语音信号样本的降噪语音信号;Perform noise reduction processing on the training voice signal samples included in the training voice signal sample set by the noise reduction model, to obtain a noise reduction voice signal corresponding to the training voice signal sample; 确定对应所述训练语音信号样本的降噪语音信号与纯净语音信号样本之间的误差,并将所述误差代入所述降噪模型的损失函数;Determine the error between the noise reduction speech signal corresponding to the training speech signal sample and the pure speech signal sample, and substitute the error into the loss function of the noise reduction model; 基于所述降噪模型的学习率确定所述损失函数取得最小值时所述降噪模型的参数变化值,并基于所述参数变化值更新所述降噪模型的参数。A parameter change value of the noise reduction model when the loss function obtains a minimum value is determined based on the learning rate of the noise reduction model, and the parameters of the noise reduction model are updated based on the parameter change value. 15.根据权利要求14所述的方法,其特征在于,所述15. The method of claim 14, wherein the 通过所述降噪模型对所述训练语音信号样本集合中包括的训练语音信号样本进行降噪处理,包括:Perform noise reduction processing on the training speech signal samples included in the training speech signal sample set by using the noise reduction model, including: 从训练语音信号样本中提取语音特征;Extract speech features from training speech signal samples; 通过语音端点网络的第一全连接层对对应所述训练语音信号样本的语音特征进行第一全连接处理,得到第一全连接处理结果;The first full connection processing is performed on the voice features corresponding to the training voice signal samples through the first full connection layer of the voice endpoint network to obtain the first full connection processing result; 通过所述语音端点网络的第一门控循环单元对对应所述训练语音信号样本的第一全连接结果进行语音端点检测处理,得到对应所述训练语音信号样本的语音端点检测结果;Perform voice endpoint detection processing on the first full connection result corresponding to the training voice signal sample by the first gated loop unit of the voice endpoint network to obtain a voice endpoint detection result corresponding to the training voice signal sample; 以对应所述训练语音信号样本的语音特征、对应所述训练语音信号样本的第一全连接处理结果以及对应所述训练语音信号样本的语音端点检测结果为输入,通过噪声谱预估网络的第二门控循环单元,预测所述训练语音信号样本的噪声光谱特性;Taking the voice feature corresponding to the training voice signal sample, the first full connection processing result corresponding to the training voice signal sample, and the voice endpoint detection result corresponding to the training voice signal sample as input, the noise spectrum pre-estimating network's No. Two gated loop units, predicting the noise spectral characteristics of the training speech signal sample; 以对应所述训练语音信号样本的噪声光谱特性,对应所述训练语音信号样本的语音端点检测结果以及对应所述训练语音信号样本的语音特征为输入,通过噪声谱去除网络的第三门控循环单元预测对应所述训练语音信号样本的增益,并将对应所述训练语音信号样本的增益应用于所述训练语音信号样本。Taking the noise spectral characteristics corresponding to the training voice signal sample, the voice endpoint detection result corresponding to the training voice signal sample and the voice characteristics corresponding to the training voice signal sample as input, remove the third gated loop of the network through the noise spectrum The unit predicts the gain corresponding to the training speech signal sample, and applies the gain corresponding to the training speech signal sample to the training speech signal sample. 16.一种基于人工智能的降噪模型的训练方法,其特征在于,所述方法包括:16. A method for training a noise reduction model based on artificial intelligence, wherein the method comprises: 获取多个约束条件中携带有噪声的多个训练语音信号样本,以形成与所述多个约束条件一一对应的训练语音信号样本集合;Acquiring multiple training speech signal samples that carry noise in multiple constraints to form a one-to-one training speech signal sample set corresponding to the multiple constraints; 通过与所述多个约束条件一一对应的降噪模型,对所述多个约束条件对应的训练语音信号样本集合包括的训练语音信号样本进行降噪处理,得到降噪语音信号;Perform noise reduction processing on the training voice signal samples included in the training voice signal sample set corresponding to the multiple constraints by using the noise reduction model corresponding to the multiple constraints one-to-one, to obtain a noise reduction voice signal; 确定所述降噪语音信号与对应所述训练语音信号样本的纯净语音信号样本之间的误差,并根据所述误差更新对应所述约束条件的降噪模型的参数;Determine the error between the noise reduction speech signal and the pure speech signal sample corresponding to the training speech signal sample, and update the parameters of the noise reduction model corresponding to the constraint condition according to the error; 其中,所述降噪模型用于针对对应的约束条件中的语音信号进行降噪处理。Wherein, the noise reduction model is used to perform noise reduction processing on the speech signal in the corresponding constraint condition. 17.一种基于人工智能的语音降噪装置,其特征在于,包括:17. A voice noise reduction device based on artificial intelligence, characterized in that, comprising: 获取模块,用于获取语音信号;an acquisition module for acquiring voice signals; 获取模块,用于获取对应所述语音信号的约束条件;an acquisition module for acquiring constraints corresponding to the voice signal; 降噪模块,用于调用与所述约束条件适配的降噪模型对所述语音信号进行降噪处理,得到降噪语音信号。A noise reduction module, configured to call a noise reduction model adapted to the constraint condition to perform noise reduction processing on the voice signal to obtain a noise reduction voice signal. 18.一种电子设备,其特征在于,包括:18. An electronic device, characterized in that, comprising: 存储器,用于存储可执行指令;memory for storing executable instructions; 处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至15任一项所述的基于人工智能的语音降噪方法或者权利要求16所述的基于人工智能的降噪模型的训练方法。The processor, when executing the executable instructions stored in the memory, implements the artificial intelligence-based voice noise reduction method according to any one of claims 1 to 15 or the artificial intelligence-based noise reduction method according to claim 16 The training method of the model. 19.一种计算机可读存储介质,其特征在于,存储有可执行指令,用于被处理器执行时,实现权利要求1至15任一项所述的基于人工智能的语音降噪方法或者权利要求16所述的基于人工智能的降噪模型的训练方法。19. A computer-readable storage medium, characterized in that, storing executable instructions for implementing the artificial intelligence-based voice noise reduction method according to any one of claims 1 to 15 or the right when being executed by a processor. The training method of the artificial intelligence-based noise reduction model described in claim 16.
CN202110116096.8A 2021-01-28 2021-01-28 Speech noise reduction method, device and electronic equipment based on artificial intelligence Active CN113593595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110116096.8A CN113593595B (en) 2021-01-28 2021-01-28 Speech noise reduction method, device and electronic equipment based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110116096.8A CN113593595B (en) 2021-01-28 2021-01-28 Speech noise reduction method, device and electronic equipment based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN113593595A true CN113593595A (en) 2021-11-02
CN113593595B CN113593595B (en) 2025-02-18

Family

ID=78238137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110116096.8A Active CN113593595B (en) 2021-01-28 2021-01-28 Speech noise reduction method, device and electronic equipment based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN113593595B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793620A (en) * 2021-11-17 2021-12-14 深圳市北科瑞声科技股份有限公司 Speech noise reduction method, device, device and storage medium based on scene classification
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114664322A (en) * 2022-05-23 2022-06-24 深圳市听多多科技有限公司 Single-microphone hearing-aid noise reduction method based on Bluetooth headset chip and Bluetooth headset

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005338286A (en) * 2004-05-25 2005-12-08 Yamaha Motor Co Ltd Object sound processor and transport equipment system using same, and object sound processing method
US20070041589A1 (en) * 2005-08-17 2007-02-22 Gennum Corporation System and method for providing environmental specific noise reduction algorithms
CN108447494A (en) * 2018-01-31 2018-08-24 广东聚晨知识产权代理有限公司 A kind of voice communication intelligent processing method
CN111031186A (en) * 2019-12-03 2020-04-17 苏宁云计算有限公司 Noise processing method, server and client
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005338286A (en) * 2004-05-25 2005-12-08 Yamaha Motor Co Ltd Object sound processor and transport equipment system using same, and object sound processing method
US20070041589A1 (en) * 2005-08-17 2007-02-22 Gennum Corporation System and method for providing environmental specific noise reduction algorithms
CN108447494A (en) * 2018-01-31 2018-08-24 广东聚晨知识产权代理有限公司 A kind of voice communication intelligent processing method
CN111031186A (en) * 2019-12-03 2020-04-17 苏宁云计算有限公司 Noise processing method, server and client
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CONG GUO ET.AL: "A speech enhancement algorithm using computational auditory scene analysis with spetral subtraction", IEEE, 27 March 2017 (2017-03-27) *
张卫强等: "一种基于计算听觉场景分析的语音增强算法", 天津大学学报(自然科学与工程技术版), vol. 38, no. 08, 31 August 2015 (2015-08-31) *
张行;赵馨;: "基于神经网络噪声分类的语音增强算法", 中国电子科学研究院学报, no. 09, 20 September 2020 (2020-09-20) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793620A (en) * 2021-11-17 2021-12-14 深圳市北科瑞声科技股份有限公司 Speech noise reduction method, device, device and storage medium based on scene classification
CN113793620B (en) * 2021-11-17 2022-03-08 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device and equipment based on scene classification and storage medium
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114664322A (en) * 2022-05-23 2022-06-24 深圳市听多多科技有限公司 Single-microphone hearing-aid noise reduction method based on Bluetooth headset chip and Bluetooth headset

Also Published As

Publication number Publication date
CN113593595B (en) 2025-02-18

Similar Documents

Publication Publication Date Title
CN108255934B (en) Voice control method and device
CN104538024B (en) Phoneme synthesizing method, device and equipment
CN113593595B (en) Speech noise reduction method, device and electronic equipment based on artificial intelligence
JP2021533397A (en) Speaker dialification using speaker embedding and a trained generative model
CN112967725A (en) Voice conversation data processing method and device, computer equipment and storage medium
CN112309365A (en) Training method, device, storage medium and electronic device for speech synthesis model
CN107316635B (en) Voice recognition method and device, storage medium and electronic equipment
CN111081280A (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN111816170A (en) Training of audio classification model and junk audio recognition method and device
CN109215679A (en) Dialogue method and device based on user emotion
Islam et al. Soundsifter: Mitigating overhearing of continuous listening devices
JP2024542658A (en) Audio processing method and device, computer device, and program
CN114817514A (en) Method and device for determining reply audio, storage medium and electronic device
CN108986804A (en) Man-machine dialogue system method, apparatus, user terminal, processing server and system
WO2025031102A9 (en) Method and apparatus for training speech enhancement network, and storage medium, device and product
CN112398952A (en) Electronic resource pushing method, system, equipment and storage medium
US12164828B2 (en) Method and system for assigning unique voice for electronic device
CN109637509A (en) A kind of music automatic generation method, device and computer readable storage medium
CN112740219A (en) Generating method, device, storage medium and electronic device for gesture recognition model
Li et al. Overview and analysis of speech recognition
WO2024093557A1 (en) Data processing method and apparatus, electronic device, computer-readable storage medium, and computer program product
KR102644989B1 (en) Method for providing psychological counseling service using voice data of the deceased based on artificial intelligence algorithm
CN112035648A (en) User data processing method and device and electronic equipment
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN113761232B (en) A method, device, electronic device and storage medium for generating an audio library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40054076

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant