CN113823303A - Audio noise reduction method and device and computer readable storage medium - Google Patents

Audio noise reduction method and device and computer readable storage medium Download PDF

Info

Publication number
CN113823303A
CN113823303A CN202110653790.3A CN202110653790A CN113823303A CN 113823303 A CN113823303 A CN 113823303A CN 202110653790 A CN202110653790 A CN 202110653790A CN 113823303 A CN113823303 A CN 113823303A
Authority
CN
China
Prior art keywords
noise reduction
voice
audio data
scene
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110653790.3A
Other languages
Chinese (zh)
Inventor
郑吉剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Beijing Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN202110653790.3A priority Critical patent/CN113823303A/en
Publication of CN113823303A publication Critical patent/CN113823303A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application provides an audio noise reduction method and device and a computer readable storage medium, and relates to the technical field of voice processing. The method comprises the following steps: acquiring current audio data and a preset scene label at the current moment from an audio stream; performing voice recognition on the current audio data, and determining a first voice category of the current audio data; generating a target noise reduction parameter for the current audio data based on the first voice category and the scene label; and performing noise reduction processing on the audio stream based on the target noise reduction parameters. According to the embodiment of the application, the voice recognition is carried out on the current audio data, and the corresponding target noise reduction parameters are matched in combination with the scene labels, so that the technical effect of improving the tone quality is achieved.

Description

Audio noise reduction method and device and computer readable storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to an audio denoising method and apparatus, and a computer-readable storage medium.
Background
In the network digital era, after the voice is recorded, no matter the speaking voice, the singing voice, the musical instrument or even the noise can be processed by the digital music software, and people usually need to further reduce the noise of the audio file in order to pursue excellent tone quality, so as to reduce the interference of the external noise to the audience.
In the prior art, fixed noise reduction parameters are generally configured at the server side. For example, in a live webcast scenario, in order to improve the sound quality of a main webcast, a noise reduction function is added during recording to eliminate background noise brought during the voice acquisition of the main webcast. However, the fixed noise reduction parameters cannot be matched with different audio files, so that the noise reduction effect is not ideal.
Disclosure of Invention
The application provides an audio noise reduction method, an audio noise reduction device and a computer readable storage medium, which are used for solving the technical problem that the noise reduction effect is not ideal.
In a first aspect, a method for audio noise reduction is provided, the method comprising:
acquiring current audio data and a preset scene label at the current moment from an audio stream;
performing voice recognition on the current audio data, and determining a first voice category of the current audio data;
generating a target noise reduction parameter for the current audio data based on the first voice category and the scene label;
and performing noise reduction processing on the audio stream based on the target noise reduction parameters.
In one possible implementation, generating a target noise reduction parameter for the current audio data based on the first vocal class and the scene tag includes:
acquiring a second voice category corresponding to the audio data at the previous moment;
and if the first voice category is not matched with the second voice category, generating a target noise reduction parameter aiming at the current audio data based on the first voice category and the scene label.
In one possible implementation, performing speech recognition on the current audio data, and determining a first voice category of the current audio data includes:
performing voice detection on the current audio data, and extracting at least one voice segment;
acquiring the audio characteristics of each voice segment;
determining a first voice category of the current audio data based on the audio features; wherein the first voice category includes a speech voice and a singing voice.
In another possible implementation, generating target noise reduction parameters for the current audio data based on the first vocal class and the scene tag includes:
weighting the voice noise reduction parameters corresponding to the first voice category and the scene noise reduction parameters corresponding to the scene labels to obtain target noise reduction parameters; when the first voice category is speaking voice, the corresponding voice noise reduction parameter is larger than that when the first voice category is singing voice.
In another possible implementation, generating target noise reduction parameters for the current audio data based on the first vocal class and the scene tag includes:
determining an acquisition channel of audio data;
and generating target noise reduction parameters aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
In yet another possible implementation, generating target noise reduction parameters for the current audio data based on the first vocal class, the scene tag, and the acquisition path includes:
and if the acquisition channel does not have the matched noise reduction attribute, generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
In yet another possible implementation, generating target noise reduction parameters for the current audio data based on the first vocal class, the scene tag, and the acquisition path includes:
and weighting the voice noise reduction parameters corresponding to the first voice category, the scene noise reduction parameters corresponding to the scene labels and the access noise reduction parameters corresponding to the acquisition access to obtain target noise reduction parameters.
In another possible implementation manner, weighting the voice noise reduction parameter corresponding to the first voice category, the scene noise reduction parameter corresponding to the scene tag, and the path noise reduction parameter corresponding to the acquisition path to obtain the target noise reduction parameter includes:
determining a first weight of a human voice noise reduction parameter, a second weight of a scene noise reduction parameter and a third weight of a channel noise reduction parameter;
based on the first weight, the second weight and the third weight, carrying out weighted summation on the human noise reduction parameter, the scene noise reduction parameter and the access noise reduction parameter to obtain a target noise reduction parameter; wherein the first weight is greater than either of the second weight and the third weight.
In a second aspect, there is provided an audio noise reduction apparatus, comprising:
the acquisition module is used for acquiring current audio data and a preset scene label at the current moment from the audio stream;
the recognition module is used for carrying out voice recognition on the current audio data and determining a first voice category of the current audio data;
the generating module is used for generating a target noise reduction parameter aiming at the current audio data based on the first voice category and the scene label;
and the noise reduction module is used for carrying out noise reduction processing on the audio stream based on the target noise reduction parameters.
In a possible implementation manner, the generating module is specifically configured to:
acquiring a second voice category corresponding to the audio data at the previous moment;
and if the first voice category is not matched with the second voice category, generating a target noise reduction parameter aiming at the current audio data based on the first voice category and the scene label.
In a possible implementation manner, the identification module is specifically configured to:
performing voice detection on the current audio data, and extracting at least one voice segment;
acquiring the audio characteristics of each voice segment;
a first vocal category of the current audio data is determined based on the audio features, wherein the first vocal category includes speech and singing.
In another possible implementation manner, the generating module is specifically configured to:
weighting the voice noise reduction parameters corresponding to the first voice category and the scene noise reduction parameters corresponding to the scene labels to obtain target noise reduction parameters; when the first voice category is speaking voice, the corresponding voice noise reduction parameter is larger than that when the first voice category is singing voice.
In another possible implementation manner, the generating module specifically includes:
the determining unit is used for determining an acquisition path of the audio data;
and the generating unit is used for generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
In another possible implementation manner, the generating unit is specifically configured to:
and if the acquisition channel does not have the matched noise reduction attribute, generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
In another possible implementation manner, the generating unit is further configured to:
and weighting the voice noise reduction parameters corresponding to the first voice category, the scene noise reduction parameters corresponding to the scene labels and the access noise reduction parameters corresponding to the acquisition access to obtain target noise reduction parameters.
In another possible implementation manner, the generating unit is further configured to:
determining a first weight of a human voice noise reduction parameter, a second weight of a scene noise reduction parameter and a third weight of a channel noise reduction parameter;
based on the first weight, the second weight and the third weight, carrying out weighted summation on the human noise reduction parameter, the scene noise reduction parameter and the access noise reduction parameter to obtain a target noise reduction parameter; wherein the first weight is greater than either of the second weight and the third weight.
In a third aspect, an electronic device is provided, which includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the audio noise reduction method as shown in the first aspect of the present application.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the audio noise reduction method shown in the first aspect of the present application.
In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device implements the method provided in the first aspect embodiment or the second aspect embodiment when executed.
The beneficial effect that technical scheme that this application provided brought is:
according to the method, the current audio data are subjected to voice recognition, the target noise reduction parameters aiming at the current audio data are determined by combining the preset scene tags, and then effective noise reduction processing on the audio stream is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1a is an application scenario diagram of an audio denoising method according to an embodiment of the present application;
fig. 1b is a diagram of another application scenario of an audio denoising method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of an audio denoising method according to an embodiment of the present application;
fig. 3 is a configuration diagram of a live broadcast page provided in an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating a speech recognition scheme according to an embodiment of the present application;
fig. 5 is a target noise reduction parameter configuration table for a scene tag according to an embodiment of the present application;
fig. 6 is a table of configuration of target noise reduction parameters for a first voice category according to an embodiment of the present application;
fig. 7 is a target noise reduction parameter configuration table for an acquisition channel according to an embodiment of the present disclosure;
fig. 8 is a flowchart illustrating an audio denoising method in an example provided by an embodiment of the present application;
fig. 9 is a flowchart illustrating a live webcasting process in an example provided by an embodiment of the present application;
fig. 10 is a schematic structural diagram of an audio noise reduction device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device for audio noise reduction according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
The audio denoising method provided by the application can match the denoising requirement of the current audio data in real time, so that the tone quality of the audio stream can be effectively improved.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.
In the audio noise reduction scheme provided by the embodiment of the application, the scene tag of the audio stream, the first voice category of the audio data, and the corresponding target noise reduction parameter may be stored in the block chain, and when the server or the terminal for audio noise reduction performs audio noise reduction, whether the target noise reduction parameter corresponding to the current scene tag and the first voice category exists in the block chain may be queried first, and the target noise reduction parameter is obtained from the block chain, so as to perform noise reduction processing on the audio data.
The scheme provided by the embodiment of the application relates to a voice processing technology of natural language processing, and is specifically explained by the following embodiment.
In the network digital age, with the continuous progress of the voice processing technology, the recording technology plays an irreplaceable role in promoting the social development. For example, the use of smart phones and the development of the movie industry and the music industry all require the support and guarantee of the recording technology. After the voice is recorded, no matter the speaking voice, the singing voice, the musical instrument or even the noise can be processed by the digital music software, and people usually need to further reduce the noise of the audio file in order to pursue excellent tone quality, so that the interference of the external noise to the audience is reduced.
In the prior art, generally, a server configures fixed noise reduction parameters or a user manually adjusts the noise reduction strength. For example, in the process of live network broadcast, in order to improve the sound quality of the anchor, a noise reduction function is added in the live broadcast recording process to eliminate background noise brought in the process of acquiring the anchor voice; in a chat scenario, fixed noise reduction configurations may be used, but in the case of singing, dancing, outdoor lights, the requirements of these configurations are different. For example, when singing, the noise reduction parameters are reduced, even the noise reduction is closed, and when dancing, the effect of dance music is reduced by noise reduction processing; in outdoor situations, the effect of speaking may be affected by the wind noise and other environmental noises of the background.
Generally, the noise reduction degree is manually adjusted by configuring noise reduction parameters or a host at a server side, and the noise reduction method has the following disadvantages: when the audio data of the audio stream is transformed in real time, the noise reduction requirements of the audio stream are different, and the fixed configuration cannot be matched with the audio stream which is changed in real time; when the live scene changes, the anchor is inconvenient to adjust the scene label in time, and the noise reduction parameters cannot be automatically configured according to the change of the scene.
The audio denoising method provided by the application realizes the real-time configuration of the target denoising parameters based on the current audio data, can meet the denoising requirement of the audio stream transformed in real time, improves the denoising effect and efficiency of the audio stream compared with the prior art, achieves the purpose of improving the tone quality, and effectively improves the user experience.
The present application provides an audio denoising method, apparatus and computer-readable storage medium, which aim to solve the above technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
As shown in fig. 1a, the audio denoising method of the present application may be applied to the scene shown in fig. 1a, specifically, after acquiring an audio stream 101 to be processed, a server 102 performs speech recognition on current audio data 101 in the audio stream and acquires a preset scene tag, determines a target denoising parameter for the current audio data based on a first voice category and the scene tag obtained by recognition, and further performs denoising processing on the audio stream 101 according to the target denoising parameter to obtain an audio stream 103 after denoising.
As shown in fig. 1b, the audio denoising method of the present Application may also be applied to the scenario shown in fig. 1b, specifically, the terminal 104 may collect the audio stream 101, send the audio stream 101 to the server, and the server 102 identifies the current audio data in the audio stream, then determines a target denoising parameter according to the identified first voice category and the preset scenario tag, and sends the target denoising parameter to an APP (Application) of the terminal 104 for denoising. In other scenes, the terminal can also acquire audio data, determine a target noise reduction parameter and process the audio according to the target noise reduction parameter.
Those skilled in the art will understand that the "terminal" used herein may be a Mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), etc.; a "server" may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
An embodiment of the present application provides an audio denoising method, as shown in fig. 2, where the method may be applied to a server shown in fig. 1a and fig. 1b, and may also be applied to a terminal, and the method may include the following steps:
s201, current audio data and a preset scene label at the current moment are acquired from the audio stream.
The server or the terminal for performing audio noise reduction processing may use audio data acquired in real time from an audio stream as current audio data, for example, record an audio stream signal through an audio acquisition device, such as a microphone, to obtain the current audio data; the existing audio stream may also be processed to obtain current audio data, for example, capturing and intercepting sound from the audio stream by using audio processing software, stripping sound of the audio stream in the video, or intercepting a segment of sound from the audio stream as the current audio data.
Specifically, a preset scene tag can be acquired from a terminal or a server; the scene label can represent a scene or an activity category where a signal source of the audio stream is located, or a scene or an activity category corresponding to the content of the audio stream; the scene label can be preset by the user, and can also be determined by the system according to the historical use data of the user.
Taking live webcasting as an example, the scenes of audio data in the live webcasting process can be divided into: meeting, lecture, singing, chatting, outdoors, dancing, etc. As described in fig. 3, a configuration diagram of a live broadcast page. T1 is a head portrait area or a surface area and is displayed as a recommendation page; t2 is a title area for showing the broadcasting title set by the user; t3 is a label area, the label is differentiated without specific scene classification, and can be a conference, singing, teaching, chatting, outdoors, dancing, relative and the like; t4 is the functional area, can set up the front and back of camera in the functional area when the anchor broadcasts, selects suitable beautiful face filter etc.. The user can set the required scene tag in the tag area according to the actual live broadcast content and the live broadcast scene.
S202, performing voice recognition on the current audio data, and determining a first voice category of the current audio data.
The first vocal category may be a category of audio features in the audio data, and the vocal category may include speech and singing.
In some embodiments, a server or a terminal for performing audio noise reduction processing may first identify text content of current audio data, and then classify the text content to obtain a first personal category; specifically, voice recognition can be performed on the audio data based on a voice recognition network to obtain text data corresponding to the content of the audio data, so that conversion from audio to characters is realized; and then classifying the text data based on the pre-trained classification network, and determining a first voice category of the current audio data.
In other embodiments, the server or the terminal for performing audio noise reduction processing may extract audio features from the current audio data, and then determine the first vocal class of the current audio data by performing digital signal processing on the audio features. The specific identification process based on the audio features will be described in detail below.
S203, generating target noise reduction parameters aiming at the current audio data based on the first voice category and the scene label.
The value of the target noise reduction parameter can be 0-1, and the target noise reduction parameter can actually represent the percentage of the whole noise reduction capability of the noise reduction software or equipment. For example, when the audio stream data has a high requirement for noise reduction processing, the value range of the target noise reduction parameter may be 0.6 to 1, that is, for the audio stream, noise reduction processing is performed by using noise reduction software or 60% to 100% of the total noise reduction capability of the device.
In some embodiments, a server or a terminal for performing audio noise reduction processing may be preset with correspondence between different classification results, different scene tags, and different noise reduction parameters, and may directly query the noise reduction parameters for the audio data according to a first voice category and a scene tag of the audio data; and setting a functional relation among the classification result, the scene label and the noise reduction parameter, and calculating the noise reduction parameter according to the first voice category and the scene label.
In other embodiments, the classification result and the scene category may be combined with other parameters, for example, noise reduction parameters corresponding to an acquisition path of the audio data, to determine the noise reduction parameters for the current audio data, and a specific process of determining the noise reduction parameters will be described in detail below.
And S204, performing noise reduction processing on the audio stream based on the target noise reduction parameters.
Specifically, the server or the terminal for performing the audio noise reduction processing may configure the preset noise reduction software based on the target noise reduction parameter, and perform the noise reduction processing on the audio stream through the configured noise reduction software. Before the audio stream is subjected to noise reduction processing, noise audio needs to be acquired, and noise reduction setting is performed on noise reduction software based on the noise audio, so that the noise can be accurately identified by the noise reduction software, and the audio stream is subjected to noise reduction processing.
In one embodiment, taking an application scene of game audio acquisition as an example, in a general large-scale game, because background sound and sound effect are more, and a plurality of game players need to keep real-time voice communication before, real-time noise reduction needs to be performed according to different game scenes, so as to achieve the purpose of ensuring the game effect and simultaneously ensuring the voice communication effect of the players; before audio acquisition, a user presets a scene tag such as fighting, music, shooting and the like according to the requirements of a game scene, then voice recognition is carried out based on current game audio data to obtain first human voice categories such as speaking voice, singing voice and the like of the game audio data, and then target noise reduction parameters are generated based on the scene tag and the first human voice categories so as to carry out noise reduction processing on an audio stream of a game.
According to the method, the current audio data are subjected to voice recognition, the target noise reduction parameters aiming at the current audio data are determined by combining the preset scene tags, and then effective noise reduction processing on the audio stream is achieved.
A possible implementation manner is provided in this embodiment of the present application, performing speech recognition on the current audio data in step S202, and determining the first voice category of the current audio data may include:
(1) and carrying out voice detection on the current audio data, and extracting at least one voice segment.
Specifically, the fluctuation condition of the time domain signal of the current audio data may be detected based on a VAD (Voice Activity Detection) algorithm, so as to identify a human Voice part and a non-human Voice part in the current audio data, and extract a human Voice segment.
In phonetics, the sound with vibrating vocal cords is called voiced sound, and the sound with non-vibrating vocal cords is called unvoiced sound. Short-term energy in speech features is more suitable for detecting voiced speech, while short-term zero-crossing rates are more suitable for detecting unvoiced speech. The VAD algorithm adopts a double-threshold end point detection method, and the method combines a short-time zero-crossing rate and short-time energy as judgment indexes; wherein, the short-time average zero crossing rate refers to the number of times that a frame of voice time domain signal passes through a horizontal axis (zero level); the energy of the short-time energy is the energy of a frame of voice signal, the energy of the human voice segment part is usually smaller than that of the non-human voice segment part, and the energy of the unvoiced part is smaller than that of the voiced part.
The specific VAD detection steps are as follows: the method comprises the steps of firstly framing current voice data, then calculating short-time energy and a short-time zero-crossing rate of each frame of voice signal, and then judging a starting frame and an ending frame of a voice segment based on preset short-time energy or an upper limit and a lower limit threshold of the short-time zero-crossing rate, so as to further realize extraction of the voice segment.
(2) And acquiring the audio characteristics of each human voice segment.
Specifically, feature extraction processing can be performed on each voice segment, and then a voice feature sequence, i.e., an audio feature, which changes with time is obtained. The audio features may be acoustic features such as LPC (Linear Predictive Coding, Linear Predictive coefficient), MFCC (MeI-frequencyctraprai Coefficients, mel cepstral coefficient), CEP (Cepstrum, Cepstrum coefficient), and the like.
Specifically, the MFCC feature extraction is taken as an example for explanation: pre-emphasizing the voice segment after VAD preprocessing by a high-pass filter; then, the pre-emphasized audio is divided into frames according to the time length of 20ms, and each frame signal is windowed to reduce the frequency spectrum leakage of the audio signal; and then, carrying out discrete Fourier transform on the audio signal to obtain a frequency domain signal, filtering the frequency domain signal through a Mel scale filter bank to obtain a Mel frequency spectrum, and finally carrying out cepstrum analysis on the Mel frequency spectrum to obtain an MFCC coefficient.
The audio features can represent the feature information of the voice segments from multiple dimensions, and the voice categories of the audio data are identified based on the audio features, so that the identification accuracy can be effectively improved.
(3) A first voice category of the current audio data is determined based on the audio features, wherein the first voice category includes speech and singing voice.
In some embodiments, the audio features may be classified by digital signal processing to determine the first vocal class.
The main classification method of the digital signal processing is to judge the frequency of the change rate of the fundamental frequency, and judge whether the voice segment belongs to singing voice or speaking voice by obtaining the change rate of the fundamental frequency of the audio frequency characteristics. In the musical sound, every tone of singing voice is constant, while the fundamental frequency of speaking voice is constantly changing. The specific classification method comprises the following steps: aiming at audio features, obtaining a fundamental frequency value of each audio feature by adopting a Matlab (software of a high-tech computing environment mainly facing scientific computation, visualization and interactive programming) tool and based on findpeaks function operation, and detecting fundamental frequency change within a preset time period, such as 1 second; if the fundamental frequency is changed from 200Hz to 400Hz and from 400Hz to 200Hz within one second and is changed 400 times in one second, the voice segment corresponding to the audio feature can be determined as the speaking voice; if the fundamental frequency changes around 300Hz within one second and the difference is less than 10Hz, the voice segment corresponding to the audio feature can be judged to belong to the singing voice.
In other embodiments, the first personal category may be determined by classification from textual content of the audio data determined by the speech recognition network.
Specifically, as shown in fig. 4, the process of obtaining the first personal category through the voice recognition network includes:
firstly, calculating the probability of each frame of audio features generated by each phoneme in a preset training set based on a trained acoustic model, further determining a phoneme sequence with the maximum probability, and realizing the conversion from the audio features to the phoneme sequence. The acoustic Model may be GMM (Gaussian Mixture Model), HMM (Hidden Markov Model), or the like.
And then determining to obtain text data based on the trained language model, so that the probability of converting the phoneme sequence into the text data is the maximum, and realizing the conversion from the phoneme sequence to the text data. Wherein the language model may be used to calculate the probability that the phoneme sequence constitutes each complete text; the language model may be a statistical-based N-Gram (N-Gram), a neural network language model, or a model based on a transform (data Transformer) architecture.
Finally, classifying the text data based on a pre-trained classification network to obtain a first human voice class, wherein the classification network can be based on a random forest model, an SVM (Support Vector Machine) classification model or a neural network classification model; the Neural network classification model may be Text classification Networks such as TextCNN (Text Convolutional Neural Networks) or LSTM (Long Short-Term Memory network).
According to the embodiment of the application, the first voice category is obtained through digital signal processing or a voice recognition network, the accuracy of voice recognition of the audio data is improved, and a reliable guarantee is provided for a subsequent noise reduction scheme to meet the noise reduction requirement of the audio stream signal which changes in real time.
A possible implementation manner is provided in this embodiment of the present application, where the generating of the target noise reduction parameter for the current audio data based on the first voice category and the scene tag in step S203 may include:
(1) and acquiring a second voice category corresponding to the audio data at the previous moment.
Specifically, the second voice category corresponding to the audio data at the previous time may be obtained by querying from a database for storing the classification result. The recognition mode of the second voice category is the same as the recognition mode of the first voice category.
(2) And if the first voice category is not matched with the second voice category, generating a target noise reduction parameter aiming at the current audio data based on the first voice category and the scene label.
Specifically, when the first voice category is matched with the second voice category, it indicates that the scene and the activity category corresponding to the current audio data are the same as each other or the voice category is the same as the scene and the activity category or the voice category corresponding to the previous audio data, and the target noise reduction parameter is the noise reduction parameter corresponding to the audio data at the previous moment;
and when the first voice category is not matched with the second voice category, the voice category corresponding to the current audio data is different from the voice category corresponding to the previous audio data, and the target noise reduction parameter is a new noise reduction parameter generated in real time based on the first voice category and the scene label.
According to the method and the device, the target noise reduction parameters for the current audio data are generated in real time according to the first voice category and the scene label based on the changed audio stream signals, and the noise reduction effect of the audio stream can be effectively improved.
In another possible implementation manner provided in this embodiment of the present application, the generating, in step S203, a target noise reduction parameter for the current audio data based on the first voice category and the scene tag may include:
weighting the voice noise reduction parameters corresponding to the first voice category and the scene noise reduction parameters corresponding to the scene labels to obtain target noise reduction parameters; when the first voice category is speaking voice, the corresponding voice noise reduction parameter is larger than that when the first voice category is singing voice.
Specifically, in some embodiments, the corresponding relationship between the first voice category and the voice noise reduction parameter, and the corresponding relationship between the scene tag and the scene noise reduction parameter may be preset, and then the voice noise reduction parameter and the scene noise reduction parameter may be determined based on the corresponding relationships.
Specifically, taking a live webcast scene as an example, the preset comparison relationship between different scene tags and the scene noise reduction parameter value-taking interval may be as shown in fig. 5, where when a scene tag is a conference, outdoors or a relative, the scene tag has a high requirement for background noise processing, and the corresponding noise reduction weight is set to be high; when the scene is singing, music or dancing, the scene has high requirements on sound fidelity and restoring degree and low requirements on background noise processing, and the corresponding noise reduction weight is set to be low; when the scene is lecture or chat, the requirements of sound fidelity and reduction degree and the requirement of background noise processing are considered in the scene, and the corresponding noise reduction weight is set to be a middle level. The different levels of the noise reduction weight correspond to different intervals of scene parameter values respectively.
As shown in fig. 6, when the first voice category is speech, noise reduction is performed on noise except for speech, and the corresponding noise reduction weight is set to be high; when the first vocal category is singing, the corresponding noise reduction weight is set to low level since the background music needs to be preserved.
In other embodiments, a functional relationship among the first voice category, the voice noise reduction parameter, the scene tag, and the scene noise reduction parameter may be further established, and the voice noise reduction parameter and the scene noise reduction parameter may be calculated based on the functional relationship.
When the weighting processing is carried out on the voice noise reduction parameters corresponding to the first voice category and the scene noise reduction parameters corresponding to the scene labels, the result of the voice noise reduction parameters is generated based on real-time voice recognition, so that the weight of the voice noise reduction parameters is greater than that of the scene noise reduction parameters, the problem that the target noise reduction parameters are not matched with the current audio data when the scene or activity corresponding to the audio stream is changed and the user cannot set and change the scene labels in time can be solved, and the noise reduction effect and the noise reduction efficiency are improved.
In another possible implementation manner provided in this embodiment of the application, the generating, based on the first voice category and the scene tag, a target noise reduction parameter for the current audio data in step S203 may include:
(1) an acquisition path of the audio data is determined.
Specifically, the collection of the audio data may be a recording process of recording software on a computer or a mobile phone terminal through a microphone, and a collection path of the audio data may be determined from a collection interface of the terminal, where the collection path may be an earphone MIC (microphone), a mobile phone MIC, and a sound card, and different collection paths may also affect the noise reduction requirement of the audio data. For example, the environmental noise of the audio data collected by the MIC of the mobile phone is large, the required noise reduction strength is large, the environmental noise of the audio data is low when the MIC of the earphone is collected, and the required noise reduction strength is small.
(2) And generating target noise reduction parameters aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
In some embodiments, a functional relationship between the first voice category, the scene tag, the acquisition path, and the target noise reduction parameter may be set, and then the target noise reduction parameter for the current audio data is calculated based on the first voice category, the scene tag, and the acquisition path of the audio data.
In other embodiments, the correspondence between different voice categories, different scene labels, different acquisition paths, and different target noise reduction parameters may be preset, and the target noise reduction parameters for the audio data may be determined based on the correspondence.
Specifically, taking a live webcast scene as an example, a comparison relationship between different audio acquisition channels and value intervals of channel noise reduction parameters is preset as shown in fig. 7, when a mobile phone MIC is used as an acquisition channel, background noise is large, and therefore the corresponding noise reduction weight is set to be high; when the earphone MIC is used as an acquisition channel, the problems of background noise and echo are small, and the corresponding noise reduction weight is set to be low; when the sound card and the Bluetooth headset are used as acquisition channels, the background noise is moderate, so the corresponding noise reduction weight is a middle level; different levels of the noise reduction weight respectively correspond to different channel parameter value intervals.
The method and the device comprehensively consider the influence of the classification result, the scene label and the collection channel of the audio data on the required noise reduction strength, so that the generated target noise reduction parameter is further matched with the audio data, the personalized scene label can be also adapted to various audio collection channels, the classification result of the audio data can be matched in real time, the noise reduction effect is effectively improved, and the purpose of improving the tone quality is achieved.
In an embodiment of the present application, a further possible implementation manner is provided, where the generating of the target noise reduction parameter for the current audio data based on the first voice category, the scene tag, and the acquisition path includes:
and if the acquisition channel does not have the matched noise reduction attribute, generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
Specifically, a server or a terminal for performing audio noise reduction processing detects a current acquisition channel, and when a noise reduction attribute matched with the acquisition channel is detected, for example, the acquisition channel is an earphone MIC with a noise reduction function, a target noise reduction parameter for current audio data is generated based on a first voice category and a scene tag.
And when the acquisition channel is detected to have no matched noise reduction attribute, generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
Compared with the prior art, the audio noise reduction method provided by the embodiment of the invention has no additional interactive operation aiming at the audio noise reduction parameter setting, the noise reduction setting and the noise reduction processing are transparent and non-sensible to a user, and the user experience is effectively improved.
In an embodiment of the present application, a further possible implementation manner is provided, where generating a target noise reduction parameter for current audio data based on the first voice category, the scene tag, and the acquisition path includes:
and weighting the voice noise reduction parameters corresponding to the first voice category, the scene noise reduction parameters corresponding to the scene labels and the access noise reduction parameters corresponding to the acquisition access to obtain target noise reduction parameters.
Specifically, the respective weights corresponding to the human voice noise reduction parameter, the scene noise reduction parameter and the path noise reduction parameter may be determined in a classified manner, and the human voice noise reduction parameter, the scene noise reduction parameter and the path noise reduction parameter may be weighted and summed based on the weights, so as to obtain the target noise reduction parameter.
According to the embodiment of the invention, when the noise reduction parameters are determined, three factors of a scene label set by a user, a first voice category identified by voice recognition and an acquisition channel of audio data are comprehensively considered, and the voice quality of the audio data is further improved by combining the real-time voice classification result of the audio data and the acquisition channel under the condition of referring to the subjective judgment of the user, so that the user experience is effectively improved.
In an embodiment of the present application, another possible implementation manner is provided, where the weighting processing is performed on the voice noise reduction parameter corresponding to the first voice category, the scene noise reduction parameter corresponding to the scene tag, and the path noise reduction parameter corresponding to the acquisition path to obtain the target noise reduction parameter, where the weighting processing includes:
(1) determining a first weight of the human voice noise reduction parameter, a second weight of the scene noise reduction parameter and a third weight of the channel noise reduction parameter.
The preset first weight, second weight and third weight may be obtained from a terminal or a server, or the first weight, second weight and third weight may be generated based on data statistics calculation according to actual engineering application.
(2) Based on the first weight, the second weight and the third weight, carrying out weighted summation on the human noise reduction parameter, the scene noise reduction parameter and the access noise reduction parameter to obtain a target noise reduction parameter; wherein the first weight is greater than either of the second weight and the third weight.
Specifically, the target noise reduction parameter may be obtained by multiplying the first weight by the human voice noise reduction parameter, multiplying the second weight by the scene noise reduction parameter, multiplying the third weight by the path noise reduction parameter, and adding the three products.
Wherein the sum of the first weight, the second weight and the third weight is 1. The human voice noise reduction parameter is determined based on real-time voice recognition, so that the first weight has the largest value among the three weights; when receiving external influence, for example the scene or the activity change that the audio stream corresponds, the user can't in time set up the change to the scene label, or when the function of making an uproar that falls of gathering the route breaks down, the parameter of making an uproar falls in the people's voice that the real-time generation based on speech recognition can account for the biggest weight, avoids the unmatched problem of target noise reduction parameter and current audio data that causes when receiving above-mentioned external influence, has further promoted the noise reduction effect.
In this embodiment, a live webcast is taken as an example to specifically describe, an audio noise reduction parameter in a live webcast scene is determined based on three parts, namely a voice analysis unit, a tag configuration analysis unit, and a noise reduction unit:
the voice analysis unit respectively determines a voice noise reduction parameter and a channel noise reduction parameter according to the recognized first voice category and the recognized acquisition channel;
the tag configuration analysis unit determines scene noise reduction parameters according to scene tags set by a user;
the noise reduction unit generates target noise reduction parameters based on the human voice noise reduction parameters, the scene noise reduction parameters and the access noise reduction parameters.
Different classification results, scene labels or acquisition paths have different noise reduction weights, the value intervals of the human voice noise reduction parameters, the scene noise reduction parameters and the path noise reduction parameters can be determined based on the different noise reduction weights, and then the actual values of the human voice noise reduction parameters, the scene noise reduction parameters and the path noise reduction parameters can be determined according to the actual engineering application conditions based on the value intervals.
Specifically, taking the application of live webcasting as an example, a anchor may set a scene tag on a configuration page during live webcasting, and when there is a scene change in the live webcasting process, such as the anchor transferring from indoor to outdoor live webcasting, the anchor may not change the setting of the scene tag in time, and in order to reduce errors caused by artificial setting, the second weight of the scene noise reduction parameter may be set to 10%; secondly, as the audio acquisition channel depends on the use habit of the anchor, the audio acquisition channel is not changed too much generally, and the existing audio acquisition technology is mature, the influence on the whole noise reduction parameter is small, and the third weight of the channel parameter can be set to be 20%; such that the first weight of the human voice noise reduction parameter determined from the real-time audio stream speech recognition is set to 70%. Therefore, the final target noise reduction parameter r of the system can be calculated by the following formula (1):
r=S1×10%+S2×70%+S3×20% (1)
where r is the target noise reduction parameter, S1For scene noise reduction parameters, S2For a parameter of acoustic noise reduction, S3Are the channel noise reduction parameters.
The audio target noise reduction parameters provided by the embodiment of the application are calculated and determined according to a plurality of parameters, and a user preset scene label, a voice recognition result and an acquisition channel condition of current audio data are combined, so that the problem that fixed parameter configuration cannot adapt to diversified voice activities and scenes can be solved, manual adjustment of a user is not needed, and the user experience is improved while the tone quality is effectively improved; meanwhile, the calculation and generation of the target noise reduction parameters have low requirements on the processing performance of the system, the occupation of the system memory is small, and the efficiency of audio noise reduction is further ensured.
In order to better understand the above audio noise reduction method, as shown in fig. 8, an example of the audio noise reduction method of the present application is set forth in detail as follows:
s801, acquiring current audio data and a preset scene label at the current moment from an audio stream.
S802, performing voice recognition on the current audio data, and determining a first voice category of the current audio data.
And S803, acquiring the audio data corresponding to the previous moment.
S804, if the first voice category is not matched with the second voice category, determining an acquisition path of the audio data.
S805, if the noise reduction attribute matched with the noise reduction attribute does not exist in the acquisition channel, determining a first weight of the human voice noise reduction parameter, a second weight of the scene noise reduction parameter and a third weight of the channel noise reduction parameter.
S806, based on the first weight, the second weight and the third weight, carrying out weighted summation on the human noise reduction parameter, the scene noise reduction parameter and the channel noise reduction parameter to obtain a target noise reduction parameter; wherein the first weight is greater than either of the second weight and the third weight.
S807, noise reduction processing is performed on the audio stream based on the target noise reduction parameter.
In order to better understand the above audio noise reduction method, an example of the present application is described in detail below, and taking a webcast application as an example, as shown in fig. 9, a webcast process applying the above audio noise reduction method may include the following steps:
(1) the audio acquisition module 901 acquires audio data through a mobile phone MIC, an earphone MIC, or a microphone MIC; different acquisition devices need to be configured with different parameters, and common parameters include: the number of channels (mono-bi), the sampling rate (44100/48000), the number of bits (8/16), etc.;
(2) the scene recognition module 902 obtains the audio data from the corresponding acquisition channel; the scene recognition module 902 includes a tag configuration analysis unit 9021 and a voice analysis unit 9022;
(3) the tag configuration analysis unit 9021 determines a scene noise reduction parameter according to a scene tag set by a user, the voice analysis unit 9022 performs voice recognition on the audio data, and further determines a voice noise reduction parameter and a channel noise reduction parameter respectively according to the recognized first voice category and the recognized channel;
(4) the scene recognition module 902 further includes a noise reduction unit 9023, where the noise reduction unit 9023 generates a target noise reduction parameter based on the scene noise reduction parameter, the human voice noise reduction parameter, and the access noise reduction parameter;
(5) the audio preprocessing module 903 receives the target noise reduction parameters, and performs preprocessing operations such as noise reduction, echo cancellation, automatic gain processing, sampling rate conversion and the like on audio data;
(6) the audio coding module 904 compresses the pre-processed audio data through an encoder to save storage space and transmission bandwidth, and common coding standards are as follows: AAC (Advanced Audio Coding), MP3(Moving Picture Experts Group Audio Layer-3), Opus, etc.; wherein Opus is a lossy audio coding format that can handle various audio applications, and it can be extended from low bit rate narrowband speech to very high definition tone quality stereo music;
(7) the camera acquisition module 905 acquires images through a mobile phone camera to obtain video data; different camera devices provide data streams with different sizes and frame rates according to different capabilities;
(8) preprocessing the video data before encoding by a video preprocessing module 906, including size clipping, boundary alignment, rotation/color space conversion, and the like;
(9) the video coding module 907 is adopted to compress the preprocessed video data, so that the storage space is saved and the transmission efficiency is improved; common coding standards are: MPEG2(Moving Picture Experts Group), h.264, h.265, etc.; h.264 is a highly compressed digital video codec standard proposed by the international organization for standardization and the international telecommunication union, and h.265 is a video coding standard improved around the existing video coding standard h.264;
(10) after receiving the compressed audio data and video data, the audio/video encapsulation module 908 aligns and arranges the compressed audio data and video data according to a PTS (Presentation Time Stamp), obtains audio/video interleaved data according to different encapsulator specification formats, and stores and transmits the interleaved data to the stream server module 909;
(11) the streaming server module 909 distributes the interleaved video data to a Content Delivery Network (CDN), which is a terminal 910 of the viewer, through an internal Network; therefore, the audience acquires the live data stream from the CDN server and performs decoding playing through the player.
According to the method, the current audio data are subjected to voice recognition, the target noise reduction parameters aiming at the current audio data are determined by combining the preset scene tags, and then effective noise reduction processing on the audio stream is achieved.
An embodiment of the present application provides an audio noise reduction apparatus, as shown in fig. 10, the audio apparatus 110 may include: the device comprises an acquisition module 1101, an identification module 1102, a generation module 1103 and a noise reduction module 1104, wherein the acquisition module 1101 is used for acquiring current audio data and a preset scene tag at the current moment from an audio stream;
the recognition module 1102 is configured to perform voice recognition on current audio data, and determine a first voice category of the current audio data;
a generating module 1103, configured to generate a target noise reduction parameter for the current audio data based on the first voice category and the scene tag;
and a noise reduction module 1104, configured to perform noise reduction processing on the audio stream based on the target noise reduction parameter.
A possible implementation manner is provided in the embodiment of the present application, and the generating module 1103 is specifically configured to:
acquiring a second voice category corresponding to the audio data at the previous moment;
and if the first voice category is not matched with the second voice category, generating a target noise reduction parameter aiming at the current audio data based on the first voice category and the scene label.
In an embodiment of the present application, a possible implementation manner is provided, and the identifying module 1102 is specifically configured to:
performing voice detection on the current audio data, and extracting at least one voice segment;
acquiring the audio characteristics of each voice segment;
a first voice category of the current audio data is determined based on the audio features, wherein the first voice category includes speech and singing voice.
A possible implementation manner is provided in the embodiment of the present application, and the generating module 1103 is specifically configured to:
weighting the voice noise reduction parameters corresponding to the first voice category and the scene noise reduction parameters corresponding to the scene labels to obtain target noise reduction parameters; when the first voice category is speaking voice, the corresponding voice noise reduction parameter is larger than that when the first voice category is singing voice.
A possible implementation manner is provided in the embodiment of the present application, and the generating module 1103 specifically includes:
the determining unit is used for determining an acquisition path of the audio data;
and the generating unit is used for generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
In an embodiment of the present application, there is provided another possible implementation manner, and the generating unit is specifically configured to:
and if the acquisition channel does not have the matched noise reduction attribute, generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition channel.
In an embodiment of the present application, there is provided another possible implementation manner, where the generating unit is further configured to:
and weighting the voice noise reduction parameters corresponding to the first voice category, the scene noise reduction parameters corresponding to the scene labels and the access noise reduction parameters corresponding to the acquisition access to obtain target noise reduction parameters.
In an embodiment of the present application, a possible implementation manner is provided, and the generating unit is further configured to:
determining a first weight of a human voice noise reduction parameter, a second weight of a scene noise reduction parameter and a third weight of a channel noise reduction parameter;
based on the first weight, the second weight and the third weight, carrying out weighted summation on the human noise reduction parameter, the scene noise reduction parameter and the access noise reduction parameter to obtain a target noise reduction parameter; wherein the first weight is greater than either of the second weight and the third weight.
According to the method, the current audio data are subjected to voice recognition, the target noise reduction parameters aiming at the current audio data are determined by combining the preset scene tags, and then effective noise reduction processing on the audio stream is achieved.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory, for implementing the corresponding content in the foregoing method embodiments when the program is executed by the processor, compared with the prior art, the method can implement: according to the method, the current audio data are subjected to voice recognition, the target noise reduction parameters aiming at the current audio data are determined by combining the preset scene tags, and then effective noise reduction processing on the audio stream is achieved.
In an alternative embodiment, an electronic device is provided, as shown in fig. 11, the electronic device 4000 shown in fig. 11 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the following when executed:
acquiring current audio data and a preset scene label at the current moment from an audio stream; performing voice recognition on the current audio data, and determining a first voice category of the current audio data; generating a target noise reduction parameter for the current audio data based on the first voice category and the scene label; and performing noise reduction processing on the audio stream based on the target noise reduction parameters.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. An audio noise reduction method, comprising:
acquiring current audio data and a preset scene label at the current moment from an audio stream;
performing voice recognition on the current audio data, and determining a first voice category of the current audio data;
generating target noise reduction parameters for the current audio data based on the first voice category and the scene tag;
and performing noise reduction processing on the audio stream based on the target noise reduction parameter.
2. The audio denoising method of claim 1, wherein the generating target denoising parameters for the current audio data based on the first vocal class and the scene label comprises:
acquiring a second voice category corresponding to the audio data at the previous moment;
and if the first voice category is not matched with the second voice category, generating a target noise reduction parameter aiming at the current audio data based on the first voice category and the scene label.
3. The method of claim 1, wherein the performing speech recognition on the current audio data and determining the first vocal category of the current audio data comprises:
performing voice detection on the current audio data, and extracting at least one voice segment;
acquiring the audio characteristics of each voice segment;
determining a first vocal category of the current audio data based on the audio features; wherein the first voice category includes a speech voice and a singing voice.
4. The audio denoising method of claim 1, wherein the generating target denoising parameters for the current audio data based on the first vocal class and the scene label comprises:
weighting the voice noise reduction parameters corresponding to the first voice category and the scene noise reduction parameters corresponding to the scene labels to obtain the target noise reduction parameters; when the first voice category is speaking voice, the corresponding voice noise reduction parameter is larger than that when the first voice category is singing voice.
5. The audio denoising method of claim 1, wherein the generating target denoising parameters for the current audio data based on the first vocal class and the scene label comprises:
determining an acquisition path of the audio data;
generating a target noise reduction parameter for the current audio data based on the first voice category, the scene tag, and the acquisition path.
6. The audio denoising method of claim 5, wherein the generating target denoising parameters for the current audio data based on the first vocal class, the scene tag, and the acquisition path comprises:
and if the acquisition path does not have the matched noise reduction attribute, generating a target noise reduction parameter aiming at the current audio data based on the first voice category, the scene label and the acquisition path.
7. The audio denoising method of claim 5, wherein the generating target denoising parameters for the current audio data based on the first vocal class, the scene tag, and the acquisition path comprises:
and weighting the voice noise reduction parameters corresponding to the first voice category, the scene noise reduction parameters corresponding to the scene labels and the access noise reduction parameters corresponding to the acquisition access to obtain the target noise reduction parameters.
8. The audio denoising method according to claim 7, wherein the weighting the vocal denoising parameter corresponding to the first vocal category, the scene denoising parameter corresponding to the scene tag, and the path denoising parameter corresponding to the acquisition path to obtain the target denoising parameter comprises:
determining a first weight of the human voice noise reduction parameter, a second weight of the scene noise reduction parameter and a third weight of the access noise reduction parameter;
based on the first weight, the second weight and the third weight, weighting and summing the human voice noise reduction parameter, the scene noise reduction parameter and the path noise reduction parameter to obtain the target noise reduction parameter; wherein the first weight is greater than either of the second weight and the third weight.
9. An audio noise reduction apparatus, comprising:
the acquisition module is used for acquiring current audio data and a preset scene label at the current moment from the audio stream;
the recognition module is used for carrying out voice recognition on the current audio data and determining a first voice category of the current audio data;
a generating module, configured to generate a target noise reduction parameter for the current audio data based on the first voice category and the scene tag;
and the noise reduction module is used for carrying out noise reduction processing on the audio stream based on the target noise reduction parameters.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the audio noise reduction method according to any one of claims 1 to 8.
CN202110653790.3A 2021-06-11 2021-06-11 Audio noise reduction method and device and computer readable storage medium Pending CN113823303A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110653790.3A CN113823303A (en) 2021-06-11 2021-06-11 Audio noise reduction method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110653790.3A CN113823303A (en) 2021-06-11 2021-06-11 Audio noise reduction method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113823303A true CN113823303A (en) 2021-12-21

Family

ID=78923851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110653790.3A Pending CN113823303A (en) 2021-06-11 2021-06-11 Audio noise reduction method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113823303A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116033078A (en) * 2023-01-09 2023-04-28 绍兴泰民科技股份有限公司 Production safety consultation system based on block chain
CN117440440A (en) * 2023-12-21 2024-01-23 艾康恩(深圳)电子科技有限公司 Bluetooth headset low-delay transmission method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116033078A (en) * 2023-01-09 2023-04-28 绍兴泰民科技股份有限公司 Production safety consultation system based on block chain
CN117440440A (en) * 2023-12-21 2024-01-23 艾康恩(深圳)电子科技有限公司 Bluetooth headset low-delay transmission method
CN117440440B (en) * 2023-12-21 2024-03-15 艾康恩(深圳)电子科技有限公司 Bluetooth headset low-delay transmission method

Similar Documents

Publication Publication Date Title
CN108900725B (en) Voiceprint recognition method and device, terminal equipment and storage medium
US10373609B2 (en) Voice recognition method and apparatus
CN112071330B (en) Audio data processing method and device and computer readable storage medium
US20120303369A1 (en) Energy-Efficient Unobtrusive Identification of a Speaker
CN111951823B (en) Audio processing method, device, equipment and medium
JP2020034895A (en) Responding method and device
CN112102846B (en) Audio processing method and device, electronic equipment and storage medium
CN113488063B (en) Audio separation method based on mixed features and encoding and decoding
Lu et al. Self-supervised audio spatialization with correspondence classifier
CN114338623B (en) Audio processing method, device, equipment and medium
US20230197061A1 (en) Method and System for Outputting Target Audio, Readable Storage Medium, and Electronic Device
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
CN112863489B (en) Speech recognition method, apparatus, device and medium
US11687576B1 (en) Summarizing content of live media programs
CN113851136A (en) Clustering-based speaker recognition method, device, equipment and storage medium
CN115602165A (en) Digital staff intelligent system based on financial system
CN113205793A (en) Audio generation method and device, storage medium and electronic equipment
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium
Huang et al. Research on robustness of emotion recognition under environmental noise conditions
CN116959471A (en) Voice enhancement method, training method of voice enhancement network and electronic equipment
Abraham et al. A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients
CN113823300B (en) Voice processing method and device, storage medium and electronic equipment
CN112652292A (en) Method, apparatus, device and medium for generating audio
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
CN112382297A (en) Method, apparatus, device and medium for generating audio

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination