CN113709291A - Audio processing method and device, electronic equipment and readable storage medium - Google Patents

Audio processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN113709291A
CN113709291A CN202110904018.4A CN202110904018A CN113709291A CN 113709291 A CN113709291 A CN 113709291A CN 202110904018 A CN202110904018 A CN 202110904018A CN 113709291 A CN113709291 A CN 113709291A
Authority
CN
China
Prior art keywords
noise
voice
user
recording
noise reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110904018.4A
Other languages
Chinese (zh)
Inventor
谢慧智
张黎
虞国桥
万广鲁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202110904018.4A priority Critical patent/CN113709291A/en
Publication of CN113709291A publication Critical patent/CN113709291A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/64Automatic arrangements for answering calls; Automatic arrangements for recording messages for absent subscribers; Arrangements for recording conversations
    • H04M1/65Recording arrangements for recording a message from the calling party
    • H04M1/6505Recording arrangements for recording a message from the calling party storing speech in digital form
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones

Abstract

The embodiment of the application provides an audio processing method and device, electronic equipment and a readable storage medium. The method comprises the following steps: responding to the operation of a user on a recording starting button on a recording initial interface, displaying a recording acquisition interface, and outputting noise detection prompt information on the recording acquisition interface, wherein the noise detection prompt information is used for prompting that noise detection is carried out before or in the process of acquiring voice sent by the user; collecting voice sent by a user and carrying out noise reduction processing on the collected voice to obtain a recorded voice packet; and outputting noise reduction prompt information on the recording acquisition interface, wherein the noise reduction prompt information is used for prompting that noise reduction processing is carried out on the voice sent by the user. By the audio processing method, the user can record the voice in an unobstructed manner in an environment with noise interference, and the use threshold of the user is reduced while the user is helped to complete the whole voice recording process more smoothly.

Description

Audio processing method and device, electronic equipment and readable storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to an audio processing method and device, an electronic device and a readable storage medium.
Background
Under the scene of carrying out individualized voice recording in the present mobile terminal, the requirement to recording quality is higher, nevertheless because the complexity of recording the environment, often hardly realize carrying out the voice recording in the place of complete noiselessness, at the in-process of recording, if there is noise interference in the environment, then can't get into and record the interface to force the requirement to carry out noise detection again and can't get into and record the flow until the ambient noise satisfies the condition. Moreover, even if the recording process is performed in a noiseless environment, if noise is detected in the voice recording process, the user is required to record the audio again, that is, once noise interference occurs, recording cannot be performed, and recording experience of the user is extremely influenced.
Disclosure of Invention
Embodiments of the present application provide an audio processing method and apparatus, an electronic device, and a readable storage medium, which enable a user to perform voice recording in an unobstructed manner in an environment with noise interference, and reduce a user threshold while helping the user to complete the whole voice recording process more smoothly.
A first aspect of an embodiment of the present application provides an audio processing method, where the method includes:
responding to the operation of a user on a recording starting button on a recording initial interface, displaying a recording acquisition interface, and outputting noise detection prompt information on the recording acquisition interface, wherein the noise detection prompt information is used for prompting that noise detection is carried out before or in the process of acquiring voice sent by the user;
collecting voice sent by a user and carrying out noise reduction processing on the collected voice to obtain a recorded voice packet;
and outputting noise reduction prompt information on the recording acquisition interface, wherein the noise reduction prompt information is used for prompting that noise reduction processing is carried out on the voice sent by the user.
Optionally, the method further includes:
outputting a plurality of noise source options on the recording acquisition interface, wherein the plurality of noise source options at least comprise: a recording environment option, a recording device option and a pronunciation feature option;
the processing of making an uproar falls to the pronunciation of gathering, obtains recording the pronunciation package, includes:
and carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the noise source option selected by the user to obtain a recorded voice packet.
Optionally, the method further includes:
collecting noise voices sent by a plurality of recorders respectively;
extracting corresponding noise from the collected multiple voices with noise respectively, and storing the extracted noise into a noise database;
the processing of making an uproar falls to the pronunciation of gathering, obtains recording the pronunciation package, includes:
comparing the detected noise with each noise in the noise database respectively, and determining a target noise matched with the detected noise;
and carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the target noise to obtain a recorded voice packet.
Optionally, the method further includes:
responding to the re-operation of the user on the recording start key on the recording initial interface, and outputting personalized noise reduction prompt information on the recording acquisition interface, wherein the personalized noise reduction prompt information is used for prompting that noise reduction processing is to be performed on voice triggered by the current recording operation of the user according to the voice with noise triggered by the historical recording operation of the user;
extracting historical noise from voice with noise triggered by historical recording operation of a user;
and collecting voice triggered by the recording operation of the user, and performing noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the extracted historical noise to obtain a recorded voice packet.
Optionally, after obtaining the recorded voice packet, the method further includes:
analyzing the current content to be broadcasted by voice, and determining a broadcasting scene;
under the condition that the playing scene is a personalized playing scene, voice broadcasting is carried out on the content of the current voice to be broadcasted according to the recorded voice packet;
and under the condition that the playing scene is a universal playing scene, carrying out voice broadcasting on the content of the current voice to be broadcasted according to a default voice packet.
Optionally, the denoising processing is performed on the collected voice to obtain a recorded voice packet, and the method specifically includes:
extracting corresponding spectral features from the collected voice;
and inputting the collected voice and the frequency spectrum characteristics thereof into a pre-trained voice noise reduction model to obtain a recorded voice packet, wherein the voice noise reduction model is obtained by training the pre-trained model by taking the frequency spectrum characteristics corresponding to the composite band noise voice and the noiseless voice corresponding to the composite band noise voice as training samples.
Optionally, the complex speech with noise is obtained according to the following steps:
collecting voices sent by a plurality of sound recorders respectively;
screening out noise-containing voice and noise-free voice from voices respectively emitted by the plurality of sound recorders;
extracting corresponding noise from the screened multiple noisy voices respectively;
and according to the extracted noise, carrying out noise adding processing on the screened noise-free voice to obtain the composite voice with noise.
Optionally, the denoising processing is performed on the screened noise-free speech according to the extracted noise to obtain a composite speech with noise, including:
classifying the extracted noise, wherein the category of the extracted noise at least comprises: background noise, noise due to audio impairments;
under the condition that the extracted noise type is background noise, additive noise adding processing is carried out on the screened noise-free voice to obtain a composite voice with noise;
and under the condition that the extracted noise type is noise caused by audio damage, multiplicative noise adding processing is carried out on the screened noise-free voice to obtain a composite voice with noise.
Optionally, the training step of the pre-trained model includes:
extracting corresponding spectral features from the screened noise-free voice;
and training a preset model by taking the screened noise-free voice and the corresponding frequency spectrum characteristics as training samples to obtain the pre-trained model.
Optionally, the training step of the speech noise reduction model includes:
extracting corresponding spectral features from the composite noisy speech, and establishing a correspondence between the spectral features and the noiseless speech for generating the composite noisy speech;
and training the pre-trained model by taking the corresponding relation and the noiseless voice corresponding to the composite noisy voice as a training sample to obtain the voice noise reduction model.
A second aspect of the embodiments of the present application provides an audio processing apparatus, including:
the first display module is used for responding to the operation of a user on a recording starting button on a recording initial interface, displaying a recording acquisition interface and outputting noise detection prompt information on the recording acquisition interface, wherein the noise detection prompt information is used for prompting that noise detection is carried out before or in the process of acquiring voice sent by the user;
the acquisition and noise reduction module is used for acquiring voice sent by a user and carrying out noise reduction processing on the acquired voice to obtain a recorded voice packet;
and the second display module is used for outputting noise reduction prompt information on the recording acquisition interface, wherein the noise reduction prompt information is used for prompting that noise reduction processing is carried out on the voice sent by the user.
Optionally, the apparatus further comprises:
a third display module, configured to output a plurality of noise source options on the recording collection interface, where the plurality of noise source options at least include: a recording environment option, a recording device option and a pronunciation feature option;
the acquisition noise reduction module comprises:
and the first noise reduction module is used for carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the noise source option selected by the user to obtain the recorded voice packet.
Optionally, the apparatus further comprises:
the first acquisition module is used for acquiring noise voices sent by a plurality of recorders respectively;
the first extraction module is used for respectively extracting corresponding noise from the collected multiple noisy speeches and storing the extracted noise into a noise database;
the acquisition noise reduction module comprises:
a target noise determination module, configured to compare the detected noise with each noise in the noise database, and determine a target noise matching the detected noise;
and the second noise reduction module is used for carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the target noise to obtain a recorded voice packet.
Optionally, the apparatus further comprises:
the voice processing module is used for receiving voice input by a user, acquiring a recording initial interface, and sending the recording initial interface to a voice input module, wherein the voice input module is used for inputting voice information of the user, and the voice information is used for inputting voice information of the user;
the second extraction module is used for extracting historical noise from the voice with noise triggered by the historical recording operation of the user;
and the third noise reduction module is used for acquiring voice triggered by the recording operation of the user, and performing noise reduction processing on the acquired voice by adopting a noise reduction strategy corresponding to the extracted historical noise to obtain a recorded voice packet.
Optionally, after obtaining the recorded voice packet, the apparatus further includes:
the analysis module is used for analyzing the content to be broadcasted by voice currently and determining a broadcasting scene;
the first broadcasting module is used for carrying out voice broadcasting on the content of the current voice to be broadcasted according to the recorded voice packet under the condition that the broadcasting scene is a personalized broadcasting scene;
and the second broadcasting module is used for carrying out voice broadcasting on the current voice content to be broadcasted according to a default voice packet under the condition that the broadcasting scene is a general broadcasting scene.
Optionally, the collecting and denoising module specifically includes:
the third extraction module is used for extracting corresponding spectrum characteristics from the collected voice;
and the fourth noise reduction module is used for inputting the collected voice and the frequency spectrum characteristics thereof into a pre-trained voice noise reduction model to obtain a recorded voice packet, wherein the voice noise reduction model is obtained by taking the frequency spectrum characteristics corresponding to the composite band noise voice and the noiseless voice corresponding to the composite band noise voice as training samples and training the pre-trained model.
Optionally, the apparatus further comprises:
the second acquisition module is used for acquiring voices sent by a plurality of recorders respectively;
the screening module is used for screening out noise-containing voice and noise-free voice from the voice sent by each of the plurality of recorders;
the noise extraction module is used for respectively extracting corresponding noise from the screened multiple noisy voices;
and the noise adding processing module is used for adding noise to the screened noise-free voice according to the extracted noise to obtain the composite voice with noise.
Optionally, the noise processing module includes:
a classification module, configured to classify the extracted noise, where a category of the extracted noise at least includes: background noise, noise due to audio impairments;
the additive noise adding module is used for performing additive noise adding processing on the screened noise-free voice under the condition that the extracted noise type is background noise to obtain a composite voice with noise;
and the multiplicative denoising module is used for performing multiplicative denoising processing on the screened noise-free voice under the condition that the extracted noise type is noise caused by audio damage, so as to obtain a composite voice with noise.
Optionally, the apparatus further comprises:
the fourth extraction module is used for extracting corresponding spectrum characteristics from the screened noiseless voice;
and the first model training module is used for training a preset model by taking the screened noiseless voice and the corresponding spectrum characteristics as training samples to obtain the pre-trained model.
Optionally, the apparatus further comprises:
a relation establishing module, configured to extract corresponding spectral features from the complex speech with noise, and establish a corresponding relation between the spectral features and the noiseless speech for generating the complex speech with noise;
and the second model training module is used for training the pre-trained model by taking the corresponding relation and the noiseless voice corresponding to the composite voice with noise as a training sample to obtain the voice noise reduction model.
A third aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps in the method according to the first aspect of the present application.
A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the present application when executed.
By adopting the audio processing method provided by the embodiment of the application, the recording acquisition interface is displayed in response to the operation of a user on the recording start button on the recording initial interface, and the noise detection prompt information is output on the recording acquisition interface; collecting voice sent by a user, carrying out noise reduction processing on the collected voice to obtain a recorded voice packet, and outputting noise reduction prompt information on a recording acquisition interface. By the audio processing method, a user can record voice in an unobstructed manner in an environment with noise interference, noise monitoring can be carried out before or in the process of pronunciation of the user, and voice noise reduction processing can be automatically carried out while voice sent by the user is collected.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flow chart illustrating an audio processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an audio processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a model training method according to an embodiment of the present application;
fig. 4 is a block diagram of an audio processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Under the scene of carrying out individualized voice recording in current mobile terminal, the requirement to the recording quality is higher, need guarantee that sound is clear and no background noise, but owing to record the complexity of environment, often hardly realize carrying out the pronunciation at the place of complete noiselessness and record, at the in-process of recording, the system is carrying out the noise and detects, if there is noise interference in the environment, then can't get into and record the interface, the suggestion user records in quiet environment such as bedroom, meeting room to force the requirement to carry out the noise again and detect and can not get into and record the flow until ambient noise satisfies the condition. In the process of recording voice, if noise is detected, the voice is required to be recorded again, that is, once noise interference exists, recording cannot be carried out, and recording experience of a user is extremely influenced.
Based on the above, the application provides an audio processing method, which enables a user to record voice in an unobstructed manner in an environment with noise interference, can monitor noise before or during pronunciation of the user, and can automatically perform voice noise reduction processing while collecting voice sent by the user, and the method does not force the user to re-detect or re-record, thereby helping the user to complete the whole voice recording process more smoothly and reducing the use threshold of the user.
Referring to fig. 1, fig. 1 is a flowchart illustrating an audio processing method according to an embodiment of the present application. The embodiment provides an audio processing method, which is applied to a terminal device, where the terminal device may include but is not limited to: the terminal equipment comprises a mobile phone, a tablet, a computer, an intelligent watch and the like with a networking function.
As shown in fig. 1, the audio processing method of the present embodiment may include the steps of:
step S11: responding to the operation of a user on a recording starting button on a recording initial interface, displaying a recording acquisition interface, and outputting noise detection prompt information on the recording acquisition interface, wherein the noise detection prompt information is used for prompting that noise detection is carried out before or in the process of acquiring voice sent by the user.
In this embodiment, the user may operate the recording start button on the recording initial interface of the terminal, where the operation includes, but is not limited to, a click operation, a sliding operation, a pressing operation, and the like on the recording start button on the recording initial interface. The terminal equipment responds to the operation after receiving the operation of the user on the recording starting button on the recording initial interface, displays the recording acquisition interface for the user, the user can send voice to record voice when seeing the recording acquisition interface, and at the moment, the terminal can output noise detection prompt information on the recording acquisition interface so as to prompt the user that the terminal equipment carries out noise detection before acquiring the voice sent by the user or in the voice acquisition process sent by the user. For example, the noise detection prompt information output by the embodiment may be "detecting that there is xx noise in your current environment, and automatically reducing the noise for your, please use it with ease".
Step S12: and collecting voice sent by a user and carrying out noise reduction processing on the collected voice to obtain a recorded voice packet.
In this embodiment, the terminal device collects the voice sent by the user, and performs noise reduction processing on the collected voice of the user to obtain a recorded voice packet. In the present application, various noise reduction methods may be adopted to perform noise reduction processing on the collected user speech, which is not specifically limited in this embodiment.
Step S13: and outputting noise reduction prompt information on the recording acquisition interface, wherein the noise reduction prompt information is used for prompting that noise reduction processing is carried out on the voice sent by the user.
In this embodiment, in the process of recording the voice, the terminal may output the noise reduction prompt information on the recording acquisition interface to prompt the user that the terminal is performing noise reduction processing on the voice sent by the user.
In the embodiment, the terminal device responds to the operation of a user on a recording start button on a recording initial interface, displays a recording acquisition interface, and outputs noise detection prompt information on the recording acquisition interface to prompt the user terminal to perform noise detection before or during the acquisition of voice sent by the user; the terminal equipment collects voice sent by a user, outputs noise reduction prompt information on a recording collection interface, and performs noise reduction processing on the collected voice to obtain a recorded voice packet. Through the audio processing method, noise monitoring can be carried out before the user pronounces or in the pronunciation process, and voice noise reduction processing can be automatically carried out while voice sent by the user is collected, so that the user can record voice without obstruction in the environment with noise interference, the method can not force the user to re-detect or re-record, thereby helping the user to more smoothly complete the whole voice recording process, reducing the use threshold of the user, reducing the interference of noise factors, improving the self-adaptive capacity of the self-defined voice synthesis function to complex and various environments, and further improving the tone quality after voice synthesis.
With reference to the foregoing embodiment, in an implementation manner, the present application further provides an audio processing method, and specifically, the method may further include:
step S21: outputting a plurality of noise source options on the recording acquisition interface, wherein the plurality of noise source options at least comprise: a recording environment option, a recording device option, and a pronunciation feature option.
In this embodiment, when the user enters the recording acquisition interface to record the voice, the terminal may further output a plurality of noise source options on the recording acquisition interface, where the plurality of noise source options include, but are not limited to: a recording environment option, a recording device option, a pronunciation feature option, etc. The user can select a noise option source in the recording acquisition interface, and if the user considers that the noise existing in the current recording is mainly the environmental noise, the user can select a recording environment option; if the user thinks that the noise existing in the current recording is mainly generated due to the influence caused by the quality of the equipment (such as a microphone) commonly used by the user, the user can select the option of the recording equipment; if the user thinks that the noise existing in the current recording is mainly generated due to the influence brought by the user's own pronunciation characteristics (such as air sound, wheat spurting and the like), the user can select the pronunciation characteristic option.
Step S22: and carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the noise source option selected by the user to obtain a recorded voice packet.
In this embodiment, after the user selects the noise source option on the recording acquisition interface, the terminal device responds to the selection of the user, and performs noise reduction processing on the acquired voice by using a noise reduction strategy corresponding to the noise source option selected by the user, so as to obtain a recorded voice packet.
The terminal device of this embodiment may learn the noise features of the noise types corresponding to the noise source options in advance, and make and store a corresponding noise reduction policy according to the noise feature of each noise type, where the made noise reduction policy corresponding to each noise type is usually the most applicable noise reduction policy for the corresponding noise type, and can perform noise reduction processing specifically for the corresponding noise type. When a user selects a noise source option on the recording acquisition interface, in the recording process, the terminal equipment can adopt a noise reduction strategy which is formulated in advance and corresponds to the noise source option selected by the user to perform corresponding noise reduction processing on the acquired user voice to obtain a recorded voice packet.
In this embodiment, the user may further select a corresponding noise source option on the recording acquisition interface according to the actual recording scene, so that the terminal device performs noise reduction processing on the acquired voice by using a noise reduction strategy corresponding to the noise source option selected by the user, and obtains a recorded voice packet. Therefore, the most suitable denoising strategy can be adopted for denoising the current recording scene, noise interference can be effectively removed, and the synthesized tone with higher quality can be obtained.
With reference to the foregoing embodiment, in an implementation manner, the present application further provides an audio processing method, and specifically, the method may further include:
step S31: the method comprises the steps of collecting noise voices sent by a plurality of sound recorders respectively.
In this embodiment, the terminal device may further collect the noisy voices sent by each of the plurality of recorders in advance. The noise-carrying voice collected in advance can be the noise-carrying voice collected by a plurality of recorders when the terminal equipment records the voice, or the noise-carrying voice collected by the plurality of recorders in the process of using other items or functions on the terminal equipment.
Step S32: extracting corresponding noise from the collected multiple voices with noise respectively, and storing the extracted noise into a noise database.
In this embodiment, the terminal device may establish a noise database in advance, and specifically, the terminal device may extract corresponding noise from a plurality of noisy voices collected in advance, and store the extracted corresponding noise in the noise database according to the scene type to which the extracted corresponding noise belongs, respectively, for the noise characteristics of different scene types.
For example, various noisy sounds, such as road surface external sounds, which are extracted from the plurality of noisy voices and appear for the in-vehicle environment, may be stored in the noise database according to the type of the in-vehicle environment; or storing noises such as walking of outside people when the indoor sound insulation effect extracted from a plurality of voices with noises is not good in a noise database according to the type of the indoor environment; and the voice, music, outdoor wind sound and the like from far away in the public area of the mall or the office building extracted from the plurality of noisy voices can be stored in the noise database according to the type of the public area. It should be noted that the noise types in the noise database listed above in this embodiment are only examples, and this is not specifically limited by the embodiment of this application.
Step S33: and comparing the detected noise with each noise in the noise database respectively, and determining the target noise matched with the detected noise.
In this embodiment, in a scenario where a current user records a voice, the noise detected from the collected user voice may be cyclically matched with each noise stored in the noise database in advance, for example, the detected noise may be subjected to similarity calculation with each noise stored in the noise database in advance, and when the similarity is greater than a similarity threshold, the detected noise is considered to be successfully matched with the noise in the database, and the noise in the noise database that is successfully matched is determined as a target noise that is matched with the detected noise. The similarity threshold in this embodiment is set in advance according to historical similarity data or manual experience, and the specific numerical value of the similarity threshold is not specifically limited in this application.
Step S34: and carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the target noise to obtain a recorded voice packet.
In this embodiment, after the terminal device determines the target noise matched with the noise corresponding to the currently detected user voice, a pre-established denoising strategy corresponding to the target noise may be adopted to denoise the collected user voice to obtain the recorded voice packet.
In this embodiment, corresponding denoising strategies may be formulated for various noises in the noise database, wherein the formulated denoising strategies corresponding to the noises are all the denoising strategies most suitable for denoising the noises with the best denoising effect; and a denoising strategy corresponding to each scene type noise can be formulated for each scene type noise in the noise database, wherein the formulated denoising strategies corresponding to the scene type noise are all the denoising strategies which are most suitable for denoising the scene type noise and have the best denoising effect. The noise strategy is not particularly limited to be established according to the noise type or the noise scene type aiming at the noise in the noise database.
Specifically, the terminal device may perform noise reduction processing on the collected user voice by using a noise policy corresponding to the target noise, so as to obtain the recorded voice packet. The terminal equipment can also determine the scene type noise to which the terminal equipment belongs according to the target noise, and then perform noise reduction processing on the collected user voice by adopting a noise strategy corresponding to the scene type noise to which the target noise belongs, so as to obtain a recorded voice packet. The application is not particularly limited by which noise strategy corresponding to the target noise is adopted by the terminal device for noise reduction processing.
In the embodiment, the noise database is established in advance, and the corresponding noise strategy is established for each noise in the noise database in advance, so that the noise characteristics that the noise detected when the current user records the voice is closer to the noise characteristics in the noise database are determined, the specific noise reduction strategy is selected in a self-adaptive manner, the noise interference can be removed more effectively, and the synthesized tone with higher quality is obtained.
With reference to the foregoing embodiment, in an implementation manner, the present application further provides an audio processing method, and specifically, the method may further include:
step S41: responding to the re-operation of the user on the recording start key on the recording initial interface, and outputting personalized noise reduction prompt information on the recording acquisition interface, wherein the personalized noise reduction prompt information is used for prompting that noise reduction processing is to be performed on voice triggered by the current recording operation of the user according to the voice with noise triggered by the historical recording operation of the user.
In this embodiment, when the terminal device receives a second operation of the current user on the recording start button on the recording initial interface, the terminal device outputs personalized noise reduction prompt information on the recording acquisition interface, where the personalized noise reduction prompt information is used to prompt the user of a voice with noise to be triggered according to a historical recording operation of the user and perform noise reduction processing on the voice triggered by the current recording operation of the user.
Step S42: the historical noise is extracted from the noisy speech triggered by the historical recording operation of the user.
In this embodiment, the terminal device may determine whether the current user has performed a previous recording operation according to the account information of the user, determine a noisy speech triggered by the historical recording operation of the current user from the historical data after determining that the current user has performed a recording start key on the recording initial interface again, and extract the historical noise of the current user from the noisy speech.
Step S43: and collecting voice triggered by the recording operation of the user, and performing noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the extracted historical noise to obtain a recorded voice packet.
In this embodiment, the terminal device determines a noise reduction strategy corresponding to the historical noise according to the historical noise of the current user, collects voice triggered by the current recording operation of the current user, and performs noise reduction processing on the collected voice of the current user by using the extracted noise reduction strategy corresponding to the historical noise to obtain a recorded voice packet.
In the embodiment, through the historical noise reduction operation of the user, noise data and a noise reduction strategy in the historical data of the user are effectively utilized to perform personalized noise reduction on the current voice of the user, and the user can feel better use experience.
With reference to the foregoing embodiment, in an implementation manner, the present application further provides an audio processing method, and specifically after obtaining the recorded voice packet, the method may further include:
step S51: and analyzing the current content to be subjected to voice broadcast, and determining a playing scene.
In this embodiment, after obtaining the recording voice packet of the user, the terminal device may analyze the current content to be subjected to the voice broadcast, and determine a playing scene corresponding to the current content to be subjected to the voice broadcast. The recorded voice packet obtained by the terminal device in this embodiment may be a voice packet obtained by acquiring, by the terminal, a voice uttered by a user and performing noise reduction processing on the acquired voice. The user may send out the voice corresponding to the text according to the text display on the terminal, or the user may send out the voice after determining the voice content by himself, and this embodiment does not have any specific limitation on the specific content of the voice sent out by the user. The playing scenes in this embodiment can be divided into: the method comprises the following steps of personalized playing scenes and general playing scenes, wherein the personalized playing scenes can be as follows: the method comprises the following steps of enabling all scenes of personalized voice broadcasting, such as a mobile phone ring scene, a takeaway rider end, a merchant end, a driver end or a passenger end to broadcast order messages, a sales scene providing a robot outgoing call scene, a preferred service scene needing to make training videos and dubbing or a customer service telephone scene. The general play scenario may be: and voice broadcasting scenes in normal or serious occasions, such as television news broadcasting scenes, examination room broadcasting scenes and the like.
Step S52: and under the condition that the playing scene is a personalized playing scene, carrying out voice broadcasting on the content of the current voice to be broadcasted according to the recorded voice packet.
In this embodiment, under the condition that the playing scene is the personalized playing scene, the recorded voice packet can be obtained according to the terminal device to perform voice broadcast on the content of the current voice to be broadcast, and specifically, the voice broadcast can be performed on the content of the current voice to be broadcast according to the user tone separated from the recorded voice packet.
For example, a scene can be broadcasted according to order messages of a merchant, the merchant is allowed to define favorite sounds, and the satisfaction degree of the merchant is improved; the robot outbound call scene is provided for the sales scene, the business is allowed to make the sound of the DB, and then the notification-like phone is dialed by the merchant, so that the BD working efficiency is improved, and the merchant experience is not influenced; aiming at the scene that training videos need to be made for the optimal service and dubbing is needed to be made, but due to the characteristic of sinking market, many cities have local accents, the group leaders are allowed to make own timbre, and then personalized notifications or dubbing are made by using own voice, so that the learning enthusiasm and participation sense of the group leaders are improved; the method is characterized in that the method allows the seat personnel to make own voice aiming at the customer service telephone scene, and then directly uses the own synthesized voice to broadcast some system voices for users in the link of needing system broadcasting, so that the experience of the users in using the customer service is improved, and the like.
Step S53: and under the condition that the playing scene is a universal playing scene, carrying out voice broadcasting on the content of the current voice to be broadcasted according to a default voice packet.
In this embodiment, under the condition that the playing scene is a general playing scene, the stored default voice packet can be used for carrying out voice broadcast on the current voice content to be broadcasted according to the terminal equipment, and specifically, the current voice content to be broadcasted can be subjected to voice broadcast according to the tone color in the default voice packet. The default voice packet in this embodiment may be a voice packet of a broadcasting cavity (a perfect circle of a word) recorded in a recording studio, which is prepared in advance and stored on the terminal device.
In this embodiment, after the voice packet is recorded, the terminal device can perform personalized voice broadcast according to the recorded voice packet, and can also perform common broadcast by using the voice packet carried by the system, so that various requirements of users are met, and the user experience is improved.
With reference to the foregoing embodiment, in an implementation manner, the present application further provides an audio processing method, which specifically may include:
step S61: and extracting corresponding spectral features from the collected voice.
In this embodiment, the terminal device may extract the spectrum feature corresponding to the user voice from the collected user voice.
Step S62: and inputting the collected voice and the frequency spectrum characteristics thereof into a pre-trained voice noise reduction model to obtain a recorded voice packet, wherein the voice noise reduction model is obtained by training the pre-trained model by taking the frequency spectrum characteristics corresponding to the composite band noise voice and the noiseless voice corresponding to the composite band noise voice as training samples.
In this embodiment, the user voice collected by the terminal device and the spectrum characteristics corresponding to the collected user voice can be input to the pre-trained voice denoising model together, so as to obtain the recording voice packet subjected to denoising processing. The speech noise reduction model in this embodiment is obtained by training a pre-trained model with spectral features corresponding to a complex noisy speech and a noiseless speech corresponding to the complex noisy speech as training samples.
In this embodiment, fall the noise reduction processing through the user's pronunciation that the model of making an uproar falls in the training in advance to the user's pronunciation of gathering, can reduce noise factor's interference, promote the self-adaptation ability of self-defined speech synthesis function to complicated various environment, and then promote the tone quality of the recording voice package after the speech synthesis.
In combination with the above embodiments, in an implementation manner, the present application further provides an audio processing method, and referring to fig. 2, fig. 2 is a schematic diagram of an audio processing method shown in an embodiment of the present application. Specifically, as shown in fig. 2, the complex noisy speech (i.e., the noisy complex audio in fig. 2) is obtained according to the following steps, which may include:
step S71: the voice of each of a plurality of recorders is collected.
In this embodiment, pre-processing may be performed on the speech (audio data) collected in advance to obtain a complex speech with noise for model training. Specifically, the terminal equipment can collect the voices sent by a plurality of recorders in advance; the voice which is sent by each of the plurality of recorders and is collected in advance may be the voice which is collected by the plurality of recorders when the terminal device records the voice, or the voice which is collected by the plurality of recorders in the process that the plurality of recorders use other items or functions on the terminal device.
Step S72: and screening out the voice with noise and the voice without noise from the voice sent by each of the plurality of sound recorders.
In this embodiment, the terminal device may classify voices uttered by each of the plurality of recorders acquired in advance, and filter out voices with noise and voices without noise from the voices. In this embodiment, the method may use manual classification or automatic classification (e.g., modeling) to screen out noisy speech (i.e., noisy audio) and noiseless speech (i.e., clean audio) from the speech uttered by each of the multiple recorders.
Step S73: extracting corresponding noise from the screened multiple noisy speeches respectively.
In this embodiment, the noisy speech screened from the speech uttered by each of the plurality of recorders is a plurality of noisy speech, and at this time, it is necessary to extract the noise information of each frequency band corresponding to the noisy speech from the plurality of noisy speech. Since the noise is in different audio frequency ranges according to different audio frequencies, in this embodiment, noise information of each frequency band is extracted for each noisy speech.
Step S74: and according to the extracted noise, carrying out noise adding processing on the screened noise-free voice to obtain the composite voice with noise.
In this embodiment, corresponding noise is extracted from the plurality of screened noisy voices respectively, and corresponding noise adding processing is performed on the screened noiseless voices according to the noise extracted from the plurality of noisy voices to obtain composite noisy voices, so as to be used for training a subsequent voice noise reduction model.
In this embodiment, clean data and noisy data are screened from the historical speech data of the user, and corresponding noise adding processing is performed on the clean data according to noise information extracted from the noisy data to generate a composite noisy speech, so that a noise environment (such as noise caused by human or equipment) most consistent with the reality is simulated for the subsequent training of the noise reduction model, and the speech noise reduction accuracy of the speech noise reduction model is improved.
In combination with the above embodiment, in an implementation manner, the present application further provides an audio processing method, and specifically, the method may include:
step S81: classifying the extracted noise, wherein the category of the extracted noise at least comprises: background noise, noise due to audio impairments.
In this embodiment, the noise information of each frequency band extracted from the noisy speech may be classified into at least: background noise, noise due to audio impairments. In this embodiment, the noise information of each frequency band extracted from the noisy speech may be classified by manual classification or automatic classification (e.g., modeling). The manner of classification is not particularly limited by the present application.
Step S82: and under the condition that the extracted noise type is background noise, performing additive noise addition processing on the screened noise-free voice to obtain the composite voice with noise.
In this embodiment, when the type of the extracted noise is the background noise, the selected noise-free speech may be additively denoised according to the extracted noise information, so as to obtain a composite noisy speech. For example, when the extracted noise information is weak noise or stable background noise (such as continuous current sound in a radio), the noise may be added to the noise-free speech through additive noise adding processing, so as to generate a complex speech with noise.
Step S83: and under the condition that the extracted noise type is noise caused by audio damage, multiplicative noise adding processing is carried out on the screened noise-free voice to obtain a composite voice with noise.
In this embodiment, when the extracted noise type is noise caused by audio damage, multiplicative denoising processing may be performed on the screened noise-free speech according to the extracted noise information, so as to obtain a composite noisy speech. For example, when the extracted noise is relatively random, varies greatly, or is very noisy, the noise may be added to the noise-free speech by multiplicative addition processing, thereby generating a complex noisy speech. The composite voice with noise obtained by processing can control the signal to noise ratio to lead the original voice (the voice without noise) to be dominant under the condition of keeping the noise characteristics.
In this embodiment, different noise adding processes are performed on clean audio (non-noise speech) through different types of noise, so that the finally obtained composite noise-carrying speech can simulate a noise environment that matches the actual environment, and a solid foundation is laid for subsequently training a speech noise reduction model with a good noise reduction effect and strong generalization capability.
With reference to the foregoing embodiment, in an implementation manner, the present application further provides a model training method, and specifically, the training step of the pre-trained model may include:
step S91: and extracting corresponding spectral features from the screened noiseless voice.
In this embodiment, the terminal device may extract a spectrum feature corresponding to the noiseless voice from the screened noiseless voice.
Step S92: and training a preset model by taking the screened noise-free voice and the corresponding frequency spectrum characteristics as training samples to obtain the pre-trained model.
In this embodiment, after the terminal device extracts the spectral feature corresponding to the noiseless speech, the spectral feature corresponding to the noiseless speech screened in step S72 and the noiseless speech extracted in step S91 is used as a training sample, and a pre-trained model is obtained by training a preset model. Specifically, the spectrum feature corresponding to the noiseless speech may be used as the input feature, and the screened noiseless speech may be used as the tag for training to generate the basic model of the neural network vocoder.
In this embodiment, the model after pre-training is obtained by training the screened noiseless speech and the spectrum features corresponding to the noiseless speech together, so as to provide a model training basis for further training on the model after pre-training to obtain a final speech noise reduction model.
In an implementation manner, with reference to fig. 3, fig. 3 is a schematic diagram of a model training method according to an embodiment of the present application. Specifically, as shown in fig. 3, the training step of the speech noise reduction model (i.e. the adaptive noise reduction model in fig. 3) may include:
step S101: extracting corresponding spectral features from the composite noisy speech, and establishing a correspondence between the spectral features and the noiseless speech used to generate the composite noisy speech.
In this embodiment, a spectral feature corresponding to the complex speech with noise is extracted from the generated complex speech with noise, and a correspondence between the spectral feature corresponding to the complex speech with noise and a noiseless speech for generating the complex speech with noise is established (where the spectral feature corresponding to the complex speech with noise is a feature and the noiseless speech is a tag). The noiseless speech for generating the composite voice with noise in this embodiment is the noiseless speech screened from the speech uttered by each of the plurality of recorders in step S72. Moreover, in the embodiment, the establishment of the corresponding relationship between the spectral feature (noisy spectral feature) corresponding to the complex noisy speech and the noiseless speech (clean audio) used for generating the complex noisy speech is realized on the pre-trained model, so that the embodiment does not need to specially perform modeling processing on the "noisy audio to clean audio", does not relate to training work of the model, and saves time and computing resources.
Step S102: and training the pre-trained model by taking the corresponding relation and the noiseless voice corresponding to the composite noisy voice as a training sample to obtain the voice noise reduction model.
In this embodiment, a corresponding relationship between a spectral feature corresponding to the complex speech with noise and a noise-free speech for generating the complex speech with noise, and a noise-free speech corresponding to the complex speech with noise are taken as training samples (where the corresponding relationship between the spectral feature with noise and a clean audio is an input feature, and the clean audio is a label), and training is continued on the pre-trained model after convergence (for example, adaptive training is performed), so as to obtain a model capable of speech noise reduction.
In the embodiment, by the technical scheme of obtaining the voice noise reduction model through two times of training, noise interference in user voice can be effectively removed, a synthesized tone with higher quality can be obtained, the time and calculation resources for model training can be saved, and the generalization capability and stability of the model can be enhanced, so that the problem that the neural network vocoder generates a synthesized audio with fuzzy effect or obvious noise due to the fact that original audio data has noise or an acoustic model with poor robustness predicts the generated noisy spectral feature is solved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments of the application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.
Based on the same inventive concept, an embodiment of the present application provides an audio processing apparatus 400. Referring to fig. 4, fig. 4 is a block diagram of an audio processing apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus 400 includes:
the first display module 401, in response to a user operating a recording start button on a recording initial interface, displays a recording acquisition interface, and outputs noise detection prompt information on the recording acquisition interface, where the noise detection prompt information is used to prompt that noise detection is being performed before or during acquisition of a voice sent by the user;
a collecting and denoising module 402, configured to collect voice sent by a user and perform denoising processing on the collected voice to obtain a recorded voice packet;
a second display module 403, configured to output noise reduction prompt information on the recording acquisition interface, where the noise reduction prompt information is used to prompt that noise reduction processing is performed on a voice sent by a user.
Optionally, the apparatus 400 further includes:
a third display module, configured to output a plurality of noise source options on the recording collection interface, where the plurality of noise source options at least include: a recording environment option, a recording device option and a pronunciation feature option;
the acquisition noise reduction module 402 comprises:
and the first noise reduction module is used for carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the noise source option selected by the user to obtain the recorded voice packet.
Optionally, the apparatus 400 further includes:
the first acquisition module is used for acquiring noise voices sent by a plurality of recorders respectively;
the first extraction module is used for respectively extracting corresponding noise from the collected multiple noisy speeches and storing the extracted noise into a noise database;
the acquisition noise reduction module 402 comprises:
a target noise determination module, configured to compare the detected noise with each noise in the noise database, and determine a target noise matching the detected noise;
and the second noise reduction module is used for carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the target noise to obtain a recorded voice packet.
Optionally, the apparatus 400 further includes:
the voice processing module is used for receiving voice input by a user, acquiring a recording initial interface, and sending the recording initial interface to a voice input module, wherein the voice input module is used for inputting voice information of the user, and the voice information is used for inputting voice information of the user;
the second extraction module is used for extracting historical noise from the voice with noise triggered by the historical recording operation of the user;
and the third noise reduction module is used for acquiring voice triggered by the recording operation of the user, and performing noise reduction processing on the acquired voice by adopting a noise reduction strategy corresponding to the extracted historical noise to obtain a recorded voice packet.
Optionally, after obtaining the recorded voice packet, the apparatus 400 further includes:
the analysis module is used for analyzing the content to be broadcasted by voice currently and determining a broadcasting scene;
the first broadcasting module is used for carrying out voice broadcasting on the content of the current voice to be broadcasted according to the recorded voice packet under the condition that the broadcasting scene is a personalized broadcasting scene;
and the second broadcasting module is used for carrying out voice broadcasting on the current voice content to be broadcasted according to a default voice packet under the condition that the broadcasting scene is a general broadcasting scene.
Optionally, the acquisition and noise reduction module 402 specifically includes:
the third extraction module is used for extracting corresponding spectrum characteristics from the collected voice;
and the fourth noise reduction module is used for inputting the collected voice and the frequency spectrum characteristics thereof into a pre-trained voice noise reduction model to obtain a recorded voice packet, wherein the voice noise reduction model is obtained by taking the frequency spectrum characteristics corresponding to the composite band noise voice and the noiseless voice corresponding to the composite band noise voice as training samples and training the pre-trained model.
Optionally, the apparatus 400 further includes:
the second acquisition module is used for acquiring voices sent by a plurality of recorders respectively;
the screening module is used for screening out noise-containing voice and noise-free voice from the voice sent by each of the plurality of recorders;
the noise extraction module is used for respectively extracting corresponding noise from the screened multiple noisy voices;
and the noise adding processing module is used for adding noise to the screened noise-free voice according to the extracted noise to obtain the composite voice with noise.
Optionally, the noise processing module includes:
a classification module, configured to classify the extracted noise, where a category of the extracted noise at least includes: background noise, noise due to audio impairments;
the additive noise adding module is used for performing additive noise adding processing on the screened noise-free voice under the condition that the extracted noise type is background noise to obtain a composite voice with noise;
and the multiplicative denoising module is used for performing multiplicative denoising processing on the screened noise-free voice under the condition that the extracted noise type is noise caused by audio damage, so as to obtain a composite voice with noise.
Optionally, the apparatus 400 further includes:
the fourth extraction module is used for extracting corresponding spectrum characteristics from the screened noiseless voice;
and the first model training module is used for training a preset model by taking the screened noiseless voice and the corresponding spectrum characteristics as training samples to obtain the pre-trained model.
Optionally, the apparatus 400 further includes:
a relation establishing module, configured to extract corresponding spectral features from the complex speech with noise, and establish a corresponding relation between the spectral features and the noiseless speech for generating the complex speech with noise;
and the second model training module is used for training the pre-trained model by taking the corresponding relation and the noiseless voice corresponding to the composite voice with noise as a training sample to obtain the voice noise reduction model.
Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to any of the above-mentioned embodiments of the present application.
Based on the same inventive concept, another embodiment of the present application provides an electronic device 500, as shown in fig. 5. Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 502, a processor 501 and a computer program stored on the memory and executable on the processor, which when executed performs the steps of the method according to any of the embodiments of the present application.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The foregoing describes in detail an audio processing method, an audio processing apparatus, a storage medium, and an electronic device provided by the present application, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing examples are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (13)

1. A method of audio processing, the method comprising:
responding to the operation of a user on a recording starting button on a recording initial interface, displaying a recording acquisition interface, and outputting noise detection prompt information on the recording acquisition interface, wherein the noise detection prompt information is used for prompting that noise detection is carried out before or in the process of acquiring voice sent by the user;
collecting voice sent by a user and carrying out noise reduction processing on the collected voice to obtain a recorded voice packet;
and outputting noise reduction prompt information on the recording acquisition interface, wherein the noise reduction prompt information is used for prompting that noise reduction processing is carried out on the voice sent by the user.
2. The method of claim 1, further comprising:
outputting a plurality of noise source options on the recording acquisition interface, wherein the plurality of noise source options at least comprise: a recording environment option, a recording device option and a pronunciation feature option;
the processing of making an uproar falls to the pronunciation of gathering, obtains recording the pronunciation package, includes:
and carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the noise source option selected by the user to obtain a recorded voice packet.
3. The method of claim 1, further comprising:
collecting noise voices sent by a plurality of recorders respectively;
extracting corresponding noise from the collected multiple voices with noise respectively, and storing the extracted noise into a noise database;
the processing of making an uproar falls to the pronunciation of gathering, obtains recording the pronunciation package, includes:
comparing the detected noise with each noise in the noise database respectively, and determining a target noise matched with the detected noise;
and carrying out noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the target noise to obtain a recorded voice packet.
4. The method of claim 1, further comprising:
responding to the re-operation of the user on the recording start key on the recording initial interface, and outputting personalized noise reduction prompt information on the recording acquisition interface, wherein the personalized noise reduction prompt information is used for prompting that noise reduction processing is to be performed on voice triggered by the current recording operation of the user according to the voice with noise triggered by the historical recording operation of the user;
extracting historical noise from voice with noise triggered by historical recording operation of a user;
and collecting voice triggered by the recording operation of the user, and performing noise reduction processing on the collected voice by adopting a noise reduction strategy corresponding to the extracted historical noise to obtain a recorded voice packet.
5. The method according to any of claims 1-4, wherein after obtaining the recorded voice packets, the method further comprises:
analyzing the current content to be broadcasted by voice, and determining a broadcasting scene;
under the condition that the playing scene is a personalized playing scene, voice broadcasting is carried out on the content of the current voice to be broadcasted according to the recorded voice packet;
and under the condition that the playing scene is a universal playing scene, carrying out voice broadcasting on the content of the current voice to be broadcasted according to a default voice packet.
6. The method according to any one of claims 1 to 4, wherein the denoising processing is performed on the collected voice to obtain a recorded voice packet, specifically comprising:
extracting corresponding spectral features from the collected voice;
and inputting the collected voice and the frequency spectrum characteristics thereof into a pre-trained voice noise reduction model to obtain a recorded voice packet, wherein the voice noise reduction model is obtained by training the pre-trained model by taking the frequency spectrum characteristics corresponding to the composite band noise voice and the noiseless voice corresponding to the composite band noise voice as training samples.
7. The method of claim 6, wherein the complex noisy speech is obtained by:
collecting voices sent by a plurality of sound recorders respectively;
screening out noise-containing voice and noise-free voice from voices respectively emitted by the plurality of sound recorders;
extracting corresponding noise from the screened multiple noisy voices respectively;
and according to the extracted noise, carrying out noise adding processing on the screened noise-free voice to obtain the composite voice with noise.
8. The method according to claim 7, wherein the denoising the filtered noiseless speech according to the extracted noise to obtain a composite noisy speech, comprises:
classifying the extracted noise, wherein the category of the extracted noise at least comprises: background noise, noise due to audio impairments;
under the condition that the extracted noise type is background noise, additive noise adding processing is carried out on the screened noise-free voice to obtain a composite voice with noise;
and under the condition that the extracted noise type is noise caused by audio damage, multiplicative noise adding processing is carried out on the screened noise-free voice to obtain a composite voice with noise.
9. The method of claim 7, wherein the training step of the pre-trained model comprises:
extracting corresponding spectral features from the screened noise-free voice;
and training a preset model by taking the screened noise-free voice and the corresponding frequency spectrum characteristics as training samples to obtain the pre-trained model.
10. The method according to any of claims 7-9, wherein the training step of the speech noise reduction model comprises:
extracting corresponding spectral features from the composite noisy speech, and establishing a correspondence between the spectral features and the noiseless speech for generating the composite noisy speech;
and training the pre-trained model by taking the corresponding relation and the noiseless voice corresponding to the composite noisy voice as a training sample to obtain the voice noise reduction model.
11. An audio processing apparatus, characterized in that the apparatus comprises:
the first display module is used for responding to the operation of a user on a recording starting button on a recording initial interface, displaying a recording acquisition interface and outputting noise detection prompt information on the recording acquisition interface, wherein the noise detection prompt information is used for prompting that noise detection is carried out before or in the process of acquiring voice sent by the user;
the acquisition and noise reduction module is used for acquiring voice sent by a user and carrying out noise reduction processing on the acquired voice to obtain a recorded voice packet;
and the second display module is used for outputting noise reduction prompt information on the recording acquisition interface, wherein the noise reduction prompt information is used for prompting that noise reduction processing is carried out on the voice sent by the user.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-10.
CN202110904018.4A 2021-08-06 2021-08-06 Audio processing method and device, electronic equipment and readable storage medium Pending CN113709291A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110904018.4A CN113709291A (en) 2021-08-06 2021-08-06 Audio processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110904018.4A CN113709291A (en) 2021-08-06 2021-08-06 Audio processing method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113709291A true CN113709291A (en) 2021-11-26

Family

ID=78651862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110904018.4A Pending CN113709291A (en) 2021-08-06 2021-08-06 Audio processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113709291A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114441029A (en) * 2022-01-20 2022-05-06 深圳壹账通科技服务有限公司 Recording noise detection method, device, equipment and medium of voice labeling system
CN116030788A (en) * 2023-02-23 2023-04-28 福建博士通信息股份有限公司 Intelligent voice interaction method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991754A (en) * 2015-06-29 2015-10-21 小米科技有限责任公司 Recording method and apparatus
CN105228054A (en) * 2015-10-15 2016-01-06 深圳市大疆创新科技有限公司 Flight instruments, filming apparatus and recording denoising device thereof and method
CN108922523A (en) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 Position indicating method, device, storage medium and electronic equipment
CN109273001A (en) * 2018-10-25 2019-01-25 珠海格力电器股份有限公司 A kind of voice broadcast method, device, computing device and storage medium
CN111627416A (en) * 2020-05-13 2020-09-04 广州国音智能科技有限公司 Audio noise elimination method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991754A (en) * 2015-06-29 2015-10-21 小米科技有限责任公司 Recording method and apparatus
CN105228054A (en) * 2015-10-15 2016-01-06 深圳市大疆创新科技有限公司 Flight instruments, filming apparatus and recording denoising device thereof and method
CN108922523A (en) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 Position indicating method, device, storage medium and electronic equipment
CN109273001A (en) * 2018-10-25 2019-01-25 珠海格力电器股份有限公司 A kind of voice broadcast method, device, computing device and storage medium
CN111627416A (en) * 2020-05-13 2020-09-04 广州国音智能科技有限公司 Audio noise elimination method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114441029A (en) * 2022-01-20 2022-05-06 深圳壹账通科技服务有限公司 Recording noise detection method, device, equipment and medium of voice labeling system
CN116030788A (en) * 2023-02-23 2023-04-28 福建博士通信息股份有限公司 Intelligent voice interaction method and device
CN116030788B (en) * 2023-02-23 2023-06-09 福建博士通信息股份有限公司 Intelligent voice interaction method and device

Similar Documents

Publication Publication Date Title
CN104991754B (en) The way of recording and device
EP1913708B1 (en) Determination of audio device quality
CN108346433A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN103688531A (en) Control device, control method and program
CN113709291A (en) Audio processing method and device, electronic equipment and readable storage medium
CN108762494A (en) Show the method, apparatus and storage medium of information
JP2010020133A (en) Playback apparatus, display method, and display program
CN101867742A (en) Television system based on sound control
CN112102846A (en) Audio processing method and device, electronic equipment and storage medium
CN105872205A (en) Information processing method and device
CN107945806A (en) User identification method and device based on sound characteristic
CN110931019B (en) Public security voice data acquisition method, device, equipment and computer storage medium
CN111081275B (en) Terminal processing method and device based on sound analysis, storage medium and terminal
Smyrnova et al. Determination of perceptual auditory attributes for the auralization of urban soundscapes
CN110232909A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN104851423B (en) Sound information processing method and device
CN114694678A (en) Sound quality detection model training method, sound quality detection method, electronic device, and medium
CN113641330A (en) Recording control method and device, computer readable medium and electronic equipment
CN106782625A (en) Audio-frequency processing method and device
CN107197404B (en) Automatic sound effect adjusting method and device and recording and broadcasting system
CN109754816B (en) Voice data processing method and device
CN109271480B (en) Voice question searching method and electronic equipment
JP7284570B2 (en) Sound reproduction system and program
CN113409800A (en) Processing method and device for monitoring audio, storage medium and electronic equipment
Aarts et al. A real-time speech-music discriminator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211126