WO2021114808A1 - 音频处理方法、装置、电子设备和存储介质 - Google Patents

音频处理方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2021114808A1
WO2021114808A1 PCT/CN2020/116711 CN2020116711W WO2021114808A1 WO 2021114808 A1 WO2021114808 A1 WO 2021114808A1 CN 2020116711 W CN2020116711 W CN 2020116711W WO 2021114808 A1 WO2021114808 A1 WO 2021114808A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
foreground
category
audio
environmental
Prior art date
Application number
PCT/CN2020/116711
Other languages
English (en)
French (fr)
Inventor
邓朔
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021114808A1 publication Critical patent/WO2021114808A1/zh
Priority to US17/527,935 priority Critical patent/US11948597B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application relates to the field of communication technology, in particular to audio processing methods, devices, electronic equipment, and storage media.
  • the embodiment of the present application provides an audio processing method, which is executed by an electronic device, and the method includes:
  • an audio processing device including:
  • the acquiring unit is used to acquire the current playing environment of the audio
  • the recognition unit is configured to perform audio recognition on the ambient sound of the current playback environment if the current playback environment is in the foreground state;
  • a determining unit configured to determine the foreground sound in the environmental sound according to the result of audio recognition
  • the classification unit is used to classify the foreground sounds in the environmental sounds to determine the category of the foreground sounds
  • the mixing unit is configured to mix the foreground sound with the audio based on the category of the foreground sound to obtain a mixed playback sound.
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute any of the provided in the embodiments of the present application. Steps in audio processing methods.
  • an embodiment of the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The steps in any audio processing method.
  • FIG. 1a is a schematic diagram of a scene of an audio processing method provided by an embodiment of the present application.
  • Fig. 1b is a first flowchart of an audio processing method provided by an embodiment of the present application.
  • Figure 2a is a schematic diagram of the training process of the adaptive discriminant network provided by an embodiment of the present application.
  • 2b is a schematic diagram of another training process of the adaptive discriminant network provided by an embodiment of the present application.
  • FIG. 2c is a second flowchart of the audio processing method provided by an embodiment of the present application.
  • FIG. 2d is a third flowchart of an audio processing method provided by an embodiment of the present application.
  • FIG. 2e is a fourth flowchart of an audio processing method provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of an audio processing device provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the embodiments of the present invention provide audio processing methods, devices, electronic equipment, and storage media, which can improve the flexibility of audio playback.
  • the embodiments of the present application provide an audio processing method, device, and storage medium.
  • the audio processing device may be integrated in an electronic device, and the electronic device may be a server or a terminal or other equipment.
  • the electronic device integrated with the audio processing device can obtain the current playback environment in which the user plays audio when the user starts the audio processing mode. If the current playback environment is in the foreground state, the current playback environment Perform audio recognition of the ambient sound, and then determine the foreground sound in the ambient sound according to the result of audio recognition, and then classify the foreground sound in the environmental sound to determine the category of the foreground sound, and then based on the category of the foreground sound The foreground sound is mixed with the audio to obtain a mixed playback sound.
  • this solution can obtain the ambient sound during audio playback, and then infer the current playback status based on the ambient sound, and combine the current playback status according to the current playback status to mix the audio, which can effectively improve the flexibility of audio playback , And enables users to always pay attention to the surrounding environment information when wearing headphones for audio playback, and obtain a safer and more convenient listening experience.
  • the audio processing device may be specifically integrated in an electronic device.
  • the electronic device may be a server or a terminal.
  • the terminal may include a mobile phone, a tablet computer, Equipment such as laptops and personal computers.
  • An audio processing method including: obtaining the current playback environment of the audio, if the current playback environment is in the foreground state, performing audio recognition on the ambient sound of the current playback environment, and then determining the foreground in the ambient sound according to the result of the audio recognition Then, classify the foreground sound in the environmental sound to determine the type of the foreground sound, and then mix the foreground sound with the audio based on the type of the foreground sound to obtain a mixed playback sound.
  • the audio processing method may be specifically executed by an audio processing device integrated in an electronic device, and the specific process may include the following steps.
  • Step 101 Acquire the current playing environment of the audio.
  • the audio processing device may obtain the environment information of the current playback environment when playing audio according to the instruction, and determine the current playback environment according to the environment information.
  • the recording permission can be obtained, which can be used to distinguish the current playback environment and simultaneously mix with the audio being played in the electronic device.
  • the user when the user is wearing headphones to watch videos or listen to music, broadcasts, etc., the user can turn on the audio processing mode.
  • the electronic device asks the user whether he agrees to turn on the microphone permission according to the user's turn-on instruction.
  • the microphone can be used to collect the ambient sound of the current playing environment when the audio is played, and then, based on the ambient sound, determine the environment the user is currently in, that is, the current playing environment.
  • the current playing environment may include the following scenes, for example, classrooms, campuses, sports fields, roads, offices, cafes, parks, construction sites, libraries, and so on.
  • the user can set the scene that needs audio processing.
  • there are many ways to set the scene For example, it can be set flexibly according to actual needs, or it can be set and stored in the electronic device in advance. In, wait.
  • Step 102 If the current playback environment is in the foreground state, perform audio recognition on the ambient sound of the current playback environment.
  • a microphone can be used to collect the ambient sound of the current playback environment when playing audio, and based on the collected ambient sound, the adaptive discrimination network is used to determine whether the current playback environment is in the foreground state or the background state. If the current playback environment is in the foreground state, audio recognition is performed on the ambient sound of the current playback environment, and if the current playback environment is in the background state, the ambient sound of the current playback environment can be filtered or shielded.
  • the foreground state refers to the state (scene) that needs to be mixed, for example, it can be a more important scene set by the user that needs to listen to the audio while listening to the ambient sound in the environment, for example, the user presets
  • the scenes that need to be mixed are classrooms, roads, etc., when the current playback environment is in the classroom, roads, etc., it can be considered to be in the foreground state.
  • the background state refers to a state (scenario) that does not need to be mixed. For example, it may be a scene where the user can ignore the surrounding environmental sounds, such as environmental white noise, noise on construction sites, and rain on rainy days.
  • the foreground state or the background state can be flexibly set according to actual applications, or can be preset and stored in the electronic device, and so on.
  • the ambient sound of the current playback environment is sampled, and the sampled ambient sound
  • the Mel frequency cepstrum coefficient feature is extracted to obtain the Mel feature of the environmental sound
  • the adaptive discriminant network is used to perform audio recognition on the Mel feature of the environmental sound.
  • sampling can set a sampling window T
  • T is the time required for sampling
  • T can be flexibly set according to actual application requirements, for example, T can be 1 second.
  • the adaptive discrimination network may be specifically trained by other equipment and provided to the audio processing device, or the audio processing device may also be trained by itself, that is, before the adaptive discrimination network is used,
  • the audio processing method may further include the following steps:
  • the target playback environment may be a playback environment set by the user that requires audio processing.
  • a microphone may be used to collect environmental sound samples of the target playback environment set by the user, and the collected environmental sound samples are sent to the audio processing device, so that the audio processing device can further process them.
  • specific features of Mel frequency cepstral coefficients can be extracted from the environmental sound sample to obtain the Mel feature of the environmental sound sample; the environmental sound sample can be classified according to the Mel feature of the environmental sound sample to obtain the classification of the environmental sound sample Result: Use the Mel feature of the environmental sound sample and the classification result of the environmental sound sample to adaptively train the discriminant network to obtain an adaptive discriminant network.
  • the classification result of the environmental sound samples may be to divide the environmental sound samples into foreground sound samples and background sound samples.
  • Mel-Frequency Cepstrum (Mel-Frequency Cepstrum) is a linear transformation of the logarithmic energy spectrum based on the non-linear Mel scale of sound frequency.
  • Mel-Frequency Cepstral Coefficients (MFCCs) are the coefficients that make up the Mel-Frequency Cepstral Coefficients. It is derived from the Cepstrum of audio fragments. The difference between cepstrum and mel frequency cepstrum is that the band division of mel frequency cepstrum is equally spaced on the mel scale, which is more approximate than the linearly spaced frequency band used in the normal cepstrum The human auditory system.
  • the user interaction method can be used to classify the environmental sound sample.
  • the user Through the interaction with the user Determine the foreground sound sample and the background sound sample in the environmental sound sample, and so on. For example, use the Mel feature of the environmental sound sample to start user interaction, obtain the current feature category (label): foreground, background, and determine that the current feature is a foreground sound sample or a background sound sample.
  • the discriminant network may include a preset Gaussian mixture model, and use the Mel features and classification results of the environmental sound sample to perform adaptive training on the discriminant network to obtain an adaptive discriminant network, which may include:
  • the Gaussian mixture model is a parametric model that can smoothly simulate various complex models.
  • the Gaussian model has less calculation than machine learning and other algorithms, and the iteration speed is fast.
  • the Gaussian model is to use Gaussian probability density function (normal distribution curve) to accurately quantify things, and decompose a thing into several models based on Gaussian probability density function (normal distribution curve).
  • Gaussian probability density function normal distribution curve
  • the background only the background needs to be separated, and the Gaussian mixture model can provide higher accuracy.
  • different Gaussian mixture models can be used for discrimination, reducing interference between models and improving accuracy.
  • the Gaussian mixture model uses K (generally 3 to 5) Gaussian models to characterize the features in the audio, which is mainly determined by the two parameters of the variance and the mean. Using different learning mechanisms for the learning of the mean and variance will directly affect the stability, accuracy and convergence of the model.
  • some parameters such as variance, mean, weight, etc. in the Gaussian mixture model need to be initialized, and the data required for modeling, such as Mahalanobis distance, are obtained through these parameters.
  • the variance can generally be set as large as possible (such as 15), and the weight value is as small as possible (such as 0.001). This setting is because the initialized Gaussian model is an inaccurate and possible model. It is necessary to constantly reduce its range and update its parameter values during training to obtain the most likely Gaussian model, and set the variance to be larger. It is to include as much audio as possible into a model, so as to obtain the most possible model.
  • an Expectation-Maximization algorithm (EM) algorithm can be used for estimation.
  • the maximum expectation algorithm is a type of optimization algorithm that performs Maximum Likelihood Estimation (MLE) through iteration.
  • MLE Maximum Likelihood Estimation
  • the standard calculation framework of the EM algorithm consists of E-step (Expectation-step) and M-step (Maximization step) alternately. The convergence of the algorithm can ensure that the iteration at least approximates the local maximum.
  • Hidden variables can represent missing data, or any random variable that cannot be directly observed in the probability model.
  • the first line is the case where the hidden variable is a continuous variable
  • the second line is the case where the hidden variable is a discrete variable, integral/sum
  • the part of is also called the joint liklihood of X and Z. Without loss of generality, a discrete variable is used as an example for illustration. From the general method of MLE, after taking the natural logarithm of the above formula, we can get:
  • the EM algorithm has the following solution goals:
  • the L( ⁇ ,q) in the formula is equivalent to the surrogate function in the MM algorithm (Minorize-Maximization algorithm), which is the lower limit of the MLE optimization problem.
  • the EM algorithm approximates the extreme of the log likelihood by maximizing the surrogate function. Great value.
  • the Gaussian mixture model in this embodiment uses the EM algorithm to estimate the distribution of parameters, and then adjusts the estimated parameters according to the true value of the environmental sound sample until the likelihood function value of the preset Gaussian mixture model converges to obtain Self-adaptive discrimination network.
  • the adaptive discriminant network can be verified. For example, accept the input of environmental sound samples and input the environmental sound samples into the Gaussian mixture model to observe whether the judgment is accurate. If the user input is accurate, the training will end, and if the input fails, continue. The environmental sound is sampled, and the Mel frequency cepstrum coefficient feature is extracted from the sampled environmental sound sample, and the subsequent training process is continued.
  • the target playback environment can include multiple playback scenes
  • the discriminant network can include multiple preset Gaussian mixture models, that is, the step "Using The environmental sound sample performs adaptive training on the discriminant network to obtain an adaptive discriminant network", which may include:
  • the preset Gaussian mixture model can be trained using the environmental sound samples of the first playback scene to obtain the first Gaussian mixture model; use the second playback scene Training the preset Gaussian mixture model to obtain the second Gaussian mixture model; calculate the similarity between the first Gaussian mixture model and the second Gaussian mixture model; if the similarity exceeds the preset threshold, consider the The first Gaussian mixture model and the second Gaussian mixture model are similar, the first Gaussian mixture model is determined as the Gaussian mixture model of the adaptive discriminant network; if the similarity does not exceed the preset threshold, the second Gaussian mixture model is determined.
  • a Gaussian mixture model and the second Gaussian mixture model are determined as the Gaussian mixture model of the adaptive discriminant network.
  • the preset threshold can be set in many ways. For example, it can be set flexibly according to actual application requirements, or it can be preset and stored in an electronic device. In addition, the preset threshold value can be built in the electronic device, or can also be stored in the memory and sent to the electronic device, and so on.
  • the similarity of the hybrid model can include:
  • the first Gaussian mixture model is determined as the Gaussian mixture model of the adaptive discriminant network, and if the similarity does not exceed the preset threshold, the first Gaussian mixture model is determined to be the Gaussian mixture model of the adaptive discriminant network.
  • Determining the first Gaussian mixture model and the second Gaussian mixture model as the Gaussian mixture model of the adaptive discriminant network may include: if the distance is less than a preset threshold (the second preset threshold), then the first Gaussian mixture model Determined as the Gaussian mixture model of the adaptive discriminant network, and if the distance is not less than a preset threshold, the first Gaussian mixture model and the second Gaussian mixture model are determined as the Gaussian mixture model of the adaptive discriminant network.
  • a Gaussian mixture model is obtained After the user trains in a new scene, a new Gaussian mixture model is obtained.
  • the parameters of the Gaussian mixture model are as follows:
  • the above formula shows that the Gaussian mixture model is composed of K Gaussian models with parameters ⁇ , ⁇ , and ⁇ , where ⁇ is the weighting coefficient of the current Gaussian model. Based on the parameter characteristics of the Gaussian mixture model, a fast distance estimation method is proposed, which can quickly judge the similarity between the models.
  • the distance estimation formula is as follows:
  • the Gaussian mixture model is similar, where, Is the maximum distance factor, which represents the maximum tolerable distance as the mirror image of the current Gaussian model.
  • Step 103 Determine the foreground sound in the environmental sound according to the result of audio recognition.
  • the environmental sound can be specifically classified according to the Mel characteristics of the environmental sound to obtain the foreground sound and the background sound in the environmental sound; and the foreground sound in the environmental sound can be obtained from the foreground sound and the background sound in the environmental sound.
  • the foreground sound can refer to sounds that contain important information such as dialogue sounds and whistle sounds
  • the background sounds can refer to sounds that can be ignored by the user, such as environmental white noise, rainy sounds, and so on.
  • Step 104 Classify the foreground sound in the environmental sound to determine the category of the foreground sound.
  • the classification category of the audio classify the foreground sound based on the classification category, obtain the confidence level of the foreground sound in each classification category, and determine the classification category with the highest confidence level as the category of the foreground sound .
  • confidence is also called reliability, or confidence level, confidence coefficient, that is, when sampling to estimate the overall parameters, the conclusion is always uncertain due to the randomness of the sample. Therefore, a probability statement method is adopted, that is, the interval estimation method in mathematical statistics, that is, the estimated value and the overall parameter are within a certain allowable error range, and the corresponding probability is called the confidence.
  • Confidence level is one of the important indicators to describe the uncertainty of the position of line elements and surface elements in a geographic information system (Geographic Information System or Geo-Information system, GIS).
  • GIS Geographic Information System
  • the confidence level indicates the degree of confidence in the interval estimation.
  • the span of the confidence interval is a positive function of the confidence level, that is, the greater the degree of confidence required, a wider confidence interval is bound to be obtained, which correspondingly reduces the accuracy of the estimation.
  • the confidence interval is only used in frequency statistics. The corresponding concept in Bayesian statistics is the credible interval. However, the confidence interval and the confidence interval are based on different concepts, so generally speaking, the values will not be the same.
  • the confidence interval indicates the interval in which the estimated value is calculated.
  • the confidence level indicates the probability that the accurate value falls within this interval.
  • the confidence level refers to the probability that the overall parameter value falls within a certain area of the sample statistical value; and the confidence interval refers to the error range between the sample statistical value and the overall parameter value under a certain confidence level. The larger the confidence interval, the higher the confidence level.
  • the support vector machine can be trained using the features in the audio training set, and the audio classification category can be determined according to the training result.
  • the distance between the foreground sound and the category can be calculated, etc., that is, the step "classify the foreground sound based on the classification category, Obtain the confidence of the foreground sound for each category", which can include:
  • determining the category with the highest confidence level as the category of the foreground sound may include: determining the category with the highest probability in the category category as the category category of the foreground sound.
  • Step 105 Mix the foreground sound with the audio based on the category of the foreground sound to obtain a mixed playback sound.
  • the mixing mode may be determined according to the category of the foreground sound, and the determined mixing mode may be used to mix the foreground sound with the audio to obtain a mixed playback sound.
  • the input is divided into two parts, the environmental sound input EnvInput, and the audio input VideoInput, and the output is Output.
  • the mixing stage we use a linear superposition method. The formula is as follows:
  • a and b are superposition coefficients, and different superposition coefficients can be used according to different categories.
  • this embodiment can obtain the current playback environment of the audio. If the current playback environment is in the foreground state, perform audio recognition on the ambient sound of the current playback environment, and then determine the foreground in the ambient sound according to the result of the audio recognition Then, classify the foreground sound in the environmental sound to determine the type of the foreground sound, and then mix the foreground sound with the audio based on the type of the foreground sound to obtain a mixed playback sound; By obtaining the ambient sound during audio playback, and then infer the current playback status based on the ambient sound, and combine the currently played audio according to the current playback status to mix, which can effectively improve the flexibility of audio playback and enable users When wearing headphones for audio playback, you can always pay attention to the surrounding environment information, and get a safer and more convenient listening experience.
  • the audio processing device is specifically integrated in an electronic device as an example for description.
  • the discriminant network needs to be trained, as shown in Figure 2a, which can specifically include the following steps.
  • the electronic device obtains environmental sound samples of the target playback environment.
  • the user can set classrooms, roads, etc. as the target playback environment that needs to be processed by audio.
  • a microphone can be used to collect environmental sound samples of the target playback environment, and the collected environmental sound samples are sent to the electronic device, so that the electronic device can further process it.
  • the electronic device uses the environmental sound sample to perform adaptive training on the discriminant network to obtain an adaptive discriminant network.
  • the electronic device can specifically extract the Mel frequency cepstral coefficient feature of the environmental sound sample to obtain the Mel feature of the environmental sound sample, and then, based on the Mel frequency of the environmental sound sample Features, through interaction with the user, determine the foreground sound sample and the background sound sample in the environmental sound sample, and then use the Mel feature and classification result of the environmental sound sample to adaptively train the discriminant network to obtain an adaptive discriminant network.
  • the Gaussian mixture model can be initialized first The number of Gaussian models is 5, enter the sampling window T for sampling, extract its MFCC features, start user interaction, and obtain the current feature labels: foreground and background.
  • the extracted MFCC features are input into the Gaussian mixture model for parameter estimation, and the parameter estimation is estimated by the EM algorithm
  • the discriminant network may include a preset Gaussian mixture model.
  • the Mel feature of the environmental sound sample can be specifically used to estimate the parameters of the preset Gaussian mixture model by the sampling maximum expectation algorithm to obtain a classification result based on user interaction. Adjust the estimated parameters according to the true value of, until the likelihood function value of the preset Gaussian mixture model converges, and an adaptive discriminant network is obtained.
  • the Gaussian mixture model can be initialized first
  • the number of Gaussian models is 5, enter the sampling window T for sampling, extract its MFCC features, start user interaction, and obtain the current feature labels: foreground and background.
  • the extracted MFCC features are input into the Gaussian mixture model for parameter estimation, and the parameter estimation is estimated by the EM algorithm.
  • the adaptive discriminant network can be verified. For example, accept the input of environmental sound samples, input the environmental sound samples into the Gaussian mixture model, and observe whether the judgment is accurate. If the user input is accurate, the training ends, and if the input fails, continue to check The environmental sound is sampled, the Mel frequency cepstrum coefficient feature is extracted from the sampled environmental sound sample, and subsequent training processes such as parameter estimation are performed.
  • the target playback environment can include multiple playback scenes
  • the discrimination network can include multiple preset Gaussian mixture models, which can be specifically used
  • the environmental sound samples of multiple playback scenes train the preset Gaussian mixture model to obtain multiple Gaussian mixture models; calculate the similarity between the multiple Gaussian mixture models; if the similarity exceeds the preset threshold (section A preset threshold), one of the two Gaussian mixture models whose similarity exceeds the preset threshold is determined as the Gaussian mixture model of the adaptive discriminant network.
  • the model can also be merged in other ways.
  • the preset Gaussian mixture model can be trained using the environmental sound samples of the first playback scene to obtain the first Gaussian mixture model and use the second playback scene. Training the preset Gaussian mixture model to obtain the second Gaussian mixture model, and then calculate the distance between the first Gaussian mixture model and the second Gaussian mixture model according to the parameters in the Gaussian mixture model.
  • the first Gaussian mixture model is determined as the Gaussian mixture model of the adaptive discriminant network, and if the distance value is not less than the preset threshold, the first Gaussian mixture model is determined
  • the model and the second Gaussian mixture model are determined as the Gaussian mixture model of the adaptive discriminant network.
  • the preset threshold (the second preset threshold) to 1.
  • the two Gaussian mixture models are considered to be similar, and one of the two Gaussian mixture models can be used as The Gaussian mixture model of the two playing scenes, that is, the first Gaussian mixture model is determined as the Gaussian mixture model of the adaptive discriminant network, where the first Gaussian mixture model refers to any one of multiple similar Gaussian mixture models .
  • a Gaussian mixture model is obtained After the user trains in a new scene, a new Gaussian mixture model is obtained.
  • the parameters of the Gaussian mixture model are as follows:
  • the above formula shows that the Gaussian mixture model is composed of K Gaussian models with parameters ⁇ , ⁇ , and ⁇ , where ⁇ is the weighting coefficient of the current Gaussian model. Based on the parameter characteristics of the Gaussian mixture model, a fast distance estimation method is proposed, which can quickly judge the similarity between the models.
  • the distance estimation formula is as follows:
  • the Gaussian mixture model is similar, where, Is the maximum distance factor, which represents the maximum tolerable distance as the mirror image of the current Gaussian model. For example, you can take Through this distance formula, the distance between Gaussian mixture models can be quickly estimated, reducing the number of models.
  • an audio processing method the specific process may include the following steps.
  • Step 201 The electronic device obtains the current playing environment of the audio.
  • the user can specifically choose to turn on the audio processing mode in the player of the electronic device when watching a video while wearing a headset.
  • the electronic device receives the user's instruction to turn on the audio processing mode, it asks the user whether he agrees to turn it on according to the user's turn on instruction.
  • Microphone permission After receiving the user's permission to turn on the microphone permission, the microphone can be used to collect the environmental sound of the current playing environment when playing audio, and then determine the environment the user is currently in based on the environmental sound.
  • this embodiment extracts an adaptive dynamic discrimination method, adds user interaction feedback, and uses the Gaussian mixture model to dynamically update the front background discrimination network to adapt to the user’s Different playback environments.
  • the user can set the foreground state before turning on the audio processing mode.
  • the user presets the scene that needs to be mixed as a classroom, a road, and so on.
  • Step 202 If the current playback environment is in the foreground state, the electronic device samples the ambient sound of the current playback environment.
  • the electronic device can specifically determine whether the current playback environment is in the foreground state or the background state. If the current playback environment is in the foreground state, the electronic device can sample the ambient sound of the current playback environment. For example, a sampling window T can be set, T is the time required for sampling, where T can be 1 second.
  • Step 203 The electronic device extracts the Mel frequency cepstrum coefficient feature from the sampled environmental sound to obtain the Mel feature of the environmental sound.
  • the electronic device can specifically extract the characteristics of the Mel frequency cepstrum coefficient of the environmental sound to obtain the environmental sound Mel characteristics of tone.
  • Step 204 The electronic device uses the adaptive discrimination network to perform audio recognition on the Mel feature of the ambient sound.
  • the electronic device may specifically input the mel feature of the environmental sound into a trained adaptive discrimination network, and use the adaptive discrimination network to perform audio recognition to identify foreground and background sounds in the environmental sound.
  • Step 205 The electronic device determines the foreground sound in the environmental sound according to the result of audio recognition.
  • the electronic device may specifically classify the environmental sound according to the Mel characteristics of the environmental sound to determine the foreground sound and the background sound in the environmental sound, and filter the foreground sound in the environmental sound from the determined foreground sound and background sound sound.
  • Step 206 The electronic device classifies the foreground sound in the environmental sound to determine the category of the foreground sound.
  • the electronic device may specifically obtain the classification category of the audio, classify the foreground sound based on the classification category, obtain the confidence level of the foreground sound in each classification category, and determine the classification category with the highest confidence level as the foreground sound Category.
  • a classification algorithm based on Support Vector Machine can be used, and the training set uses the Youtube-8K training set, and the audio classification categories can be dialogue, music, and siren.
  • SVM Support Vector Machine
  • the distance between the Mel feature of the foreground sound and each classification category can be specifically calculated, the probability that the foreground sound belongs to each classification category is determined according to the distance, and the category with the highest probability in the classification category is determined as the foreground The classification category of the tone.
  • Step 207 The electronic device mixes the foreground sound with the audio based on the category of the foreground sound to obtain a mixed playback sound.
  • the mixing mode may be determined according to the category of the foreground sound, and the determined mixing mode may be used to mix the foreground sound with the audio to obtain a mixed playback sound.
  • the input is divided into two parts, the environmental sound input EnvInput, and the audio input VideoInput, and the output is Output.
  • the mixing stage we use a linear superposition method. The formula is as follows:
  • a and b are superposition coefficients, and different superposition coefficients can be used according to different categories. Specifically, it can be set as follows:
  • this embodiment can obtain the current playback environment of the audio. If the current playback environment is in the foreground state, perform audio recognition on the ambient sound of the current playback environment, and then determine the foreground in the ambient sound according to the result of the audio recognition Then, classify the foreground sound in the environmental sound to determine the type of the foreground sound, and then mix the foreground sound with the audio based on the type of the foreground sound to obtain a mixed playback sound; Use electronic equipment to obtain the ambient sound during audio playback, and then infer the current playback state based on the ambient sound, and combine the currently played audio according to the current playback state to mix, which can effectively improve the flexibility of audio playback, and This allows the user to always pay attention to the surrounding environment information when wearing headphones for audio playback, and obtain a safer and more convenient listening experience.
  • This solution is applied to a player of an electronic device.
  • the player turns on the audio processing mode and the user wears headphones to watch videos or listen to music, broadcasts, etc.
  • the current user’s playing environment can be obtained, and according to the method in this solution, Decide when to arouse and use the mixing strategy, so that users can easily accept the audio information of the external environment, improve their viewing experience, and make them clear the external environment information when they focus on watching the video.
  • an embodiment of the present application also provides an audio processing device.
  • the audio processing device may be specifically integrated in an electronic device.
  • the electronic device may be a server or a terminal.
  • the audio processing device may include an acquisition unit 301, an identification unit 302, a determination unit 303, a classification unit 304, and a mixing unit 305.
  • the acquiring unit 301 is used to acquire the current playing environment of the audio.
  • the acquiring unit 301 may specifically acquire the current playback environment when the audio is played according to the instruction after receiving the user's instruction to turn on the audio processing mode.
  • the acquisition unit 301 asks the user whether they agree to turn on the microphone permission according to the user's turn-on instruction.
  • obtain The unit 301 may use the microphone to collect the ambient sound of the current playback environment when playing audio, and then, according to the ambient sound, determine the environment the user is currently in, that is, the current playback environment.
  • the current playing environment may include the following scenes, such as classrooms, campuses, sports fields, roads, offices, cafes, parks, construction sites, libraries, and so on.
  • the user can set the scene that needs audio processing.
  • there are many ways to set the scene For example, it can be set flexibly according to actual needs, or it can be set and stored in the electronic device in advance. In, wait.
  • the recognition unit 302 is configured to perform audio recognition on the ambient sound of the current playback environment if the current playback environment is in the foreground state.
  • the identification unit 302 can be specifically used to sample the environmental sound of the current playing environment if the current playing environment is in the foreground state; extract the Mel frequency cepstral coefficient feature from the sampled environmental sound, Obtain the Mel feature of the environmental sound; use the adaptive discriminant network to perform audio recognition on the Mel feature of the environmental sound.
  • the audio processing device may further include a training unit for obtaining environmental sound samples of the target playback environment, and using the environmental sound samples to perform adaptive training on the discrimination network to obtain an adaptive discrimination network.
  • the training unit may include an extraction subunit and a training subunit.
  • the extraction subunit is used to extract the Mel frequency cepstral coefficient feature of the environmental sound sample to obtain the Mel feature of the environmental sound sample, and classify the environmental sound sample according to the Mel feature of the environmental sound sample to obtain the environmental sound sample Classification results;
  • the training subunit is used for adaptively training the discriminant network by using the Mel feature and classification result of the environmental sound sample to obtain an adaptive discriminant network.
  • the discriminant network includes a preset Gaussian mixture model
  • the training subunit can be specifically used to estimate the parameters of the preset Gaussian mixture model by using the Mel characteristics of the environmental sound sample, and according to the environmental sound sample The true value of the classification result is adjusted to the estimated parameters until the preset Gaussian mixture model converges, and an adaptive discriminant network is obtained.
  • the target playback environment includes multiple playback scenes
  • the discrimination network includes multiple preset Gaussian mixture models
  • the training unit may be specifically used to utilize environmental sound samples of multiple playback scenes for the multiple playback scenes.
  • One Gaussian mixture model in the mixture model is determined as the Gaussian mixture model of the adaptive discriminant network.
  • the preset Gaussian mixture model can be trained using the environmental sound samples of the first playback scene to obtain the first Gaussian mixture model; use the second playback scene Training the preset Gaussian mixture model to obtain the second Gaussian mixture model; calculate the similarity between the first Gaussian mixture model and the second Gaussian mixture model; if the similarity exceeds the preset threshold, The first Gaussian mixture model is determined to be the Gaussian mixture model of the adaptive discriminant network; if the similarity does not exceed a preset threshold, the first Gaussian mixture model and the second Gaussian mixture model are determined as the adaptive discriminant network Gaussian mixture model.
  • the determining unit 303 is configured to determine the foreground sound in the environmental sound according to the result of audio recognition.
  • the determining unit 303 may be specifically used to classify the environmental sound according to the Mel characteristics of the environmental sound to obtain the foreground sound and the background sound in the environmental sound; from the foreground sound and the background sound in the environmental sound Acquire the foreground sound of the environmental sound in the sound.
  • the classification unit 304 is used to classify the foreground sound in the environmental sound to determine the category of the foreground sound.
  • the classification unit 304 may include a classification sub-unit and a determination sub-unit.
  • the classification subunit is used to obtain the classification category of the audio, classify the foreground sound based on the classification category, and obtain the confidence level of the foreground sound in each classification category;
  • the determining subunit is used to determine the classification category with the highest confidence as the foreground sound category.
  • the classification subunit may be specifically used to calculate the distance between the Mel feature of the foreground sound and each classification category. According to the distance, the probability that the foreground sound belongs to each classification category is determined, then the The determining subunit may be specifically used to determine the category with the highest probability in the classification category as the classification category of the foreground sound.
  • the mixing unit 305 is configured to mix the foreground sound with the audio based on the category of the foreground sound to obtain a mixed playback sound.
  • the sound mixing unit 305 may be specifically used to determine a mixing mode according to the category of the foreground sound, and use the determined mixing mode to mix the foreground sound with the audio to obtain a mixed playback sound.
  • each of the above units can be implemented as an independent entity, or can be combined arbitrarily, and implemented as the same or several entities.
  • each of the above units please refer to the previous method embodiments, which will not be repeated here.
  • the acquisition unit 301 acquires the current playback environment of the audio. If the current playback environment is in the foreground state, the recognition unit 302 performs audio recognition on the ambient sound of the current playback environment, and then the determination unit 303 performs audio recognition according to The result of audio recognition determines the foreground sound in the environmental sound.
  • the classification unit 304 classifies the foreground sound in the environmental sound to determine the category of the foreground sound, and the mixing unit 305 based on the category of the foreground sound Mix the foreground sound with the audio to obtain a mixed playback sound; because this solution can use the perception of the environment to obtain the ambient sound during audio playback, infer the current playback state based on the ambient sound, and combine the current playback state with the current playback state.
  • the played audio is mixed, which can effectively improve the flexibility of audio playback, and enables users to always pay attention to the surrounding environment information when wearing headphones for audio playback, and obtain a safer and more convenient listening experience.
  • an embodiment of the present application also provides an electronic device, as shown in FIG. 4, which shows a schematic structural diagram of the electronic device involved in the embodiment of the present application, specifically:
  • the electronic device may include one or more processing core processors 401, one or more computer-readable storage medium memory 402, power supply 403, input unit 404 and other components.
  • processing core processors 401 one or more computer-readable storage medium memory 402, power supply 403, input unit 404 and other components.
  • FIG. 4 does not constitute a limitation on the electronic device, and may include more or fewer components than shown in the figure, or a combination of certain components, or different component arrangements.
  • the processor 401 is the control center of the electronic device. It uses various interfaces and lines to connect the various parts of the entire electronic device. It runs or executes the software programs and/or modules stored in the memory 402, and calls the data stored in the memory 402. Data, perform various functions of electronic equipment and process data, so as to monitor the electronic equipment as a whole.
  • the processor 401 may include one or more processing cores; in the embodiment of the present application, the processor 401 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system , User interface and application programs, etc.
  • the modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 401.
  • the memory 402 may be used to store software programs and modules.
  • the processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402.
  • the memory 402 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of electronic equipment, etc.
  • the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
  • the electronic device also includes a power supply 403 for supplying power to various components.
  • the power supply 403 may be logically connected to the processor 401 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
  • the power supply 403 may also include any components such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.
  • the electronic device may further include an input unit 404, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • an input unit 404 which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the electronic device may also include a display unit, etc., which will not be repeated here.
  • the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the executable file stored in the memory 402.
  • the application programs in the memory 402 thus realize various functions, as follows:
  • this embodiment can obtain the current playback environment of the audio. If the current playback environment is in the foreground state, perform audio recognition on the ambient sound of the current playback environment, and then determine that the ambient sound is in the environment according to the result of the audio recognition.
  • an embodiment of the present application further provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute any of the audio processing methods provided in the embodiments of the present application. step.
  • the instruction can perform the following steps:
  • the storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
  • the instructions stored in the storage medium can execute the steps in any audio processing method provided in the embodiments of the present application, it can achieve what can be achieved by any audio processing method provided in the embodiments of the present application.
  • the beneficial effects refer to the previous embodiment for details, and will not be repeated here.

Abstract

一种音频处理方法、装置、电子设备和存储介质;方法包括获取音频的当前播放环境(101),若当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别(102),然后,根据音频识别的结果确定环境音中的前景音(103),接着,对环境音中的前景音进行分类,以确定前景音的类别(104),再基于前景音的类别将前景音与音频进行混音,得到混合播放音(105)。

Description

音频处理方法、装置、电子设备和存储介质
本申请要求于2019年12月11日提交国家知识产权局、申请号为201911267593.7、申请名称为“音频处理方法、装置和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,具体涉及音频处理方法、装置、电子设备和存储介质。
背景技术
随着4G时代的发展以及5G时代的到来,使用移动设备欣赏视频内容,已逐渐成为广大用户的主要娱乐方式。
发明内容
本申请实施例提供一种音频处理方法,由电子设备执行,该方法包括:
获取音频的当前播放环境;
若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别;
根据音频识别的结果确定所述环境音中的前景音;
对所述环境音中的前景音进行分类,以确定所述前景音的类别;
基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音。
相应的,本申请实施例还提供一种音频处理装置,包括:
获取单元,用于获取音频的当前播放环境;
识别单元,用于若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别;
确定单元,用于根据音频识别的结果确定所述环境音中的前景音;
分类单元,用于对所述环境音中的前景音进行分类,以确定所述前景音的类别;
混音单元,用于基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音。
此外,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有多条指令,所述指令适于处理器进行加载,以执行本申请实施例提供的任一种音频处理方法中的步骤。
此外,本发明实施例还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如本发明实施例提供的任一种音频处理方法中的步骤。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本申请实施例提供的音频处理方法的场景示意图;
图1b是本申请实施例提供的音频处理方法的第一流程图;
图2a是本申请实施例提供的自适应判别网络的训练过程示意图;
图2b是本申请实施例提供的自适应判别网络的另一训练过程示意图;
图2c是本申请实施例提供的音频处理方法的第二流程图;
图2d是本申请实施例提供的音频处理方法的第三流程图;
图2e是本申请实施例提供的音频处理方法的第四流程图;
图3是本申请实施例提供的音频处理装置的结构示意图;
图4是本申请实施例提供的电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案 进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在使用移动设备欣赏视频内容时,用户在某些特定的场景下,例如佩戴耳机在复杂的环境中进行视频观看时,容易被视频内容吸引,从而忽略周围的环境音,造成不可预知的危险或者带来不便,例如用户行走时,可能会无法注意到周围的环境及声响,进而忽略周围危险的环境。而当用户要与他人交谈时,需摘下耳机或者调低音量,便于听清楚对话者的声音,从而造成观看中断,破坏观看体验。
有鉴于此,本发明实施例提供了音频处理方法、装置、电子设备和存储介质,可以提高音频播放的灵活性。
本申请实施例提供音频处理方法、装置和存储介质。其中,该音频处理装置可以集成在电子设备中,该电子设备可以是服务器,也可以是终端等设备。
例如,参见图1a,首先,该集成了音频处理装置的电子设备在用户开启音频处理模式时,可以获取用户播放音频的当前播放环境,若该当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,然后,根据音频识别的结果确定该环境音中的前景音,接着,对该环境音中的前景音进行分类,以确定该前景音的类别,再基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音。
由于该方案可以通过获取音频播放时的环境音,然后根据该环境音推断当前的播放状态,并根据当前的播放状态结合当前所播放的音频,进行混音,可以有效地提高音频播放的灵活性,并使得用户在佩戴耳机进行音频播放时,能够时刻注意到周围的环境信息,获得更安全方便的收听体验。
以下进行详细说明。需说明的是,以下实施例的描述顺序不作为对实施例优选顺序的限定。
本实施例将从音频处理装置的角度进行描述,该音频处理装置具体可以集成在电子设备中,该电子设备可以是服务器,也可以是终端等设备;其中,该终端可以包括手机、平板电脑、笔记本电脑以及个人计算机等设备。
一种音频处理方法,包括:获取音频的当前播放环境,若该当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,然后,根据音频识别的结果确定该环境音中的前景音,接着,对该环境音中的前景音进行分类,以确定该前景音的类别,再基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音。
如图1b所示,该音频处理方法具体可以由集成在电子设备中的音频处理装置执行,具体流程可以包括如下步骤。
步骤101、获取音频的当前播放环境。
例如,具体可以由音频处理装置在接收到用户开启音频处理模式的指令后,根据该指令获取播放音频时当前所处的播放环境的环境信息,根据该环境信息判断当前所处的播放环境。
比如,在获取音频的当前播放环境之前,可以获取录音权限,该权限可以用于分辨当前的播放环境,同时与电子设备中正在播放的音频进行混音。
比如,用户在佩戴耳机观看视频或者收听音乐、广播等,用户可以开启音频处理模式,电子设备根据用户的开启指令,询问用户是否同意开启麦克风权限,当接收到用户同意开启麦克风权限后,电子设备可以利用该麦克风采集播放音频时当前播放环境的环境音,然后,根据该环境音判断用户当前所处的环境,即当前播放环境。
其中,当前播放环境可以包括以下场景,比如,教室、校园、运动场、马路、办公室、咖啡馆、公园、工地、图书馆等等。在本申请实施例中,可以由用户设置需要进行音频处理的场景,其中,设置场景的设定方式可以有很多种,比如,可以根据实际的需求灵活设置,也可以预先设置好存储在电子设备中,等等。
步骤102、若该当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别。
例如,具体可以使用麦克风采集播放音频时当前播放环境的环境音,根据采集到的环境音,利用自适应判别网络判断当前播放环境处于前景状态还是背景状态。若该当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,若该当前播放环境处于背景状态时,则可以对当前播放环境的环境音进行过滤或者屏蔽等等。
其中,前景状态指的是需要进行混音的状态(场景),比如,可以是用户设定的比较重要的在收听音频的同时也需要收听环境中的环境音的场景,比如,用户预先设定了需要混音的场景为教室、马路等,在当前播放环境处于教室、马路等场景时,即可以认为处于前景状态。其中,背景状态指的是不需要进行混音的状态(场景),比如,可以是用户可以忽略周围环境声音的场景,如环境白噪声、工地的嘈杂声、雨天的雨声等等。其中,前景状态或者背景状态是可以根据实际应用灵活设置的,也可以预先设置好存储在电子设备中,等等。
其中,对当前播放环境的环境音进行音频识别的方式可以有很多种,例如,具体可以是若该当前播放环境处于前景状态时,对当前播放环境的环境音进行采样,对采样得到的环境音提取梅尔频率倒谱系数特征,得到环境音的梅尔特征,利用该自适应判别网络对该环境音的梅尔特征进行音频识别。
其中,采样可以设定一个采样窗口T,T为采样所需时间,T可以根据实际应用的需求灵活设置,比如,T可以为1秒。
在本申请实施例中,该自适应判别网络具体可以由其他设备进行训练后,提供给该音频处理装置,或者,也可以由该音频处理装置自行进行训练,即在利用自适应判别网络之前,该音频处理方法还可以包括以下步骤:
(1)获取目标播放环境的环境音样本。
其中,目标播放环境可以是由用户设置需要进行音频处理的播放环 境。比如,具体可以使用麦克风采集用户设定的目标播放环境的环境音样本,将采集到的环境音样本发送给音频处理装置,以使音频处理装置对其进行进一步处理。
(2)利用该环境音样本对判别网络进行自适应训练,得到自适应判别网络。
例如,具体可以对该环境音样本提取梅尔频率倒谱系数特征,得到环境音样本的梅尔特征;根据该环境音样本的梅尔特征对该环境音样本进行分类,得到环境音样本的分类结果;利用该环境音样本的梅尔特征和该环境音样本的分类结果对判别网络进行自适应训练,得到自适应判别网络。其中,环境音样本的分类结果可以是将环境音样本分为前景音样本和背景音样本。
其中,在声音处理领域中,梅尔频率倒谱(Mel-Frequency Cepstrum)是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换。梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCCs)就是组成梅尔频率倒谱的系数。它衍生自音讯片段的倒频谱(Cepstrum)。倒频谱和梅尔频率倒谱的区别在于,梅尔频率倒谱的频带划分是在梅尔刻度上等距划分的,它比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统。
其中,根据该环境音样本的梅尔特征对该环境音样本进行分类的方式可以有很多种,为了降低问题的复杂度,减少计算量,可以使用用户交互的方式进行分类,通过与用户的交互确定环境音样本中的前景音样本和背景音样本,等等。比如,利用该环境音样本的梅尔特征开始用户交互,获取当前特征类别(标签):前景、背景,确定当前特征为前景音样本或背景音样本。
其中,该判别网络可以包括预设高斯混合模型,利用该环境音样本的梅尔特征和分类结果对判别网络进行自适应训练,得到自适应判别网络,可以包括:
利用该环境音样本的梅尔特征对该预设高斯混合模型进行参数估 计;根据该环境音样本的分类结果的真实值对估计的参数进行调整,直到该预设高斯混合模型收敛,得到自适应判别网络。
其中,高斯混合模型是一种参数化模型,能够平滑地模拟各种复杂模型,同时高斯模型的计算量相较于机器学习等算法较少,迭代速度快。高斯模型就是用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。在本申请的场景中,仅需要分别背景,高斯混合模型可提供较高的准确度,同时针对不同的场景,可使用不同的高斯混合模型进行判别,减少模型间的干扰,提高准确度。
其中,高斯混合模型使用K个(一般为3到5个)高斯模型来表征音频中的特征,它主要是由方差和均值两个参数决定。对均值和方差的学习,采取不同的学习机制,将直接影响到模型的稳定性、精确性和收敛性。建模过程中,需要对高斯混合模型中的方差、均值、权值等一些参数初始化,并通过这些参数求出建模所需的数据,如马氏距离。在初始化过程中,一般可以将方差设置的尽量大些(如15),而权值则尽量小些(如0.001)。这样设置是由于初始化的高斯模型是一个并不准确,可能的模型,需要在训练中不停的缩小它的范围,更新它的参数值,从而得到最可能的高斯模型,将方差设置大些,就是为了将尽可能多的音频包含到一个模型里面,从而获得最有可能的模型。
其中,参数估计的方式可以有很多种,比如,可以采用最大期望算法(Expectation-Maximization algorithm,EM)算法进行估计。最大期望算法是一类通过迭代进行极大似然估计(Maximum Likelihood Estimation,MLE)的优化算法。EM算法的标准计算框架由E步(Expectation-step)和M步(Maximization step)交替组成,算法的收敛性可以确保迭代至少逼近局部极大值。
EM算法是基于极大似然估计理论的优化算法。给定相互独立的观测数据X={X 1,...,X N},和包含隐变量Z、参数θ的概率模型f(X,Z,θ),根据MLE理论,θ的最优单点估计在模型的似然取极大值时给出: θ=argmax θp(X|θ)。考虑隐变量,模型的似然有如下展开:
Figure PCTCN2020116711-appb-000001
Figure PCTCN2020116711-appb-000002
隐变量可以表示缺失数据,或概率模型中任何无法直接观测的随机变量,上式中第一行是隐变量为连续变量的情形,第二行是隐变量为离散变量的情形,积分/求和的部分也被称为X,Z的联合似然(joint liklihood)。不失一般性,这里以离散变量为例进行说明。由MLE的一般方法,对上式取自然对数后可得:
Figure PCTCN2020116711-appb-000003
上述展开考虑了观测数据的相互独立性。引入与隐变量有关的概率分布q(Z),即隐分布(可认为隐分布是隐变量对观测数据的后验,参见标准算法的E步推导),由Jensen不等式,观测数据的对数似然有如下不等关系:
Figure PCTCN2020116711-appb-000004
当θ,q不等式右侧取全局极大值时,所得到的θ至少使不等式左侧取局部极大值。因此,将不等式右侧表示为L(θ,q)后,EM算法有如下求解目标:
Figure PCTCN2020116711-appb-000005
式中的L(θ,q)等效于MM算法(Minorize-Maximization algorithm)中的代理函数(surrogate function),是MLE优化问题的下限,EM算法通过最大化代理函数逼近对数似然的极大值。
其中,本实施例中的高斯混合模型是利用EM算法估计参数的分布,然后,根据环境音样本的真实值对估计的参数进行调整,直到该预设高 斯混合模型的似然函数值收敛,得到自适应判别网络。
训练完成后可以对自适应判别网络进行验证,比如,接受环境音样本的输入,将该环境音样本输入高斯混合模型,观察是否判断准确,若用户输入准确,则结束训练,若输入失败则继续对环境音进行采样,对采样得到的环境音样本提取梅尔频率倒谱系数特征,并继续后面的训练过程。
由于用户的输入场景较多,其中有些场景可能会存在高度的重合性,当用户进行新的场景训练后,得到新的高斯混合模型,过多的模型将会造成空间的浪费,因此,为了减少空间的浪费,减少模型的数量,提出了模型合并,对模型进行更精细的训练,该目标播放环境可以包括多个播放场景,该判别网络可以包括多个预设高斯混合模型,即步骤“利用该环境音样本对判别网络进行自适应训练,得到自适应判别网络”,可以包括:
利用多个播放场景的环境音样本对该多个预设高斯混合模型进行训练,得到多个高斯混合模型;计算该多个高斯混合模型两两之间的相似度;若该相似度超过预设阈值,则将相似度超过预设阈值的两个高斯混合模型中的一个高斯混合模型确定为该自适应判别网络的高斯混合模型。
比如,当目标播放环境包括第一播放场景和第二播放场景时,可以利用第一播放场景的环境音样本对该预设高斯混合模型进行训练,得到第一高斯混合模型;利用第二播放场景的环境音样本对该预设高斯混合模型进行训练,得到第二高斯混合模型;计算该第一高斯混合模型和该第二高斯混合模型的相似度;若该相似度超过预设阈值,认为该第一高斯混合模型和该第二高斯混合模型是相似的,则将该第一高斯混合模型确定为该自适应判别网络的高斯混合模型;若该相似度不超过预设阈值,则将该第一高斯混合模型和该第二高斯混合模型确定为该自适应判别网络的高斯混合模型。
其中,预设阈值的设定方式可以有很多种,比如,可以根据实际应 用的需求灵活设置,也可以预先设置好存储在电子设备中。此外,预设阈值可以内置于电子设备中,或者,也可以保存在存储器中并发送给电子设备,等等。
其中,计算两个高斯混合模型之间的相似度的方式可以有很多种,比如,可以通过计算两个高斯混合模型之间的距离,即步骤“计算该第一高斯混合模型和该第二高斯混合模型的相似度”可以包括:
根据高斯混合模型中的参数计算该第一高斯混合模型和该第二高斯混合模型的距离;
则若该相似度超过预设阈值(第一预设阈值),则将该第一高斯混合模型确定为该自适应判别网络的高斯混合模型,若该相似度不超过预设阈值,则将该第一高斯混合模型和该第二高斯混合模型确定为该自适应判别网络的高斯混合模型,可以包括:若该距离小于预设阈值(第二预设阈值),则将该第一高斯混合模型确定为该自适应判别网络的高斯混合模型,若该距离不小于预设阈值,则将该第一高斯混合模型和该第二高斯混合模型确定为该自适应判别网络的高斯混合模型。
比如,当用户进行场景训练后,得到高斯混合模型
Figure PCTCN2020116711-appb-000006
当用户进行新的场景训练后,得到新的高斯混合模型
Figure PCTCN2020116711-appb-000007
其中,高斯混合模型的参数如下:
Figure PCTCN2020116711-appb-000008
上式表示了高斯混合模型由K个参数为μ,σ,α的高斯模型组成,其中α为当前高斯模型的加权系数。基于高斯混合模型中的参数特性,提出了一种快速距离估计方法,可快速的判断模型间的相似性,其距离估算公式如下:
Figure PCTCN2020116711-appb-000009
当距离值小于预设阈值时,可以认为高斯混合模型是相似的,式中,
Figure PCTCN2020116711-appb-000010
为最大距离因子,代表了最大可忍受的距离为当前高斯模型的镜像。通过该距离公式,可快速的估算出高斯混合模型之间的距离,减少模型 的数量。
步骤103、根据音频识别的结果确定该环境音中的前景音。
例如,具体可以根据该环境音的梅尔特征对该环境音进行分类,得到环境音中的前景音和背景音;从环境音中的前景音和背景音中获取该环境音中的前景音。
其中,前景音可以指的是对话声音、鸣笛声等包含重要信息的声音,背景音可以指的是用户可忽略的声音,例如环境白噪声、雨天的声音等等。
步骤104、对该环境音中的前景音进行分类,以确定该前景音的类别。
例如,具体可以获取音频的分类类别,基于该分类类别对该前景音进行分类,得到该前景音在每个分类类别中的置信度,将该置信度最高的分类类别确定为该前景音的类别。
其中,置信度也称为可靠度,或置信水平、置信系数,即在抽样对总体参数做出估计时,由于样本的随机性,其结论总是不确定的。因此,采用一种概率的陈述方法,也就是数理统计中的区间估计法,即估计值与总体参数在一定允许的误差范围以内,其相应的概率有多大,这个相应的概率称作置信度。
置信水平是描述地理信息系统(Geographic Information System或Geo-Information system,GIS)中线元素与面元素的位置不确定性的重要指标之一。置信水平表示区间估计的把握程度,置信区间的跨度是置信水平的正函数,即要求的把握程度越大,势必得到一个较宽的置信区间,这就相应降低了估计的准确程度。
置信区间越大,置信水平越高。置信区间只在频率统计中使用。在贝叶斯统计中的对应概念是可信区间。但是可信区间和置信区间是建立在不同的概念基础上的,因此一般上说取值不会一样。置信区间表示通过计算估计值所在的区间。置信水平表示准确值落在这个区间的概率。
置信水平是指总体参数值落在样本统计值的某一区内的概率;而置 信区间是指在某一置信水平下,样本统计值与总体参数值间误差范围。置信区间越大,置信水平越高。
其中,音频的分类类别可以有很多种,比如,对话声、音乐声、鸣笛声、警报声等等。比如,可以利用音频训练集中的特征对支持向量机进行训练,根据训练的结果确定音频的分类类别。
其中,基于该分类类别对该前景音进行分类的方式可以有很多种,比如,可以通过计算前景音与类别之间的距离,等等,即步骤“基于该分类类别对该前景音进行分类,得到前景音对于每个类别的置信度”,可以包括:
计算该前景音的梅尔特征与每个分类类别之间的距离,根据该距离确定该前景音属于每个分类类别的概率;
则该将该置信度最高的类别确定为该前景音的类别,可以包括:将该分类类别中概率最高的类别确定为该前景音的分类类别。
步骤105、基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音。
例如,具体可以根据该前景音的类别确定混音模式,采用确定的混音模式将该前景音与该音频进行混音,得到混合播放音。比如,在进行混音时,输入分为两部分,分别为环境音输入EnvInput,以及音频输入VideoInput,输出为Output,在混音阶段,我们采用线性叠加的方式,其公式如下:
Output=a*EnvInput+b*VideoInput
其中a,b为叠加系数,可以依据不同的类别使用不同的叠加系数。
由上可知,本实施例可以获取音频的当前播放环境,若该当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,然后,根据音频识别的结果确定该环境音中的前景音,接着,对该环境音中的前景音进行分类,以确定该前景音的类别,再基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音;由于该方案可以通过获取音频播放时的环境音,然后根据该环境音推断当前的播放状态,并根据 当前的播放状态结合当前所播放的音频,进行混音,可以有效地提高音频播放的灵活性,并使得用户在佩戴耳机进行音频播放时,能够时刻注意到周围的环境信息,获得更安全方便的收听体验。
根据上一个实施例所描述的方法,以下将举例作进一步详细说明。
在本实施例中,将以该音频处理装置具体集成在电子设备为例进行说明。
(一)首先,需要对判别网络进行训练,如图2a所示,具体可以包括如下步骤。
(1)电子设备获取目标播放环境的环境音样本。
比如,用户可以设置教室、马路等作为需要进行音频处理的目标播放环境。比如,具体可以使用麦克风采集目标播放环境的环境音样本,将采集到的环境音样本发送给电子设备,以使电子设备对其进行进一步处理。
(2)电子设备利用该环境音样本对判别网络进行自适应训练,得到自适应判别网络。
例如,为了降低问题的复杂度,减少计算量,电子设备具体可以对该环境音样本提取梅尔频率倒谱系数特征,得到环境音样本的梅尔特征,然后,基于该环境音样本的梅尔特征,通过与用户的交互确定环境音样本中的前景音样本和背景音样本,接着,利用该环境音样本的梅尔特征和分类结果对判别网络进行自适应训练,得到自适应判别网络。比如,首先可以初始化高斯混合模型
Figure PCTCN2020116711-appb-000011
其高斯模型的数量为5,进入采样窗口T进行采样,提取其MFCC特征,开始用户交互,获取当前特征标签:前景、背景。将提取的MFCC特征输入高斯混合模型中进行参数估计,参数估计采用EM算法进行估计
其中,该判别网络可以包括预设高斯混合模型,比如,具体可以利用该环境音样本的梅尔特征,采样最大期望算法对该预设高斯混合模型进行参数估计,获取基于用户交互得到的分类结果的真实值,根据该真实值对估计的参数进行调整,直到该预设高斯混合模型的似然函数值收 敛,得到自适应判别网络。
比如,首先可以初始化高斯混合模型
Figure PCTCN2020116711-appb-000012
其高斯模型的数量为5,进入采样窗口T进行采样,提取其MFCC特征,开始用户交互,获取当前特征标签:前景、背景。将提取的MFCC特征输入高斯混合模型中进行参数估计,参数估计采用EM算法进行估计。
训练完成后可以对自适应判别网络进行验证,比如,接受环境音样本的输入,将该环境音样本输入高斯混合模型,观察是否判断准确,若用户输入准确则结束训练,若输入失败则继续对环境音进行采样,对采样得到的环境音样本提取梅尔频率倒谱系数特征,并进行后续的参数估计等训练过程。
由于用户的输入场景较多,其中有些场景可能会存在高度的重合性,当用户进行新的场景训练后,得到新的高斯混合模型,过多的模型将会造成空间的浪费,因此,为了减少空间的浪费,减少模型的数量,提出了模型合并,对模型进行更精细的训练,比如,目标播放环境可以包括多个播放场景,判别网络可以包括多个预设高斯混合模型,则具体可以利用多个播放场景的环境音样本对该预设高斯混合模型进行训练,得到多个高斯混合模型;计算该多个高斯混合模型两两之间的相似度;若该相似度超过预设阈值(第一预设阈值),则将相似度超过预设阈值的两个高斯混合模型中的一个高斯混合模型确定为该自适应判别网络的高斯混合模型。当然,也可以通过其他方式进行模型合并。
比如,当目标播放环境包括第一播放场景和第二播放场景时,可以利用第一播放场景的环境音样本对该预设高斯混合模型进行训练,得到第一高斯混合模型,利用第二播放场景的环境音样本对该预设高斯混合模型进行训练,得到第二高斯混合模型,然后,根据高斯混合模型中的参数计算该第一高斯混合模型和该第二高斯混合模型的距离,若该距离值小于预设阈值(第二预设阈值),则将该第一高斯混合模型确定为该自适应判别网络的高斯混合模型,若该距离值不小于预设阈值,则将该第一高斯混合模型和该第二高斯混合模型确定为该自适应判别网络的 高斯混合模型。
比如,在这里可以设置预设阈值(第二预设阈值)为1,当距离值小于1时,则认为这两个高斯混合模型是相似的,可以将这两个高斯混合模型中的一个作为这两个播放场景的高斯混合模型,即将该第一高斯混合模型确定为该自适应判别网络的高斯混合模型,其中,第一高斯混合模型指的是多个相似的高斯混合模型中的任意一个。
比如,如图2b所示,当用户进行场景训练后,得到高斯混合模型
Figure PCTCN2020116711-appb-000013
当用户进行新的场景训练后,得到新的高斯混合模型
Figure PCTCN2020116711-appb-000014
其中,高斯混合模型的参数如下:
Figure PCTCN2020116711-appb-000015
上式表示了高斯混合模型由K个参数为μ,σ,α的高斯模型组成,其中α为当前高斯模型的加权系数。基于高斯混合模型中的参数特性,提出了一种快速距离估计方法,可快速的判断模型间的相似性,其距离估算公式如下:
Figure PCTCN2020116711-appb-000016
当距离值小于预设阈值时,可以认为高斯混合模型是相似的,式中,
Figure PCTCN2020116711-appb-000017
为最大距离因子,代表了最大可忍受的距离为当前高斯模型的镜像,比如,可以取
Figure PCTCN2020116711-appb-000018
通过该距离公式,可快速的估算出高斯混合模型之间的距离,减少模型的数量。
(二)通过训练好的自适应判别网络,便可以进行音频处理,具体可以参见图2c、图2d和图2e。
如图2c所示,一种音频处理方法,具体流程可以包括如下步骤。
步骤201、电子设备获取音频的当前播放环境。
例如,用户具体可以在佩戴耳机观看视频时,在电子设备的播放器中选择开启音频处理模式,在电子设备接收到用户开启音频处理模式的指令后,根据用户的开启指令,询问用户是否同意开启麦克风权限,当接收到用户同意开启麦克风权限后,可以利用该麦克风采集播放音频时 当前播放环境的环境音,然后,根据该环境音判断用户当前所处的环境。
由于在判别当前用户的播放环境时,传统的算法会对当前的环境音持续地检测并分辨,持续检测的方式将会带来巨大的性能损耗,同时由于用户所处环境的多样性,这种方式对识别的准确度也将带来巨大的挑战,因此,本实施例提取了一种自适应的动态判别方法,加入用户交互反馈,使用高斯混合模型动态地更新前背景判别网络,适应用户的不同播放环境。
例如,用户可以在开启音频处理模式之前,设定前景状态,比如,用户预先设定了需要混音的场景为教室、马路等。
步骤202、若该当前播放环境处于前景状态时,电子设备对当前播放环境的环境音进行采样。
例如,电子设备具体可以判断当前播放环境处于前景状态还是背景状态,若该当前播放环境处于前景状态时,电子设备可以对当前播放环境的环境音进行采样,比如,可以设定一个采样窗口T,T为采样所需时间,其中,T可以为1秒。
步骤203、电子设备对采样得到的环境音提取梅尔频率倒谱系数特征,得到环境音的梅尔特征。
例如,为了使环境音的音频特征更符合人耳的听觉特性,而且当信噪比降低时仍然具有较好的识别性能,电子设备具体可以提取环境音的梅尔频率倒谱系数特征,得到环境音的梅尔特征。
步骤204、电子设备利用该自适应判别网络对该环境音的梅尔特征进行音频识别。
例如,电子设备具体可以将该环境音的梅尔特征输入训练好的自适应判别网络,利用该自适应判别网络进行音频识别,识别出该环境音中的前景音和背景音。
步骤205、电子设备根据音频识别的结果确定该环境音中的前景音。
例如,电子设备具体可以根据该环境音的梅尔特征对该环境音进行分类,以确定环境音中的前景音和背景音,从确定的前景音和背景音中 筛选出该环境音中的前景音。
步骤206、电子设备对该环境音中的前景音进行分类,以确定该前景音的类别。
例如,电子设备具体可以获取音频的分类类别,基于该分类类别对该前景音进行分类,得到该前景音在每个分类类别中的置信度,将该置信度最高的分类类别确定为该前景音的类别。
比如,可以使用基于支持向量机(Support Vector Machine,SVM)的分类算法,训练集使用Youtube-8K训练集,得到音频的分类类别可以为对话声、音乐声、警笛声。
例如,具体可以计算该前景音的梅尔特征与每个分类类别之间的距离,根据该距离确定该前景音属于每个分类类别的概率,将该分类类别中概率最高的类别确定为该前景音的分类类别。
步骤207、电子设备基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音。
例如,具体可以根据该前景音的类别确定混音模式,采用确定的混音模式将该前景音与该音频进行混音,得到混合播放音。比如,在进行混音时,输入分为两部分,分别为环境音输入EnvInput,以及音频输入VideoInput,输出为Output,在混音阶段,我们采用线性叠加的方式,其公式如下:
Output=a*EnvInput+b*VideoInput
其中a,b为叠加系数,可以依据不同的类别使用不同的叠加系数。具体地,可如下设置:
Figure PCTCN2020116711-appb-000019
比如,当前景音为对话声时,可以利用conversation的混音模式将该前景音与该音频进行混音;当前景音为音乐声时,可以利用music的混音模式将该前景音与该音频进行混音;当前景音为警笛声时,可以利用alert的混音模式将该前景音与该音频进行混音。
由上可知,本实施例可以获取音频的当前播放环境,若该当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,然后,根据音频识别的结果确定该环境音中的前景音,接着,对该环境音中的前景音进行分类,以确定该前景音的类别,再基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音;由于该方案可以利用电子设备获取音频播放时的环境音,然后根据该环境音推断当前的播放状态,并根据当前的播放状态结合当前所播放的音频,进行混音,可以有效地提高音频播放的灵活性,并使得用户在佩戴耳机进行音频播放时,能够时刻注意到周围的环境信息,获得更安全方便的收听体验。本方案应用于电子设备的播放器中,当播放器开启音频处理模式时,且用户佩戴耳机观看视频或者收听音乐、广播等时,可以获取当前用户的播放环境,并依据本方案中的方法,决定何时唤起以及使用混音策略,使用户可方便的接受外部环境的音频信息,提升其观看体验,使其在专注观看视频时也可以时刻清楚外部环境信息。
为了更好地实施以上方法,相应的,本申请实施例还提供一种音频处理装置,该音频处理装置具体可以集成在电子设备中,该电子设备可以是服务器,也可以是终端等设备。
例如,如图3所示,该音频处理装置可以包括获取单元301、识别单元302、确定单元303、分类单元304和混音单元305。
获取单元301,用于获取音频的当前播放环境。
例如,获取单元301具体可以在接收到用户开启音频处理模式的指令后,根据该指令获取播放音频时当前所处的播放环境。
比如,用户在佩戴耳机观看视频或者收听音乐、广播等,用户可以开启音频处理模式,获取单元301根据用户的开启指令,询问用户是否同意开启麦克风权限,当接收到用户同意开启麦克风权限后,获取单元301可以利用该麦克风采集播放音频时当前播放环境的环境音,然后,根据该环境音判断用户当前所处的环境,即当前播放环境。
其中,当前播放环境可以包括以下场景,比如,教室、校园、运动 场、马路、办公室、咖啡馆、公园、工地、图书馆等等。在本申请实施例中,可以由用户设置需要进行音频处理的场景,其中,设置场景的设定方式可以有很多种,比如,可以根据实际的需求灵活设置,也可以预先设置好存储在电子设备中,等等。
识别单元302,用于若该当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别。
在一些实施例中,该识别单元302,具体可以用于若该当前播放环境处于前景状态时,对当前播放环境的环境音进行采样;对采样得到的环境音提取梅尔频率倒谱系数特征,得到环境音的梅尔特征;利用自适应判别网络对该环境音的梅尔特征进行音频识别。
在一些实施例中,该音频处理装置还可以包括训练单元,用于获取目标播放环境的环境音样本,利用该环境音样本对判别网络进行自适应训练,得到自适应判别网络。
在一些实施例中,该训练单元可以包括提取子单元和训练子单元。
提取子单元,用于对该环境音样本提取梅尔频率倒谱系数特征,得到环境音样本的梅尔特征,根据该环境音样本的梅尔特征对该环境音样本进行分类,得到环境音样本的分类结果;
该训练子单元,用于利用该环境音样本的梅尔特征和分类结果对判别网络进行自适应训练,得到自适应判别网络。
在一些实施例中,该判别网络包括预设高斯混合模型,该训练子单元,具体可以用于利用该环境音样本的梅尔特征对该预设高斯混合模型进行参数估计,根据该环境音样本的分类结果的真实值对估计的参数进行调整,直到该预设高斯混合模型收敛,得到自适应判别网络。
在一些实施例中,该目标播放环境包括多个播放场景,该判别网络包括多个预设高斯混合模型,该训练单元,具体可以用于利用多个播放场景的环境音样本对该多个预设高斯混合模型进行训练,得到多个高斯混合模型;计算该多个高斯混合模型两两之间的相似度;若该相似度超过预设阈值,则将相似度超过预设阈值的两个高斯混合模型中的一个高 斯混合模型确定为该自适应判别网络的高斯混合模型。
比如,当目标播放环境包括第一播放场景和第二播放场景时,可以利用第一播放场景的环境音样本对该预设高斯混合模型进行训练,得到第一高斯混合模型;利用第二播放场景的环境音样本对该预设高斯混合模型进行训练,得到第二高斯混合模型;计算该第一高斯混合模型和该第二高斯混合模型的相似度;若该相似度超过预设阈值,则将该第一高斯混合模型确定为该自适应判别网络的高斯混合模型;若该相似度不超过预设阈值,则将该第一高斯混合模型和该第二高斯混合模型确定为该自适应判别网络的高斯混合模型。
确定单元303,用于根据音频识别的结果确定该环境音中的前景音。
在一些实施例中,该确定单元303,具体可以用于根据该环境音的梅尔特征对该环境音进行分类,得到环境音中的前景音和背景音;从环境音中的前景音和背景音中获取该环境音中的前景音。
分类单元304,用于对该环境音中的前景音进行分类,以确定该前景音的类别。
在一些实施例中,该分类单元304可以包括分类子单元和确定子单元。
分类子单元,用于获取音频的分类类别,基于该分类类别对该前景音进行分类,得到该前景音在每个分类类别中的置信度;
确定子单元,用于将该置信度最高的分类类别确定为该前景音的类别。
在一些实施例中,该分类子单元,具体可以用于计算该前景音的梅尔特征与每个分类类别之间的距离,根据该距离确定该前景音属于每个分类类别的概率,则该确定子单元,具体可以用于将该分类类别中概率最高的类别确定为该前景音的分类类别。
混音单元305,用于基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音。
在一些实施例中,该混音单元305,具体可以用于根据该前景音的 类别确定混音模式,采用确定的混音模式将该前景音与该音频进行混音,得到混合播放音。
具体实施时,以上各个单元可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的具体实施可参见前面的方法实施例,在此不再赘述。
由上可知,本实施例由获取单元301获取音频的当前播放环境,若该当前播放环境处于前景状态时,由识别单元302对当前播放环境的环境音进行音频识别,然后,由确定单元303根据音频识别的结果确定该环境音中的前景音,接着,由分类单元304对该环境音中的前景音进行分类,以确定该前景音的类别,再由混音单元305基于该前景音的类别将该前景音与该音频进行混音,得到混合播放音;由于该方案可以利用对环境感知获取音频播放时的环境音,根据该环境音推断当前的播放状态,并根据当前的播放状态结合当前所播放的音频,进行混音,可以有效地提高音频播放的灵活性,并使得用户在佩戴耳机进行音频播放时,能够时刻注意到周围的环境信息,获得更安全方便的收听体验。
此外,本申请实施例还提供一种电子设备,如图4所示,其示出了本申请实施例所涉及的电子设备的结构示意图,具体来讲:
该电子设备可以包括一个或者一个以上处理核心的处理器401、一个或一个以上计算机可读存储介质的存储器402、电源403和输入单元404等部件。本领域技术人员可以理解,图4中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
处理器401是该电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器402内的软件程序和/或模块,以及调用存储在存储器402内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。在本申请实施例中,处理器401可包括一个或多个处理核心;在本申请实施例中,处理器401可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作 系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器401中。
存储器402可用于存储软件程序以及模块,处理器401通过运行存储在存储器402的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器402还可以包括存储器控制器,以提供处理器401对存储器402的访问。
电子设备还包括给各个部件供电的电源403,在本申请实施例中,电源403可以通过电源管理系统与处理器401逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源403还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该电子设备还可包括输入单元404,该输入单元404可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
尽管未示出,电子设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,电子设备中的处理器401会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器402中,并由处理器401来运行存储在存储器402中的应用程序,从而实现各种功能,如下:
获取音频的当前播放环境,若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,然后,根据音频识别的结果确定所述环境音中的前景音,接着,对所述环境音中的前景音进行分类,以确定所述前景音的类别,再基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
由上可知,本实施例可以获取音频的当前播放环境,若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,然后,根据音频识别的结果确定所述环境音中的前景音,接着,对所述环境音中的前景音进行分类,以确定所述前景音的类别,再基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音;由于该方案可以利用对环境感知获取音频播放时的环境音,根据该环境音推断当前的播放状态,并根据当前的播放状态结合当前所播放的音频,进行混音,可以有效地提高音频播放的灵活性,并使得用户在佩戴耳机进行音频播放时,能够时刻注意到周围的环境信息,获得更安全方便的收听体验。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本申请实施例还提供一种计算机可读存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本申请实施例所提供的任一种音频处理方法中的步骤。例如,该指令可以执行如下步骤:
获取音频的当前播放环境,若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,然后,根据音频识别的结果确定所述环境音中的前景音,接着,对所述环境音中的前景音进行分类,以确定所述前景音的类别,再基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本申请实施例所提供的任一种音频处理方法中的步骤,因此,可以实现本申请实施例所提供的任一种音频处理方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
以上对本申请实施例所提供的音频处理方法、装置、电子设备和存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (13)

  1. 一种音频处理方法,由电子设备执行,包括:
    获取音频的当前播放环境;
    若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别;
    根据音频识别的结果确定所述环境音中的前景音;
    对所述环境音中的前景音进行分类,以确定所述前景音的类别;
    基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音。
  2. 根据权利要求1所述的方法,所述若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别,包括:
    若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行采样;
    对采样得到的环境音提取梅尔频率倒谱系数特征,得到环境音的梅尔特征;
    利用自适应判别网络对所述环境音的梅尔特征进行音频识别。
  3. 根据权利要求2所述的方法,所述根据音频识别的结果确定所述环境音中的前景音,包括:
    根据所述环境音的梅尔特征对所述环境音进行分类,得到所述环境音中的前景音和背景音;
    从所述环境音中的前景音和背景音中获取所述环境音中的前景音。
  4. 根据权利要求2所述的方法,所述对所述环境音中的前景音进行分类,以确定所述前景音的类别,包括:
    获取音频的分类类别;
    基于所述分类类别对所述前景音进行分类,得到所述前景音在每个分类类别中的置信度;
    将所述置信度最高的分类类别确定为所述前景音的类别。
  5. 根据权利要求4所述的方法,所述基于所述分类类别对所述前景 音进行分类,得到前景音对于每个类别的置信度,包括:
    计算所述前景音的梅尔特征与每个分类类别之间的距离,根据所述距离确定所述前景音属于每个分类类别的概率;
    所述将所述置信度最高的类别确定为所述前景音的类别,包括:将所述分类类别中概率最高的类别确定为所述前景音的分类类别。
  6. 根据权利要求2至5中的任一项所述的方法,在所述利用自适应判别网络对所述环境音的梅尔特征进行音频识别之前,还包括:
    获取目标播放环境的环境音样本;
    利用所述环境音样本对判别网络进行自适应训练,得到所述自适应判别网络。
  7. 根据权利要求6所述的方法,所述利用所述环境音样本对判别网络进行自适应训练,得到所述自适应判别网络,包括:
    对所述环境音样本提取梅尔频率倒谱系数特征,得到所述环境音样本的梅尔特征;
    根据所述环境音样本的梅尔特征对所述环境音样本进行分类,得到所述环境音样本的分类结果;
    利用所述环境音样本的梅尔特征和所述环境音样本的分类结果对所述判别网络进行自适应训练,得到所述自适应判别网络。
  8. 根据权利要求7所述的方法,所述判别网络包括预设高斯混合模型,所述利用所述环境音样本的梅尔特征和所述环境音样本的分类结果对所述判别网络进行自适应训练,得到所述自适应判别网络,包括:
    利用所述环境音样本的梅尔特征对所述预设高斯混合模型进行参数估计;
    根据所述环境音样本的分类结果的真实值对估计的参数进行调整,直到所述预设高斯混合模型收敛,得到所述自适应判别网络。
  9. 根据权利要求6所述的方法,所述目标播放环境包括多个播放场景,所述判别网络包括多个预设高斯混合模型,
    所述利用所述环境音样本对判别网络进行自适应训练,得到所述自 适应判别网络,包括:
    利用所述多个播放场景的环境音样本对所述多个预设高斯混合模型进行训练,得到多个高斯混合模型;
    计算所述多个高斯混合模型两两之间的相似度;
    若所述相似度超过预设阈值,则将相似度超过预设阈值的两个高斯混合模型中的一个高斯混合模型确定为所述自适应判别网络的高斯混合模型。
  10. 根据权利要求1所述的方法,所述基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音,包括:
    根据所述前景音的类别确定混音模式;
    采用确定的混音模式将所述前景音与所述音频进行混音,得到混合播放音。
  11. 一种音频处理装置,包括:
    获取单元,用于获取音频的当前播放环境;
    识别单元,用于若所述当前播放环境处于前景状态时,对当前播放环境的环境音进行音频识别;
    确定单元,用于根据音频识别的结果确定所述环境音中的前景音;
    分类单元,用于对所述环境音中的前景音进行分类,以确定所述前景音的类别;
    混音单元,用于基于所述前景音的类别将所述前景音与所述音频进行混音,得到混合播放音。
  12. 一种计算机可读存储介质,所述计算机可读存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至10中的任一项所述的音频处理方法中的步骤。
  13. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1至10中的任一项所述方法的步骤。
PCT/CN2020/116711 2019-12-11 2020-09-22 音频处理方法、装置、电子设备和存储介质 WO2021114808A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/527,935 US11948597B2 (en) 2019-12-11 2021-11-16 Audio processing method and apparatus, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911267593.7 2019-12-11
CN201911267593.7A CN110930987B (zh) 2019-12-11 2019-12-11 音频处理方法、装置和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/527,935 Continuation US11948597B2 (en) 2019-12-11 2021-11-16 Audio processing method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021114808A1 true WO2021114808A1 (zh) 2021-06-17

Family

ID=69860032

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116711 WO2021114808A1 (zh) 2019-12-11 2020-09-22 音频处理方法、装置、电子设备和存储介质

Country Status (3)

Country Link
US (1) US11948597B2 (zh)
CN (1) CN110930987B (zh)
WO (1) WO2021114808A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930987B (zh) 2019-12-11 2021-01-08 腾讯科技(深圳)有限公司 音频处理方法、装置和存储介质
CN113539279A (zh) * 2020-04-16 2021-10-22 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置以及计算机可读存储介质
CN111583950B (zh) * 2020-04-21 2024-05-03 珠海格力电器股份有限公司 一种音频处理方法、装置、电子设备及存储介质
CN114722884B (zh) * 2022-06-08 2022-09-30 深圳市润东来科技有限公司 基于环境音的音频控制方法、装置、设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005295175A (ja) * 2004-03-31 2005-10-20 Jpix:Kk ヘッドホン装置
CN1897054A (zh) * 2005-07-14 2007-01-17 松下电器产业株式会社 可根据声音种类发出警报的传输装置及方法
CN103971680A (zh) * 2013-01-24 2014-08-06 华为终端有限公司 一种语音识别的方法、装置
KR101647974B1 (ko) * 2015-03-30 2016-08-16 주식회사 이드웨어 스마트 믹싱 모듈을 갖춘 스마트 이어폰, 스마트 믹싱 모듈을 갖춘 기기, 외부음과 기기음을 혼합하는 방법 및 시스템
CN205864671U (zh) * 2016-06-17 2017-01-04 万魔声学科技有限公司 耳机
US20170156006A1 (en) * 2015-11-16 2017-06-01 Tv Ears, Inc. Headphone audio and ambient sound mixer
CN107613113A (zh) * 2017-09-05 2018-01-19 深圳天珑无线科技有限公司 一种耳机模式控制方法、装置及计算机可读存储介质
CN108475502A (zh) * 2015-12-30 2018-08-31 美商楼氏电子有限公司 语音增强感知模式
CN110930987A (zh) * 2019-12-11 2020-03-27 腾讯科技(深圳)有限公司 音频处理方法、装置和存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010046304A1 (en) * 2000-04-24 2001-11-29 Rast Rodger H. System and method for selective control of acoustic isolation in headsets
CN101404160B (zh) * 2008-11-21 2011-05-04 北京科技大学 一种基于音频识别的语音降噪方法
EP3324406A1 (en) * 2016-11-17 2018-05-23 Fraunhofer Gesellschaft zur Förderung der Angewand Apparatus and method for decomposing an audio signal using a variable threshold
CN107103901B (zh) * 2017-04-03 2019-12-24 浙江诺尔康神经电子科技股份有限公司 人工耳蜗声音场景识别系统和方法
CN108764304B (zh) * 2018-05-11 2020-03-06 Oppo广东移动通信有限公司 场景识别方法、装置、存储介质及电子设备

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005295175A (ja) * 2004-03-31 2005-10-20 Jpix:Kk ヘッドホン装置
CN1897054A (zh) * 2005-07-14 2007-01-17 松下电器产业株式会社 可根据声音种类发出警报的传输装置及方法
CN103971680A (zh) * 2013-01-24 2014-08-06 华为终端有限公司 一种语音识别的方法、装置
KR101647974B1 (ko) * 2015-03-30 2016-08-16 주식회사 이드웨어 스마트 믹싱 모듈을 갖춘 스마트 이어폰, 스마트 믹싱 모듈을 갖춘 기기, 외부음과 기기음을 혼합하는 방법 및 시스템
US20170156006A1 (en) * 2015-11-16 2017-06-01 Tv Ears, Inc. Headphone audio and ambient sound mixer
CN108475502A (zh) * 2015-12-30 2018-08-31 美商楼氏电子有限公司 语音增强感知模式
CN205864671U (zh) * 2016-06-17 2017-01-04 万魔声学科技有限公司 耳机
CN107613113A (zh) * 2017-09-05 2018-01-19 深圳天珑无线科技有限公司 一种耳机模式控制方法、装置及计算机可读存储介质
CN110930987A (zh) * 2019-12-11 2020-03-27 腾讯科技(深圳)有限公司 音频处理方法、装置和存储介质

Also Published As

Publication number Publication date
US11948597B2 (en) 2024-04-02
CN110930987B (zh) 2021-01-08
US20220076692A1 (en) 2022-03-10
CN110930987A (zh) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2021114808A1 (zh) 音频处理方法、装置、电子设备和存储介质
CN109166593B (zh) 音频数据处理方法、装置及存储介质
CN111179961B (zh) 音频信号处理方法、装置、电子设备及存储介质
US20210217433A1 (en) Voice processing method and apparatus, and device
CN111179962B (zh) 语音分离模型的训练方法、语音分离方法及装置
CN107799126A (zh) 基于有监督机器学习的语音端点检测方法及装置
WO2020177190A1 (zh) 一种处理方法、装置及设备
CN106164845A (zh) 基于关注的动态音频水平调整
JP7086521B2 (ja) 情報処理方法および情報処理装置
CN108922525B (zh) 语音处理方法、装置、存储介质及电子设备
WO2021114847A1 (zh) 网络通话方法、装置、计算机设备及存储介质
CN110277106B (zh) 音频质量确定方法、装置、设备及存储介质
US20230252964A1 (en) Method and apparatus for determining volume adjustment ratio information, device, and storage medium
US11511200B2 (en) Game playing method and system based on a multimedia file
CN109361995B (zh) 一种电器设备的音量调节方法、装置、电器设备和介质
CN111785238A (zh) 音频校准方法、装置及存储介质
US20150254054A1 (en) Audio Signal Processing
CN112667844A (zh) 检索音频的方法、装置、设备和存储介质
JP6856115B2 (ja) 情報処理方法および情報処理装置
CN113301372A (zh) 直播方法、装置、终端及存储介质
CN111696566B (zh) 语音处理方法、装置和介质
WO2020154916A1 (zh) 视频字幕合成方法、装置、存储介质及电子设备
WO2020154883A1 (zh) 语音信息的处理方法、装置、存储介质及电子设备
CN112382296A (zh) 一种声纹遥控无线音频设备的方法和装置
CN110489572B (zh) 多媒体数据处理方法、装置、终端及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20899325

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20899325

Country of ref document: EP

Kind code of ref document: A1