CN114067782A

CN114067782A - Audio recognition method and device, medium and chip system thereof

Info

Publication number: CN114067782A
Application number: CN202010759752.1A
Authority: CN
Inventors: 杨舒; 张柏雄; 吴义镇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2022-02-18

Abstract

The present application relates to an audio recognition method, an apparatus, a medium, and a chip system thereof, and relates to a voice recognition technology in the field of Artificial Intelligence (AI). The audio recognition method comprises the following steps: acquiring audio to be identified; separating a low-frequency part representing vocal tract characteristics and a high-frequency harmonic part representing sound source characteristics from the audio to be identified through a linear predictor; and identifying the audio to determine a type of the audio to be identified based on at least one of the first audio features extracted from the low frequency part and the second audio features extracted from the high frequency harmonic part. After the high-frequency harmonic part and the low-frequency part are separated, different algorithms can be adopted to extract audio features of the two parts, so that the accuracy of audio identification can be improved.

Description

Audio recognition method and device, medium and chip system thereof

Technical Field

The present application relates to the field of speech recognition, and in particular, to an audio recognition method, an audio recognition device, an audio recognition medium, and a chip system.

Background

With the rapid development of the internet and information technology, the living standard of people is increasingly improved, the requirements on the quality of life and work are also increasingly high, and the audio is used as a medium in the daily life and work of people, so that the behaviors of the daily life are greatly influenced. The audio contains extraordinarily rich information, such as environment, language or dialect, emotion and the like, the audio processing is to extract effective audio information in a complex speech environment, and by analyzing the audio information extracted from the audio, the noise type in the environment corresponding to the audio can be distinguished, and the sound of people or objects in the audio can be distinguished (voiceprint recognition) and the like.

Taking voiceprint recognition as an example, the voiceprint feature is a sound feature which can only recognize a person or something, and is a sound wave spectrum which is displayed by an electroacoustic instrument and carries sound information. The voiceprint recognition technology is an application technology for automatically identifying the attribute and the category of a pronunciation device based on the physiological and physical characteristics represented by the pronunciation device. Voiceprint recognition generally consists of three parts, namely audio preprocessing, sound characteristic parameter extraction and voiceprint model training decision. The audio feature extraction is one of the key parts of the voiceprint recognition, and aims to extract feature parameters reflecting the sound characteristics, and the selection of the feature parameters directly influences the overall effect of the voiceprint recognition. The sound characteristic parameters are preferably selected to have the largest inter-class distance and the smallest intra-class distance. A commonly used acoustic feature parameter in the field of voiceprint recognition is the Mel-Frequency Cepstral coeffients (MFCC).

Disclosure of Invention

The embodiment of the application provides an audio identification method and device, medium and chip system thereof, so as to improve the accuracy of audio identification.

A first aspect of the present application provides an audio recognition method, including: acquiring audio to be identified; separating a first frequency band range part and a second frequency band range part from the audio to be identified through a linear predictor, wherein the frequency of the frequency band contained in the first frequency band range part is lower than the frequency of the frequency band contained in the second frequency band range part; the audio is identified to determine a type of the audio to be identified based on at least one of a first audio feature extracted from the first frequency band range portion and a second audio feature extracted from the second frequency band range portion.

In the method, a low-frequency part or a channel signal (a first frequency range part) representing the characteristics of a sound channel and a high-frequency harmonic part or a sound source signal (a second frequency range part) representing the characteristics of a sound source in the audio are separated, and the audio characteristics are respectively extracted. The interference of a high-frequency harmonic part on an audio feature extraction algorithm such as MFCC (Mel frequency cepstrum coefficient) simulating the human ear cochlea perception capability can be avoided, so that the accuracy of audio identification is improved.

In a possible implementation of the first aspect, the second audio feature is extracted from the second frequency range portion by wavelet transform, where the second audio feature is a time-frequency feature obtained by wavelet transform.

In one possible implementation of the first aspect, the first frequency range portion characterizes a vocal tract of a sound generating object that emits audio to be recognized, and the second frequency range portion characterizes a sound source of the sound generating object.

In one possible implementation of the first aspect, the separating, by the linear predictor, the first frequency band range portion and the second frequency band range portion from the audio to be identified includes: and separating a first frequency range part from the audio to be identified through a linear predictor, and taking the rest part of the audio to be identified after the first frequency range part is separated as a second frequency range part.

In a possible implementation of the first aspect, the method further includes: and extracting a first audio characteristic from the first frequency band range part by an audio characteristic extraction algorithm simulating the human ear cochlea perception capability.

In one possible implementation of the first aspect described above, the audio feature extraction algorithm that simulates human ear-cochlea perception capability is a mel-frequency cepstrum coefficient MFCC extraction method, and the first audio feature is a mel-frequency cepstrum coefficient MFCC.

In one possible implementation of the first aspect, identifying the audio to determine the type of the audio to be identified based on at least one of a first audio feature extracted from the first frequency band range portion and a second audio feature extracted from the second frequency band range portion includes:

and matching the first audio characteristic or the second audio characteristic of the audio to be identified with the first audio characteristic corresponding to the first audio type, and determining that the type of the audio to be identified is the first audio type when the matching degree is greater than a first matching degree threshold value. That is, one of the first audio feature and the second audio feature is used for audio recognition, for example, by matching the first audio feature or the second audio feature of the audio to be recognized with an audio feature of a known audio type, it is determined whether the type of the audio to be recognized is the known audio type.

and fusing the first audio features and the second audio features to obtain fused audio features, matching the fused audio features with second audio features corresponding to a second audio type, and determining that the type of the audio to be identified is the second audio type when the matching degree is greater than a second matching degree threshold value.

Namely, the first audio features and the second audio features are adopted for audio recognition at the same time in a fusion mode. For example, in the case where the first audio feature and the second audio feature are an MFCC feature parameter and a time-frequency feature parameter, respectively, the two may be linearly fused and combined into a feature vector. Or the two are normalized and then linearly fused, or the two are weighted and then linearly fused to form a feature vector. And calculating a corresponding characteristic value of the characteristic vector, and under the condition that the difference between the calculated characteristic value and the characteristic value corresponding to the second audio type is greater than a second matching degree threshold value, determining that the type of the audio to be identified is the second audio type.

and inputting the first audio characteristic, the second audio characteristic or the fusion audio characteristic of the first audio characteristic and the second audio characteristic into the neural network model to obtain the type of the audio to be identified.

In one possible implementation of the first aspect, the audio to be identified comprises noise.

For example, a user wears a noise reduction earphone to take a subway, the noise reduction earphone collects audio in the subway through a microphone, and when the audio intensity exceeds a preset sound intensity threshold value in the noise reduction earphone, the noise reduction earphone separates a sound channel signal and a sound source signal from the collected audio through a linear filter. And then, extracting MFCC characteristic parameters for the sound channel signals, and extracting time-frequency characteristic parameters for the sound source signals. And finally, identifying the audio frequency according to the MFCC characteristic parameters and the time-frequency characteristic parameters, and reducing the noise through a noise reduction earphone.

A second aspect of the present application provides an audio recognition apparatus comprising: the acquisition module is used for acquiring the audio to be identified; the device comprises a separation module, a recognition module and a recognition module, wherein the separation module is used for separating a first frequency band range part and a second frequency band range part from the audio to be recognized, and the frequency of the frequency band contained in the first frequency band range part is lower than the frequency of the frequency band contained in the second frequency band range part; and the identification module identifies the audio to determine the type of the audio to be identified based on at least one of the first audio characteristic extracted from the first frequency band range part and the second audio characteristic extracted from the second frequency band range part. The audio recognition apparatus may implement any of the methods provided by the foregoing first aspect.

A third aspect of the present application provides a computer-readable medium, characterized in that the computer-readable medium has stored thereon instructions, which when executed on a computer, cause the computer to perform any one of the methods provided by the first aspect.

A fourth aspect of the present application provides an electronic apparatus, comprising: a processor, the processor coupled to a memory, the memory storing program instructions that, when executed by the processor, cause the electronic device to perform any of the methods provided by the foregoing first aspect.

A fifth aspect of the present application provides a chip system, where the chip system includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute any one of the methods provided in the foregoing first aspect.

Drawings

FIG. 1 illustrates a scenario for noise recognition by the audio recognition methods provided herein, according to some embodiments of the present application;

FIG. 2 illustrates a hardware block diagram of the noise reducing headphone of FIG. 1, according to some embodiments of the present application;

FIG. 3 illustrates a flow chart for training a noise scene recognition model with a server and transplanting the trained noise scene recognition model to a noise reduction headphone to achieve intelligent noise reduction, according to some embodiments of the present application;

fig. 4 illustrates a process of MFCC feature parameter extraction for a separated channel signal in a subway scene, according to some embodiments of the present application;

FIG. 5(a) illustrates a time domain waveform diagram of an acoustic source signal, according to some embodiments of the present application;

FIG. 5(b) illustrates a time domain waveform of a pitch pulse signal extracted from the acoustic source signal shown in FIG. 5(a), according to some embodiments of the present application;

FIG. 5(c) illustrates a time domain waveform diagram of the acoustic source signal shown in FIG. 5(a) at different sub-band frequencies, according to some embodiments of the present application;

FIG. 6 illustrates a flow diagram of a method of audio recognition, according to some embodiments of the present application;

FIG. 7 illustrates a schematic structural diagram of an audio recognition device, according to some embodiments of the present application;

FIG. 8 illustrates a schematic structural diagram of an electronic device, according to some embodiments of the present application;

fig. 9 illustrates a block diagram of a system on a chip (SoC), according to some embodiments of the present application.

Detailed Description

Illustrative embodiments of the present application include, but are not limited to, an audio recognition method, and apparatus, medium, and electronic device thereof.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another.

The embodiment of the application discloses an audio identification method, an audio identification device, a medium and electronic equipment. The existing MFCC represents the sound channel characteristics of a sound production device, and the sound channel characteristics contain rich low-frequency sound channel signal characteristics, but the high-frequency sound source signal characteristics reflecting the sound source characteristics of the sound production device cannot be extracted, and the Mel cepstrum coefficient is directly extracted from high-low frequency doped original audio, so that the high-frequency sound source signal characteristics are easily polluted by high-frequency signals, the generalization capability of the Mel cepstrum coefficient is influenced, and the accuracy of voiceprint recognition is further influenced. Some embodiments provided by the present application design a linear predictor, which separates a low-frequency part (a feature characterizing a vocal tract of a vocal object that emits the audio) and a high-frequency harmonic part (a feature characterizing a sound source of the vocal object that emits the audio) in the audio, and then performs feature extraction on the separated low-frequency part and high-frequency harmonic part respectively by using corresponding feature extraction algorithms, so as to obtain a low-frequency feature corresponding to the low-frequency part (hereinafter, referred to as vocal tract information) of the audio and a high-frequency feature corresponding to the high-frequency harmonic part (hereinafter, referred to as sound source signal). For example, Mel-scale Frequency Cepstral Coefficients (MFCCs) are extracted from a channel signal in audio, so that the extracted MFCC characteristic parameters are free from interference of high-Frequency harmonics, the channel characteristics of a sound object of the audio can be better described, and the generalization capability of the MFCC characteristic parameters is enhanced. Moreover, for example, time-frequency characteristic parameter extraction is performed on the sound source signal separated by the linear predictor in the audio through multi-scale wavelet transformation, so that the sound source characteristics of the sound production object of the audio can be effectively represented. And finally, by utilizing the characteristic that the sound channel signal and the sound source signal have strong complementarity in a low-frequency part and a high-frequency harmonic part, the MFCC characteristic parameters extracted from the low-frequency sound channel signal and the time-frequency characteristic parameters extracted from the high-frequency sound source signal are linearly fused into a final characteristic vector, and the characteristic vector can reflect the characteristics of the audio more accurately, thereby being beneficial to improving the effect of audio identification (such as voiceprint identification, noise identification and the like).

It is understood that in the embodiments of the present application, the sound-producing object may be a sound-producing organ of a living being (e.g., a human being), or may be various devices capable of producing audio, such as a sound-producing device of a non-living being (e.g., a musical instrument, a machine, a sound-producing apparatus).

Embodiments of the present application are described in further detail below with reference to the accompanying drawings.

Fig. 1 illustrates a scenario 10 for noise recognition by the audio recognition methods provided herein, according to some embodiments of the present application. Specifically, as shown in fig. 1, the scenario 10 includes an electronic device 100 and an electronic device 200. The electronic device 100 can identify the noise in the scene where the user is located through the audio identification method provided by the application, so that the electronic device 100 determines the type of the scene where the user is located according to the identified noise type, and then adaptively adjusts the noise reduction mode according to the determined type of the scene, thereby meeting the personalized noise reduction requirements of the user in different scenes (such as airports, railway stations, buses, subways, markets, conference rooms and the like), and improving the user experience.

It is understood that the

electronic devices

100 and 200 provided by some embodiments of the present application may be various electronic devices capable of audio recognition using the audio recognition methods provided by the present application, including but not limited to noise reduction headsets, servers, tablets, smartphones, laptop computers, desktop computers, wearable electronic devices, head-mounted displays, mobile email devices, portable games, portable music players, and the like, televisions with one or more processors embedded or coupled thereto, or other electronic devices capable of accessing a network. It is understood that the electronic device 100 and the electronic device 200 may capture audio of different scenes where the user is located through the audio capturing device. The audio capturing device may be a part of the electronic device 100 or the electronic device 200, or may be an independent device independent from the electronic device 100 and the electronic device 200, and may be in data connection with the electronic device 100 and the electronic device 200 to transmit the captured audio to the electronic device 100 and the electronic device 200.

For convenience of description, the following description will use the electronic device 100 as the noise reduction earphone 100 and the electronic device 200 as the server 200 as an example to describe the technical solution of the present application.

Fig. 2 illustrates a hardware block diagram of a noise reducing headphone 100, according to some embodiments of the present application. Specifically, as shown in fig. 2, the noise reduction headphone 100 includes a data Processing chip 110, an audio module 120, a power supply module 130, a noise reduction circuit 140, a Neural-Network Processing Unit (NPU) 150, a microphone 160, and a speaker 170. Wherein the content of the first and second substances,

the microphone 160 is used to capture audio in the scene in which the user is located.

The audio module 120 is used for converting digital audio information into an analog audio signal for the speaker 170 to output, and is also used for converting an analog audio signal collected by the microphone 160 into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processing chip 110, or some functional modules of the audio module 170 may be disposed in the processing chip 110.

The data Processing chip 110 (e.g., a Digital Signal Processing (DSP) chip) is configured to separate a low-frequency sound channel Signal and a high-frequency sound source Signal in the audio collected by the microphone 160, and can respectively perform MFCC feature parameter extraction on the separated sound channel signals, perform time-frequency feature parameter extraction on the sound source signals, and perform fusion on the MFCC feature parameters and the time-frequency feature parameters (e.g., perform linear fusion on the two to combine the two into a feature vector, or perform normalization Processing on the two, combine the two after weighting, and the like) to obtain fusion feature parameters corresponding to the audio. The neural network processor 150 is configured to identify a noise scene type where the user is located according to one of the extracted MFCC characteristic parameters and the extracted time-frequency characteristic parameters, or a fusion characteristic parameter obtained by fusing the two, and then adaptively match a corresponding noise reduction mode according to the identified noise scene type. In one possible implementation, the neural network processor 150 may be located outside the noise reduction earphone 100, such as in an electronic device (e.g., a mobile phone) cooperating with the noise reduction earphone 100.

The noise reduction circuit 140 is configured to generate an electrical signal corresponding to the identified noise scene (e.g., the electrical signal has an opposite phase and the same amplitude as compared to the noise in the identified scene) based on the noise reduction pattern determined by the neural network processor 150. The speaker 170 is used for converting the electrical signal generated by the noise reduction circuit 110 into sound waves to be output, so as to achieve the purpose of noise reduction. The power module 130 is used for supplying power to the neural network processor 150, the data processing chip 110, the noise reduction circuit 140 and the audio module 120.

It is to be understood that the hardware structure of the noise reduction earphone 100 provided in the embodiment of the present application does not constitute a specific limitation to the noise reduction earphone 100. In other embodiments of the present application, the noise reducing headphone 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components.

FIG. 3 illustrates a particular noise reduction technique for the scenario illustrated in FIG. 1, according to some embodiments of the present application. In the technical scheme, a noise scene recognition model is trained through the server 200, and then the trained noise scene recognition model is transplanted to the noise reduction earphone 100, so that the noise reduction earphone 100 can recognize the noise type through the noise scene recognition model, and adaptively adjust the noise reduction mode according to the recognized noise scene.

Specifically, the noise reduction technique shown in fig. 3 mainly includes noise scene recognition model training and model noise reduction. In the process of training the noise scene recognition model, the server 200 may separate a large amount of audio data acquired in different scenes by using a linear predictor to obtain a sound source signal and a sound channel signal corresponding to the audio data. And then, respectively extracting the MFCC characteristic parameters of the sound channel signals and the time-frequency characteristic parameters of the sound source signals, and training a neural network model based on the extracted MFCC characteristic parameters, the time-frequency characteristic parameters and the corresponding scene types to train a noise scene recognition model.

It should be noted that the audio recognition method provided in the embodiment of the present application may be applied to various Neural Network models, such as Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Binary Neural Networks (BNNs), and the like. In the specific implementation, the number of layers of the neural network model, the number of nodes in each layer, and the connection parameters of two connected nodes (i.e., the weights on the connection line of the two nodes) can be preset according to actual requirements.

As shown in fig. 3, the training process of the noise scene recognition model includes:

s301: the server 200 obtains audio data for model training.

It is understood that the server 200 may obtain the audio data for model training in real time, or may obtain the audio data that has been collected by the audio collecting apparatus.

S302: the server 200 selects audio data for training by detecting the audio intensity.

Selecting an audio intensity to reach a sound intensity threshold (e.g., 10^ -5 (W/m)²) Or the server 200 monitors the intensity of the audio collected by the audio collecting device (e.g., microphone) in real time, only when the server 200 determines that the audio intensity of the collected audio reaches the threshold sound intensity (e.g., 10^ -5 (W/m))²) Server 200 starts subsequent processing. For example, when the server 200 determines that the audio intensity of the acquired audio data is less than the sound intensity threshold, the current scene (e.g., an office without one person after work) may be considered to be very silent and free of noise. When the server 200 determines that the audio intensity of the collected audio data is greater than the sound intensity threshold, the server 200 separates the sound channel signal and the sound source signal in the collected audio, and then performs feature parameter extraction on the separated sound channel signal and the separated sound source signal respectivelyAnd obtaining MFCC characteristic parameters corresponding to the sound channel signals and time-frequency characteristic parameters corresponding to the sound source signals.

S303: the server 200 separates the channel signal and the sound source signal from the audio data through linear prediction.

In training the noise scene recognition model, the server 200 first needs to perform linear prediction on audio data in a plurality of scenes used in training, which meet audio intensity conditions, to separate low-frequency channel signals (e.g., a low-frequency part of an audio with a frequency below 200 hz and a middle-frequency part of the audio with a frequency between 200 hz and 3000 hz) and high-frequency sound source signals (e.g., a part of the audio with a frequency above 3000 hz).

For example, taking the server 200 as an example of extracting the characteristic parameters of the audio in the subway scene, in some embodiments, the sound channel signal and the sound source signal in the audio data collected in the subway scene are separated by using a P-order linear predictor. Namely, the current sampling value x (n) of the audio in the subway scene can be predicted by the weighted sum of the sampling values of the audio in the subway scene at the past P historical moments

(i.e., the channel signal in the audio in the subway scene), the channel signal in the audio in the subway scene

Can be expressed as:

wherein a is_iIs a linear prediction coefficient, and the order of the linear predictor is P order.

The transfer function of the P-order linear predictor is:

to find a linear prediction systemNumber a_iDefining audio x (n) and its sound channel signal under subway scene

The error E between is as follows:

let error E be linear prediction coefficient a_iIs equal to 0, the error E is minimized:

combining equation 3 and equation 4, we get:

equation 5 can be reduced as follows:

substituting equation 6 into equation 3 yields:

therefore, if φ (j, i) can be calculated, the linear prediction coefficient a can be obtained from equation 7_i. To find φ (j, i), the autocorrelation function of the audio x (n) in a subway scene can be defined as follows:

in the formula, L represents a segment length of audio. Therefore, from equation 6, one can obtain:

phi (j, i) ═ r (j-i) (equation 9) since r (j) is an even function, equation (equation 5) can be reduced to:

the matrix representation of equation 10 is of the form:

solving the above formula to obtain the linear prediction coefficient a_iThe value of (c). At this time

I.e. the channel signal. By finding out the audio frequency x (n) and sound channel signal in subway scene

And obtaining the sound source signal in the audio frequency under the subway scene.

S304: the server 200 performs feature extraction on the channel signal and the sound source signal separated from the audio data, respectively.

Fig. 4 illustrates a process of MFCC extraction of separated channel signals in a subway scene, according to some embodiments of the present application. Specifically, referring to fig. 4, the channel signal separated from the audio in the subway scene is preprocessed by pre-emphasis, framing, windowing, and the like, so as to enhance the signal-to-noise ratio of the channel signal and improve the processing accuracy. And then obtaining a corresponding frequency spectrum for each short-time analysis window through Fast Fourier Transform (FFT) so as to obtain frequency spectrums of the channel signals distributed in different time windows on the time axis. And the obtained spectrum is processed by a Mel filter bank to obtain a Mel spectrum, and the linear natural spectrum is converted into the Mel spectrum which embodies the auditory characteristics of human ears by the Mel spectrum. Then, cepstrum analysis is performed on the Mel spectrum, for example, the obtained Mel spectrum is logarithmized, and then the MFCC of the channel signal is obtained by DCT (Discrete Cosine Transform).

It is understood that the extraction technique of MFCC in the prior art is all applicable to the technical solution of the present application, and is not limited to the solution shown in fig. 4, so that the present application is not limited thereto. In addition, the vocal tract characteristics of the vocal tract signals can also be extracted by other audio characteristic extraction algorithms simulating the human ear cochlea perception capability, such as a Linear Prediction Cepstrum Coefficien (LPCC) extraction algorithm.

The following describes in detail the process of extracting time-frequency features of sound source signals by the server 200 using multi-scale wavelet transform.

In some embodiments, to eliminate the influence of sound intensity on the time-frequency feature extraction of the sound source signal, the sound source signal is subjected to amplitude normalization to obtain a normalized sound source signal x_e(n) is:

where xn in equation 12 represents the current sample value of the audio in the subway scene,

representing the vocal tract signals in the audio in a subway scene.

For example, in the embodiment shown in FIG. 5(a), the source signal x_e(n) is composed of periodic pitch pulses, and the sound source signal x is extracted using a Hamming window having a window length of two pitch periods (for example, the pitch period may be set to 10ms here)_eEach pitch pulse in (n). The pitch pulse signal x shown in FIG. 5(b) is obtained_ew(n)。

For fundamental tone pulse signal x_ew(n) performing a binary discrete wavelet transform, which can be calculated by:

wherein N is the Hamming window length; psi^*(n) is the conjugate function of the Daubechies wavelet basis; a. b represents scale parameter and time factor respectively, reflecting pitch pulse x_ewFrequency information and time information of (n).

To extract the frequency features, the fundamental pulse x is_ew(n) from K subbands W with different frequency resolution_kRepresents:

generally, the frequency range of the audio frequency is 300 to 3400 Hz. Therefore, let K be 4, resulting in 4 subbands with different frequency ranges as shown in fig. 5 (c): 2000-4000 Hz (W)₁)、1000～2000Hz(W₂)、500～1000Hz(W₃)、250～500Hz(W₄)。

To preserve the time information, each set of wavelet coefficients in equation 14 is divided into M subsets:

where M-4 is the number of subsets. Calculating each wavelet coefficient w_kThe 2-norm of the subset, 4 subvectors can be obtained:

wherein, | | | represents a 2-norm. The time-frequency characteristic parameters of the sound source signal can be expressed as follows:

ω＝[ω₁,ω₂,ω₃,ω₄]^T(formula 17)

It is understood that besides wavelet transformation, other algorithms can be used to extract time-frequency features in the sound source signal, such as, but not limited to, pitch period extraction method.

S305: and fusing the MFCC extracted from the sound channel signal and the time-frequency characteristics extracted from the sound source signal to obtain the characteristic vector of the audio data.

Specifically, in some embodiments, the server 200 may extract feature parameters of a large number of audios acquired in different scenes according to the above method, fuse the MFCC feature parameters of the vocal tract signals in the audio acquired in each scene with the time-frequency feature parameters of the sound source signals to obtain fused feature parameters, and input the fused feature parameters into the neural network model for training. Because the fusion characteristic parameters comprise the characteristic parameters capable of reflecting the sound channel signals of the middle and low frequency parts in the audio and the characteristic parameters capable of reflecting the sound source signals of the high frequency parts in the audio, the neural network model trained by using the fused characteristic parameters can more accurately identify the scene where the user is located. There are groups to improve the scene recognition effect.

For example, in some embodiments, the MFCC feature parameters and the time-frequency feature parameters may be linearly fused to be combined into a feature vector, or the MFCC feature parameters and the time-frequency feature parameters may be normalized and then linearly fused, or the MFCC feature parameters and the time-frequency feature parameters may be weighted and then linearly fused. In other embodiments, the two may also be non-linearly fused, for example, the two may be multiplied. In the specific implementation process, the fusion rule can be preset according to needs, and the scheme does not limit the fusion rule.

In addition, it is understood that in other embodiments, the MFCC features and the time-frequency features may be directly input into the neural network model for training without feature fusion.

S306: and inputting the feature vector obtained after fusion into a neural network model for model training.

Specifically, the server 200 may input a feature vector obtained by fusing MFCC feature parameters of a channel signal in the audio collected in each scene and time-frequency feature parameters of a source signal into the neural network model for training. For example, a feature vector obtained by linearly fusing MFCC feature parameters of a channel signal in audio collected in a subway scene and time-frequency feature parameters of a source signal is input to a neural network model, then an output of the model (i.e., a training result obtained by training the model using the fused feature vector of the audio collected in the subway scene) is compared with data representing the subway scene to find an error (i.e., a difference between the two), a partial derivative is found for the error, and a weight is updated according to the partial derivative. And considering that the model training is finished until the model outputs data representing the subway scene finally. It can be understood that the fused characteristic parameters in other scenes can be input to train the model, so that in the training of a large number of sample scenes, when the output error reaches a small value (for example, a predetermined error threshold is met) by continuously adjusting the weight, the neural network model is considered to be converged, and the noise scene recognition model is trained.

It is to be understood that the trained noise scene recognition model may only include the trained neural network model, or may include one or more of the audio intensity detection function, the linear prediction function, the feature extraction function, and the feature fusion function in steps S302 to 305 while having the trained neural network model.

For the noise reduction earphone 100, the trained noise scene recognition model may be transplanted into the noise reduction earphone 100, so that the noise reduction earphone 100 performs noise reduction processing during the use process. For example, after the noise scene recognition model is trained on the server 200, an Android project may be established, the model is read and analyzed through a model reading interface in the foregoing project, and then an APK (Android application package) file is generated by compiling and installed in the noise reduction earphone 100, so as to complete transplantation of the noise scene recognition model.

Then, the noise reduction earphone can perform scene recognition by using the noise scene recognition model transplanted to the noise reduction earphone 100 to recognize a corresponding noise scene, and then the noise reduction earphone 100 sets a noise reduction mode corresponding to the scene according to the recognized scene to realize an individualized noise reduction function.

It can be understood that the process of noise reduction of audio in different scenes by using the noise reduction earphone 100 is similar, and with reference to fig. 3 and fig. 2 and 4, the process of noise reduction using the noise reduction earphone 100 is described as an example of noise reduction of audio in a subway scene by using the noise reduction earphone 100, in the following embodiment, the noise scene recognition model only includes the trained neural network model. Specifically, the noise reduction process of the noise reduction headphone 100 includes:

s307: and acquiring audio data to be identified. For example, the microphone 160 of the noise reduction earphone 100 collects analog audio data in a subway, and performs analog-to-digital conversion by the audio module 120 to obtain digital audio data to be identified.

S308: the data processing chip 110 of the noise reduction headphone 100 separates the channel signal and the source signal from the collected audio data using a P-order linear filter. The specific separation process is similar to the process of separating the sound channel signal and the sound source signal by using the server 200, and is not described herein again.

S309: the data processing chip 110 of the noise reduction headphone 100 performs feature extraction on the separated sound channel signal and sound source signal, respectively, and performs feature fusion on the extracted features. The specific extraction process and the fusion process are similar to the above process and the fusion process for respectively extracting the features of the vocal tract signal and the sound source signal through the server 200, and are not described herein again.

S310: the noise reduction headphone 100 inputs the fused feature vector into a noise scene recognition model which is transplanted into the neural network processor 150 of the noise reduction headphone 100, and performs scene recognition. For example, the finally identified result of the audio data collected from the subway scene represents that the user is riding a subway, and the noise reduction scene is the subway scene.

S311: the noise reduction circuit 140 of the noise reduction earphone 100 generates an electrical signal of a corresponding noise reduction mode according to the identified scene, then converts the electrical signal into audio data, combines the audio data with the audio data normally output by the noise reduction earphone 100 to generate noise-reduced audio data, and outputs the noise-reduced audio data.

It will be appreciated that different noise reduction modes may be employed for different scenarios. For a subway scene, a bus scene, an airport scene, etc., the noise reduction earphone 100 may reduce noise by generating an electrical signal having a phase opposite to that of noise in the scene and an amplitude equal to that of the noise in the scene. For example, when a user takes a subway and uses an electronic device such as a mobile phone or a tablet computer to perform entertainment activities such as listening to songs, playing games, watching movies, etc., in order to avoid interference of noise such as noisy human voice in the surrounding environment, booming voice of a vehicle, etc., the noise reduction circuit 110 in the noise reduction earphone 100 may generate an electrical signal having the same amplitude and opposite phase to the noise in the audio in the scene, and then superimpose the electrical signal with the audio data normally output by the noise reduction earphone 100 to output the audio data after noise reduction. In other situations, the noise reduction degree is not too strong, for example, when the user walks through an intersection, the noise reduction earphone 100 will not reduce the noise too strongly, because if the noise reduction degree is too strong, the noise such as the whistle of the vehicle running on the road and the booming of the engine is completely removed, and the user may not hear the external warning sound to cause traffic safety accidents, the phase of the electric signal generated by the noise reduction circuit 110 is opposite to that of the noise, but the amplitude is smaller than the noise amplitude.

The above embodiments disclose the scheme of applying the audio recognition technology of the present application to noise reduction, and it can be understood that the audio recognition technology of the present application may also be applied to voiceprint recognition, for example, voice assistant of electronic device, voice command recognition of car machine, voiceprint recognition of user, and the like. An embodiment of applying the technical solution of the present application to voiceprint recognition of an electronic device is described below. Specifically, as shown in fig. 6, the method includes:

s602: the electronic equipment collects voice of a user to obtain audio data.

S604: the electronics utilize a linear filter to separate the channel signal and the source signal from the captured audio data. The specific separation scheme is the same as that shown in fig. 3, and is not described herein again.

S606: and the electronic equipment respectively extracts the characteristics of the separated sound channel signals and the separated sound source signals, and performs characteristic fusion on the extracted characteristics to obtain a characteristic vector. The specific extraction and fusion scheme is the same as that shown in fig. 3, and is not described herein again.

S608: and the electronic equipment matches the obtained feature vector with the stored feature vector of the voice signal of the legal user.

For example, a matching degree threshold a is arranged for a feature vector of the speech of a legitimate user, a feature value of the feature vector of the speech of the user after the fusion and a feature value of the feature vector of the speech of the legitimate user are calculated, and when a difference between the feature values of the two (an absolute value of a difference between the two) is larger than the matching degree threshold a, it is confirmed that the user who uttered the speech is a legitimate user. The matching degree threshold a may be configured to be 0.5, and in a case where an absolute value of a difference between the feature value of the feature vector of the voice of the user after the fusion and the feature value of the feature vector of the voice of the legitimate user is greater than 0.5, it is determined that the user who utters the voice is the legitimate user.

If so, go to 610; otherwise, the user is reminded that the voice cannot be recognized, and is asked to re-send the voice, and the process goes to 602 again.

S610: and the electronic equipment executes corresponding operation through the voiceprint recognition. For example, the electronic device is a door lock, and the door is opened after the voiceprint recognition is passed. For another example, the electronic device is a mobile phone, and the mobile phone is unlocked after the voiceprint recognition is passed.

Furthermore, it is understood that in other embodiments, in S604, the extracted features of the channel signal and the sound source signal may not be fused, but in S606, the stored features of the channel signal and/or the sound source signal in the voice of the legitimate user are respectively matched with the features of the channel signal and the sound source signal obtained in S604.

For example, the matching degree threshold B is configured for each of the characteristics of the vocal tract signal and/or the sound source signal in the speech of the legitimate user. Calculating the characteristic value of the sound channel signal of the voice of the user or the characteristic vector of the sound source signal, and confirming that the user sending the voice is the legal user when the difference (the absolute value of the difference) between the characteristic value of the sound channel signal and the characteristic value of the characteristic of the sound channel signal in the voice of the legal user is larger than the matching degree threshold B, or the difference (the absolute value of the difference) between the characteristic value of the sound source signal and the characteristic value of the characteristic of the sound source signal in the voice of the legal user is larger than the matching degree threshold B.

Fig. 7 provides a schematic structural diagram of an audio recognition apparatus 700 according to some embodiments of the present application. As shown in fig. 7, the audio recognition apparatus 700 includes:

an obtaining module 702, configured to obtain an audio to be identified;

a separation module 704, configured to separate a low frequency part and a high frequency harmonic part from the audio to be identified;

the recognition module 706 recognizes the audio to determine a type of the audio to be recognized based on at least one of low audio features extracted from the low frequency part and high audio features extracted from the high frequency harmonic part.

It can be understood that the audio recognition apparatus 700 shown in fig. 7 corresponds to the audio recognition method provided in the present application, and the technical details in the above detailed description about the audio recognition method provided in the present application are still applicable to the audio recognition apparatus 700 shown in fig. 7, and the detailed description is referred to above and is not repeated herein.

Fig. 8 shows a schematic structural diagram of an electronic device 800 according to an embodiment of the present application. The electronic device 800 is also capable of executing the audio recognition method disclosed in the above-mentioned embodiments of the present application. In fig. 8, like parts have the same reference numerals. As shown in fig. 8, electronic device 800 may include processor 810, power module 840, memory 880, mobile communication module 830, wireless communication module 820, sensor module 890, audio module 850, camera 870, interface module 860, keys 801, display 802, and the like.

It is to be understood that the illustrated structure of the embodiments of the invention is not to be construed as a specific limitation to the electronic device 800. In other embodiments of the present application, the electronic device 800 may include more or fewer components than illustrated, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 810 may include one or more Processing units, for example, a Processing module or a Processing circuit that may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), a Microprocessor (MCU), an Artificial Intelligence (AI) processor, or a Programmable logic device (FPGA), among others. The different processing units may be separate devices or may be integrated into one or more processors. A memory unit may be provided in the processor 810 for storing instructions and data. In some embodiments, the memory location in processor 810 is cache 880. The memory 880 mainly includes a program storage area 881 and a data storage area 882, wherein the program storage area 881 may store an operating system and at least one application program required for the functions (such as voice playing, voice recognition, etc.). The data storage area 882 may store MFCC characteristic parameters and time-frequency characteristic parameters extracted from audio using the methods provided herein. The neural network model provided in the embodiment of the present application may be regarded as an application program capable of implementing functions such as audio recognition in the stored program area 881.

The power module 840 may include a power supply, power management components, and the like. The power source may be a battery. The power management component is used for managing the charging of the power supply and the power supply of the power supply to other modules. In some embodiments, the power management component includes a charge management module and a power management module. The charging management module is used for receiving charging input from the charger; the power management module is used to connect a power supply, the charging management module and the processor 810. The power management module receives power and/or charge management module inputs and provides power to the processor 810, the display 802, the camera 870, and the wireless communication module 820, among other things.

The mobile communication module 830 may include, but is not limited to, an antenna, a power amplifier, a filter, a Low Noise Amplifier (LNA), and the like. The mobile communication module 830 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 800.

The wireless communication module 820 may include an antenna and realize transceiving of electromagnetic waves via the antenna. The wireless communication module 820 may provide a solution for wireless communication applied to the electronic device 800, including Wireless Local Area Networks (WLANs) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite Systems (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The electronic device 800 may communicate with networks and other devices via wireless communication techniques.

In some embodiments, the mobile communication module 830 and the wireless communication module 820 of the electronic device 800 may also be located in the same module.

The display screen 802 is used for displaying human-computer interaction interfaces, images, videos, and the like. The display screen 802 includes a display panel.

The sensor module 890 may include a proximity light sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

The audio module 850 is used to convert digital audio information into analog audio output or analog audio input into digital audio. The audio module 850 may also be used to encode and decode audio. In some embodiments, the audio module 850 may be disposed in the processor 810, or some functional modules of the audio module 850 may be disposed in the processor 810. In some embodiments, audio module 850 may include speakers, an earpiece, a microphone, and a headphone interface.

The camera 870 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element converts the optical Signal into an electrical Signal, and then transmits the electrical Signal to an Image Signal Processing (ISP) to be converted into a digital Image Signal. The electronic device 800 may implement a shooting function through an ISP, a camera 870, a video codec, a Graphics Processing Unit (GPU), a display screen 802, an application processor, and the like.

The interface module 860 includes an external memory interface, a Universal Serial Bus (USB) interface, a Subscriber Identity Module (SIM) card interface, and the like. The external memory interface may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the electronic device 800. The external memory card communicates with the processor 810 through an external memory interface to implement a data storage function. The universal serial bus interface is used for communication between the electronic device 800 and other electronic devices. The SIM card interface is used to communicate with a SIM card installed to the electronic device 800, such as to read a phone number stored in the SIM card or to write a phone number into the SIM card.

In some embodiments, the electronic device 800 also includes keys 801, motors, indicators, and the like. The keys 801 may include a volume key, an on/off key, and the like. The motor is used to cause the electronic device 800 to generate a vibration effect, such as a vibration when the user's electronic device 800 is called, to prompt the user to answer the incoming call from the electronic device 800. The indicators may include laser indicators, radio frequency indicators, LED indicators, and the like.

Fig. 9 shows a block diagram of a System on Chip (SoC) 900 according to an embodiment of the present application. In fig. 9, like parts have the same reference numerals. In addition, the dashed box is an optional feature of more advanced socs. In fig. 9, SoC900 includes: an interconnect unit 950 coupled to the application processor 910; a system agent unit 970; a bus controller unit 980; an integrated memory controller unit 940; a set or one or more coprocessors 920 which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a Static Random Access Memory (SRAM) unit 930; a Direct Memory Access (DMA) unit 960. In one embodiment, coprocessor 920 includes a special-purpose processor, such as a network or communication processor, compression engine, GPU, high-throughput MIC processor, embedded processor, or the like.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. The Processing system may include any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, Read-Only memories (CD-ROMs), magneto-optical disks, Read-Only memories (ROMs), Random Access Memories (RAMs), Erasable Programmable Read-Only memories (EPROMs), Electrically Erasable Programmable Read-Only memories (EEPROMs), magnetic or optical cards, flash Memory, or tangible machine-readable memories for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the Internet to transmit information in an electrical, optical, acoustical or other form of propagated signals. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some features of the structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the apparatuses in the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or may be a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solve the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules which are not so closely related to solve the technical problems presented in the present application, which does not indicate that no other units/modules exist in the above-mentioned device embodiments.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. An audio recognition method, comprising:

acquiring audio to be identified;

separating a first frequency band range part and a second frequency band range part from the audio to be identified through a linear predictor, wherein the frequency of a frequency band contained in the first frequency band range part is lower than the frequency of a frequency band contained in the second frequency band range part;

identifying the audio to determine a type of the audio to be identified based on at least one of a first audio feature extracted from the first frequency band range portion and a second audio feature extracted from the second frequency band range portion.

2. The method of claim 1, further comprising:

and extracting the second audio features from the second frequency band range part through wavelet transformation, wherein the second audio features are time-frequency features obtained through the wavelet transformation.

3. The method according to claim 1 or 2, wherein the first frequency band range portion characterizes a vocal tract of a sound generating object emitting the audio to be recognized, and the second frequency band range portion characterizes a sound source of the sound generating object.

4. The method according to any of claims 1-3, wherein the separating the first and second frequency band range portions from the audio to be identified by the linear predictor comprises:

and separating the first frequency range part from the audio to be identified through a linear predictor, and taking the rest part of the audio to be identified after the first frequency range part is separated as the second frequency range part.

5. The method according to any one of claims 1-4, further comprising:

and extracting the first audio features from the first frequency band range part by an audio feature extraction algorithm simulating the perception capability of the human cochlea.

6. The method of claim 5, wherein the audio feature extraction algorithm that simulates human cochlear perception is Mel Frequency Cepstral Coefficient (MFCC) extraction, and the first audio feature is Mel Frequency Cepstral Coefficient (MFCC).

7. The method of any of claims 1-6, wherein identifying the audio to determine the type of the audio to be identified based on at least one of first audio features extracted from the first frequency band range portion and second audio features extracted from the second frequency band range portion comprises:

and matching the first audio characteristic or the second audio characteristic of the audio to be identified with the first audio characteristic corresponding to the first audio type, and determining that the type of the audio to be identified is the first audio type when the matching degree is greater than a first matching degree threshold value.

8. The method of any of claims 1-6, wherein identifying the audio to determine the type of the audio to be identified based on at least one of first audio features extracted from the first frequency band range portion and second audio features extracted from the second frequency band range portion comprises:

9. The method of any of claims 1-6, wherein identifying the audio to determine the type of the audio to be identified based on at least one of first audio features extracted from the first frequency band range portion and second audio features extracted from the second frequency band range portion comprises:

and inputting the first audio characteristic, the second audio characteristic or the fusion audio characteristic of the first audio characteristic and the second audio characteristic into a neural network model to obtain the type of the audio to be identified.

10. The method of any of claims 1-9, wherein the audio to be identified comprises noise.

11. An audio recognition apparatus, comprising:

the acquisition module is used for acquiring the audio to be identified;

the separation module is used for separating a first frequency band range part and a second frequency band range part from the audio to be identified, wherein the frequency of the frequency band contained in the first frequency band range part is lower than the frequency of the frequency band contained in the second frequency band range part;

an identification module that identifies the audio to determine a type of the audio to be identified based on at least one of a first audio feature extracted from the first frequency band range portion and a second audio feature extracted from the second frequency band range portion.

12. A computer-readable medium having stored thereon instructions which, when executed on a computer, cause the computer to perform the audio recognition method of any one of claims 1 to 10.

13. An electronic device, comprising:

a processor coupled to a memory, the memory storing program instructions that, when executed by the processor, cause the electronic device to perform the audio recognition method of any of claims 1-10.

14. A chip system, comprising a processor and a data interface, wherein the processor reads instructions stored on a memory through the data interface to perform the method according to any one of claims 1 to 10.