CN117423355A - Audio signal identification method and device, smoke kitchen system and computer equipment - Google Patents

Audio signal identification method and device, smoke kitchen system and computer equipment Download PDF

Info

Publication number
CN117423355A
CN117423355A CN202311259178.3A CN202311259178A CN117423355A CN 117423355 A CN117423355 A CN 117423355A CN 202311259178 A CN202311259178 A CN 202311259178A CN 117423355 A CN117423355 A CN 117423355A
Authority
CN
China
Prior art keywords
result
audio signal
frequency spectrum
target sound
coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311259178.3A
Other languages
Chinese (zh)
Inventor
袁帅
李思聪
王志轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Micronano Core Electronic Technology Co ltd
Original Assignee
Wuxi Micronano Core Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Micronano Core Electronic Technology Co ltd filed Critical Wuxi Micronano Core Electronic Technology Co ltd
Priority to CN202311259178.3A priority Critical patent/CN117423355A/en
Publication of CN117423355A publication Critical patent/CN117423355A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The application relates to an audio signal identification method, an audio signal identification device, a smoke kitchen system and computer equipment. The method comprises the following steps: acquiring an initial audio signal, performing feature extraction processing on the initial audio signal to obtain at least two spectrum frame results, and arranging the spectrum frame results according to a time dimension to obtain spectrum coefficients corresponding to the initial audio signal; inputting the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result corresponding to the frequency spectrum coefficient, and calculating a similarity result of the frequency spectrum coefficient in a time dimension; and identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result. By adopting the method, the accuracy of target voice recognition can be improved.

Description

Audio signal identification method and device, smoke kitchen system and computer equipment
Technical Field
The present disclosure relates to the field of voice recognition technologies, and in particular, to an audio signal recognition method, an audio signal recognition device, a smoke kitchen system, and a computer device.
Background
With the development of artificial intelligence technology, the application of speech recognition technology is becoming more and more popular, and after a section of audio signal is obtained, whether the audio signal contains a required target sound can be automatically judged. However, in practical application, the method is inevitably influenced by environmental noise on target voice recognition, so that the accuracy of target voice recognition is reduced, and the accuracy of audio signal recognition is lower.
At present, an effective solution is not proposed for the problem of low accuracy in judging target sound.
Disclosure of Invention
Based on the foregoing, it is necessary to provide an audio signal recognition method, an audio signal recognition device, a smoke kitchen system and a computer device.
In a first aspect, the present application provides an audio signal recognition method. The method comprises the following steps:
acquiring an initial audio signal, performing feature extraction processing on the initial audio signal to obtain at least two spectrum frame results, and arranging the spectrum frame results according to a time dimension to obtain spectrum coefficients corresponding to the initial audio signal;
inputting the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result corresponding to the frequency spectrum coefficient, and calculating a similarity result of the frequency spectrum coefficient in the time dimension;
and identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result.
In one embodiment, performing feature extraction processing on the initial audio signal to obtain at least two spectrum frame results, and arranging the spectrum frame results according to a time dimension to obtain spectrum coefficients corresponding to the initial audio signal, where the method includes:
framing and windowing are carried out on the initial audio signal to obtain a windowing result of the sound information;
performing Fourier transform on the sound information windowing result to obtain a frequency spectrum frame result;
and arranging the frequency spectrum frame results according to the time dimension and performing filtering processing to obtain the frequency spectrum coefficients of the initial audio signal.
In one embodiment, calculating a similarity result of the spectral coefficients in a time dimension includes:
carrying out centering treatment on the frequency spectrum coefficient to obtain a center distribution result, and calculating a covariance matrix of the center distribution result;
and obtaining a similarity result according to the covariance matrix.
In one embodiment, the similarity result includes at least one eigenvalue result; identifying the initial audio signal according to the target sound confidence result and the similarity result to obtain an audio signal identification result, wherein the method comprises the following steps:
if the detected target sound confidence coefficient result is smaller than a preset confidence coefficient threshold value, or the maximum eigenvalue result in at least one eigenvalue result of the similarity result is larger than a preset eigenvalue threshold value, obtaining an audio signal identification result indicating that the initial audio signal is not the target audio signal;
if the confidence coefficient result of the detected target sound is larger than or equal to the confidence coefficient threshold value and the maximum characteristic value result in the similarity result is smaller than or equal to the characteristic value threshold value, obtaining an audio signal identification result indicating that the initial audio signal is the target audio signal.
In one embodiment, the method further comprises:
acquiring a preset frequency spectrum coefficient training set, wherein the frequency spectrum coefficient training set carries a characteristic label aiming at target sound;
the training set of the frequency spectrum coefficients is input into a preset initial target sound judging network for training, a frequency spectrum coefficient prediction result is obtained, a loss function result is calculated according to the frequency spectrum coefficient prediction result and the characteristic label, and the gradient of the loss function result is reversely transmitted to the initial target sound judging network for iterative training, so that a target sound judging network with complete training is generated.
In one embodiment, calculating a similarity result of the spectral coefficients in a time dimension includes:
and obtaining a distance total result corresponding to the frequency spectrum coefficient according to the variation amplitude of the frequency spectrum frame result, and obtaining a similarity result according to the distance total result.
In one embodiment, obtaining a distance total result corresponding to the spectral coefficient according to the variation amplitude of the spectral frame result includes:
under the condition that the variation amplitude is a distance difference value, squaring and summing the distance difference values between adjacent frequency spectrum frame results to obtain a distance total result corresponding to the frequency spectrum coefficient;
and under the condition that the variation amplitude is the frequency entropy value, summing the entropy values of all the spectrum frame results to obtain a distance total result corresponding to the spectrum coefficient.
In a second aspect, the present application further provides an apparatus for identifying an audio signal, including:
the acquisition module is used for acquiring an initial audio signal, carrying out feature extraction processing on the initial audio signal to obtain at least two frequency spectrum frame results, and arranging the frequency spectrum frame results according to the time dimension to obtain frequency spectrum coefficients corresponding to the initial audio signal;
the calculation module inputs the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result corresponding to the frequency spectrum coefficient, and calculates a similarity result of the frequency spectrum coefficient in a time dimension;
and the generating module is used for identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result.
In a third aspect, the present application also provides a range system comprising a range hood and a cooktop;
the smoke machine is connected with the kitchen range, is used for acquiring target sound generated by the kitchen range, and executes the following steps based on the target sound:
acquiring an initial audio signal, performing feature extraction processing on the initial audio signal to obtain at least two spectrum frame results, and arranging the spectrum frame results according to a time dimension to obtain spectrum coefficients corresponding to the initial audio signal;
inputting the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result corresponding to the frequency spectrum coefficient, and calculating a similarity result of the frequency spectrum coefficient in a time dimension;
and identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result.
In a fourth aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor when executing the computer program performs the steps of:
acquiring an initial audio signal, performing feature extraction processing on the initial audio signal to obtain at least two spectrum frame results, and arranging the spectrum frame results according to a time dimension to obtain spectrum coefficients corresponding to the initial audio signal;
inputting the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result corresponding to the frequency spectrum coefficient, and calculating a similarity result of the frequency spectrum coefficient in a time dimension;
and identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result.
The method, the device, the smoke kitchen system and the computer equipment for identifying the audio signal firstly acquire the frequency spectrum coefficient corresponding to the initial audio signal; inputting the frequency spectrum coefficient into a target sound discrimination network to obtain a target sound confidence coefficient result, and simultaneously calculating a similarity result of the frequency spectrum coefficient in a time dimension; and finally, identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result, wherein the audio signal identification result is used for indicating whether the initial audio signal is noise or target sound. According to the method, the frequency spectrum coefficients are subjected to neural network and similarity calculation processing in parallel, and the frequency spectrum coefficients are identified by two completely different methods at the same time, so that the accuracy of identifying the initial audio signals is improved; further, by correcting the output result of the neural network by calculating the similarity result in the time dimension, higher audio signal identification accuracy can be ensured, and the calculation cost of the method in the application is reduced.
Drawings
FIG. 1 is a diagram of an application environment for an audio signal recognition method in one embodiment;
FIG. 2 is a flow chart of a method for identifying audio signals according to one embodiment;
FIG. 3 is a flowchart illustrating a method for obtaining spectral coefficients in an audio signal recognition method according to an embodiment;
FIG. 4 is a flowchart of a method for obtaining a similarity result in an audio signal recognition method according to an embodiment;
FIG. 5 is a flow chart of a method for identifying audio signals in a preferred embodiment;
FIG. 6 is a schematic diagram of bandpass filtering in an audio signal recognition method according to one embodiment;
FIG. 7 is a block diagram of an audio signal recognition device in one embodiment;
FIG. 8 is a block diagram of a smoke kitchen system in one embodiment;
fig. 9 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The audio signal identification method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. Firstly, acquiring an initial audio signal, and performing feature extraction processing on the initial audio signal to obtain a frequency spectrum coefficient; and inputting the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result and a similarity result corresponding to the frequency spectrum coefficient, and finally, identifying the initial audio signal based on the target sound confidence coefficient result and the similarity result to obtain an identification result of the initial audio signal. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, there is provided an audio signal recognition method, which is described by taking an example that the method is applied to the server in fig. 1, and includes the following steps:
step S202, an initial audio signal is obtained, feature extraction processing is carried out on the initial audio signal to obtain at least two frequency spectrum frame results, and the frequency spectrum frame results are arranged according to a time dimension to obtain frequency spectrum coefficients corresponding to the initial audio signal.
The initial audio signal may be a piece of voice data obtained in advance by a user or data containing other sound information, and the feature extraction processing is performed on the initial audio signal to obtain a plurality of spectrum frame results. Further, in practical applications, the above-mentioned spectral coefficients may be generally selected as mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficents, MFCC) obtained by sorting a plurality of obtained spectral frame results according to a time dimension after feature extraction of an initial audio signal, and the MFCC coefficients may be generally understood as a set of two-dimensional vectors, where the vectors forming the MFCC coefficients are the above-mentioned spectral frame results, and the spectral frame results in the MFCC coefficients are arranged according to a time sequence, that is, the MFCC coefficients include data in two dimensions of time and frequency.
Step S204, the frequency spectrum coefficient is input into a complete training target sound discrimination network, a target sound confidence result corresponding to the frequency spectrum coefficient is obtained, and a similarity result of the frequency spectrum coefficient in the time dimension is calculated.
The training complete target sound discriminating network may generally select a CNN or DNN network, where the target sound discriminating network is a neural network for identifying a target sound, which is generated based on training of a training sample containing the target sound. Specifically, after the spectral coefficients are input to the target sound discrimination network through the steps, the target sound discrimination network may be utilized to perform feature extraction processing on the spectral coefficients, perform class discrimination, and output a target sound confidence result, where the target sound may be a fire sound, a speaking sound or other sounds containing target information, and the target sound confidence result is usually in a probability form to express the probability that the initial audio signal is a target signal. The above-mentioned similarity result generally makes use of the feature that the noise signal has a higher similarity in the time dimension.
Step S206, the initial audio signal is identified according to the target sound confidence result and the similarity result, and an audio signal identification result is obtained.
If the confidence result is smaller than a preset threshold, that is, the probability that the initial audio signal is the target sound is smaller, the initial audio signal can be judged to be a noise signal; if the similarity is greater than a preset threshold, the similarity in the time dimension is too high, and the likelihood of being a noise signal is high, and the initial audio signal is determined to be the noise signal.
Through steps S202 to S206, the difference between different information included in the initial audio signal is amplified from the frequency domain angle by calculating the spectral coefficient of the initial audio signal, and further, after the spectral coefficient is obtained, the spectral coefficient is simultaneously applied to two different discrimination methods, on the premise of improving the recognition efficiency, a higher error of the single discrimination method in signal discrimination is avoided, and the accuracy of target voice recognition can be improved under the condition that the sample set is thinner, so that the maintenance cost of audio signal recognition is reduced.
In one embodiment, performing feature extraction processing on an initial audio signal to obtain at least two spectrum frame results, and arranging the spectrum frame results according to a time dimension to obtain spectrum coefficients corresponding to the initial audio signal, including:
framing and windowing are carried out on the initial audio signal to obtain a windowing result of the sound information;
performing Fourier transform on the sound information windowing result to obtain a frequency spectrum frame result;
and arranging the frequency spectrum frame results according to the time dimension and performing filtering processing to obtain the frequency spectrum coefficients of the initial audio signal.
Specifically, fig. 3 provides a method for obtaining spectral coefficients. Firstly, after an initial audio signal is obtained, the initial audio signal is subjected to framing, and each frame is windowed, which is generally a hamming window in practical application. After windowing, carrying out fast Fourier transform on each frame of data to obtain the frequency spectrum frame result, realizing the conversion of signals from time domain to frequency domain, enhancing the distinction between target sound and noise signals and facilitating the subsequent discrimination of the target sound and noise. And sequencing the spectrum frame results according to the time dimension, carrying out band-pass filtering treatment, taking a Mel filter as an example, obtaining a frequency response result corresponding to the filter through the Mel filter, calculating logarithmic energy of the filtered data, and obtaining the spectrum coefficient through discrete cosine transform. The obtained initial audio signal is processed by the method and converted into a frequency domain signal. In practical applications, considering that it is not meaningful to perform fourier transform on the whole signal, and also in order to save computational resources, the fourier transform is performed in a short time frame, which may be manually set by a technician according to practical application situations, usually 20 ms to 40 ms, and there is a certain overlap between consecutive frames. The introduction of hamming windows may also increase the continuity at the left and right ends of the frame. The final filtering process can smooth the brand, highlight the formants of the original initial audio signal, and further reduce the operand.
In one embodiment, calculating a similarity result of spectral coefficients in a time dimension includes:
carrying out centering treatment on the frequency spectrum coefficient to obtain a center distribution result, and calculating a covariance matrix of the center distribution result;
and obtaining a similarity result according to the covariance matrix.
Specifically, fig. 4 provides a method for obtaining a similarity result. After the spectrum coefficient is obtained, the spectrum coefficient is subjected to centering treatment, and the numerical mean value is changed into 0:
wherein N is used for indicating the total number of data, x i Refer to the ith data in the spectral coefficients. Computing covariance matrix XX T For covariance matrix XX T And performing eigenvalue decomposition processing to obtain eigenvalues of the covariance matrix, and obtaining the similarity result according to the eigenvalues. According to the method for decomposing the characteristic values, the two-dimensional frequency spectrum coefficients are directly used as image processing, main city component analysis is carried out, the characteristic values and the characteristic vectors are output, on the basis of judging the initial audio signals through the neural network, the initial audio signals are identified by a method completely different from the neural network prediction, and the characteristic values are used as the basis, so that the accuracy of the identification result is further improved.
In one embodiment, the similarity result includes at least one eigenvalue result; identifying the initial audio signal according to the target sound confidence result and the similarity result to obtain an audio signal identification result, wherein the method comprises the following steps:
if the detected target sound confidence coefficient result is smaller than a preset confidence coefficient threshold value, or the maximum eigenvalue result in at least one eigenvalue result of the similarity result is larger than a preset eigenvalue threshold value, obtaining an audio signal identification result indicating that the initial audio signal is not the target audio signal;
if the confidence coefficient result of the detected target sound is larger than or equal to the confidence coefficient threshold value and the maximum characteristic value result in the similarity result is smaller than or equal to the characteristic value threshold value, obtaining an audio signal identification result indicating that the initial audio signal is the target audio signal.
Specifically, the above-described similarity result is composed of a plurality of eigenvalue results, and the noise information is higher in similarity in the time dimension, so that the noise-containing MFCC graph is easier to express at a low rank, that is, the maximum eigenvalue of the noise MFCC graph is larger than the target sound. The confidence threshold and the feature value threshold may be set by a technician according to actual situations. And under the condition that the output result of the neural network and the characteristic value result output by the principal component analysis meet the preset threshold requirement at the same time, the initial audio signal can be judged to be the target sound, and if any one of the requirements is not met, the initial audio signal is directly judged to be the noise signal. The method can effectively avoid errors of a single recognition method, further, under the condition that samples are limited, the probability of recognition errors can be reduced, the cost required by the audio signal recognition method is reduced, and meanwhile, the characteristic value of the MFCC image is used for assisting in judgment, so that interference of various environmental noises can be effectively eliminated.
In one embodiment, the method further comprises:
acquiring a preset frequency spectrum coefficient training set, wherein the frequency spectrum coefficient training set carries a characteristic label aiming at target sound;
the training set of the frequency spectrum coefficients is input into a preset initial target sound judging network for training, a frequency spectrum coefficient prediction result is obtained, a loss function result is calculated according to the frequency spectrum coefficient prediction result and the characteristic label, and the gradient of the loss function result is reversely transmitted to the initial target sound judging network for iterative training, so that a target sound judging network with complete training is generated.
Specifically, when the above-mentioned frequency spectrum coefficient training set is obtained, a large amount of speech signals are usually obtained first, and labeling is performed to obtain target sound or noise signals, then MFCC coefficients corresponding to each section of speech are calculated, and then MFCC coefficients are input into an initial target sound discrimination network for supervised training, so as to obtain a target sound discrimination network with complete training. Further, in practical application, considering whether the required target sound information is concentrated in high frequency, low frequency or full frequency band, the number of required MFCC coefficients is different, for example, human voice is generally concentrated in 40Hz to 4000Hz, ignition sound is generally concentrated in 500Hz to 6000Hz, and the related technicians can select the required number according to practical situations. Through the method, training of the target sound discrimination network is realized, wherein the MFCC coefficient can be regarded as an image to be input into the network for training, the original data can be enhanced, and the generalization capability of the model is improved, so that the training of the target sound discrimination network can be efficiently completed under the condition of low training cost.
In one embodiment, calculating a similarity result of spectral coefficients in a time dimension includes:
and obtaining a distance total result corresponding to the frequency spectrum coefficient according to the variation amplitude of the frequency spectrum frame result, and obtaining a similarity result according to the distance total result.
Specifically, considering that the similarity of the noise information in the time dimension is higher, and the similarity of the target sound in the time dimension is lower, the change is obvious, such as voice, sparking sound and the like, the judgment can be performed by judging the similarity result of the noise information and the target information in the time dimension, wherein the distance total result is used for indicating the magnitude of the change amplitude of the initial audio signal in a period of time, so that the accurate and quick judgment is performed on whether the initial audio signal is noise or the target sound. By the method, the characteristics of the noise signals are utilized, and the signals are identified from another angle, namely the change amplitude of the initial audio signals in the time dimension on the basis of neural network identification, so that the accuracy of the application for identifying the initial audio signals is further improved.
In one embodiment, obtaining a distance summary corresponding to the spectral coefficients from the magnitude of the variation of the spectral frame result comprises:
under the condition that the variation amplitude is a distance difference value, squaring and summing the distance difference values between adjacent frequency spectrum frame results to obtain a distance total result corresponding to the frequency spectrum coefficient;
and under the condition that the variation amplitude is the frequency entropy value, summing the entropy values of all the spectrum frame results to obtain a distance total result corresponding to the spectrum coefficient.
Specifically, the application provides two specific methods for discriminating an initial audio signal, which are both carried out by utilizing the characteristic of high similarity of noise signals in the time dimension. In practical applications, the method can be used for identifying the initial audio signal, the distance difference value can be Euclidean distance, and the adjacent spectrum frame result, namely the square sum of the difference value obtained by subtracting every two adjacent spectrum frames, is calculated, and the smaller the distance is, the more likely the noise is; the frequency entropy value may be calculated as the entropy of the MFCC coefficients at each frequency and the entropy at different frequencies added up, the smaller the entropy is, the more likely the noise is. By the method, two judging modes of whether the initial audio signal is the noise signal or not are provided, and the common design ideas are the ideas of small change degree of the noise signal in the time dimension, so that the judging result can be accurately and quickly obtained without other redundant devices or steps for calculation, and the efficiency of voice recognition is improved.
The embodiment also provides a specific embodiment of an audio signal recognition method, as shown in fig. 5, and fig. 5 is a flow chart of an audio signal recognition method in a preferred embodiment.
Firstly, voice data of a certain time is collected at a certain sampling rate, the voice data can be preprocessed in advance, and then the MFCC coefficients of the voice data are obtained through the flow of MFCC calculation. Specifically, the calculation flow for MFCC is: pre-emphasis processing is performed on the collected initial audio signal:
H(Z)=1-μZ -1
wherein the pre-emphasis processing is implemented by z-transform. The pre-emphasis process is to make the voice signal pass through a high-pass filter to raise the high-frequency part so that the frequency spectrum is flattened, and keep the frequency spectrum in the whole frequency band from low frequency to high frequency, and the same signal-to-noise ratio can be used for obtaining the frequency spectrum. Mu is generally 0.97. The speech is then split into frames, each frame being windowed (typically a hamming window):
different values of a will produce different hamming windows, typically a takes 0.46. Each frame of windowed data is subjected to fast Fourier transform FFT to obtain energy distribution of each frame on a frequency spectrum:
the obtained frequency domain data is subjected to band-pass filtering through a mel filter bank, the bandwidth of the triangular band-pass filter is larger and larger along with the increase of frequency, and the band-pass filtering schematic diagram of the mel filter bank is shown in fig. 6.
The mel filter frequency response formula is as follows:
the filtered data firstly calculates logarithmic energy, and then a group of two-dimensional MFCC coefficients are obtained through discrete cosine transform DCT:
c (u) may be assigned by a person skilled in the art in practice, for example when u=0,otherwise->Where u refers to the MFCC coefficient on u frequency.
After the MFCC coefficients are obtained, the MFCC coefficients are input into two paths, one path is input into a neural network, which may be CNN or DNN, and in practical application, may be determined by a relevant technician, and the other path is input into principal component analysis, i.e. eigenvalue decomposition. Specifically, the principal component analysis is to directly treat two-dimensional MFCC coefficients as image processing, first center the two-dimensional MFCC coefficients, and change the numerical mean value to 0:
and calculating a covariance matrix, and performing eigenvalue decomposition on the covariance matrix to obtain the maximum eigenvalue. The channel input to the neural network outputs the confidence coefficient, the channel input to the principal component analysis outputs the characteristic value and the characteristic vector, the output of the neural network is judged, if the confidence coefficient is smaller than the threshold value, the non-target sound is directly judged, if the output of the neural network is larger than the threshold value, the size of the maximum characteristic value is judged again, if the confidence coefficient is larger than the threshold value, the non-target sound is judged, and if the confidence coefficient is smaller than the threshold value, the initial audio signal is judged to be the target sound. By passing through
On the one hand, the method can carry out supplementary judgment on the basis of the judgment of the neural network, the principal component analysis is a judgment method completely different from the judgment of the neural network, and the accuracy of the judgment of the target sound can be further improved by introducing the method; on the other hand, the method can keep higher voice recognition accuracy under the condition that training data of the neural network is thinner, can be more suitable for practical application, and reduces calculation cost required by voice recognition.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an audio signal recognition device for realizing the above related audio signal recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for identifying one or more audio signals provided below may be referred to the limitation of the method for identifying audio signals hereinabove, and will not be repeated here.
In one embodiment, as shown in fig. 7, there is provided an apparatus for recognizing an audio signal, including:
the obtaining module 71 is configured to obtain an initial audio signal, perform feature extraction processing on the initial audio signal to obtain at least two spectrum frame results, and arrange the spectrum frame results according to a time dimension to obtain spectrum coefficients corresponding to the initial audio signal;
the calculation module 72 inputs the spectrum coefficient into the complete training target sound discrimination network to obtain a target sound confidence result corresponding to the spectrum coefficient, and calculates a similarity result of the spectrum coefficient in the time dimension;
the generating module 73 is configured to identify the initial audio signal according to the target sound confidence result and the similarity result, so as to obtain an audio signal identification result.
Specifically, after the obtaining module 71 obtains the initial audio signal, at least two spectrum results are obtained according to the initial audio signal, and the spectrum frame results are arranged according to a time dimension to obtain spectrum coefficients, in practical application, the spectrum coefficients are MFCC coefficients, and the MFCC coefficients are a set of two-dimensional vectors, where the vectors are the spectrum frame results, and in subsequent processing, the spectrum coefficients may also be regarded as an image processing. The obtaining module 71 sends the spectrum coefficient to the calculating module 72, the calculating module 72 inputs the spectrum coefficient to the target sound discrimination network to obtain a target sound confidence result, and meanwhile, calculates a similarity result of the spectrum coefficient in a time dimension, wherein the calculating method can be a method of eigenvalue decomposition, euclidean distance calculation and the like. After the recognition of the neural network and the similarity calculation in the time dimension are completed in parallel, the calculation module 72 inputs the target sound confidence result and the similarity result into the generation module 73, and the generation module 73 obtains the audio signal recognition result according to the target sound confidence result and the similarity result.
Through the device, after the frequency spectrum coefficient is obtained, the frequency spectrum coefficient is simultaneously applied to two different judging methods, and on the premise of improving the identifying efficiency, the higher error of the single judging method on the judgment of the signal is avoided, and the accuracy of identifying the target sound can be improved under the condition that the sample set is thinner, so that the maintenance cost of identifying the audio signal is reduced.
Each of the modules in the target sound discriminating apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a range system is provided, as shown in FIG. 8, comprising a range 81 and a cooktop 84;
the above-described range 81 is connected to the kitchen range 84 for acquiring a target sound generated by the kitchen range 84 and performing the audio signal recognition method as described above based on the target sound.
Specifically, in this embodiment, the target sound in the initial audio signal may be a lighting sound generated by the kitchen range 84, and in practical application, the control device 82 and the recognition device 83 are further integrated on the smoke machine 81, and the control device 82 controls the smoke machine 81 to operate in response to the audio signal recognition result obtained by the recognition device 83, where the recognition device 83 calculates and obtains the recognition result of the audio signal based on the lighting sound generated by the kitchen range 84.
In this embodiment, considering that the audio signal recognition method of the present application is used for recognizing the ignition sound of the kitchen range 84 in the kitchen environment, that is, only the distinction between the ignition sound and the noise needs to be distinguished, it may be considered that the noise sample in the audio sample is enhanced when the audio sample is established, but this may also result in a problem of false activation in the case that the ignition sound hit rate is increased. Because the ignition sound of the smoke machine 81 has the characteristic that the MFCC coefficient is obviously changed along with time, and the MFCC coefficient of the environmental noise of the kitchen has the characteristic of small time difference, the method in the application is based on the obvious distinction between the ignition sound and the noise, and the characteristics that the ignition sounds of different kitchen ranges 84 are basically the same, and the identification of the ignition sound is completed by the two methods of identifying a network by target sound aiming at a frequency spectrum coefficient and calculating a target sound confidence result, so that the method is high-efficiency and easy to operate, and has higher accuracy, the intelligent experience of the smoke range linkage based on AI can be provided for a user, the response of the smoke machine 81 is more accurate, and the misidentification of the ignition sound and the mistriggering of the smoke machine 81 are avoided.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a vulnerability classification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. A method of audio signal recognition, the method comprising:
acquiring an initial audio signal, performing feature extraction processing on the initial audio signal to obtain at least two frequency spectrum frame results, and arranging the frequency spectrum frame results according to a time dimension to obtain frequency spectrum coefficients corresponding to the initial audio signal;
inputting the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result corresponding to the frequency spectrum coefficient, and calculating a similarity result of the frequency spectrum coefficient in the time dimension;
and identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result.
2. The method according to claim 1, wherein the performing feature extraction processing on the initial audio signal to obtain at least two spectral frame results, and arranging the spectral frame results according to a time dimension to obtain spectral coefficients corresponding to the initial audio signal, includes:
framing and windowing the initial audio signal to obtain a windowing result of sound information;
performing Fourier transform on the sound information windowing result to obtain the frequency spectrum frame result;
and arranging the frequency spectrum frame results according to the time dimension and performing filtering processing to obtain the frequency spectrum coefficients of the initial audio signal.
3. The method of claim 1, wherein said calculating a similarity result of said spectral coefficients in said time dimension comprises:
carrying out centering treatment on the frequency spectrum coefficient to obtain a center distribution result, and calculating a covariance matrix of the center distribution result;
and obtaining the similarity result according to the covariance matrix.
4. The method of claim 1, wherein the similarity result comprises at least one eigenvalue result; the step of identifying the initial audio signal according to the target sound confidence result and the similarity result to obtain an audio signal identification result, including:
if the target sound confidence coefficient result is detected to be smaller than a preset confidence coefficient threshold value, or the maximum eigenvalue result in at least one eigenvalue result of the similarity results is larger than a preset eigenvalue threshold value, obtaining an audio signal identification result indicating that the initial audio signal is not the target audio signal;
and if the target sound confidence coefficient result is detected to be greater than or equal to the confidence coefficient threshold value, and the maximum eigenvalue result in the similarity result is detected to be less than or equal to the eigenvalue threshold value, obtaining an audio signal identification result indicating that the initial audio signal is a target audio signal.
5. The method according to claim 1, wherein the method further comprises:
acquiring a preset frequency spectrum coefficient training set, wherein the frequency spectrum coefficient training set carries a characteristic label aiming at the target sound;
and inputting the frequency spectrum coefficient training set into a preset initial target sound discrimination network for training to obtain a frequency spectrum coefficient prediction result, calculating a loss function result according to the frequency spectrum coefficient prediction result and the characteristic label, and reversely transmitting the gradient of the loss function result to the initial target sound discrimination network for iterative training to generate the target sound discrimination network with complete training.
6. The method of claim 1, wherein said calculating a similarity result of said spectral coefficients in a time dimension comprises:
and obtaining a distance total result corresponding to the frequency spectrum coefficient according to the variation amplitude of the frequency spectrum frame result, and obtaining the similarity result according to the distance total result.
7. The method of claim 6, wherein said deriving a distance summary corresponding to said spectral coefficients from a magnitude of change in said spectral frame result comprises:
under the condition that the variation amplitude is a distance difference value, squaring and summing the distance difference values between the adjacent spectrum frame results to obtain the distance total result corresponding to the spectrum coefficient;
and under the condition that the variation amplitude is a frequency entropy value, summing the entropy values of all the spectrum frame results to obtain the distance total result corresponding to the spectrum coefficient.
8. An audio signal recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring an initial audio signal, carrying out feature extraction processing on the initial audio signal to obtain at least two frequency spectrum frame results, and arranging the frequency spectrum frame results according to a time dimension to obtain frequency spectrum coefficients corresponding to the initial audio signal;
the calculation module inputs the frequency spectrum coefficient into a complete training target sound discrimination network to obtain a target sound confidence coefficient result corresponding to the frequency spectrum coefficient, and calculates a similarity result of the frequency spectrum coefficient in the time dimension;
and the generating module is used for identifying the initial audio signal according to the target sound confidence coefficient result and the similarity result to obtain an audio signal identification result.
9. A smoke kitchen system, characterized in that the smoke kitchen system comprises a smoke machine and a kitchen range;
the range hood is connected with the range for acquiring a target sound generated by the range and performing the audio signal recognition method according to any one of claims 1 to 7 based on the target sound.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
CN202311259178.3A 2023-09-26 2023-09-26 Audio signal identification method and device, smoke kitchen system and computer equipment Pending CN117423355A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311259178.3A CN117423355A (en) 2023-09-26 2023-09-26 Audio signal identification method and device, smoke kitchen system and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311259178.3A CN117423355A (en) 2023-09-26 2023-09-26 Audio signal identification method and device, smoke kitchen system and computer equipment

Publications (1)

Publication Number Publication Date
CN117423355A true CN117423355A (en) 2024-01-19

Family

ID=89527440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311259178.3A Pending CN117423355A (en) 2023-09-26 2023-09-26 Audio signal identification method and device, smoke kitchen system and computer equipment

Country Status (1)

Country Link
CN (1) CN117423355A (en)

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110428808B (en) Voice recognition method and device
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
CN112435684B (en) Voice separation method and device, computer equipment and storage medium
EP2763134B1 (en) Method and apparatus for voice recognition
CN109599124A (en) A kind of audio data processing method, device and storage medium
CN110853617B (en) Model training method, language identification method, device and equipment
CN110211599B (en) Application awakening method and device, storage medium and electronic equipment
US9799333B2 (en) System and method for processing speech to identify keywords or other information
CN112735473B (en) Method and system for identifying unmanned aerial vehicle based on voice
CN110120230B (en) Acoustic event detection method and device
WO2021115176A1 (en) Speech recognition method and related device
KR20210052036A (en) Apparatus with convolutional neural network for obtaining multiple intent and method therof
CN110728993A (en) Voice change identification method and electronic equipment
CN112735466B (en) Audio detection method and device
CN116913258B (en) Speech signal recognition method, device, electronic equipment and computer readable medium
CN117423355A (en) Audio signal identification method and device, smoke kitchen system and computer equipment
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN113035230B (en) Authentication model training method and device and electronic equipment
Liu et al. Wavoice: A mmWave-assisted Noise-resistant Speech Recognition System
CN113851113A (en) Model training method and device and voice awakening method and device
CN116982111A (en) Audio characteristic compensation method, audio identification method and related products
CN113064118A (en) Sound source positioning method and device
CN110189747A (en) Voice signal recognition methods, device and equipment
CN105989838B (en) Audio recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination