CN114550731A - Audio identification method and device, electronic equipment and storage medium - Google Patents

Audio identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114550731A
CN114550731A CN202210343564.XA CN202210343564A CN114550731A CN 114550731 A CN114550731 A CN 114550731A CN 202210343564 A CN202210343564 A CN 202210343564A CN 114550731 A CN114550731 A CN 114550731A
Authority
CN
China
Prior art keywords
audio
identified
determining
compression function
nonlinear
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210343564.XA
Other languages
Chinese (zh)
Inventor
张银辉
赵情恩
熊新雷
陈蓉
梁芸铭
周羊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210343564.XA priority Critical patent/CN114550731A/en
Publication of CN114550731A publication Critical patent/CN114550731A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides an audio identification method, apparatus, electronic device, readable storage medium and computer program product, and relates to the technical field of artificial intelligence, security authentication technology and voiceprint identification. The specific implementation scheme is as follows: determining a second audio characteristic corresponding to the audio to be identified in a real number domain based on a first audio characteristic corresponding to the audio to be identified in a frequency domain; performing feature compression on the second audio features by using a target compression function to obtain nonlinear audio features corresponding to the audio to be recognized, wherein the target compression function is obtained by parameter learning of a smooth logarithmic compression function in advance, and the smooth logarithmic compression function comprises preset learnable parameters; and determining an audio identification result corresponding to the audio to be identified based on the nonlinear audio characteristics. According to the scheme, the nonlinearity of the audio to be recognized can be efficiently simulated without manually extracting audio features, and then the safety and the recognition efficiency of audio recognition can be improved.

Description

Audio identification method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and further relates to the field of security authentication technology and voiceprint recognition technology, and in particular, to an audio recognition method, apparatus, electronic device, readable storage medium, and computer program product.
Background
With the rapid development of computer technology and artificial intelligence technology, biometric identification technology has also been rapidly popularized and developed. Biometric technology, which is a technology for authenticating an individual using a biometric characteristic inherent to a human body, has advantages such as being not easy to forget and being usable at any time and place.
However, the biometric technology also has problems of low security and low recognition efficiency during the application process. For example: the audio identification technology for personal identity authentication by using the audio characteristics of a user often faces the problem of malicious fraud attacks in the application process.
Disclosure of Invention
The present disclosure provides an audio recognition method, apparatus, electronic device, readable storage medium, and computer program product to improve security and recognition efficiency of audio recognition.
According to an aspect of the present disclosure, there is provided an audio recognition method, which may include the steps of:
determining a second audio characteristic corresponding to the audio to be identified in a real number domain based on a first audio characteristic corresponding to the audio to be identified in a frequency domain;
performing feature compression on the second audio features by using a target compression function to obtain nonlinear audio features corresponding to the audio to be recognized, wherein the target compression function is obtained by parameter learning of a smooth logarithmic compression function in advance, and the smooth logarithmic compression function comprises preset learnable parameters;
determining an audio recognition result corresponding to an audio to be recognized based on a nonlinear audio characteristic according to a second aspect of the present disclosure, there is provided an audio recognition apparatus, which may include:
the second audio characteristic determining unit is used for determining a second audio characteristic corresponding to the audio to be identified in a real number domain based on the first audio characteristic corresponding to the audio to be identified in a frequency domain;
the nonlinear audio characteristic determining unit is used for performing characteristic compression on the second audio characteristic by using a target compression function to obtain a nonlinear audio characteristic corresponding to the audio to be identified, wherein the target compression function is obtained by parameter learning of a smooth logarithmic compression function in advance, and the smooth logarithmic compression function comprises preset learnable parameters;
and the audio recognition result determining unit is used for determining an audio recognition result corresponding to the audio to be recognized based on the nonlinear audio characteristics.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method in any of the embodiments of the present disclosure.
According to the technology disclosed by the invention, the target compression function obtained by parameter learning of the smooth logarithmic compression function is utilized to obtain the nonlinear audio features corresponding to the audio to be identified, and the smooth logarithmic compression function can enable the obtaining process of the nonlinear audio features to be simpler. And the target compression function obtained by parameter learning of the smooth logarithmic compression function can efficiently simulate the nonlinearity of the audio to be identified without manually extracting audio features, so that the safety and the identification efficiency of audio identification can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flow chart of an audio recognition method provided by an embodiment of the present disclosure;
fig. 2 is a flowchart of a nonlinear audio feature obtaining method provided in an embodiment of the present disclosure;
fig. 3 is a flowchart of an audio recognition result determination method provided in an embodiment of the present disclosure;
fig. 4 is a flowchart of an audio recognition result obtaining method provided in an embodiment of the present disclosure;
fig. 5 is a schematic diagram of an audio recognition process provided in an embodiment of the present disclosure;
fig. 6 is a schematic diagram of an audio recognition apparatus provided in an embodiment of the present disclosure;
fig. 7 is a schematic diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
An embodiment of the present disclosure provides an audio recognition method, and specifically, referring to fig. 1, a flowchart of the audio recognition method is provided, where the method includes the following steps:
step S101: and determining a second audio characteristic corresponding to the audio to be identified in a real number domain based on the first audio characteristic corresponding to the audio to be identified in the frequency domain.
Step S102: and performing feature compression on the second audio features by using a target compression function to obtain nonlinear audio features corresponding to the audio to be recognized, wherein the target compression function is obtained by performing parameter learning on a smooth logarithmic compression function in advance, and the smooth logarithmic compression function comprises preset learnable parameters.
Step S103: and determining an audio identification result corresponding to the audio to be identified based on the nonlinear audio characteristics.
According to the audio identification method provided by the embodiment of the disclosure, the target compression function obtained by parameter learning of the smooth logarithmic compression function is utilized to obtain the nonlinear audio feature corresponding to the audio to be identified, and the smooth logarithmic compression function can enable the obtaining process of the nonlinear audio feature to be simpler. And the target compression function obtained by parameter learning of the smooth logarithmic compression function can efficiently simulate the nonlinearity of the audio to be identified without manually extracting audio features, so that the safety and the identification efficiency of audio identification can be improved.
In addition, learnable parameters are preset in the smooth log compression function, so that the smooth log compression function can be accessed into a specific neural network model for end-to-end training.
The audio to be recognized belongs to a time domain signal which changes along with the change of time, and is generally a nonlinear time domain signal. In practical applications, the audio to be recognized may be: based on preset audio acquisition equipment, aiming at the audio acquired by a target object; it can also be: and the processed audio is obtained after audio preprocessing is carried out on the audio collected by the audio collecting equipment.
In the embodiment of the present disclosure, the audio preprocessing method for audio includes, but is not limited to: noise removal and audio temporal enhancement. Wherein, the purpose that the noise was got rid of is: removing environmental noise, busy tone or sound generated by external playing of a mobile phone and the like in the audio collected by the audio collecting equipment; the purpose of audio temporal enhancement is: aliasing echoes, or changing the speech rate of the audio, etc.
The audio frequency collected by the audio frequency collecting device is subjected to audio frequency preprocessing to obtain the audio frequency to be identified, so that interference factors in the audio frequency to be identified can be reduced, the audio frequency quality of the audio frequency to be identified is improved, and the accuracy of an audio frequency identification result can be improved.
In the embodiment of the present disclosure, the audio recognition method may be specifically implemented by a pre-trained target audio recognition model. That is, the above-described steps S101 to S104 may be implemented by a target audio recognition model.
The target audio recognition model is obtained by performing model training on the audio recognition model to be trained based on the audio samples and the corresponding labels. The target audio recognition model is used for recognizing the audio to be recognized to obtain an audio recognition result.
In order for the target audio recognition model to implement the above steps S101 to S104, the audio recognition model to be trained and the target audio recognition model need to at least include a first sub-model for performing audio feature extraction, and a second sub-model for determining an audio recognition result based on the nonlinear audio features. Specifically, the first sub-model at least includes a first audio feature extraction layer, a second audio feature extraction layer, and a third audio feature extraction layer.
It should be noted that the first audio feature extraction layer is a feature extraction layer for obtaining a first audio feature for the audio to be identified. The second audio feature extraction layer is a second audio feature extraction layer obtained based on the first audio feature. And the third audio feature extraction layer is a feature extraction layer for obtaining the nonlinear audio features for the second audio features. That is, the target compression function belongs to a feature extraction layer for obtaining nonlinear audio features in the target audio recognition model.
In the embodiment of the disclosure, learnable parameters may be respectively preset for the first audio feature extraction layer and the third audio feature extraction layer, and parameter learning is performed on the learnable parameters in a process of performing model training on an audio recognition model to be trained to obtain a target audio recognition model. Taking the target compression function as an example, the method for parameter learning of the smooth log compression function is as follows: in the process of training to obtain the target audio recognition model, parameter learning is carried out on the smooth logarithmic compression function to obtain the target compression function.
The learnable parameters are preset for the first audio feature extraction layer and the third audio feature extraction layer, so that the generalization capability of the target audio recognition model can be enhanced. In addition, the first audio feature extraction layer and the third audio feature can be trained for audio feature extraction by parameter learning of preset learnable parameters, so that the nonlinear audio feature can be extracted for determining the audio recognition result without manually extracting the audio feature. Therefore, the security and recognition efficiency of audio recognition can be improved.
In the embodiment of the disclosure, the first audio feature extraction layer core is N bandpass filters (Gabor filters), and the N Gabor filters can perform convolution operation on the audio to be identified based on the gaussian kernel, respectively, so as to obtain the first audio feature. For N Gabor filters, each of the N Gabor filters may be formulated as follows:
Figure BDA0003575566790000051
in the above formula, t is used to represent time, nFor representing the nth, eta, of N Gabor filtersnFor representing the centre frequency, σ, of the nth Gabor filternFor the inverse bandwidth of the nth Gabor filter, W for the length of the Gabor filter,
Figure BDA0003575566790000052
is a complex operator, ηnRepresenting the center frequency of the Gabor filter.
In the case where the first audio feature extraction layer core is N bandpass filters (Gabor filters) and N Gabor filters, the learnable parameter preset for the first audio feature extraction layer may be
Figure BDA0003575566790000053
And σn
In an embodiment of the disclosure, the core of the third audio feature extraction layer is a smooth log compression function including preset learnable parameters. Wherein the function form of the smooth log compression function is: f (x) s log (1+ θ x).
In the above formula, x is used to represent the second audio feature, s is used to represent the smoothing parameter matrix, and θ is used to represent the weight of the second audio feature. Since f (x) is differentiable, both the parameter s and the θ parameter can be preset as learnable parameters.
In the embodiment of the disclosure, based on the audio samples and the corresponding labels, the specific implementation manner of performing model training on the audio recognition model to be trained is as follows:
firstly, collecting an audio sample, and correspondingly labeling the audio sample, wherein the information of the clustering label is as follows: and actual audio identification results corresponding to the audio samples.
And then, inputting the audio samples and the corresponding labels into the audio recognition model to be trained, and obtaining the current audio recognition result output by the audio recognition model to be trained aiming at the audio samples. And obtaining a loss function value between the actual audio recognition result and the current audio recognition result by using a preset loss function, and updating parameters of the audio recognition model to be trained based on the loss function value. And repeatedly iterating the steps of the Torontis to obtain the target audio recognition model under the condition of loss function convergence.
In the embodiment of the present disclosure, the preset loss function is generally: the Loss function for face recognition (ArcFace-Softmax Loss, AAM-Softmax Loss) may also be other types of Loss functions. The form or kind of the loss function is not particularly limited in the embodiments of the present disclosure.
In addition, the parameters refer to network model parameters, and the method for updating the parameters of the audio recognition model to be trained may be as follows: and updating learnable parameters in the audio recognition model to be trained by using an Adaptive Momentum Estimation (ADAM) algorithm.
In an embodiment of the present disclosure, a specific implementation manner of step S102 is shown in fig. 2, where fig. 2 is a flowchart of a nonlinear audio feature obtaining method provided in an embodiment of the present disclosure, and the method may include the following steps:
step S201: and performing feature sampling on the second audio features to obtain sampled audio features corresponding to the second audio features.
Step S202: and inputting the sampled audio features into a target compression function to obtain nonlinear audio features.
And the second audio features are subjected to feature sampling, so that the second audio features can be compressed, the calculated amount in the subsequent feature processing process can be reduced, and the memory consumption of the subsequent feature processing can be reduced.
In the embodiment of the present disclosure, the manner of performing feature sampling on the second audio feature is as follows: the second audio feature is downsampled.
Accordingly, in the case where step S102 is implemented using the target audio recognition model, the convolution pooling may be performed by a filtering pooling layer in the target audio recognition model, such that the second audio feature is downsampled to a lower sampling rate to obtain the sampled audio feature.
Specifically, the filtering pooling layer is a feature extraction layer preset in the audio recognition model to be trained. In the target audio recognition model, each convolution channel in the filtering pooling layer is associated with a high-pass filter, and the second audio feature is filtered through the corresponding high-pass filter on each convolution channel, thereby realizing down-sampling of the second audio feature.
The advantage of down-sampling the audio features by filtering with a corresponding high-pass filter on each convolution channel is that: because the convolution kernel parameter (bandwidth of the high-pass filter) of each convolution channel is learnable, the bandwidth information suitable for the corresponding audio identification task can be learnt according to different audio identification tasks.
That is, the target audio recognition model in embodiments of the present disclosure may further include a filtering pooling layer for feature sampling the second audio feature. In addition, for the audio recognition model to be trained, learnable parameters may be set in advance for the filter pooling layer. E.g., the bandwidth of the high pass filter.
When a learnable parameter is preset for the filter pooling layer, the method for parameter learning the learnable parameter in the filter pooling layer is as follows: and in the process of carrying out model training on the audio recognition model to be trained to obtain a target audio recognition model, carrying out parameter learning on the bandwidth of the high-pass filter.
The audio identification method provided in the embodiments of the present disclosure may be applied to audio classification, where the audio identification result is an audio type corresponding to an audio to be identified. Specifically, the audio classification may be: and classifying the real voice and the non-real voice of the audio to be recognized. Audio classification may also be performed for other audio, such as: and determining a sound production object corresponding to the audio to be recognized, and the like.
And under the condition that the audio is classified into the real voice and the non-real voice of the audio to be recognized, the audio recognition result comprises a recognition result used for representing whether the audio to be recognized is the real voice. At this time, the step of determining the audio recognition result corresponding to the audio to be recognized is shown in fig. 3, where fig. 3 is a flowchart of an audio recognition result determining method provided in an embodiment of the present disclosure, and the audio recognition result determining method may include the following steps:
step S301: and determining the probability that the audio to be identified is the real human voice based on the nonlinear audio characteristics.
Step S302: based on the probability, an audio recognition result is determined.
According to the audio identification method provided by the embodiment of the disclosure, after the probability that the audio to be identified is the real voice is determined, whether the audio to be identified is the real voice is further determined based on the probability, so that whether the audio to be identified is the real voice can be determined. Therefore, it is possible to effectively prevent an attacker from using a forged voice such as a recorded voice as a real voice when performing personal authentication using the audio characteristics of the user, and it is possible to improve the security of performing personal authentication using the audio characteristics of the user.
In the embodiment of the present disclosure, based on the probability, the recognition mode for determining the audio recognition result is shown in fig. 4, where fig. 4 is a flowchart of a method for obtaining an audio recognition result provided in the embodiment of the present disclosure, and the method may include the following steps:
step S401: and under the condition that the probability meets a preset condition, determining that the audio to be identified is the real voice.
Step S402: and determining the audio to be recognized as the real voice as the audio recognition result.
And under the condition that the probability meets the preset condition, determining the audio to be recognized as the real voice, so that the accuracy of the audio recognition result can be improved.
The preset conditions may include: the probability reaches a corresponding probability threshold, for example: 50 percent. Further, the preset condition may be: the probability that the audio to be identified is the real voice is larger than or equal to the probability that the audio to be identified is the non-real voice. At this time, the specific implementation manner of determining that the audio to be recognized is the real voice is as follows:
firstly, after the probability that the audio to be identified is the real voice is determined, the probability that the audio to be identified is the real voice is determined as a first probability, meanwhile, the probability that the audio to be identified is the sub-real voice is determined based on the nonlinear audio characteristics, and the probability that the audio to be identified is the sub-real voice is determined as a second probability. Then, the first probability is compared with the second probability, and the audio to be recognized is determined to be the real human voice under the condition that the first probability is larger than or equal to the second probability.
In addition, in the case where the first probability is smaller than the second probability, it may be determined that the audio to be recognized is an unreal human voice.
When performing audio recognition based on the target audio recognition module, the second sub-model may be set to be a module based on the ECAPA-TDNN classifier as a core. The ECAPA-TDNN classifier can respectively obtain the probability that the audio to be identified is the real voice and the probability that the audio to be identified is the non-real voice based on the nonlinear audio characteristics. In addition, in the embodiment of the present disclosure, other types of second submodels may also be adopted to determine the audio recognition result based on the nonlinear audio feature.
In the embodiment of the present disclosure, the first audio feature is determined as follows: firstly, converting the audio to be identified from a time domain signal into a frequency domain signal, and obtaining a frequency domain audio signal corresponding to the audio to be identified. The frequency domain audio signal is then determined as the first audio feature.
The frequency domain signal can more intuitively and conveniently represent the audio characteristics of the audio to be identified, so that the frequency domain audio signal is determined as the first audio characteristic, and the audio characteristics of the audio to be identified can be more intuitively and conveniently represented by the first audio characteristic.
In the embodiment of the present disclosure, the manner of determining the second audio characteristic corresponding to the audio to be identified in the real number domain is as follows: firstly, converting the first audio characteristic from a frequency domain signal into a real number domain signal, and obtaining a real number domain audio signal corresponding to the audio to be identified. Then, the real-number domain audio signal is determined as the second audio feature.
The second audio characteristic is obtained by converting the frequency domain signal of the first audio characteristic into the real number domain signal, so that the nonlinear factor can be added in the second audio characteristic, and the characteristic expression capability of the second audio characteristic can be enhanced.
In the embodiment of the disclosure, the first audio feature may be converted from the frequency domain signal to the real number domain signal based on other preset manners such as a square activation function. Accordingly, a square activation function may be used as a core of the second audio feature extraction layer when performing audio recognition based on the target audio recognition module.
The following specifically takes the example of recognizing whether the audio to be recognized is a real human voice through the target audio recognition model, to describe in detail the audio recognition method provided in the embodiment of the present disclosure. Specifically, referring to fig. 5, which is a schematic diagram of an audio recognition process provided in an embodiment of the present disclosure, the audio recognition process includes:
firstly, inputting an audio to be recognized into a target audio recognition model, performing convolution operation on the audio to be recognized by a first audio feature extraction layer in a first sub-model based on N Gabor filters to obtain a first audio feature, and outputting the first audio feature to a second audio feature extraction layer.
Secondly, after the second audio feature extraction layer obtains the first audio feature, the second audio feature extraction layer converts the first audio feature from a frequency domain signal to a real number domain signal based on a square activation function, so as to obtain a second audio feature, and outputs the second audio feature to the filtering pooling layer.
Thirdly, after the second audio features are obtained, the filtering pooling layer filters the second audio features through the corresponding high-pass filters on each convolution channel, so that down-sampling of the second audio features is achieved, sampling audio features corresponding to the second audio features are obtained, and the sampling audio features are output to a third audio feature extraction layer.
Fourthly, after the third audio feature extraction layer obtains the sampled audio features, feature compression is carried out on the third audio features on the basis of a target compression function, so that nonlinear audio features are obtained, and the nonlinear audio features are input into the second submodel.
Fifthly, after the nonlinear audio characteristics are obtained, the second submodel respectively obtains the probability that the audio to be identified is the real voice and the probability that the audio to be identified is the unreal voice based on the ECAPA-TDNN classifier, and further determines whether the audio to be identified is the real voice based on the probability that the audio to be identified is the real voice and the probability that the audio to be identified is the unreal voice, so that an audio identification result is obtained.
As shown in fig. 6, an embodiment of the present disclosure provides an audio recognition apparatus, including:
a second audio characteristic determining unit 601, configured to determine, based on a first audio characteristic corresponding to the audio to be recognized in a frequency domain, a second audio characteristic corresponding to the audio to be recognized in a real number domain;
a nonlinear audio feature determining unit 602, configured to perform feature compression on the second audio feature by using a target compression function to obtain a nonlinear audio feature corresponding to the audio to be identified, where the target compression function is obtained by performing parameter learning on a smooth logarithmic compression function in advance, and the smooth logarithmic compression function includes preset learnable parameters;
the audio recognition result determining unit 603 is configured to determine an audio recognition result corresponding to the audio to be recognized based on the nonlinear audio feature.
In one embodiment, the audio recognition result determining unit 603 may include:
the characteristic sampling subunit is used for carrying out characteristic sampling on the second audio characteristic to obtain a sampled audio characteristic corresponding to the second audio characteristic;
and the nonlinear audio characteristic subunit is used for inputting the sampled audio characteristics into the target compression function to obtain nonlinear audio characteristics.
In one embodiment, the audio recognition result determining unit 603 may include:
the target audio recognition model training module is used for performing parameter learning on the smooth logarithmic compression function in the process of training to obtain the target audio recognition model to obtain a target compression function;
the target audio recognition model is a model obtained based on audio samples and corresponding label training and used for recognizing audio to be recognized so as to obtain an audio recognition result, and the target compression function belongs to a feature extraction layer used for obtaining nonlinear audio features in the target audio recognition model.
In one embodiment, in a case that the audio recognition result includes a recognition result indicating whether the audio to be recognized is a real human voice, the audio recognition result determining unit 603 may include:
the probability determining subunit is used for determining the probability that the audio to be identified is the real human voice based on the nonlinear audio features;
and the audio recognition result determining subunit is used for determining the audio recognition result based on the probability.
In one embodiment, the audio recognition result determining subunit may include:
the real voice determining subunit is used for determining the audio to be identified as the real voice under the condition that the probability meets a preset condition;
and the real voice recognition result determining subunit is used for determining the audio to be recognized as the real voice as the audio recognition result.
In one embodiment, the second audio feature determination unit 601 may include:
the frequency domain audio signal obtaining unit is used for converting the audio to be identified from the time domain signal into a frequency domain signal and obtaining a frequency domain audio signal corresponding to the audio to be identified;
a first audio feature determination subunit for determining the frequency domain audio signal as the first audio feature.
In one embodiment, the second audio feature determination unit 601 may include:
the real number domain audio signal obtaining unit is used for converting the first audio characteristic from a frequency domain signal into a real number domain signal and obtaining a real number domain audio signal corresponding to the audio to be identified;
a second audio feature determination subunit for determining the real-number-domain audio signal as the second audio feature.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
According to an embodiment of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as an audio recognition method. For example, in some embodiments, the audio recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM703 and executed by the computing unit 701, one or more steps of the audio recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the audio recognition method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable audio recognition device such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (11)

1. An audio recognition method, comprising:
determining a second audio characteristic corresponding to the audio to be identified in a real number domain based on a first audio characteristic corresponding to the audio to be identified in a frequency domain;
performing feature compression on the second audio features by using a target compression function to obtain nonlinear audio features corresponding to the audio to be recognized, wherein the target compression function is obtained by parameter learning of a smooth logarithmic compression function in advance, and the smooth logarithmic compression function comprises preset learnable parameters;
and determining an audio identification result corresponding to the audio to be identified based on the nonlinear audio characteristics.
2. The method according to claim 1, wherein the performing feature compression on the second audio feature by using an objective compression function to obtain a nonlinear audio feature corresponding to the audio to be identified comprises:
performing feature sampling on the second audio features to obtain sampled audio features corresponding to the second audio features;
and inputting the sampling audio features into the target compression function to obtain the nonlinear audio features.
3. The method of claim 1 or 2, wherein the target compression function is determined in a manner comprising:
in the process of training to obtain a target audio recognition model, parameter learning is carried out on the smooth logarithmic compression function to obtain the target compression function;
the target audio recognition model is a model obtained based on an audio sample and corresponding label training and used for recognizing the audio to be recognized so as to obtain the audio recognition result, and the target compression function belongs to a feature extraction layer used for obtaining the nonlinear audio features in the target audio recognition model.
4. The method according to claim 1 or 2, wherein in a case that the audio recognition result includes a recognition result representing whether the audio to be recognized is a real human voice, the determining, based on the nonlinear audio feature, the audio recognition result to which the audio to be recognized corresponds includes:
determining the probability that the audio to be identified is the real human voice based on the nonlinear audio features;
determining the audio recognition result based on the probability.
5. The method of claim 4, wherein the determining the audio recognition result based on the probability comprises:
determining the audio to be identified as the real voice under the condition that the probability meets a preset condition;
and determining the audio to be recognized as the real voice as the audio recognition result.
6. The method of claim 1, wherein the manner of determining the first audio feature comprises:
converting the audio to be identified into a frequency domain signal from a time domain signal, and obtaining a frequency domain audio signal corresponding to the audio to be identified;
determining the frequency domain audio signal as the first audio feature.
7. The method of claim 6, wherein the determining of the second audio feature corresponding to the audio to be identified in the real number domain comprises:
converting the first audio characteristic from a frequency domain signal into a real number domain signal to obtain a real number domain audio signal corresponding to the audio to be identified;
determining the real-number domain audio signal as the second audio feature.
8. An audio recognition apparatus comprising:
the second audio characteristic determining unit is used for determining a second audio characteristic corresponding to the audio to be identified in a real number domain based on a first audio characteristic corresponding to the audio to be identified in a frequency domain;
the nonlinear audio characteristic determining unit is used for performing characteristic compression on the second audio characteristic by using a target compression function to obtain a nonlinear audio characteristic corresponding to the audio to be identified, wherein the target compression function is obtained by performing parameter learning on a smooth logarithmic compression function in advance, and the smooth logarithmic compression function comprises preset learnable parameters;
and the audio identification result determining unit is used for determining an audio identification result corresponding to the audio to be identified based on the nonlinear audio characteristics.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 7.
11. A computer program product comprising computer programs/instructions, wherein the computer programs/instructions, when executed by a processor, implement the method of claims 1 to 7.
CN202210343564.XA 2022-03-31 2022-03-31 Audio identification method and device, electronic equipment and storage medium Pending CN114550731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210343564.XA CN114550731A (en) 2022-03-31 2022-03-31 Audio identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210343564.XA CN114550731A (en) 2022-03-31 2022-03-31 Audio identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114550731A true CN114550731A (en) 2022-05-27

Family

ID=81666247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210343564.XA Pending CN114550731A (en) 2022-03-31 2022-03-31 Audio identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114550731A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115188109A (en) * 2022-07-26 2022-10-14 思必驰科技股份有限公司 Device audio unlocking method, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115188109A (en) * 2022-07-26 2022-10-14 思必驰科技股份有限公司 Device audio unlocking method, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN114118287A (en) Sample generation method, sample generation device, electronic device and storage medium
CN111276124B (en) Keyword recognition method, device, equipment and readable storage medium
KR20220116395A (en) Method and apparatus for determining pre-training model, electronic device and storage medium
CN114360562A (en) Voice processing method, device, electronic equipment and storage medium
CN114550731A (en) Audio identification method and device, electronic equipment and storage medium
CN114494814A (en) Attention-based model training method and device and electronic equipment
CN112634880A (en) Speaker identification method, device, equipment, storage medium and program product
CN114724144B (en) Text recognition method, training device, training equipment and training medium for model
CN116363444A (en) Fuzzy classification model training method, fuzzy image recognition method and device
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device
CN113035230B (en) Authentication model training method and device and electronic equipment
CN112786058B (en) Voiceprint model training method, voiceprint model training device, voiceprint model training equipment and storage medium
CN112820298B (en) Voiceprint recognition method and device
CN113361621B (en) Method and device for training model
CN114078274A (en) Face image detection method and device, electronic equipment and storage medium
CN114067805A (en) Method and device for training voiceprint recognition model and voiceprint recognition
CN114220430A (en) Multi-sound-zone voice interaction method, device, equipment and storage medium
CN114333912A (en) Voice activation detection method and device, electronic equipment and storage medium
CN114171038A (en) Voice noise reduction method, device, equipment, storage medium and program product
CN115312042A (en) Method, apparatus, device and storage medium for processing audio
CN113539300A (en) Voice detection method and device based on noise suppression, storage medium and terminal
CN112951268B (en) Audio recognition method, apparatus and storage medium
CN110895929B (en) Voice recognition method and device
CN114882890A (en) Deep learning model training method, voiceprint recognition method, device and equipment
CN115761717A (en) Method and device for identifying topic image, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination