CN113851147A

CN113851147A - Audio recognition method, audio recognition model training method and device and electronic equipment

Info

Publication number: CN113851147A
Application number: CN202111213690.5A
Authority: CN
Inventors: 熊新雷; 肖岩; 赵情恩; 陈蓉; 张银辉; 梁芸铭; 周羊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2021-12-28

Abstract

The disclosure provides an audio data identification method, an audio data identification model training device, electronic equipment and a medium, and relates to the field of data processing, in particular to the field of audio processing and deep learning. An audio data recognition method, comprising: acquiring audio data to be identified; respectively extracting features of the audio data to be identified by using N parameter sets to obtain N feature data of the audio data to be identified, wherein each parameter set in the N parameter sets is respectively associated with different frequency ranges, and N is a positive integer greater than 1; and classifying the audio data to be identified based on the N characteristic data.

Description

Audio recognition method, audio recognition model training method and device and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to the field of audio processing and deep learning technologies, and in particular, to an audio data recognition method, an audio data recognition model training device, an electronic device, a computer-readable storage medium, and a computer program product.

Background

There are many scenarios where it is desirable to identify and classify audio, such as classifying the source of the audio, detecting whether the audio is offensive, comparing whether audio features match expected features, and so forth. In the process of classifying audio, the features of the audio need to be extracted. A method capable of effectively extracting audio features and then accurately identifying audio is desired.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides an audio data recognition method, an audio data recognition model training method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided an audio data recognition method including: acquiring audio data to be identified; respectively extracting features of the audio data to be identified by using N parameter sets to obtain N feature data of the audio data to be identified, wherein each parameter set in the N parameter sets is respectively associated with different frequency ranges, and N is a positive integer greater than 1; and classifying the audio data to be identified based on the N characteristic data.

According to another aspect of the present disclosure, there is provided a method of training an audio data recognition model, the audio data recognition model comprising M feature extraction sub-networks and a classification sub-network connected to an output of each of the M feature extraction sub-networks, M being a positive integer greater than 1, the method comprising: acquiring sample audio data and a real label of the sample audio data; inputting the sample audio data into each of the M feature extraction subnetworks to obtain M feature data for the sample audio data; inputting the M characteristic data into a classification sub-network to obtain a prediction label of the sample audio data; calculating a loss function based on the real label and the predicted label; and adjusting parameters of the audio data recognition model based on the loss function.

According to another aspect of the present disclosure, there is provided an audio data recognition apparatus including: the audio data acquisition unit is used for acquiring audio data to be identified; the device comprises a feature extraction unit, a feature extraction unit and a feature extraction unit, wherein the feature extraction unit is used for respectively performing feature extraction on audio data to be identified by using N parameter sets so as to obtain N feature data of the audio data to be identified, each parameter set in the N parameter sets is respectively associated with different frequency ranges, and N is a positive integer greater than 1; and the classification unit is used for classifying the audio data to be identified based on the N characteristic data.

According to another aspect of the present disclosure, there is provided a training apparatus for an audio data recognition model, the audio data recognition model including M feature extraction sub-networks and a classification sub-network connected to an output of each of the M feature extraction sub-networks, M being a positive integer greater than 1, the training apparatus comprising: the sample acquiring unit is used for acquiring sample audio data and a real label of the sample audio data; a feature extraction unit configured to input the sample audio data into each of the M feature extraction subnetworks to acquire M feature data for the sample audio data; the classification unit is used for inputting the M characteristic data into a classification sub-network to obtain a prediction label of the sample audio data; a loss function calculation unit for calculating a loss function based on the true label and the predicted label; and a parameter adjusting unit for adjusting parameters of the audio data recognition model based on the loss function.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of audio data recognition or a method of training an audio data recognition model according to one or more embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform an audio data recognition method or a training method of an audio data recognition model according to one or more embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements an audio data recognition method or a training method of an audio data recognition model according to one or more embodiments of the present disclosure.

According to one or more embodiments of the present disclosure, better audio feature extraction can be achieved, thereby achieving a better audio recognition effect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of an audio data identification method according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a method of training an audio data recognition model according to an embodiment of the present disclosure;

FIG. 4A shows a schematic diagram of an audio data recognition model to which a method according to an embodiment of the present disclosure may be applied;

FIG. 4B illustrates another schematic diagram of an audio data recognition model to which a method according to an embodiment of the present disclosure may be applied;

fig. 5 shows a block diagram of the structure of an audio data recognition apparatus according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an apparatus for training an audio data recognition model according to an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the execution of an audio data recognition method or a training method of an audio data recognition model.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use

client devices

101, 102, 103, 104, 105, and/or 106 to recognize audio data, train audio data recognition models, input audio, interact with audio recognition results, and so forth. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

An audio data identification method 200 according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2.

At step S201, audio data to be recognized is acquired.

At step S202, feature extraction is performed on the audio data to be identified by using N parameter sets respectively, so as to obtain N feature data of the audio data to be identified, where each parameter set of the N parameter sets is respectively associated with a different frequency range, and N is a positive integer greater than 1.

At step S203, the audio data to be recognized is classified based on the N feature data.

According to the method disclosed by the embodiment of the disclosure, the audio data can be identified more accurately. Specifically, by respectively performing feature extraction on the audio data by using at least two different parameter sets, features associated with different frequency ranges in the audio data can be better covered, better audio feature extraction is realized, and thus better audio recognition effect is realized.

The audio data to be identified may be an original time domain waveform of the audio or a segment of the time domain waveform. The audio characteristics are extracted from the time domain waveform, so that information loss can be reduced, and the audio identification effect is improved.

According to some embodiments, obtaining the audio data to be identified may include: the method comprises the steps of obtaining original audio data, and in response to the fact that the time length of the original audio data is larger than a length threshold value, intercepting the original audio data by using the length threshold value to obtain audio data to be identified, wherein the time length of the audio data to be identified is equal to the length threshold value. Thus, excessively long audio can be intercepted to obtain input data with regular time length.

As an example, the length threshold may be 0.5s, 1s, 4s, 10s, etc., but those skilled in the art will appreciate that the present disclosure is not limited thereto and other time lengths that are appropriate for audio features may also be applicable to the methods described in the present disclosure.

As an example, the audio length of the cut may be determined according to the task requirements and "balance of computation speed and contained content amount". For example, if it is too long, the operation speed is slow, and if it is too short, the content contained in the audio may not be enough as a reference for classification. For example, in the case where the length of the audio to be identified tends to be short, the real-time required for identifying the audio is strong, and/or the data contained in the audio data tends to be significant, shorter length thresholds, such as 0.2s, 0.02s, 0.002s, etc., may be selected. For another example, where the audio to be identified generally has a longer length, requires less real-time requirements, or the audio data contains data that is often less clear and therefore requires longer audio for analysis, longer length thresholds such as 20s, 60s, 120s, etc. may be selected. As a specific, non-limiting example, where the audio recognition task is to prevent a voice recording attack, audio shorter than 2 seconds may not be sufficient to contain human voice or features of the attack, and thus a time length of 3 seconds to 5 seconds (e.g., 4 seconds) may be chosen, and it is understood that the disclosure is not limited thereto.

According to some embodiments, obtaining the audio data to be identified may include: acquiring original audio data, and in response to determining that the time length of the original audio data is less than a length threshold, copying the original audio data until the time length of the copied original audio data is not less than the length threshold; and intercepting the copied original audio data by using a length threshold value to obtain audio data to be identified, wherein the time length of the audio data to be identified is equal to the length threshold value. Thus, too short audio can also be copied and intercepted to obtain regular time length input data.

According to some alternative embodiments, feature extraction and classification of audio data may be implemented by an audio data recognition model. According to further embodiments, feature extraction and classification of audio data may be achieved by other feature extraction means and classification means known to those skilled in the art. It is to be understood that the present disclosure is not limited thereto.

A method 300 of training an audio data recognition model according to another embodiment of the present disclosure is described below with reference to fig. 3. The audio data recognition model may include M feature extraction sub-networks and a classification sub-network connected to an output of each of the M feature extraction sub-networks, where M is a positive integer.

At step 301, sample audio data and a true tag of the sample audio data are obtained.

At step 302, sample audio data is input into each of the M feature extraction subnetworks to obtain M feature data for the sample audio data.

At step 303, the M feature data are input into a classification subnetwork to obtain a prediction tag for the sample audio data.

At step 304, a loss function is calculated based on the true tags and the predicted tags.

At step 305, parameters of the audio data recognition model are adjusted based on the loss function.

According to the method disclosed by the embodiment of the disclosure, the audio data can be identified more accurately. Specifically, by performing feature extraction on audio data respectively using at least two different classification subnetworks, better audio feature extraction can be achieved, so that a model trained in this way can achieve better audio recognition effect.

According to some embodiments, each of the M feature extraction sub-networks is initialized based on a respective set of M filtering parameters, each set of M filtering parameters including an upper cutoff frequency and a lower cutoff frequency. In such an embodiment, a faster model learning process and better convergence can be achieved by initializing the feature extraction sub-network with the filter parameters.

A schematic diagram of a model 400 to which a method according to an embodiment of the present disclosure may be applied is described below with reference to fig. 4A. In FIG. 4A, it is shown that the model 400 may include M feature extraction subnetworks 410-1, 410-2 … … 410-M and a classification subnetwork 420. The model 400 may also include an optional feature integration for integrating M feature data obtained from M feature extraction sub-networks, but it will be appreciated that this is merely an example, the M feature data may be directly input into the input of the classification sub-network without the need for an additional feature integration, or the classification sub-network itself may have a unit or one or more layers (e.g., one or more residual blocks, etc.) that integrates the features.

It will be appreciated that while FIG. 4A illustrates model 400 as including at least 3 sub-networks of feature extraction, such a model may have more or fewer sub-networks of feature extraction. For example, the model 400 may include only one feature extraction sub-network (M ═ 1). Hereinafter, the feature extraction portion, which for convenience is generally described as M feature extraction subnetworks 410-1, 410-2 … … 410-M, may include only one or two subnetworks, or may include more (e.g., tens of) subnetworks. The choice of the value of M will be described in more detail below in connection with specific embodiments, and the disclosure is not limited thereto.

In signal and system theory, a convolution operation is required to be performed on a time domain through a filter to extract signal characteristics. Here, the convolution of the filter can be approximated using convolution layers of the neural network, with the neural network parameters initialized using conventional filter parameters. Therefore, the initial state equivalent to the neural network is a better state by using the empirical formula of the filter, thereby greatly reducing the learning cost. Specifically, the filter parameters are calculated by using the upper and lower limit frequencies of the preset filter, for example, by substituting the filter formula, and such parameters are used as the initial parameters of the neural network, so that the advantages of the traditional empirical formula or the existing filter parameters in the traditional theory can be greatly utilized to obtain relatively good initial parameters for the model, and therefore, the training process is fast, and a better convergence value can be obtained.

In the signal and system and audio processing arts, performing a multiplication operation in the frequency domain corresponds to performing a convolution operation in the time domain. Assuming that the filter in the frequency domain functions in the time domain as g [ n, θ ], then for the input signal, the signal after x [ n ] has been filtered by the filter is:

y[n]＝x[n]*g[n,θ]

where n represents the time of the signal, in the case of a discrete signal a normalized unitless time index, x represents the convolution operation, and θ is the generalized set of parameters of the filter, may include all possible parameters in the filter equation except n, and may include one or more or zero parameters.

The conventional features may be filtered using a filter g n, θ, and then fed into a back-end classifier. Further, in order to utilize the powerful feature extraction capability of the convolutional neural network CNN and minimize information loss (e.g., spectrum leakage), the CNN may be used as a sub-network part of feature extraction, and feature extraction is implemented by learning features through the powerful capability of the CNN. In this process, g [ n, θ ] may be used as an initialization parameter for the CNN network. With continued reference to FIG. 4A, where the M feature extraction sub-networks 410-1, 410-2 … … 410-M may be M CNNs. The sample audio data may be time series data, and in particular may be a time domain waveform of audio or a discrete time domain signal or the like, or other raw or processed audio signals or audio data as will be understood by those skilled in the art.

The classifier at the back end may use RNN or CNN or other networks. As an example, the CNN may be used to extract features, and then a Residual Block (Residual Block) module is used to integrate audio features, so that the audio features are more distinctive, and finally the features integrated by the Residual Block module are merged into the RNN network, and the RNN is used to model the time sequence to obtain the audio level features. As a more specific, non-limiting example, the back-end classifier may use gated round-robin units (GRUs), but those skilled in the art will appreciate that the present disclosure is not so limited.

The operation of the audio recognition model is described below in connection with a more specific, non-limiting example model 4200 with reference to FIG. 4B. For audio with a sampling frequency of 16k, a maximum frequency of 8000Hz, and a threshold length of 4s, there are 16000 × 4-64000 sample points. The bandwidth of each

corresponding filter network

4211, 4212 … … 421M in this case is 8000/M (hz). As one example, each

filtering network

4211, 4212 … … 421M may be a CNN network of dimension (129,0, 128).

A pooling layer 4220, such as a maximum pooling layer of Batch normalization (Batch Norm) and leakage (leak) ReLU functions, may be arranged after the feature extraction sub-network to extract the largest feature points on each feature by pooling. For example, if the pooling coefficient is 3, the data dimension can be calculated as (64000-128)/3-21900.

Residual blocks can also be placed in the neural network to further reduce the data dimensionality. As an example, a first residual network 4230 may be set in the model, with a channel number of 128 and a step size (stride) of 1. Thus, the data dimension may be calculated as 21290/3 — 7096. Thereafter, the data may be subjected to a feature map scaling process. One possible embodiment is to derive a corresponding coefficient c for each feature s via the softmax function, and then calculate the updated feature s 'by s' ═ c × s + c. Similarly, two first residual networks 4230 may be provided (not shown), and after passing through a second residual network, the dimension may be further reduced to 7096/3 ═ 2365.

More residual blocks can be set to further reduce the dimension. As an example, as shown in fig. 4B, 4 identical second residual networks 4240 are provided, each having a channel of 512 and a step size of 1, and the dimension may be further reduced to 2365/3/4-29. The reduced dimensional data may be input to a gated loop unit (GRU)4250 to integrate information for the entire audio. Finally, the full link layer 4260 is passed for linear transformation.

It is to be understood that the number of modules, the module names and the division methods, and the parameters of the audio data, etc., described above in connection with fig. 4B are examples, and the present disclosure is not limited thereto.

According to some embodiments, the M sets of filter parameters may be set by: acquiring a preset frequency range; dividing a predetermined frequency range into M consecutive sub-bands; and setting a lower limit frequency and an upper limit frequency of each of the M consecutive sub-bands as a lower limit cutoff frequency (noted as f1) and an upper limit cutoff frequency (noted as f2) in the corresponding set of filter parameters. Therefore, the network is initialized by using the filter parameters covering the full frequency, and the neural network obtained by training can better cover various characteristics of the neural network.

For example, a frequency range of 0-8k Hz is included in the total frequency band range, and a filtering scheme of 8 filters, i.e., 8 feature extraction sub-networks (e.g., 8 CNNs), is set. In this case, as an exemplary frequency averaging scheme, the start and end points of each band may be 0-1k, 1k-2k … …, respectively, and so on. In such a case, one CNN may be set as a filter in each frequency band to extract features in the frequency band.

The number of filters or feature extraction sub-networks may be set based on experience, accuracy requirements, system computational power, etc. For example, a smaller value of M (or M ═ 1, i.e., no division of the frequency band) results in a larger frequency band interval and a simpler system architecture and computation speed; larger values of M (e.g., tens, or even hundreds) can result in finer granularity of intervals and more accurate results; however, in the case of an excessively large M value, a slower calculation speed may result, or the accuracy may not continue to increase as the M value increases because the information in each interval is too little to characterize the audio. As one example, 10, 20, or 40 frequency bands, etc. may be set for an 8k sampling rate, and those skilled in the art will appreciate that the present disclosure is not limited thereto.

It will be appreciated that the division of the frequency bands may be uniform or non-uniform, which may depend on different filter types or filter formulas. For example, for rectangular filter parameters, a division method of band-averaging may be adopted, but for other filter formulas (e.g., mel filter), a division scheme of non-averaging may also be adopted. It is to be understood that the present disclosure is not limited thereto.

A method according to an alternative embodiment of the present disclosure is described below using a rectangular filter as an example. It will be appreciated that such filter parameters are merely examples, and that other filtering formulas in the signal and system processing arts may also be applicable to initialization and training of audio recognition models in accordance with embodiments of the present disclosure. According to some embodiments, each of the M sets of filtering parameters may correspond to a parameter of a rectangular filter on the frequency domain, and wherein dividing the predetermined frequency range into M consecutive frequency subbands may comprise dividing the predetermined frequency range equally to obtain M subbands of the same width. For the rectangular filter, the method is suitable for a frequency averaging scheme according to an empirical formula in the signal field.

For the divided frequency range is denoted as f₁,f₂]Respectively setting the initialization parameters of the neural network. Adding a rectangular filter in the frequency domain corresponds to the Sinc function in the time domain. A corresponding time-domain Sinc function is given belowExamples of filtering parameters:

g[n,f₁,f₂]＝2f₂sinc(2πf₂n)-2f₁sinc(2πf₁n)

thus, g [ n, f ] thus calculated can be used₁,f₂]The weights of the feature extraction sub-network are initialized. As one example, a first layer of the neural network may be initialized using parameters in the above function, while later layers may be initialized randomly. Such a model may then be trained so that the convolution kernel corresponding to the filter can be continuously updated to learn the convolution kernel parameters appropriate for the desired task.

According to some embodiments, the real tags of the sample audio data may comprise tags where the sample data is real human voice or machine generated audio. The trained model can realize voice living body recognition and prevent recording attack. The recording anti-attack system needs to ensure that the audio received by the voiceprint system is the real voice audio, and the security of the voiceprint system is ensured. It is understood that the application herein is merely an example, and that such training methods and training models may be applied to the field of speech recognition, classification, etc. for other purposes.

Further, in order to reduce or prevent spectral leakage for better feature extraction, according to some embodiments, each of the N sets of filtering parameters may correspond to a set of parameters of the windowed filter. Continuing with the example of the rectangular filter above, the filter parameter formula after windowing for initialization may be as follows

g_w[n,f₁,f₂]＝g[n,f₁,f₂]·w[n]

Where w [ n ] is a window function. As an example, a hamming window function as shown below may be employed,

where n represents time and L represents the length of the convolution kernel.

In such an embodiment, the g thus calculated may be used_w[n,f₁,f₂]The weights of the feature extraction sub-networks are initialized, for example, the first layer of the neural network. It will be appreciated that the above filter types, window function types, and filter formulas are examples, and those skilled in the art will appreciate that other sets of parameters related to filtering and feature extraction may be used to initialize a sub-network of feature extraction according to embodiments of the present disclosure and achieve better results than random initialization.

According to some embodiments, obtaining sample audio data may comprise: acquiring original audio data; and in response to determining that the length of time of the original audio data is greater than the sample length threshold, truncating the original audio data using the sample length threshold to obtain at least one sample audio data, wherein the length of time of each of the at least one sample audio data is equal to the sample length threshold. According to some embodiments, obtaining sample audio data may comprise: acquiring original audio data; in response to determining that the length of time of the original audio data is less than the sample length threshold, copying the original audio data until the length of time of the copied original audio data is not less than the sample length threshold; and truncating the copied original audio data using a sample length threshold to obtain sample audio data, wherein a time length of the sample audio data is equal to the sample length threshold. Thereby, the samples can be processed to obtain regular sample lengths. The input of the neural network is often data with a fixed length, so in practical training and testing, data longer than the length needs to be intercepted, and data shorter than the length needs to be copied and intercepted by itself. As previously mentioned, the sample length threshold may be determined based on the task requirements for which the model is to be used and the "balance of speed of operation and amount of content involved". As an example, the length threshold may be 0.5s, 1s, 4s, 10s, etc.

As a specific, non-limiting example, where the sample length threshold is 4 seconds, the method may include first slicing the audio, if the audio length is greater than 4 seconds long, slicing all of the audio into one segment every 4 seconds; if the audio is less than 4s, the audio is copied from the beginning, and the audio with the length of 4s is cut from the copied audio after the copying is finished.

According to one or more embodiments of the present disclosure, the model may be a two-class model, trained with artificially labeled true and false samples (e.g., samples labeled as attack audio and real human voice, respectively, in the case of sound recording anti-attack or live recognition), and the loss function may be designed as a cross-entropy function.

According to one or more embodiments of the present disclosure, audio data may be obtained first, and time-series audio data may be segmented and padded to a predetermined length; and setting the number N of the filters as required, calculating the upper and lower limit cut-off frequencies f1 and f2 of each frequency band according to the number N of the filters and the sampling rate, and then calculating the parameter set of each filter, namely the initialization parameter of each feature extraction sub-network according to the upper and lower limit cut-off frequencies f1 and f 2. Each feature extraction sub-network (e.g., CNN) is then initialized with the parameters. After model set-up and initialization is complete, training is performed with the sample, and thereafter, the model may be tested. Each audio may also be warped to a predetermined length (e.g., 4s length) at the time of testing. After training and testing are completed, the trained model can be used for judgment. For example, in the case of using a true tag of sample audio data, a true human voice, or a tag of machine-generated audio, a trained model may decide that the probability is high as a true human voice, and that the probability is low as an attack audio.

An audio data recognition apparatus 500 according to an embodiment of the present disclosure will now be described with reference to fig. 5. The audio data recognition apparatus 500 may include an audio data acquisition unit 501, a feature extraction unit 502, and a classification unit 503. The audio data acquiring unit 501 is used for acquiring audio data to be identified. The feature extraction unit 502 is configured to perform feature extraction on the audio data to be identified respectively by using N parameter sets to obtain N feature data of the audio data to be identified, where each parameter set in the N parameter sets is associated with a different frequency range respectively, and N is a positive integer greater than 1. The classification unit 503 is configured to classify the audio data to be recognized based on the N feature data.

According to the device of the embodiment of the disclosure, the audio data can be identified more accurately.

Referring now to fig. 6, an apparatus 600 for training an audio data recognition model according to an embodiment of the present disclosure is described. The audio data recognition model may comprise M feature extraction sub-networks and a classification sub-network connected to an output of each of the M feature extraction sub-networks, M being a positive integer greater than 1. The training apparatus 600 for audio data recognition models may include a sample obtaining unit 601, a feature extracting unit 602, a classifying unit 603, a loss function calculating unit 604, and a parameter adjusting unit 605. The sample acquiring unit 601 is used for acquiring sample audio data and a real tag of the sample audio data. The feature extraction unit 602 is configured to input the sample audio data into each of the M feature extraction subnetworks to obtain M feature data for the sample audio data. The classification unit 603 is configured to input the M feature data into a classification sub-network to obtain a prediction tag of the sample audio data. The loss function calculation unit 604 is configured to calculate a loss function based on the true label and the predicted label. The parameter adjustment unit 605 is configured to adjust parameters of the audio data recognition model based on the loss function. In such an embodiment, each of the M feature extraction sub-networks is initialized based on a respective one of the M sets of filter parameters, respectively, and each of the M sets of filter parameters includes an upper cutoff frequency and a lower cutoff frequency.

According to the device disclosed by the embodiment of the disclosure, the feature extraction sub-network can be initialized by utilizing the filtering parameters, so that a faster model learning process and a better convergence effect are realized.

In the technical scheme of the disclosure, the collection, acquisition, storage, use, processing, transmission, provision, public application and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations, and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the methods 200 and/or 300 and variations thereof. For example, in some embodiments, methods 200 and/or 300, variations thereof, and so forth, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by computing unit 701, may perform one or more of the steps of methods 200 and/or 300, variations thereof, and the like, described above. Alternatively, in other embodiments, the computing unit 701 may be configured in any other suitable manner (e.g., by way of firmware) to perform the methods 200 and/or 300, variations thereof, and so forth.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. An audio data recognition method, comprising:

acquiring audio data to be identified;

respectively performing feature extraction on the audio data to be identified by using N parameter sets to obtain N feature data of the audio data to be identified, wherein each parameter set in the N parameter sets is respectively associated with different frequency ranges, and N is a positive integer greater than 1; and

and classifying the audio data to be identified based on the N characteristic data.

2. The method of claim 1, wherein obtaining audio data to be identified comprises:

acquiring original audio data;

in response to determining that the time length of the original audio data is greater than a length threshold, truncating the original audio data based on the length threshold to obtain audio data to be identified, wherein the time length of the audio data to be identified is equal to the length threshold.

3. The method of claim 1, wherein obtaining audio data to be identified comprises:

acquiring original audio data;

in response to determining that the length of time of the original audio data is less than a length threshold, copying the original audio data until the length of time of the copied original audio data is not less than the length threshold; and is

Intercepting the copied original audio data based on the length threshold value to obtain audio data to be identified, wherein the time length of the audio data to be identified is equal to the length threshold value.

4. A method of training an audio data recognition model, the audio data recognition model comprising M feature extraction sub-networks and a classification sub-network connected to an output of each of the M feature extraction sub-networks, M being a positive integer greater than 1, the method comprising:

acquiring sample audio data and a real label of the sample audio data;

inputting the sample audio data into each of the M feature extraction sub-networks to obtain M feature data for the sample audio data;

inputting the M feature data into the classification sub-network to obtain a prediction label for the sample audio data;

calculating a loss function based on the real label and the predicted label; and

adjusting parameters of the audio data recognition model based on the loss function.

5. The method of claim 4, wherein,

each of the M feature extraction sub-networks is initialized based on a respective set of M filter parameters, respectively, each of the M sets of filter parameters including an upper cutoff frequency and a lower cutoff frequency.

6. The method of claim 4, wherein the M sets of filter parameters are set by:

acquiring a preset frequency range;

dividing the predetermined frequency range into M consecutive sub-bands; and is

Setting a lower limit frequency and an upper limit frequency of each of the M consecutive sub-bands as an upper limit cutoff frequency and a lower limit cutoff frequency in a corresponding set of filter parameters.

7. The method of claim 5 or 6, wherein each of the M sets of filtering parameters corresponds to a set of parameters of a rectangular filter in the frequency domain.

8. The method of claim 7, wherein the dividing the predetermined frequency range into M contiguous subbands comprises:

the predetermined frequency range is equally divided to obtain M subbands of the same width.

9. The method of any of claims 5-8, wherein each of the M sets of filtering parameters corresponds to a set of parameters of a windowed filter.

10. The method of any of claims 4-9, wherein obtaining sample audio data comprises:

acquiring original audio data; and

in response to determining that the temporal length of the original audio data is greater than a sample length threshold, truncating the original audio data based on the sample length threshold to obtain at least one sample audio data, wherein the temporal length of each of the at least one sample audio data is equal to the sample length threshold.

11. The method of any of claims 4-9, wherein obtaining sample audio data comprises:

acquiring original audio data;

in response to determining that the length of time of the original audio data is less than a sample length threshold, copying the original audio data until the length of time of the copied original audio data is not less than the sample length threshold; and

truncating the copied original audio data based on the sample length threshold to obtain sample audio data, wherein a temporal length of the sample audio data is equal to the sample length threshold.

12. The method of any of claims 4-11, wherein the real tag of the sample audio data comprises a tag that the sample data is real human voice or machine generated audio.

13. An audio data recognition apparatus comprising:

the audio data acquisition unit is used for acquiring audio data to be identified;

the characteristic extraction unit is used for respectively carrying out characteristic extraction on the audio data to be identified by using N parameter sets so as to obtain N characteristic data of the audio data to be identified, wherein each parameter set in the N parameter sets is respectively associated with different frequency ranges, and N is a positive integer greater than 1; and

and the classification unit is used for classifying the audio data to be identified based on the N characteristic data.

14. Training device for an audio data recognition model comprising M feature extraction sub-networks and a classification sub-network connected to the output of each of the M feature extraction sub-networks, M being a positive integer larger than 1, the training device comprising:

the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring sample audio data and a real label of the sample audio data;

a feature extraction unit configured to input the sample audio data into each of the M feature extraction subnetworks to acquire M feature data for the sample audio data;

a classification unit for inputting the M feature data into the classification sub-network to obtain a prediction label of the sample audio data;

a loss function calculation unit for calculating a loss function based on the true label and the predicted label; and

and the parameter adjusting unit is used for adjusting the parameters of the audio data identification model based on the loss function.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3 or 4-12.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-3 or 4-12.

17. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-3 or 4-12 when executed by a processor.