CN113948103A

CN113948103A - Audio processing method and device, model training method and device, equipment and medium

Info

Publication number: CN113948103A
Application number: CN202111202648.3A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-01-18

Abstract

The disclosure provides an audio processing method and device, a model training method and device, electronic equipment and a medium, and relates to the field of artificial intelligence, in particular to the field of voice technology. The implementation scheme is as follows: sequentially determining local feature information of each audio frame in a plurality of audio frames extracted from audio data to be processed, wherein the audio data to be processed comprises audio data from at least two sound sources; and determining any one of the plurality of audio frames as a target audio frame, and performing the following operations for the target audio frame: determining global feature information of a target audio frame based on the local feature information of each of the plurality of audio frames; and determining the sound source classification corresponding to the target audio frame based on the global feature information of the target audio frame.

Description

Audio processing method and device, model training method and device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of speech technologies, and in particular, to a method and an apparatus for audio processing, a method and an apparatus for model training, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and has both hardware module plane technology and software module plane technology. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, an audio recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a method of audio processing, a model training method, an apparatus, an electronic device, a computer readable storage medium and a computer program product.

According to an aspect of the present disclosure, there is provided an audio processing method including: sequentially determining local feature information of each audio frame in a plurality of audio frames extracted from audio data to be processed, wherein the audio data to be processed comprises audio data from at least two sound sources; and determining any one of the plurality of audio frames as a target audio frame, and performing the following operations for the target audio frame: determining global feature information of a target audio frame based on the local feature information of each of the plurality of audio frames; and determining the sound source classification corresponding to the target audio frame based on the global feature information of the target audio frame.

According to another aspect of the present disclosure, there is provided a method for training an audio processing model, wherein the audio processing model includes a local feature extraction module, a global feature extraction module, and an output module, the method including: sequentially aiming at each sample audio frame in a plurality of sample audio frames extracted from sample audio data, acquiring first local feature information of the sample audio frame by using a local feature extraction module, wherein the sample audio data comprises audio data from a first number of sound sources, and each sample audio frame in the plurality of sample audio frames has a sound source label; acquiring global feature information of each sample audio frame in a plurality of sample audio frames by using a global feature extraction module at least based on the first local feature information of each sample audio frame in the plurality of sample audio frames; sequentially inputting the global feature information of each sample audio frame in a plurality of sample audio frames into an output module so as to obtain a first confidence coefficient of the sample audio frame output by the output module for each predicted sound source classification in a first number of predicted sound source classifications; and adjusting parameters of the audio processing model based on a first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and the sound source label of each of the plurality of sample audio frames.

According to another aspect of the present disclosure, there is provided an audio processing apparatus including: a local feature extraction module configured to determine local feature information of each of a plurality of audio frames extracted from to-be-processed audio data in turn, wherein the to-be-processed audio data includes audio data from at least two sound sources; the global feature extraction module is configured to determine any one of the plurality of audio frames as a target audio frame, and determine global feature information of the target audio frame based on local feature information of each of the plurality of audio frames; and the output module is configured to determine the sound source classification corresponding to the target audio frame based on the global feature information of the target audio frame.

According to another aspect of the present disclosure, there is provided an apparatus for training an audio processing model, wherein the audio processing model includes a local feature extraction module, a global feature extraction module, and an output module, the apparatus including: a first obtaining module configured to obtain, by the local feature extraction module, first local feature information of a plurality of sample audio frames extracted from sample audio data in turn, wherein the sample audio data includes audio data from a first number of sound sources, and each of the plurality of sample audio frames has a sound source tag; a second obtaining module configured to obtain global feature information of each of the plurality of sample audio frames by using the global feature extraction module based on at least the first local feature information of each of the plurality of sample audio frames; a third obtaining module, configured to sequentially input, for each of a plurality of sample audio frames, global feature information of the sample audio frame into the output module, so as to obtain a first confidence of the sample audio frame output by the output module for each of the first number of predicted sound source classifications; and a first adjustment module configured to adjust parameters of the audio processing model based on a first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and a sound source label of each of the plurality of sample audio frames.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform any one of the methods described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform any one of the methods described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes any of the methods described above when executed by a processor.

According to one or more embodiments of the present disclosure, effective separation of audio data to be processed may be achieved according to differences in sound sources.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

fig. 2A and 2B illustrate a flow diagram of an audio processing method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of confidence of sound source classification according to an embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a method of training an audio processing model according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a training method in which an audio processing model according to an embodiment of the present disclosure may be implemented;

FIG. 6 shows a block diagram of an audio processing model according to an embodiment of the present disclosure;

FIG. 7 shows a block diagram of an apparatus for training an audio processing model according to an embodiment of the present disclosure;

FIG. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

For a section of audio data to be processed acquired by an audio acquisition device, it is often necessary to perform separation of the audio data to be processed first according to sound sources from which each part of the audio data to be processed comes. Thereafter, the subsequent targeted processing is performed again on the audio data of the separated single sound source. For example, in scenes such as intelligent customer service, conference discussion, interview conversation, public security inquiries, and the like, a plurality of speakers are located on a single sound channel in audio data to be processed acquired by an audio acquisition device, and it is necessary to first perform voice separation on the audio data to be processed, that is, to separate audio data from different speakers in the audio data to be processed, and then perform targeted analysis on the audio data of each speaker in the audio data to be processed. However, a simple and effective separation scheme for audio data to be processed is lacking in the related art.

Based on this, the present disclosure provides an audio processing method, which determines, for a target audio frame in audio data to be processed, global feature information of the target audio frame based on local feature information of each of a plurality of audio frames in the audio data to be processed; and determining the sound source classification corresponding to the target audio frame based on the global characteristic information of the target audio frame. Based on information from a plurality of audio frames in the audio data to be processed, global feature information that a target audio frame is different from other audio frames can be mined, so that the target audio frame and the audio frames from the same sound source in other audio frames can be classified into the same sound source classification and the target audio frame and the audio frames from different sound sources in other audio frames can be classified into different sound source classifications based on the global feature information, and the whole audio data to be processed can be effectively separated according to the difference of the sound sources.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the method of audio processing to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

A user may use

client devices

101, 102, 103, 104, 105, and/or 106 to obtain pending audio data. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middleware applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Fig. 2A and 2B are flowcharts illustrating an audio processing method according to an exemplary embodiment of the present disclosure, and as shown in fig. 2A and 2B, an audio processing method includes: step S201, sequentially aiming at each audio frame in a plurality of audio frames extracted from audio data to be processed, determining local characteristic information of the audio frame, wherein the audio data to be processed comprises audio data from at least two sound sources; and step S202, determining any one of the audio frames as a target audio frame, and executing the following operations aiming at the target audio frame: step S202-1, determining global feature information of a target audio frame based on local feature information of each audio frame in a plurality of audio frames; and step S202-2, determining the sound source classification corresponding to the target audio frame based on the global characteristic information of the target audio frame.

Since global feature information that a target audio frame is different from other audio frames can be mined based on information from a plurality of audio frames in audio data to be processed, it is possible to classify the target audio frame and audio frames from the same sound source in other audio frames into the same sound source classification and to classify the target audio frame and audio frames from different sound sources in other audio frames into different sound source classifications based on the global feature information, thereby enabling the entire audio data to be processed to be efficiently separated according to the difference in sound sources.

According to some embodiments, the audio data to be processed may be preprocessed before extracting the audio frames from the audio data to be processed, and the preprocessing may include, but is not limited to, removing noise (including ambient noise, busy tone, color ring tone, etc.), resulting in clean audio data to be processed.

For step S201, an audio frame may be extracted from the audio data to be processed through a time window with a preset length.

According to some embodiments, the plurality of audio frames may be sequentially adjacent in a time domain of the audio data to be processed. For example, the time window with the preset length may be 40ms, and the time window is moved by a step size of 10ms in the audio data to be processed, so that a plurality of audio frames sequentially adjacent in the time domain can be sequentially extracted from the audio data to be processed.

According to some embodiments, the local feature information of each audio frame may be one or more of mel-frequency cepstral coefficients (MFCCs), FBank features (Filter Banks), or Perceptual Linear Prediction (PLP) features, etc.

According to some embodiments, the local feature information of each audio frame may also be obtained by inputting one or more of time domain or frequency domain features of each audio frame, for example, the mel-frequency cepstrum coefficients (MFCCs), FBank features (Filter Banks), or Perceptual Linear Prediction (PLP) features, into the trained local feature extraction module, so as to obtain the local feature information of the audio frame output by the local feature extraction module.

According to some embodiments, determining, for each of a plurality of audio frames extracted from the audio data to be processed in turn, the local feature information of the audio frame may include: the local feature information of the audio frame is determined based on the audio frame and one or more of the plurality of audio frames that are close to the audio frame. Therefore, the local feature information of the audio frame can be determined based on the information in the local time range of the audio frame contained in the audio data to be processed, and therefore certain context information can be contained in the local feature information of the audio frame.

According to some embodiments, each of the at least two sound sources is a speaker. Therefore, the voice separation in scenes such as intelligent customer service, conference discussion, interview conversation, public security inquiries and the like can be realized.

Based on the local feature information of each of the plurality of audio frames, step S202 may be further performed to determine global feature information of the target audio frame, thereby mining global feature information of the target audio frame different from other audio frames, so that the entire audio data to be processed may be effectively separated according to different sound sources.

According to some embodiments, the global feature information of the target audio frame may be determined from information in a plurality of audio frames that are evenly distributed throughout the audio data to be processed.

According to some embodiments, the local feature information of the target audio frame may be modified based on the local feature information of each of the plurality of audio frames, and the global feature information of the target audio frame may be determined based on the modified local feature information.

According to some embodiments, the local feature information of each of the plurality of audio frames may be input to a trained global feature extraction module to obtain global feature information of the target audio frame output by the global feature extraction module. And further clustering a plurality of audio frames is realized through the extracted global feature information, in other words, the target audio frame and the audio frame from the same sound source in other audio frames are classified into the same sound source classification, and the target audio frame and the audio frame from different sound sources in other audio frames are classified into different sound source classifications.

According to some embodiments, based on the global feature information of the target audio frame, a confidence of the target audio frame for each of the sound source classifications may be obtained, and based on the confidence, the sound source classification corresponding to the target audio frame may be predicted.

In particular, a confidence of the target audio frame for each of the plurality of sound source classifications may be derived based on the trained output module.

For example, when the number of sound source classifications is 2, fig. 3 is a diagram illustrating the confidence of a plurality of audio frames for each of the 2 sound source classifications according to an exemplary embodiment of the present disclosure.

As shown in fig. 3, the confidence of the 1 st to 4 th audio frames for the first sound source classification is significantly higher than the confidence of the 1 st to 4 th audio frames for the second sound source classification, so that the 1 st to 4 th audio frames can be predicted to be from the same sound source and fall under the first sound source classification.

The confidence of the 8 th-10 th audio frame for the first sound source classification is significantly lower than the confidence of the 8 th-10 th audio frame for the second sound source classification, so that it can be predicted that the 8 th-10 th audio frame comes from the same sound source, which is different from the sound source from which the 1 st-4 th audio frame comes, and the 8 th-10 th audio frame is classified as the second sound source.

The confidence of the 5 th-7 th audio frame for the first sound source classification is comparable to the confidence for the second sound source classification, so the 5 th-7 th audio frame can be predicted to include audio data from both the first sound source classification and the second sound source classification.

In particular, an audio frame may be classified into a first sound source classification when a difference between a confidence of the audio frame for the first sound source classification and a confidence of the audio frame for other sound source classifications is greater than a preset threshold. And when the difference between the confidence degrees of the audio frame aiming at any two sound source classifications is not larger than a preset threshold value, simultaneously classifying the audio frame into the two sound source classifications.

It can be understood that, in the process of performing sound source separation on the audio data to be processed, prior information about the number and characteristics of the sound sources is not required, and by means of the difference between the plurality of audio frames represented by the global characteristic information of each of the plurality of audio frames, the audio frames from the same sound source can be classified into the same sound source classification, and the audio frames from different sound sources can be classified into different sound source classifications.

Fig. 4 is a flowchart illustrating a training method of an audio processing model according to an exemplary embodiment of the present disclosure, and as shown in fig. 4, a training method of an audio processing model, wherein the audio processing model includes a local feature extraction module, a global feature extraction module, and an output module, the method includes: step S401, sequentially obtaining, by using a local feature extraction module, first local feature information of a plurality of sample audio frames extracted from sample audio data, where the sample audio data includes audio data from a first number of sound sources, and each of the plurality of sample audio frames has a sound source tag; step S402, at least based on the first local feature information of each sample audio frame in the plurality of sample audio frames, acquiring global feature information of each sample audio frame in the plurality of sample audio frames by using a global feature extraction module; step S403, sequentially inputting the global feature information of each sample audio frame in the plurality of sample audio frames into an output module, so as to obtain a first confidence of the sample audio frame output by the output module for each of the first number of predicted sound source classifications; and step S404, adjusting parameters of the audio processing model based on the first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and the sound source label of each of the plurality of sample audio frames.

The audio processing model obtained by training can utilize the global feature extraction module to dig out global feature information of a target audio frame different from other audio frames based on information of a plurality of audio frames in the audio data to be processed when the target audio frame in the audio data to be processed is predicted, and enables the target audio frame and the audio frame from the same sound source in other audio frames to be classified into the same sound source classification and enables the target audio frame and the audio frame from different sound sources in other audio frames to be classified into different sound source classifications based on the global feature information, so that the whole audio data to be processed can be effectively separated according to different sound sources.

According to some embodiments, the local feature extraction module, the global feature extraction module and the output module are connected in sequence in the audio processing model, so that in the training process of the audio processing model, the output information of the local feature extraction module can be used as the input information of the global feature extraction module, and the output information of the global feature extraction module can be used as the input information of the output module.

According to some embodiments, the sample audio data may be pre-processed before extracting the sample audio frames from the sample audio data, and the pre-processing may include, but is not limited to, removing noise (including ambient noise, busy tone, ringing tone, etc.) to obtain clean sample audio data.

According to some embodiments, the sample audio data may also be subjected to data enhancement, such as time-domain warping, frequency-domain masking, etc., prior to extracting the sample audio frame from the sample audio data.

For step S401, a sample audio frame may be extracted from sample audio data through a time window with a preset length.

According to some embodiments, the plurality of sample audio frames may be sequentially adjacent in a time domain of the sample audio data. For example, a time window of a preset length may be 40ms, which is shifted in steps of 10ms in the sample audio data, so that a plurality of sample audio frames sequentially adjacent in the time domain can be sequentially extracted from the sample audio data.

According to some embodiments, a plurality of sample audio frames may be selected for training according to a preset rule from a first number of sequentially adjacent sample audio frames acquired based on a time window, for example, a frame skipping strategy may be adopted, and one sample audio frame may be selected for training at intervals of a certain number of sample audio frames, thereby effectively reducing subsequent computational complexity.

According to some embodiments, each sample audio frame may be input to the local feature extraction module in turn to obtain the local feature information of the sample audio frame output by the local feature extraction module.

According to some embodiments, time domain or frequency domain features of each sample audio frame, for example, one or more of the above-mentioned mel-frequency cepstral coefficients (MFCCs), FBank features (Filter Banks), or Perceptual Linear Prediction (PLP) features, may be input to the local feature extraction module to obtain local feature information of the sample audio frame output by the local feature extraction module.

According to some embodiments, determining, in turn, for each of a plurality of sample audio frames extracted from sample audio data, local feature information for the sample audio frame may include: the local feature information of the sample audio frame is determined based on the sample audio frame and one or more sample audio frames near the sample audio frame among the plurality of sample audio frames. Thus, the local feature information of the sample audio frame may be determined based on information in the local time range in which the sample audio frame is included in the sample audio data.

For step S402, according to some embodiments, second local feature information of each sample audio frame of a plurality of sample audio frames extracted from sample audio data may be obtained by using an additional local feature extraction module in sequence; wherein, using the global feature extraction module to obtain the global feature information of each of the plurality of sample audio frames based on at least the first local feature information of each of the plurality of sample audio frames may include: and acquiring global feature information of each sample audio frame in the plurality of sample audio frames by using a global feature extraction module based on the first local feature information and the second local feature information of each sample audio frame in the plurality of sample audio frames.

The additional local feature extraction module is applied to the training process of the audio processing model, the additional local feature extraction module and the local feature extraction module in the audio processing model operate independently, different first local feature information and second local feature information of each sample audio frame can be extracted through the local feature extraction module and the additional local feature extraction module, and the multiple local feature information can help to build global feature information subsequently and further improve the training effect of the model.

For example, the local feature extraction module is a first multi-layer time-dilation convolutional neural network (TDCN), and the additional local feature extraction module is a second multi-layer time-dilation convolutional neural network, wherein the local feature extraction module and the additional local feature extraction module have different parameters although the same network structure is adopted. Therefore, for the same sample audio frame, the sample audio frame is input into the first multi-layer time-expansion convolutional neural network and the second multi-layer time-expansion convolutional neural network, so that first local feature information and second local feature information can be obtained respectively, and the first local feature information and the second local feature information are different based on different parameters in different networks. According to some embodiments, obtaining, with the global feature extraction module, the global feature information of each of the plurality of sample audio frames based on the first local feature information and the second local feature information of each of the plurality of sample audio frames may include: for each sample audio frame in a plurality of sample audio frames, fusing first local feature information and second local feature information of the sample audio frame to obtain fused local feature information of the sample audio frame; and acquiring the global feature information of each sample audio frame in the plurality of sample audio frames by using a global feature extraction module based on the fused local feature information of each sample audio frame in the plurality of sample audio frames.

According to some embodiments, fusing the first local feature information and the second local feature information of the sample audio frame may include performing stitching on the first local feature information and the second local feature information of the sample audio frame.

According to some embodiments, for each of a plurality of sample audio frames, the following operations may be performed: inputting the second local feature information of the sample audio frame into a first additional output module to obtain a second confidence coefficient of the sample audio frame output by the first additional output module for each of the first number of predicted sound source classifications; and adjusting parameters of the additional local feature extraction module based on a second confidence of the sample audio frame for each of the first number of predicted sound source classifications and the sound source label of the sample audio frame.

The first additional output module is applied to the training process of the audio processing model, and parameters of the additional local feature extraction module can be adjusted in a targeted manner according to a second confidence coefficient of the sample audio frame output by the first additional output module for each of the first number of predicted sound source classes, so that the additional local feature extraction module can have different parameters from the local feature extraction module in the combined training process, and further the additional local feature extraction module can extract second local feature information different from the first local feature information.

It is to be noted that the parameter adjustment for the additional local feature extraction module performed by the first additional output module may be performed on a per sample audio frame basis, that is, for each of a plurality of sample audio frames sequentially input to the audio processing model, the parameter adjustment for the additional local feature extraction module may be performed on an independent basis of the sample audio frame without waiting for the extraction of the second local feature to be completed for each of the plurality of sample audio frames sequentially input. In other words, for input sample audio data, the parameters of the additional local feature extraction module may perform parameter adjustment for multiple times based on multiple sample audio frames, respectively, and as the multiple sample audio frames are sequentially input, the accuracy of the second local feature information extracted by the additional local feature extraction module for the sample audio frames can be continuously improved.

In contrast, since the global feature information of each of the plurality of sample audio frames needs to depend on at least the first local feature information of each of the plurality of sample audio frames, the adjustment of the parameters of the audio processing model needs to wait for the extraction of the first local feature to be completed for each of the plurality of sample audio frames that are input in sequence. Therefore, in the process of the combined training, parameter adjustment of different frequencies of different modules can be performed based on different mechanisms, so that the flexibility of the training is improved, and the training efficiency is improved on the basis of ensuring the training effect.

For steps S403 and S404, according to some embodiments, adjusting the parameters of the audio processing model based on the first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and the sound source label of each of the plurality of sample audio frames may comprise: determining a corresponding relation between a first number of predicted sound source classifications and a first number of sound sources under a target mapping type, wherein in a plurality of corresponding relations respectively corresponding to a plurality of mapping types between the first number of predicted sound source classifications and the first number of sound sources, a loss value between a first confidence coefficient corresponding to each sample audio frame in a plurality of sample audio frames calculated based on the target mapping type and a sound source label of the sample audio frame is minimum; and adjusting parameters of the audio processing model based on the loss values calculated by the corresponding relations under the target mapping type.

Since the audio processing model may be used to implement clustering on a plurality of sample audio frames, and it is not necessary to determine which sound source corresponds to each predicted sound source classification, in order to implement inverse parameter adjustment on the audio processing model, a pit (persistence innovative loss) strategy may be adopted, that is, loss values are respectively calculated in sequence under a plurality of mapping types between the first number of predicted sound source classifications and the first number of sound sources, and parameter adjustment is performed based on a smaller loss value.

According to some embodiments, each of the plurality of correspondences to which the plurality of mapping types respectively correspond between the first number of predicted sound source classifications and the first number of sound sources is a one-to-one correspondence.

For example, in the presence of two sound sources, a sound source a and a sound source B, the first confidence of a sample audio frame for a first predicted sound source classification of the 2 predicted sound source classifications is 0.1, the first confidence of a sample audio frame for a second predicted sound source classification of the 2 predicted sound source classifications is 0.9, and the sound source label of the sample audio frame is sound source a.

First, a first predicted sound source classification is made as a sound source a, a second predicted sound source classification is made as a sound source B, and a loss value a is calculated based on a first confidence 0.1 of a sample audio frame for the sound source a and a first confidence 0.9 of the sample audio frame for the sound source B, and a sound source label of the sample audio frame, that is, the sound source a.

Next, with the first predicted sound source classification as the sound source B and the second predicted sound source classification as the sound source a, the loss value B is calculated based on the first confidence 0.1 of the sample audio frame for the sound source B and the first confidence 0.9 of the sample audio frame for the sound source a, and the sound source label of the sample audio frame, i.e., the sound source a.

Obviously, the loss value B is smaller than the loss value a, and therefore, the correspondence between the predicted sound source classification and the sound source in the target mapping type is determined as the first predicted sound source classification as the sound source B and the second predicted sound source classification as the sound source a, and the parameters of the audio processing model are adjusted based on the loss value B calculated by the correspondence.

According to some embodiments, the adjusting of the parameters of the audio processing model in step S404 may be performed based on the first confidence level corresponding to each of the plurality of sample audio frames.

For example, still taking the case where there are two sound sources, i.e., a sound source a and a sound source B, the plurality of sample audio frames are a sample audio frame a, a sample audio frame B, and a sample audio frame c. The first confidence degrees of the sample audio frame a, the sample audio frame B and the sample audio frame c for a first predicted sound source classification in the 2 predicted sound source classifications are 0.1, 0.8 and 0.3 respectively, the first confidence degrees of the sample audio frame a, the sample audio frame B and the sample audio frame c for a second predicted sound source classification in the 2 predicted sound source classifications are 0.9, 0.2 and 0.7 respectively, and the sound source labels of the sample audio frame a, the sample audio frame B and the sample audio frame c are a sound source A, a sound source B and a sound source A respectively.

Obviously, the loss value calculated under the first predicted sound source classification as sound source B and the second predicted sound source classification as sound source a is smaller, and thus the parameters of the audio processing model are adjusted based on the smaller loss value.

According to some embodiments, when the additional local feature extraction module is included in the training process, the parameters of the additional local feature extraction module can also be adjusted at the same time through step S404.

According to some embodiments, the loss value may be calculated by Binary Cross Entropy (BCE).

According to some embodiments, a third confidence level of each of the first number of predicted sound source classifications for the sample audio frame may be obtained with the second additional output module based on the first local feature information of the sample audio frame in turn; and adjusting parameters of a local feature extraction module in the audio processing model based on the third confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and the sound source label of each of the plurality of sample audio frames.

The second additional output module is applied to the training process of the audio processing model, and the second additional output module can adjust the parameters of the local feature extraction module in the audio processing model based on the first local feature information of each sample audio frame, so that the training effect can be further improved.

According to some embodiments, the adjustment of the parameter of the local feature extraction module performed by the second additional output module may also be performed based on the third confidence degree corresponding to each of the plurality of sample audio frames, and the specific manner is similar to the above-mentioned manner of performing the adjustment of the parameter of the audio processing model based on the first confidence degree corresponding to each of the plurality of sample audio frames, and is not described herein again.

Fig. 5 is a schematic diagram illustrating a training method of an audio processing model according to an exemplary embodiment of the present disclosure, where, as shown in fig. 5, each of a plurality of sample audio frames in sample audio data is sequentially input to a local feature extraction module to obtain first local feature information of each of the plurality of sample audio frames output by the local feature extraction module.

And sequentially inputting each sample audio frame in a plurality of sample audio frames in the sample audio data into the additional local feature extraction module to obtain second local feature information of each sample audio frame in the plurality of sample audio frames output by the additional local feature extraction module.

And fusing the first local feature information and the second local feature information of each sample audio frame through a fusion module to obtain fused local feature information of the sample audio frame.

Until the fused local feature information of each sample audio frame in all the sample audio frames which are input in sequence is obtained, the fused local feature information of each sample audio frame in all the sample audio frames is input into the global feature extraction module, so that the global feature information which is output in sequence by the global feature extraction module and aims at each sample audio frame in all the sample audio frames is obtained.

The method comprises the steps of obtaining a first confidence coefficient of each sample audio frame aiming at each predicted sound source classification in a first number of predicted sound source classifications aiming at the global feature information of each sample audio frame by using an output module, calculating a first loss value based on the first confidence coefficient of each sample audio frame aiming at each predicted sound source classification in the first number of predicted sound source classifications and a sound source label of each sample audio frame in a plurality of sample audio frames, and adjusting a local feature extraction module, a global feature extraction module, an output module and parameters of an additional local feature extraction module in an audio processing model according to the first loss value. The adjustment of the parameters of the audio processing model is performed in units of the entire sample audio data.

And aiming at the second local characteristic information of each sample audio frame, obtaining a second confidence coefficient of the sample audio frame aiming at each predicted sound source classification in the first number of predicted sound source classifications by utilizing a first additional output module, calculating a second loss value based on the second confidence coefficient of the sample audio frame aiming at each predicted sound source classification in the first number of predicted sound source classifications and the sound source label of the sample audio frame, adjusting the parameter of an additional local characteristic extraction module according to the second loss value, and performing the adjustment of the parameter of the additional local characteristic extraction module by taking each sample audio frame as a unit.

The method includes the steps of obtaining a third confidence coefficient of each predicted sound source classification of each sample audio frame for each of a first number of predicted sound source classifications by using a second additional output module according to first local feature information of each sample audio frame, calculating a third loss value based on the third confidence coefficient of each predicted sound source classification of the sample audio frame for each of the first number of predicted sound source classifications and a sound source label of the sample audio frame, and adjusting parameters of a local feature extraction module according to the third loss value.

It is noted that the part in the dashed box in fig. 5 is only used for the training process of the audio processing model, and does not participate in the application of the audio processing model.

Fig. 6 is a block diagram illustrating a structure of an audio processing apparatus according to an exemplary embodiment of the present disclosure, and as shown in fig. 6, an audio processing model 600 includes: a local feature extraction module 601 configured to determine local feature information of each of a plurality of audio frames extracted from to-be-processed audio data in turn, wherein the to-be-processed audio data includes audio data from at least two sound sources; a global feature extraction module 602 configured to determine any one of the plurality of audio frames as a target audio frame, and determine global feature information of the target audio frame based on local feature information of each of the plurality of audio frames; and an output module 603 configured to determine a sound source classification corresponding to the target audio frame based on the global feature information of the target audio frame.

According to some embodiments, the local feature extraction module comprises a multi-layered time-expanded convolutional neural network (TDCN).

According to some embodiments, the global feature extraction module may include one or more attention sub-modules, and at least one of the one or more attention sub-modules employs an exponential linear operation. Therefore, calculation can be effectively simplified, and processing efficiency is improved.

In order to improve the calculation efficiency of the attention submodule, an exponential linear unit (exponential linear unit) may be used to perform the calculation in the attention submodule, and the processing result O of the attention submodule on the input feature may be represented as:

O＝φ(Q)(φ(K)^TV).

wherein Q represents a query vector, K represents a keyword vector, and V represents a value vector.

Where Φ (x) ═ elu (x) +1, elu (x) equals x when x is greater than 0, and elu (x) ═ α (e) when x is less than or equal to 0^x-1), where α is a hyperparameter, typically taken to be 1.0.

According to some embodiments, the local feature extraction module comprises: and determining the sub-module of the local characteristic information of the audio frame based on the audio frame and one or more audio frames close to the audio frame in the plurality of audio frames.

Fig. 7 is a block diagram illustrating a structure of an apparatus for training an audio processing model according to an exemplary embodiment of the present disclosure, and as shown in fig. 7, an apparatus 700 for training an audio processing model, wherein the audio processing model includes a local feature extraction module, a global feature extraction module, and an output module, the apparatus 700 includes: a first obtaining module 701 configured to obtain, by a local feature extraction module, first local feature information of a plurality of sample audio frames extracted from sample audio data in turn, wherein the sample audio data includes audio data from a first number of sound sources, and each of the plurality of sample audio frames has a sound source tag; a second obtaining module 702, configured to obtain, by using the global feature extraction module, global feature information of each of the plurality of sample audio frames based on at least the first local feature information of each of the plurality of sample audio frames; a third obtaining module 703, configured to, for each of a plurality of sample audio frames in turn, input global feature information of the sample audio frame into the output module, so as to obtain a first confidence of the sample audio frame output by the output module for each of the first number of predicted sound source classifications; and a first adjusting module 704 configured to adjust parameters of the audio processing model based on a first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and a sound source label of each of the plurality of sample audio frames.

According to some embodiments, the first adjustment module comprises: a determining submodule configured to determine a correspondence between a first number of predicted sound source classifications and a first number of sound sources in a target mapping type, wherein a loss value between a first confidence corresponding to each of a plurality of sample audio frames calculated based on the target mapping type and a sound source label of the sample audio frame is minimum in a plurality of correspondences respectively corresponding to a plurality of mapping types between the first number of predicted sound source classifications and the first number of sound sources; and an adjusting submodule configured to adjust a parameter of the audio processing model based on the loss value calculated by the correspondence under the target mapping type.

According to some embodiments, the apparatus further comprises: a fourth obtaining module, configured to sequentially obtain, for each sample audio frame of a plurality of sample audio frames extracted from sample audio data, second local feature information of the sample audio frame by using the additional local feature extraction module; wherein, the second acquisition module includes: an obtaining sub-module configured to obtain, by the global feature extraction module, global feature information of each of the plurality of sample audio frames based on the first local feature information and the second local feature information of each of the plurality of sample audio frames.

According to some embodiments, the acquisition submodule comprises: means for fusing, for each of a plurality of sample audio frames, first local feature information and second local feature information of the sample audio frame to obtain fused local feature information of the sample audio frame; and a unit for acquiring global feature information of each of the plurality of sample audio frames by using a global feature extraction module based on the fused local feature information of each of the plurality of sample audio frames.

According to some embodiments, the apparatus further comprises: a fifth obtaining module configured to input the second local feature information of the sample audio frame into the first additional output module to obtain a second confidence of the sample audio frame output by the first additional output module for each of the first number of predicted sound source classifications; and a second adjusting module configured to adjust parameters of the additional local feature extraction module based on a second confidence of the sample audio frame for each of the first number of predicted sound source classifications and the sound source label of the sample audio frame.

According to some embodiments, the apparatus further comprises: a sixth obtaining module configured to obtain, by the second additional output module, a third confidence level of each of the first number of predicted sound source classifications for the sample audio frame based on the first local feature information of each of the plurality of sample audio frames in turn; and a third adjustment module configured to adjust parameters of the local feature extraction module in the audio processing model based on a third confidence level for each of the first number of predicted sound source classifications for each of the plurality of sample audio frames and the sound source label for each of the plurality of sample audio frames.

According to an embodiment of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform any one of the methods described above.

There is also provided, in accordance with an embodiment of the present disclosure, a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform any one of the methods described above.

There is also provided, in accordance with an embodiment of the present disclosure, a computer program product, including a computer program, wherein the computer program, when executed by a processor, implements any of the methods described above.

Referring to fig. 8, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 808 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a processing method of audio data or a training method of an audio processing model. For example, in some embodiments, the method of processing audio data or the method of training an audio processing model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the method of processing audio data or the method of training an audio processing model described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured in any other suitable way (e.g. by means of firmware) to perform a processing method of audio data or a training method of an audio processing model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, audio, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. An audio processing method, comprising:

sequentially determining local feature information of each audio frame in a plurality of audio frames extracted from audio data to be processed, wherein the audio data to be processed comprises audio data from at least two sound sources; and

determining any one of the plurality of audio frames as a target audio frame, and performing the following operations for the target audio frame:

determining global feature information of the target audio frame based on the local feature information of each of the plurality of audio frames; and

and determining the sound source classification corresponding to the target audio frame based on the global characteristic information of the target audio frame.

2. The method of claim 1, wherein the plurality of audio frames are sequentially adjacent in a time domain of the audio data to be processed.

3. The method according to claim 1 or 2, wherein the determining the local feature information of each audio frame of a plurality of audio frames extracted from the audio data to be processed in turn comprises:

determining local feature information of the audio frame based on the audio frame and one or more audio frames near the audio frame in the plurality of audio frames.

4. The method of any one of claims 1 to 3, wherein each of the at least two sound sources is a speaker.

5. A method of training an audio processing model, wherein the audio processing model comprises a local feature extraction module, a global feature extraction module, and an output module, the method comprising:

sequentially aiming at each sample audio frame in a plurality of sample audio frames extracted from sample audio data, acquiring first local feature information of the sample audio frame by using the local feature extraction module, wherein the sample audio data comprises audio data from a first number of sound sources, and each sample audio frame in the plurality of sample audio frames is provided with a sound source label;

obtaining global feature information of each of the plurality of sample audio frames by using the global feature extraction module based on at least the first local feature information of each of the plurality of sample audio frames;

sequentially inputting the global feature information of each sample audio frame in the plurality of sample audio frames into the output module so as to obtain a first confidence coefficient of the sample audio frame output by the output module for each predicted sound source classification in a first number of predicted sound source classifications; and

adjusting parameters of the audio processing model based on a first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and a sound source label of each of the plurality of sample audio frames.

6. The method of claim 5, wherein said adjusting parameters of the audio processing model based on the first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and the sound source label of each of the plurality of sample audio frames comprises:

determining a corresponding relation between the first number of predicted sound source classifications and the first number of sound sources under a target mapping type, wherein in a plurality of corresponding relations respectively corresponding to a plurality of mapping types between the first number of predicted sound source classifications and the first number of sound sources, a loss value between a first confidence coefficient corresponding to each sample audio frame in the plurality of sample audio frames calculated based on the target mapping type and a sound source label of the sample audio frame is minimum; and

adjusting parameters of the audio processing model based on the loss values calculated by the corresponding relations under the target mapping type.

7. The method of claim 5 or 6, further comprising:

sequentially aiming at each sample audio frame in a plurality of sample audio frames extracted from sample audio data, acquiring second local feature information of the sample audio frame by utilizing an additional local feature extraction module;

wherein the obtaining, by the global feature extraction module, the global feature information of each of the plurality of sample audio frames based on at least the first local feature information of each of the plurality of sample audio frames comprises:

obtaining, by the global feature extraction module, global feature information of each of the plurality of sample audio frames based on the first local feature information and the second local feature information of each of the plurality of sample audio frames.

8. The method of claim 7, wherein the obtaining, with the global feature extraction module, global feature information for each of the plurality of sample audio frames based on the first and second local feature information for each of the plurality of sample audio frames comprises:

for each sample audio frame in the plurality of sample audio frames, fusing the first local feature information and the second local feature information of the sample audio frame to obtain fused local feature information of the sample audio frame; and

and acquiring global feature information of each sample audio frame in the plurality of sample audio frames by using the global feature extraction module based on the fused local feature information of each sample audio frame in the plurality of sample audio frames.

9. The method of claim 7 or 8, further comprising, for each of the plurality of sample audio frames, performing the following:

inputting the second local feature information of the sample audio frame into a first additional output module to obtain a second confidence of the sample audio frame output by the first additional output module for each of the first number of predicted sound source classifications; and

adjusting parameters of the additional local feature extraction module based on a second confidence of the sample audio frame for each of the first number of predicted sound source classifications and the sound source label of the sample audio frame.

10. The method of any of claims 5 to 9, further comprising:

acquiring a third confidence coefficient of each of the first number of predicted sound source classifications by the sample audio frame based on the first local feature information of each of the plurality of sample audio frames in sequence by using a second additional output module; and

adjusting parameters of the local feature extraction module in the audio processing model based on a third confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and a sound source label of each of the plurality of sample audio frames.

11. An audio processing apparatus comprising:

a local feature extraction module configured to determine local feature information of each of a plurality of audio frames extracted from to-be-processed audio data in turn, wherein the to-be-processed audio data includes audio data from at least two sound sources;

a global feature extraction module configured to determine any one of the plurality of audio frames as a target audio frame, and determine global feature information of the target audio frame based on local feature information of each of the plurality of audio frames; and

and the output module is configured to determine the sound source classification corresponding to the target audio frame based on the global feature information of the target audio frame.

12. The audio processing apparatus according to claim 11, wherein the local feature extraction module comprises:

and determining a sub-module of the local feature information of the audio frame based on the audio frame and one or more audio frames close to the audio frame in the plurality of audio frames.

13. The audio processing apparatus according to claim 11 or 12, wherein the global feature extraction module comprises one or more attention sub-modules, and at least one of the one or more attention sub-modules employs an exponential linear operation.

14. An apparatus for training an audio processing model, wherein the audio processing model comprises a local feature extraction module, a global feature extraction module and an output module, the apparatus comprising:

a first obtaining module configured to obtain, by the local feature extraction module, first local feature information of a plurality of sample audio frames extracted from sample audio data in turn, wherein the sample audio data includes audio data from a first number of sound sources, and each of the plurality of sample audio frames has a sound source tag;

a second obtaining module configured to obtain, by the global feature extraction module, global feature information of each of the plurality of sample audio frames based on at least the first local feature information of each of the plurality of sample audio frames;

a third obtaining module, configured to sequentially input, for each of the plurality of sample audio frames, global feature information of the sample audio frame into the output module, so as to obtain a first confidence of the sample audio frame output by the output module for each of the first number of predicted sound source classifications; and

a first adjustment module configured to adjust parameters of the audio processing model based on a first confidence of each of the plurality of sample audio frames for each of the first number of predicted sound source classifications and a sound source label of each of the plurality of sample audio frames.

15. The apparatus of claim 14, wherein the first adjustment module comprises:

a determining submodule configured to determine a correspondence between the first number of predicted sound source classifications and the first number of sound sources under a target mapping type, wherein a loss value between a first confidence corresponding to each of the plurality of sample audio frames and a sound source label of the sample audio frame, which is calculated based on the target mapping type, is minimum in a plurality of correspondences respectively corresponding to a plurality of mapping types between the first number of predicted sound source classifications and the first number of sound sources; and

an adjusting submodule configured to adjust a parameter of the audio processing model based on the loss value calculated by the correspondence under the target mapping type.

16. The apparatus of claim 14 or 15, further comprising:

a fourth obtaining module, configured to sequentially obtain, for each sample audio frame of a plurality of sample audio frames extracted from sample audio data, second local feature information of the sample audio frame by using the additional local feature extraction module;

wherein the second obtaining module comprises:

an obtaining sub-module configured to obtain, by the global feature extraction module, global feature information of each of the plurality of sample audio frames based on the first local feature information and the second local feature information of each of the plurality of sample audio frames.

17. The apparatus of claim 16, wherein the acquisition submodule comprises:

means for fusing, for each of the plurality of sample audio frames, the first local feature information and the second local feature information of the sample audio frame to obtain fused local feature information of the sample audio frame; and

means for obtaining, with the global feature extraction module, global feature information for each of the plurality of sample audio frames based on the fused local feature information for each of the plurality of sample audio frames.

18. The apparatus of claim 16 or 17, further comprising:

a fifth obtaining module configured to input the second local feature information of the sample audio frame into the first additional output module to obtain a second confidence of the sample audio frame output by the first additional output module for each of the first number of predicted sound source classifications; and

a second adjusting module configured to adjust parameters of the additional local feature extraction module based on a second confidence of the sample audio frame for each of the first number of predicted sound source classifications and a sound source label of the sample audio frame.

19. The apparatus of any of claims 14 to 18, further comprising:

a sixth obtaining module configured to obtain, by a second additional output module, a third confidence level of each of the first number of predicted sound source classifications for each of the sample audio frames based on the first local feature information of the sample audio frame in turn; and

a third adjustment module configured to adjust parameters of the local feature extraction module in the audio processing model based on a third confidence for each of the first number of predicted sound source classifications for each of the plurality of sample audio frames and a sound source label for each of the plurality of sample audio frames.

20. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

21. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

22. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-10 when executed by a processor.