WO2021159775A1 - 音频分离网络的训练方法、音频分离方法、装置及介质 - Google Patents

音频分离网络的训练方法、音频分离方法、装置及介质 Download PDF

Info

Publication number
WO2021159775A1
WO2021159775A1 PCT/CN2020/126492 CN2020126492W WO2021159775A1 WO 2021159775 A1 WO2021159775 A1 WO 2021159775A1 CN 2020126492 W CN2020126492 W CN 2020126492W WO 2021159775 A1 WO2021159775 A1 WO 2021159775A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
audio
separated
separation
sample set
Prior art date
Application number
PCT/CN2020/126492
Other languages
English (en)
French (fr)
Inventor
王珺
林永业
苏丹
俞栋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20918512.3A priority Critical patent/EP4012706A4/en
Publication of WO2021159775A1 publication Critical patent/WO2021159775A1/zh
Priority to US17/682,399 priority patent/US20220180882A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/06Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of machine learning, in particular to training methods, audio separation methods, devices and media for audio separation networks.
  • the embodiments of the present application provide a training method, audio separation method, device, and medium for an audio separation network.
  • the first sample set can be used as samples for training the unsupervised network, which can enrich the sample data of the unsupervised network and enhance the unsupervised network.
  • the generalization ability of the network can be used as samples for training the unsupervised network, which can enrich the sample data of the unsupervised network and enhance the unsupervised network.
  • an embodiment of the present application provides a training method for an audio separation network.
  • the method is applied to a training device for an audio separation network and includes:
  • the loss of the second separation sample is used to adjust the network parameters of the unsupervised network, so that the loss of the separation result output by the adjusted unsupervised network meets the convergence condition.
  • an embodiment of the present application provides an audio separation method, the method is applied to an audio separation device, and the method includes:
  • an embodiment of the present application provides a training device for an audio separation network, and the device includes:
  • the first acquisition module is configured to acquire a first separated sample set, and the first separated sample set includes at least two types of audio with pseudo labels;
  • the first interpolation module is configured to use disturbance data to interpolate the first separated sample set to obtain a first sample set
  • the first separation module is configured to use an unsupervised network to separate the first sample set to obtain a second separated sample set;
  • the first determining module is configured to determine the loss of the second separated sample in the second separated sample set
  • the first adjustment module is configured to use the loss of the second separation sample to adjust the network parameters of the unsupervised network, so that the loss of the separation result output by the adjusted unsupervised network meets the convergence condition.
  • an audio separation device includes:
  • the second acquisition module is configured to acquire the audio to be separated
  • the first input module is configured to use a trained neural network to separate the audio to be separated to obtain a separation result; wherein the neural network is obtained by training based on the audio separation network training method described in the first aspect above ;
  • the first output module is configured to output the separation result.
  • a computer storage medium is stored with executable instructions configured to cause the processor to execute, to implement the audio separation network training method described in the first aspect, or configured to cause the processor to execute During execution, the audio separation method described in the second aspect is implemented.
  • the embodiments of the present application have the following beneficial effects: firstly, by interpolating the first separated sample sets of the two types of audio with pseudo-labels to obtain the mixed first sample set; then, based on the first sample set
  • the supervised network is trained to adjust the network parameters of the unsupervised network based on the loss of the second separation sample, so that the loss of the separation result of the adjusted unsupervised network output meets the convergence condition;
  • the first sample set is obtained by interpolation using perturbation data.
  • the first sample set is used as the samples for training the unsupervised network, which can enrich the samples of the unsupervised network Data, thereby enhancing the generalization capabilities of unsupervised networks.
  • FIG. 1 is a schematic diagram of an optional architecture of an audio separation network training system provided by an embodiment of the present application
  • FIG. 2A is a schematic diagram of another optional architecture of the audio separation network training system provided by an embodiment of the present application.
  • 2B is a schematic structural diagram of a training system for an audio separation network provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the implementation process of the audio separation network training method provided by an embodiment of the present application.
  • FIG. 4A is a schematic diagram of another implementation process of the audio separation network training method provided by an embodiment of the present application.
  • FIG. 4B is a schematic diagram of an implementation process of an audio separation method provided by an embodiment of the present application.
  • FIG. 5A is a schematic diagram of the implementation process of the training method of a supervised network according to an embodiment of the present application
  • FIG. 5B is a schematic diagram of an implementation process of an unsupervised network training method according to an embodiment of the present application.
  • first ⁇ second ⁇ third involved only distinguishes similar objects, and does not represent a specific order for the objects. Understandably, “first ⁇ second ⁇ third” Where permitted, the specific order or sequence can be interchanged, so that the embodiments of the present application described herein can be implemented in a sequence other than those illustrated or described herein.
  • ASR Automatic Speech Recognition
  • Permutation Invariant Training Propose a permutation invariant training technique that solves the problem of label permutation by minimizing the separation error.
  • Permutation invariant training technology means that the change in the order of the input will not affect the output value.
  • Permutation invariant training is to determine the correct output arrangement by calculating the objective loss function under all possible output permutations, and selecting the corresponding lowest objective loss function to determine the correct output arrangement. It is a universal and effective method, and the cost is that the complexity will increase with the output The increase in dimensions increases.
  • SSL Semi-Supervised Learning
  • Consistent semi-supervised learning First, sample a minibatch of labeled data; second, send it to the network to predict the cross-entropy loss; again, sample two minibatch data without labels; again, send it to the network to predict Label out; again, use mix to mix two unlabeled data; again, calculate the squared error loss of the new data prediction after mixing; finally, add the squared error loss with the label loss, and iteratively update the network parameters to get the final network.
  • Moving average also known as moving average method.
  • the moving average is calculated by sequentially increasing or decreasing the old and new data in order to eliminate accidental factors, find out the development trend of things, and make predictions accordingly.
  • the moving average method is a kind of trend extrapolation technique. In fact, curve fitting is performed on the data sequence with obvious load change trend, and then the new curve is used to predict the value at a certain point in the future.
  • Adversarial network (Generative Adversarial Network, GAN): consists of two parts, the generative network and the discriminant network.
  • Generating a network means that data such as text, images, and videos can be generated from input data according to tasks and through network training.
  • the generation network is essentially a kind of maximum likelihood estimation, which is used to generate a network of specified distribution data.
  • the function of the generation network is to capture the distribution of sample data and pass the distribution of the original input information through the parameters in the maximum likelihood estimation. Transform to transform the training bias into samples of the specified distribution.
  • the discrimination network is actually a two-category, it will judge the data such as the image generated by the generation network, and judge whether it is the data in the real training data.
  • Expanded networks based on high-dimensional embedding networks include: deep attraction networks, deep extraction networks, and anchor deep attraction networks, as well as methods based on permutation invariant training.
  • PIT is to determine the correct output arrangement by calculating the objective loss function under all possible output permutations, and selecting the corresponding lowest objective loss function to determine the correct output arrangement. It is a universal and effective method, at the cost of complexity as the output dimension increases. And increase.
  • the embodiments of the present application provide an audio separation network training method, audio separation method, device, and medium.
  • unsupervised network training two types of audio with pseudo-labels and perturbation data are used for interpolation.
  • the first sample set is used as the sample for training the unsupervised network, which enriches the sample data of the unsupervised network, thereby enhancing the generalization ability of the unsupervised network.
  • the equipment provided in the embodiments of the application can be implemented as a notebook computer, a tablet computer, a desktop computer, a set-top box, and a mobile device (e.g., mobile phone, portable
  • a mobile device e.g., mobile phone, portable
  • Various types of user terminals, such as music players, personal digital assistants, dedicated messaging devices, and portable game devices can also be implemented as servers.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers.
  • the terminal can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, and intermediate Cloud servers for basic cloud computing services such as software services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but it is not limited to this.
  • the terminal and the server can be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
  • Fig. 1 is a schematic diagram of an optional architecture of an audio separation network training system provided by an embodiment of the present application.
  • a separate sample set 10 (including at least two parts: clean audio data and interference signals), interpolate with disturbance data to obtain the first sample set 11 after mixing; then, input the first sample set 11 to the unsupervised In the network 12, the unsupervised network 12 is trained in pairs; finally, the loss of the second separated sample output by the unsupervised network 12 is fed back to the network to adjust the network parameters so that the adjusted unsupervised network outputs
  • the loss of the separation result meets the convergence condition, so that a trained unsupervised network 13 is obtained; in this way, in the process of training the unsupervised network, two types of audio with pseudo labels and the first sample of interpolation using perturbation data are used In this way, the first sample set is used as the sample for training the unsupervised network, which enriches the sample data of the unsupervised network, thereby enhancing the generalization ability of the unsupervised network.
  • FIG. 2A is a schematic diagram of another optional architecture of the audio separation network training system provided by an embodiment of the present application, including a blockchain network 20 (exemplarily showing a server 200 as a native node), a monitoring system 30 (the device 300 belonging to the monitoring system 30 and its graphical interface 301 are shown by way of example), which will be described separately below.
  • a blockchain network 20 exemplarily showing a server 200 as a native node
  • a monitoring system 30 the device 300 belonging to the monitoring system 30 and its graphical interface 301 are shown by way of example
  • the type of the blockchain network 20 is flexible and diverse, for example, it can be any one of a public chain, a private chain, or a consortium chain.
  • the electronic equipment of any business entity such as user equipment and servers, can access the blockchain network 20 without authorization; taking the alliance chain as an example, the business entity has its jurisdiction after being authorized
  • the electronic device (for example, device/server) can access the blockchain network 20. At this time, it becomes a special type of node in the blockchain network 20, that is, the client node.
  • the client node can only provide the function of supporting business entities to initiate transactions (for example, for storing data on the chain or querying data on the chain).
  • the client node can be implemented by default or selectively (for example, depending on the specific business requirements of the business entity). Therefore, the data and business processing logic of the business entity can be migrated to the blockchain network 20 to the greatest extent, and the credibility and traceability of the data and business processing process can be realized through the blockchain network 20.
  • the blockchain network 20 receives the transaction submitted by the client node (for example, the device 300 belonging to the monitoring system 30 shown in FIG. 2A) of the business entity (for example, the monitoring system 30 shown in FIG. 2A), and executes the transaction to Update the ledger or query the ledger, and display various intermediate or final results of the execution of the transaction on the user interface of the device (for example, the graphical interface 301 of the device 300).
  • the client node for example, the device 300 belonging to the monitoring system 30 shown in FIG. 2A
  • the business entity for example, the monitoring system 30 shown in FIG. 2A
  • the equipment 300 of the monitoring system 30 is connected to the blockchain network 20 and becomes a client node of the blockchain network 20.
  • the device 300 obtains the first set of separated samples through sensors; and transmits the trained unsupervised network to the server 200 in the blockchain network 20 or saves it in the device 300; after the device 300 has deployed the upload logic or the user performs operations
  • the device 300 generates a transaction corresponding to the update operation/query operation according to the to-be-processed/synchronization time query request, and specifies the smart contract that needs to be called to implement the update operation/query operation and the parameters passed to the smart contract in the transaction
  • the transaction also carries a digital signature signed by the monitoring system 30 (for example, using the private key in the digital certificate of the monitoring system 30 to encrypt the summary of the transaction), and broadcast the transaction to the blockchain network 20.
  • the digital certificate can be obtained by the monitoring system 30 registering with the certification center 31.
  • the native node in the blockchain network 20 for example, when the server 200 receives the transaction, it verifies the digital signature carried by the transaction. After the digital signature verification succeeds, it confirms whether the monitoring system 30 is carried in the transaction according to the identity of the monitoring system 30 It has transaction authority, and any one of the digital signature and authority verification will cause the transaction to fail. After the verification is successful, the native node's own digital signature is signed (for example, the digest of the transaction is encrypted using the private key of the native node), and the broadcast continues in the blockchain network 20.
  • the native node's own digital signature is signed (for example, the digest of the transaction is encrypted using the private key of the native node), and the broadcast continues in the blockchain network 20.
  • the node with the sorting function in the blockchain network 20 After the node with the sorting function in the blockchain network 20 receives the successfully verified transaction, it fills the transaction into a new block and broadcasts it to the node in the blockchain network 20 that provides consensus services.
  • the nodes that provide consensus services in the blockchain network 20 conduct a consensus process on the new block to reach agreement.
  • the node that provides the ledger function appends the new block to the end of the blockchain and executes the transaction in the new block: For submitting a new block For the training transaction of the audio separation network, the output score result and the key-value pair corresponding to the evaluation data set are updated; for the transaction querying the synchronization time, the key-value pair corresponding to the synchronization time is queried from the state database, and the query result is returned.
  • the obtained synchronization time can be displayed in the graphical interface 301 of the device 300.
  • the native node in the blockchain network 20 can read the first separated sample set from the blockchain, and display the first separated sample set on the monitoring page of the native node.
  • the native node can also perform the first separated sample set Interpolate to obtain the mixed first sample set; then, train the unsupervised network based on the first sample set; finally, adjust the network parameters of the unsupervised network through the loss of the second separated sample to Two trained neural networks are obtained; in this way, the first sample set is used as the samples for training the unsupervised network, which enriches the sample data of the unsupervised network and enhances the generalization ability of the unsupervised network.
  • the server 200 can be set to have the training function and accounting function of the audio separation network. Interpolate to obtain the mixed first sample set; then, train the unsupervised network based on the first sample set; finally, adjust the network parameters of the unsupervised network through the loss of the second separated sample to Get a trained unsupervised network.
  • the server 200 may receive the first separated sample set sent by the device 300, and the server 200 may interpolate the first separated sample set to obtain the mixed first sample set; The sample set trains the unsupervised network; finally, through the loss of the second separated sample, the network parameters are adjusted to obtain the trained unsupervised network.
  • FIG. 2B is a schematic structural diagram of an audio separation network training system provided by an embodiment of the present application, which includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430.
  • the various components are coupled together through the bus system 440.
  • the bus system 440 is used to implement connection and communication between these components.
  • the bus system 440 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the bus system 440 in FIG. 2B.
  • the processor 410 may be an integrated circuit chip with signal processing capabilities, such as general-purpose processors, digital signal processors, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., among which, general-purpose processing
  • the processor can be a microprocessor or any conventional processor or the like.
  • the user interface 430 includes one or more output devices 431 that enable the presentation of media content, including one or more speakers and/or one or more visual display screens.
  • the user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, in some examples keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
  • the memory 450 may be removable, non-removable, or a combination thereof.
  • Exemplary hardware devices include solid-state memory, hard disk drives, optical disk drives, and so on.
  • the memory 450 optionally includes one or more storage devices that are physically remote from the processor 410.
  • the memory 450 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read only memory (Read Only Memory, ROM), and the volatile memory may be a random access memory (Random Access Memory, RAM).
  • ROM Read Only Memory
  • RAM Random Access Memory
  • the memory 450 described in the embodiment of the present application is intended to include any suitable type of memory.
  • the memory 450 can store data to support various operations. Examples of these data include programs, modules, and data structures, or a subset or superset thereof, as illustrated below.
  • Operating system 451 including system programs used to process various basic system services and perform hardware-related tasks, such as framework layer, core library layer, driver layer, etc., configured to implement various basic services and process hardware-based tasks;
  • the network communication module 452 is configured to reach other computing devices via one or more (wired or wireless) network interfaces 420.
  • Exemplary network interfaces 420 include: Bluetooth, wireless compatibility authentication, and Universal Serial Bus (Universal Serial Bus). , USB) etc.;
  • the presentation module 453 is configured to enable the presentation of information via one or more output devices 431 (for example, a display screen, a speaker, etc.) associated with the user interface 430 (for example, a user interface for operating peripheral devices and displaying content and information) );
  • output devices 431 for example, a display screen, a speaker, etc.
  • the user interface 430 for example, a user interface for operating peripheral devices and displaying content and information
  • the input processing module 454 is configured to detect one or more user inputs or interactions from one of the one or more input devices 432 and translate the detected inputs or interactions.
  • FIG. 2B shows a training server 455 of the audio separation network stored in the memory 450, which can be software in the form of programs and plug-ins, etc. It includes the following software modules: a first acquisition module 4551, a first interpolation module 4552, a first separation module 4553, a first determination module 4554, and a first adjustment 4555; a data repair terminal 456 in the memory 450, which can be a program or a plug-in
  • the software includes the following software modules: the second acquisition module 4561, the first input module 4562, and the first output module 4563; these modules are logical, so they can be combined or further divided according to the realized functions. . The function of each module will be explained below.
  • the device provided in the embodiment of the application may be implemented in hardware.
  • the device provided in the embodiment of the application may be a processor in the form of a hardware decoding processor, which is programmed to execute the application.
  • the audio separation network training method provided by the embodiment for example, a processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASICs), DSPs, and programmable logic devices (Programmable Logic). Device, PLD), Complex Programmable Logic Device (CPLD), Field-Programmable Gate Array (FPGA) or other electronic components.
  • ASICs application specific integrated circuits
  • DSPs digital signal processor
  • PROM programmable logic devices
  • PLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning. Each direction will be explained separately below.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect the image.
  • Computer vision studies related theories and technologies trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping Construction and other technologies also include common face recognition, fingerprint recognition and other biometric recognition technologies.
  • Speech Technology includes Automatic Speech Recognition (ASR), Text To Speech (TTS), and voiceprint recognition technology. Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • Natural language processing (Nature Language Processing, NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use daily, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications are in all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning and other technologies.
  • Autonomous driving technology usually includes high-precision maps, environment perception, behavior decision-making, path planning, motion control and other technologies.
  • Self-determined driving technology has a wide range of application prospects.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , Robotics, intelligent medical care, intelligent customer service, etc., I believe that with the development of technology, artificial intelligence technology will be applied in more fields and play more and more important values.
  • Figure 3 is a schematic diagram of the implementation process of the audio separation network training method provided by an embodiment of the present application.
  • the implementation process of the audio separation network training method can be implemented by the training equipment of the audio separation network, which is shown in combination with Figure 3. The steps are explained.
  • Step S301 Obtain a first separated sample set.
  • the first set of separated samples includes at least two types of audio with pseudo tags. For example, clean voice signals and interference signals with fake tags.
  • Obtaining the first separated sample set in step S301 may be to generate the first separated sample set by simulation, or use a trained network to separate unidentified audio data to obtain the first separated sample set with pseudo labels.
  • Step S302 Interpolate the first separated sample set by using the disturbance data to obtain the first sample set.
  • different disturbance data are used to interpolate each of the first separated samples respectively, and then the interpolated data are mixed to obtain the first sample set.
  • the first separated sample set includes three first separated samples, and three different disturbance data (for example, weights) are used to adjust this first separated sample respectively, and the adjustment results are summed to realize the first separation sample. Interpolation mixing of a separate sample set to obtain the first sample set.
  • Step S303 Use an unsupervised network to separate the first sample set to obtain a second separated sample set.
  • the unsupervised network can be any type of student network used to separate audio data.
  • the first sample set is input into the unsupervised network to obtain multiple second separated samples predicted and separated by the unsupervised network.
  • the first sample set in which the speech signal and the interference signal are mixed together is input into the unsupervised network to obtain the predicted and separated speech signal and the interference signal.
  • Step S304 Determine the loss of the second separated sample in the second separated sample set.
  • the loss between each second separated sample and the true value data of the first separated sample set is determined to obtain the loss of each second separated sample, that is, the second separated sample and the first separated sample are determined The gap between.
  • Step S305 Use the loss of the second separation sample to adjust the network parameters of the unsupervised network, so that the loss of the separation result output by the adjusted unsupervised network meets the convergence condition.
  • the loss of each second separated sample is determined separately, and then the smallest loss is selected from these losses, and the smallest loss is used to adjust the network parameters of the unsupervised network. After the adjustment, continue to include the adjusted
  • the unsupervised network of network parameters is trained until the loss of the separation result of the unsupervised network output meets the convergence condition, that is, a trained unsupervised network is obtained, indicating that the separation result of the trained unsupervised network output is more accurate .
  • the loss convergence condition of the separation result of the adjusted unsupervised network output can be understood as the loss of the separation result of the adjusted unsupervised network output ultimately remains unchanged, or the loss is less than a certain threshold, that is, the output of the adjusted unsupervised network
  • the separation result is the same as the true value data, or the similarity is greater than 99%, etc.
  • the first separation sample set of two types of audio with pseudo labels is mixed and interpolated; for example, the pseudo label can be understood as the separation result obtained after the teacher network is used for separation, that is, the The result of the initial separation of the sample by the teacher network.
  • the unsupervised network is trained using the first sample set of hybrid interpolation to obtain the separation result, that is, the second separated sample set; finally, the loss of the second separated sample is used to adjust the network parameters of the unsupervised network , So that the loss of the separation result of the adjusted unsupervised network output satisfies the convergence condition; in this way, in the process of training the unsupervised network, two types of audio with pseudo labels and the first sample interpolated with perturbed data are used As a training data set, the set enriches the sample data of the unsupervised network, thereby enhancing the generalization ability of the trained unsupervised network.
  • step S301 in order to improve the generalization ability of the unsupervised network, step S301 can be implemented in the following two ways:
  • Method 1 Using an analog method to generate a variety of audio with pseudo labels, that is, to obtain the first set of separated samples.
  • Method 2 First, obtain sample audio including at least unlabeled audio.
  • the acquired unlabeled audio data is used as sample audio.
  • the sample audio can be obtained in the following ways: collect audio data in any scene to obtain sample audio, for example, collect audio data in a chat scene, or receive audio data sent by other devices as the sample audio, such as , The audio data of a piece of music sent by other devices.
  • the trained supervised network is used to separate the sample audio according to the type of audio data to obtain separated samples of each type to obtain the first separated sample set.
  • the supervised network can be obtained through the following process: first, obtain labeled clean sample audio and noisy sample audio; in some embodiments, the following methods can be used to achieve the Noise sample audio acquisition: Manually annotate the clean sample audio and noise sample audio in the sample audio to get the labeled clean sample audio and noise sample audio; or, randomly select a part of the labeled clean sample from the sample audio library Audio and noise sample audio, etc.
  • the clean sample audio and the noise sample audio are mixed to obtain the third sample set.
  • the clean sample audio and the noise sample audio are superimposed to obtain a mixed third sample set; again, the third sample set is separated by the supervised network to be trained to obtain the fifth separated sample set; for example, the third sample
  • the set is input into the supervised network to be trained to perform predictive separation to obtain the separation result, namely the fifth separated sample set; again, determine the loss of the fifth separated sample in the fifth separated sample set; that is, determine The loss between the fifth separated sample and the labeled clean sample audio and noisy sample audio; finally, the fifth separated sample loss is used to adjust the network parameters of the supervised network to be trained, so that the adjusted The loss of the separation result output by the to-be-trained supervised network satisfies the convergence condition, and the trained supervised network is obtained. For example, determine the loss between each fifth separated sample and any true value, select the smallest loss from it, and use the smallest loss to adjust the network parameters of the supervised network to be trained to obtain the
  • the network parameters of the supervised network are updated based on the network parameters of the unsupervised network.
  • the network parameters of the supervised network are obtained by performing a moving average on the network parameters of the unsupervised network.
  • the trained supervised network can be a teacher network.
  • the types of audio data include at least: voice signals, noise signals, music signals, or other interference signals.
  • Input sample audio containing multiple audio types into the trained supervised network which separates the sample audio, and obtains the separation result of each type with pseudo-labels, that is, the first separated sample set; so, A supervised network is used to predict and separate the unidentified sample audio, and then the result of the prediction separation is used as the sample audio of the unsupervised network to be trained, thereby enriching the sample data and improving the generalization ability of the unsupervised network.
  • step S302 in order to enrich the sample data for training the unsupervised network, step S302 can be implemented through the following steps:
  • Step S321 Multiply different disturbance data in a one-to-one correspondence of each first separated sample to obtain an adjusted data set.
  • different first separated samples are multiplied by different disturbance data.
  • the first separated sample set includes two first separated samples, the disturbance data A is used to multiply one of the first separated samples, and the disturbance data B (or 1-A) is used to multiply the other first separated sample.
  • the disturbance data A is used to multiply one of the first separated samples
  • the disturbance data B is used to multiply the other first separated sample.
  • it is not limited to adjusting the amplitude of the first separated sample, but may also be adjusting the frequency or speech rate of the first separated sample to obtain an adjusted data set.
  • Step S322 Sum the adjustment data in the adjustment data set to obtain the first sample set.
  • the adjustment data in the adjustment data set is summed to obtain the mixed audio data, that is, the first sample set.
  • the sample data for training the unsupervised network is enriched, so that the generalization ability of the trained unsupervised network is stronger.
  • step S305 can be implemented by the following steps. See Figure 4A.
  • Figure 4A is a schematic diagram of another implementation process of the audio separation network training method provided by an embodiment of the present application. The method can be implemented by the audio separation network training device. Based on Figure 3, the following descriptions are made:
  • Step S401 Determine the loss between each second separated sample and the true value data of the first separated sample set, and obtain the loss of each second separated sample to obtain a loss set.
  • the loss between each second separated sample and the true value data of the first separated sample set is determined separately to obtain the loss set.
  • Step S402 Determine the minimum loss from the loss set.
  • the minimum loss indicates that the gap between the second separated sample and the true value data is the smallest, indicating that the accuracy of the second separated sample is higher.
  • Step S403 based on the minimum loss, update the network parameters of the unsupervised network to obtain the updated network parameters.
  • the minimum loss is fed back to the unsupervised network to adjust the network parameters of the unsupervised network, for example, the weight value of the convolution operation of the unsupervised network or the structural parameters of the channel are adjusted to obtain Updated network parameters.
  • the updated network parameters are fed back to the supervised network to update the network parameters of the supervised network, that is, step S404 is entered.
  • step S404 the updated network parameters are fed back to the supervised network to adjust the network parameters of the supervised network to obtain the updated supervised network.
  • Step S405 based on the updated supervised network and sample audio, continue to adjust the network parameters of the updated unsupervised network, so that the loss of the separation result output by the adjusted unsupervised network meets the convergence condition.
  • the moving average of the network parameters of the unsupervised network is used to update the network parameters of the supervised network. That is, the moving average value of the updated network parameters is determined first; then, the moving average value is fed back to the supervised network to adjust the network parameters of the supervised network to obtain the updated supervised network. For example, use the moving average as the network parameter of the supervised network to obtain an updated supervised network.
  • the network parameters of the unsupervised network are adjusted with the minimum loss, and then the moving average is performed on the network parameters of the updated unsupervised network to obtain the updated supervised network; Both the unsupervised network and the unsupervised network are trained multiple times, so that the final trained unsupervised network has a higher separation accuracy.
  • step S404 further includes the following steps:
  • Step S441 Use the updated supervised network to separate the sample audio again to obtain a third separated sample set.
  • the sample audio is re-input into the updated supervised network, and the updated supervised network separates the sample audio again according to the audio type to obtain a third separated sample set containing pseudo labels.
  • the updated supervised network separates the sample audio again according to the audio type to obtain a third separated sample set containing pseudo labels. For example, input sample audio including unlabeled clean speech signals and unlabeled interference signals into the updated supervised network to obtain clean speech signals with pseudo labels and interference signals with pseudo labels.
  • Step S442 Interpolate the third separated sample set using the disturbance data to obtain a second sample set, and input the second sample set into the updated unsupervised network.
  • the disturbance data is used to perform hybrid interpolation on the third separated sample set, thereby mixing the third separated sample set to obtain the second sample set; this second sample set is used as the sample for training the unsupervised network, and input The updated unsupervised network. For example, the clean speech signal with the pseudo label and the interference signal with the pseudo label are mixed and interpolated to obtain the second sample set.
  • Step S443 using the updated unsupervised network to perform prediction and separation again on the second sample set to obtain a fourth separated sample set.
  • the updated unsupervised network is used to predict and separate the second sample set again to obtain the predicted separation result, that is, the fourth separated sample set.
  • Step S444 Determine the loss of the fourth separated sample in the fourth separated sample set.
  • the loss between the fourth separated sample and the sample audio is determined, that is, the gap between the fourth separated sample and the sample audio is determined.
  • Step S445 Use the loss of the fourth separation sample to adjust the network parameters of the updated unsupervised network and the updated network parameters of the supervised network, so that the adjusted loss of the separation result of the updated unsupervised network output satisfies the convergence condition.
  • the first step is to determine the loss between each fourth separated sample and the true value data, and adjust the network parameters of the updated unsupervised network again based on the smallest loss, so that the adjusted updated unsupervised network The loss of the separation result of the network output satisfies the convergence condition; thus a trained unsupervised network is obtained.
  • a supervised network such as a teacher network
  • the estimated separation result is weighted and "mixed” to obtain more useful pseudo-labeled input-output sample pairs (Ie the first sample set).
  • train an unsupervised network for example, a student network
  • the training of the student network is realized in a semi-supervised manner, so that the separation result of the output of the trained student network is more accurate.
  • the embodiment of the present application provides an audio separation method, which can be implemented by an audio separation device, which will be described in detail below with reference to FIG. 4B.
  • Step S421 Acquire audio to be separated.
  • the audio to be separated may include any type of audio signal in any scene, for example, voice in an indoor chat scene for a period of time, audio in an outdoor environment for a period of time, or a piece of music played.
  • the audio to be separated may be audio data actively collected by an audio separation device, or may also be received audio data sent by other devices. For example, a segment of voice in an indoor chat scene collected by an audio collection device in an audio separation device, or a segment of audio data of a video in a TV play sent by other devices.
  • Step S422 Use the trained neural network to separate the to-be-separated audio to obtain a separation result.
  • the neural network is trained based on the above-mentioned audio separation network training method, that is, the trained neural network is obtained by interpolating a first sample set including two types of audio with pseudo-labels, Obtain the first sample set; then, input the first sample set into the neural network to obtain the separation result of the preset separation, that is, the second separated sample set; use the loss of the second separated sample to affect the network parameters of the neural network The adjustment is made so that the loss of the separation result of the adjusted neural network output meets the convergence condition, that is, the trained neural network is obtained.
  • the neural network obtained by such training is used to separate the to-be-separated audio, regardless of whether the scene corresponding to the to-be-separated audio matches the scene corresponding to the training sample data, the to-be-separated audio can be accurately separated into various types of separation results.
  • the audio to be separated is an audio collected indoors for a multi-person chat.
  • the audio includes voice signals and indoor noise signals.
  • the audio to be separated is input into the trained neural network trained in this way, namely A clean speech signal and a noise signal can be obtained, and the two signals can be accurately separated.
  • the supervised network in the network is used to separate the to-be-separated audio according to the type of audio data to obtain candidate separation results of each type , To obtain the separation result set; then, use the disturbance data to difference the separation result set to obtain the difference result set; then, use the trained unsupervised network to separate the difference result set to obtain the final separation result; Go to step S423.
  • Step S423, output the separation result.
  • the first separated sample sets of the two types of audio with pseudo labels are interpolated to obtain the mixed first sample set; then, the unsupervised network is trained based on the first sample set , Adjust the network parameters of the unsupervised network based on the loss of the second separated sample, so that the loss of the separated result output by the adjusted unsupervised network meets the convergence condition; in this way, for the two types of audio with pseudo labels , Interpolate the disturbance data to obtain the first sample set.
  • the first sample set is used as the sample for training the unsupervised network, which enriches the sample data of the unsupervised network, thereby enhancing the generalization ability of the unsupervised network.
  • the audio to be separated is input into the neural network trained in this way to obtain a separation result with higher accuracy.
  • the embodiments of the present application propose a new, effective and easy-to-implement consistency-based semi-supervised learning algorithm, namely, Mixup-Breakdown training (MBT), which is used for speech separation tasks.
  • MBT first introduces the Mean Teacher (MT) network to predict the separation results of the input mixed signals.
  • the input mixed signals include labeled data and unlabeled data; then these intermediate output results are randomly interpolated and mixed to obtain false The first sample set of labels; finally, the student network is updated by optimizing the prediction consistency between the teacher network (for example, the supervised network) and the student network (for example, the unsupervised network).
  • the embodiment of the application verifies the performance of the MBT network on the mixed voice data that has not been interfered with, and the result shows that the MBT separation performance is remarkable.
  • label y (s,e)
  • label data in addition to label data, more data is unlabeled data that is easy to obtain and reflects the real scene but has yet to be developed
  • FIG. 5A is a schematic diagram of the implementation process of the training method of a supervised network according to an embodiment of the present application. With reference to FIG. 5A, the following description will be made:
  • the labeled clean speech signal 501 and the interference signal 502 are mixed to obtain a labeled mixed signal 503 (ie, the third sample set); then, the mixed signal is used 503 trains the student network 504, that is, inputs the mixed signal 503 into the student network 504, determines the loss of each predicted separation result, and regards the separation result with the smallest loss as the separation result with the highest accuracy, that is, the separation results 505 and 506 are respectively and clean
  • the speech signal 501 corresponds to the interference signal 502.
  • the network parameters of the student network 504 are adjusted based on the minimum loss to obtain a trained student network, and the trained student network is used as the teacher network 512 in FIG. 5B.
  • scale-invariant SNR (SI-SNR) and PIT are used to define the loss function L(f ⁇ (x), y) of the trained student network, as in the formula (1) Shown:
  • Equation (1) Represents the projection of b to a.
  • u and v respectively represent any one of a clean voice signal and an interference signal, and u and v are different.
  • the scale-invariant signal-to-noise ratio and loss function used in formula (1) can be replaced by other reconstruction types of loss functions, such as mean square error.
  • Figure 5A shows the process of supervised learning.
  • the input-output pair conforms to the joint distribution P(x,y) (this distribution is usually unknown), and the goal is to minimize the expected loss function on the distribution ( Expected Risk), the optimal solution of the supervised network parameter ⁇ * is shown in formula (2):
  • NL represents the number of labeled sample data
  • DL represents labeled sample data
  • dP emp (x, y; D L ) can be expressed as the formula (3):
  • ⁇ ( ⁇ ) represents a Dirac ⁇ function centered on (x i , y i ). Based on this, it can be used with a number denoted N L estimates the desired training samples.
  • the complex neural network trained by the method provided by formulas (1) to (3) actually "memorizes" the training data, rather than "generalizes” the training data; in addition, the report shows that the network system solely relies on this method for training Unable to respond to adversarial attacks, that is, samples that deviate slightly from the training data distribution can induce the system to give completely different failure predictions. Therefore, the network trained in this way cannot generalize the network to slightly mismatched with the supervised training data set. On the test data.
  • an embodiment of the present application proposes a training method for an audio separation network, which can still be separated and identified in the mixed speech even when the clean speech signal is not heard, and it can deal with various disturbances, such as energy level, The speed of speech, static or moving, with or without processing distortion, etc., can maintain a high degree of stability and consistency.
  • Figure 5B shows the process of unsupervised learning.
  • the perturbation strategy is formed by interpolating and mixing the separated signals to promote consistent learning.
  • the trained student network obtained in Figure 5A is used as the teacher network 512; first, the unlabeled audio mixed data 511 is input to the teacher network 512, and two separation results are obtained, namely, the predicted separated interference signal 513 and the clean voice signal 514; secondly, the interference signal 513 and the clean voice signal 514 are respectively interpolated using the preset interference data to obtain the mixed mixed signal 515; again, the mixed signal 515 is used as the input of the untrained student network 516 , The network is trained, and the output results with the least loss are selected from the output results of the network, that is, the output results 517 and 518, which correspond to the predicted separated interference signal 513 and the clean speech signal 514 output by the teacher network 512, respectively.
  • the student network 516 is adjusted so that the loss of the separation result output by the adjusted student network 516 meets the convergence condition; in this way, in Figure 5B, the teacher network 512 is trained Using unlabeled data to conduct semi-supervised training on the untrained student network 516, which improves the generalization ability of the student network 516 obtained in the final training.
  • the setting of the interpolation weight ⁇ conforms to the Beta distribution, that is, ⁇ Beta( ⁇ , ⁇ ), ⁇ (0, ⁇ ).
  • the MBT strategy trains the student network Given the input mixed signal (both labeled and unlabeled), encourage its prediction and teacher network in the following ways
  • the consistency between the perturbed predictions is shown in formula (6):
  • the teacher network parameter ⁇ T is the exponential moving average of the student network parameter ⁇ S.
  • Using the exponential moving average of the student network parameters in multiple training steps can obtain a more accurate network, thereby accelerating the feedback loop between the student-teacher network.
  • the embodiment of the application adopts a mixed method after adding disturbance to the predicted separation result, which can construct more pseudo-labeled input-output sample pairs. Since the pseudo-labeled input-output sample pairs are closer to the separation boundary, Therefore, it is more useful for consistency-based regularization training.
  • the training optimized audio separation network of MBT includes both accuracy and consistency, as shown in the formula (7 ) Shows:
  • r(t) represents a ramp function, so that the importance of the consistency optimization index in the overall optimization goal gradually increases as the training progresses.
  • Automatic online augmentation of data can be used to improve the generalization performance of a supervised learning network.
  • image samples are expanded by shifting, zooming in, zooming out, rotating, flipping, etc.; similarly, in the field of speech recognition, voice training data is expanded by changing SNR, rhythm, vocal cord length, or speed.
  • SNR signal-to-noise
  • rhythm rhythm, vocal cord length, or speed.
  • this is expanded on the basis of labeled data.
  • MBT can mine labeled data (ie, j ⁇ ⁇ 1,..., N L ⁇ ) or unlabeled data (ie, j ⁇ ⁇ N L +1, ...,N ⁇ ), generate pseudo-label input-output sample pairs, and expand the empirical distribution.
  • labeled data ie, j ⁇ ⁇ 1,..., N L ⁇
  • unlabeled data ie, j ⁇ ⁇ N L +1, ...,N ⁇
  • the MBT strategy is not limited to this, but can be intuitively extended to the effect similar to other types of data automatic online amplification, for example, the speed of speech, moving or static orientation (multi-microphone array, That is, multi-channel scenes), algorithm distortion, and so on.
  • the network structure can adopt the structure of Conv-TasNet, and it also implements a more advanced semi-supervised method for mixing, with MT and ICT as the reference system for comparison.
  • ⁇ in the interpolation weight ⁇ Beta( ⁇ , ⁇ ) is set to 1, that is, ⁇ is uniformly distributed in the range of [0,1].
  • the network structure and specific parameters may also be set using other parameters.
  • the embodiments of this application do not specifically limit the network type and topology of the deep neural network, and can be replaced with various other effective new network structures, such as long and short-term memory network structures, networks combining CNN and other network structures, or others Network structure, such as time delay network, gated convolutional neural network, etc.
  • the topology of the network can be expanded or simplified according to actual application restrictions on network memory occupation and requirements for detection accuracy.
  • the MBT provided by the embodiments of this application achieves the best scale-invariant SNR improvement (Si-SNRi) performance with the smallest network scale (8.8 megabytes), and Si-SNRi and SDRi are both the highest.
  • the embodiment of the present application also tested the performance of the MBT semi-supervised learning method under the condition that no interference type is seen in multiple fields.
  • the unlabeled data sets WSJ0-Libri, WSJ0-noise, and WSJ0-music are combined into one data set (WSJ0-multi).
  • the mixed-decomposition training can be stored roughly the same, for example, in the training
  • the SI-SNRi is 13.75
  • the SI-SNRi is 13.95
  • the SI-SNRi is 13.95
  • the SI-SNRi Is 13.88.
  • the mixed-decomposition training can be stored roughly the same, for example, in the training
  • the SI-SNRi is 13.21
  • the SI-SNRi is 13.52.
  • Table 3 The separation performance of different training methods in the case of background noise mismatch
  • Table 4 The separation performance of different training methods in the case of music mismatch
  • ICT is an important expansion and improvement based on the average teacher, which is mainly reflected in the calculation of the consistency-based loss function L ICT , as shown in formula (8):
  • the samples used for "Mix” are drawn directly and randomly from unlabeled data. In the embodiment of this application, it is applied to the speech separation task and compared with MBT as an ablation experiment to verify the meaning of the "Breakdown" process.
  • the training data of the audio separation network stored in the memory 450 may include: a first acquisition module 4551 configured to acquire a first separated sample set, the first separated sample set including at least two types of audio with pseudo labels; a first interpolation module 4552, configured to Interpolate the first separated sample set by using disturbance data to obtain a first sample set; the first separation module 4553 is configured to use an unsupervised network to separate the first sample set to obtain a second separated sample set The first determination module 4554 determines the loss of the second separated sample in the second separated sample set; the first adjustment module 4555 is configured to use the loss of the second separated sample to determine the network parameters of the unsupervised network Make adjustments so that the loss of the separation result output by the adjusted unsupervised network meets the convergence condition.
  • a first acquisition module 4551 configured to acquire a first separated sample set, the first separated sample set including at least two types of audio with pseudo labels
  • a first interpolation module 4552 configured to Interpolate the first separated sample set by using disturbance data to obtain a first sample set
  • the first acquisition module 4551 is further configured to: acquire sample audio including at least unlabeled audio; use a trained supervised network to separate the sample audio according to the type of audio data, Each type of separated samples is obtained to obtain the first separated sample set; wherein the network parameters of the supervised network are updated based on the network parameters of the unsupervised network.
  • the first interpolation module 4552 is further configured to: multiply the one-to-one correspondence of each first separated sample with different disturbance data to obtain an adjustment data set; Adjust the sum of the data to obtain the first sample set.
  • the first determining module 4554 is further configured to: determine the loss between each second separated sample and the true value data of the first separated sample set, and obtain the value of each second separated sample Loss to obtain a loss set; the first adjustment module 4555 is further configured to determine a minimum loss from the loss set; based on the minimum loss, update the network parameters of the unsupervised network to obtain updated network parameters .
  • the first adjustment module 4555 is further configured to feed back the updated network parameters to the supervised network, so as to adjust the network parameters of the supervised network to obtain an updated supervised network .
  • the first adjustment module 4555 is further configured to: determine a moving average of the updated network parameters; and feed back the moving average to the supervised network to adjust the supervised network.
  • the network parameters of the network to obtain the updated supervised network.
  • the first adjustment module 4555 is further configured to: use the updated supervised network to separate the sample audio again to obtain a third separated sample set; Performing interpolation on the third separated sample set to obtain a second sample set, and inputting the second sample set into an updated unsupervised network; using the updated unsupervised network to perform prediction and separation again on the second sample set, Obtain the fourth separated sample set; determine the loss of the fourth separated sample in the fourth separated sample set; use the loss of the fourth separated sample to determine the network parameters of the updated unsupervised network and the updated data
  • the network parameters of the supervised network are adjusted so that the loss of the separation result output by the adjusted and updated unsupervised network meets the convergence condition.
  • the first separation module 4553 is further configured to: obtain labeled clean sample audio and noisy sample audio; mix the clean sample audio and noisy sample audio to obtain a third sample set;
  • the supervised network to be trained separates the third sample set to obtain a fifth separated sample set; determines the loss of the fifth separated sample in the fifth separated sample set; adopts the loss of the fifth separated sample,
  • the network parameters of the supervised network to be trained are adjusted so that the adjusted loss of the separation result output by the supervised network to be trained meets the convergence condition, and the trained supervised network is obtained.
  • the software module stored in the terminal 456 of the memory 450 may include:
  • the second acquisition module 4561 is configured to acquire the audio to be separated;
  • the first input module 4562 is configured to use a trained neural network to separate the audio to be separated to obtain a separation result; wherein, the neural network is based on the aforementioned
  • the training method of the audio separation network is obtained by training;
  • the first output module 4563 is configured to output the separation result.
  • the embodiment of the present application provides a computer storage medium storing executable instructions, and the executable instructions are stored therein.
  • the processor When the executable instructions are executed by a processor, the processor will cause the processor to execute the audio separation method provided by the embodiments of the present application, or It is used to cause the processor to execute the training method of the audio separation network provided in the embodiment of the present application.
  • the storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM, etc.; it may also be various terminals including one or any combination of the foregoing memories. .
  • the executable instructions may be in the form of programs, software, software modules, scripts or codes, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and their It can be deployed in any form, including being deployed as an independent program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may but do not necessarily correspond to files in the file system, and may be stored as part of a file that saves other programs or data, for example, in a HyperText Markup Language (HTML) document
  • HTML HyperText Markup Language
  • One or more scripts in are stored in a single file dedicated to the program in question, or in multiple coordinated files (for example, a file storing one or more modules, subroutines, or code parts).
  • executable instructions can be deployed to be executed on a vehicle-mounted computing terminal, or on multiple computing terminals located in one location, or on multiple computing terminals that are distributed in multiple locations and interconnected by a communication network. Execute on the terminal.
  • the first separated sample set of the two types of audio with pseudo-labels is interpolated to obtain the first sample after mixing.
  • the unsupervised network is trained based on the first sample set to adjust the network parameters of the unsupervised network based on the loss of the second separated sample, so that the output of the adjusted unsupervised network is separated
  • the loss of the result satisfies the convergence condition; in this way, in the process of training the unsupervised network, two types of audio with pseudo labels and the first sample set of interpolation using perturbation data are used.
  • the first sample set is used as the training
  • the sample of the unsupervised network enriches the sample data of the unsupervised network, thereby enhancing the generalization ability of the unsupervised network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种音频分离网络的训练方法、音频分离方法、装置及介质;方法包括:获取第一分离样本集合(S301),第一分离样本集合中至少包括两类具有伪标签的音频;采用预设的扰动数据对第一分离样本集合进行插值,得到第一样本集合(S302);采用无监督网络对第一样本集合进行分离,得到第二分离样本集合(S303);确定第二分离样本集合中第二分离样本的损失(S304);采用第二分离样本集合中第二分离样本的损失,对无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件(S305)。

Description

音频分离网络的训练方法、音频分离方法、装置及介质
相关申请的交叉引用
本申请基于申请号为202010086752.X、申请日为2020年2月11日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及机器学习领域,尤其涉及音频分离网络的训练方法、音频分离方法、装置及介质。
背景技术
在相关技术中,由于深度学习的语音分离网络存在泛化能力差,在训练语音与测试语音不匹配的情况下,即使采用最先进的语音分离网络对与测试语音进行评估时也可能突然失效。通常由于时间、人力和成本的限制,采集大规模、覆盖范围广且足够多样化的有标注训练数据往往是不切实际的,这样,标注数据的不足会导致有大量参数的复杂网络出现过拟合,以及泛化能力较差。
发明内容
本申请实施例提供一种音频分离网络的训练方法、音频分离方法、装置及介质,能够采用第一样本集合作为训练无监督网络的样本,能够丰富无监督网络的样本数据,并增强无监督网络的泛化能力。
本申请实施例的技术方案是这样实现的:
第一方面,本申请实施例提供一种音频分离网络的训练方法,所述方法应用于音频分离网络的训练设备,包括:
获取第一分离样本集合,所述第一分离样本集合中至少包括两类具有伪标签的音频;
采用扰动数据对所述第一分离样本集合进行插值,得到第一样本集合;
采用无监督网络对所述第一样本集合进行分离,得到第二分离样本集合;
确定所述第二分离样本集合中第二分离样本的损失;
采用所述第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件。
第二方面,本申请实施例提供一种音频分离方法,所述方法应用于音频分离设备,所述方法包括:
获取待分离音频;
采用已训练的神经网络对所述待分离音频进行分离,得到分离结果;其中,所述神经网络为基于上述的音频分离网络的训练方法训练得到的;
输出所述分离结果。
第三方面,本申请实施例提供一种音频分离网络的训练装置,所述装置包括:
第一获取模块,配置为获取第一分离样本集合,所述第一分离样本集合中至少包括两类具有伪标签的音频;
第一插值模块,配置为采用扰动数据对所述第一分离样本集合进行插值,得到第一样本集合;
第一分离模块,配置为采用无监督网络对所述第一样本集合进行分离,得到第二分离样本集合;
第一确定模块,配置为确定所述第二分离样本集合中第二分离样本的损失;
第一调整模块,配置为采用所述第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件。
第四方面,本申请实施例一种音频分离装置,所述装置包括:
第二获取模块,配置为获取待分离音频;
第一输入模块,配置为采用已训练的神经网络对所述待分离音频进行分离,得到分离结果;其中,所述神经网络为基于上述第一方面所述的音频分离网络的训练方法训练得到的;
第一输出模块,配置为输出所述分离结果。
第五方面,本申请实施例一种计算机存储介质,其中,存储有可执行指令,配置为引起处理器执行时,实现第一方面所述的音频分离网络的训练方法,或配置为引起处理器执行时,实现第二方面所述的音频分离方法。
本申请实施例具有以下有益效果:首先,通过对两类具有伪标签的音频的第一分离样本集合进行插值,以得到混合后的第一样本集合;然后,基于第一样本集合对无监督网络进行训练,以基于第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件;如此,在对无监督网络训练的过程中,对于两类具有伪标签的音频,采用扰动数据进行插值得到第一样本集合,这样,将第一样本集合作为训练无监督网络的样本,能够丰富无监督网络的样本数据,从而增强无监督网络的泛化能力。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本申请的实施例,并与说明书一起用于说明本申请实施例的技术方案。
图1是本申请实施例提供的音频分离网络的训练系统的一个可选的架构示意图;
图2A是本申请实施例提供的音频分离网络的训练系统的另一个可选的架构示意图;
图2B是本申请实施例提供的音频分离网络的训练系统的结构示意图;
图3是本申请实施例提供的音频分离网络的训练方法的实现流程示意图;
图4A是本申请实施例提供的音频分离网络的训练方法的又一实现流程示意图;
图4B是本申请实施例提供的音频分离方法的实现流程示意图;
图5A是本申请实施例有监督网络的训练方法的实现流程示意图;
图5B是本申请实施例无监督网络的训练方法的实现流程示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)自动语音识别(Automatic Speech Recognition,ASR):是一种将人的语音转换为文本的技术。语音识别是一个多学科交叉的领域,它与声学、语音学、语言学、数字信号处理理论、信息论、计算机科学等众多学科紧密相连。由于语音信号的多样性和复杂 性,语音识别系统只能在一定的限制条件下获得满意的性能,或者说只能应用于某些特定的场合。
2)置换不变训练(Permutation Invariant Training,PIT:):提出一种通过最小化分离误差解决标签排列问题的排列不变训练技术,排列不变训练技术是指输入的顺序改变不会影响输出的值。置换不变训练是通过计算所有可能的输出置换下的目标损失函数,并选择对应最低的目标损失函数来决定正确的输出排列,是一种通用且有效的方法,代价是复杂度会随着输出维度的增加而增加。
3)半监督学习(Semi-Supervised Learning,SSL):是模式识别和机器学习领域研究的重点问题,是监督学习与无监督学习相结合的一种学习方法。半监督学习使用大量的未标记数据,以及同时使用标记数据,来进行模式识别工作。当使用半监督学习时,将会要求尽量少的人员来从事工作,同时,又能够带来比较高的准确性。
4)一致性半监督学习:首先,采样一个小批量(minibatch)有标签的数据;其次,送入网络预测计算交叉熵损失;再次,采样两个minibatch没有标签的数据;再次,送入网络预测出标签;再次,使用mix混合两个没有标签的数据;再次,计算混合后新数据预测的平方误差损失;最后,有标签损失加上平方误差损失,迭代更新网络参数得到最终网络。
5)滑动平均(Exponential moving average,EMA):又称移动平均法。在简单平均数法基础上,通过顺序逐期增减新旧数据求算移动平均值,借以消除偶然变动因素,找出事物发展趋势,并据此进行预测的方法。滑动平均法是趋势外推技术的一种。实际上是对具有明显的负荷变化趋势的数据序列进行曲线拟合,再用新曲线预报未来的某点处的值。
6)对抗生成网络(Generative Adversarial Network,GAN):包括两个部分,生成网络和判别网络。生成网络是指可以根据任务、通过网络训练由输入的数据生成文字、图像、视频等数据。生成网络从本质上是一种极大似然估计,用于产生指定分布数据的网络,生成网络的作用是捕捉样本数据的分布、将原输入信息的分布情况经过极大似然估计中参数的转化来将训练偏向转换为指定分布的样本。判别网络实际上是个二分类,会对生成网络生成的图像等数据进行判断,判断其是否是真实的训练数据中的数据。
7)平均教师网络(Mean Teacher):包括两个网络,学生网络和教师网络,这两个网络的结构是相同的,教师网络的网络参数通过学生网络计算得到,学生网络的网络参数通过损失函数梯度下降更新得到。整个训练过程中,教师网络的网络参数通过学生网络 的网络参数进行滑动平均更新得到。
8)深度聚类(Deep Clustering,DPCL):深度网络在无监督数据聚类中的应用。将物理或抽象对象的集合分成由类似的对象组成的多个类的过程被称为聚类。
在相关技术中,基于深度学习的语音分离方法的进步,使得在若干基准数据集上测试的最先进的性能被大幅度提升。基于高维嵌入(embedding)网络的拓展网络包括:深度吸引网络,深度提取网络,以及锚深度吸引网络,此外还包括基于置换不变训练的方法。PIT是通过计算所有可能的输出置换下的目标损失函数,并选择对应最低的目标损失函数来决定正确的输出排列,是一种通用且有效的方法,代价是复杂度会随着输出维度的增加而增加。
然而在实际应用中,当将这些网络应用于与训练时的干扰信号类型不匹配的场景时,即使是最先进的网络也可能失败,因为要训练一个具有大量可学习参数的复杂神经网络,并使其具有好的泛化性能,需要大规模、覆盖范围广、足够多样化的训练数据。一方面,为语音分离和识别采集这种高质量的有标注数据是昂贵、繁重并且有时是不现实的;尽管有标注数据的自动扩增技术经实践证明可以改善网络的泛化性能,但是改善程度有限,因为这些自动扩增技术无法挖掘标注数据之外的信息,例如海量的无标注数据蕴藏的信息。另一方面,海量无标注数据的获取相对通常非常容易,但是无法有效挖掘这些无标注数据,因此这些数据通常被基于深度学习的语音分离和识别系统忽略。
基于此,本申请实施例提供一种音频分离网络的训练方法、音频分离方法、装置及介质,通过在对无监督网络训练的过程中,采用两类具有伪标签的音频和采用扰动数据进行插值的第一样本集合,这样,采用第一样本集合作为训练无监督网络的样本,丰富了无监督网络的样本数据,从而增强了无监督网络的泛化能力。
下面说明本申请实施例提供的音频分离网络的训练的设备的示例性应用,本申请实施例提供的设备可以实施为笔记本电脑,平板电脑,台式计算机,机顶盒,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)等各种类型的用户终端,也可以实施为服务器。下面,将说明设备实施为终端或服务器时示例性应用。服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及服务器可以通过有线或 无线通信方式进行直接或间接地连接,本申请实施例在此不做限制。
参见图1,图1是本申请实施例提供的音频分离网络的训练系统的一个可选的架构示意图,为实现支撑一个示例性应用,首先,对于获取的包括多类具有伪标签的音频的第一分离样本集合10(至少包括:干净音频数据和干扰信号两个部分),采用扰动数据进行插值,得到混合之后的第一样本集合11;然后,将第一样本集合11输入到无监督网络12中,以对次对无监督网络12进行训练;最后,将无监督网络12输出的第二分离样本的损失反馈给该网络,以对网络参数进行调整,使得调整后的无监督网络输出的分离结果的损失满足收敛条件,从而得到训练好的无监督网络13;如此,在对无监督网络训练的过程中,采用两类具有伪标签的音频和采用扰动数据进行插值的第一样本集合,这样,采用第一样本集合作为训练无监督网络的样本,丰富了无监督网络的样本数据,从而增强了无监督网络的泛化能力。当需要对待分离音频14进行分离时,将该待分离音频14输入到该训练好的无监督网络13中,得到精准的分离结果15,并输出该分离结果15;这样,采用通过无标注的样本数据训练得到的无监督网络13对待分离音频进行分离,提高了分离结果的准确度。
参见图2A,图2A是本申请实施例提供的音频分离网络的训练系统的另一个可选的架构示意图,包括区块链网络20(示例性示出了作为原生节点的服务器200)、监测系统30(示例性示出归属于监测系统30的设备300及其图形界面301),下面分别进行说明。
区块链网络20的类型是灵活多样的,例如可以为公有链、私有链或联盟链中的任意一种。以公有链为例,任何业务主体的电子设备例如用户设备和服务器,都可以在不需要授权的情况下接入区块链网络20;以联盟链为例,业务主体在获得授权后其下辖的电子设备(例如设备/服务器)可以接入区块链网络20,此时,成为区块链网络20中的一类特殊的节点即客户端节点。
需要指出地,客户端节点可以只提供支持业务主体发起交易(例如,用于上链存储数据或查询链上数据)功能,对于区块链网络20的原生节点的功能,例如下文所述的排序功能、共识服务和账本功能等,客户端节点可以缺省或者有选择性(例如,取决于业务主体的具体业务需求)地实现。从而,可以将业务主体的数据和业务处理逻辑最大程度迁移到区块链网络20中,通过区块链网络20实现数据和业务处理过程的可信和可追溯。
区块链网络20接收来自业务主体(例如图2A中示出的监测系统30)的客户端节点(例如,图2A中示出的归属于监测系统30的设备300)提交的交易,执行交易以更新 账本或者查询账本,并在设备的用户界面(例如,设备300的图形界面301)显示执行交易的各种中间结果或最终结果。
下面以监测系统接入区块链网络以实现音频分离网络的训练的上链为例说明区块链网络的示例性应用。
监测系统30的设备300接入区块链网络20,成为区块链网络20的客户端节点。设备300通过传感器获取第一分离样本集合;并且,将训练好的无监督网络传递给区块链网络20中的服务器200或者保存在设备300中;在已对设备300部署上传逻辑或用户进行操作的情况下,设备300根据待处理事项/同步时间查询请求,生成对应更新操作/查询操作的交易,在交易中指定了实现更新操作/查询操作需要调用的智能合约、以及向智能合约传递的参数,交易还携带了监测系统30签署的数字签名(例如,使用监测系统30的数字证书中的私钥,对交易的摘要进行加密得到),并将交易广播到区块链网络20。其中,数字证书可由监测系统30向认证中心31进行登记注册得到。
区块链网络20中的原生节点,例如服务器200在接收到交易时,对交易携带的数字签名进行验证,数字签名验证成功后,根据交易中携带的监测系统30的身份,确认监测系统30是否是具有交易权限,数字签名和权限验证中的任何一个验证判断都将导致交易失败。验证成功后签署原生节点自己的数字签名(例如,使用原生节点的私钥对交易的摘要进行加密得到),并继续在区块链网络20中广播。
区块链网络20中具有排序功能的节点接收到验证成功的交易后,将交易填充到新的区块中,并广播到区块链网络中20提供共识服务的节点。
区块链网络20中的提供共识服务的节点对新区块进行共识过程以达成一致,提供账本功能的节点将新区块追加到区块链的尾部,并执行新区块中的交易:对于提交新的音频分离网络的训练的交易,更新输出的评分结果和评价数据集合对应的键值对;对于查询同步时间的交易,从状态数据库中查询同步时间对应的键值对,并返回查询结果。对于得到的同步时间,可显示于设备300的图形界面301中。
区块链网络20中的原生节点可从区块链中读取第一分离样本集合,并将第一分离样本集合呈现于原生节点的监测页面,原生节点也可以通过对第一分离样本集合进行插值,以得到混合后的第一样本集合;然后,基于第一样本集合对无监督网络进行训练;最后,通过第二分离样本的损失,对该无监督网络的网络参数进行调整,以得到两个训练好的神经网络;这样,采用第一样本集合作为训练无监督网络的样本,丰富了无监督网络的样本数据,并增强了无监督网络的泛化能力。
在实际应用中,可为区块链网络20的不同原生节点设置不同的功能,例如设置服务器200具有音频分离网络的训练功能和记账功能,比如,服务器对于设备端上传第一分离样本集合进行插值,以得到混合后的第一样本集合;然后,基于第一样本集合对无监督网络进行训练;最后,通过第二分离样本的损失,对该无监督网络的网络参数进行调整,以得到训练好的无监督网络。对于该情况,可在交易过程中,服务器200接收设备300发送的第一分离样本集合,采用服务器200通过对第一分离样本集合进行插值,以得到混合后的第一样本集合;并采用第一样本集合对无监督网络进行训练;最后,通过第二分离样本的损失,对网络参数进行调整,以得到训练好的无监督网络。
参见图2B,图2B是本申请实施例提供的音频分离网络的训练系统的结构示意图,包括:至少一个处理器410、存储器450、至少一个网络接口420和用户接口430。各个组件通过总线系统440耦合在一起。可理解,总线系统440用于实现这些组件之间的连接通信。总线系统440除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2B中将各种总线都标为总线系统440。
处理器410可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口430包括使得能够呈现媒体内容的一个或多个输出装置431,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口430还包括一个或多个输入装置432,包括有助于用户输入的用户接口部件,在一些示例中键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器450可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器450可选地包括在物理位置上远离处理器410的一个或多个存储设备。
存储器450包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(Read Only Memory,ROM),易失性存储器可以是随机存取存储器(Random Access Memory,RAM)。本申请实施例描述的存储器450旨在包括任意适合类型的存储器。
在一些实施例中,存储器450能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统451,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例 如框架层、核心库层、驱动层等,配置为实现各种基础业务以及处理基于硬件的任务;
网络通信模块452,配置为经由一个或多个(有线或无线)网络接口420到达其他计算设备,示例性的网络接口420包括:蓝牙、无线相容性认证、和通用串行总线(Universal Serial Bus,USB)等;
呈现模块453,配置为经由一个或多个与用户接口430相关联的输出装置431(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);
输入处理模块454,配置为对一个或多个来自一个或多个输入装置432之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可以采用软件方式实现,图2B示出了存储在存储器450中的音频分离网络的训练的服务器455,其可以是程序和插件等形式的软件,包括以下软件模块:第一获取模块4551、第一插值模块4552、第一分离模块4553、第一确定模块4554和第一调整4555;存储器450中的数据修复的终端456,其可以是程序和插件等形式的软件,包括以下软件模块:第二获取模块4561、第一输入模块4562和第一输出模块4563;这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。将在下文中说明各个模块的功能。
在另一些实施例中,本申请实施例提供的装置可以采用硬件方式实现,作为示例,本申请实施例提供的装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的音频分离网络的训练方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(Application Specific Integrated Circuit,ASIC)、DSP、可编程逻辑器件(Programmable Logic Device,PLD)、复杂可编程逻辑器件(Complex Programmable Logic Device,CPLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或其他电子元件。
为了更好地理解本申请实施例提供的方法,首先对人工智能、人工智能的各个分支,以及本申请实施例提供的方法所涉及的应用领域进行说明。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的 功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。以下对各个方向分别进行说明。
计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
语音技术(Speech Technology)的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
自然语言处理(Nature Language Processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习等技术。
自动驾驶技术通常包括高精地图、环境感知、行为决策、路径规划、运动控制等技术,自定驾驶技术有着广泛的应用前景。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
本申请实施例提供的方案涉及人工智能的自然语言处理等技术,具体通过如下实施例进行说明。
参见图3,图3是本申请实施例提供的音频分离网络的训练方法的实现流程示意图,所述音频分离网络的训练方法的实现流程可以采用音频分离网络的训练设备实现,结合图3示出的步骤进行说明。
步骤S301,获取第一分离样本集合。
在一些实施例中,第一分离样本集合中至少包括两类具有伪标签的音频。比如,具有伪标签的干净语音信号和干扰信号等。步骤S301中获取第一分离样本集合,可以是模拟产生该第一分离样本集合,还可以是采用已训练的网络对未标识的音频数据进行分离,得到具有伪标签的第一分离样本集合。
步骤S302,采用扰动数据对第一分离样本集合进行插值,得到第一样本集合。
在一些实施例中,采用不同的扰动数据分别对每一个第一分离样本进行插值,然后将插值后的数据进行混合,得到第一样本集合。比如,第一分离样本集合中包括三个第一分离样本,采用三个不同的扰动数据(比如,权值)分别对这个第一分离样本进行调整,对调整结果进行求和,实现了对第一分离样本集合的插值混合,得到第一样本集合。
步骤S303,采用无监督网络对第一样本集合进行分离,得到第二分离样本集合。
在一些实施例中,无监督网络可以是任意类型的用于分离音频数据的学生网络,将第一样本集合输入该无监督网络中,得到该无监督网络预测分离的多个第二分离样本。在一个具体例子中,将语音信号和干扰信号混合在一起的第一样本集合,输入无监督网络中,得到预测分离的语音信号和干扰信号。
步骤S304,确定第二分离样本集合中第二分离样本的损失。
在一些实施例中,确定每一第二分离样本与第一分离样本集合的真值数据之间的损失,得到每一第二分离样本的损失,即,确定第二分离样本与第一分离样本之间的差距。
步骤S305,采用第二分离样本的损失,对所述无监督网络的网络参数进行调整,以 使调整后的无监督网络输出的分离结果的损失满足收敛条件。
在一些实施例中,分别确定每一第二分离样本的损失,然后从这些损失中选择最小的损失,采用该最小的损失对无监督网络的网络参数进行调整,调整之后,继续对包含调整后的网络参数的无监督网络进行训练,直至该无监督网络输出的分离结果的损失满足收敛条件,即得到训练好的无监督网络,说明该训练好的无监督网络输出的分离结果是较为准确的。调整后的无监督网络输出的分离结果的损失收敛条件可以理解为,调整后的无监督网络输出的分离结果的损失最终保持不变,或者损失小于特定阈值,即调整后的无监督网络输出的分离结果与真值数据相同,或者相似度大于99%等。
在本申请实施例中,首先,通过对两类具有伪标签的音频的第一分离样本集合进行混合插值;比如,伪标签可以理解为是采用教师网络进行分离后,得到的分离结果,即采用教师网络对样本进行初步分离的结果。然后,采用混合插值的第一样本集合对无监督网络进行训练,得到分离结果,即第二分离样本集合;最后,采用第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件;如此,在对无监督网络训练的过程中,采用两类具有伪标签的音频和采用扰动数据进行插值的第一样本集合作为训练数据集,丰富了无监督网络的样本数据,从而增强了训练好的无监督网络的泛化能力。
在一些实施例中,为了提高无监督网络的泛化能力,步骤S301可以通过以下两种方式实现:
方式一:采用模拟的方式,生成多种具有伪标签的音频,即得到第一分离样本集合。
方式二:首先,获取至少包括未标注音频的样本音频。
比如,将获取的未标注的音频数据,作为样本音频。可以通过以下方式实现对样本音频的获取:对任意场景下的音频数据进行采集得到样本音频,比如,对聊天场景下的音频数据进行采集,或者接收其他设备发送的音频数据作为该样本音频,比如,其他设备发送的一段音乐的音频数据。
然后,采用已训练的有监督网络,按照音频数据的类型,对样本音频进行分离,得到每一类型的分离样本,以得到所述第一分离样本集合。
在一些可能的实现方式中,有监督网络可以通过以下过程得到:首先,获取有标注的干净样本音频和噪声样本音频;在一些实施例中,可以通过以下方式实现对有标注的干净样本音频和噪声样本音频获取:手动对样本音频中的干净样本音频和噪声样本音频进行标注,即可得到有标注的干净样本音频和噪声样本音频;或者,从样本音频库中随 机选择一部分有标注的干净样本音频和噪声样本音频等。
其次,将干净样本音频和噪声样本音频相混合,得到第三样本集合。比如,将干净样本音频和噪声样本音频叠加,得到混合的第三样本集合;再次,采用待训练的有监督网络对第三样本集合进行分离,得到第五分离样本集合;比如,将第三样本集合输入到该待训练的有监督网络中,以进行预测分离,得到分离结果,即第五分离样本集合;再次,确定所述第五分离样本集合中的第五分离样本的损失;即,确定第五分离样本与有标注的干净样本音频和噪声样本音频之间的损失;最后,采用所述第五分离样本损失,对所述待训练的有监督网络的网络参数进行调整,以使调整后的待训练的有监督网络输出的分离结果的损失满足收敛条件,得到已训练的所述有监督网络。比如,确定每一第五分离样本与任一真值之间的损失,从中选择最小的损失,利用该最小的损失调整待训练的有监督网络的网络参数,以得到已训练的所述有监督网络。
在一些实施例中,有监督网络的网络参数是基于无监督网络的网络参数进行更新的。比如,通过对无监督网络的网络参数进行滑动平均得到该有监督网络的网络参数。已训练的有监督网络可以是教师网络。音频数据的类型至少包括:语音信号、噪声信号、音乐信号或者其他干扰信号等。将包含多种音频类型的样本音频输入到已训练的有监督网络中,该有监督网络对样本音频进行分离,得到具有伪标签的每一类型的分离结果,即第一分离样本集合;如此,采用有监督网络对未标识的样本音频进行预测分离,然后,将预测分离的结果作为待训练的无监督网络的样本音频,从而丰富了样本数据,提高了无监督网络的泛化能力。
在一些实施例中,为了丰富训练无监督网络的样本数据,步骤S302可以通过以下步骤实现:
步骤S321,将每一第一分离样本一一对应的与不同的扰动数据相乘,得到调整数据集合。
在一些实施例中,不同的第一分离样本相乘的扰动数据不同。比如,第一分离样本集合中包括两个第一分离样本,采用扰动数据A与其中一个第一分离样本相乘,采用扰动数据B(或者1-A)与另一个第一分离样本相乘。在本申请实施例中,不限于是对第一分离样本的幅度进行调整,还可以是第一分离样本的频率或者语速等进行调整,以得到调整数据集合。
步骤S322,对所述调整数据集合中的调整数据求和,得到所述第一样本集合。在一些实施例中,对调整数据集合中的调整数据进行求和,得到混合音频数据,即第一样本 集合。
在本申请实施例中,通过对多个第一分离样本进行插值混合,丰富了训练无监督网络的样本数据,从而使得训练好的无监督网络的泛化能力更强。
在一些实施例中,基于无监督网络的网络参数,更新有监督网络的网络参数,从而对有监督网络和无监督网络均进行多次训练,使得最终得到的训练好的无监督网络的分离准确度更高,步骤S305可以通过以下步骤实现,参见图4A,图4A是本申请实施例提供的音频分离网络的训练方法的又一实现流程示意图,该方法可通过音频分离网络的训练设备实现,基于图3,进行以下说明:
步骤S401,确定每一第二分离样本与所述第一分离样本集合的真值数据之间的损失,得到每一第二分离样本的损失,以得到损失集合。
在一些实施例中,分别确定每一个第二分离样本与第一分离样本集合的真值数据之间的损失,得到损失集合。
步骤S402,从损失集合中,确定最小损失。
在一些实施例中,最小损失表明该第二分离样本与真值数据之间的差距最小,说明该第二分离样本的准确度更高。
步骤S403,基于最小损失,更新无监督网络的网络参数,得到更新的网络参数。
在一些实施例中,将该最小损失反馈给无监督网络,以对无监督网络的网络参数进行调整,比如,对无监督网络的卷积操作的权重值或者通道的结构参数等进行调整,得到更新的网络参数。在步骤S403之后,即得到更新的网络参数之后,将更新的网络参数反馈给有监督网络,以更新有监督网络的网络参数,即进入步骤S404。
步骤S404,将更新的网络参数反馈给有监督网络,以调整有监督网络的网络参数,得到更新的有监督网络。
步骤S405,基于更新的有监督网络和样本音频,继续对更新的无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件。
在一些可能的实现方式中,利用无监督网络的网络参数的滑动平均,更新有监督网络的网络参数。即,先确定更新的网络参数的滑动平均值;然后,将所述滑动平均值反馈给所述有监督网络,以调整所述有监督网络的网络参数,以得到所述更新的有监督网络。比如,将该滑动平均值作为有监督网络的网络参数,以得到更新的有监督网络。
在本申请实施例中,采用最小损失对无监督网络的网络参数进行调整,然后,对更新的无监督网络的网络参数进行滑动平均,以得到更新的有监督网络;从而实现了对有 监督网络和无监督网络均进行多次训练,使得最终得到的训练好的无监督网络的分离准确度更高。
在一些实施例中,对无监督网络的网络参数和有监督网络的网络参数均进行更新之后,继续采用更新的有监督网络对样本音频进行预测分离,以便于对更新的无监督网络进行继续训练,从而得到训练好的无监督网络,在步骤S404之后,还包括以下步骤:
步骤S441,采用更新的有监督网络,对样本音频进行再次分离,得到第三分离样本集合。
在一些实施例中,将样本音频再次输入到更新的有监督网络中,该更新的有监督网络按照音频类型,对样本音频进行再次分离,得到包含伪标签的第三分离样本集合。比如,将包含未标注的干净语音信号和未标注的干扰信号的样本音频输入到更新的有监督网络,得到具有伪标签的干净语音信号和具有伪标签的干扰信号。
步骤S442,采用扰动数据对第三分离样本集合进行插值,得到第二样本集合,并将第二样本集合输入更新的无监督网络。
在一些实施例中,采用扰动数据对第三分离样本集合进行混合插值,从而将第三分离样本集合进行混合,得到第二样本集合;将该第二样本集合作为训练无监督网络的样本,输入该更新的无监督网络。比如,将具有伪标签的干净语音信号和具有伪标签的干扰信号,进行混合插值,得到第二样本集合。
步骤S443,采用更新的无监督网络对第二样本集合进行再次预测分离,得到第四分离样本集合。
在一些实施例中,采用该更新的无监督网络对第二样本集合进行再次预测分离,得到预测的分离结果,即第四分离样本集合。
步骤S444,确定所述第四分离样本集合中第四分离样本的损失。
在一些实施例中,确定第四分离样本与样本音频之间的损失,即确定第四分离样本与样本音频之间的差距。
步骤S445,采用第四分离样本的损失,对更新的无监督网络的网络参数和更新的有监督网络的网络参数进行调整,以使调整后的更新的无监督网络输出的分离结果的损失满足收敛条件。
在一些实施例中,首先是确定每一个第四分离样本与真值数据之间的损失,基于最小的损失对更新的无监督网络的网络参数进行再次调整,以使调整后的更新的无监督网络输出的分离结果的损失满足收敛条件;从而得到训练好的无监督网络。
在本申请实施例中,采用有监督网络(比如,教师网络)对未标注的样本音频进行分离,然后对估计的分离结果进行加权“混合”以得到更多有用的伪标注输入-输出样本对(即第一样本集合)。然后基于伪标注输入-输出样本对训练无监督网络(比如,学生网络),从而在半监督方式下实现了对学生网络的训练,使得训练好的学生网络输出的分离结果更加准确。
本申请实施例提供一种音频分离方法,该方法可通过音频分离设备实现,下面结合图4B进行详细说明。
步骤S421,获取待分离音频。
在一些实施例中,待分离音频可以包含任意类型,任意场景下的音频信号,比如,一段时间内室内聊天场景下的语音、一点时间内室外环境下的音频或者播放的一段音乐等。在一些可能的实现方式中,待分离音频可以是音频分离设备主动采集的音频数据,还可以是接收的其他设备发送的音频数据。比如,音频分离设备中的音频采集装置采集的一段室内聊天场景下的语音,或者是接收的其他设备发送的一段电视剧中视频的音频数据。
步骤S422,采用已训练的神经网络对所述待分离音频进行分离,得到分离结果。
在一些实施例中,所述神经网络为基于上述音频分离网络的训练方法训练得到的,即已训练的神经网络为,通过对包括两类具有伪标签的音频的第一样本集合进行插值,得到第一样本集合;然后,将该第一样本集合输入神经网络中,得到预设分离的分离结果,即第二分离样本集合;采用第二分离样本的损失,对神经网络的网络参数进行调整,以使调整后的神经网络输出的分离结果的损失满足收敛条件,即得到该已训练的神经网络。采用这样训练得到的神经网络对该待分离音频进行分离,无论待分离音频对应的场景与训练的样本数据对应的场景是否匹配,均能够准确的将该待分离音频分离为各个类型的分离结果。比如,待分离音频为室内采集的一段多人聊天的音频,该音频中包括语音信号和室内的噪声信号,将该待分离音频输入采用这种方式训练得到的该已训练的神经网络中,即可得到干净的语音信号和噪声信号,两种信号,即得到了准确的分离结果。在已训练的神经网络对所述待分离音频进行分离的过程中,首先,采用该网络中的有监督网络按照音频数据的类型,对该待分离音频进行分离,得到每一类型的候选分离结果,以得到分离结果集合;然后,采用扰动数据对分离结果集合进行差值,得到差值结果集合;然后,采用已训练好的无监督网络对差值结果集合进行分离,得到最终的分离结果;进入步骤S423。
步骤S423,输出分离结果。
在本申请实施例中,通过对两类具有伪标签的音频的第一分离样本集合进行插值,以得到混合后的第一样本集合;然后,基于第一样本集合对无监督网络进行训练,以基于第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件;这样,对于两类具有伪标签的音频,采用扰动数据进行插值得到第一样本集合,这样,将第一样本集合作为训练无监督网络的样本,丰富了无监督网络的样本数据,从而增强了无监督网络的泛化能力。如此,当需要对待分离音频进行分离时,将待分离音频输入到以这样的方式训练得到的神经网络中,能够得到准确度较高的分离结果。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用,以对混合的音频进行分离为例,进行说明。
本申请实施例提出了一种新型有效的、并且容易实现的基于一致性的半监督学习算法,即混合-分解训练(Mixup-Breakdown training,MBT),以用于语音分离任务。MBT首先引入平均教师(Mean Teacher,MT)网络预测输入混合信号的分离结果,其输入混合信号包括有标注的数据,也包括无标注数据;然后对这些中间输出结果进行随机插值混合,得到包含伪标签的第一样本集合;最后,通过优化教师网络(比如,有监督网络)和学生网络(比如,无监督网络)之间的预测一致性,来更新学生网络。本申请实施例在受到未见干扰的混合语音数据上验证了MBT网络的性能,结果显示MBT的分离性能显著。
在本申请实施例中,按照语音分离任务的训练的标准设定,将干净语音信号s和干扰信号e,按照给定范围内的信噪比(signal-to-noise,SNR)混合得到输入x=s+e(在此略去根据SNR对s和e进行加权的表示),形成包含N L对输入-输出样本的有标注数据集
Figure PCTCN2020126492-appb-000001
其中,标注y=(s,e);在一些实施例中,除了有标注数据,更多的数据是易于获取、反映真实场景但尚待开发的无标注数据
Figure PCTCN2020126492-appb-000002
图5A是本申请实施例有监督网络的训练方法的实现流程示意图,结合图5A,进行以下说明:
如图5A所示的一个有监督的学习架构中,给定一个语音分离网络f θ(即学生网络504)和该网络的学习参数θ,网络的目标函数L(f θ(x),y)通常反映分离的“准确程度(co rrectness)”,定义为预测分离结果
Figure PCTCN2020126492-appb-000003
和原始干净语音数据(即标注)y=(s,e)之间的差异。在图5A中,将有标注的干净语音信号501和干扰信号502(即干净样本音频和噪声样本音频)相混合,得到有标注的混合信号503(即第三样本集合);然后,采用混合信号503对学生网络504进行训练,即将混合信号503输入学生网络504中,确定每一个预测分离结果的损失,将损失最小的分离结果作为准确程度最高的分离结果,即分离结果505和506分别与干净语音信号501和干扰信号502相对应。基于该最小损失对学生网络504的网络参数进行调整,以得到已训练的学生网络,并将该已训练的学生网络作为图5B中的教师网络512。例如,在一个发明实例中采用尺度不变的信噪比(Scale-invariant SNR,SI-SNR)和PIT来定义已训练的学生网络的损失函数L(f θ(x),y),如公式(1)所示:
Figure PCTCN2020126492-appb-000004
其中,在公式(1),
Figure PCTCN2020126492-appb-000005
表示b到a的投影。u和v分别表示干净语音信号和干扰信号中的任意一种,且u和v不同。
在一些实施例中,如公式(1)采用的比例不变信噪比和损失函数,可以用其它重建类型的损失函数替代,例如均方误差等。
图5A表示有监督学习的过程,在图5A中,假设输入-输出对符合联合分布P(x,y)(该分布通常是未知的),目标是最小化损失函数在该分布上的期望(Expected Risk),从而求得有监督网络参数θ *的最优解如公式(2)所示:
Figure PCTCN2020126492-appb-000006
其中,在公式(2)中,N L表示有标注的样本数据的数量,D L表示有标注的样本数据,dP emp(x,y;D L)可以表示为公式(3)所示:
Figure PCTCN2020126492-appb-000007
其中,δ(·)表示一个以(x i,y i)为中心的狄拉克δ函数。基于此,就可以利用N L个有标注训练样本来估计上述期望。采用公式(1)至(3)提供的方式训练的复杂神经网络其实是“记忆”了训练数据,而非利用训练数据进行“泛化”;此外,报告显示单纯依赖该方式进行训练的网络系统无法应对对抗攻击,即仅仅稍微偏离了训练数据分布的样本就能诱发系统给出截然不同的失败预测,因此,这种方式训练的网络无法使网络泛化到和监督训练数据集稍不匹配的测试数据上。
基于此,本申请实施例提出一种音频分离网络的训练方法,在没有听到干净语音信号的情况下,依然能够将其在混合语音中分离鉴别出来,并且对各种扰动,例如能量高低、语速快慢、静态或移动、有无处理失真等,能保持高度的稳定一致性。
图5B为无监督学习的过程,通过对分离信号进行插值混合来形成扰动的策略以促成一致性学习。在图5B中,将图5A中得到的已训练的学生网络,作为教师网络512;首先,将未标注音频的混合数据511输入该教师网络512,得到两个分离结果,即预测分离的干扰信号513和干净语音信号514;其次,采用预设的干扰数据对干扰信号513和干净语音信号514分别进行插值,得到混合的混合信号515;再次,将混合信号515作为未训练的学生网络516的输入,对该网络进行训练,从该网络的输出结果中选择损失最小的输出结果,即输出结果517和518,分别与教师网络512输出的预测分离的干扰信号513和干净语音信号514相对应。最后,基于该输出结果517和518的损失,对学生网络516进行调整,以使调整后的学生网络516输出的分离结果的损失满足收敛条件;这样,在图5B中,教师网络512是已训练的网络,采用无标注的数据对未训练的学生网络516进行半监督训练,提高了最后训练得到的学生网络516的泛化能力。首先,定义图5B中的混合(Mixup)和分离(Breakdown)的操作如公式(4)和(5)所示:
Figure PCTCN2020126492-appb-000008
Figure PCTCN2020126492-appb-000009
其中,插值权重λ的设置符合Beta分布,即λ~Beta(α,α),α∈(0,∞)。
然后,MBT策略训练学生网络
Figure PCTCN2020126492-appb-000010
给定输入混合信号(包括标注的和未标注的),通过如下方式鼓励它的预测与教师网络
Figure PCTCN2020126492-appb-000011
的受扰动的预测之间的一致性如公式(6)所示:
Figure PCTCN2020126492-appb-000012
其中,教师网络参数θ T是学生网络参数θ S的指数移动平均值。将学生网络参数在多训练步骤上作指数移动平均,能够得到更趋于准确的网络,进而加速学生-教师网络之间的反馈闭环。
同时,本申请实施例采取的在预测的分离结果上加扰动后在混合的方式,能够构建出更多的伪标注输入-输出样本对,由于伪标注输入-输出样本对距离分离边界更近,因此对基于一致性的规整训练更有用。
在半监督学习的设定下,给定包含有标注数据D L和无标注数据D U的总数据集,MBT的训练优化的音频分离网络同时包括准确度和一致性两部分,如公式(7)所示:
Figure PCTCN2020126492-appb-000013
其中,r(t)表示一个斜坡函数,使一致性优化指标在整体优化目标中的重要性随着训练的进行而逐渐提高。
上述公式(4)至(7)实现了对音频分离网络的训练过程,即通过(4)至(7)能够实现,在半监督的条件下,得到训练好的音频分离网络;从公式(7)可以看出,在本申请实施例中,采用的对待分离音频进行分离的网络即可保证分离结果的一致性,还可以保证分离结果的准确性。
数据自动在线扩增能够用于提高有监督学习网络的泛化性能。例如,在图像分类领域,通过移位、放大、缩小、旋转、翻转等方式来扩充图片样本;类似地,在语音识别领域,通过改变SNR、节奏、声带长度或快慢等方式扩充语音训练数据。然而,这是在有标注数据的基础上进行扩充。基于本申请实施例中的MBT方法非常容易实现数据自动在线扩增,并且额外计算的计算量很小,几乎可以忽略不计。从公式(7)可以看到,MBT即可以挖掘有标注数据(即,j∈{1,...,N L}),也可以挖掘未标注数据(即,j∈{N L+1,...,N}),生成伪标签输入-输出样本对,扩充经验分布。尽管,本申请实施例给出的示例,如公式(4)和(5)所示,是通过幅度方面的插值达到类似不同SNR的数据自动扩增的效果。但是值得特别指出的是,MBT的策略不局限于此,而是可以很直观地拓展到类似其它类型的数据自动在线扩增的效果,例如,语速、移动或静止的方位(多麦克风阵列,即多通道场景)、算法失真等等。在一个具体例子中,网络结构可以采用C onv-TasNet的结构,而且还实现了较为先进的半监督方式进行混合,MT和ICT作为对比的参照系统。对所有上述半监督方法中平均教师网络中用于约束保守程度的衰减系数,均设置为0.999;另外,斜坡函数设为r(t)=exp(t/T max-1),对于t∈{1,...,T max},其中T max=100表示训练的最大迭代次数。此外,将插值权重λ~Beta(α,α)中的α设置为1,即λ在[0,1]范围内均匀分布。
在其他实施例中,网络结构和具体参数,也可以采用其他的参数进行设置。本申请实施例不具体限定深度神经网络的网络类型和拓扑结构,可以替换为各种其它有效的新型的网络结构,例如,长短时记忆网络结构,CNN和其它网络结构相结合的网络,或者其它网络结构,例如时延网络、闸控卷积神经网络等。可以根据实际应用对网络内存占用的限制和对检测准确率的要求,对该网络的拓扑结构加以拓展或简化。
在语音标准训练和测试集(WSJ0)及概述数据集用于语音分离任务的标准数据集(WSJ0-2mix)基础上拓展。将WSJ0-2mix中作为干扰说话声的信号替换为其它类型的干扰声,得到如下混合信号数据集:
WSJ0-Libri:采用来自另一个独立语音数据集的语音作为干扰声。
WSJ0-music:采用来自43小时的音乐数据集中的音乐片段作为干扰声,涵盖了丰富的古典和流行音乐流派。
WSJ0-noise:采用来自4小时的噪声数据集的噪声片段作为干扰声,涵盖了丰富的日常生活场景,例如办公场所、餐厅、超市和建筑工地等。上述多个数据集均按照与WSJ0-2mix一致的比例拆分为训练集、开发集和测试集。这些训练集将作为未标注训练集用于下面的实验。
首先,对应上述的数据自动在线扩增效应,本申请实施例在WSJ0-2mix有标注数据集上的结果如表1所示,在深度吸引网络中的网络规模为9.1兆(M),尺度不变的信噪比改善(Scale-invariant SNR improvement,Si-SNRi)为10.5;锚深度吸引网络的网络规模为9.1M,Si-SNRi为10.4,SDRi为10.8;双向长短记忆时域音频分离网络的网络规模为23.6M,Si-SNRi为13.2,SDRi为13.6;卷积时域音频分离网络的网络规模为8.8M,Si-SNRi为15.3,SDRi为15.6;本申请实施例提供的混合-分解训网络(MBT)以WSJ0-2mix+线上数据增强为训练数据集的情况下,网络规模为8.8M,Si-SNRi为15.5,SDRi为15.9;混合-分解训网络以WSJ0-2mix+无标注WSJO-multi为训练数据集的情况下,网络规模为8.8M,Si-SNRi为15.5。由此可以看出,本申请实施例提供的MBT以最 小的网络规模(8.8兆),达到了最好的尺度不变的信噪比改善(Scale-invariant SNR improvement,Si-SNRi)性能,而且Si-SNRi和SDRi均为最高。
表1 WSJO-2mix数据集上的性能比较
Figure PCTCN2020126492-appb-000014
接下来,为验证MBT的泛化性能,在表2、3和4中分别对比了在监督学习过程中未见的干扰类型下不同系统的性能。可以看到,在所有测试的环境中,MBT均一致地超过了参考系统,特别地,在音乐干扰声环境下,MBT取得了较ICT方法13.77%的相对SI-SNIRi的提升。
此外,本申请实施例还测试了MBT半监督学习方法在综合多领域未见干扰类型下的性能。为此,将未标注数据集WSJ0-Libri、WSJ0-noise、WSJ0-music合并为一个数据集(WSJ0-multi)。将WSJ0-multi作为多领域未标注数据集用于MBT的半监督训练,然后对各个领域的测试集进行测试,结果分别在表1、2、3、4中的最后一行给出。
从表2可以看出,无论是将哪个数据集作为训练数据集,在测试的语音与训练数据即中的语音类型不匹配时,混合-分解训练均能够均可以保存大致相同,比如,在训练数据集WSJ0-2mix上,SI-SNRi为13.75;在训练数据集WSJ0-2mix+无标注WSJO-Libri上,SI-SNRi为13.95;在训练数据集WSJ0-2mix+无标注WSJ0-multi上,SI-SNRi为13.88。
表2 语音不匹配时不同训练方法的分离性能
Figure PCTCN2020126492-appb-000015
Figure PCTCN2020126492-appb-000016
从表3可以看出,无论是将哪个数据集作为训练数据集,在背景噪声与训练数据集中的噪声类型不匹配的情况下,混合-分解训练均能够均可以保存大致相同,比如,在训练数据集WSJ0-2mix+无标注WSJO-noise上,SI-SNRi为13.21;在训练数据集WSJ0-2mix+无标注WSJ0-multi上,SI-SNRi为13.52。
从表4可以看出,无论是将哪个数据集作为训练数据集,在音乐与训练数据集中的音乐类型不匹配的情况下,混合-分解训练均能够均可以保存大致相同,比如,在训练数据集WSJ0-2mix+无标注WSJO-noise上,SI-SNRi为15.95;在训练数据集WSJ0-2mix+无标注WSJ0-multi上,SI-SNRi为15.67。由此可见,从表2至4中均可以看到,MBT的性能可以大致保持。特别地,在表1和表3中,MBT的性能的SI-SNRi有所提升。
表3 在背景噪声不匹配的情况下不同训练方法的分离性能
Figure PCTCN2020126492-appb-000017
表4 在音乐不匹配的情况下不同训练方法的分离性能
Figure PCTCN2020126492-appb-000018
Figure PCTCN2020126492-appb-000019
在相关技术中,尤其是半监督学习方法中,ICT是在平均教师基础上的重要拓展和改进,主要体现在计算基于一致性的损失函数L ICT上,如公式(8)所示:
Figure PCTCN2020126492-appb-000020
其中
Figure PCTCN2020126492-appb-000021
(x i,y i)~D L,(x j,y k)~D U,DL为有标注样本,DU为无标注样本。
在一些实施例中,用于“Mix”的样本是直接随机地从未标注数据中抽取。在本申请实施例中将其应用于语音分离任务中,并与MBT作对比,以此作为验证“Breakdown”过程的意义的消融实验。
从以上实验对比结果,可以看出,本申请实施例提供的MBT带来的性能优势所在。在应用实验中,MBT被测试用于训练--测试之间不匹配程度依次递增的不同场景,包括未见干扰语音、噪声和音乐,以反映方法的泛化性能。本申请实施例中将MBT的泛化性能和最先进的监督的方法以及半监督方法比较,结果显示MBT能够获得较ICT高达13.77%的相对Si-SNRi的提升,同时也显著地、一致地超过了对比的若干种方法。并且,本申请实施例提出的MBT需要在标准训练方案基础上额外增加的计算量很少。
下面继续说明本申请实施例提供的音频分离网络的训练的服务器455的实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器450的音频分离网络的训练的服务器455中的软件模块可以包括:第一获取模块4551,配置为获取第一分离样本集合,所述第一分离样本集合中至少包括两类具有伪标签的音频;第一插值模块4552,配置为采用扰动数据对所述第一分离样本集合进行插值,得到第一样本集合;第一分离模块4553,配置为采用无监督网络对所述第一样本集合进行分离,得到第二分离样本集合;第一确定模块4554,确定所述第二分离样本集合中第二分离样本的损失;第一调整模块4555,配置为采用所述第二分离样本的损失,对所述无监督网络的网络参数进 行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件。
在一些实施例中,所述第一获取模块4551,还配置为:获取至少包括未标注音频的样本音频;采用已训练的有监督网络,按照音频数据的类型,对所述样本音频进行分离,得到每一类型的分离样本,以得到所述第一分离样本集合;其中,所述有监督网络的网络参数是基于所述无监督网络的网络参数进行更新的。
在一些实施例中,所述第一插值模块4552,还配置为:将每一第一分离样本一一对应的与不同的扰动数据相乘,得到调整数据集合;对所述调整数据集合中的调整数据求和,得到所述第一样本集合。
在一些实施例中,所述第一确定模块4554,还配置为:确定每一第二分离样本与所述第一分离样本集合的真值数据之间的损失,得到每一第二分离样本的损失,以得到损失集合;所述第一调整模块4555,还配置为从所述损失集合中,确定最小损失;基于所述最小损失,更新所述无监督网络的网络参数,得到更新的网络参数。
在一些实施例中,所述第一调整模块4555,还配置为:将所述更新的网络参数反馈给所述有监督网络,以调整所述有监督网络的网络参数,得到更新的有监督网络。
在一些实施例中,所述第一调整模块4555,还配置为:确定所述更新的网络参数的滑动平均值;将所述滑动平均值反馈给所述有监督网络,以调整所述有监督网络的网络参数,以得到所述更新的有监督网络。
在一些实施例中,所述第一调整模块4555,还配置为:采用所述更新的有监督网络,对所述样本音频进行再次分离,得到第三分离样本集合;采用所述扰动数据对所述第三分离样本集合进行插值,得到第二样本集合,并将所述第二样本集合输入更新的无监督网络;采用所述更新的无监督网络对所述第二样本集合进行再次预测分离,得到第四分离样本集合;确定所述第四分离样本集合中第四分离样本的损失;采用所述第四分离样本的损失,对所述更新的无监督网络的网络参数和所述更新的有监督网络的网络参数进行调整,以使调整后的更新的无监督网络输出的分离结果的损失满足收敛条件。
在一些实施例中,所述第一分离模块4553,还配置为:获取有标注的干净样本音频和噪声样本音频;将所述干净样本音频和噪声样本音频相混合,得到第三样本集合;采用待训练的有监督网络对所述第三样本集合进行分离,得到第五分离样本集合;确定所述第五分离样本集合中的第五分离样本的损失;采用所述第五分离样本的损失,对所述待训练的有监督网络的网络参数进行调整,以使调整后的待训练的有监督网络输出的分离结果的损失满足收敛条件,得到已训练的所述有监督网络。
下面继续说明本申请实施例提供的音频分离的终端456的实施为软件模块的示例性结构,在一些实施例中,如图2B所示,存储在存储器450的终端456中的软件模块可以包括:第二获取模块4561,配置为获取待分离音频;第一输入模块4562,配置为采用已训练的神经网络对所述待分离音频进行分离,得到分离结果;其中,所述神经网络为基于上述所述的音频分离网络的训练方法训练得到的;第一输出模块4563,配置为输出所述分离结果。本申请实施例提供一种存储有可执行指令的计算机存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的音频分离方法,或者用于引起处理器执行本申请实施例提供音频分离网络的训练方法。在一些实施例中,存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种终端。在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(Hyper Text Markup Language,HTML)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个车载计算终端上执行,或者在位于一个地点的多个计算终端上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算终端上执行。综上所述,本申请实施例对用于进行音频分离的网络的训练过程中,首先,通过对两类具有伪标签的音频的第一分离样本集合进行插值,以得到混合后的第一样本集合;然后,基于第一样本集合对无监督网络进行训练,以基于第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件;如此,在对无监督网络训练的过程中,采用两类具有伪标签的音频和采用扰动数据进行插值的第一样本集合,这样,采用第一样本集合作为训练无监督网络的样本,丰富了无监督网络的样本数据,从而增强了无监督网络的泛化能力。以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (12)

  1. 一种音频分离网络的训练方法,所述方法应用于音频分离网络的训练设备,所述方法包括:
    获取第一分离样本集合,所述第一分离样本集合中至少包括两类具有伪标签的音频;
    采用扰动数据对所述第一分离样本集合进行插值,得到第一样本集合;
    采用无监督网络对所述第一样本集合进行分离,得到第二分离样本集合;
    确定所述第二分离样本集合中第二分离样本的损失;
    采用所述第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件。
  2. 根据权利要求1所述的方法,其中,所述获取第一分离样本集合,包括:
    获取至少包括未标注音频的样本音频;
    采用已训练的有监督网络,按照音频数据的类型,对所述样本音频进行分离,得到每一类型的分离样本,以得到所述第一分离样本集合;其中,所述有监督网络的网络参数是基于所述无监督网络的网络参数进行更新的。
  3. 根据权利要求1所述的方法,其中,所述采用扰动数据对所述第一分离样本集合进行插值,得到第一样本集合,包括:
    将每一第一分离样本一一对应的与不同的扰动数据相乘,得到调整数据集合;
    对所述调整数据集合中的调整数据求和,得到所述第一样本集合。
  4. 根据权利要求2所述的方法,其中,所述确定所述第二分离样本集合中第二分离样本的损失,包括:确定每一第二分离样本与所述第一分离样本集合的真值数据之间的损失,得到每一第二分离样本的损失,以得到损失集合;
    对应地,所述采用所述第二分离样本的损失,对所述无监督网络的网络参数进行调整,包括:从所述损失集合中,确定最小损失;基于所述最小损失,更新所述无监督网络的网络参数,得到更新的网络参数。
  5. 根据权利要求4所述的方法,其中,在所述基于所述最小损失,更新所述无监督网络的网络参数,得到更新的网络参数之后,所述方法还包括:
    将所述更新的网络参数反馈给所述有监督网络,以调整所述有监督网络的网络参数,得到更新的有监督网络。
  6. 根据权利要求5所述的方法,其中,所述将所述更新的网络参数反馈给所述有监督网络,以调整所述有监督网络的网络参数,得到更新的有监督网络,包括:
    确定所述更新的网络参数的滑动平均值;
    将所述滑动平均值反馈给所述有监督网络,以调整所述有监督网络的网络参数,以得到所述更新的有监督网络。
  7. 根据权利要求5或6所述的方法,其中,在所述将所述更新的网络参数反馈给所述有监督网络,以调整所述有监督网络的网络参数,得到更新的有监督网络之后,所述方法还包括:
    采用所述更新的有监督网络,对所述样本音频进行再次分离,得到第三分离样本集合;
    采用所述扰动数据对所述第三分离样本集合进行插值,得到第二样本集合,并将所述第二样本集合输入更新的无监督网络;
    采用所述更新的无监督网络对所述第二样本集合进行再次预测分离,得到第四分离样本集合;
    确定所述第四分离样本集合中第四分离样本的损失;
    采用所述第四分离样本的损失,对所述更新的无监督网络的网络参数和所述更新的有监督网络的网络参数进行调整,以使调整后的更新的无监督网络输出的分离结果的损失满足收敛条件。
  8. 根据权利要求2所述的方法,其中,在所述采用已训练的有监督网络,按照音频数据的类型,对所述样本音频进行分离,得到每一类型的分离样本,以得到所述第一分离样本集合之前,所述方法还包括:
    获取有标注的干净样本音频和噪声样本音频;
    将所述干净样本音频和噪声样本音频相混合,得到第三样本集合;
    采用待训练的有监督网络对所述第三样本集合进行分离,得到第五分离样本集合;
    确定所述第五分离样本集合中的第五分离样本的损失;
    采用所述第五分离样本的损失,对所述待训练的有监督网络的网络参数进行调整,以使调整后的待训练的有监督网络输出的分离结果的损失满足收敛条件,得到已训练的所述有监督网络。
  9. 一种音频分离方法,所述方法应用于音频分离设备,所述方法包括:
    获取待分离音频;
    采用已训练的神经网络对所述待分离音频进行分离,得到分离结果;其中,所述神经网络为基于上述权利要求1至8任一项所述的音频分离网络的训练方法训练得到的;
    输出所述分离结果。
  10. 一种音频分离网络的训练装置,其中,所述装置包括:
    第一获取模块,配置为获取第一分离样本集合,所述第一分离样本集合中至少包括两类具有伪标签的音频;
    第一插值模块,配置为采用扰动数据对所述第一分离样本集合进行插值,得到第一样本集合;
    第一分离模块,配置为采用无监督网络对所述第一样本集合进行分离,得到第二分离样本集合;
    第一确定模块,配置为确定所述第二分离样本集合中第二分离样本的损失;
    第一调整模块,配置为采用所述第二分离样本的损失,对所述无监督网络的网络参数进行调整,以使调整后的无监督网络输出的分离结果的损失满足收敛条件。
  11. 一种音频分离装置,其中,所述装置包括:
    第二获取模块,配置为获取待分离音频;
    第一输入模块,配置为采用已训练的神经网络对所述待分离音频进行分离,得到分离结果;其中,所述神经网络为基于上述权利要求1至8任一项所述的音频分离网络的训练方法训练得到的;
    第一输出模块,配置为输出所述分离结果。
  12. 一种计算机存储介质,其中,存储有可执行指令,用于引起处理器执行时,实现权利要求1至8任一项所述的方法,或用于引起处理器执行时,实现权利要求9所述的方法。
PCT/CN2020/126492 2020-02-11 2020-11-04 音频分离网络的训练方法、音频分离方法、装置及介质 WO2021159775A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20918512.3A EP4012706A4 (en) 2020-02-11 2020-11-04 METHOD AND DEVICE FOR LEARNING AUDIO SEPARATION NETWORK, METHOD AND DEVICE FOR AUDIO SEPARATION AND MEDIA
US17/682,399 US20220180882A1 (en) 2020-02-11 2022-02-28 Training method and device for audio separation network, audio separation method and device, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010086752.X 2020-02-11
CN202010086752.XA CN111341341B (zh) 2020-02-11 2020-02-11 音频分离网络的训练方法、音频分离方法、装置及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/682,399 Continuation US20220180882A1 (en) 2020-02-11 2022-02-28 Training method and device for audio separation network, audio separation method and device, and medium

Publications (1)

Publication Number Publication Date
WO2021159775A1 true WO2021159775A1 (zh) 2021-08-19

Family

ID=71183362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/126492 WO2021159775A1 (zh) 2020-02-11 2020-11-04 音频分离网络的训练方法、音频分离方法、装置及介质

Country Status (4)

Country Link
US (1) US20220180882A1 (zh)
EP (1) EP4012706A4 (zh)
CN (1) CN111341341B (zh)
WO (1) WO2021159775A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023073595A1 (en) * 2021-10-27 2023-05-04 WingNut Films Productions Limited Audio source separation systems and methods
WO2023106498A1 (ko) * 2021-12-06 2023-06-15 주식회사 스파이스웨어 다중 필터링을 이용한 개인정보 탐지 강화 방법 및 장치
US11763826B2 (en) 2021-10-27 2023-09-19 WingNut Films Productions Limited Audio source separation processing pipeline systems and methods
CN117782403A (zh) * 2024-02-27 2024-03-29 北京谛声科技有限责任公司 一种基于分离网络的松动螺栓定位方法、装置、介质

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11651702B2 (en) * 2018-08-31 2023-05-16 Pearson Education, Inc. Systems and methods for prediction of student outcomes and proactive intervention
JP7564117B2 (ja) * 2019-03-10 2024-10-08 カードーム テクノロジー リミテッド キューのクラスター化を使用した音声強化
CN111179962B (zh) 2020-01-02 2022-09-27 腾讯科技(深圳)有限公司 语音分离模型的训练方法、语音分离方法及装置
CN111341341B (zh) * 2020-02-11 2021-08-17 腾讯科技(深圳)有限公司 音频分离网络的训练方法、音频分离方法、装置及介质
US11676598B2 (en) 2020-05-08 2023-06-13 Nuance Communications, Inc. System and method for data augmentation for multi-microphone signal processing
US20220068287A1 (en) * 2020-08-31 2022-03-03 Avaya Management Lp Systems and methods for moderating noise levels in a communication session
CN112037809A (zh) * 2020-09-09 2020-12-04 南京大学 基于多特征流结构深度神经网络的残留回声抑制方法
CN112232426B (zh) * 2020-10-21 2024-04-02 深圳赛安特技术服务有限公司 目标检测模型的训练方法、装置、设备及可读存储介质
CN112309375B (zh) * 2020-10-28 2024-02-23 平安科技(深圳)有限公司 语音识别模型的训练测试方法、装置、设备及存储介质
CN112257855B (zh) * 2020-11-26 2022-08-16 Oppo(重庆)智能科技有限公司 一种神经网络的训练方法及装置、电子设备及存储介质
US20220188643A1 (en) * 2020-12-11 2022-06-16 International Business Machines Corporation Mixup data augmentation for knowledge distillation framework
CN112634842B (zh) * 2020-12-14 2024-04-05 湖南工程学院 一种基于双模式网络游走融合的新曲生成方法
CN112509563B (zh) * 2020-12-17 2024-05-17 中国科学技术大学 模型训练方法、装置及电子设备
US11839815B2 (en) 2020-12-23 2023-12-12 Advanced Micro Devices, Inc. Adaptive audio mixing
US11783847B2 (en) * 2020-12-29 2023-10-10 Lawrence Livermore National Security, Llc Systems and methods for unsupervised audio source separation using generative priors
CN112786003A (zh) * 2020-12-29 2021-05-11 平安科技(深圳)有限公司 语音合成模型训练方法、装置、终端设备及存储介质
CN113393858B (zh) * 2021-05-27 2022-12-02 北京声智科技有限公司 语音分离方法和系统、电子设备及可读存储介质
CN113380268A (zh) * 2021-08-12 2021-09-10 北京世纪好未来教育科技有限公司 模型训练的方法、装置和语音信号的处理方法、装置
CN114936974A (zh) * 2022-05-12 2022-08-23 中山大学中山眼科中心 基于注意力机制的半监督oct图像去噪方法及装置
CN115132183B (zh) * 2022-05-25 2024-04-12 腾讯科技(深圳)有限公司 音频识别模型的训练方法、装置、设备、介质及程序产品
CN115240702B (zh) * 2022-07-15 2024-09-24 西安电子科技大学 基于声纹特征的语音分离方法
CN116034425A (zh) * 2022-11-16 2023-04-28 广州酷狗计算机科技有限公司 人声音符识别模型的训练方法、人声音符识别方法及设备
CN117235435B (zh) * 2023-11-15 2024-02-20 世优(北京)科技有限公司 确定音频信号损失函数的方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10147442B1 (en) * 2015-09-29 2018-12-04 Amazon Technologies, Inc. Robust neural network acoustic model with side task prediction of reference signals
CN109544190A (zh) * 2018-11-28 2019-03-29 北京芯盾时代科技有限公司 一种欺诈识别模型训练方法、欺诈识别方法及装置
CN110070882A (zh) * 2019-04-12 2019-07-30 腾讯科技(深圳)有限公司 语音分离方法、语音识别方法及电子设备
CN110120227A (zh) * 2019-04-26 2019-08-13 天津大学 一种深度堆叠残差网络的语音分离方法
CN110600018A (zh) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 语音识别方法及装置、神经网络训练方法及装置
CN110634502A (zh) * 2019-09-06 2019-12-31 南京邮电大学 基于深度神经网络的单通道语音分离算法
CN111341341A (zh) * 2020-02-11 2020-06-26 腾讯科技(深圳)有限公司 音频分离网络的训练方法、音频分离方法、装置及介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009151578A2 (en) * 2008-06-09 2009-12-17 The Board Of Trustees Of The University Of Illinois Method and apparatus for blind signal recovery in noisy, reverberant environments
US9536540B2 (en) * 2013-07-19 2017-01-03 Knowles Electronics, Llc Speech signal separation and synthesis based on auditory scene analysis and speech modeling
US9721202B2 (en) * 2014-02-21 2017-08-01 Adobe Systems Incorporated Non-negative matrix factorization regularized by recurrent neural networks for audio processing
US9553681B2 (en) * 2015-02-17 2017-01-24 Adobe Systems Incorporated Source separation using nonnegative matrix factorization with an automatically determined number of bases
US10334390B2 (en) * 2015-05-06 2019-06-25 Idan BAKISH Method and system for acoustic source enhancement using acoustic sensor array
US10347271B2 (en) * 2015-12-04 2019-07-09 Synaptics Incorporated Semi-supervised system for multichannel source enhancement through configurable unsupervised adaptive transformations and supervised deep neural network
US10395644B2 (en) * 2016-02-25 2019-08-27 Panasonic Corporation Speech recognition method, speech recognition apparatus, and non-transitory computer-readable recording medium storing a program
CN105845149B (zh) * 2016-03-18 2019-07-09 云知声(上海)智能科技有限公司 声音信号中主音高的获取方法及系统
CN107203777A (zh) * 2017-04-19 2017-09-26 北京协同创新研究院 音频场景分类方法及装置
CN107680611B (zh) * 2017-09-13 2020-06-16 电子科技大学 基于卷积神经网络的单通道声音分离方法
CN108109619B (zh) * 2017-11-15 2021-07-06 中国科学院自动化研究所 基于记忆和注意力模型的听觉选择方法和装置
US11462209B2 (en) * 2018-05-18 2022-10-04 Baidu Usa Llc Spectrogram to waveform synthesis using convolutional networks
CN108806668A (zh) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 一种音视频多维度标注与模型优化方法
CN108847238B (zh) * 2018-08-06 2022-09-16 东北大学 一种服务机器人语音识别方法
KR102018286B1 (ko) * 2018-10-31 2019-10-21 에스케이 텔레콤주식회사 음원 내 음성 성분 제거방법 및 장치
CN109978034B (zh) * 2019-03-18 2020-12-22 华南理工大学 一种基于数据增强的声场景辨识方法
CN110085251B (zh) * 2019-04-26 2021-06-25 腾讯音乐娱乐科技(深圳)有限公司 人声提取方法、人声提取装置及相关产品
CN110246487B (zh) * 2019-06-13 2021-06-22 思必驰科技股份有限公司 用于单通道的语音识别模型的优化方法及系统
CN110289007A (zh) * 2019-06-25 2019-09-27 西安交通大学 一种用于语音基音频率提取的改进局部均值分解方法
CN110503976B (zh) * 2019-08-15 2021-11-23 广州方硅信息技术有限公司 音频分离方法、装置、电子设备及存储介质
US11537901B2 (en) * 2019-12-31 2022-12-27 Robert Bosch Gmbh System and method for unsupervised domain adaptation with mixup training

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10147442B1 (en) * 2015-09-29 2018-12-04 Amazon Technologies, Inc. Robust neural network acoustic model with side task prediction of reference signals
CN109544190A (zh) * 2018-11-28 2019-03-29 北京芯盾时代科技有限公司 一种欺诈识别模型训练方法、欺诈识别方法及装置
CN110070882A (zh) * 2019-04-12 2019-07-30 腾讯科技(深圳)有限公司 语音分离方法、语音识别方法及电子设备
CN110120227A (zh) * 2019-04-26 2019-08-13 天津大学 一种深度堆叠残差网络的语音分离方法
CN110600018A (zh) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 语音识别方法及装置、神经网络训练方法及装置
CN110634502A (zh) * 2019-09-06 2019-12-31 南京邮电大学 基于深度神经网络的单通道语音分离算法
CN111341341A (zh) * 2020-02-11 2020-06-26 腾讯科技(深圳)有限公司 音频分离网络的训练方法、音频分离方法、装置及介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023073595A1 (en) * 2021-10-27 2023-05-04 WingNut Films Productions Limited Audio source separation systems and methods
US11763826B2 (en) 2021-10-27 2023-09-19 WingNut Films Productions Limited Audio source separation processing pipeline systems and methods
WO2023106498A1 (ko) * 2021-12-06 2023-06-15 주식회사 스파이스웨어 다중 필터링을 이용한 개인정보 탐지 강화 방법 및 장치
JP7569489B2 (ja) 2021-12-06 2024-10-18 アンラブ クラウドメイト インコーポレイテッド 多重フィルタリングを用いた個人情報探知強化方法及び装置
CN117782403A (zh) * 2024-02-27 2024-03-29 北京谛声科技有限责任公司 一种基于分离网络的松动螺栓定位方法、装置、介质
CN117782403B (zh) * 2024-02-27 2024-05-10 北京谛声科技有限责任公司 一种基于分离网络的松动螺栓定位方法、装置、介质

Also Published As

Publication number Publication date
CN111341341B (zh) 2021-08-17
CN111341341A (zh) 2020-06-26
EP4012706A4 (en) 2022-11-23
US20220180882A1 (en) 2022-06-09
EP4012706A1 (en) 2022-06-15

Similar Documents

Publication Publication Date Title
WO2021159775A1 (zh) 音频分离网络的训练方法、音频分离方法、装置及介质
CN110929098B (zh) 视频数据的处理方法、装置、电子设备及存储介质
WO2023050650A1 (zh) 动画视频生成方法、装置、设备及存储介质
CN114398961A (zh) 一种基于多模态深度特征融合的视觉问答方法及其模型
WO2021218028A1 (zh) 基于人工智能的面试内容精炼方法、装置、设备及介质
WO2022188773A1 (zh) 文本分类方法、装置、设备、计算机可读存储介质及计算机程序产品
WO2023035923A1 (zh) 一种视频审核方法、装置及电子设备
CN111666416A (zh) 用于生成语义匹配模型的方法和装置
WO2021135449A1 (zh) 基于深度强化学习的数据分类方法、装置、设备及介质
US12002379B2 (en) Generating a virtual reality learning environment
CN110234018A (zh) 多媒体内容描述生成方法、训练方法、装置、设备及介质
WO2023040516A1 (zh) 一种事件整合方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN110516749A (zh) 模型训练方法、视频处理方法、装置、介质和计算设备
CN111382563B (zh) 文本相关性的确定方法及装置
KR20220097239A (ko) 인공지능에 기반하여 시놉시스 텍스트를 분석하고 시청률을 예측하는 서버
CN112861474A (zh) 一种信息标注方法、装置、设备及计算机可读存储介质
JP7225380B2 (ja) 音声パケット記録機能のガイド方法、装置、デバイス、プログラム及びコンピュータ記憶媒体
CN111265851B (zh) 数据处理方法、装置、电子设备及存储介质
CN113590772A (zh) 异常评分的检测方法、装置、设备及计算机可读存储介质
US11810476B2 (en) Updating a virtual reality environment based on portrayal evaluation
Szynkiewicz et al. Utilisation of embodied agents in the design of smart human–computer interfaces—A Case Study in Cyberspace Event Visualisation Control
Pan et al. A multimodal framework for automated teaching quality assessment of one-to-many online instruction videos
Zhang The Cognitive Transformation of Japanese Language Education by Artificial Intelligence Technology in the Wireless Network Environment
US20230196935A1 (en) Systems and methods for managing experiential course content
Yan et al. Design and Implementation of Interactive Platform for Operation and Maintenance of Multimedia Information System Based on Artificial Intelligence and Big Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918512

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020918512

Country of ref document: EP

Effective date: 20220310

NENP Non-entry into the national phase

Ref country code: DE