WO2023093295A1 - 基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质 - Google Patents

基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质 Download PDF

Info

Publication number
WO2023093295A1
WO2023093295A1 PCT/CN2022/122553 CN2022122553W WO2023093295A1 WO 2023093295 A1 WO2023093295 A1 WO 2023093295A1 CN 2022122553 W CN2022122553 W CN 2022122553W WO 2023093295 A1 WO2023093295 A1 WO 2023093295A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
audio
audio frame
features
feature
Prior art date
Application number
PCT/CN2022/122553
Other languages
English (en)
French (fr)
Inventor
林炳怀
王丽园
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to US18/203,469 priority Critical patent/US20230306959A1/en
Publication of WO2023093295A1 publication Critical patent/WO2023093295A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to artificial intelligence technology, and in particular to an artificial intelligence-based audio processing method, device, electronic equipment, computer program product, and computer-readable storage medium.
  • AI Artificial Intelligence
  • Voice interaction can be applied to various voice scoring systems, for example, language test system for language education applications, oral test system, etc.
  • voice interaction function In order to use the voice interaction function normally, you need to use the The phonemes are aligned with the text, and the alignment accuracy is improved as much as possible, but the related art cannot accurately align the phonemes with the text.
  • Embodiments of the present application provide an artificial intelligence-based audio processing method, device, electronic device, computer program product, and computer-readable storage medium, which can improve the accuracy of phoneme alignment.
  • the embodiment of the present application provides an audio processing method based on artificial intelligence, including:
  • the following processing is performed for each of the audio frames: the audio features of the audio frames are mapped to obtain the weight of the phoneme features of each of the phonemes, and based on the weight of the phoneme features of each of the phonemes, the The audio features of the audio frame and the phoneme features of at least one of the phonemes are fused to obtain the fused features of the audio frame;
  • the phoneme corresponding to each audio frame is determined, and based on the phoneme corresponding to each audio frame, the start and end moments of each phoneme are determined.
  • An embodiment of the present application provides an artificial intelligence-based audio processing device, the method being executed by an electronic device, including:
  • a phoneme module configured to obtain at least one phoneme of a given text and determine phonemic features for each of said phonemes
  • An audio module configured to obtain audio data corresponding to the given text, and determine audio features of each audio frame included in the audio data
  • the fusion module is configured to perform the following processing for each of the audio frames: performing mapping processing on the audio features of the audio frames to obtain the weight of the phoneme features of each of the phonemes, based on the weight of the phoneme features of each of the phonemes Weighting, performing fusion processing on the audio features of the audio frame and the phoneme features of at least one of the phonemes to obtain the fusion features of the audio frame;
  • the alignment module is configured to determine the phoneme corresponding to each audio frame based on the fusion feature of each audio frame, and determine the start and end moments of each phoneme based on the phoneme corresponding to each audio frame.
  • An embodiment of the present application provides an electronic device, including:
  • the processor is configured to implement the artificial intelligence-based audio processing method provided in the embodiment of the present application when executing the computer-executable instructions stored in the memory.
  • An embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions for implementing the artificial intelligence-based audio processing method provided in the embodiment of the present application when executed by a processor.
  • An embodiment of the present application provides a computer program product, including a computer program or a computer-executable instruction.
  • the computer program or computer-executable instruction is executed by a processor, the audio processing method based on artificial intelligence provided in the embodiment of the present application is implemented.
  • the weight of each phoneme in the text sequence is determined based on the audio features, and then based on the weight of each phoneme, the phoneme features and audio features are fused with the text sequence to obtain the fusion feature, so the fusion feature can effectively represent the audio
  • the relationship between the frame and the phoneme, and then classify each audio frame in the audio based on the fusion feature can effectively improve the classification accuracy, thereby improving the phoneme alignment accuracy.
  • Fig. 1 is a schematic structural diagram of an audio processing system based on artificial intelligence provided by an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • 3A-3C are schematic flowcharts of an audio processing method based on artificial intelligence provided by an embodiment of the present application.
  • 4A-4D are schematic interface diagrams of an audio processing method based on artificial intelligence provided by an embodiment of the present application.
  • Fig. 5 is a schematic flow chart of an audio processing method based on artificial intelligence provided in an embodiment of the present application
  • FIG. 6 is a schematic structural diagram of a phoneme alignment model of an artificial intelligence-based audio processing method provided in an embodiment of the present application
  • Fig. 7 is a schematic data flow diagram of an audio processing method based on artificial intelligence provided by an embodiment of the present application.
  • 8A-8C are the alignment time matrix of the audio processing method based on artificial intelligence provided by the embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of an audio encoder provided by an embodiment of the present application.
  • first ⁇ second ⁇ third is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, “first ⁇ second ⁇ third” Where permitted, the specific order or sequencing may be interchanged such that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein.
  • Speech recognition technology Automatic Speech Recognition (ASR, Automatic Speech Recognition), whose goal is to convert the vocabulary content in human speech into computer-readable input, such as keystrokes, binary codes, or character sequences.
  • ASR Automatic Speech Recognition
  • HMM Hidden Markov Model
  • Maximum likelihood estimation (MLE, Maximum Likelihood Estimation), also known as maximum likelihood estimation, is a method used to estimate the parameters of a probability model.
  • Discriminant model In the field of machine learning, a discriminative model is a method of modeling the relationship between unknown data y and known data x.
  • the discriminant model is a method based on probability theory.
  • the input variable x is known, and the discriminant model predicts y by constructing a conditional probability distribution P(y
  • FC Full Connection
  • Pearson correlation coefficient In statistics, the Pearson correlation coefficient is used to measure the linear correlation between two variables X and Y, and its value is between -1 and 1.
  • Support vector machine Often referred to as support vector network in machine learning, it is a supervised learning model for analyzing data in classification and regression analysis.
  • Phoneme is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable. A pronunciation action is regarded as a phoneme. Phonemes are divided into two categories: vowels and consonants. In this application, the phoneme also includes a silent phoneme, for example, a certain audio frame is silent, that is, the audio frame corresponds to a silent phoneme.
  • Phoneme alignment refers to aligning phonemes with audio, that is, determining the start and end time of each phoneme in a given text.
  • the text-independent method usually classifies phoneme boundaries and judges the position of a certain audio frame in the audio data. Whether the time is a phoneme boundary, for example, using the Viterbi algorithm to distinguish articulated segments from unvoiced segments, or using a recurrent neural network to classify phoneme boundaries, the text-dependent approach usually uses HMM based on maximum likelihood to get the most likely sequence, or use a discriminative model, or design an alignment function and use a support vector machine for phoneme alignment.
  • the HMM-based alignment method in the related art mainly uses the phoneme boundary judgment as a hidden state, and optimizes it by maximum likelihood, without directly and explicitly optimizing the phoneme alignment.
  • Other phoneme alignment methods in the related art need to design a manually designed alignment function and perform artificial features project.
  • the embodiment of the present application proposes an audio processing method based on artificial intelligence, which can automatically learn the mapping relationship between phoneme sequences and audio data based on a neural network including an attention mechanism without relying on artificially designed alignment functions. Explicitly optimize the loss function, combine multi-tasks for training, and perform constraint learning through the loss function in the attention processing stage, effectively improving the accuracy of phoneme alignment.
  • embodiments of the present application provide an audio processing method, device, electronic equipment, computer program product, and computer-readable storage medium based on artificial intelligence, which can calculate audio features and text sequences through an attention mechanism. Fusion features, so as to classify the phonemes of each frame in the audio based on the fusion features, effectively improving the classification accuracy, thereby improving the accuracy of phoneme alignment.
  • FIG. 1 is a schematic structural diagram of an audio processing system based on artificial intelligence provided by an embodiment of the present application.
  • the audio processing system can be used in an oral test scene.
  • the terminal 400 is connected to the server 200 through the network 300, and the network It can be a wide area network or a local area network, or a combination of both.
  • the functions of the audio processing system are implemented based on various modules in the server 200.
  • the terminal 400 receives the user's audio data for a given text, and the terminal 400 converts the audio data and the given text to The given text is sent to the server 200, and the server 200 determines the phoneme features of each phoneme in the given text and the audio features of each audio frame in the audio data, and performs the following processing for each audio frame: mapping the audio features of the audio frame , to obtain the weight of the phoneme feature of each phoneme, based on the weight of the phoneme feature of each phoneme, perform fusion processing on the audio feature of the audio frame and the phoneme feature of at least one phoneme, obtain the fusion feature of each audio frame, and determine each The phoneme corresponding to the audio frame, and based on the phoneme corresponding to each audio frame, determine the start and end time of each phoneme, and send the start and end time of each phoneme to the terminal 400, so that the terminal 400 directly presents the start
  • the speaking test requires the user to read a given text in English.
  • the terminal 400 receives the audio data corresponding to the given text from the user, and the terminal 400 sends the audio data to the server 200.
  • the server 200 The audio feature of the audio frame is mapped to obtain the weight of the phoneme feature of each phoneme, and based on the weight of the phoneme feature of each phoneme, the audio feature of the audio frame and the phoneme feature of at least one phoneme are fused to obtain each The fusion feature of the audio frame, based on the fusion feature of each audio frame, determines the phoneme corresponding to each audio frame, and based on the phoneme corresponding to each audio frame, determines the start and end moments of each phoneme, and sends it to the terminal 400, so that The terminal 400 directly presents the start and end times of each phoneme.
  • the terminal 400 can display the scoring result for each phoneme.
  • the users participating in the follow-up reading and the scoring user can be the same or
  • the oral practice topic requires the user to follow the given text in English
  • the terminal 400 receives the user’s audio data corresponding to the given text
  • the terminal 400 sends the audio data to the server 200
  • the server 200 The audio feature of the audio frame is mapped to obtain the weight of the phoneme feature of each phoneme, and based on the weight of the phoneme feature of each phoneme, the audio feature of the audio frame and the phoneme feature of at least one phoneme are fused to obtain each
  • the fusion feature of the audio frame based on the fusion feature of each audio frame, determines the phoneme corresponding to each audio frame, and based on the phoneme corresponding to each audio frame, determines the start and end moments of each phoneme, and sends it to the terminal 400, so that
  • the terminal 400 directly presents the start and end moments of each phoneme, so that it can be used for the user's playback operation for each phoneme, and the terminal 400 can individually play the audio frame of the corresponding phoneme.
  • the terminal may perform mapping processing on the audio features of the audio frame to obtain the weight of the phoneme feature of each phoneme, based on the weight of the phoneme feature of each phoneme, the weight of the audio frame.
  • the audio feature and the phoneme feature of at least one phoneme are fused to obtain the fusion feature of each audio frame, and based on the fusion feature of each audio frame, the phoneme corresponding to each audio frame is determined, and based on the phoneme corresponding to each audio frame, Determine the start and end moments of each phoneme, and directly present the start and end moments of each phoneme.
  • the server 200 can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle terminal, etc., but is not limited thereto.
  • the terminal and the server may be connected directly or indirectly through wired or wireless communication, which is not limited in this embodiment of the present application.
  • the terminal or the server can implement the audio processing method provided in the embodiment of the present application by running a computer program.
  • a computer program can be a native program or software module in the operating system; it can be a local (Native) application program (APP, Application), that is, a program that needs to be installed in the operating system to run, such as an oral test APP or an oral Learning APP; it can also be a small program, that is, a program that only needs to be downloaded into the browser environment to run; it can also be a small program that can be embedded in any APP.
  • the above-mentioned computer program can be any form of application program, module or plug-in.
  • FIG. 2 is a schematic structural diagram of a server 200 provided by an embodiment of the present application.
  • the server 200 shown in FIG. 2 includes: at least one processor 210 , a memory 250 , and at least one network interface 220 .
  • Various components in the server 200 are coupled together through the bus system 240 .
  • the bus system 240 is used to realize connection and communication between these components.
  • the bus system 240 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 240 in FIG. 2 .
  • Processor 210 can be a kind of integrated circuit chip, has signal processing capability, such as general-purpose processor, digital signal processor (DSP, Digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware Components, etc., wherein the general-purpose processor can be a microprocessor or any conventional processor, etc.
  • DSP digital signal processor
  • DSP Digital Signal Processor
  • Memory 250 may be removable, non-removable or a combination thereof.
  • Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like.
  • Memory 250 optionally includes one or more storage devices located physically remote from processor 210 .
  • Memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory.
  • the non-volatile memory can be a read-only memory (ROM, Read Only Memory), and the volatile memory can be a random access memory (RAM, Random Access Memory).
  • ROM read-only memory
  • RAM Random Access Memory
  • the memory 250 described in the embodiment of the present application is intended to include any suitable type of memory.
  • memory 250 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • Operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as framework layer, core library layer, driver layer, etc., for implementing various basic services and processing hardware-based tasks; network communication Module 252 for reaching other computing devices via one or more (wired or wireless) network interfaces 220.
  • Exemplary network interfaces 220 include: Bluetooth, Wireless Compatibility Authentication (WiFi), and Universal Serial Bus (USB, Universal Serial Bus), etc.
  • the audio processing device based on artificial intelligence provided by the embodiment of the present application can be realized by software.
  • FIG. 2 shows an audio processing device 255 based on artificial intelligence stored in the memory 250, which can be a program and Software in the form of plug-ins, etc., including the following software modules: phoneme module 2551, audio module 2552, fusion module 2553, alignment module 2554 and training module 2555, these modules are logical, so any combination or combination can be carried out according to the realized functions Further splitting, the functions of each module will be explained below.
  • the artificial intelligence-based audio processing method provided by the embodiment of the present application will be described in conjunction with the exemplary application and implementation of the server 200 provided in the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of the phoneme alignment model of the artificial intelligence-based audio processing method provided by the embodiment of the present application.
  • the phoneme alignment model includes an attention fusion network, a phoneme classification network (corresponding to the first task) and a loudness classification network.
  • the attention fusion network is used to fuse phoneme features and audio features, so that the fusion features output by the attention fusion network are divided into the loudness classification network corresponding to the first task and the phoneme classification network corresponding to the second task Sharing, the input of the attention fusion network is the audio feature obtained based on the audio data and the phoneme feature obtained based on the given text, the output of the attention fusion network is the fusion feature of the audio feature and the phoneme feature, and then through the loudness classification network and phoneme classification The network performs full-connection processing on the fusion features respectively, and obtains loudness classification results and phoneme classification results respectively.
  • the loudness classification network can be a fully-connected layer structure, and the phoneme classification network can also be a fully-connected layer structure, but the parameters of the two are different.
  • the first task is to identify the phoneme of a certain audio frame from multiple candidate phonemes, and the second task is to judge whether a certain audio frame is a silent audio frame.
  • the phoneme alignment model includes an attention fusion network, a phoneme classification network (corresponding to the first task) and a loudness classification network (corresponding to the second task), see Figure 7,
  • Figure 7 is an artificial intelligence-based
  • the data flow diagram of the audio processing method the input of the audio encoder is audio data, the output of the audio encoder is the audio feature (vector form) of each audio frame included in the audio data, and the input of the phoneme encoder is a phoneme sequence (given text), the output of the phoneme encoder is the phoneme feature of each phoneme (the data form of the phoneme feature is a vector), the input of the attention fusion network is the output of the audio encoder and the output of the phoneme encoder, and the output of the attention fusion network It is a fusion feature of phoneme features and audio features.
  • the fusion features are classified and processed through two parallel phoneme classification networks and loudness classification networks.
  • the phoneme classification network outputs the probability that each audio frame belongs to each candidate phoneme
  • the loudness classification network outputs The probability that each audio frame belongs to each loudness category.
  • the loudness category includes mute and non-silence. For example, the flag of non-silence is 1, the flag of silence is 0, and the candidate phonemes are W, IH, L, and so on.
  • the server 200 in FIG. 1 as an example to execute the audio processing method based on artificial intelligence provided in the embodiment of the present application, the audio processing method based on artificial intelligence provided in the embodiment of the present application is described.
  • FIG. 3A is a schematic flowchart of an audio processing method based on artificial intelligence provided by an embodiment of the present application, which will be described in conjunction with steps 101-104 shown in FIG. 3A .
  • step 101 at least one phoneme of a given text is obtained, and phoneme features of each phoneme are determined.
  • determining the phoneme feature of each phoneme is achieved by calling a phoneme encoder, which includes a phoneme feature representation network and a phoneme position representation network.
  • the phoneme feature of each phoneme can be determined by the following Technical solution implementation: perform the following processing for each phoneme: determine the characteristic representation feature of the phoneme through the phoneme characteristic representation network, wherein the characteristic representation feature is used to represent the characteristics of the phoneme; determine the position representation feature of the phoneme through the phoneme position representation network, wherein, The position representation feature is used to represent the position of the phoneme in the corresponding text unit; the position representation feature and the characteristic representation feature are added together to obtain the phoneme feature of the phoneme.
  • the phoneme feature representation network and the phoneme position representation network are in a parallel relationship
  • the phoneme feature representation network and the phoneme position representation network are both convolutional neural networks
  • the number of convolutional layers included in the two convolutional neural networks is different, and each The parameters of the convolutional layers also vary.
  • the phoneme is convoluted through the multiple convolutional layers cascaded in the phoneme characteristic representation network to obtain the characteristic representation feature of the phoneme
  • the audio frame is convoluted through the multiple convolutional layers cascaded in the phoneme position representation network, Get the position representation feature of an audio frame.
  • the phonemes of the given text include EH1, V, ER, sp, F, R, G, EH, T, where , EH1, V, ER, F, R, G, EH, and T are different phonemes, sp represents a mute phoneme, and mute is also one of the candidate phonemes.
  • the feature representation feature is used to distinguish different phonemes, and the feature representation feature represents the characteristics of the phoneme.
  • Each phoneme has four position possibilities in the corresponding text unit, which is the smallest unit of a sentence, for example, in English, the text unit (How) of the given text (How are) shown in Figure 6 is the word , when a certain word contains a plurality of phonemes, the word has the start position (B), the middle position (I) and the end position (E) of the phoneme, when a certain word contains a phoneme, utilize S to represent the position of the phoneme,
  • the position of the phoneme in the corresponding text unit is encoded by the phoneme position representation network, and the position representation feature of each phoneme is obtained.
  • the position representation feature represents the position of the phoneme in the corresponding text unit, for example, E(B) shown in Figure 6 , and finally add the unique feature representation feature (vector used to represent phoneme characteristics) and position representation feature (vector used to represent phoneme position) of each phoneme to obtain the final phoneme feature.
  • This phoneme encoding method can effectively represent the characteristic difference of each phoneme, and can also effectively represent the difference of the same phoneme at different positions.
  • step 102 audio data corresponding to a given text is obtained, and audio features of each audio frame included in the audio data are determined.
  • FIG. 9 is a schematic structural diagram of an audio encoder provided by an embodiment of the present application.
  • the audio encoder shown in FIG. 9 includes multiple cascaded convolutional networks and a normalized network.
  • Step In 102 determining the audio features of each audio frame included in the audio data can be achieved through the following technical solutions: performing feature extraction processing on at least one audio frame through a plurality of cascaded convolutional networks included in the audio encoder to obtain the corresponding The convolutional feature extraction result of the audio frame; the normalization network included in the audio encoder is used to normalize the convolutional feature extraction result of each audio frame to obtain the audio feature of each audio frame.
  • audio features are obtained based on an audio encoder, and at least one audio frame is extracted as a whole through multiple cascaded convolutional networks. If there are multiple audio frames, the output of multiple convolutional networks is low-frequency Feature representation, for example, it encodes about 30 milliseconds of 16 kHz audio data, and generates a low-frequency feature representation every set time step, resulting in convolutional feature extraction results for each audio frame, Then normalize the convolution feature extraction results of each audio frame through the normalization network to obtain the audio features of each audio frame.
  • the structure of the audio encoder can be the network structure of wav2vec, and the parameters of the audio encoder are It is obtained by training based on the network structure of wav2vec.
  • Wav2vec is a convolutional neural network.
  • the convolutional neural network includes an encoding network.
  • the encoding network is a 5-layer convolutional structure.
  • the convolutional neural network also includes a content network.
  • the content network is a 9-layer convolutional structure.
  • step 103 the following processing is performed for each audio frame: the audio feature of the audio frame is mapped to obtain the weight of the phoneme feature of each phoneme, based on the weight of the phoneme feature of each phoneme, the audio feature of the audio frame and the phoneme features of at least one phoneme are fused to obtain the fused features of the audio frame.
  • step 103 is realized by attention fusion network, attention fusion network includes attention layer and fusion layer, in step 103, the audio feature of audio frame is mapped, and the phoneme feature of each phoneme is obtained
  • the weight can be realized through the following technical solutions: perform query vector transformation processing on the audio features to obtain the query vector; perform key vector transformation processing on the phoneme features to obtain the key vector; multiply the query vector and the transposition of the key vector to obtain Multiply the result; obtain the square root of the dimension of the key vector; determine the ratio of the multiplication result to the square root as the attention feature; perform maximum likelihood processing on the attention feature to obtain the weight of the corresponding phoneme.
  • the weight corresponding to each phoneme is obtained based on the audio features of the audio frame, and the association information between the phoneme and the audio frame can be obtained, thereby improving the accuracy of subsequent alignment.
  • the query vector conversion process can be implemented in the following manner: the first parameter Wq of the attention layer is multiplied by the audio feature to obtain the query vector Q, or the first parameter Wq of the attention layer is multiplied by the audio feature Perform multiplication processing to obtain the first multiplication result, and add the first multiplication result to the fourth parameter Bq to obtain the query vector Q;
  • the key vector transformation processing can be implemented in the following manner: the second parameter of the attention layer
  • the key vector K can be obtained by multiplying Wk with the audio feature, or the second parameter Wk of the attention layer is multiplied with the phoneme feature to obtain the second multiplication result, and the second multiplication result is combined with the fifth
  • the query vector K can be obtained by adding the parameters Bk.
  • the first parameter, the second parameter, the fourth parameter and the fifth parameter of the attention layer are all obtained by training the attention fusion network.
  • an attention mechanism is used to fuse phoneme features and audio features.
  • the attention mechanism is used to model the relationship between the query vector Q, the key vector K, and the value vector V, see formulas (1) and (2):
  • the audio feature of each audio frame can also be As a query vector, the phoneme feature H phone of each phoneme of the given text is used as the key vector K of each phoneme and the value vector V of each phoneme, AttentionScore(Q,K) is the weight of each phoneme, Attention(Q ,K,V) are the attention results for each phoneme, and d k is the dimension of the key vector K.
  • the audio features Perform query vector transformation processing to obtain the query vector Q, perform key vector transformation processing on the phoneme feature H phone of each phoneme of the given text, obtain the key vector K, and perform a value vector on the phoneme feature H phone of each phoneme in the given text Transform processing to obtain the value vector V, the parameters involved in these transformation processing can be obtained through the overall training of the phoneme alignment model, and the audio features of each audio frame can also be As a query vector, or take the phoneme feature H phone of each phoneme of a given text as the key vector K of each phoneme and the value vector V of each phoneme.
  • step 103 is implemented through an attention fusion network, and the attention network includes an attention layer and a fusion layer, see FIG. 3B, and FIG. 3B is a flow chart of an audio processing method based on artificial intelligence provided by an embodiment of the present application Schematic diagram, in step 103, based on the weight of the phoneme feature of each phoneme, the audio feature of the audio frame and the phoneme feature of at least one phoneme are fused to obtain the fusion feature of each audio frame, which can be executed for each phoneme.
  • Figure 3B Steps 1031-1033 are shown for illustration.
  • step 1031 a value vector transformation process is performed on the phoneme feature of the phoneme to obtain a value vector.
  • step 1032 the weight of the corresponding phoneme is multiplied by the value vector to obtain the attention result of the corresponding phoneme.
  • Both step 1031 and step 1032 are realized by the attention layer in the attention fusion network, and the value vector transformation process can be implemented in the following manner: the third parameter Wv of the attention layer is multiplied by the phoneme feature, and the value can be obtained Vector V, or, multiply the third parameter Wv of the attention layer and the phoneme feature to obtain the third multiplication result, and add the third multiplication result to the sixth parameter Bv to obtain the value vector V; note Both the third parameter and the sixth parameter of the force layer are obtained by training the attention fusion network.
  • step 1033 the attention result corresponding to at least one phoneme and the audio feature of the audio frame are fused to obtain the fused feature of the corresponding audio frame.
  • step 103 is implemented by calling the attention fusion network.
  • the attention fusion network includes an attention layer and a fusion layer.
  • the fusion process is actually a feature splicing process.
  • the attention result obtained based on a certain audio frame is combined with the audio
  • the audio features of the frame are spliced to obtain the fusion feature corresponding to the audio frame, see formula (3):
  • the attention result of audio frame i is a matrix, and each column in the matrix represents the attention result of each phoneme and audio frame i in all phonemes, is the audio feature for audio frame i
  • H phone is the phoneme feature for all phones of the given text, is the fused feature corresponding to each audio frame.
  • the Attention Mechanism stems from the study of human vision.
  • the attention mechanism includes soft attention mechanism (which can be divided into input-based soft attention (Item-wise Soft Attention) and location-wise Soft Attention (Location-wise Soft Attention)), hard attention mechanism (can be divided into Item-wise Hard Attention based on input items and strong attention based on location Attention (Location-wise Hard Attention)), self-attention mechanism (is a variant of the attention mechanism, which reduces the dependence on external information, and is better at capturing the internal correlation of data or features.
  • soft attention mechanism which can be divided into input-based soft attention (Item-wise Soft Attention) and location-wise Soft Attention (Location-wise Soft Attention)
  • hard attention mechanism can be divided into Item-wise Hard Attention based on input items and strong attention based on location Attention (Location-wise Hard Attention)
  • self-attention mechanism is a variant of the attention mechanism, which reduces the dependence on external information, and is better at capturing the internal correlation of data or features.
  • the attention mechanism mainly has two aspects: decide which part of the input needs to be paid attention to; allocate limited information processing resources to important part.
  • attention can be achieved by means of weights.
  • the weight is used to judge the relevance of the audio frame to each phoneme. For different audio frames, the attention of the audio frame to the same phoneme is different, so that the audio of the audio frame When the feature is fused with the phoneme features of multiple phonemes, the weight of the phoneme features will be different.
  • step 104 based on the fusion feature of each audio frame, determine the phoneme corresponding to each audio frame, and based on the phoneme corresponding to each audio frame, determine the start and end moments of each phoneme.
  • determining the phoneme corresponding to each audio frame is realized by calling a phoneme classification network
  • the phoneme classification network shown in Figure 6 includes at least one cascaded phoneme fully connected layer, and in step 104 based on each audio frame
  • the fusion features of each audio frame to determine the phoneme corresponding to each audio frame can be achieved through the following technical solutions: perform the following processing for each audio frame: when the number of phoneme fully connected layers is one, perform the first fusion feature through the phoneme fully connected layer A fully connected process to obtain the first probability that the audio frame belongs to each candidate phoneme; when the number of phoneme fully connected layers is multiple, the nth phoneme fully connected layer in the phoneme fully connected layers of N cascaded, for The input of the nth phoneme fully connected layer is processed by the first full connection, and the nth phoneme fully connected result output by the nth phoneme fully connected layer is transmitted to the n+1 phoneme fully connected layer to continue the first fully connected process, Obtain the n+1th phoneme fully connected result corresponding to the
  • the phoneme classification is performed for each audio frame through the phoneme classification network, and the candidate phonemes contain a total of 40 phonemes (including 39 phonemes in the phoneme dictionary).
  • phonemes and silent phonemes when there is only one phoneme fully connected layer, output the first probability that an audio frame belongs to each candidate phoneme through the phoneme fully connected layer, that is, output 40 first probabilities for audio frame A, will be The candidate phoneme corresponding to the highest first probability is determined to be the phoneme of the audio frame A.
  • step 104 based on the phoneme corresponding to each audio frame, determining the start and end moments of each phoneme can be achieved through the following technical solution: based on the phoneme corresponding to each audio frame, determine at least one corresponding phoneme Audio frame; perform the following processing for each phoneme: when the phoneme corresponds to multiple consecutive audio frames, determine the start and end moments of the continuous audio frames corresponding to the phoneme as the start and end moments of the phoneme; The moment of the audio frame is determined as the start and end moment of the phoneme.
  • the start and end times include the start time and end time of the phoneme. Take 10 audio frames as an example for illustration, based on the phoneme corresponding to each audio frame, determine at least one audio frame corresponding to each phoneme, and execute for each phoneme The following processing: when the phoneme corresponds to a plurality of continuous audio frames, the start and end moments of the corresponding continuous audio frames of the phoneme are determined as the start and end moments of the phoneme, for example, the first audio frame to the third audio frame are all corresponding to the phoneme W, then The phoneme W corresponds to the first audio frame to the third audio frame, and the start and end time of the first audio frame to the third audio frame are determined as the start and end time of the phoneme W, that is, the time of the first audio frame is determined as the start and end time The start time of the audio frame, the time of the third audio frame is determined as the end time of the start and end time, when the phoneme corresponds to an audio frame, the time of the audio frame corresponding to the phoneme is determined as the
  • FIG. 3C is a schematic flow chart of an audio processing method based on artificial intelligence provided in an embodiment of the present application.
  • step 101 at least one phoneme of a given text is obtained, and the phoneme of each phoneme is determined.
  • Steps 105 to 111 shown in FIG. 3C may be executed before the feature, or before the audio data corresponding to the given text is acquired in step 102 and the audio feature of each audio frame included in the audio data is determined.
  • step 105 audio data samples are acquired together with a given text sample.
  • a given text sample corresponds to an audio data sample, for example, the audio data sample is obtained by the user reading along with the given text.
  • step 106 at least one phoneme sample of a given text sample is acquired, and a phoneme feature of each phoneme sample is determined through a phoneme encoder.
  • step 107 the audio feature of each audio frame sample included in the audio data sample is determined by an audio encoder.
  • the audio encoder and phoneme encoder participating in the training can be a pre-trained network structure.
  • a pre-trained acoustic model is used for audio feature extraction, such as a voice-to-vector model.
  • the voice-to-vector model consists of a multi-layer
  • the convolutional network structure uses a large number of unlabeled tasks to pre-train the sound-to-vector model based on the contrastive loss.
  • the audio data (audio waveform features) is input to the pre-trained network structure.
  • the phoneme alignment model includes a phoneme classification network, a loudness classification network, a shared attention fusion network, an audio encoder and a phoneme encoder
  • step 103 is implemented by calling the attention fusion network
  • the corresponding phoneme is implemented by calling the phoneme classification network.
  • the phoneme classification network and the loudness classification network share the attention fusion network.
  • the input of the attention fusion network is the output of the audio encoder and the output of the phoneme encoder.
  • step 108 the following processing is performed for each audio frame sample: the audio feature of the audio frame sample and the phoneme feature of at least one phoneme sample are forward-propagated in a network composed of an attention fusion network and a phoneme classification network, Get the first forward propagation result.
  • the audio feature of the audio frame sample and the phoneme feature of at least one phoneme sample are forward-propagated in a network composed of an attention fusion network and a phoneme classification network to obtain a first forward-propagation result
  • a network composed of an attention fusion network and a phoneme classification network to obtain a first forward-propagation result
  • the following processing is performed for each phoneme sample: based on the audio features of the audio frame sample and the phoneme features of the phoneme sample, determine the weight of the corresponding phoneme sample
  • the phoneme feature is processed by value vector transformation, and the weight of the corresponding phoneme sample is multiplied by the value vector transformation result to obtain the attention result of the corresponding phoneme sample;
  • the fusion layer of the attention fusion network will correspond to the attention of each phoneme sample
  • the results and the audio features of the audio frame samples are fused to obtain the fusion features of the corresponding audio frame samples; the first full connection processing is performed on the fusion features of the audio frame samples through the phoneme classification network to obtain the
  • each row in the weight matrix represents an audio frame
  • each column represents the audio Frames correspond to probability distributions for each phoneme.
  • step 109 the following processing is performed for each audio frame sample: the audio feature of the audio frame sample and the phoneme feature of at least one phoneme sample are forward propagated in a network composed of an attention fusion network and a loudness classification network, Get the second forward propagation result.
  • the audio feature of the audio frame sample and the phoneme feature of at least one phoneme sample are forward-propagated in a network composed of an attention fusion network and a loudness classification network to obtain a second forward-propagation result
  • the audio feature of the audio frame sample is mapped through the attention fusion network to obtain the weight of the phoneme feature of each phoneme sample, and based on the weight of the phoneme feature of each phoneme sample, the audio frame sample
  • the audio features of each audio frame sample and the phoneme features of at least one phoneme sample are fused to obtain the fusion feature of each audio frame sample
  • the second full connection processing is performed on the fusion feature of each audio frame sample through the loudness classification network to obtain each audio frame
  • the sample belongs to the second probability of each loudness category, and the second probability of each audio frame sample belonging to each loudness category forms a second forward propagation result.
  • the input to the loudness classification network is the same as the input to the phoneme classification network.
  • the audio feature of the audio frame sample and the phoneme feature of at least one phoneme sample are forward-propagated in a network composed of an attention fusion network and a loudness classification network to obtain a second forward-propagation result
  • a network composed of an attention fusion network and a loudness classification network to obtain a second forward-propagation result
  • the following processing is performed for each phoneme sample: based on the audio features of the audio frame sample and the phoneme features of the phoneme sample, determine the weight of the corresponding phoneme sample
  • the phoneme feature is processed by value vector transformation, and the weight of the corresponding phoneme sample is multiplied by the value vector transformation result to obtain the attention result of the corresponding phoneme sample;
  • the fusion layer of the attention fusion network will correspond to the attention of each phoneme sample
  • the results and the audio features of the audio frame samples are fused to obtain the fusion features of the corresponding audio frame samples; the second full-connection processing is performed on the fusion features of the audio frame samples through the loudness classification network, and
  • the phoneme alignment model includes an attention fusion network, a phoneme classification network, and a loudness classification network.
  • the input of the audio encoder is an audio data sample
  • the output of the audio encoder is the audio feature (vector) of each audio frame sample included in the audio data.
  • the input of the phoneme encoder is a phoneme sequence sample (a given text sample)
  • the output of the phoneme encoder is the phoneme feature of each phoneme sample (the data form of the phoneme feature is a vector)
  • the input of the attention fusion network is the audio
  • the output of the encoder and the output of the phoneme encoder, the output of the attention fusion network is the fusion feature of the phoneme feature and the audio feature
  • the audio feature of each audio frame and all phonemes are calculated by the attention mechanism
  • the fusion feature is obtained, and the audio frame is determined
  • the fusion features are classified through two parallel phoneme classification networks and the loudness classification network
  • the phoneme classification network outputs the third probability that each audio frame belongs to each candidate phoneme
  • the loudness classification network outputs the second probability that each audio frame belongs to each loudness category.
  • the loudness category includes mute and non-mute.
  • the flag of non-silence is 1, and the flag of silence is 0.
  • the loudness category can also be more fine-grained For example, mute, 10 decibels, 20 decibels, 30 decibels, etc., the candidate phonemes are W, IH, L, etc.
  • step 110 a joint loss is determined according to the first forward propagation result and the second forward propagation result.
  • the above-mentioned determination of the joint loss based on the first forward propagation result and the second forward propagation result can be achieved through the following technical solutions: based on the third probability that each audio frame sample corresponds to a plurality of candidate phonemes, and Pre-labeled candidate phonemes for each audio frame sample to determine the first phoneme category loss; based on the second probability that each audio frame sample corresponds to multiple loudness categories, and the pre-labeled loudness category for each audio frame sample, determine the second loudness Category loss; based on the weight of each audio frame sample corresponding to each phoneme sample, and the pre-labeled alignment identification of each audio frame sample corresponding to each phoneme sample, determine the third alignment loss; for the first phoneme category loss, the second loudness
  • the category loss and the third alignment loss are fused to obtain a joint loss. Composing a joint loss through the loss of multiple dimensions, and training based on the joint loss, can effectively improve the training effect of the phoneme alignment model.
  • the cross loss is used to calculate the loss of two classifications during the training of the phoneme alignment model, see formula (4) and formula (5):
  • Lphone is the phoneme classification loss (first phoneme category loss)
  • Lsil is the loudness classification loss (second loudness category loss)
  • m is the number of audio frames
  • c is the number of candidate phones
  • m is the number of audio frames
  • c is the number of candidate phones
  • m is the number of audio frames
  • c is the number of candidate phones
  • m is the number of audio frames
  • c is the number of candidate phones
  • the weight matrix in the embodiment of the present application is constrained, that is, attention weight constraints are performed, wherein each row in the matrix represents an audio frame, and each column represents the audio
  • the probability distribution of each phoneme in the frame, the probability distribution of the phoneme of each audio frame and the phoneme corresponding to the actual audio frame are calculated to obtain the loss of the attention mechanism, see formula (6):
  • L align is the attention mechanism loss
  • m is the number of audio frames
  • N p is the number of phonemes in the given text
  • It is 1 or 0. 1 indicates that the i-th audio frame is aligned with the j-th phoneme
  • 0 indicates that the i-th audio frame is not aligned with the j-th phoneme.
  • the joint loss of the entire phoneme alignment network consists of three parts, including phoneme classification loss (first phoneme category loss), loudness classification loss (second loudness category loss) and alignment loss (third alignment loss),
  • phoneme classification loss first phoneme category loss
  • loudness classification loss second loudness category loss
  • alignment loss third alignment loss
  • the three losses are weighted and summed with different weights, and the final joint loss is shown in formula (7):
  • the weight of each loss ( ⁇ , ⁇ , and ⁇ ) is a preset weight, and the sum of the three is equal to 1, L phone is the phoneme classification loss (the first phoneme category loss), and L sil is the loudness classification loss (the second Loudness category loss) and L align is the alignment loss (the third alignment loss), and L total is the joint loss.
  • step 111 the parameters of the attention fusion network, the phoneme classification network, the loudness classification network, the phoneme encoder and the audio encoder are updated according to the joint loss.
  • the gradient is determined according to the joint loss, and then the parameters of each network are updated through the descent algorithm. In order to make the joint loss converge to the lowest value as much as possible.
  • the audio feature and the text sequence are calculated by the attention mechanism to obtain the fusion feature, so the fusion feature can effectively represent the relationship between the audio frame and the phoneme, and then perform phoneme classification on each audio frame in the audio based on the fusion feature, It can effectively improve the classification accuracy, thereby improving the phoneme alignment accuracy.
  • the oral test question when the audio processing system is applied to the oral test scene, for example, the oral test question requires the examinee user to read the given text in English, the examinee terminal receives the user's audio data corresponding to the given text, and the examinee terminal converts the audio data to the given text.
  • the data is sent to the server, and the server performs mapping processing on the audio features of the audio frame to obtain the weight of the phoneme feature of each phoneme, based on the weight of the phoneme feature of each phoneme, the audio feature of the audio frame and the phoneme feature of at least one phoneme.
  • Fusion processing to obtain the fusion feature of the audio frame, determine the phoneme corresponding to each audio frame based on the fusion feature of each audio frame, and determine the start and end time of each phoneme based on the phoneme corresponding to each audio frame, and send it to the judges terminal, so that the judge terminal directly presents the start and end moments of each phoneme, and in response to the scoring operation of the judge user, the judge terminal can display the scoring result for each phoneme.
  • the embodiment of the present application mainly provides an automatic tool for phoneme labeling, which marks the corresponding position of each phoneme in the audio data of a given text, and on this basis, it can further mark whether the phoneme and word are read aloud incorrectly. This effectively reduces the cost of manual labeling and provides a more convenient scoring environment for subsequent judges' scoring.
  • the oral practice topic when the audio processing system is applied to the oral practice scene, for example, the oral practice topic requires the student user to read the given text in English, the student terminal receives the user's audio data corresponding to the given text, and the student terminal converts the audio data to the given text.
  • the data is sent to the server, and the server performs mapping processing on the audio features of the audio frame to obtain the weight of the phoneme feature of each phoneme, based on the weight of the phoneme feature of each phoneme, the audio feature of the audio frame and the phoneme feature of at least one phoneme.
  • Fusion processing to obtain the fusion feature of the audio frame, based on the fusion feature of each audio frame, determine the phoneme corresponding to each audio frame, and based on the phoneme corresponding to each audio frame, determine the start and end time of each phoneme, and send it to the examinee terminal, so that the examinee’s terminal can directly display the start and end times of each phoneme.
  • the examinee’s terminal can display the scoring result for each phoneme.
  • the scoring result can be a mark on whether the pronunciation of the phoneme is correct, that is, this
  • the embodiment of the application mainly provides an automatic tool for phoneme labeling, which marks the corresponding position of each phoneme in the audio data of a given text, and on this basis, it can further mark whether the phoneme and word are read aloud incorrectly, so as to effectively It reduces the cost of manual labeling and provides a more convenient self-examination environment for the subsequent self-examination of candidates.
  • Phoneme mandatory alignment refers to aligning the given phoneme sequence text with the corresponding audio to obtain the time position of each phoneme in the text in the audio.
  • Phoneme alignment has different applications in speech processing, such as speech recognition, speech keyword detection, etc.
  • the audio feature and the text sequence are calculated by the attention mechanism to obtain the fused audio and text features, and the phoneme classification is performed on each frame in the audio.
  • auxiliary tasks are added, such as whether each frame in the audio is muted judgment.
  • the resulting weight matrix is constrained to achieve a more precise alignment.
  • FIG. 4A is a schematic diagram of an interface of an audio processing method based on artificial intelligence provided in an embodiment of the present application.
  • a read aloud button 402A and an end read aloud button 403A are displayed in the human-computer interaction interface 401A.
  • the human-computer interaction interface The given text "What are you doing?" is also displayed in 401A.
  • the examinee terminal receives the audio data corresponding to the given text, and in response to the trigger operation of the examinee user on the end reading button 403A , the examinee terminal stops receiving audio data corresponding to the given text.
  • FIG. 4B is a schematic interface diagram of an audio processing method based on artificial intelligence provided in an embodiment of the present application.
  • the phoneme labeling function can be embedded in a webpage or in a client.
  • the phoneme-level labeling process for pronunciation is as follows.
  • the given text 403B and the label button 402B are displayed in the human-computer interaction interface 401B.
  • the label page for the given text 403B is displayed in the human-computer interaction interface 401B. .
  • FIG. 4C is a schematic interface diagram of an audio processing method based on artificial intelligence provided in an embodiment of the present application.
  • An annotation page 403C is displayed in the human-computer interaction interface 401C, and the audio phoneme 402C is displayed in the annotation page 403C.
  • the start and end time of the word 404C in the audio and the start and end time of the word 404C in the audio are determined by the start and end time of the phoneme 402C in the audio.
  • FIG. 4D is a schematic diagram of an interface of an audio processing method based on artificial intelligence provided in an embodiment of the present application.
  • An annotation page 403D is displayed on the human-computer interaction interface 401D
  • the phoneme 402D is displayed on the annotation page 403D.
  • the start and end time of the word 404D in the audio is determined by the start and end time of the phoneme 402D in the audio, so the human-computer interaction interface 401D displays divided phonemes, in response to user
  • the phoneme-specific pronunciation label 405D is displayed on the last layer of the label page, for example, whether a certain phoneme is wrong.
  • FIG. 5 it is a schematic flow chart of an audio processing method based on artificial intelligence provided in an embodiment of the present application.
  • the overall flow chart of the business based on phoneme forced alignment is shown in FIG. 5 , and the steps are as follows: the webpage of the phoneme labeling tool After opening, the user can select the audio that needs to be marked and the corresponding follow-up text; in response to the user's selection operation, determine the audio that needs to be marked and the corresponding phoneme text sequence (from the follow-up text of the title) and start marking; the webpage will The audio data phoneme text sequence (from the follow-up text of the title) is sent to the server; the server sends the audio data and the phoneme text sequence (from the follow-up text of the title) to the phoneme forced alignment module; the phoneme forced alignment module converts each phoneme The start and end time (phoneme boundary information) in the audio data is returned to the server; the server returns the audio segmented based on the phoneme boundary information to the user; in response to
  • the phoneme alignment model provided by the embodiment of the present application consists of a phoneme encoder, an audio encoder, an attention fusion network, a phoneme classification network, and a loudness classification network.
  • the phoneme encoder is used to extract phoneme features
  • the audio encoder is used to extract audio features.
  • the fusion feature, the fusion feature contains the information of the audio feature and the information of the phoneme feature.
  • the phoneme classification network (fully connected layer) and the loudness classification network (fully connected layer) are connected, and the phoneme classification network is used for each audio
  • the frame is classified into phonemes, and the phoneme classification contains a total of 40 phonemes (including 39 phonemes and mute phonemes in the phoneme dictionary), and whether each audio frame is classified as a mute audio frame (including mute or non-mute) through the loudness classification network.
  • the audio feature representation is obtained based on an audio encoder.
  • the embodiment of the present application uses a pre-trained acoustic model for audio feature extraction, such as a sound-to-vector model, which is composed of a multi-layer convolutional network, using a large number of The unlabeled task pre-trains the sound-to-vector model based on the contrastive loss.
  • a pre-trained acoustic model for audio feature extraction such as a sound-to-vector model, which is composed of a multi-layer convolutional network, using a large number of The unlabeled task pre-trains the sound-to-vector model based on the contrastive loss.
  • the audio data (audio waveform features) is input to the pre-trained network structure, and the audio features of each audio frame in the audio data are output.
  • the embodiment of the present application adopts phoneme encoding to extract phoneme features, and uses a unique vector to represent the characteristics of each phoneme (characteristics represent features), and uses random initialization for each phoneme.
  • the feature vector (feature representation feature) of the phoneme is initialized, and in order to make the representation of the phoneme different in different positions in the word, the position vector (position representation feature) of each phoneme is randomly initialized, including four positions.
  • the phoneme represents the start position (B), the middle position (I), and the end (E) of the word.
  • the word contains a phoneme, it is represented by S.
  • phoneme features and audio features are fused based on an attention mechanism.
  • the embodiment of the present application uses an attention mechanism to fuse phoneme features and audio features.
  • the attention mechanism is used to model query vectors Q, key vectors K, and The relationship between the value vector V, see formulas (8) and (9):
  • the audio features of each audio frame are used as a query vector Q
  • the phoneme feature H phone of all phonemes of a given text is used as a key vector K and a value vector V
  • AttentionScore(Q,K) is the weight
  • Attention(Q,K,V) is each audio frame corresponding to all Phoneme attention results
  • d k is the dimension of the key vector K.
  • the matrix obtained based on the attention mechanism is spliced with the audio features to finally obtain the fusion features, see formula (10):
  • the attention result of audio frame i is a matrix, and each column in the matrix represents the attention result of each phoneme and audio frame i in all phonemes, is the audio feature for audio frame i, H phone is the phoneme feature for all phones of the given text, is the fused feature corresponding to each audio frame.
  • the cross loss is used to calculate the losses of the two classifications during the training of the phoneme alignment model, see formula (11) and formula (12):
  • Lphone is the phoneme classification loss (first phoneme category loss)
  • Lsil is the loudness classification loss (second loudness category loss)
  • m is the number of audio frames
  • c is the number of candidate phones
  • m is the number of audio frames
  • c is the number of candidate phones
  • m is the number of audio frames
  • c is the number of candidate phones
  • m is the number of audio frames
  • c is the number of candidate phones
  • the weight matrix in the embodiment of the present application is constrained, that is, attention weight constraints are performed, wherein each row in the matrix represents an audio frame, and each column represents the audio
  • the probability distribution of each phoneme in the frame, the probability distribution of the phoneme of each audio frame and the phoneme corresponding to the actual audio frame are calculated to obtain the attention mechanism loss, see formula (13):
  • L align is the attention mechanism loss
  • m is the number of audio frames
  • N p is the number of phonemes in the given text
  • It is 1 or 0. 1 indicates that the i-th audio frame is aligned with the j-th phoneme
  • 0 indicates that the i-th audio frame is not aligned with the j-th phoneme.
  • the joint loss of the entire phoneme alignment network consists of three parts, including phoneme classification loss (first phoneme category loss), loudness classification loss (second loudness category loss) and alignment loss (third alignment loss), The three losses are weighted and summed with different weights, and the final joint loss is shown in formula (14):
  • the weight of each loss ( ⁇ , ⁇ , and ⁇ ) is a preset weight, and the sum of the three is equal to 1, L phone is the phoneme classification loss (the first phoneme category loss), and L sil is the loudness classification loss (the second Loudness category loss) and L align is the alignment loss (the third alignment loss), and L total is the joint loss.
  • FIG. 7 is a schematic data flow diagram of an audio processing method based on artificial intelligence provided in an embodiment of the present application.
  • the phoneme alignment model includes an attention fusion network, a phoneme classification network (corresponding to the first task) and Loudness classification network (corresponding to the second task), the input of the audio encoder is audio data, the output of the audio encoder is the audio feature (vector form) of each audio frame that the audio data includes, and the input of the phoneme encoder is a phoneme sequence ( Given text), the output of the phoneme encoder is the phoneme feature of each phoneme (the data form of the phoneme feature is a vector), the input of the attention fusion network is the output of the audio encoder and the output of the phoneme encoder, the attention fusion network The output is the fusion feature of phoneme features and audio features.
  • the audio features of each audio frame and all phonemes are calculated by the attention mechanism to obtain the fusion features, and the representation of the corresponding candidate phoneme of the audio frame and the representation of the corresponding mute or not are determined.
  • the phoneme classification network outputs the probability that each audio frame belongs to each candidate phoneme
  • the loudness classification network outputs the probability that each audio frame belongs to each loudness category.
  • Loudness categories include mute and non-mute, for example, non-mute is marked as 1, mute is marked as 0, and candidate phonemes are W, IH, L and so on.
  • the embodiment of the present application conducts experiments on two public data sets, including the TIMIT data set and the Buckeye data set. These two data sets will time-mark each phoneme in the audio, and finally calculate the index. Including at least one of the following: the precision rate P between the phoneme boundary predicted by the phoneme alignment model and the actual phoneme boundary, the recall rate R and the F1 score, and in order to solve the situation when the recall rate is high and the precision rate is very low, the F1 score value For relatively high problems, introduce R-value for evaluation, see formula (15)-formula (17):
  • P is the precision rate
  • R is the recall rate
  • OS is R/P-1.
  • Table 1 The scores of each model in each data set in the embodiment of this application and related technologies
  • Figures 8A-8C are the alignment time matrix of the artificial intelligence-based audio processing method provided by the embodiment of the present application.
  • the phoneme alignment matrix is drawn, where the vertical axis is the audio frame divided by time, and the horizontal axis is each phoneme.
  • Figure 8A shows the alignment time matrix without adding attention weight constraints
  • Figure 8B shows the alignment time matrix with constraints added
  • Figure 8C shows the real It can be seen that the matrix constrained by the attention mechanism is more in line with the actual alignment time of phonemes and audio.
  • the software modules in the device 255 may include: a phoneme module 2551, configured to obtain at least one phoneme of a given text, and determine the phoneme feature of each phoneme; an audio module 2552, configured to obtain audio data corresponding to the given text, And determine the audio feature of each audio frame included in the audio data; the fusion module 2553 is configured to perform the following processing for each audio frame: map the audio feature of the audio frame to obtain the phoneme of each phoneme The weight of the phoneme feature, based on the weight of the phoneme feature of each of the phonemes, the audio feature of the audio frame and the phoneme feature of at least one of the phonemes are fused to obtain the fusion feature of each of the audio frames; alignment Module 2554 is configured to determine the phoneme corresponding to each audio frame based on the fusion feature of each audio frame, and determine the start and end
  • the audio module 2552 is further configured to: perform feature extraction processing on at least one audio frame to obtain a convolution feature extraction result corresponding to each audio frame; normalize the convolution feature extraction result of each audio frame Unified processing to obtain the audio features of each audio frame.
  • the phoneme module 2551 is further configured to: perform the following processing for each phoneme: determine the characteristic representation feature of the phoneme, wherein the characteristic representation feature represents the characteristic of the phoneme; determine the position representation feature of the phoneme, wherein the position represents The feature represents the position of the phoneme in the corresponding text unit; the position representation feature and the characteristic representation feature are added together to obtain the phoneme feature of the phoneme.
  • the fusion module 2553 is further configured to: perform the following processing for each phoneme: perform value vector transformation processing on the phoneme feature of the phoneme to obtain a value vector; combine the weight corresponding to the phoneme with the The value vector is multiplied to obtain the attention result corresponding to the phoneme; the attention result corresponding to the at least one phoneme and the audio feature of the audio frame are fused to obtain the fusion feature corresponding to the audio frame .
  • the fusion module 2553 is further configured to: perform query vector conversion processing on audio features to obtain query vectors; perform key vector conversion processing on phoneme features to obtain key vectors; transpose query vectors and key vectors Multiply processing to obtain the multiplication result; obtain the square root of the dimension of the key vector; determine the ratio of the multiplication result to the square root as the attention feature; perform maximum likelihood processing on the attention feature to obtain the weight of the corresponding phoneme.
  • determining the phoneme corresponding to each audio frame is implemented by calling a phoneme classification network
  • the phoneme classification network includes at least one cascaded phoneme fully connected layer
  • the alignment module 2554 is also configured to: for each audio frame Perform the following processing: when the number of phoneme fully connected layers is one, perform the first fully connected processing on the fusion feature through the phoneme fully connected layer to obtain the first probability that the audio frame belongs to each candidate phoneme; when the number of phoneme fully connected layers When it is multiple, through the nth phoneme fully connected layer in the N cascaded phoneme fully connected layers, the input of the nth phoneme fully connected layer is first fully connected, and the output of the nth phoneme fully connected layer The nth phoneme full-connection result is transmitted to the n+1th phoneme full-connection layer to continue the first full-connection process, and the n+1th phoneme full-connection result corresponding to the n+1th phoneme full-connection layer is obtained; wherein, N is greater than Or an integer equal to 2, n is
  • the alignment module 2554 is further configured to: determine at least one audio frame corresponding to each phoneme based on the phoneme corresponding to each audio frame; perform the following processing for each phoneme: when the phoneme corresponds to multiple consecutive audio In frame time, the start and end moments of the continuous audio frames corresponding to the phoneme are determined as the start and end moments of the phoneme; when the phoneme corresponds to an audio frame, the time of the audio frame corresponding to the phoneme is determined as the start and end moment of the phoneme.
  • the audio feature of the audio frame is mapped to obtain the weight of the phoneme feature of each of the phonemes, and based on the weight of the phoneme feature of each of the phonemes, the audio feature of the audio frame is And at least one of the phoneme features of the phoneme is fused, and the fusion feature of each of the audio frames is obtained by calling the attention fusion network, and the phoneme corresponding to each audio frame is determined by calling the phoneme classification network.
  • the phoneme classification network shares the attention fusion network with the loudness classification network, and the device also includes: a training module 2555 configured to: obtain audio data samples and a given text sample; obtain at least one phoneme sample of a given text sample, and pass the phoneme encoder Determine the phoneme feature of each phoneme sample; determine the audio feature of each audio frame sample included in the audio data sample by an audio encoder; perform the following processing for each audio frame sample: combine the audio feature of the audio frame sample and at least one phoneme sample
  • the phoneme features of the attention fusion network and the phoneme classification network are forward-propagated to obtain the first forward-propagation result; the following processing is performed for each audio frame sample: the audio feature of the audio frame sample and at least
  • the phoneme features of a phoneme sample are forward-propagated in a network composed of an attention fusion network and a loudness classification network to obtain a second forward-propagation result; according to the first forward-propagation result and the second forward-propagation result, determine Joint loss; update the
  • the audio feature of the audio frame sample and the phoneme feature of at least one phoneme sample is also configured to: use the attention fusion network to analyze the audio feature of the audio frame sample and the phoneme feature of at least one phoneme sample Perform fusion processing based on the attention mechanism to obtain the fusion features corresponding to each audio frame sample; perform the second full connection processing on the fusion features of each audio frame sample through the loudness classification network, and obtain each audio frame sample belonging to each loudness The second probability of the class, and the second probability that each audio frame sample belongs to each loudness class constitutes the second forward propagation result.
  • the training module 2555 is also configured to: perform the following processing for each phoneme sample through the attention layer of the attention fusion network: perform the following processing for each phoneme sample through the attention layer of the attention fusion network: Determine the weight of the corresponding phoneme sample based on the audio feature of the audio frame sample and the phoneme feature of the phoneme sample; perform value vector transformation processing on the phoneme feature of the phoneme sample, and multiply the weight of the corresponding phoneme sample with the value vector transformation result to obtain The attention result corresponding to the phoneme sample; through the fusion layer of the attention fusion network, the attention result corresponding to each phoneme sample and the audio feature of the audio frame sample are fused to obtain the fusion feature of the corresponding audio frame sample; through the phoneme classification network The first full connection processing is performed on the fusion features of the audio frame samples to obtain the third probability that the audio frame samples belong to each candidate phoneme; the third probability and the weight form the first forward propagation result.
  • the training module 2555 is further configured to: determine the first phoneme category loss based on the third probability that each audio frame sample corresponds to a plurality of candidate phonemes and the pre-marked candidate phonemes of each audio frame sample; Each audio frame sample corresponds to the second probability of multiple loudness categories and the pre-marked loudness category of each audio frame sample to determine the second loudness category loss; based on the weight of each audio frame sample corresponding to each phoneme sample, and each The audio frame samples correspond to the pre-marked alignment identifiers of each phoneme sample to determine the third alignment loss; the first phoneme category loss, the second loudness category loss and the third alignment loss are fused to obtain a joint loss.
  • An embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned embodiments of the present application. . method.
  • An embodiment of the present application provides a computer-readable storage medium storing executable instructions, wherein executable instructions are stored, and when the executable instructions are executed by a processor, it will cause the processor to execute the artificial intelligence-based
  • the audio processing method for example, the audio processing method based on artificial intelligence as shown in FIGS. 3A-3C .
  • the computer-readable storage medium can be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; Various equipment.
  • executable instructions may take the form of programs, software, software modules, scripts, or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and its Can be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in a Hyper Text Markup Language (HTML) document in one or more scripts, in a single file dedicated to the program in question, or in multiple cooperating files (for example, files that store one or more modules, subroutines, or sections of code).
  • HTML Hyper Text Markup Language
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communication network. to execute.
  • the audio features and the text sequence are calculated by the attention mechanism to obtain the fusion feature, so the fusion feature can effectively represent the relationship between the audio frame and the phoneme, and then based on the fusion feature, each audio frame in the audio is Phoneme classification can effectively improve classification accuracy, thereby improving phoneme alignment accuracy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质,该方法包括:获取给定文本的至少一个音素,并确定每个音素的音素特征(101);获取对应给定文本的音频数据,并确定音频数据包括的每个音频帧的音频特征(102);针对每个音频帧执行以下处理:对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到音频帧的融合特征(103);基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻(104)。

Description

基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为202111421900.X、申请日为2021年11月26日的中国专利申请提出,并要求中国专利申请的优先权,中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及人工智能技术,尤其涉及一种基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质。
背景技术
人工智能(AI,Artificial Intelligence)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法和技术及应用系统。
越来越多的人工智能产品具备语音交互的功能,语音交互可以应用于各种语音评分系统,例如,语言教育应用的语言测试系统,口语考试系统等等,为了正常使用语音交互功能,需要将音素与文本进行对齐,并尽可能地提高对齐准确度,但是相关技术中无法准确将音素与文本进行对齐。
发明内容
本申请实施例提供一种基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质,能够提高音素的对齐的准确度。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种基于人工智能的音频处理方法,包括:
获取给定文本的至少一个音素,并确定每个所述音素的音素特征;
获取对应所述给定文本的音频数据,并确定所述音频数据包括的每个音频帧的音频特征;
针对每个所述音频帧执行以下处理:对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到所述音频帧的融合特征;
基于每个所述音频帧的融合特征,确定每个所述音频帧对应的音素,并基于每个所述音频帧对应的音素,确定每个所述音素的起止时刻。
本申请实施例提供一种基于人工智能的音频处理装置,所述方法由电子设备执行,包括:
音素模块,配置为获取给定文本的至少一个音素,并确定每个所述音素的音素特征;
音频模块,配置为获取对应所述给定文本的音频数据,并确定所述音频数据包括的每个音频帧的音频特征;
融合模块,配置为针对每个所述音频帧执行以下处理:对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到所述音频帧的融合特征;
对齐模块,配置为基于每个所述音频帧的融合特征,确定每个所述音频帧对应的音素,并基于每个所述音频帧对应的音素,确定每个所述音素的起止时刻。
本申请实施例提供一种电子设备,包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的计算机可执行指令时,实现本申请实施例提供的基于人工智能的音频处理方法。
本申请实施例提供一种计算机可读存储介质,存储有计算机可执行指令,用于被处理器执行时,实现本申请实施例提供的基于人工智能的音频处理方法。
本申请实施例提供一种计算机程序产品,包括计算机程序或计算机可执行指令,所述计算机程序或计算机可执行指令被处理器执行时实现本申请实施例提供的基于人工智能的音频处理方法。
本申请实施例具有以下有益效果:
通过本申请实施例基于音频特征确定出文本序列中每个音素的权重,再基于每个音素的权重将音素特征与音频特征与文本序列进行融合处理,得到融合特征,因此融合特征能够有效表征音频帧与音素之间的关系,再基于融合特征对音频中每个音频帧进行音素分类,可以有效提高分类准确度,从而提高音素对齐准确度。
附图说明
图1是本申请实施例提供的基于人工智能的音频处理系统的结构示意图;
图2是本申请实施例提供的电子设备的结构示意图;
图3A-3C是本申请实施例提供的基于人工智能的音频处理方法的流程示意图;
图4A-4D是本申请实施例提供的基于人工智能的音频处理方法的界面示意图;
图5是本申请实施例提供的基于人工智能的音频处理方法的流程示意图;
图6是本申请实施例提供的基于人工智能的音频处理方法的音素对齐模型的结构示意图;
图7是本申请实施例提供的基于人工智能的音频处理方法的数据流程示意图;
图8A-8C是本申请实施例提供的基于人工智能的音频处理方法的对齐时间矩阵;
图9是本申请实施例提供的音频编码器的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)语音识别技术:自动语音识别(ASR,Automatic Speech Recognition),其目标是将人类的语音中的词汇内容转换为计算机可读的输入,例如按键、二进制编码或者字符序列。
2)隐马尔可夫模型(HMM,Hidden Markov Model):是一种统计模型,用来描述一个含有隐含未知参数的马尔可夫过程。
3)最大似然估计:(MLE,Maximum Likelihood Estimation),也称极大似然估计,是用来估计一个概率模型的参数的一种方法。
4)判别模型:在机器学习领域判别模型是一种对未知数据y与已知数据x之间关系进行建模的方法。判别模型是一种基于概率理论的方法,已知输入变量x,判别模型通过构建条件概率分布P(y|x)预测y。
5)全连接(FC,Full Connection):全连接层中的每个神经元与其前一层的所有神经元进行全连接.全连接层可以整合卷积层或者池化层中具有类别区分性的局部信息。
6)皮尔逊相关系数:在统计学中皮尔逊相关系数用于度量两个变量X和Y之间的线性相关,其值介于-1与1之间。
7)支持向量机(SVM,support vector machine):在机器学习中常简称为支持向量网络,是在分类与回归分析中分析数据的监督式学习模型。
8)音素(phone),是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个发音动作作为一个音素,音素分为元音与辅音两大类,在本申请实施例中音素还包括静音音素,例如,某个音频帧是静音的,即该音频帧对应静音音素。
9)音素对齐,音素对齐指的是将音素与音频进行对齐,即确定出给定文本中每个音素的起止时间。
相关技术中音素对齐方式有两种,一种是不依赖于给定文本,一种是依赖于文本的,不依赖于文本的方式通常对音素边界进行分类,判断音频数据中某一音频帧的时间是否是音素边界,例如,采用维特比算法来区分发音段和非发音段,或者采用循环神经网络对音素边界进行分类,依赖于文本的方式通常 采用HMM基于最大似然来得到最有可能的序列,或者采用判别模型,或者设计对齐函数并利用支持向量机进行音素对齐。
相关技术中基于HMM的对齐方式主要将音素边界判断作为隐藏状态,采用最大似然进行优化,没有直接显式地优化音素对齐,相关技术中其他音素对齐方式需要设计人工设计对齐函数并进行人工特征工程。
本申请实施例提出一种基于人工智能的音频处理方法,能够在不依赖于人工设计对齐函数的前提下基于包括注意力机制的神经网络自动学习音素序列与音频数据的映射关系,并在训练阶段显式优化损失函数,联合多任务进行训练,并在注意力处理阶段通过损失函数进行约束学习,有效提高音素对齐的准确度。
针对相关技术的上述问题,本申请实施例提供一种基于人工智能的音频处理方法、装置、电子设备、计算机程序产品和计算机可读存储介质,能够将音频特征与文本序列进行注意力机制计算得到融合特征,从而基于融合特征对音频中每帧进行音素分类,有效提高分类准确度,从而提高音素对齐准确度。
下面说明本申请实施例提供的电子设备的示例性应用,本申请实施例提供的电子设备可以实施为服务器。下面,将说明电子设备实施为服务器时示例性应用。
参见图1,图1是本申请实施例提供的基于人工智能的音频处理系统的结构示意图,音频处理系统可以用于口语考试场景,在音频处理系统中,终端400通过网络300连接服务器200,网络可以是广域网或者局域网,又或者是二者的组合。
在一些实施例中,音频处理系统的功能是基于服务器200中的各个模块实现的,在用户使用终端400的过程中,终端400接收用户针对给定文本的音频数据,终端400将音频数据以及给定文本发送至服务器200,服务器200确定给定文本中每个音素的音素特征以及音频数据中每个音频帧的音频特征,针对每个音频帧执行以下处理:对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻,将每个音素的起止时刻发送至终端400,以使终端400直接呈现每个音素的起止时刻,从而完成了音素对齐过程。
以音频处理系统应用于口语考试场景为例,口语考试题目要求用户使用英语跟读给定文本,终端400接收到用户对应给定文本的音频数据,终端400将音频数据发送至服务器200,服务器200对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到每个音频帧的融合特征,基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻,并发送至终端400,以使终端400直接呈现每个音素的起止时刻,响应于用户的评分操作,终端400可以显示针对每个音素的评分结果,参与跟读的用户与进行评分的用户可以是相同或者不同用户。
以音频处理系统应用于口语练习场景为例,口语练习题目要求用户使用英语跟读给定文本,终端400接收到用户对应给定文本的音频数据,终端400将音频数据发送至服务器200,服务器200对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到每个音频帧的融合特征,基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻,并发送至终端400,以使终端400直接呈现每个音素的起止时刻,从而想用于用户针对每个音素的播放操作,终端400可以单独播放对应音素的音频帧。
作为上述示例的服务器200进行音素对齐的替代方案,可以由终端对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到每个音频帧的融合特征,基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻,并直接呈现每个音素的起止时刻。
在一些实施例中,服务器200可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、智能语音交互设备、智能家电、车载终端等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。
在一些实施例中,终端或服务器可以通过运行计算机程序来实现本申请实施例提供的音频处理方法。举例来说,计算机程序可以是操作系统中的原生程序或软件模块;可以是本地(Native)应用程序(APP,Application),即需要在操作系统中安装才能运行的程序,如口语考试APP或者口语学习APP;也可以是 小程序,即只需要下载到浏览器环境中就可以运行的程序;还可以是能够嵌入至任意APP中的小程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。
接下来,说明本申请实施例提供的用于实施基于人工智能的音频处理方法的电子设备的结构,如前,本申请实施例提供的电子设备可以是图1中的服务器200。参见图2,图2是本申请实施例提供的服务器200的结构示意图,图2所示的服务器200包括:至少一个处理器210、存储器250、至少一个网络接口220。服务器200中的各个组件通过总线系统240耦合在一起。可理解,总线系统240用于实现这些组件之间的连接通信。总线系统240除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统240。
处理器210可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
存储器250可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器250可选地包括在物理位置上远离处理器210的一个或多个存储设备。
存储器250包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器250旨在包括任意适合类型的存储器。
在一些实施例中,存储器250能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统251,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;网络通信模块252,用于经由一个或多个(有线或无线)网络接口220到达其他计算设备,示例性的网络接口220包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等。
在一些实施例中,本申请实施例提供的基于人工智能的音频处理装置可以采用软件方式实现,图2示出了存储在存储器250中的基于人工智能的音频处理装置255,其可以是程序和插件等形式的软件,包括以下软件模块:音素模块2551、音频模块2552、融合模块2553、对齐模块2554和训练模块2555,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分,将在下文中说明各个模块的功能。
将结合本申请实施例提供的服务器200的示例性应用和实施,说明本申请实施例提供的基于人工智能的音频处理方法。
参见图6,图6是本申请实施例提供的基于人工智能的音频处理方法的音素对齐模型的结构示意图,音素对齐模型包括注意力融合网络、音素分类网络(对应第一任务)以及响度分类网络(对应第二任务),注意力融合网络用于对音素特征以及音频特征进行融合处理,使得注意力融合网络输出的融合特征被对应第一任务的响度分类网络与对应第二任务的音素分类网络共享,注意力融合网络的输入是基于音频数据得到的音频特征以及基于给定文本得到的音素特征,注意力融合网络的输出是音频特征和音素特征的融合特征,再通过响度分类网络和音素分类网络分别对融合特征进行全连接处理,分别得到响度分类结果和音素分类结果,响度分类网络可以是全连接层的结构,音素分类网络也可以是全连接层的结构,但是两者的参数不同,第一任务是从多个候选音素中识别出某个音频帧的音素,第二任务是判断某个音频帧是否为静音音频帧。
参见图6,音素对齐模型包括注意力融合网络、音素分类网络(对应第一任务)以及响度分类网络(对应第二任务),参见图7,图7是本申请实施例提供的基于人工智能的音频处理方法的数据流程示意图,音频编码器的输入是音频数据,音频编码器的输出是音频数据包括的每个音频帧的音频特征(向量形式),音素编码器的输入是音素序列(给定文本),音素编码器的输出是每个音素的音素特征(音素特征的数据形式是向量),注意力融合网络的输入是音频编码器的输出以及音素编码器的输出,注意力融合网络的输出是音素特征与音频特征的融合特征,通过两个并列的音素分类网络以及响度分类网络分别对融合特征进行分类处理,音素分类网络输出每个音频帧属于每个候选音素的概率,响度分类网络输出每个音频帧属于每个响度类别的概率,响度类别包括静音以及非静音,例如,非静音的标识为1,静音的标识为0,候选音素为W、IH、L等等。
以由图1中的服务器200执行本申请实施例提供的基于人工智能的音频处理方法为例,说明本申请实施例提供的基于人工智能的音频处理方法。
参见图3A,图3A是本申请实施例提供的基于人工智能的音频处理方法的流程示意图,将结合图3A示出的步骤101-104进行说明。
在步骤101中,获取给定文本的至少一个音素,并确定每个音素的音素特征。
在一些实施例中,确定每个音素的音素特征是通过调用音素编码器实现的,音素编码器包括音素特性表示网络以及音素位置表示网络,步骤101中确定每个音素的音素特征,可以通过以下技术方案实现:针对每个音素执行以下处理:通过音素特性表示网络确定音素的特性表示特征,其中,特性表示特征用于表征音素的特性;通过音素位置表示网络确定音素的位置表示特征,其中,位置表示特征用于表征音素在对应文本单元中的位置;将位置表示特征与特性表示特征进行相加处理,得到音素的音素特征。
作为示例,音素特性表示网络和音素位置表示网络是并列关系,音素特性表示网络和音素位置表示网络均是卷积神经网络,两个卷积神经网络包括的卷积层的数目不同,且每个卷积层的参数也各不相同。通过音素特性表示网络中级联的多个卷积层对音素进行卷积处理,得到音素的特性表示特征,通过音素位置表示网络中级联的多个卷积层对音频帧进行卷积处理,得到音频帧的位置表示特性。
作为示例,不同语言所包含的音素不同,以英语为例,当给定文本为ever forget时,给定文本的音素包括EH1、V、ER、sp、F、R、G、EH、T,其中,EH1、V、ER、F、R、G、EH、T为各不相同的音素,sp表征静音音素,静音也是候选音素的其中之一。通过音素特性表示网络对每个音素进行编码,得到每个音素的特性表示特征,例如,图6所示的E(HH),不同音素的特性表示特征不同,特性包括发音特性、含义特性等等,特性表示特征用于对不同音素进行区别,特性表示特征表征音素的特性。每个音素在对应的文本单元中具有四种位置可能性,文本单元是语句的最小单位,例如,在英语中,图6所示的给定文本(How are)的文本单元(How)是单词,当某个单词包含多个音素时,单词具有音素的开始位置(B)、中间位置(I)和结束位置(E),当某个单词包含一个音素时,利用S表示该音素的位置,通过音素位置表示网络对音素在对应文本单元中的位置进行编码,得到每个音素的位置表示特征,位置表示特征表征音素在对应文本单元中的位置,例如,图6所示的E(B),最终将每个音素的独特的特性表示特征(用于表征音素特性的向量)与位置表示特征(用于表征音素位置的向量)进行相加,得到最终的音素特征。通过这种音素编码方式能够有效表征每个音素的特性区别,并且还能够有效表征相同音素在不同位置的区别。
在步骤102中,获取对应给定文本的音频数据,并确定音频数据包括的每个音频帧的音频特征。
在一些实施例中,参见图9,图9是本申请实施例提供的音频编码器的结构示意图,图9示出的音频编码器包括多个级联的卷积网络以及归一化网络,步骤102中确定音频数据包括的每个音频帧的音频特征,可以通过以下技术方案实现:通过音频编码器包括的多个级联的卷积网络对至少一个音频帧进行特征提取处理,得到对应每个音频帧的卷积特征提取结果;通过音频编码器包括的归一化网络对每个音频帧的卷积特征提取结果进行归一化处理,得到每个音频帧的音频特征。
作为示例,基于音频编码器获取音频特征,通过多个级联的卷积网络将至少一个音频帧作为一个整体进行特征提取处理,若是存在多个音频帧时,多个卷积网络的输出是低频特征表示,例如,它对大约30毫秒的16千赫兹的音频数据进行编码,并且每隔设定时间的步长就会生成一个低频特征表示,从而得到每个音频帧的卷积特征提取结果,再通过归一化网络对每个音频帧的卷积特征提取结果进行归一化处理,得到每个音频帧的音频特征,音频编码器的结构可以为wav2vec的网络结构,音频编码器的参数是基于wav2vec的网络结构进行训练得到的。
wav2vec是一种卷积神经网络,卷积神经网络包括编码网络,编码网络是5层卷积结构,卷积神经网络还包括内容网络,内容网络是9层卷积结构。
在步骤103中,针对每个音频帧执行以下处理:对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到音频帧的融合特征。
在一些实施例中,步骤103是通过注意力融合网络实现的,注意力融合网络包括注意力层和融合层,步骤103中对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,可以通过以下技术方案实现:对音频特征进行查询向量变换处理,得到查询向量;对音素特征进行关键向量变换处理,得到关键向量;将查询向量以及关键向量的转置进行相乘处理,得到相乘结果;获取关键向量的维度的平方根;将相乘结果与平方根的比值确定为注意力特征;对注意力特征进行最大似然处理,得到对应音素的权重。基于音频帧的音频特征来获取对应每个音素的权重,可以获取音素与音频帧的关联信息,从而提高后续对齐的准确度。
作为示例,查询向量变换处理可以通过以下方式实施:将注意力层的第一参数Wq与音频特征进行相乘处理,可以得到查询向量Q,或者,将注意力层的第一参数Wq与音频特征进行相乘处理,得到第一相乘结果,将第一相乘结果与第四参数Bq相加,可以得到查询向量Q;关键向量变换处理可以通过以下方式实施:将注意力层的第二参数Wk与音频特征进行相乘处理,可以得到关键向量K,或者,将注意力层的第二参数Wk与音素特征进行相乘处理,得到第二相乘结果,将第二相乘结果与第五参数Bk相加,可以得到查询向量K,注意力层的第一参数、第二参数、第四参数以及第五参数均是通过对注意力融合网 络进行训练得到的。
作为示例,采用注意力机制对音素特征与音频特征进行融合,注意力机制用于建模查询向量Q、关键向量K以及值向量V之间的关系,参见公式(1)和(2):
Figure PCTCN2022122553-appb-000001
Attention(Q,K,V)=AttentionScore(Q,K)*V(2);
其中,基于每个音频帧的音频特征
Figure PCTCN2022122553-appb-000002
得到查询向量Q,基于给定文本的每个音素的音素特征H phone得到每个音素的关键向量K和每个音素的值向量V,还可以将每个音频帧的音频特征
Figure PCTCN2022122553-appb-000003
作为查询向量,将给定文本的每个音素的音素特征H phone作为每个音素的关键向量K和每个音素的值向量V,AttentionScore(Q,K)是每个音素的权重,Attention(Q,K,V)是每个音素的注意力结果,d k是关键向量K的维度。
作为示例,对每个音频帧的音频特征
Figure PCTCN2022122553-appb-000004
进行查询向量变换处理得到查询向量Q,对给定文本的每个音素的音素特征H phone进行关键向量变换处理,得到关键向量K,对给定文本的每个音素的音素特征H phone进行值向量变换处理,得到值向量V,这些变换处理所涉及的参数可以通过对于音素对齐模型进行整体训练得到,还可以将每个音频帧的音频特征
Figure PCTCN2022122553-appb-000005
作为查询向量,或者将给定文本的每个音素的音素特征H phone作为每个音素的关键向量K和每个音素的值向量V。
在一些实施例中,步骤103是通过注意力融合网络实现的,注意力网络包括注意力层和融合层,参见图3B,图3B是本申请实施例提供的基于人工智能的音频处理方法的流程示意图,步骤103中基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到每个音频帧的融合特征,可以通过针对每个音素执行图3B示出的步骤1031-1033进行说明。
在步骤1031中,对音素的音素特征进行值向量变换处理,得到值向量。
在步骤1032中,将对应音素的权重与值向量进行相乘处理,得到对应音素的注意力结果。
步骤1031和步骤1032均是通过注意力融合网络中的注意力层实现的,值向量变换处理可以通过以下方式实施:将注意力层的第三参数Wv与音素特征进行相乘处理,可以得到值向量V,或者,将注意力层的第三参数Wv与音素特征进行相乘处理,得到第三相乘结果,将第三相乘结果与第六参数Bv相加,可以得到值向量V;注意力层的第三参数以及第六参数均是通过对注意力融合网络进行训练得到的。
在步骤1033中,将对应至少一个音素的注意力结果以及音频帧的音频特征进行融合处理,得到对应音频帧的融合特征。
作为示例,步骤103是通过调用注意力融合网络实现的,注意力融合网络包括注意力层以及融合层,融合处理实际上是特征拼接过程,将基于某个音频帧得到的注意力结果与该音频帧的音频特征进行拼接处理,得到对应该音频帧的融合特征,参见公式(3):
Figure PCTCN2022122553-appb-000006
其中,
Figure PCTCN2022122553-appb-000007
是音频帧i的注意力结果,音频帧i的注意力结果是矩阵,矩阵中每列代表所有音素中每个音素与音频帧i的注意力结果,
Figure PCTCN2022122553-appb-000008
是音频帧i的音频特征,H phone是给定文本的所有音素的音素特征,
Figure PCTCN2022122553-appb-000009
是对应每个音频帧的融合特征。
作为示例,注意力机制(Attention Mechanism)源于对人类视觉的研究。在认知科学中,由于信息处理的瓶颈,人类会选择性地关注所有信息的一部分,同时忽略其他可见的信息,注意力机制包括软性注意力机制(可分为基于输入项的软注意力(Item-wise Soft Attention)和基于位置的软注意力(Location-wise Soft Attention))、硬性注意力机制(可分为基于输入项的强注意力(Item-wise Hard Attention)和基于位置的强注意力(Location-wise Hard Attention))、自注意力机制(是注意力机制的变体,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。自注意力机制在文本中的应用,主要是通过计算单词间的互相影响,来解决长距离依赖问题)等等,注意力机制主要有两个方面:决定需要关注输入的哪部分;分配有限的信息处理资源给重要的部分。在深度学习中,注意力可以借助权重实现,通过权重来判断,音频帧与每个音素的关联性,针对不同的音频帧,音频帧对相同音素的注意力具有差异,从而将音频帧的音频特征与多个音素的音素特征进行融合时,音素特征的权重会存在差异。
在步骤104中,基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应 的音素,确定每个音素的起止时刻。
在一些实施例中,确定每个音频帧对应的音素是通过调用音素分类网络实现的,图6所示的音素分类网络包括至少一个级联的音素全连接层,步骤104中基于每个音频帧的融合特征,确定每个音频帧对应的音素,可以通过以下技术方案实现:针对每个音频帧执行以下处理:当音素全连接层的数目为一个时,通过音素全连接层对融合特征进行第一全连接处理,得到音频帧属于每个候选音素的第一概率;当音素全连接层的数目为多个时,通过N个级联的音素全连接层中的第n音素全连接层,对第n音素全连接层的输入进行第一全连接处理,并将第n音素全连接层输出的第n音素全连接结果传输到第n+1音素全连接层以继续进行第一全连接处理,得到对应第n+1音素全连接层的第n+1音素全连接结果;其中,N为大于或者等于2的整数,n为取值从1开始递增的整数变量,n的取值范围为1≤n<N,当n取值为1时,第n音素全连接层的输入为融合特征,当n取值为2≤n<N时,第n音素全连接层的输入为第n-1音素全连接层输出的第n-1音素全连接结果,当n取值为N-1时,第n+1音素全连接结果为音频帧属于每个候选音素的第一概率;将最大的第一概率的候选音素确定为音频帧对应的音素。
作为示例,参见图6,在注意力融合网络后外接音素分类网络(音素全连接层),通过音素分类网络针对每个音频帧进行音素分类,候选音素总共包含40个音素(包括音素词典中39个音素以及静音音素),当仅存在一个音素全连接层时,通过音素全连接层输出某个音频帧属于每个候选音素的第一概率,即针对音频帧A输出40个第一概率,将最高的第一概率所对应的候选音素确定为该音频帧A的音素,当存在多个音素全连接层时,由于是级联的关系,通过多个级联的全连接层可以学习到更深度的特征,从而有效提高后续音素识别准确性。
在一些实施例中,步骤104中基于每个音频帧对应的音素,确定每个音素的起止时刻,可以通过以下技术方案实现:基于每个音频帧对应的音素,确定每个音素对应的至少一个音频帧;针对每个音素执行以下处理:当音素对应多个连续的音频帧时,将音素对应的连续音频帧的起止时刻确定为音素的起止时刻;当音素对应一个音频帧时,将音素对应的音频帧的时刻确定为音素的起止时刻。
作为示例,起止时刻包括音素的开始时刻以及结束时刻,以存在10个音频帧为例进行说明,基于每个音频帧对应的音素,确定每个音素对应的至少一个音频帧,针对每个音素执行以下处理:当音素对应多个连续的音频帧时,将音素对应的连续音频帧的起止时刻确定为音素的起止时刻,例如,第1个音频帧至第3个音频帧均对应音素W,则音素W对应第1个音频帧至第3个音频帧,将第1个音频帧至第3个音频帧的起止时刻确定为音素W的起止时刻,即将第1个音频帧的时刻确定为起止时刻中的开始时刻,第3个音频帧的时刻确定为起止时刻中的结束时刻,当音素对应一个音频帧时,将音素对应的音频帧的时刻确定为音素的起止时刻,例如,第1个音频帧对应音素W,第2个音频帧对应静音音频帧,则音素W对应第1个音频帧,将第1个音频帧的起止时刻确定为音素W的起止时刻,即将第1个音频帧的时刻确定为起止时刻中的开始时刻,也同时将第1个音频帧的时刻确定为起止时刻中的结束时刻。
在一些实施例中,参见图3C,图3C是本申请实施例提供的基于人工智能的音频处理方法的流程示意图,执行步骤101中获取给定文本的至少一个音素,并确定每个音素的音素特征之前,或者执行步骤102中获取对应给定文本的音频数据,并确定所述音频数据包括的每个音频帧的音频特征之前,可以执行图3C示出的步骤105-步骤111。
在步骤105中,获取音频数据样本以及给定文本样本。
作为示例,给定文本样本与音频数据样本对应,例如,音频数据样本是用户跟读给定文本得到的。
在步骤106中,获取给定文本样本的至少一个音素样本,并通过音素编码器确定每个音素样本的音素特征。
在步骤107中,通过音频编码器确定音频数据样本包括的每个音频帧样本的音频特征。
作为示例,参与训练的音频编码器和音素编码器可以为经过预先训练的网络结构,本申请实施例采用预训练的声学模型进行音频特征提取,如声音转向量模型,声音转向量模型由多层卷积网络构成,利用大量无标签任务基于对比损失进行声音转向量模型的预训练,在训练音素对齐模型时,将音频数据(音频波形特征)输入至预训练的网络结构。
作为示例,参见图6,音素对齐模型包括音素分类网络、响度分类网络、共享注意力融合网络、音频编码器以及音素编码器,步骤103是通过调用注意力融合网络实现的,确定每个音频帧对应的音素是通过调用音素分类网络实现的,音素分类网络与响度分类网络共享注意力融合网络,注意力融合网络的输入是音频编码器的输出以及音素编码器的输出。
在步骤108中,针对每个音频帧样本执行以下处理:将音频帧样本的音频特征以及至少一个音素样本的音素特征,在由注意力融合网络以及音素分类网络构成的网络中进行正向传播,得到第一正向传播结果。
在一些实施例中,上述将音频帧样本的音频特征以及至少一个音素样本的音素特征,在由注意力融合网络以及音素分类网络构成的网络中进行正向传播,得到第一正向传播结果,可以通过以下技术方案 实现:通过注意力融合网络的注意力层针对每个音素样本执行以下处理:基于音频帧样本的音频特征以及音素样本的音素特征,确定对应音素样本的权重;对音素样本的音素特征进行值向量变换处理,将对应音素样本的权重与值向量变换结果进行相乘处理,得到对应音素样本的注意力结果;通过注意力融合网络的融合层将对应每个音素样本的注意力结果以及音频帧样本的音频特征进行融合处理,得到对应音频帧样本的融合特征;通过音素分类网络对音频帧样本的融合特征进行第一全连接处理,得到音频帧样本属于每个候选音素的第三概率;将第三概率以及权重组成第一正向传播结果。
作为示例,为了更好的融合音素特征和音频特征表示,需要对本申请实施例中权重矩阵进行约束,即进行注意力权重约束,其中,权重矩阵中每行代表一个音频帧,每列代表该音频帧对应每个音素的概率分布。
在步骤109中,针对每个音频帧样本执行以下处理:将音频帧样本的音频特征以及至少一个音素样本的音素特征,在由注意力融合网络以及响度分类网络构成的网络中进行正向传播,得到第二正向传播结果。
在一些实施例中,上述将音频帧样本的音频特征以及至少一个音素样本的音素特征,在由注意力融合网络以及响度分类网络构成的网络中进行正向传播,得到第二正向传播结果,可以通过以下技术方案实现:通过注意力融合网络对对音频帧样本的音频特征进行映射处理,得到每个音素样本的音素特征的权重,基于每个音素样本的音素特征的权重,对音频帧样本的音频特征以及至少一个音素样本的音素特征进行融合处理,得到每个音频帧样本的融合特征;通过响度分类网络对每个音频帧样本的融合特征进行第二全连接处理,得到每个音频帧样本属于每个响度类别的第二概率,并将每个音频帧样本属于每个响度类别的第二概率组成第二正向传播结果。
作为示例,在进行数据正向传播的过程中,响度分类网络的输入与音素分类网络的输入相同。
在一些实施例中,上述将音频帧样本的音频特征以及至少一个音素样本的音素特征,在由注意力融合网络以及响度分类网络构成的网络中进行正向传播,得到第二正向传播结果,可以通过以下技术方案实现:通过注意力融合网络的注意力层针对每个音素样本执行以下处理:基于音频帧样本的音频特征以及音素样本的音素特征,确定对应音素样本的权重;对音素样本的音素特征进行值向量变换处理,将对应音素样本的权重与值向量变换结果进行相乘处理,得到对应音素样本的注意力结果;通过注意力融合网络的融合层将对应每个音素样本的注意力结果以及音频帧样本的音频特征进行融合处理,得到对应音频帧样本的融合特征;通过响度分类网络对音频帧样本的融合特征进行第二全连接处理,得到音频帧样本属于每个候选音素的第二概率;将第二概率以及权重组成第二正向传播结果。
作为示例,音素对齐模型包括注意力融合网络、音素分类网络以及响度分类网络,音频编码器的输入是音频数据样本,音频编码器的输出是音频数据包括的每个音频帧样本的音频特征(向量形式),音素编码器的输入是音素序列样本(给定文本样本),音素编码器的输出是每个音素样本的音素特征(音素特征的数据形式是向量),注意力融合网络的输入是音频编码器的输出以及音素编码器的输出,注意力融合网络的输出是音素特征与音频特征的融合特征,每个音频帧的音频特征与所有音素进行注意力机制计算,得到融合特征,确定音频帧对应候选音素的表示和对应静音与否的表示,通过两个并列的音素分类网络以及响度分类网络分别对融合特征进行分类处理,音素分类网络输出每个音频帧属于每个候选音素的第三概率,响度分类网络输出每个音频帧属于每个响度类别的第二概率,响度类别包括静音以及非静音,例如,非静音的标识为1,静音的标识为0,响度类别还可以为更加细粒度的划分,例如静音、10分贝、20分贝、30分贝等等,候选音素为W、IH、L等等。
在步骤110中,根据第一正向传播结果以及第二正向传播结果,确定联合损失。
在一些实施例中,上述根据第一正向传播结果以及第二正向传播结果,确定联合损失,可以通过以下技术方案实现:基于每个音频帧样本对应多个候选音素的第三概率、以及每个音频帧样本的预标记候选音素,确定第一音素类别损失;基于每个音频帧样本对应多个响度类别的第二概率、以及每个音频帧样本的预标记响度类别,确定第二响度类别损失;基于每个音频帧样本对应每个音素样本的权重、以及每个音频帧样本对应每个音素样本的预标记对齐标识,确定第三对齐损失;对第一音素类别损失、第二响度类别损失以及第三对齐损失进行融合处理,得到联合损失。通过多个维度的损失构成联合损失,并基于联合损失进行训练,可以有效提高音素对齐模型的训练效果。
作为示例,在音素对齐模型的训练过程中采用交叉损失对两种分类的损失进行计算,参见公式(4)和公式(5):
Figure PCTCN2022122553-appb-000010
Figure PCTCN2022122553-appb-000011
其中,L phone是音素分类损失(第一音素类别损失),L sil是响度分类损失(第二响度类别损失),m是音频帧的数目,c是候选音素的数目,
Figure PCTCN2022122553-appb-000012
是第i个音频帧对应第j个音素的真实标识结果,
Figure PCTCN2022122553-appb-000013
是第i个音频帧对应第j个音素的第一概率,
Figure PCTCN2022122553-appb-000014
是第i个音频帧的预标记对齐标识,非静音为1,静音为0,
Figure PCTCN2022122553-appb-000015
是第i个音频帧为非静音音频帧的概率。
在一些实施例中为了更好的融合音素特征和音频特征表示,对本申请实施例中权重矩阵进行约束,即进行注意力权重约束,其中,矩阵中每行代表一个音频帧,每列代表该音频帧中每个音素的概率分布,将每个音频帧的音素的概率分布与实际该音频帧对应的音素进行损失计算,得到注意力机制损失,参见公式(6):
Figure PCTCN2022122553-appb-000016
其中,L align是注意力机制损失,m是音频帧的数目,N p是给定文本中音素的数目,
Figure PCTCN2022122553-appb-000017
是1或者0,1表征第i个音频帧与第j个音素是对齐的,0表征第i个音频帧与第j个音素不是对齐的,
Figure PCTCN2022122553-appb-000018
是第i个音频帧与第j个音素的权重。
在一些实施例中,整个音素对齐网络的联合损失由三部分构成,包括音素分类损失(第一音素类别损失),响度分类损失(第二响度类别损失)以及对齐损失(第三对齐损失),三种损失采用不同的权重进行加权求和,最终得到的联合损失如公式(7)所示:
L total=λL phone+βL sil+γL align              (7);
其中,每个损失的权重(λ、β以及γ)是预先设置的权重,三者求和等于1,L phone是音素分类损失(第一音素类别损失),L sil是响度分类损失(第二响度类别损失)以及L align是对齐损失(第三对齐损失),L total是联合损失。
在步骤111中,根据联合损失更新注意力融合网络、音素分类网络、响度分类网络、音素编码器以及音频编码器的参数。
作为示例,在根据联合损失更新注意力融合网络、音素分类网络、以及响度分类网络、音素编码器以及音频编码器的参数时,根据联合损失确定出梯度,进而通过下降算法更新各个网络的参数,从而尽量使得联合损失收敛至最低值。
通过本申请实施例将音频特征与文本序列进行注意力机制计算得到融合特征,因此融合特征能够有效表征音频帧与音素之间的关系,再基于融合特征对音频中每个音频帧进行音素分类,可以有效提高分类准确度,从而提高音素对齐准确度。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
在一些实施例中,当音频处理系统应用于口语考试场景时,例如,口语考试题目要求考生用户使用英语跟读给定文本,考生终端接收到用户对应给定文本的音频数据,考生终端将音频数据发送至服务器,服务器对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到音频帧的融合特征,基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻,并发送至评委终端,以使评委终端直接呈现每个音素的起止时刻,响应于评委用户的评分操作,评委终端可以显示针对每个音素的评分结果。即本申请实施例主要提供一种音素标注的自动化的工具,标注出给定文本的每个音素在音频数据中的对应位置,并在此基础上可以进一步进行音素以及单词朗读是否错误的标注,从而有效减少人工标注成本,为后续评委评分提供了更便捷的评分环境。
在一些实施例中,当音频处理系统应用于口语练习场景时,例如,口语练习题目要求学生用户使用英语跟读给定文本,学生终端接收到用户对应给定文本的音频数据,学生终端将音频数据发送至服务器,服务器对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到音频帧的融合特征,基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻,并发送至考生终端,以使考生终端直接呈现每个音素的起止时刻,响应于考生用户的评分操作,考生终端可以显示针对每个音素的评分结果,评分结果可以为针对音素的发音是否正确的标注, 即本申请实施例主要提供一种音素标注的自动化的工具,标注出给定文本的每个音素在音频数据中的对应位置,并在此基础上可以进一步进行音素以及单词朗读是否错误的标注,从而有效减少人工标注成本,为后续考生评分自检提供了更便捷的自检环境。
音素强制对齐是指将给定的音素序列文本与对应的音频进行对齐,得到文本中的每个音素在音频中的时间位置。音素对齐在语音处理中有不同应用,如语音识别,语音关键词检测等。本申请实施例将音频特征与文本序列进行注意力机制计算,得到融合的音频和文本特征,对音频中每帧进行音素分类,为了让对齐更加准确,增加辅助任务,如音频中每帧是否静音的判断。同时,对得到的权重矩阵进行约束,以达到更精确的对齐。
在一些实施例中,参见图4A,图4A是本申请实施例提供的基于人工智能的音频处理方法的界面示意图,人机交互界面401A中显示朗读按钮402A以及结束朗读按钮403A,人机交互界面401A中还显示给定文本“What are you doing?”,响应于考生用户针对朗读按钮402A的触发操作,考生终端接收对应给定文本的音频数据,响应于考生用户针对结束朗读按钮403A的触发操作,考生终端停止接收对应给定文本的音频数据。
在一些实施例中,参见图4B,图4B是本申请实施例提供的基于人工智能的音频处理方法的界面示意图,音素标注功能可以以嵌入在网页中,还可以嵌入在客户端内,用户对发音进行音素级别的标注流程如下,人机交互界面401B中显示给定文本403B以及标注按钮402B,响应于针对标注按钮402B的触发操作,人机交互界面401B中显示针对给定文本403B的标注页面。
在一些实施例中,参见图4C,图4C是本申请实施例提供的基于人工智能的音频处理方法的界面示意图,人机交互界面401C中显示标注页面403C,标注页面403C中显示音素402C在音频中的起止时间以及单词404C在音频中的起止时间,单词404C在音频中的起止时间由音素402C在音频中的起止时间确定。
在一些实施例中,参见图4D,图4D是本申请实施例提供的基于人工智能的音频处理方法的界面示意图,人机交互界面401D中显示标注页面403D,标注页面403D中显示音素402D在音频中的起止时间以及单词404D在音频中的起止时间,单词404D在音频中的起止时间由音素402D在音频中的起止时间确定,因此人机交互界面401D中显示有经过划分的音素,响应于用户针对音素的标注操作,在标注页面的最后一层中显示针对音素的发音标注405D,例如,如某个音素是否错误。
在一些实施例中,参见图5是本申请实施例提供的基于人工智能的音频处理方法的流程示意图,基于音素强制对齐的业务整体流程图如图5所示,步骤如下:音素标注工具的网页打开后,用户可以选择需要标注的音频和对应的跟读文本;响应于用户的选择操作,确定需要标注的音频和对应的音素文本序列(来源于题目的跟读文本)并且开始标注;网页将音频数据音素文本序列(来源于题目的跟读文本)发送给服务器;服务器将音频数据和音素文本序列(来源于题目的跟读文本)发送给音素强制对齐模块;音素强制对齐模块将每个音素在音频数据中的起止时间(音素边界信息)返回给服务器;服务器将基于音素边界信息切分的音频返回给用户;响应于用户的批注操作,基于每个切分的音素发音段对音素级别进行发音标注。
在一些实施例中,参见图6,本申请实施例提供的音素对齐模型由音素编码器、音频编码器、注意力融合网络、音素分类网络以及响度分类网络。音素编码器用于提取音素特征,音频编码器用于提取音频特征。对音频帧的音频特征进行映射处理,得到每个音素的音素特征的权重,基于每个音素的音素特征的权重,对音频帧的音频特征以及至少一个音素的音素特征进行融合处理,得到音频帧的融合特征,融合特征包含音频特征的信息和音素特征的信息,在注意力融合网络后外接音素分类网络(全连接层)以及响度分类网络(全连接层),通过音素分类网络针对每个音频帧进行音素分类,音素分类总共包含40个音素(包括音素词典中39个音素以及静音音素),通过响度分类网络对每个音频帧进行是否为静音音频帧的分类(包含静音或非静音)。
在一些实施例中,基于音频编码器获取音频特征表示,本申请实施例采用预训练的声学模型进行音频特征提取,如声音转向量模型,声音转向量模型由多层卷积网络构成,利用大量无标签任务基于对比损失进行声音转向量模型的预训练,在训练音素对齐模型时,将音频数据(音频波形特征)输入至预训练的网络结构,输出音频数据中每个音频帧的音频特征,基于音素编码器获取音素特征,本申请实施例采用音素编码的方式进行音素特征的提取,用独特的向量对每个音素的特性进行表示(特性表示特征),采用随机初始化的方式对每个音素的特性向量(特性表示特征)进行初始化,同时为了让音素在单词中不同位置的表示有所区别,随机初始化每个音素的位置向量(位置表示特征),包括四种位置,当单词包含多个音素,则代表单词的开始位置(B)、中间位置(I)、结束为止(E),当单词包含一个音素,则用S表示,对这些位置进行编码,得到每个音素的位置向量,最终将每个音素的独特的编码表示(发音向量)与位置编码表示(位置向量)进行相加,得到最终的音素特征,将给定文本的音素输入到音素编码器后,得到每个音素的深度特征表示(音素特征)。
在一些实施例中,基于注意机制对音素特征与音频特征进行融合,本申请实施例采用注意力机制对音素特征与音频特征进行融合,注意力机制用于建模查询向量Q、关键向量K以及值向量V之间的关系,参见公式(8)和(9):
Figure PCTCN2022122553-appb-000019
Attention(Q,K,V)=AttentionScore(Q,K)*V                  (9);
其中,用每个音频帧的音频特征
Figure PCTCN2022122553-appb-000020
作为查询向量Q,将给定文本的所有音素的音素特征H phone作为关键向量K和值向量V,AttentionScore(Q,K)是权重,Attention(Q,K,V)是每个音频帧对应所有音素的注意力结果,d k是关键向量K的维度。
在一些实施例中,将基于注意力机制得到的矩阵与音频特征进行拼接,最终得到融合特征,参见公式(10):
Figure PCTCN2022122553-appb-000021
其中,
Figure PCTCN2022122553-appb-000022
是基于注意力机制得到的音频帧i的注意力结果,音频帧i的注意力结果是矩阵,矩阵中每列代表所有音素中每个音素与音频帧i的注意力结果,
Figure PCTCN2022122553-appb-000023
是音频帧i的音频特征,H phone是给定文本的所有音素的音素特征,
Figure PCTCN2022122553-appb-000024
是对应每个音频帧的融合特征。
在一些实施例中,在音素对齐模型的训练过程中采用交叉损失对两种分类的损失进行计算,参见公式(11)和公式(12):
Figure PCTCN2022122553-appb-000025
Figure PCTCN2022122553-appb-000026
其中,L phone是音素分类损失(第一音素类别损失),L sil是响度分类损失(第二响度类别损失),m是音频帧的数目,c是候选音素的数目,
Figure PCTCN2022122553-appb-000027
是第i个音频帧对应第j个音素的真实标识结果,
Figure PCTCN2022122553-appb-000028
是第i个音频帧对应第j个音素的第一概率,
Figure PCTCN2022122553-appb-000029
是第i个音频帧的预标记对齐标识,非静音为1,静音为0,
Figure PCTCN2022122553-appb-000030
是第i个音频帧为非静音音频帧的概率。
在一些实施例中为了更好的融合音素特征和音频特征表示,对本申请实施例中权重矩阵进行约束,即进行注意力权重约束,其中,矩阵中每行代表一个音频帧,每列代表该音频帧中每个音素的概率分布,将每个音频帧的音素的概率分布与实际该音频帧对应的音素进行损失计算,得到注意力机制损失,参见公式(13):
Figure PCTCN2022122553-appb-000031
其中,L align是注意力机制损失,m是音频帧的数目,N p是给定文本中音素的数目,
Figure PCTCN2022122553-appb-000032
是1或者0,1表征第i个音频帧与第j个音素是对齐的,0表征第i个音频帧与第j个音素不是对齐的,
Figure PCTCN2022122553-appb-000033
是第i个音频帧与第j个音素的权重。
在一些实施例中,整个音素对齐网络的联合损失由三部分构成,包括音素分类损失(第一音素类别损失),响度分类损失(第二响度类别损失)以及对齐损失(第三对齐损失),三种损失采用不同的权重进行加权求和,最终得到的联合损失如公式(14)所示:
L total=λL phone+βL sil+γL align              (14);
其中,每个损失的权重(λ、β以及γ)是预先设置的权重,三者求和等于1,L phone是音素分类损失(第一音素类别损失),L sil是响度分类损失(第二响度类别损失)以及L align是对齐损失(第三对齐损失),L total是联合损失。
在一些实施例中,参见图7,图7是本申请实施例提供的基于人工智能的音频处理方法的数据流程示意图,音素对齐模型包括注意力融合网络、音素分类网络(对应第一任务)以及响度分类网络(对应第二任务),音频编码器的输入是音频数据,音频编码器的输出是音频数据包括的每个音频帧的音频特征(向量形式),音素编码器的输入是音素序列(给定文本),音素编码器的输出是每个音素的音素特征(音素特征的数据形式是向量),注意力融合网络的输入是音频编码器的输出以及音素编码器的输出,注意力融合网络的输出是音素特征与音频特征的融合特征,每个音频帧的音频特征与所有音素进行注意力机制计算,得到融合特征,确定音频帧对应候选音素的表示和对应静音与否的表示,通过两个并列的音素分类网络以及响度分类网络分别对融合特征进行分类处理,音素分类网络输出每个音频帧属于每个候选音素的概率,响度分类网络输出每个音频帧属于每个响度类别的概率,响度类别包括静音以及非静音,例如,非静音的标识为1,静音的标识为0,候选音素为W、IH、L等等。
在一些实施例中,本申请实施例在两个公开数据集进行实验,包括TIMIT数据集和Buckeye数据集,这两个数据集会在音频中对每个音素进行时间标记,最终进行指标计算,指标包括以下至少之一:音素对齐模型预测得到的音素边界与实际音素边界的精确率P,召回率R和F1分数,另外为了解决当召回率很高且精确率很低的情况下,F1分数值比较高的问题,引入R-value进行评价,参见公式(15)-公式(17):
Figure PCTCN2022122553-appb-000034
Figure PCTCN2022122553-appb-000035
Figure PCTCN2022122553-appb-000036
其中,P为精确率,R为召回率,OS是R/P-1。
最终结果参见表1,Discrimi、Montreal与SEGFEAT均是相关技术中的模型,从表1可以看出,本申请实施例在不同公开数据集上,音素边界准确率都有较大的提升。
表1本申请实施例以及相关技术中各个模型在各个数据集的评分
Corpora Model P R F1 R-value
TIMIT Ours 93.42 95.96 94.67 95.18
TIMIT Discrimi 90 82.2 85.9 79.51
TIMIT Montreal 83.9 81.6 82.7 85.16
TIMIT SEGFEAT 92.67 93.03 92.85 93.91
Buckeye Ours 88.49 90.33 89.40 90.90
Buckeye SEGFEAT 85.40 89.12 87.23 88.76
参见图8A-8C,图8A-8C是本申请实施例提供的基于人工智能的音频处理方法的对齐时间矩阵,为了验证对注意力机制约束的有效性,绘制了音素对齐矩阵,其中,纵轴为按照时间划分的音频帧,横轴为每个音素,图8A示出了未添加加注意力权重约束的对齐时间矩阵,图8B示出了添加约束的对齐时间矩阵,图8C示出了真实的对齐时间矩阵,可以看出加了注意力机制约束的矩阵整体更符合音素与音频的实际对齐时间。
可以理解的是,在本申请实施例中,涉及到用户信息等相关的数据,当本申请实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
下面继续说明本申请实施例提供的基于人工智能的音频处理装置255的实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器250的基于人工智能的音频处理装置255中的软件模块可以包括:音素模块2551,配置为获取给定文本的至少一个音素,并确定每个音素的音素特征;音频模块2552,配置为获取对应所述给定文本的音频数据,并确定所述音频数据包括的每个音频帧的音频特征;融合模块2553,配置为针对每个音频帧执行以下处理:对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到每个所述音频帧的融合特征;对齐模块2554,配置为基于每个音频帧的融合特征,确定每个音频帧对应的音素,并基于每个音频帧对应的音素,确定每个音素的起止时刻。
在一些实施例中,音频模块2552,还配置为:对至少一个音频帧进行特征提取处理,得到对应每个音频帧的卷积特征提取结果;对每个音频帧的卷积特征提取结果进行归一化处理,得到每个音频帧的音频特征。
在一些实施例中,音素模块2551,还配置为:针对每个音素执行以下处理:确定音素的特性表示特征,其中,特性表示特征表征音素的特性;确定音素的位置表示特征,其中,位置表示特征表征音素在 对应文本单元中的位置;将位置表示特征与特性表示特征进行相加处理,得到音素的音素特征。
在一些实施例中,融合模块2553,还配置为:针对每个所述音素执行以下处理:对所述音素的音素特征进行值向量变换处理,得到值向量;将对应所述音素的权重与所述值向量进行相乘处理,得到对应所述音素的注意力结果;将对应所述至少一个音素的注意力结果以及所述音频帧的音频特征进行融合处理,得到对应所述音频帧的融合特征。
在一些实施例中,融合模块2553,还配置为:对音频特征进行查询向量变换处理,得到查询向量;对音素特征进行关键向量变换处理,得到关键向量;将查询向量以及关键向量的转置进行相乘处理,得到相乘结果;获取所述关键向量的维度的平方根;将相乘结果与平方根的比值确定为注意力特征;对注意力特征进行最大似然处理,得到对应音素的权重。
在一些实施例中,确定每个音频帧对应的音素是通过调用音素分类网络实现的,音素分类网络包括至少一个级联的音素全连接层,对齐模块2554,还配置为:针对每个音频帧执行以下处理:当音素全连接层的数目为一个时,通过音素全连接层对融合特征进行第一全连接处理,得到音频帧属于每个候选音素的第一概率;当音素全连接层的数目为多个时,通过N个级联的音素全连接层中的第n音素全连接层,对第n音素全连接层的输入进行第一全连接处理,并将第n音素全连接层输出的第n音素全连接结果传输到第n+1音素全连接层以继续进行第一全连接处理,得到对应第n+1音素全连接层的第n+1音素全连接结果;其中,N为大于或者等于2的整数,n为取值从1开始递增的整数变量,n的取值范围为1≤n<N,当n取值为1时,第n音素全连接层的输入为融合特征,当n取值为2≤n<N时,第n音素全连接层的输入为第n-1音素全连接层输出的第n-1音素全连接结果,当n取值为N-1时,第n+1音素全连接结果为音频帧属于每个候选音素的第一概率;将最大的第一概率的候选音素确定为音频帧对应的音素。
在一些实施例中,对齐模块2554,还配置为:基于每个音频帧对应的音素,确定每个音素对应的至少一个音频帧;针对每个音素执行以下处理:当音素对应多个连续的音频帧时,将音素对应的连续音频帧的起止时刻确定为音素的起止时刻;当音素对应一个音频帧时,将音素对应的音频帧的时刻确定为音素的起止时刻。
在一些实施例中,对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到每个所述音频帧的融合特征是通过调用注意力融合网络实现的,确定每个音频帧对应的音素是通过调用音素分类网络实现的,音素分类网络与响度分类网络共享注意力融合网络,装置还包括:训练模块2555,配置为:获取音频数据样本以及给定文本样本;获取给定文本样本的至少一个音素样本,并通过音素编码器确定每个音素样本的音素特征;通过音频编码器确定音频数据样本包括的每个音频帧样本的音频特征;针对每个音频帧样本执行以下处理:将音频帧样本的音频特征以及至少一个音素样本的音素特征,在由注意力融合网络以及音素分类网络构成的网络中进行正向传播,得到第一正向传播结果;针对每个音频帧样本执行以下处理:将音频帧样本的音频特征以及至少一个音素样本的音素特征,在由注意力融合网络以及响度分类网络构成的网络中进行正向传播,得到第二正向传播结果;根据第一正向传播结果以及第二正向传播结果,确定联合损失;根据联合损失更新注意力融合网络、音素分类网络、响度分类网络、音频编码器以及音素编码器的参数。
在一些实施例中,将音频帧样本的音频特征以及至少一个音素样本的音素特征,训练模块2555,还配置为:通过注意力融合网络对音频帧样本的音频特征以及至少一个音素样本的音素特征进行基于注意力机制的融合处理,得到对应每个音频帧样本的融合特征;通过响度分类网络对每个音频帧样本的融合特征进行第二全连接处理,得到每个音频帧样本属于每个响度类别的第二概率,并将每个音频帧样本属于每个响度类别的第二概率组成第二正向传播结果。
在一些实施例中,训练模块2555,还配置为:通过注意力融合网络的注意力层针对每个音素样本执行以下处理:通过注意力融合网络的注意力层针对每个音素样本执行以下处理:基于音频帧样本的音频特征以及音素样本的音素特征,确定对应音素样本的权重;对音素样本的音素特征进行值向量变换处理,将对应音素样本的权重与值向量变换结果进行相乘处理,得到对应音素样本的注意力结果;通过注意力融合网络的融合层将对应每个音素样本的注意力结果以及音频帧样本的音频特征进行融合处理,得到对应音频帧样本的融合特征;通过音素分类网络对音频帧样本的融合特征进行第一全连接处理,得到音频帧样本属于每个候选音素的第三概率;将第三概率以及权重组成第一正向传播结果。
在一些实施例中,训练模块2555,还配置为:基于每个音频帧样本对应多个候选音素的第三概率、以及每个音频帧样本的预标记候选音素,确定第一音素类别损失;基于每个音频帧样本对应多个响度类别的第二概率、以及每个音频帧样本的预标记响度类别,确定第二响度类别损失;基于每个音频帧样本对应每个音素样本的权重、以及每个音频帧样本对应每个音素样本的预标记对齐标识,确定第三对齐损失;对第一音素类别损失、第二响度类别损失以及第三对齐损失进行融合处理,得到联合损失。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的。。方法。
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的基于人工智能的音频处理方法,例如,如图3A-3C示出的基于人工智能的音频处理方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
综上,通过本申请实施例将音频特征与文本序列进行注意力机制计算得到融合特征,因此融合特征能够有效表征音频帧与音素之间的关系,再基于融合特征对音频中每个音频帧进行音素分类,可以有效提高分类准确度,从而提高音素对齐准确度。
以上,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (14)

  1. 一种基于人工智能的音频处理方法,所述方法由电子设备执行,所述方法包括:
    获取给定文本的至少一个音素,并确定每个所述音素的音素特征;
    获取对应所述给定文本的音频数据,并确定所述音频数据包括的每个音频帧的音频特征;
    针对每个所述音频帧执行以下处理:对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到所述音频帧的融合特征;
    基于每个所述音频帧的融合特征,确定每个所述音频帧对应的音素,并基于每个所述音频帧对应的音素,确定每个所述音素在所述音频数据中的起止时刻。
  2. 根据权利要求1所述的方法,其中,所述确定每个所述音素的音素特征,包括:
    针对每个所述音素执行以下处理:
    确定所述音素的特性表示特征,其中,所述特性表示特征表征所述音素的特性;
    确定所述音素的位置表示特征,其中,所述位置表示特征表征所述音素在对应文本单元中的位置;
    将所述位置表示特征与所述特性表示特征进行相加处理,得到所述音素的音素特征。
  3. 根据权利要求1所述的方法,其中,所述基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到所述音频帧的融合特征,包括:
    针对每个所述音素执行以下处理:
    对所述音素的音素特征进行值向量变换处理,得到值向量;
    将对应所述音素的权重与所述值向量进行相乘处理,得到对应所述音素的注意力结果;
    将对应所述至少一个音素的注意力结果以及所述音频帧的音频特征进行融合处理,得到对应所述音频帧的融合特征。
  4. 根据权利要求1所述的方法,其中,所述对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,包括:
    对所述音频特征进行查询向量变换处理,得到查询向量;
    对所述音素特征进行关键向量变换处理,得到关键向量;
    将所述查询向量以及所述关键向量的转置进行相乘处理,得到相乘结果;
    获取所述关键向量的维度的平方根;
    将所述相乘结果与所述平方根的比值确定为所述注意力特征;
    对所述注意力特征进行最大似然处理,得到对应所述音素的权重。
  5. 根据权利要求1所述的方法,其中,所述确定每个所述音频帧对应的音素是通过调用音素分类网络实现的,所述音素分类网络包括至少一个级联的音素全连接层,所述基于每个所述音频帧的融合特征,确定每个所述音频帧对应的音素,包括:
    针对每个所述音频帧执行以下处理:
    当所述音素全连接层的数目为一个时,通过所述音素全连接层对所述融合特征进行第一全连接处理,得到所述音频帧属于每个候选音素的第一概率;
    当所述音素全连接层的数目为多个时,通过N个级联的音素全连接层中的第n音素全连接层,对所述第n音素全连接层的输入进行第一全连接处理,并将所述第n音素全连接层输出的第n音素全连接结果传输到第n+1音素全连接层以继续进行第一全连接处理,得到对应所述第n+1音素全连接层的第n+1音素全连接结果;
    其中,N为大于或者等于2的整数,n为取值从1开始递增的整数变量,n的取值范围为1≤n<N,当n取值为1时,所述第n音素全连接层的输入为所述融合特征,当n取值为2≤n<N时,所述第n音素全连接层的输入为第n-1音素全连接层输出的第n-1音素全连接结果,当n取值为N-1时,所述第n+1音素全连接结果为所述音频帧属于每个候选音素的第一概率;
    将最大的所述第一概率的候选音素确定为所述音频帧对应的音素。
  6. 根据权利要求1所述的方法,其中,所述基于每个所述音频帧对应的音素,确定每个所述音素在所述音频数据中的起止时刻,包括:
    基于每个所述音频帧对应的音素,确定每个所述音素对应的至少一个音频帧;
    针对每个所述音素执行以下处理:
    当所述音素对应多个连续的音频帧时,将所述音素对应的连续音频帧的起止时刻确定为所述音素的起止时刻;
    当所述音素对应一个音频帧时,将所述音素对应的音频帧的时刻确定为所述音素在所述音频数据中的起止时刻。
  7. 根据权利要求1所述的方法,其中,所述对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到每个所述音频帧的融合特征是通过调用注意力融合网络实现的,所述确定每个所述音频帧对应的音素是通过调用音素分类网络实现的,所述音素分类网络与响度分类网络共享所述注意力融合网络,所述注意力融合网络的输入是音频编码器以及音素编码器的输出,所述方法还包括:
    获取给定文本样本以及对应所述给定文本样本的音频数据样本;
    获取所述给定文本样本的至少一个音素样本,并通过所述音素编码器确定每个所述音素样本的音素特征;
    通过所述音频编码器确定所述音频数据样本包括的每个音频帧样本的音频特征;
    针对每个所述音频帧样本执行以下处理:将所述音频帧样本的音频特征以及至少一个所述音素样本的音素特征,在由所述注意力融合网络以及所述音素分类网络构成的网络中进行正向传播,得到第一正向传播结果;
    针对每个所述音频帧样本执行以下处理:将所述音频帧样本的音频特征以及至少一个所述音素样本的音素特征,在由所述注意力融合网络以及所述响度分类网络构成的网络中进行正向传播,得到第二正向传播结果;
    根据所述第一正向传播结果以及所述第二正向传播结果,确定联合损失;
    根据所述联合损失更新所述注意力融合网络、所述音素分类网络、所述响度分类网络、所述音频编码器以及所述音素编码器的参数。
  8. 根据权利要求7所述的方法,其中,所述将所述音频帧样本的音频特征以及至少一个所述音素样本的音素特征,在由所述注意力融合网络以及所述响度分类网络构成的网络中进行正向传播,得到第二正向传播结果,包括:
    通过所述注意力融合网络对所述音频帧样本的音频特征以及至少一个所述音素样本的音素特征进行基于注意力机制的融合处理,得到对应每个所述音频帧样本的融合特征;
    通过所述响度分类网络对每个所述音频帧样本的融合特征进行第二全连接处理,得到每个所述音频帧样本属于每个响度类别的第二概率,并将每个所述音频帧样本属于每个响度类别的第二概率组成所述第二正向传播结果。
  9. 根据权利要求7所述的方法,其中,所述将所述音频帧样本的音频特征以及至少一个所述音素样本的音素特征,在由所述注意力融合网络以及所述音素分类网络构成的网络中进行正向传播,得到第一正向传播结果,包括:
    通过所述注意力融合网络的注意力层针对每个所述音素样本执行以下处理:
    基于所述音频帧样本的音频特征以及所述音素样本的音素特征,确定对应所述音素样本的权重;
    对所述音素样本的音素特征进行值向量变换处理,将对应所述音素样本的权重与值向量变换结果进行相乘处理,得到对应所述音素样本的注意力结果;
    通过所述注意力融合网络的融合层将对应每个所述音素样本的注意力结果以及所述音频帧样本的音频特征进行融合处理,得到对应所述音频帧样本的融合特征;
    通过所述音素分类网络对所述音频帧样本的融合特征进行第一全连接处理,得到所述音频帧样本属于每个候选音素的第三概率;
    将所述第三概率以及所述权重组成所述第一正向传播结果。
  10. 根据权利要求9所述的方法,其中,所述根据所述第一正向传播结果以及所述第二正向传播结果,确定联合损失,包括:
    基于每个所述音频帧样本对应多个候选音素的第三概率、以及每个所述音频帧样本的预标记候选音素,确定第一音素类别损失;
    基于每个所述音频帧样本对应多个响度类别的第二概率、以及每个所述音频帧样本的预标记响度类别,确定第二响度类别损失;
    基于每个所述音频帧样本对应每个所述音素样本的权重、以及每个所述音频帧样本对应每个所述音素样本的预标记对齐标识,确定第三对齐损失;
    对所述第一音素类别损失、所述第二响度类别损失以及所述第三对齐损失进行融合处理,得到所述联合损失。
  11. 一种基于人工智能的音频处理装置,所述装置包括:
    音素模块,用于获取给定文本的至少一个音素,并确定每个所述音素的音素特征;
    音频模块,用于获取对应所述给定文本的音频数据,并确定所述音频数据包括的每个音频帧的音频特征;
    融合模块,用于针对每个所述音频帧执行以下处理:对所述音频帧的音频特征进行映射处理,得到每个所述音素的音素特征的权重,基于每个所述音素的音素特征的权重,对所述音频帧的音频特征以及至少一个所述音素的音素特征进行融合处理,得到所述音频帧的融合特征;
    对齐模块,用于基于每个所述音频帧的融合特征,确定每个所述音频帧对应的音素,并基于每个所述音频帧对应的音素,确定每个所述音素的起止时刻。
  12. 一种电子设备,所述电子设备包括:
    存储器,用于存储计算机可执行指令;
    处理器,用于执行所述存储器中存储的计算机可执行指令时,实现权利要求1至10任一项所述的基于人工智能的音频处理方法。
  13. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现权利要求1至10任一项所述的基于人工智能的音频处理方法。
  14. 一种计算机程序产品,包括计算机程序或计算机可执行指令,所述计算机程序或计算机可执行指令被处理器执行时实现权利要求1至10任一项所述的基于人工智能的音频处理方法。
PCT/CN2022/122553 2021-11-26 2022-09-29 基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质 WO2023093295A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/203,469 US20230306959A1 (en) 2021-11-26 2023-05-30 Audio processing method and apparatus based on artificial intelligence, electronic device, computer program product, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111421900.X 2021-11-26
CN202111421900.XA CN114360504A (zh) 2021-11-26 2021-11-26 音频处理方法、装置、设备、程序产品及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/203,469 Continuation US20230306959A1 (en) 2021-11-26 2023-05-30 Audio processing method and apparatus based on artificial intelligence, electronic device, computer program product, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2023093295A1 true WO2023093295A1 (zh) 2023-06-01

Family

ID=81096282

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122553 WO2023093295A1 (zh) 2021-11-26 2022-09-29 基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质

Country Status (3)

Country Link
US (1) US20230306959A1 (zh)
CN (1) CN114360504A (zh)
WO (1) WO2023093295A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360504A (zh) * 2021-11-26 2022-04-15 腾讯科技(深圳)有限公司 音频处理方法、装置、设备、程序产品及存储介质
CN116153294B (zh) * 2023-04-14 2023-08-08 京东科技信息技术有限公司 语音识别方法、装置、系统、设备及介质
CN117496516B (zh) * 2023-12-25 2024-03-29 北京航空航天大学杭州创新研究院 一种脑瘤mri图像分割方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004077901A (ja) * 2002-08-20 2004-03-11 Nippon Telegr & Teleph Corp <Ntt> 音素決定方法、その装置及びプログラム
CN104756182A (zh) * 2012-11-29 2015-07-01 索尼电脑娱乐公司 组合听觉注意力线索与音位后验得分以用于音素/元音/音节边界检测
CN111105785A (zh) * 2019-12-17 2020-05-05 广州多益网络股份有限公司 一种文本韵律边界识别的方法及装置
CN111312231A (zh) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 音频检测方法、装置、电子设备及可读存储介质
CN111754978A (zh) * 2020-06-15 2020-10-09 北京百度网讯科技有限公司 韵律层级标注方法、装置、设备和存储介质
CN113536029A (zh) * 2021-08-05 2021-10-22 广州酷狗计算机科技有限公司 一种对齐音频和文本的方法、装置、电子设备及存储介质
CN114360504A (zh) * 2021-11-26 2022-04-15 腾讯科技(深圳)有限公司 音频处理方法、装置、设备、程序产品及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004077901A (ja) * 2002-08-20 2004-03-11 Nippon Telegr & Teleph Corp <Ntt> 音素決定方法、その装置及びプログラム
CN104756182A (zh) * 2012-11-29 2015-07-01 索尼电脑娱乐公司 组合听觉注意力线索与音位后验得分以用于音素/元音/音节边界检测
CN111105785A (zh) * 2019-12-17 2020-05-05 广州多益网络股份有限公司 一种文本韵律边界识别的方法及装置
CN111312231A (zh) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 音频检测方法、装置、电子设备及可读存储介质
CN111754978A (zh) * 2020-06-15 2020-10-09 北京百度网讯科技有限公司 韵律层级标注方法、装置、设备和存储介质
CN113536029A (zh) * 2021-08-05 2021-10-22 广州酷狗计算机科技有限公司 一种对齐音频和文本的方法、装置、电子设备及存储介质
CN114360504A (zh) * 2021-11-26 2022-04-15 腾讯科技(深圳)有限公司 音频处理方法、装置、设备、程序产品及存储介质

Also Published As

Publication number Publication date
US20230306959A1 (en) 2023-09-28
CN114360504A (zh) 2022-04-15

Similar Documents

Publication Publication Date Title
US11900915B2 (en) Multi-dialect and multilingual speech recognition
CN112712804B (zh) 语音识别方法、系统、介质、计算机设备、终端及应用
WO2023093295A1 (zh) 基于人工智能的音频处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质
Novotney et al. Cheap, fast and good enough: Automatic speech recognition with non-expert transcription
WO2021072875A1 (zh) 智能对话的生成方法、装置、计算机设备及计算机存储介质
CN109256152A (zh) 语音评分方法及装置、电子设备、存储介质
KR102390940B1 (ko) 음성 인식을 위한 컨텍스트 바이어싱
CN114694076A (zh) 基于多任务学习与层叠跨模态融合的多模态情感分析方法
CN110364171A (zh) 一种语音识别方法、语音识别系统及存储介质
CN112037773B (zh) 一种n最优口语语义识别方法、装置及电子设备
Hanani et al. Spoken Arabic dialect recognition using X-vectors
Lee Language-independent methods for computer-assisted pronunciation training
Alrumiah et al. Intelligent Quran Recitation Recognition and Verification: Research Trends and Open Issues
CN111161724B (zh) 中文视听结合语音识别方法、系统、设备及介质
Wohlan et al. A Text-Independent Forced Alignment Method for Automatic Phoneme Segmentation
Yang et al. Non-native acoustic modeling for mispronunciation verification based on language adversarial representation learning
CN115376547A (zh) 发音评测方法、装置、计算机设备和存储介质
CN113012685B (zh) 音频识别方法、装置、电子设备及存储介质
CN115240712A (zh) 一种基于多模态的情感分类方法、装置、设备及存储介质
CN113096646B (zh) 音频识别方法、装置、电子设备及存储介质
Xie et al. L2 mispronunciation verification based on acoustic phone embedding and siamese networks
CN114420159A (zh) 音频评测方法及装置、非瞬时性存储介质
CN113421593A (zh) 语音测评方法、装置、计算机设备和存储介质
Jiang et al. Audio public opinion analysis model based on heterogeneous neural network
CN113555006B (zh) 一种语音信息识别方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897381

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022897381

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022897381

Country of ref document: EP

Effective date: 20240328