WO2022227212A1 - 基于联邦学习的语音表征模型训练方法、装置、设备及介质 - Google Patents

基于联邦学习的语音表征模型训练方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022227212A1
WO2022227212A1 PCT/CN2021/097258 CN2021097258W WO2022227212A1 WO 2022227212 A1 WO2022227212 A1 WO 2022227212A1 CN 2021097258 W CN2021097258 W CN 2021097258W WO 2022227212 A1 WO2022227212 A1 WO 2022227212A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient value
terminal
model
albert
training
Prior art date
Application number
PCT/CN2021/097258
Other languages
English (en)
French (fr)
Inventor
李雷来
王健宗
瞿晓阳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022227212A1 publication Critical patent/WO2022227212A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to a method, apparatus, device, and medium for training a speech representation model based on federated learning.
  • Federated learning technology realizes the training of models through multi-party cooperation, which solves the problem of data silos while protecting user privacy and data security.
  • the inventor found that when the existing federated learning technology is training the model, if the network environment is relatively complex, the model parameters will converge slowly due to the delay of network communication during the model training process, and the model training efficiency will be low. .
  • the purpose of the embodiments of the present application is to provide a method, apparatus, computer equipment and computer-readable storage medium for training a speech representation model based on federated learning, so as to solve the problem of low training efficiency of existing models.
  • an embodiment of the present application provides a method for training a speech representation model based on federated learning, including:
  • the first terminal acquires a locally stored training sample data set, where the training sample data set includes a plurality of voice data;
  • the first terminal trains the locally deployed first ALBERT model by using the plurality of voice data, and outputs the first gradient value of the first ALBERT model
  • the first terminal compresses the first gradient value to obtain a compressed gradient value
  • the first terminal uploads the compressed gradient value to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by a plurality of second terminals joined to the federated network, and the second The gradient value is obtained by the second terminal training the locally deployed second ALBERT model;
  • the first terminal updates the model parameters of the ALBERT model according to the total gradient value, repeats the step of training the first ALBERT model until the model converges, and obtains a trained speech representation model, wherein,
  • the speech representation model is used for extracting mel spectrum and/or linear spectrum representing speech from speech data.
  • the method before the step of the first terminal training the locally deployed first ALBERT model by using the plurality of voice data, and outputting the first gradient value of the first ALBERT model, the method further includes:
  • the first terminal preprocesses the plurality of voice data to convert each of the audio data into a corresponding mel spectrum and/or linear spectrum;
  • the steps of the first terminal training the locally deployed first ALBERT model through the plurality of voice data, and outputting the first gradient value of the first ALBERT model include:
  • the first terminal trains the locally deployed first ALBERT model by using the Mel spectrum and/or linear spectrum corresponding to the plurality of speech data, and outputs the first gradient value of the first ALBERT model.
  • the first terminal performs preprocessing on the plurality of voice data to convert each of the audio data into a corresponding Mel spectrum and/or linear spectrum, including:
  • the first terminal performs frame division processing on each of the audio data, so as to divide each of the audio data into multiple frames of audio data;
  • the multi-frame audio data is masked by using a preset masking rule to obtain masked audio data
  • the first terminal performs compression processing on the first gradient value, and obtaining the compressed gradient value includes:
  • the first terminal performs sparse processing on the first gradient value to obtain K values, where K is an integer greater than 1;
  • the first terminal performs quantization processing on the K values to obtain quantized gradient values, and uses the quantized gradient values as the compressed gradient values.
  • the first terminal performs quantization processing on the K values to obtain quantized gradient values, and uses the quantized gradient values as the compressed gradient values including:
  • the first terminal performs quantization processing on the K values to obtain quantized gradient values
  • the first terminal performs row coding or column coding on the quantized gradient value to obtain the coded gradient value, and uses the coded gradient value as the compressed gradient value.
  • updating the model parameters of the ALBERT model by the first terminal according to the total gradient value includes:
  • the first terminal obtains the current model parameters of the ALBERT model
  • the total gradient value is calculated as follows:
  • g t+1 is the total gradient value
  • C is the number of second terminals.
  • the embodiment of the present application also provides a federated learning-based speech representation model training device, where the federated learning-based speech representation model training device includes:
  • an acquisition module used for the first terminal to acquire a locally stored training sample data set, where the training sample data set includes a plurality of voice data;
  • a training module used for the first terminal to train the locally deployed first ALBERT model through the plurality of voice data, and output the first gradient value of the first ALBERT model
  • a compression module used for the first terminal to compress the first gradient value to obtain a compressed gradient value
  • an uploading module used by the first terminal to upload the compressed gradient value to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by the plurality of second terminals added to the federation network , the second gradient value is obtained by the second terminal training the locally deployed second ALBERT model;
  • a receiving module used for the first terminal to receive the total gradient value returned by the federation server
  • an update module used for the first terminal to update the model parameters of the ALBERT model according to the total gradient value, repeating the steps of training the first ALBERT model until the model converges, and obtaining a trained voice A characterization model, wherein the speech characterization model is used to extract a mel spectrum and/or a linear spectrum that characterizes speech from the speech data.
  • the embodiments of the present application also provide a computer device, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the computer-readable instructions.
  • the following steps are implemented when the computer readable instructions are described:
  • the first terminal acquires a locally stored training sample data set, where the training sample data set includes a plurality of voice data;
  • the first terminal trains the locally deployed first ALBERT model by using the plurality of voice data, and outputs the first gradient value of the first ALBERT model
  • the first terminal compresses the first gradient value to obtain a compressed gradient value
  • the first terminal uploads the compressed gradient value to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by a plurality of second terminals joined to the federated network, and the second The gradient value is obtained by the second terminal training the locally deployed second ALBERT model;
  • the first terminal updates the model parameters of the ALBERT model according to the total gradient value, repeats the step of training the first ALBERT model until the model converges, and obtains a trained speech representation model, wherein,
  • the speech representation model is used for extracting mel spectrum and/or linear spectrum representing speech from speech data.
  • the embodiments of the present application further provide a computer-readable storage medium, where computer-readable instructions are stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the following steps: the first terminal acquires a locally stored training sample data set, where the training sample data set includes a plurality of speech data;
  • the first terminal trains the first ALBERT model deployed locally by the multiple voice data, and outputs the first gradient value of the first ALBERT model
  • the first terminal compresses the first gradient value to obtain a compressed gradient value
  • the first terminal uploads the compressed gradient value to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by a plurality of second terminals joined to the federated network, and the second The gradient value is obtained by the second terminal training the locally deployed second ALBERT model;
  • the first terminal updates the model parameters of the ALBERT model according to the total gradient value, repeats the step of training the first ALBERT model until the model converges, and obtains a trained speech representation model, wherein,
  • the speech representation model is used for extracting mel spectrum and/or linear spectrum representing speech from speech data.
  • a locally stored training sample data set is obtained through a first terminal, and the training sample data set includes a plurality of speech data; the first terminal trains the locally deployed first ALBERT model through the multiple voice data, and outputs the first gradient value of the first ALBERT model; the first terminal uses the first gradient The first terminal uploads the compressed gradient value to the federation server, wherein the federation server is based on the first terminal uploaded by the plurality of second terminals joined to the federation network.
  • Two gradient values are used to calculate the total gradient value, and the second gradient value is obtained by the second terminal training the locally deployed second ALBERT model; the first terminal receives the total gradient value returned by the federation server; The first terminal updates the model parameters of the ALBERT model according to the total gradient value, continues to train the first ALBERT model until the model converges, and obtains a trained speech representation model, wherein the speech representation
  • the model is used to extract the mel spectrum and/or linear spectrum that characterizes speech from speech data.
  • the gradient value is compressed when the gradient value is uploaded to the federation server, so as to avoid communication delay caused by too large gradient data, reduce the training time of the model, and improve the training efficiency of the model.
  • FIG. 1 is a schematic flowchart of steps of an embodiment of a method for training a speech representation model based on federated learning of the present application.
  • FIG. 2 is a schematic diagram of program modules of an apparatus for training a speech representation model based on federated learning according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present application.
  • first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure.
  • word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
  • FIG. 1 a flowchart of a method for training a speech representation model based on federated learning according to an embodiment of the present application is shown. It can be understood that the flowchart in this embodiment of the method does not limit the order of executing steps.
  • the following is an exemplary description of a speech representation model training device based on federated learning (hereinafter referred to as "training device") as the execution subject.
  • the training device can be applied to computer equipment, and the computer equipment can be a mobile phone, tablet personal Computer (tablet personal computer), laptop computer (laptop computer), server and other equipment with data transmission function. details as follows:
  • Step S10 the first terminal acquires a locally stored training sample data set, where the training sample data set includes a plurality of speech data.
  • the voice data in the training sample data set may be all the voice data stored locally by the first terminal, or may be part of the voice data stored locally by the first terminal, which is not limited in this embodiment.
  • the local storage refers to a storage medium that can store data, such as a memory and/or a hard disk in the first terminal.
  • the first terminal refers to a local terminal that performs training on the training sample data set. For example, if the speech representation model currently to be trained is jointly trained by a federated network composed of a local terminal A and other remote terminals B, the first terminal is the local terminal A.
  • Preprocessing that is, in this embodiment, after acquiring the locally stored training sample data set, it further includes: the first terminal preprocessing the plurality of voice data to convert each of the audio data into Corresponding mel spectrum and/or linear spectrum.
  • the length parameter LEN of each voice data can be preset, that is, the length of each frame of voice can be set.
  • the voice data can be divided into Multi-frame speech data.
  • the moving window function can be used to process the voice data into multiple frames of voice data of equal length.
  • frame division processing for voice data with length less than LEN, supplementary length processing can be performed; for voice data with length longer than LEN, voice frame data with LEN length is obtained by equalizing sampling through a moving window function.
  • each frame of voice data may be converted into corresponding mel spectrum and/or linear spectrum by using a preset tool.
  • each frame of speech data can be converted into a corresponding mel-scale spectrum ((mel-scale spectrogram)) through the librosa tool.
  • the speech signal is a one-dimensional signal, and only the time domain information can be seen intuitively, but the frequency domain information cannot be seen. It can be transformed to the frequency domain through Fourier transform (FT), but the time domain information is lost, and the time-frequency relationship cannot be seen. In order to solve this problem, many methods have been produced. Short-time Fourier transform, wavelet, etc. are all commonly used time-frequency analysis methods.
  • the short-time Fourier transform is to perform Fourier transform on the short-time signal.
  • the principle is as follows: divide a long speech signal into frames, add windows, and then perform Fourier transform on each frame, and then stack the results of each frame along another dimension to obtain a picture (similar to a two-dimensional signal) , which is the spectrogram, or linear spectrum.
  • the obtained spectrogram is large, in order to obtain sound features of suitable size, it is usually passed through a Mel-scale filter bank to become a Mel spectrum.
  • the first terminal performs preprocessing on the plurality of voice data to convert each of the audio data into a corresponding Mel spectrum and/or linear spectrum including: the first A terminal performs frame processing on each of the audio data, so as to divide each of the audio data into multiple frames of audio data; mask processing is performed on the multiple frames of audio data by using a preset masking rule to obtain masked Coded audio data; converting the masked audio data into a mel spectrum and/or a linear spectrum corresponding to each of the audio data.
  • the mask rule can be set and adjusted according to the actual situation.
  • the masking rule may be to select a preset number of frames of voice data in the multi-frame voice data to perform mask processing, for example, select 15% of the voice frames to perform mask processing.
  • the mask processing refers to adding noise to the speech data.
  • this masking process can zero out 80% of the selected frames for me, 10% of the selected frames are replaced with other frames randomly sampled from the same vocalization, the remaining 10% of the selected frames Keep the original frame.
  • Step S11 the first terminal trains the locally deployed first ALBERT model by using the plurality of voice data, and outputs the first gradient value of the first ALBERT model.
  • the first ALBERT model is a variant of the BERT model.
  • the parameters of the model are greatly reduced, making practical use more convenient, and it is one of the classic BERT variants.
  • the first ALBERT model can be downloaded from the federated learning network and then deployed locally on the first terminal, or can be directly deployed locally on the first terminal without first being downloaded from the federated learning network.
  • the examples are not limited.
  • the first ALBERT model may be a model that has undergone certain training, or may be an initial model that has not been trained.
  • the first ALBERT model has a corresponding loss function, and the first gradient value is a vector used for approximating the minimum value of the loss function in the iterative process of training the first ALBERT model through a plurality of speech data.
  • the first ALBERT model can be a Transformer structure with several layers (for example, 6 layers, 12 layers, 24 layers) superimposed, and the weight of the model is expressed as
  • the first terminal uses the multiple voice data to compare the locally deployed first
  • the step of training the ALBERT model and outputting the first gradient value of the first ALBERT model includes: the first terminal uses the Mel spectrum and/or linear spectrum corresponding to the plurality of speech data to compare the locally deployed first
  • the ALBERT model is trained, and the first gradient value of the first ALBERT model is output.
  • Step S12 the first terminal compresses the first gradient value to obtain a compressed gradient value.
  • the network environment of each terminal added to the federated learning network is different. Therefore, in order to enable the first gradient value to be uploaded to the federated learning network more efficiently, the first gradient value can be uploaded to the federated learning network more efficiently.
  • the gradient value is compressed first to obtain a lighter gradient value, and then the compressed gradient value is uploaded to the federated learning network.
  • the first terminal performing compression processing on the first gradient value, and obtaining the compressed gradient value includes: performing, by the first terminal, thinning processing on the first gradient value, to obtain K values, where K is an integer greater than 1; the first terminal performs quantization processing on the K values to obtain quantized gradient values, and uses the quantized gradient values as the compressed the gradient value of .
  • the first gradient value may be sparsely processed by means of Top-k sparseness, where Top-k sparseness refers to taking the largest K values in the gradient matrix (the first gradient value), and the rest Set to 0, where K is a hyperparameter that can be set and adjusted according to the actual situation.
  • the K values may be quantized by means of ternary quantization, where the ternary quantization refers to quantizing the reserved K gradient values into one of ⁇ -u, 0, u ⁇ , so that for a floating point number, only 2 bits (bits) are needed to represent it, so that the compression of the first gradient value can be greatly realized, the bandwidth required for transmitting the first gradient value can be reduced, and the cost of the time required for the transmission of the first gradient value.
  • the first terminal performs quantization processing on the K values, obtains a quantized gradient value, and quantizes the quantized gradient value.
  • the obtained gradient value includes: the first terminal performs quantization processing on the K values to obtain a quantized gradient value; and the first terminal performs line coding on the quantized gradient value Or column encoding process, to obtain an encoded gradient value, and use the encoded gradient value as the compressed gradient value.
  • row coding or column coding refers to using a row-based or column-based matrix to store the gradient values after quantization.
  • row coding or column coding refers to using a row-based or column-based matrix to store the gradient values after quantization.
  • the gradient values are stored by row coding or column coding, only two arrays are required, wherein, One array is used to store all the non-0 elements in the gradient values, and the other array is used to store the indices of the non-0 elements in the matrix row or matrix column. For all the 0 elements in the gradient values, no storage is required, so The compression ratio for the first gradient value can be further improved.
  • Step S13 the first terminal uploads the compressed gradient value to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by the plurality of second terminals added to the federation network, so The second gradient value is obtained by the second terminal training the locally deployed second ALBERT model.
  • the federation server may perform an aggregation operation on the gradient values based on the second gradient values uploaded by multiple second terminals added to the federated network, and aggregate the gradient values.
  • the resulting gradient value is taken as the total gradient value.
  • the second terminal refers to a terminal added to the federal network, and the second terminal may be the first terminal, or may be another terminal other than the first terminal added to the federal network, It is not limited in this embodiment.
  • the second gradient value is obtained after the second terminal trains the second ALBERT model through the locally stored speech data, and the second ALBERT model and the first ALBERT model are the same model.
  • the federated server When the federated server aggregates the total gradient value, it can use the following formula to calculate the total gradient value:
  • g t+1 is the total gradient value
  • C is the number of second terminals.
  • the total gradient value is the average value of the sum of the second gradient values uploaded by the C second terminals.
  • the federated server calculates the total gradient value, it can randomly select several second gradient values uploaded by the second terminals from all the second terminals added to the federated network to calculate the total gradient value, or it can add The total gradient value is calculated by taking the second gradient value uploaded by all the second terminals in the federated network, which is not limited in this embodiment.
  • the second gradient values uploaded by C second terminals may be selected from all second terminals added to the federated network to calculate the total gradient value, where C is a preset value, for example, set C to 5 , it is necessary to select the second gradient values uploaded by 5 second terminals from all the second terminals added to the federated network to calculate the total gradient value.
  • Step S14 the first terminal receives the total gradient value returned by the federation server.
  • the federation server will push the calculated total gradient value to each terminal that joins the federated network, so that each terminal can update its own model parameters according to the pushed total gradient value.
  • Step S15 the first terminal updates the model parameters of the ALBERT model according to the total gradient value, repeats the step of training the first ALBERT model until the model converges, and obtains a trained speech representation model , wherein the speech representation model is used to extract the mel spectrum and/or linear spectrum representing speech from the speech data.
  • the first terminal will determine whether to continue training the model, if it is determined that the model has not converged, it will continue to train the model, and if it is determined that the model has converged, it will stop
  • the model is trained, and the trained model is used as a speech representation model, so that a Mel spectrum and/or a linear spectrum representing speech can be extracted from the speech data through the speech representation model.
  • model convergence in this embodiment refers to that the loss value of the model is smaller than the preset value, or the change of the loss value of the model tends to be stable, that is, the loss value corresponding to two or more consecutive training sessions.
  • the difference is less than the set value, that is, the loss value basically does not change.
  • the specific value of ⁇ can be set and adjusted according to the actual situation.
  • a locally stored training sample data set is obtained through the first terminal, and the training sample data set includes multiple pieces of voice data; the first terminal uses the multiple pieces of voice data to analyze the locally deployed first ALBERT model Perform training, and output the first gradient value of the first ALBERT model; the first terminal compresses the first gradient value to obtain a compressed gradient value; the first terminal compresses the compressed gradient value upload the gradient value of , to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by multiple second terminals joining the federated network, and the second gradient value is the second terminal to the local Obtained by training the deployed second ALBERT model; the first terminal receives the total gradient value returned by the federation server; the first terminal updates the model parameters of the ALBERT model according to the total gradient value, and continues The first ALBERT model is trained until the model converges, and a trained speech representation model is obtained, wherein the speech representation model is used to extract a Mel spectrum and/or a linear spectrum representing speech from
  • FIG. 2 shows a schematic diagram of program modules of an apparatus 200 for training a speech representation model based on federated learning (hereinafter referred to as “training apparatus” 200 ) according to an embodiment of the present application.
  • the training device 200 can be applied to computer equipment, and the computer equipment can be a mobile phone, a tablet personal computer, a laptop computer, a server, and other equipment with a data transmission function.
  • the training device 200 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to complete
  • This application can implement the above-mentioned training method of speech representation model based on federated learning.
  • the program modules referred to in the embodiments of the present application refer to a series of computer-readable instruction segments capable of performing specific functions, and are more suitable for describing the execution process of the federated learning-based speech representation model training method in the storage medium than the program itself.
  • the apparatus 200 for training a speech representation model based on federated learning includes an acquisition module 201 , a training module 202 , a compression module 203 , an uploading module 204 , a receiving module 205 and an updating module 206 .
  • the following description will specifically introduce the functions of each program module in this embodiment:
  • the obtaining module 201 is used for the first terminal to obtain a locally stored training sample data set, where the training sample data set includes a plurality of speech data.
  • the voice data in the training sample data set may be all the voice data stored locally by the first terminal, or may be part of the voice data stored locally by the first terminal, which is not limited in this embodiment.
  • the local storage refers to a storage medium that can store data, such as a memory and/or a hard disk in the first terminal.
  • the training apparatus 200 may further include a preprocessing module.
  • the preprocessing module is used for the first terminal to preprocess the plurality of voice data to convert each of the audio data into a corresponding mel spectrum and/or linear spectrum.
  • the length parameter LEN of each voice data can be preset, that is, the length of each frame of voice can be set.
  • the voice data can be divided into Multi-frame speech data.
  • the moving window function can be used to process the voice data into multiple frames of voice data of equal length.
  • frame division processing for voice data with length less than LEN, supplementary length processing can be performed; for voice data with length longer than LEN, voice frame data with LEN length is obtained by equalizing sampling through a moving window function.
  • each frame of voice data may be converted into corresponding mel spectrum and/or linear spectrum by using a preset tool.
  • each frame of speech data can be converted into a corresponding mel-scale spectrum ((mel-scale spectrogram)) through the librosa tool.
  • the speech signal is a one-dimensional signal, and only the time domain information can be seen intuitively, but the frequency domain information cannot be seen. It can be transformed to the frequency domain through Fourier transform (FT), but the time domain information is lost, and the time-frequency relationship cannot be seen. In order to solve this problem, many methods have been produced. Short-time Fourier transform, wavelet, etc. are all commonly used time-frequency analysis methods.
  • the short-time Fourier transform is to perform Fourier transform on the short-time signal.
  • the principle is as follows: divide a long speech signal into frames, add windows, and then perform Fourier transform on each frame, and then stack the results of each frame along another dimension to obtain a picture (similar to a two-dimensional signal) , which is the spectrogram, or linear spectrum.
  • the obtained spectrogram is large, in order to obtain sound features of suitable size, it is usually passed through a Mel-scale filter bank to become a Mel spectrum.
  • the preprocessing module is further configured for the first terminal to perform frame-by-frame processing on each of the audio data, so as to divide each of the audio data into multiple frames of audio data; Perform mask processing on the multi-frame audio data using a preset masking rule to obtain masked audio data; convert the masked audio data into the Mel spectrum and the corresponding mel spectrum of each audio data. / or linear spectrum.
  • the mask rule can be set and adjusted according to the actual situation.
  • the masking rule may be to select a preset number of frames of voice data in the multi-frame voice data for mask processing, for example, select 15% of the voice frames for mask processing.
  • the mask processing refers to adding noise to the speech data.
  • this masking process can zero out 80% of the selected frames for me, 10% of the selected frames are replaced with other frames randomly sampled from the same vocalization, the remaining 10% of the selected frames Keep the original frame.
  • the training module 202 is used for the first terminal to train the locally deployed first ALBERT model by using the plurality of speech data, and output the first gradient value of the first ALBERT model.
  • the first ALBERT model is a variant of the BERT model.
  • the parameters of the model are greatly reduced, making practical use more convenient, and it is one of the classic BERT variants.
  • the first ALBERT model can be downloaded from the federated learning network and then deployed locally on the first terminal, or can be directly deployed locally on the first terminal without first being downloaded from the federated learning network.
  • the first ALBERT model can be downloaded from the federated learning network and then deployed locally on the first terminal, or can be directly deployed locally on the first terminal without first being downloaded from the federated learning network.
  • Examples are not limited.
  • the first ALBERT model may be a model that has undergone certain training, or may be an initial model that has not been trained.
  • the first ALBERT model has a corresponding loss function, and the first gradient value is a vector used for approximating the minimum value of the loss function in the iterative process of training the first ALBERT model through a plurality of speech data.
  • the first ALBERT model can be a Transformer structure with several layers (for example, 6 layers, 12 layers, 24 layers) superimposed, and the weight of the model is expressed as
  • the training module 202 is further configured for the first terminal to pass the multiple voice data
  • the corresponding Mel spectrum and/or linear spectrum is used to train the locally deployed first ALBERT model, and the first gradient value of the first ALBERT model is output.
  • the compression module 203 is used for the first terminal to compress the first gradient value to obtain a compressed gradient value.
  • the network environment of each terminal added to the federated learning network is different. Therefore, in order to enable the first gradient value to be uploaded to the federated learning network more efficiently, the first gradient value can be uploaded to the federated learning network more efficiently.
  • the gradient value is compressed first to obtain a lighter gradient value, and then the compressed gradient value is uploaded to the federated learning network.
  • the compression module 203 is further configured for the first terminal to perform sparse processing on the first gradient value to obtain K values, where K is an integer greater than 1; A terminal performs quantization processing on the K values to obtain quantized gradient values, and uses the quantized gradient values as the compressed gradient values.
  • the first gradient value may be sparsely processed by means of Top-k sparseness, where Top-k sparseness refers to taking the largest K values in the gradient matrix (the first gradient value), and the rest Set to 0, where K is a hyperparameter that can be set and adjusted according to the actual situation.
  • the K values may be quantized by means of ternary quantization, where the ternary quantization refers to quantizing the reserved K gradient values into one of ⁇ -u, 0, u ⁇ , so that for a floating point number, only 2 bits (bits) are needed to represent it, so that the compression of the first gradient value can be greatly realized, the bandwidth required for transmitting the first gradient value can be reduced, and the cost of the time required for the transmission of the first gradient value.
  • the compression module 203 is further configured for the first terminal to perform quantization processing on the K values to obtain a quantized gradient. value; the first terminal performs row coding or column coding processing on the quantized gradient value to obtain the coded gradient value, and uses the coded gradient value as the compressed gradient value.
  • row coding or column coding refers to using a row-based or column-based matrix to store the gradient values after quantization.
  • row coding or column coding refers to using a row-based or column-based matrix to store the gradient values after quantization.
  • the gradient values are stored by row coding or column coding, only two arrays are required, wherein, One array is used to store all the non-0 elements in the gradient values, and the other array is used to store the indices of the non-0 elements in the matrix row or matrix column. For all the 0 elements in the gradient values, no storage is required, so The compression ratio for the first gradient value can be further improved.
  • the uploading module 204 is used for the first terminal to upload the compressed gradient value to the federation server, wherein the federation server calculates the total gradient according to the second gradient values uploaded by a plurality of second terminals joined to the federated network value, the second gradient value is obtained by the second terminal training the locally deployed second ALBERT model.
  • the federation server may perform an aggregation operation on the gradient values based on the second gradient values uploaded by multiple second terminals added to the federated network, and aggregate the gradient values.
  • the resulting gradient value is taken as the total gradient value.
  • the second gradient value is obtained after the second terminal trains the second ALBERT model through the locally stored speech data, and the second ALBERT model and the first ALBERT model are the same model.
  • the federated server When the federated server aggregates the total gradient value, it can use the following formula to calculate the total gradient value:
  • g t+1 is the total gradient value
  • C is the number of second terminals.
  • the total gradient value is the average value of the sum of the second gradient values uploaded by the C second terminals.
  • the federated server calculates the total gradient value, it can randomly select several second gradient values uploaded by the second terminals from all the second terminals added to the federated network to calculate the total gradient value, or it can add The total gradient value is calculated by taking the second gradient value uploaded by all the second terminals in the federated network, which is not limited in this embodiment.
  • the second gradient values uploaded by C second terminals may be selected from all second terminals added to the federated network to calculate the total gradient value, where C is a preset value, for example, set C to 5 , it is necessary to select the second gradient values uploaded by 5 second terminals from all the second terminals added to the federated network to calculate the total gradient value.
  • the receiving module 205 is used for the first terminal to receive the total gradient value returned by the federation server.
  • the federated server will push the calculated total gradient value to each terminal that joins the federated network, so that each terminal can update its own model parameters according to the pushed total gradient value.
  • the updating module 206 is used for the first terminal to update the model parameters of the ALBERT model according to the total gradient value, and repeat the steps of training the first ALBERT model until the model converges, and the trained model is obtained.
  • a speech representation model wherein the speech representation model is used to extract a mel spectrum and/or a linear spectrum representing speech from speech data.
  • the first terminal will determine whether to continue training the model, if it is determined that the model has not converged, it will continue to train the model, and if it is determined that the model has converged, it will stop
  • the model is trained, and the trained model is used as a speech representation model, so that a Mel spectrum and/or a linear spectrum representing speech can be extracted from the speech data through the speech representation model.
  • model convergence in this embodiment refers to that the loss value of the model is smaller than the preset value, or the change of the loss value of the model tends to be stable, that is, the loss value corresponding to two or more consecutive training sessions.
  • the difference is less than the set value, that is, the loss value basically does not change.
  • the update module 206 is further configured for the first terminal to acquire the current model parameters of the ALBERT model;
  • the specific value of ⁇ can be set and adjusted according to the actual situation.
  • a locally stored training sample data set is obtained through the first terminal, and the training sample data set includes multiple pieces of voice data; the first terminal uses the multiple pieces of voice data to analyze the locally deployed first ALBERT model Perform training, and output the first gradient value of the first ALBERT model; the first terminal compresses the first gradient value to obtain a compressed gradient value; the first terminal compresses the compressed gradient value upload the gradient value of , to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by multiple second terminals joining the federated network, and the second gradient value is the second terminal to the local Obtained by training the deployed second ALBERT model; the first terminal receives the total gradient value returned by the federation server; the first terminal updates the model parameters of the ALBERT model according to the total gradient value, and continues The first ALBERT model is trained until the model converges, and a trained speech representation model is obtained, wherein the speech representation model is used to extract a Mel spectrum and/or a linear spectrum representing speech from
  • FIG. 3 it is a schematic diagram of a hardware architecture of a computer device 300 according to an embodiment of the present application.
  • the computer device 300 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.
  • the computer equipment 300 at least includes, but is not limited to, a memory 301 , a processor 302 , and a network interface 303 that can communicate with each other through a device bus. in:
  • the memory 301 includes at least one type of computer-readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, and the like.
  • the memory 301 may be an internal storage unit of the computer device 300 , such as a hard disk or a memory of the computer device 300 .
  • the memory 301 may also be an external storage device of the computer device 300, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) device equipped on the computer device 300 , SD) card, flash memory card (Flash Card), etc.
  • the memory 301 may also include both the internal storage unit of the computer device 300 and its external storage device.
  • the memory 301 is generally used to store the operating device installed in the computer device 300 and various application software, such as program codes of the speech representation model training device 200 based on federated learning.
  • the memory 301 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 302 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 302 is generally used to control the overall operation of the computer device 300 .
  • the processor 302 is configured to run program codes or process data stored in the memory 301, for example, run the apparatus 200 for training a speech representation model based on federated learning, so as to implement the speech representation model based on federated learning in the above embodiments training method.
  • the network interface 303 may include a wireless network interface or a wired network interface, and the network interface 303 is generally used to establish a communication connection between the computer equipment 300 and other electronic devices.
  • the network interface 303 is used to connect the computer device 300 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 300 and the external terminal.
  • the network may be an intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network Wireless or wired network such as network, Bluetooth (Bluetooth), Wi-Fi, etc.
  • FIG. 3 only shows computer device 300 having components 301-303, but it should be understood that implementation of all of the shown components is not required, and that more or fewer components may be implemented instead.
  • the federated learning-based speech representation model training apparatus 200 stored in the memory 301 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 301 , and executed by one or more processors (the processor 302 in this embodiment) to complete the federated learning-based speech representation model training method of the present application.
  • This embodiment also provides a computer-readable storage medium, which may be non-volatile or volatile, such as flash memory, hard disk, multimedia card, card-type storage (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , magnetic disks, optical discs, servers, App application malls, etc., on which computer-readable instructions are stored, and when the programs are executed by the processor, corresponding functions are realized.
  • the computer-readable storage medium of this embodiment is used to store the apparatus 200 for training a speech representation model based on federated learning, so as to implement the following steps when executed by a processor:
  • the first terminal acquires a locally stored training sample data set, where the training sample data set includes a plurality of voice data;
  • the first terminal trains the locally deployed first ALBERT model by using the plurality of voice data, and outputs the first gradient value of the first ALBERT model
  • the first terminal compresses the first gradient value to obtain a compressed gradient value
  • the first terminal uploads the compressed gradient value to the federation server, wherein the federation server calculates the total gradient value according to the second gradient values uploaded by a plurality of second terminals joined to the federated network, and the second The gradient value is obtained by the second terminal training the locally deployed second ALBERT model;
  • the first terminal updates the model parameters of the ALBERT model according to the total gradient value, repeats the step of training the first ALBERT model until the model converges, and obtains a trained speech representation model, wherein,
  • the speech representation model is used for extracting mel spectrum and/or linear spectrum representing speech from speech data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

提供了一种基于联邦学习的语音表征模型训练方法,包括:第一终端获取本地存储的训练样本数据集,训练样本数据集包括多个语音数据(S10);第一终端通过多个语音数据对本地部署的第一ALBERT模型进行训练,并输出第一ALBERT模型的第一梯度值(S11);第一终端将第一梯度值进行压缩处理,得到压缩后的梯度值(S12);第一终端将压缩后的梯度值上传至联邦服务器;第一终端接收联邦服务器返回的总梯度值(S14);第一终端根据总梯度值对ALBERT模型的模型参数进行更新,继续对第一ALBERT模型进行训练直至模型收敛为止,得到训练好的语音表征模型,可以提高模型的训练效率。

Description

基于联邦学习的语音表征模型训练方法、装置、设备及介质
本申请要求于2021年4月25日提交中国专利局、申请号为202110448809.0,发明名称为“基于联邦学习的语音表征模型训练方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能技术领域,尤其涉及一种基于联邦学习的语音表征模型训练方法、装置、设备及介质。
背景技术
随着人工智能技术的发展,联邦学习(Federated learning)技术逐渐成为人工智能领域的一个热门课题。联邦学习技术通过多方协作的方式实现对模型的训练,在保护用户隐私和数据安全的同时,解决了数据孤岛的问题。
然而,发明人发现,现有的联邦学习技术在对模型进行训练时,若网络环境比较复杂,则在模型的训练过程中会因为网络通信的延迟导致模型参数收敛较慢,模型训练效率较低。
发明内容
有鉴于此,本申请实施例的目的是提供一种基于联邦学习的语音表征模型训练方法、装置、计算机设备及计算机可读存储介质,用于解决现有模型训练效率较低的问题。
为实现上述目的,本申请实施例提供了一种基于联邦学习的语音表征模型训练方法,包括:
第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
所述第一终端接收所述联邦服务器返回的总梯度值;
所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
可选地,所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤之前,还包括:
所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱;
所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤包括:
所述第一终端通过所述多个语音数据对应的梅尔频谱及/或线性频谱对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
可选地,所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱包括:
所述第一终端对每一个所述音频数据进行分帧处理,以将每一个所述音频数据分为多帧音频数据;
采用预设的掩码规则对所述多帧音频数据进行掩码处理,得到掩码后的音频数据;
将所述掩码后的音频数据转换为每一个所述音频数据对应的梅尔频谱及/或线性频谱。
可选地,所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值包括:
所述第一终端对所述第一梯度值进行稀疏化处理,以得到K个值,其中K为大于1的整数;
所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值。
可选地,所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值包括:
所述第一终端对所述K个值进行量化处理,得到量化后的梯度值;
所述第一终端将量化后的梯度值进行行编码或者列编码处理,得到编码后的梯度值,并将所述编码后的梯度值作为所述压缩后的梯度值。
可选地,所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新包括:
所述第一终端获取所述ALBERT模型的当前模型参数;
所述第一终端根据所述当前模型参数、所述总梯度值与预设的公式对所述ALBERT模型的模型参数进行更新,其中,所述预设的公式为:w t+1=w t-αg t+1,w t+1为更新后的模型参数,w t为所述当前模型参数,g t+1为所述总梯度值,α为预先设定的常量。
可选地,所述总梯度值通过如下方式计算得到:
Figure PCTCN2021097258-appb-000001
其中,g t+1为所述总梯度值,
Figure PCTCN2021097258-appb-000002
为第C个第二终端上传的第二梯度值,C为第二终端的数量。
为实现上述目的,本申请实施例还提供了一种基于联邦学习的语音表征模型训练装置,所述基于联邦学习的语音表征模型训练装置包括:
获取模块,用于所述第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
训练模块,用于所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
压缩模块,用于所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
上传模块,用于所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
接收模块,用于所述第一终端接收所述联邦服务器返回的总梯度值;
更新模块,用于所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
为实现上述目的,本申请实施例还提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:
第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
所述第一终端接收所述联邦服务器返回的总梯度值;
所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
为实现上述目的,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行以下步骤:第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输 出所述第一ALBERT模型的第一梯度值;
所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
所述第一终端接收所述联邦服务器返回的总梯度值;
所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
本申请实施例提供的基于联邦学习的语音表征模型训练方法、装置、计算机设备及计算机可读存储介质,通过第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;所述第一终端接收所述联邦服务器返回的总梯度值;所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,继续对所述第一ALBERT模型进行训练直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。本实施例通过在将梯度值上传至联邦服务器时,对梯度值进行压缩,从而可以避免因梯度数据过大导致通信延迟,降低模型的训练时长,提高模型的训练效率。
附图说明
图1为本申请基于联邦学习的语音表征模型训练方法的一实施方式的步骤流程示意图。
图2为本申请一实施方式的基于联邦学习的语音表征模型训练装置的程序模块示意图。
图3为本申请一实施方式的计算机设备的硬件结构示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
以下结合附图与具体实施例进一步阐述本申请的优点。
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中 所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
在本申请的描述中,需要理解的是,步骤前的数字标号并不标识执行步骤的前后顺序,仅用于方便描述本申请及区别每一步骤,因此不能理解为对本申请的限制。
参阅图1,示出了本申请实施例一之基于联邦学习的语音表征模型训练方法的流程图。可以理解,本方法实施例中的流程图不对执行步骤的顺序进行限定。下面以基于联邦学习的语音表征模型训练装置(下文以“训练装置”简称)为执行主体进行示例性描述,所述训练装置可以应用于计算机设备中,所述计算机设备可以是移动电话、平板个人计算机(tablet personal computer)、膝上型计算机(laptop computer)、服务器等具有数据传输功能的设备。具体如下:
步骤S10,第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据。
具体地,所述训练样本数据集中的语音数据可以为第一终端本地存储的所有语音数据,也可以为第一终端本地存储的部分语音数据,在本实施例中不作限定。
需要说明的是,本地存储指的存储在第一终端中的内存及/或硬盘等可以存储数据的存储介质中。
其中,所述第一终端指的是对训练样本数据集进行训练的本地终端。比如,若当前待训练的语音表征模型是由本地终端A和其他远程终端B组成的联邦网络进行联合训练的,则所述第一终端即为本地终端A。
在一示例性的方式中,在获取到多个语音数据之后,由于不同的语音数据的时长是不一样的,为了便于对语音数据进行处理,在获取到语音数据之后,还需要对语音数据进行预处理,即在本实施例中,在获取本地存储的训练样本数据集之后,还包括:所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱。
具体地,在对语音数据进行预处理时,可以预先设置每一个语音数据的长度参数LEN, 即设置每一帧语音的长度,这样,在对语音数据进行预处理时,即可以将语音数据分成多帧语音数据。具体而言,在对语音数据进行分帧处理时,可以通过移动窗函数来实现将语音数据处理成等同长度的多帧语音数据。其中,在进行分帧处理时,对于低于长度LEN的语音数据,可以进行补长处理;对于长度长于LEN的语音数据,通过移动窗函数均衡采样得到LEN长度的语音帧数据。
在本实施例中,在对语音数据进行分帧处理后,可以继续通过预设的工具将各帧语音数据转换为对应的梅尔频谱及/或线性频谱。比如,可以通过librosa工具将各帧语音数据转换为对应的梅尔频谱((mel-scale spectrogram))。
需要说明的是,语音信号是一维信号,直观上只能看到时域信息,不能看到频域信息。通过傅里叶变换(FT)可以变换到频域,但是丢失了时域信息,无法看到时频关系。为了解决这个问题,产生了很多方法,短时傅里叶变换,小波等都是很常用的时频分析方法。
其中,短时傅里叶变换(STFT),就是对短时的信号做傅里叶变换。原理如下:对一段长语音信号,分帧、加窗,再对每一帧做傅里叶变换,之后把每一帧的结果沿另一维度堆叠,得到一张图(类似于二维信号),这张图就是声谱图,或者称为线性频谱。
由于得到的声谱图较大,为了得到合适大小的声音特征,通常将它通过梅尔尺度滤波器组(Mel-scale filter banks),变为梅尔频谱。
在一示例性的实施方式中,所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱包括:所述第一终端对每一个所述音频数据进行分帧处理,以将每一个所述音频数据分为多帧音频数据;采用预设的掩码规则对所述多帧音频数据进行掩码处理,得到掩码后的音频数据;将所述掩码后的音频数据转换为每一个所述音频数据对应的梅尔频谱及/或线性频谱。
具体地,通过对音频数据进行掩码处理,可以提高数据的安全性。在本实施例中,该掩码规则可以根据实际情况进行设定与调整。在本实施方式中,该掩码规则可以为选择所述多帧语音数据中预设数量帧的语音数据进行掩码处理,比如,选择15%的语音帧进行掩码处理。
其中,所述掩码处理指的是在语音数据中添加噪声。在一实施方式中,该掩码处理可以为我对选定帧的80%进行归零处理,选定帧的10%替换为从同一发声中随机采样的其他帧,选定帧的其余10%保留原始帧。
步骤S11,所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
具体地,第一ALBERT模型为BERT模型的一个变体,在保持性能的基础上,大大减少了模型的参数,使得实用变得更加方便,是经典的BERT变体之一。
在本实施例中,该第一ALBERT模型可以从联邦学习网络中下载,然后部署在第一终端本地,也可以直接在第一终端本地部署,而无需先从联邦学习网络中下载,在本实施例 中不作限定。
在本实施例中,所述第一ALBERT模型可以是经过一定训练的模型,也可以是未经过训练的初始模型。所述第一ALBERT模型中具有对应的损失函数,而该第一梯度值是在通过多个语音数据对所述第一ALBERT模型进行训练迭代过程中用于逼近损失函数的最小值的向量。
其中,所述第一ALBERT模型可以为若干层(比如,6层,12层,24层)叠加的Transformer结构,模型的权重表示为
Figure PCTCN2021097258-appb-000003
在一示例性的实施方式中,当所述语音数据为多个语音数据对应的梅尔频谱及/或线性频谱时,则所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤包括:所述第一终端通过所述多个语音数据对应的梅尔频谱及/或线性频谱对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
步骤S12,所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值。
具体地,由于采用联邦学习训练时,加入至联邦学习网络中的各个终端所处的网络环境不一样,因此,为了使得第一梯度值可以更加高效地上传至联邦学习网络,可以对该第一梯度值先进行压缩,得到更加轻量的梯度值,之后,将压缩后的梯度值在上传至联邦学习网络。
在一示例性的实施方式中,所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值包括:所述第一终端对所述第一梯度值进行稀疏化处理,以得到K个值,其中K为大于1的整数;所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值。
具体地,可以通过Top-k稀疏化的方式对第一梯度值进行稀疏化处理,其中,Top-k稀疏化指的是取梯度矩阵(第一梯度值)中最大的K个值,其余的置为0,这里的K为一个超参数,可以根据实际情况进行设定与调整。
在本实施例中,可以通过三元量化的方式对K个值进行量化处理,其中,三元量化指的是,将保留的K个梯度值量化为{-u,0,u}中的一个,这样对于一个浮点数,只需要2个bit(比特)就能表示了,从而可以极大地实现对所述第一梯度值的压缩,降低传输所述第一梯度值所需要的带宽,节省所述第一梯度值传输时所需要的时间。
在一示例性的实施方式中,为了进一步地提高对所述一梯度值的压缩率,所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值包括:所述第一终端对所述K个值进行量化处理,得到量化后的梯度值;所述第一终端将量化后的梯度值进行行编码或者列编码处理,得到编码后的梯度值,并将所述编码后的梯度值作为所述压缩后的梯度值。
具体地,行编码或者列编码指的是对于量化后梯度值采用基于行或者基于列的矩阵进 行存储,通过行编码或者列编码的方式对梯度值进行存储时,只需要两个数组,其中,一个数组用于存储所有的梯度值中的非0元素,另一数组用于存储非0元素在矩阵行或者矩阵列中的索引,对于所有的梯度值中的0元素,不需要进行存储,从而可以进一步提高对所述第一梯度值的压缩率。
步骤S13,所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的。
具体地,第一终端在将压缩后的梯度值上传至联邦服务器后,联邦服务器可以基于加入至联邦网络的多个第二终端上传的第二梯度值进行梯度值的聚合操作,并将聚合操作得到的梯度值作为总梯度值。
其中,所述第二终端是指加入至联邦网络中的终端,该第二终端可以为所述第一终端,也可以为加入至联邦网络中的除所述第一终端之外的其他终端,在本实施例中不作限定。
其中,第二梯度值是第二终端通过其本地存储的语音数据对第二ALBERT模型进行训练后得到,所述第二ALBERT模型与所述第一ALBERT模型为相同的模型。
联邦服务器在进行总梯度值的聚合操作时,可以采用以下公式计算得到总梯度值:
Figure PCTCN2021097258-appb-000004
其中,g t+1为所述总梯度值,
Figure PCTCN2021097258-appb-000005
为第C个第二终端上传的第二梯度值,C为第二终端的数量。
也就是说,总梯度值为C个第二终端上传的第二梯度值的和值的平均值。
需要说明的是,联邦服务器在计算总梯度值时,可以随机从加入至联邦网络中的所有第二终端中选择若干个第二终端上传的第二梯度值来计算总梯度值,也可以将加入值联邦网络中的所有第二终端上传的第二梯度值来计算总梯度值,在本实施例中不作限定。
作为示例,可以从加入至联邦网络中的所有第二终端中选择C个第二终端上传的第二梯度值来计算总梯度值,其中C为预先设定的值,比如,设定C为5个,则需要从加入至联邦网络中的所有第二终端中选择5个第二终端上传的第二梯度值来计算总梯度值。
步骤S14,所述第一终端接收所述联邦服务器返回的总梯度值。
具体地,联邦服务器在计算出总梯度值之后,会将计算出的总梯度值推送至加入至联邦网络的各个终端,以便各个终端可以根据推送的总梯度值对各自的模型参数进行更新。
步骤S15,所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
具体地,第一终端在完成模型参数的更新之后,会判断是否需要对模型进行继续训练,若判断出模型还没有收敛,则会继续对模型进行训练,若判断出模型已经收敛,则会停止 对模型的训练,并将训练好的模型作为语音表征模型,以便可以通过该语音表征模型从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
需要说明的是,本实施例中的模型收敛指的是模型的损失值小于预设值,或者是模型的损失值的变化趋近于平稳,即相邻两次或多次训练对应的损失值的差值小于设定值,也就是损失值基本不再变化。
在一示例性的实施方式中,所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新包括:所述第一终端获取所述ALBERT模型的当前模型参数;所述第一终端根据所述当前模型参数、所述总梯度值与预设的公式对所述ALBERT模型的模型参数进行更新,其中,所述预设的公式为:w t+1=w t-αg t+1,w t+1为更新后的模型参数,w t为所述当前模型参数,g t+1为所述总梯度值,α为预先设定的常量。
其中,α的具体值可以根据实际情况进行设定与调整。
本实施例中,通过第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;所述第一终端接收所述联邦服务器返回的总梯度值;所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,继续对所述第一ALBERT模型进行训练直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。本实施例通过在将梯度值上传至联邦服务器时,对梯度值进行压缩,从而可以避免因梯度数据过大导致通信延迟,降低模型的训练时长,提高模型的训练效率。
请参阅图2,示出了本申请实施例之基于联邦学习的语音表征模型训练装置200(以下简称为“训练装置”200)的程序模块示意图。所述训练装置200可以应用于计算机设备中,所述计算机设备可以是移动电话、平板个人计算机(tablet personal computer)、膝上型计算机(laptop computer)、服务器等具有数据传输功能的设备。在本实施例中,所述训练装置200可以包括或被分割成一个或多个程序模块,一个或者多个程序模块被存储于存储介质中,并由一个或多个处理器所执行,以完成本申请,并可实现上述基于联邦学习的语音表征模型训练方法。本申请实施例所称的程序模块是指能够完成特定功能的一系列计算机可读指令指令段,比程序本身更适合于描述基于联邦学习的语音表征模型训练方法在存储介质中的执行过程。在本实施例中,所述基于联邦学习的语音表征模型训练装置200包括获取模块201、训练模块202、压缩模块203、上传模块204、接收模块205及更新模块206。以下描述将具体介绍本实施例各程序模块的功能:
获取模块201,用于第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据。
具体地,所述训练样本数据集中的语音数据可以为第一终端本地存储的所有语音数据,也可以为第一终端本地存储的部分语音数据,在本实施例中不作限定。
需要说明的是,本地存储指的存储在第一终端中的内存及/或硬盘等可以存储数据的存储介质中。
在一示例性的方式中,在获取到多个语音数据之后,由于不同的语音数据的时长是不一样的,为了便于对语音数据进行处理,在获取到语音数据之后,还需要对语音数据进行预处理,即在本实施例中,训练装置200还可以包括预处理模块。
所述预处理模块,用于所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱。
具体地,在对语音数据进行预处理时,可以预先设置每一个语音数据的长度参数LEN,即设置每一帧语音的长度,这样,在对语音数据进行预处理时,即可以将语音数据分成多帧语音数据。具体而言,在对语音数据进行分帧处理时,可以通过移动窗函数来实现将语音数据处理成等同长度的多帧语音数据。其中,在进行分帧处理时,对于低于长度LEN的语音数据,可以进行补长处理;对于长度长于LEN的语音数据,通过移动窗函数均衡采样得到LEN长度的语音帧数据。
在本实施例中,在对语音数据进行分帧处理后,可以继续通过预设的工具将各帧语音数据转换为对应的梅尔频谱及/或线性频谱。比如,可以通过librosa工具将各帧语音数据转换为对应的梅尔频谱((mel-scale spectrogram))。
需要说明的是,语音信号是一维信号,直观上只能看到时域信息,不能看到频域信息。通过傅里叶变换(FT)可以变换到频域,但是丢失了时域信息,无法看到时频关系。为了解决这个问题,产生了很多方法,短时傅里叶变换,小波等都是很常用的时频分析方法。
其中,短时傅里叶变换(STFT),就是对短时的信号做傅里叶变换。原理如下:对一段长语音信号,分帧、加窗,再对每一帧做傅里叶变换,之后把每一帧的结果沿另一维度堆叠,得到一张图(类似于二维信号),这张图就是声谱图,或者称为线性频谱。
由于得到的声谱图较大,为了得到合适大小的声音特征,通常将它通过梅尔尺度滤波器组(Mel-scale filter banks),变为梅尔频谱。
在一示例性的实施方式中,所述预处理模块,还用于所述第一终端对每一个所述音频数据进行分帧处理,以将每一个所述音频数据分为多帧音频数据;采用预设的掩码规则对所述多帧音频数据进行掩码处理,得到掩码后的音频数据;将所述掩码后的音频数据转换为每一个所述音频数据对应的梅尔频谱及/或线性频谱。
具体地,通过对音频数据进行掩码处理,可以提高数据的安全性。在本实施例中,该掩码规则可以根据实际情况进行设定与调整。在本实施方式中,该掩码规则可以为选择所 述多帧语音数据中预设数量帧的语音数据进行掩码处理,比如,选择15%的语音帧进行掩码处理。
其中,所述掩码处理指的是在语音数据中添加噪声。在一实施方式中,该掩码处理可以为我对选定帧的80%进行归零处理,选定帧的10%替换为从同一发声中随机采样的其他帧,选定帧的其余10%保留原始帧。
训练模块202,用于所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
具体地,第一ALBERT模型为BERT模型的一个变体,在保持性能的基础上,大大减少了模型的参数,使得实用变得更加方便,是经典的BERT变体之一。
在本实施例中,该第一ALBERT模型可以从联邦学习网络中下载,然后部署在第一终端本地,也可以直接在第一终端本地部署,而无需先从联邦学习网络中下载,在本实施例中不作限定。
在本实施例中,所述第一ALBERT模型可以是经过一定训练的模型,也可以是未经过训练的初始模型。所述第一ALBERT模型中具有对应的损失函数,而该第一梯度值是在通过多个语音数据对所述第一ALBERT模型进行训练迭代过程中用于逼近损失函数的最小值的向量。
其中,所述第一ALBERT模型可以为若干层(比如,6层,12层,24层)叠加的Transformer结构,模型的权重表示为
Figure PCTCN2021097258-appb-000006
在一示例性的实施方式中,当所述语音数据为多个语音数据对应的梅尔频谱及/或线性频谱时,训练模块202,还用于所述第一终端通过所述多个语音数据对应的梅尔频谱及/或线性频谱对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
压缩模块203,用于所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值。
具体地,由于采用联邦学习训练时,加入至联邦学习网络中的各个终端所处的网络环境不一样,因此,为了使得第一梯度值可以更加高效地上传至联邦学习网络,可以对该第一梯度值先进行压缩,得到更加轻量的梯度值,之后,将压缩后的梯度值在上传至联邦学习网络。
在一示例性的实施方式中,压缩模块203,还用于所述第一终端对所述第一梯度值进行稀疏化处理,以得到K个值,其中K为大于1的整数;所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值。
具体地,可以通过Top-k稀疏化的方式对第一梯度值进行稀疏化处理,其中,Top-k稀疏化指的是取梯度矩阵(第一梯度值)中最大的K个值,其余的置为0,这里的K为一个超参数,可以根据实际情况进行设定与调整。
在本实施例中,可以通过三元量化的方式对K个值进行量化处理,其中,三元量化指的是,将保留的K个梯度值量化为{-u,0,u}中的一个,这样对于一个浮点数,只需要2个bit(比特)就能表示了,从而可以极大地实现对所述第一梯度值的压缩,降低传输所述第一梯度值所需要的带宽,节省所述第一梯度值传输时所需要的时间。
在一示例性的实施方式中,为了进一步地提高对所述一梯度值的压缩率,压缩模块203,还用于所述第一终端对所述K个值进行量化处理,得到量化后的梯度值;所述第一终端将量化后的梯度值进行行编码或者列编码处理,得到编码后的梯度值,并将所述编码后的梯度值作为所述压缩后的梯度值。
具体地,行编码或者列编码指的是对于量化后梯度值采用基于行或者基于列的矩阵进行存储,通过行编码或者列编码的方式对梯度值进行存储时,只需要两个数组,其中,一个数组用于存储所有的梯度值中的非0元素,另一数组用于存储非0元素在矩阵行或者矩阵列中的索引,对于所有的梯度值中的0元素,不需要进行存储,从而可以进一步提高对所述第一梯度值的压缩率。
上传模块204,用于所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的。
具体地,第一终端在将压缩后的梯度值上传至联邦服务器后,联邦服务器可以基于加入至联邦网络的多个第二终端上传的第二梯度值进行梯度值的聚合操作,并将聚合操作得到的梯度值作为总梯度值。
其中,第二梯度值是第二终端通过其本地存储的语音数据对第二ALBERT模型进行训练后得到,所述第二ALBERT模型与所述第一ALBERT模型为相同的模型。
联邦服务器在进行总梯度值的聚合操作时,可以采用以下公式计算得到总梯度值:
Figure PCTCN2021097258-appb-000007
其中,g t+1为所述总梯度值,
Figure PCTCN2021097258-appb-000008
为第C个第二终端上传的第二梯度值,C为第二终端的数量。
也就是说,总梯度值为C个第二终端上传的第二梯度值的和值的平均值。
需要说明的是,联邦服务器在计算总梯度值时,可以随机从加入至联邦网络中的所有第二终端中选择若干个第二终端上传的第二梯度值来计算总梯度值,也可以将加入值联邦网络中的所有第二终端上传的第二梯度值来计算总梯度值,在本实施例中不作限定。
作为示例,可以从加入至联邦网络中的所有第二终端中选择C个第二终端上传的第二梯度值来计算总梯度值,其中C为预先设定的值,比如,设定C为5个,则需要从加入至联邦网络中的所有第二终端中选择5个第二终端上传的第二梯度值来计算总梯度值。
接收模块205,用于所述第一终端接收所述联邦服务器返回的总梯度值。
具体地,联邦服务器在计算出总梯度值之后,会将计算出的总梯度值推送至加入至联 邦网络的各个终端,以便各个终端可以根据推送的总梯度值对各自的模型参数进行更新。
更新模块206,用于所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
具体地,第一终端在完成模型参数的更新之后,会判断是否需要对模型进行继续训练,若判断出模型还没有收敛,则会继续对模型进行训练,若判断出模型已经收敛,则会停止对模型的训练,并将训练好的模型作为语音表征模型,以便可以通过该语音表征模型从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
需要说明的是,本实施例中的模型收敛指的是模型的损失值小于预设值,或者是模型的损失值的变化趋近于平稳,即相邻两次或多次训练对应的损失值的差值小于设定值,也就是损失值基本不再变化。
在一示例性的实施方式中,更新模块206,还用于所述第一终端获取所述ALBERT模型的当前模型参数;所述第一终端根据所述当前模型参数、所述总梯度值与预设的公式对所述ALBERT模型的模型参数进行更新,其中,所述预设的公式为:w t+1=w t-αg t+1,w t+1为更新后的模型参数,w t为所述当前模型参数,g t+1为所述总梯度值,α为预先设定的常量。
其中,α的具体值可以根据实际情况进行设定与调整。
本实施例中,通过第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;所述第一终端接收所述联邦服务器返回的总梯度值;所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,继续对所述第一ALBERT模型进行训练直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。本实施例通过在将梯度值上传至联邦服务器时,对梯度值进行压缩,从而可以避免因梯度数据过大导致通信延迟,降低模型的训练时长,提高模型的训练效率。
参阅图3,是本申请实施例之计算机设备300的硬件架构示意图。在本实施例中,所述计算机设备300是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。如图所示,所述计算机设备300至少包括,但不限于,可通过装置总线相互通信连接存储器301、处理器302、网络接口303。其中:
本实施例中,存储器301至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器301可以是计算机设备300的内部存储单元,例如所述计算机设备300的硬盘或内存。在另一些实施例中,存储器301也可以是计算机设备300的外部存储设备,例如所述计算机设备300上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器301还可以既包括计算机设备300的内部存储单元也包括其外部存储设备。本实施例中,存储器301通常用于存储安装于计算机设备300的操作装置和各类应用软件,例如基于联邦学习的语音表征模型训练装置200的程序代码等。此外,存储器301还可以用于暂时地存储已经输出或者将要输出的各类数据。
处理器302在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。所述处理器302通常用于控制计算机设备300的总体操作。本实施例中,处理器302用于运行存储器301中存储的程序代码或者处理数据,例如运行基于联邦学习的语音表征模型训练装置200,以实现上述各个实施例中的基于联邦学习的语音表征模型训练方法。
所述网络接口303可包括无线网络接口或有线网络接口,所述网络接口303通常用于在所述计算机设备300与其他电子装置之间建立通信连接。例如,所述网络接口303用于通过网络将所述计算机设备300与外部终端相连,在所述计算机设备300与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯装置(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
需要指出的是,图3仅示出了具有部件301-303的计算机设备300,但是应理解的是,并不要求实施所有示出的部件,可以替代的实施更多或者更少的部件。
在本实施例中,存储于存储器301中的所述基于联邦学习的语音表征模型训练装置200还可以被分割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器301中,并由一个或多个处理器(本实施例为处理器302)所执行,以完成本申请之基于联邦学习的语音表征模型训练方法。
本实施例还提供一种计算机可读存储介质,所述机可读存储介质可以是非易失性,也可以是易失性,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、 服务器、App应用商城等等,其上存储有计算机可读指令,程序被处理器执行时实现相应功能。本实施例的计算机可读存储介质用于存储基于联邦学习的语音表征模型训练装置200,以被处理器执行时实现以下步骤:
第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
所述第一终端接收所述联邦服务器返回的总梯度值;
所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于联邦学习的语音表征模型训练方法,包括:
    第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
    所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
    所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
    所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
    所述第一终端接收所述联邦服务器返回的总梯度值;
    所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
  2. 如权利要求1所述的基于联邦学习的语音表征模型训练方法,所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤之前,还包括:
    所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱;
    所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤包括:
    所述第一终端通过所述多个语音数据对应的梅尔频谱及/或线性频谱对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
  3. 如权利要求2所述的基于联邦学习的语音表征模型训练方法,所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱包括:
    所述第一终端对每一个所述音频数据进行分帧处理,以将每一个所述音频数据分为多帧音频数据;
    采用预设的掩码规则对所述多帧音频数据进行掩码处理,得到掩码后的音频数据;
    将所述掩码后的音频数据转换为每一个所述音频数据对应的梅尔频谱及/或线性频谱。
  4. 如权利要求1所述的基于联邦学习的语音表征模型训练方法,所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值包括:
    所述第一终端对所述第一梯度值进行稀疏化处理,以得到K个值,其中K为大于1的整数;
    所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值。
  5. 如权利要求4所述的基于联邦学习的语音表征模型训练方法,所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值包括:
    所述第一终端对所述K个值进行量化处理,得到量化后的梯度值;
    所述第一终端将量化后的梯度值进行行编码或者列编码处理,得到编码后的梯度值,并将所述编码后的梯度值作为所述压缩后的梯度值。
  6. 如权利要求1所述的基于联邦学习的语音表征模型训练方法,所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新包括:
    所述第一终端获取所述ALBERT模型的当前模型参数;
    所述第一终端根据所述当前模型参数、所述总梯度值与预设的公式对所述ALBERT模型的模型参数进行更新,其中,所述预设的公式为:w t+1=w t-αg t+1,w t+1为更新后的模型参数,w t为所述当前模型参数,g t+1为所述总梯度值,α为预先设定的常量。
  7. 如权利要求1所述的基于联邦学习的语音表征模型训练方法,所述总梯度值通过如下方式计算得到:
    Figure PCTCN2021097258-appb-100001
    其中,g t+1为所述总梯度值,
    Figure PCTCN2021097258-appb-100002
    为第C个第二终端上传的第二梯度值,C为第二终端的数量。
  8. 一种基于联邦学习的语音表征模型训练装置,所述基于联邦学习的语音表征模型训练装置包括:
    获取模块,用于所述第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
    训练模块,用于所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
    压缩模块,用于所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
    上传模块,用于所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
    接收模块,用于所述第一终端接收所述联邦服务器返回的总梯度值;
    更新模块,用于所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:
    第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
    所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
    所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
    所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
    所述第一终端接收所述联邦服务器返回的总梯度值;
    所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
  10. 如权利要求9所述的计算机设备,所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤之前,还包括:
    所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱;
    所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤包括:
    所述第一终端通过所述多个语音数据对应的梅尔频谱及/或线性频谱对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
  11. 如权利要求10所述的计算机设备,所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱包括:
    所述第一终端对每一个所述音频数据进行分帧处理,以将每一个所述音频数据分为多帧音频数据;
    采用预设的掩码规则对所述多帧音频数据进行掩码处理,得到掩码后的音频数据;
    将所述掩码后的音频数据转换为每一个所述音频数据对应的梅尔频谱及/或线性频谱。
  12. 如权利要求9所述的计算机设备,所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值包括:
    所述第一终端对所述第一梯度值进行稀疏化处理,以得到K个值,其中K为大于1的整数;
    所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值。
  13. 如权利要求12所述的计算机设备,所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值包括:
    所述第一终端对所述K个值进行量化处理,得到量化后的梯度值;
    所述第一终端将量化后的梯度值进行行编码或者列编码处理,得到编码后的梯度值,并将所述编码后的梯度值作为所述压缩后的梯度值。
  14. 如权利要求9所述的计算机设备,所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新包括:
    所述第一终端获取所述ALBERT模型的当前模型参数;
    所述第一终端根据所述当前模型参数、所述总梯度值与预设的公式对所述ALBERT模型的模型参数进行更新,其中,所述预设的公式为:w t+1=w t-αg t+1,w t+1为更新后的模型参数,w t为所述当前模型参数,g t+1为所述总梯度值,α为预先设定的常量。
  15. 一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行以下步骤:
    第一终端获取本地存储的训练样本数据集,所述训练样本数据集包括多个语音数据;
    所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值;
    所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值;
    所述第一终端将所述压缩后的梯度值上传至联邦服务器,其中,所述联邦服务器根据加入至联邦网络的多个第二终端上传的第二梯度值计算总梯度值,所述第二梯度值为所述第二终端对本地部署的第二ALBERT模型进行训练得到的;
    所述第一终端接收所述联邦服务器返回的总梯度值;
    所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新,重复执行对所述第一ALBERT模型进行训练的步骤直至模型收敛为止,得到训练好的语音表征模型,其中,所述语音表征模型用于从语音数据中提取表征语音的梅尔频谱及/或线性频谱。
  16. 如权利要求15所述的计算机可读存储介质,所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤之前,还包括:
    所述第一终端对所述多个语音数据进行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱;
    所述第一终端通过所述多个语音数据对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值的步骤包括:
    所述第一终端通过所述多个语音数据对应的梅尔频谱及/或线性频谱对本地部署的第一ALBERT模型进行训练,并输出所述第一ALBERT模型的第一梯度值。
  17. 如权利要求16所述的计算机可读存储介质,所述第一终端对所述多个语音数据进 行预处理,以将每一个所述音频数据转换为对应的梅尔频谱及/或线性频谱包括:
    所述第一终端对每一个所述音频数据进行分帧处理,以将每一个所述音频数据分为多帧音频数据;
    采用预设的掩码规则对所述多帧音频数据进行掩码处理,得到掩码后的音频数据;
    将所述掩码后的音频数据转换为每一个所述音频数据对应的梅尔频谱及/或线性频谱。
  18. 如权利要求15所述的计算机可读存储介质,所述第一终端将所述第一梯度值进行压缩处理,得到压缩后的梯度值包括:
    所述第一终端对所述第一梯度值进行稀疏化处理,以得到K个值,其中K为大于1的整数;
    所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值。
  19. 如权利要求18所述的计算机可读存储介质,所述第一终端对所述K个值进行量化处理,得到量化后的梯度值,并将所述量化后的梯度值作为所述压缩后的梯度值包括:
    所述第一终端对所述K个值进行量化处理,得到量化后的梯度值;
    所述第一终端将量化后的梯度值进行行编码或者列编码处理,得到编码后的梯度值,并将所述编码后的梯度值作为所述压缩后的梯度值。
  20. 如权利要求15所述的计算机可读存储介质,所述第一终端根据所述总梯度值对所述ALBERT模型的模型参数进行更新包括:
    所述第一终端获取所述ALBERT模型的当前模型参数;
    所述第一终端根据所述当前模型参数、所述总梯度值与预设的公式对所述ALBERT模型的模型参数进行更新,其中,所述预设的公式为:w t+1=w t-αg t+1,w t+1为更新后的模型参数,w t为所述当前模型参数,g t+1为所述总梯度值,α为预先设定的常量。
PCT/CN2021/097258 2021-04-25 2021-05-31 基于联邦学习的语音表征模型训练方法、装置、设备及介质 WO2022227212A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110448809.0 2021-04-25
CN202110448809.0A CN113178191A (zh) 2021-04-25 2021-04-25 基于联邦学习的语音表征模型训练方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2022227212A1 true WO2022227212A1 (zh) 2022-11-03

Family

ID=76925572

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097258 WO2022227212A1 (zh) 2021-04-25 2021-05-31 基于联邦学习的语音表征模型训练方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN113178191A (zh)
WO (1) WO2022227212A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725965A (zh) * 2024-02-06 2024-03-19 湘江实验室 一种基于张量掩码语义通信的联邦边缘数据通信方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341689B (zh) * 2023-03-22 2024-02-06 深圳大学 机器学习模型的训练方法、装置、电子设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263908A (zh) * 2019-06-20 2019-09-20 深圳前海微众银行股份有限公司 联邦学习模型训练方法、设备、系统及存储介质
CN111046433A (zh) * 2019-12-13 2020-04-21 支付宝(杭州)信息技术有限公司 一种基于联邦学习的模型训练方法
US20200218937A1 (en) * 2019-01-03 2020-07-09 International Business Machines Corporation Generative adversarial network employed for decentralized and confidential ai training
WO2020222386A1 (en) * 2019-05-01 2020-11-05 Samsung Electronics Co., Ltd. Method and apparatus for updating a cluster probability model
CN112101578A (zh) * 2020-11-17 2020-12-18 中国科学院自动化研究所 基于联邦学习的分布式语言关系识别方法、系统和装置
CN112132277A (zh) * 2020-09-21 2020-12-25 平安科技(深圳)有限公司 联邦学习模型训练方法、装置、终端设备及存储介质
CN112288097A (zh) * 2020-10-29 2021-01-29 平安科技(深圳)有限公司 联邦学习数据处理方法、装置、计算机设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085250B (zh) * 2016-01-14 2023-07-28 深圳市韶音科技有限公司 气导噪声统计模型的建立方法及应用方法
CN107221320A (zh) * 2017-05-19 2017-09-29 百度在线网络技术(北京)有限公司 训练声学特征提取模型的方法、装置、设备和计算机存储介质
CN107180628A (zh) * 2017-05-19 2017-09-19 百度在线网络技术(北京)有限公司 建立声学特征提取模型的方法、提取声学特征的方法、装置
CN109635422B (zh) * 2018-12-07 2023-08-25 深圳前海微众银行股份有限公司 联合建模方法、装置、设备以及计算机可读存储介质
CN111553483B (zh) * 2020-04-30 2024-03-29 同盾控股有限公司 基于梯度压缩的联邦学习的方法、装置及系统
CN111883107B (zh) * 2020-08-03 2022-09-16 北京字节跳动网络技术有限公司 语音合成、特征提取模型训练方法、装置、介质及设备
CN111951780B (zh) * 2020-08-19 2023-06-13 广州华多网络科技有限公司 语音合成的多任务模型训练方法及相关设备
CN112333216B (zh) * 2021-01-07 2021-04-06 深圳索信达数据技术有限公司 一种基于联邦学习的模型训练方法及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200218937A1 (en) * 2019-01-03 2020-07-09 International Business Machines Corporation Generative adversarial network employed for decentralized and confidential ai training
WO2020222386A1 (en) * 2019-05-01 2020-11-05 Samsung Electronics Co., Ltd. Method and apparatus for updating a cluster probability model
CN110263908A (zh) * 2019-06-20 2019-09-20 深圳前海微众银行股份有限公司 联邦学习模型训练方法、设备、系统及存储介质
CN111046433A (zh) * 2019-12-13 2020-04-21 支付宝(杭州)信息技术有限公司 一种基于联邦学习的模型训练方法
CN112132277A (zh) * 2020-09-21 2020-12-25 平安科技(深圳)有限公司 联邦学习模型训练方法、装置、终端设备及存储介质
CN112288097A (zh) * 2020-10-29 2021-01-29 平安科技(深圳)有限公司 联邦学习数据处理方法、装置、计算机设备及存储介质
CN112101578A (zh) * 2020-11-17 2020-12-18 中国科学院自动化研究所 基于联邦学习的分布式语言关系识别方法、系统和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725965A (zh) * 2024-02-06 2024-03-19 湘江实验室 一种基于张量掩码语义通信的联邦边缘数据通信方法
CN117725965B (zh) * 2024-02-06 2024-05-14 湘江实验室 一种基于张量掩码语义通信的联邦边缘数据通信方法

Also Published As

Publication number Publication date
CN113178191A (zh) 2021-07-27

Similar Documents

Publication Publication Date Title
US11205121B2 (en) Efficient encoding and decoding sequences using variational autoencoders
WO2022227212A1 (zh) 基于联邦学习的语音表征模型训练方法、装置、设备及介质
EP2507790B1 (en) Method and system for robust audio hashing.
CN110119745B (zh) 深度学习模型的压缩方法、装置、计算机设备及存储介质
CN111868753A (zh) 使用条件熵模型的数据压缩
US10366698B2 (en) Variable length coding of indices and bit scheduling in a pyramid vector quantizer
US11335034B2 (en) Systems and methods for image compression at multiple, different bitrates
US11783511B2 (en) Channel-wise autoregressive entropy models for image compression
US20120053948A1 (en) Sparse data compression
CN109147805B (zh) 基于深度学习的音频音质增强
US9886962B2 (en) Extracting audio fingerprints in the compressed domain
US20210297667A1 (en) Training method, image encoding method, image decoding method and apparatuses thereof
KR20190066438A (ko) 오류 벡터 크기 계산을 기반으로 한 데이터 압축 및 복원 장치와 그 방법
US10832661B2 (en) Sound identification utilizing periodic indications
Kaur et al. Speech compression and decompression using DWT and DCT
US9583113B2 (en) Audio compression using vector field normalization
US20210287038A1 (en) Identifying salient features for generative networks
KR20230134856A (ko) 정규화 플로우를 활용한 오디오 신호를 부호화 및 복호화 하는 방법 및 그 학습 방법
WO2022217725A1 (zh) 图像处理、网络训练、编码方法及装置、设备、存储介质
CN114139002A (zh) 图像检索方法、装置、计算机设备和存储介质
CN116259330A (zh) 一种语音分离方法及装置
CN115331660A (zh) 神经网络训练方法、语音识别方法、装置、设备及介质
CN115985347A (zh) 基于深度学习的语音端点检测方法、装置和计算机设备
CN117544276A (zh) 基于滑动窗口的语义通信编码传输方法、装置及设备
CN116437087A (zh) 图像处理方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21938682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE