CN116343760A

CN116343760A - Speech recognition method, system and computer equipment based on federal learning

Info

Publication number: CN116343760A
Application number: CN202310269664.7A
Authority: CN
Inventors: 李泽远; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-06-27

Abstract

The method obtains a local voice sample set through a learning end, obtains an initial local model, and adopts a self-coding network to perform feature extraction processing to obtain an output value; training the initial local model according to the output value and the sample label to obtain a first fusion adaptive parameter; obtaining a first global fusion parameter through a server; training the initial local model based on the first global fusion parameters to obtain second fusion self-adaptive parameters, obtaining a target local model when a preset end condition is met, determining a target voice recognition model through a server, and performing voice recognition on voice data through the target voice recognition model by an application terminal. According to the embodiment of the invention, only the parameters modified by the feature fusion structure of the model are transmitted to the server, so that the representation capability of the features is improved while the number of the model parameters is reduced, and the voice recognition accuracy is higher.

Description

Speech recognition method, system and computer equipment based on federal learning

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a federal learning-based speech recognition method, system, and computer device.

Background

Federal learning (Federated Learning) is an emerging artificial intelligence infrastructure that includes a plurality of terminals and servers. In the training process of the federal learning-based voice recognition model, voice samples of each terminal are not shared with other terminals and the server, and only model parameters are communicated between the server and the terminals, so that each terminal can benefit from data of other terminals while ensuring that own voice samples are not externally transmitted, thereby ensuring the accuracy of the model and having high voice recognition accuracy. With the development of technology, more and more labels are provided on a voice sample, or the voice sequence input into a model is too long, so that the parameter quantity of the model is increased. In the related art, by adding the adapter module at a proper position in a large model structure, only the parameters of the adapter module are updated by freezing the original model, so that the number of model parameter modifications is greatly reduced. However, the adapter module is a simpler structure with limited correction capability and limited learning capability, resulting in reduced accuracy of speech recognition.

Disclosure of Invention

The method, the system and the computer equipment for voice recognition based on federal learning aim at solving the problems in the prior art to at least a certain extent, and can increase the representation capability of the features while reducing the number of model parameters by transmitting only the parameters modified by the feature fusion structure of the model to the server, thereby having higher voice recognition accuracy.

The technical scheme of the embodiment of the application is as follows:

in a first aspect, the present application provides a federal learning-based speech recognition method, which is applied to a speech recognition system, where the speech recognition system includes an application end, a server, and a plurality of learning ends, and the method includes:

the learning terminal acquires a local voice sample set, wherein the voice sample set comprises a plurality of voice samples and sample labels corresponding to the voice samples;

the learning end acquires an initial local model, and a self-coding network of the initial local model performs the following processing: performing first feature extraction processing on each voice sample to obtain a first voice feature set, performing second feature extraction processing on the first voice feature set to obtain a second voice feature set, performing first fusion processing on the first voice feature set and the second voice feature set to obtain a third voice feature set, and performing second fusion processing on the first voice feature set, the second voice feature set and the third voice feature set to obtain an output value;

The learning end trains the initial local model according to the output value and the sample label to obtain a first fusion adaptive parameter;

the learning end sends the first fusion adaptive parameters to the server and receives first global fusion parameters obtained by aggregating the first fusion adaptive parameters sent by the server;

the learning end trains the initial local model based on the first global fusion parameter to obtain a second fusion adaptive parameter, obtains a target local model corresponding to the second fusion adaptive parameter under the condition that the preset end condition is met, and sends the target local model to the server;

the server determines a target voice recognition model according to the target local model sent by each learning end and sends the target voice recognition model to the application end;

the application end carries out voice recognition on the input voice data through the target voice recognition model.

According to some embodiments of the present application, the performing a second fusion process on the first speech feature set, the second speech feature set, and the third speech feature set to obtain an output value includes:

Taking the first voice feature set as a first matrix of an attention mechanism;

taking the second set of speech features as a second matrix of the attention mechanism;

taking the third set of speech features as a third matrix of the attention mechanism;

and the learning end performs first matrix fusion processing on the first matrix, the second matrix and the third matrix according to a preset fusion algorithm to obtain the output value.

According to some embodiments of the present application, the learning end performs a first matrix fusion process on the first matrix, the second matrix, and the third matrix according to a preset fusion algorithm, to obtain the output value, including:

the learning end performs second matrix fusion processing on the first matrix and the second matrix to obtain a first fusion value;

the learning end performs normalization processing on the first fusion value by using a normalization layer to obtain a second fusion value;

and the learning end performs third matrix fusion processing on the second fusion value and the third matrix to obtain the output value.

According to some embodiments of the present application, after the learning end trains the initial local model based on the first global fusion parameter to obtain a second fusion adaptive parameter, the method further includes:

Under the condition that a preset ending condition is not met, the learning end sends the second fusion adaptive parameter to the server;

the learning terminal receives a second global fusion parameter obtained by aggregating the second fusion adaptive parameters, which is sent by the server;

the learning end trains the initial local model based on the second global fusion parameters.

According to some embodiments of the present application, the learning end trains the initial local model according to the output value and the sample tag to obtain a first fusion adaptive parameter, including:

the learning end obtains a value of a loss function according to the output value and the sample label corresponding to the output value;

and training the initial local model by the learning end according to the value of the loss function to obtain the first fusion adaptive parameter.

According to some embodiments of the present application, the performing a second feature extraction process on the first speech feature set to obtain a second speech feature set includes:

the learning end performs downsampling operation on the first voice feature set to obtain sampling features;

and the learning end performs up-sampling operation on the sampling characteristics to obtain the second voice characteristic set.

According to some embodiments of the present application, the learning end obtains a value of a loss function according to the output value and the sample label corresponding to the output value, including:

and the learning end calculates the value of the sample label corresponding to the output value by utilizing the KL divergence to obtain the value of the loss function.

In a second aspect, the present application provides a speech recognition system based on federal learning, where the speech recognition system includes an application end, a server, and a plurality of learning ends, where the learning ends include:

the data acquisition module is used for acquiring a local voice sample set, wherein the voice sample set comprises a plurality of voice samples and sample labels corresponding to the voice samples;

the model acquisition module is used for acquiring an initial local model, and the self-coding network of the initial local model performs the following processing: performing first feature extraction processing on each voice sample to obtain a first voice feature set, performing second feature extraction processing on the first voice feature set to obtain a second voice feature set, performing first fusion processing on the first voice feature set and the second voice feature set to obtain a third voice feature set, and performing second fusion processing on the first voice feature set, the second voice feature set and the third voice feature set to obtain an output value;

The first processing module is used for training the initial local model according to the output value and the sample label so as to obtain a first fusion adaptive parameter;

the receiving module is used for receiving a first global fusion parameter obtained by aggregating the first fusion adaptive parameter sent by the server;

the second processing module is used for training the initial local model based on the first global fusion parameter to obtain a second fusion adaptive parameter, and obtaining a target local model corresponding to the second fusion adaptive parameter under the condition that the preset end condition is met, and the sending module is also used for sending the target local model to the server;

In a third aspect, the present application provides a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by one or more of the processors, cause the one or more processors to perform the steps of the method as described in any of the first aspects above.

In a fourth aspect, the present application also provides a computer readable storage medium readable and writable by a processor, the storage medium storing computer readable instructions which when executed by one or more processors cause the one or more processors to perform the steps of the method as described in any of the first aspects above.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

the embodiment of the application provides a voice recognition method, a voice recognition system and computer equipment based on federal learning, wherein the voice recognition method based on federal learning comprises the following steps: the learning terminal acquires a local voice sample set, wherein the voice sample set comprises a plurality of voice samples and sample labels corresponding to the voice samples; the learning end acquires an initial local model, and a self-coding network of the initial local model performs the following processing: performing first feature extraction processing on each voice sample to obtain a first voice feature set, performing second feature extraction processing on the first voice feature set to obtain a second voice feature set, performing first fusion processing on the first voice feature set and the second voice feature set to obtain a third voice feature set, performing second fusion processing on the first voice feature set, the second voice feature set and the third voice feature set to obtain an output value, and increasing the representation capability of the features through fusion processing; training the initial local model by the learning end according to the output value and the sample label to obtain a first fusion adaptive parameter; the learning end sends the first fusion adaptive parameters to the server, only sends the first fusion adaptive parameters to the server instead of all model parameters, reduces parameter transmission, improves communication speed, and receives the first global fusion parameters obtained by aggregating the first fusion adaptive parameters sent by the server, thereby being beneficial to carrying out subsequent parameter fine adjustment of the initial local model by using the first global fusion parameters; the learning end trains an initial local model based on the first global fusion parameters to obtain second fusion adaptive parameters, obtains a target local model corresponding to the second fusion adaptive parameters under the condition that the preset end conditions are met, and sends the target local model to the server; the server determines a target voice recognition model according to the target local model sent by each learning end, obtains the target voice recognition model through federal learning, can increase the accuracy of model voice recognition, and sends the target voice recognition model to the application end; the application end carries out voice recognition on the input voice data through the target voice recognition model. According to the embodiment of the invention, the feature fusion is carried out on the model, and only the parameters modified by the feature fusion structure of the model are transmitted to the server, so that the representation capability of the feature can be increased while the number of the model parameters is reduced, and the higher voice recognition accuracy is realized.

Drawings

FIG. 1 is a flow diagram of a federal learning-based speech recognition method provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart showing a sub-step of step S200 in FIG. 1;

FIG. 3 is a schematic flow chart showing a sub-step of step S240 in FIG. 2;

FIG. 4 is a flow chart of a federal learning-based speech recognition method provided in another embodiment of the present application;

FIG. 5 is a schematic flow chart showing a sub-step of step S300 in FIG. 1;

FIG. 6 is a schematic flow chart of another substep of step S200 in FIG. 1;

FIG. 7 is a schematic diagram of the architecture of a federal learning-based speech recognition system provided in one embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It is noted that unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

federal learning (Federated Learning) is an emerging artificial intelligence basic technology, and the design goal is to ensure information security during large data exchange, protect privacy of terminal data and personal data, and develop efficient machine learning among multiple participants or multiple computing nodes on the premise of legal compliance. The model training of federal learning is distributed over a main server and a plurality of terminals, each of which has data that is not shared with other terminals and the main server. Only the model parameter change amount (or model parameter) trained by the plan is communicated between the main server and the terminal. This approach may allow each terminal to benefit from the data of the other terminals while ensuring its own data security.

Transformer model: the transducer model consists of two parts, namely an encoder (decoder) and a decoder (decoder), wherein the encoder and the decoder both comprise block blocks for feature extraction. The encodings consist essentially of a feed-forward neural network layer and a multi-headed self-attention layer, with a residual connection around each sub-layer (self-attention and feed-forward network) in each encoder, and followed by a "layer normalization" step. The first attention layer of the decoder, which is of a different size than the decoder, is called the masking multi-headed attention mechanism, and by adding masking operations, is only allowed to process those earlier in the output sequence.

The embodiment of the application provides a voice recognition method, a voice recognition system and computer equipment based on federal learning, wherein the voice recognition method based on federal learning comprises the following steps: the learning terminal acquires a local voice sample set, wherein the voice sample set comprises a plurality of voice samples and sample labels corresponding to the voice samples; the learning end acquires an initial local model, and a self-coding network of the initial local model performs the following processing: performing first feature extraction processing on each voice sample to obtain a first voice feature set, performing second feature extraction processing on the first voice feature set to obtain a second voice feature set, performing first fusion processing on the first voice feature set and the second voice feature set to obtain a third voice feature set, performing second fusion processing on the first voice feature set, the second voice feature set and the third voice feature set to obtain an output value, and increasing the representation capability of the features through fusion processing; training the initial local model by the learning end according to the output value and the sample label to obtain a first fusion adaptive parameter; the learning end sends the first fusion adaptive parameters to the server, only sends the first fusion adaptive parameters to the server instead of all model parameters, reduces parameter transmission, improves communication speed, and receives the first global fusion parameters obtained by aggregating the first fusion adaptive parameters sent by the server, thereby being beneficial to carrying out subsequent parameter fine adjustment of the initial local model by using the first global fusion parameters; the learning end trains an initial local model based on the first global fusion parameters to obtain second fusion adaptive parameters, obtains a target local model corresponding to the second fusion adaptive parameters under the condition that the preset end conditions are met, and sends the target local model to the server; the server determines a target voice recognition model according to the target local model sent by each learning end, obtains the target voice recognition model through federal learning, can increase the accuracy of model voice recognition, and sends the target voice recognition model to the application end; the application end carries out voice recognition on the input voice data through the target voice recognition model. According to the embodiment of the invention, only the parameters modified by the feature fusion structure of the model are transmitted to the server, so that the representation capability of the features is improved while the number of the model parameters is reduced, and the voice recognition accuracy is higher.

It should be noted that, the voice recognition method based on federal learning optimizes the whole federal system by reducing parameter transmission between the learning end and the server, and the initial local model in the learning end is a transducer model or a variant of the transducer model, so that natural language or processed voice can be recognized, and the application is wider. the transformation former model is applied to various application scenes, and is used at each learning end of the federal system, and when different data sets are input, the transformation former model or variants thereof are adopted according to the different data sets, so that tasks of different scenes can be realized, and the method has wide application.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Embodiments of the present application may be used in a variety of general-purpose or special-purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The following describes a federal learning-based voice recognition method, system and computer device according to embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 shows a schematic flow chart of a federal learning-based speech recognition method according to an embodiment of the present application. The speech recognition method based on federal learning is applied to a speech recognition system, and the speech recognition system comprises an application end, a server and a plurality of learning ends, wherein the speech recognition method based on federal learning comprises, but is not limited to, step S100, step S200, step S300, step S400, step S500, step S600 and step S700.

In step S100, the learning end obtains a local voice sample set, where the voice sample set includes a plurality of voice samples and sample labels corresponding to each voice sample.

In one embodiment, a local voice sample set is obtained at the learning end, which is beneficial to the subsequent training of the network model by using the local voice sample set. The voice sample set comprises a plurality of voice samples and sample labels corresponding to the voice samples, and each voice sample can be natural language or processed voice; the sample label is the correct text information corresponding to the voice. The voice sample set of each learning end can be the same or different.

Step S200, the learning end acquires an initial local model, and a self-coding network of the initial local model performs the following processing: and performing first feature extraction processing on each voice sample to obtain a first voice feature set, performing second feature extraction processing on the first voice feature set to obtain a second voice feature set, performing first fusion processing on the first voice feature set and the second voice feature set to obtain a third voice feature set, and performing second fusion processing on the first voice feature set, the second voice feature set and the third voice feature set to obtain an output value.

In an embodiment, the initial local model obtained may be a transducer model or a variant of a transducer model. The initial local model is a transducer model, in an encoder network of the initial local model, the encoder network or the encoder network with the adaptive module can be adopted to conduct first feature extraction processing on each voice sample to obtain a first voice feature set, subsequent feature extraction on the first voice feature set is facilitated, the adaptive module in the encoder network with the adaptive module is utilized to conduct second feature extraction processing on the first voice feature set to obtain a second voice feature set, subsequent feature fusion operation is facilitated, the first voice feature set and the second voice feature set are subjected to first fusion processing to obtain a third voice feature set, and the first voice feature set, the second voice feature set and the third voice feature set are subjected to second fusion processing to obtain an output value. By fusing the output characteristics, the local model can have better expressive power.

The initial local model is a transducer model, and includes a feedforward module, an adaptive module and an adaptive fusion module, a first feature extraction process is performed on each voice sample by using the feedforward module to obtain a first voice feature set, a second feature extraction process is performed on the first voice feature set by using the adaptive module to obtain a second voice feature set, a first fusion process is performed on the first voice feature set and the second voice feature set by using the adaptive fusion module to obtain a third voice feature set, and a second fusion process is performed on the first voice feature set, the second voice feature set and the third voice feature set to obtain an output value. The output characteristics of the feedforward module and the adaptive module are fused through the adaptive fusion module, so that the local model has better expressive power.

In an embodiment, the first set of speech features includes a plurality of first speech features, the first speech features being representations of features extracted from the speech samples; the second voice feature set comprises a plurality of second voice features, and the second voice features are representations of the features extracted from the first voice feature set; the third set of speech features includes a plurality of third speech features, the third speech features being representations of the fused features. The first fusion process and the second fusion process may be dot products of features, or may be connecting features.

As shown in fig. 2, the second fusion processing is performed on the first voice feature set, the second voice feature set and the third voice feature set to obtain output values, which includes but is not limited to the following steps:

step S210, the first voice feature set is used as a first matrix of the attention mechanism.

In an embodiment, the attention mechanism is adopted for processing, and the attention mechanism may be a self-attention mechanism or a multi-head attention mechanism, so that a QKV matrix can be constructed, which is not described herein. In the case where the attention mechanism is a self-attention mechanism, the first set of extracted features is taken as a first matrix of the attention mechanism, wherein the first matrix is a Q matrix of the self-attention mechanism. The subsequent calculation of the output value from the first matrix is facilitated by obtaining the first matrix.

Step S220, using the second speech feature set as a second matrix of the attention mechanism.

In an embodiment, according to step S210, in the case where the attention mechanism is a self-attention mechanism, the second extracted feature set is used as a second matrix of the attention mechanism, where the second matrix is a K matrix of the self-attention mechanism. The subsequent calculation of the output values from the second matrix is facilitated by obtaining the second matrix.

Step S230, taking the third speech feature set as the third matrix of the attention mechanism.

In an embodiment, according to step S210 and step S220, in the case where the attention mechanism is a self-attention mechanism, the third extracted feature set is taken as a third matrix of the attention mechanism, where the third matrix is a V matrix of the self-attention mechanism. The subsequent calculation of the output values from the third matrix is facilitated by obtaining the third matrix.

In step S240, the learning end performs a first matrix fusion process on the first matrix, the second matrix and the third matrix according to a preset fusion algorithm, so as to obtain an output value.

As shown in fig. 3, the learning end performs a first matrix fusion process on the first matrix, the second matrix and the third matrix according to a preset fusion algorithm to obtain an output value, which includes but is not limited to the following steps:

Step S241, the learning end performs a second matrix fusion process on the first matrix and the second matrix to obtain a first fusion value;

in an embodiment, the learning end performs a second matrix fusion process on the first matrix and the second matrix to obtain a first fusion value, and the extracted features can have better expressive ability through the fusion process.

In step S242, the learning end performs normalization processing on the first fusion value by using the normalization layer, so as to obtain a second fusion value.

In an embodiment, the learning end performs normalization processing on the first fusion value by using a normalization layer to obtain a second fusion value, the normalization layer can be a softmax layer or a sigmoid layer, and features can be normalized to a certain range by performing normalization processing, so that the feature value is prevented from being too large or too small, and the accuracy of the model is prevented from being influenced.

Step S243, the learning end performs a third matrix fusion process on the second fusion value and the third matrix to obtain an output value.

In an embodiment, the learning end performs a third matrix fusion process on the second fusion value and the third matrix to obtain an output value, and the extracted feature can have a better expressive ability through the fusion process. The first matrix fusion process, the second matrix fusion process and the second matrix fusion process are to fuse two or more matrixes, which can be dot multiplication of the matrixes or connection of the matrixes. The first fusion value and the second fusion value are both in the form of matrix fusion representations.

As shown in fig. 6, the second feature extraction process is performed on the first speech feature set to obtain a second speech feature set, including but not limited to the following steps:

step S250, the learning end performs downsampling operation on the first voice feature set to obtain sampling features.

In an embodiment, the learning end may perform an average value taking operation on each first voice feature in the first voice feature set to obtain a sampling feature, or may perform a maximum value taking operation on each first voice feature in the first voice feature set to obtain a sampling feature. The amount of calculation can be reduced by downsampling, and model overfitting is avoided.

In step S260, the learning end performs an up-sampling operation on the sampled feature to obtain a second speech feature set.

In an embodiment, the learning end may perform sampling processing on the sampling feature by using dilation convolution to obtain a second speech feature set, or may perform sampling processing on the sampling feature by using deconvolution to obtain the second speech feature set. The receptive field can be enlarged by upsampling, and the feature representation capability can be enhanced.

Step S300, training the initial local model by the learning end according to the output value and the sample label to obtain a first fusion adaptive parameter.

As shown in fig. 5, the learning end trains the initial local model according to the output value and the sample label to obtain a first fusion adaptive parameter, which includes but is not limited to the following steps:

in step S310, the learning end obtains the value of the loss function according to the output value and the sample label corresponding to the output value.

In an embodiment, the learning end may calculate the value of the output value and the sample tag corresponding to the output value by using the Kullback-le divergence (KL divergence) to obtain the value of the loss function; the value of the loss function may also be obtained by calculating the value of the sample tag whose output value corresponds to the output value using Jenson's shannon divergence (JS divergence). By obtaining the value of the loss function, subsequent training of the initial local model is facilitated.

In step S320, the learning end trains the initial local model according to the value of the loss function, so as to obtain a first fusion adaptive parameter.

In one embodiment, the learning end counter-propagates the initial local model according to the value of the loss function, and updates the weight and bias of the initial local model by using a gradient descent algorithm so that the initial local model converges. The updated weight and bias are the weight and bias for the first fusion processing, and the parameters of the other feature extraction processing of the initial local model are frozen and not updated, so that the first fusion adaptive parameters are obtained. The first fusion adaptive parameters are weights and biases for performing the first fusion processing. The first fusion adaptive parameter is obtained, so that the first global fusion parameter can be obtained according to the first fusion adaptive parameter.

Step S400, the learning end sends the first fusion adaptive parameters to the server and receives the first global fusion parameters obtained by aggregating the first fusion adaptive parameters sent by the server.

In an embodiment, according to the steps S100-S400, the first fusion adaptive parameters are obtained, and each learning end sends the first fusion adaptive parameters to the server, and only sends the first fusion adaptive parameters to the server instead of all the model parameters, so that parameter transmission is reduced, and communication speed is improved. And receiving a first global fusion parameter which is obtained by aggregation according to the first fusion adaptive parameter and is issued by the server. And the first global fusion parameters are obtained, so that the subsequent parameter fine adjustment of the initial local model by using the first global fusion parameters is facilitated. The first global fusion parameters are model parameters obtained through a server.

In an embodiment, the server receives the first fusion adaptive parameters sent by each learning end, and can average the first fusion adaptive parameters to obtain first global fusion parameters; the first fusion adaptive parameters corresponding to the optimal performance of the initial local model can be selected from the first fusion adaptive parameters to obtain the first global fusion parameters, or the first global fusion parameters can be obtained in other calculation modes, which is not described herein.

Step S500, the learning end trains an initial local model based on the first global fusion parameters to obtain second fusion adaptive parameters, obtains a target local model corresponding to the second fusion adaptive parameters under the condition that the preset end conditions are met, and sends the target local model to the server.

In an embodiment, according to step S400, a first global fusion parameter is obtained, the first global fusion parameter is used to replace a first fusion adaptive parameter, then the initial local model is trained based on the first global fusion parameter, in the training process, when the model is in counter-propagation, only the weight and bias for performing the first fusion processing are updated to obtain a second fusion adaptive parameter, under the condition that a preset end condition is met, a target local model is obtained, and the target local model is sent to a server, so that the target speech recognition model can be obtained through the server later. The preset ending condition can be that the loss function meets a preset value, the initial local model converges, and the loss function does not have larger fluctuation; the training frequency of the initial local model can reach the preset highest training frequency, training is finished, and a target local model is obtained, and the preset highest training frequency can be 100000 times in an exemplary manner; the second fusion adaptive parameter is obtained by updating the first global fusion parameter during training.

As shown in fig. 4, after the learning end trains the initial local model based on the first global fusion parameter to obtain the second fusion adaptive parameter, the speech recognition method based on federal learning further includes, but is not limited to, the following steps:

step 810, the learning end sends the second fusion adaptive parameter to the server if the preset ending condition is not satisfied.

In an embodiment, under the condition that the preset end condition is not met, that is, the initial local model is not converged yet, multiple rounds of training are required, and the learning end sends the second fusion adaptive parameter to the server, so that the second global fusion parameter can be obtained according to the server. The second global fusion parameter is a model parameter obtained through a server.

Step S820, the learning terminal receives a second global fusion parameter obtained by aggregating the second fusion adaptive parameters sent by the server.

In an embodiment, the server receives the second fusion adaptive parameters sent by each learning end, and can average the second fusion adaptive parameters to obtain second global fusion parameters; the second fusion adaptive parameters corresponding to the optimal performance of the initial local model can be selected from the second fusion adaptive parameters to obtain the second global fusion parameters, or the second global fusion parameters can be obtained in other calculation modes, which is not described herein. The server sends the calculated second global fusion parameters to each learning end, and each learning end receives the second global fusion parameters sent by the server, so that the learning end can carry out model training based on the second global fusion parameters.

In step S830, the learning end trains the initial local model based on the second global fusion parameter.

In one embodiment, according to step S820, a second global fusion parameter is obtained, the second global fusion parameter is used to replace the second fusion adaptive parameter, then the initial local model is trained based on the second global fusion parameter, and in the training process, only the weight and bias for performing the first fusion process are updated when the model is counter-propagated.

In step S600, the server determines a target speech recognition model according to the target local model sent by each learning end, and sends the target speech recognition model to the application end.

In an embodiment, according to step S500, the server receives the target local models sent by each learning end, and the server selects a model with optimal performance from each target local model as a target voice recognition model, and obtains the target voice recognition model through federal learning, so that the accuracy of model voice recognition can be increased. And sending the target voice recognition model to the application end. The target voice recognition model is obtained, so that the voice recognition accuracy is improved.

In step S700, the application end performs speech recognition on the input speech data through the target speech recognition model.

In an embodiment, according to step S600, the application end receives the target voice recognition model sent by the server, and performs voice recognition on the input voice data through the target voice recognition model, so as to realize intelligent conversion, thereby saving time, manpower, material resources and financial resources.

Referring to fig. 7, an embodiment of the present application provides a federally learning-based speech recognition system 100, where the speech recognition system 100 includes a plurality of learning terminals 110, a server 120, and an application terminal 130, and the learning terminals 110 include: a data acquisition module 111, configured to acquire a local voice sample set, where the voice sample set includes a plurality of voice samples and sample tags corresponding to each voice sample; the model obtaining module 112 is configured to obtain an initial local model, where a self-coding network of the initial local model performs the following processing: performing first feature extraction processing on each voice sample to obtain a first voice feature set, performing second feature extraction processing on the first voice feature set to obtain a second voice feature set, performing first fusion processing on the first voice feature set and the second voice feature set to obtain a third voice feature set, performing second fusion processing on the first voice feature set, the second voice feature set and the third voice feature set to obtain an output value, and increasing the representation capability of the features through fusion processing; the first processing module 113 is configured to train the initial local model according to the output value and the sample tag, so as to obtain a first fusion adaptive parameter; the sending module 114 is configured to send the first fusion adaptive parameter to the server 120, and send the first fusion adaptive parameter to the server only instead of all model parameters, so that parameter transmission is reduced, communication speed is improved, and the receiving module 115 is configured to receive the first global fusion parameter obtained by aggregating the first fusion adaptive parameter sent from the server 120, so that parameter fine adjustment of the initial local model is facilitated by using the first global fusion parameter subsequently; the second processing module 116 is configured to train the initial local model based on the first global fusion parameter to obtain a second fusion adaptive parameter, and obtain a target local model corresponding to the second fusion adaptive parameter under a preset ending condition, where the sending module 114 is further configured to send the target local model to the server 120; the server 120 determines a target voice recognition model according to the target local model sent by each learning end 110, obtains the target voice recognition model through federal learning, can increase the accuracy of model voice recognition, and sends the target voice recognition model to the application end 130; the application end 130 performs voice recognition on the input voice data through the target voice recognition model.

The data acquisition module 111 is connected to the model acquisition module 112, the model acquisition module 112 is connected to the first processing module 113, the first processing module 113 is connected to the transmitting module 114, the transmitting module 114 is connected to the receiving module 115, and the receiving module 115 is connected to the second processing module 116. The voice recognition method based on federal learning is applied to the voice recognition system 100 based on federal learning, and the voice recognition system 100 based on federal learning only transmits parameters modified by the feature fusion structure of the model to the server, so that the representation capability of the features can be increased while the number of the model parameters is reduced, and the voice recognition accuracy is higher. The first processing module 113 and the second processing module 116 are both central processing units, and the central processing units generally comprise a logic operation unit, a control unit and a storage unit, so that a great deal of manpower resources are saved by utilizing the calculation of the central processing units.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

Fig. 8 illustrates a computer device 500 provided by an embodiment of the present application. The computer device 500 may be a server or a terminal, and the internal structure of the computer device 500 includes, but is not limited to:

a memory 510 for storing a program;

processor 520 for executing the program stored in memory 510, processor 520 for executing the federal learning-based speech recognition method described above when processor 520 executes the program stored in memory 510.

The processor 520 and the memory 510 may be connected by a bus or other means.

Memory 510, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, as well as non-transitory computer-executable programs, such as federal learning-based speech recognition methods described in any of the embodiments of the present application. Processor 520 implements the federal learning-based speech recognition method described above by running non-transitory software programs and instructions stored in memory 510.

Memory 510 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store speech recognition methods that perform federal learning based methods described above. In addition, memory 510 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, memory 510 may optionally include memory located remotely from processor 520, which may be connected to processor 520 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the federal learning-based speech recognition method described above are stored in memory 510, which when executed by one or more processors 520, perform the federal learning-based speech recognition method provided by any embodiment of the present application.

The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions for executing the federal learning-based voice recognition method.

In one embodiment, the storage medium stores computer-executable instructions that are executed by one or more control processors 520, for example, by one of the processors 520 in the computer device 500, such that the one or more processors 520 perform the federal learning-based speech recognition method provided in any embodiment of the present application.

The embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The terms "first," "second," "third," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media.

The preferred embodiments of the present application have been described in detail, but the present application is not limited to the above embodiments, and those skilled in the art will appreciate that the present application is not limited thereto. Various equivalent modifications and substitutions may be made in the shared context, and are intended to be included within the scope of the present application as defined in the claims.

Claims

1. A federal learning-based speech recognition method, which is applied to a speech recognition system, wherein the speech recognition system comprises an application end, a server and a plurality of learning ends, and the method comprises:

2. The method of claim 1, wherein performing a second fusion process on the first speech feature set, the second speech feature set, and the third speech feature set to obtain an output value comprises:

Taking the first voice feature set as a first matrix of an attention mechanism;

3. The method of claim 2, wherein the learning end performs a first matrix fusion process on the first matrix, the second matrix, and the third matrix according to a preset fusion algorithm, to obtain the output value, and the method includes:

4. The method of claim 1, wherein after the learning side trains the initial local model based on the first global fusion parameters to obtain second fusion adaptive parameters, the method further comprises:

5. The method of claim 1, wherein the learning end trains the initial local model according to the output value and the sample tag to obtain a first fused adaptive parameter, comprising:

6. The method of claim 1, wherein performing a second feature extraction process on the first speech feature set to obtain a second speech feature set comprises:

7. The method of claim 5, wherein the learning end obtains a value of a loss function according to the output value and the sample label corresponding to the output value, including:

8. A speech recognition system based on federal learning, wherein the speech recognition system comprises an application end, a server and a plurality of learning ends, and the learning ends comprise:

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by one or more of the processors, cause the one or more processors to perform the steps of the method of any of claims 1 to 7.

10. A computer readable storage medium readable and writable by a processor, the storage medium storing computer readable instructions which when executed by one or more processors cause the one or more processors to perform the steps of the method of any one of claims 1 to 7.