CN116913247A

CN116913247A - Voice recognition method and device and storage medium

Info

Publication number: CN116913247A
Application number: CN202211466722.7A
Authority: CN
Inventors: 李慧慧; 张世磊; 侯雷静
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-10-20

Abstract

The embodiment of the application provides a voice recognition method, which comprises the following steps: acquiring voice data to be recognized; wherein the voice data to be recognized is voice data comprising at least one dialect; determining voice acoustic characteristics and dialect embedding characteristics corresponding to voice data to be recognized; inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network to obtain a characteristic sequence corresponding to voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gate control network, a shared expert network and a plurality of private expert networks; and generating a recognition result corresponding to the voice data to be recognized according to the feature sequence, dynamically selecting a corresponding private expert network through a weight value output by a gating network in the coding network, processing acoustic coding features corresponding to the voice data to be recognized, and modeling common features among different dialects through a shared expert network so as to improve the accuracy of voice recognition.

Description

Voice recognition method and device and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for speech recognition, and a storage medium.

Background

At present, most of the research on voice recognition is focused on Mandarin recognition with rich data resources, and good recognition performance is already achieved. Related studies on dialects are mostly focused on a single dialect, and the performance of the dialect voice recognition is usually improved by adding the dialect voice data of the target region.

However, when recognizing voice data mixed with a plurality of dialects, a common recognition scheme cannot well model acoustic and linguistic differences of the dialects, resulting in a reduction in recognition rate. In order to improve the accuracy of voice recognition, a set of corresponding voice recognition system needs to be built for each dialect, so that the implementation is complicated, and the performance of the terminal can be reduced.

Disclosure of Invention

The embodiment of the application provides a voice recognition method and device and a storage medium, which can effectively improve the accuracy of voice recognition on the basis of ensuring the performance of a terminal.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a voice recognition method, where the voice recognition method includes:

acquiring voice data to be recognized; wherein the voice data to be recognized comprises at least one dialect;

Determining voice acoustic characteristics and dialect embedding characteristics corresponding to the voice data to be recognized;

inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network, and outputting a characteristic sequence corresponding to the voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gating network, a shared expert network and a plurality of private expert networks;

and generating a recognition result corresponding to the voice data to be recognized according to the feature sequence.

In a second aspect, an embodiment of the present application provides a voice recognition apparatus, including: the device comprises an acquisition unit, a determination unit, an output unit and a generation unit;

the acquisition unit is used for acquiring voice data to be recognized; wherein the voice data to be recognized comprises at least one dialect;

the determining unit is used for determining voice acoustic characteristics and dialect embedding characteristics corresponding to the voice data to be recognized;

the output unit is used for inputting the voice acoustic characteristics and the dialect embedding characteristics into a coding network and outputting a characteristic sequence corresponding to the voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gating network, a shared expert network and a plurality of private expert networks;

The generating unit is used for generating a recognition result corresponding to the voice data to be recognized according to the feature sequence.

In a third aspect, an embodiment of the present application provides a voice recognition apparatus, including: a processor and a memory; wherein,,

the memory is used for storing a computer program capable of running on the processor;

the processor is configured to perform the speech recognition method as described above when the computer program is run.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the storage medium has stored thereon computer program code which, when executed by a computer, implements a speech recognition method as described above.

The embodiment of the application provides a voice recognition method and device and a storage medium, wherein the method comprises the following steps: the voice recognition device acquires voice data to be recognized; wherein the voice data to be recognized is voice data comprising at least one dialect; determining voice acoustic characteristics and dialect embedding characteristics corresponding to voice data to be recognized; inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network, and outputting a characteristic sequence corresponding to voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gate control network, a shared expert network and a plurality of private expert networks; and generating a recognition result corresponding to the voice data to be recognized according to the feature sequence. Therefore, the voice recognition device can acquire the voice data to be recognized, determine the voice acoustic features and the dialect embedded features corresponding to the voice data to be recognized, and then input the voice acoustic features and the dialect embedded features into the coding network, so that the feature sequence corresponding to the voice data to be recognized can be acquired; the voice recognition device can process voice data comprising different dialects through a specific private expert network, so that the accuracy of voice recognition is improved while the performance of a terminal is ensured, meanwhile, the sharing expert network can process acoustic coding features corresponding to voice acoustic features, and common features among different dialects can be modeled, so that the accuracy of voice recognition can be further improved.

Drawings

FIG. 1 is a schematic diagram of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an encoder according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an encoder according to a second embodiment of the present application;

FIG. 4 is a schematic diagram of a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a composition structure of a voice recognition device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a second structure of a voice recognition device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to be limiting. It should be noted that, for convenience of description, only a portion related to the related application is shown in the drawings.

At present, most of the research on voice recognition is focused on Mandarin recognition with rich data resources, and good recognition performance is achieved, so that the method is widely applied to daily life of people. However, the area where each dialect is distributed is wide, and many people still have habit to use the dialect as a daily term, and the recognition rate of the speech recognition system established by adopting standard mandarin is obviously reduced when facing the dialect. Meanwhile, dialects among different regions have huge differences, and the differences among the dialects are represented in various aspects of voice, vocabulary and grammar, so that the voice aspect is particularly prominent. According to the modern popular division, modern chinese dialects can be divided into seven large dialect regions. Namely northern dialect, wu Fangyan, hunan dialect, hakka dialect, min dialect, guangdong dialect, ganyu dialect and jin dialect. Meanwhile, in a complex dialect region, some may be further divided into a plurality of dialect pieces (also called sub-dialects). At present, related researches on dialects are concentrated in a single dialect, and the performance of the dialect voice recognition is usually improved by adding the dialect voice data of the target region, so that a set of corresponding voice recognition system is required to be built for each dialect, and the realization is complicated and the performance of a terminal is reduced. However, the conventional recognition schemes cannot well model acoustic and linguistic differences of dialects, resulting in a reduction in recognition rate.

In order to solve the problem that the conventional voice recognition technology cannot achieve both the terminal performance and the accuracy of voice recognition, the embodiment of the application provides a voice recognition method and device and a storage medium, wherein the method comprises the following steps: the voice recognition device acquires voice data to be recognized; wherein the voice data to be recognized is voice data comprising at least one dialect; determining voice acoustic characteristics and dialect embedding characteristics corresponding to voice data to be recognized; inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network, and outputting a characteristic sequence corresponding to voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gate control network, a shared expert network and a plurality of private expert networks; and generating a recognition result corresponding to the voice data to be recognized according to the feature sequence. Therefore, the voice recognition device can acquire the voice data to be recognized, determine the voice acoustic features and the dialect embedded features corresponding to the voice data to be recognized, and then input the voice acoustic features and the dialect embedded features into the coding network, so that the feature sequence corresponding to the voice data to be recognized can be acquired; the voice recognition device can process voice data comprising different dialects through a specific private expert network, so that the accuracy of voice recognition is improved while the performance of a terminal is ensured, meanwhile, the sharing expert network can process acoustic coding features corresponding to voice acoustic features, and common features among different dialects can be modeled, so that the accuracy of voice recognition can be further improved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

Example 1

The embodiment of the application provides a voice recognition method, fig. 1 is a schematic diagram of the voice recognition method provided by the embodiment of the application, and as shown in fig. 1, the method for performing voice recognition by a terminal can comprise the following steps:

step 101, obtaining voice data to be recognized; wherein the voice data to be recognized is voice data including at least one dialect.

In the embodiment of the application, the voice recognition device may first acquire voice data to be recognized. The voice data to be recognized may be voice data including at least one dialect, for example, the voice data to be recognized is voice data including a northern dialect, a Wu Fangyan, a Hunan dialect, a Hakka dialect, and a Min dialect, and the specific dialect included in the voice data to be recognized is not particularly limited in the present application.

It should be noted that, in the embodiment of the present application, the voice recognition device may be any device having a storage function, and the voice recognition device includes, but is not limited to, a mobile phone, a personal computer (Personal Computer, PC), a tablet computer, a wearable smart device, a smart camera, and the like.

It should be noted that, in the embodiment of the present application, the manner of obtaining the voice data to be recognized may be various, for example, the voice data may be obtained in real time through a voice collecting module in the voice recognition device, may be obtained from the voice data stored in advance in the voice recognition device, may also be obtained through an application in the voice recognition device, and the manner of obtaining the voice data to be recognized is not limited in particular.

Step 102, determining voice acoustic features and dialect embedding features corresponding to voice data to be recognized.

In the embodiment of the application, after the voice recognition device acquires the voice data to be recognized, the voice acoustic characteristics and the dialect embedding characteristics corresponding to the voice data to be recognized can be determined.

Further, in the embodiment of the present application, when determining the speech acoustic feature and the dialect embedding feature corresponding to the speech data to be recognized, the speech recognition device may first obtain the speech acoustic feature corresponding to the speech data to be recognized; and then, extracting the characteristics of the voice acoustic characteristics to obtain dialect embedded characteristics corresponding to the voice data to be recognized.

Further, in the embodiment of the present application, the voice recognition device may extract the voice acoustic features corresponding to the voice data to be recognized, for example, the voice acoustic features may be directly generated through operations such as frequency spectrum framing, time-frequency conversion, filtering, etc., and the extraction method is not specifically limited, and there are Mel-cepstrum coefficients (Mel-scale Frequency Cepstral Coefficients, MFCC) and features Fbank (Filter bank) based on the filter bank.

Further, in the embodiment of the present application, after the voice recognition device obtains the voice acoustic feature corresponding to the voice data to be recognized, feature extraction may be further performed on the voice acoustic feature to obtain the dialect embedded feature corresponding to the voice data to be recognized.

Step 103, inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network to obtain a characteristic sequence corresponding to voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gate control network, a shared expert network and a plurality of private expert networks.

It should be noted that, in the embodiment of the present application, the voice recognition device may dynamically select the corresponding private expert network through the weight value output by the gating network to process the acoustic coding feature corresponding to the voice data to be recognized, that is, may process the voice data including different dialects through a specific private expert network.

It should be noted that, in the embodiment of the present application, the speech recognition device may model the common feature between different dialects through the shared expert network, thereby improving the accuracy of speech recognition.

In the embodiment of the application, after determining the voice acoustic features and the dialect embedded features corresponding to the voice data to be recognized, the voice recognition device can input the voice acoustic features and the dialect embedded features into the coding network to obtain the feature sequence corresponding to the voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gate control network, a shared expert network and a plurality of private expert networks.

It should be noted that, in the embodiment of the present application, fig. 2 is a schematic diagram of an encoder according to the embodiment of the present application, as shown in fig. 2, the encoder includes a gate network, a shared expert network, and a plurality of private expert networks, where the number of private expert networks may be n, and the number of private expert networks is not specifically limited.

It should be noted that, in the embodiment of the present application, the shared expert network and the private expert network may be the same neural network or may be different neural networks, where the neural networks may refer to two-layer feedforward neural networks (feedforward neural network, FNN) or other neural networks, such as deep neural networks (Deep neural network, DNN), and the category of the neural networks is not specifically limited in the present application.

It should be noted that, in the embodiment of the present application, fig. 3 is a schematic diagram of an encoder according to the embodiment of the present application, and as shown in fig. 3, the encoder includes a gating network, a shared expert network, a plurality of private expert networks, and a non-expert network layer.

It should be noted that, in the embodiment of the present application, the non-expert network layer may be formed by adopting a self-attention mechanism and normalization with residual error, or may be other neural networks, for example, long short-term memory (LSTM), and the structure of the non-expert network layer is not specifically limited in the present application.

Further, in the embodiment of the present application, the speech recognition device inputs the speech acoustic feature and the dialect embedded feature to the encoding network, and when outputting the feature sequence corresponding to the speech data to be recognized, the speech acoustic feature and the dialect embedded feature may be input to the first layer encoder first, to determine the feature output sequence of the first layer encoder; then inputting the characteristic output sequence and dialect embedded characteristic of the N-1 layer encoder to the N layer encoder, and determining the characteristic output sequence of the N layer encoder; wherein N is an integer greater than or equal to 2; the feature output sequence of the last layer encoder is determined as a feature sequence.

In the embodiment of the present application, the structure of each layer of encoder is the same.

Further, in the embodiment of the present application, the speech recognition device inputs the speech acoustic feature and the dialect embedded feature into the first layer encoder, and when determining the output sequence of the first layer encoder, the non-expert network layer may determine the acoustic coding feature corresponding to the speech acoustic feature; inputting the acoustic coding features and the dialect embedding features into a gating network, thereby determining a plurality of weight values corresponding to a plurality of private expert networks, and generating an output sequence of the private expert networks according to the plurality of weight values and the acoustic coding features; inputting the acoustic coding characteristics into a shared expert network, and determining an output sequence of the shared expert network; and determining a characteristic output sequence according to the output sequence of the private expert network and the output sequence of the shared expert network.

It should be noted that, in the embodiment of the present application, the voice recognition device determines, through the non-expert network layer, the acoustic coding feature corresponding to the voice acoustic feature, which may mean that the non-expert network layer performs up-dimensional coding on the voice acoustic feature to obtain the acoustic coding feature corresponding to the voice acoustic feature.

Further, in the embodiment of the application, the acoustic coding feature and the dialect embedding feature are input to the gating network, and when a plurality of weight values corresponding to a plurality of private expert networks are determined, the acoustic coding feature and the dialect embedding feature can be subjected to splicing treatment to obtain a spliced feature; and then inputting the spliced characteristics into a gating network, so that a plurality of weight values corresponding to a plurality of private expert networks can be obtained.

In an exemplary embodiment of the present application, the input of the gating network is a vector formed by splicing the acoustic coding feature and the dialect embedding feature, and when the number of the private expert networks in each layer of encoder is n, the output of the gating unit may be a set of vectors including n probability values, where the n probability values correspond to weight values of the private expert networks; the input-output relationship of the gating cell can be expressed by the following formula:

Wherein e is a dialect embedded feature, o ^l-1 For the acoustic coding features output by the non-expert network of the upper layer, l denotes the layer i encoder,learnable parameters for layer-1 gating network, < >>Is the output of the gating network.

For example, it is assumed that 4 private expert networks, namely private expert network 1, private expert network 2, private expert network 3, and private expert network 4, are provided in the first layer encoder; if the output vector of the gating network is {0.6,0.2,0.1,0.1}, where each probability value corresponds to the weight value of the selected different private expert networks, that is, the weight value corresponding to the private expert network 1 is 0.6, the weight value corresponding to the private expert network 2 is 0.2, the weight value corresponding to the private expert network 3 is 0.1, the weight value corresponding to the private expert network 4 is 0.1, and the number of private expert networks is not specifically limited in the present application.

It should be noted that, in the embodiment of the present application, the speech recognition device uses the vector formed by splicing the acoustic coding feature and the dialect embedding feature as the input of the gating network, so that the gating network can be guided to select the corresponding private expert network more accurately, and meanwhile, the dialect embedding feature can provide the gating network with a distinguishing feature representation, so that the weight value corresponding to the private expert network of the output of the gating network is more differentiated, and further, the acoustic coding feature corresponding to the specific type of speech data is processed by the specific private expert.

It should be noted that, in the embodiment of the present application, after determining a plurality of weight values corresponding to a plurality of private expert networks, the voice recognition device may generate an output sequence of the private expert network according to the plurality of weight values and the acoustic coding feature.

It should be noted that, in the embodiment of the present application, when an output sequence of a private expert network is generated according to a plurality of weight values and acoustic coding features, a target private expert network may be determined from a plurality of private expert networks according to a plurality of weight values; then inputting the acoustic feature codes into a target private expert network, so that an original feature sequence can be obtained; and then, carrying out weighting treatment on the original characteristic sequence according to the weight corresponding to the target private expert network, and determining the output sequence of the private expert network.

For example, assume that 3 private expert networks, namely private expert 1, private expert 2, and private expert 3, are provided in the first layer encoder; if the output vector of the gating network is {0.6,0.2,0.1}, that is, the weight value corresponding to the private expert network 1 is 0.6, the weight value corresponding to the private expert network 2 is 0.2, the weight value corresponding to the private expert network 3 is 0.1, when the voice recognition device determines the target private expert network from the plurality of private expert networks according to the plurality of weight values, one private expert network with the highest weight value can be selected as the target private expert network, that is, the private expert network 1 with the corresponding weight value of 0.6 can be selected as the target private expert network, two private expert networks with the highest weight value can be selected as the target private expert network, all private expert networks can be selected as the target private expert network, and the number of the private expert networks is not particularly limited.

Further, in the embodiment of the present application, the voice recognition device may input the acoustic feature code to the target private expert network to obtain the original feature sequence, and the target private expert processes the acoustic feature code to obtain the original feature sequence.

Further, in the embodiment of the present application, after the voice recognition device obtains the original feature sequence, the original feature sequence may be weighted according to the weight corresponding to the target private expert network in the first layer encoder, so as to determine the output sequence of the private expert network.

Illustratively, the original feature sequence may be multiplied by a weight corresponding to the target private expert network in the first layer encoder to determine an output sequence of the private expert network, which may be expressed by the following formula:

wherein,,representing the output (original signature sequence) of the layer I target private expert network, where i represents the ID of the target private expert network,/I>Representing the weight value corresponding to the first layer target private expert network,/for the first layer target private expert network>Input representing a layer I private expert networkAnd (5) sequencing.

It should be noted that, in the embodiment of the present application, the voice recognition device selects a corresponding private expert network according to the weight value corresponding to the private expert network output by the gating network, and then processes the acoustic coding features corresponding to the voice data of different dialects, so as to implement that the specific private expert network processes the acoustic coding features corresponding to the voice data of the specific type, thereby improving the accuracy of voice recognition.

Further, in an embodiment of the present application, after the voice recognition apparatus determines the output sequence of the private expert network, the determining the feature output sequence according to the output sequence of the private expert network and the output sequence of the shared expert network includes: and determining a characteristic output sequence according to the output sequence of the private expert network and the output sequence of the shared expert network.

Specifically, the voice recognition device determines the characteristic output sequence according to the output sequence of the private expert network and the output sequence of the shared expert network, and can be represented by the following formula:

wherein,,representing the output sequence of the layer i shared expert network, and (2)>Representing the output sequence of the layer I private expert network, lambda representing the weighting coefficient of the shared expert network output, which coefficient may be an experimental parameter, y ^l The mixed output sequence (feature output sequence) of the first layer is represented.

It should be noted that, in the embodiment of the present application, after obtaining the feature output sequence of the first layer, the speech recognition apparatus may send the feature output sequence of the first layer to the encoder of the next layer.

It should be noted that, in the embodiment of the present application, the voice recognition device determines the feature output sequence through the output sequence of the private expert network and the output sequence of the shared expert network, where the input of the shared expert network is the acoustic coding feature corresponding to the voice data to be recognized, and it is assumed that the voice data to be recognized is the voice data including the Guizhou dialect and the Sichuan dialect, and the two have a certain similarity in pronunciation, and the shared expert network can be used to model the common feature between the Guizhou dialect and the Sichuan dialect, so that the accuracy of voice recognition, especially the recognition accuracy of the low-resource dialect, can be further improved.

Further, in the embodiment of the present application, the speech recognition device inputs the output sequence and the dialect embedded feature of the N-1 layer encoder to the N layer encoder, and when determining the output sequence of the N layer encoder, the acoustic coding feature corresponding to the output sequence of the N-1 layer encoder may be determined through the non-expert network layer; inputting the acoustic coding feature and the dialect embedding feature into a gating network, so that a plurality of weight values corresponding to a plurality of private expert networks can be determined, and further, an output sequence of the private expert networks is generated according to the plurality of weight values and the acoustic coding feature; simultaneously, the acoustic coding features can be input into the shared expert network, so that an output sequence of the shared expert network is determined; and determining a characteristic output sequence according to the output sequence of the private expert network and the output sequence of the shared expert network.

For example, when N is equal to 2, the speech recognition apparatus may input the output sequence of the first layer encoder and the dialect embedding feature to the second layer encoder, and then determine the output sequence of the second layer encoder, specifically, first determine the acoustic coding feature corresponding to the output sequence of the first layer encoder through the non-expert network layer; then inputting the acoustic coding feature and the dialect embedding feature into a gating network to determine a plurality of weight values corresponding to a plurality of private expert networks, wherein the method for determining the plurality of weight values corresponding to the plurality of private expert networks in the second layer encoder by the voice recognition device is the same as the method for determining the plurality of weight values corresponding to the plurality of private expert networks in the first layer encoder, and the method can also be represented by a formula (1); the voice recognition device generates an output sequence of the private expert network according to the plurality of weight values and the acoustic coding characteristics, wherein the output sequence of the private expert network can be expressed by a formula (2); inputting the acoustic coding characteristics into a shared expert network, and determining an output sequence of the shared expert network; determining the characteristic output sequence from the output sequence of the private expert network and the output sequence of the shared expert network may also be represented by equation (3).

Further, in an embodiment of the present application, a speech recognition apparatus inputs an output sequence of an N-1 layer encoder and a dialect embedding feature to the N layer encoder, determines the output sequence of the N layer encoder, includes: determining acoustic coding characteristics corresponding to an output sequence of an N-1 layer encoder through a non-expert network layer; firstly, the acoustic coding feature and the dialect embedding feature can be input into a gating network to determine a plurality of weight values corresponding to a plurality of private expert networks, the method of the plurality of weight values corresponding to the plurality of private expert networks in an N-th layer encoder is the same as the method of the plurality of weight values corresponding to the private expert networks in a first layer encoder, the method can be expressed by a formula (1), an output sequence of the private expert network is generated according to the plurality of weight values and the acoustic coding feature, the output sequence generation method of the private expert network in the N-th layer encoder is the same as the first layer encoder, the method can be expressed by a formula (2), the acoustic coding feature is input into a shared expert network, and the output sequence of the shared expert network is determined; and then determining that the characteristic output sequence is the same as the method for determining the characteristic output sequence by the first layer encoder according to the output sequence of the private expert network and the output sequence of the shared expert network, wherein the characteristic output sequence can be expressed by a formula (3).

And 104, generating a recognition result corresponding to the voice data to be recognized according to the feature sequence.

In the embodiment of the application, after the voice recognition device obtains the feature output sequence of the last layer of encoder, namely the feature sequence, the voice recognition device can generate a recognition result corresponding to the voice data to be recognized according to the feature sequence.

It should be noted that, in the embodiment of the present application, the voice recognition device may input the feature sequence to the decoder to obtain the recognition result corresponding to the voice data to be recognized; i.e. obtaining text information corresponding to the speech data comprising at least one dialect.

Therefore, the voice recognition device can dynamically select a specific private expert network through the weight value output by the gating network in the coding network to process voice data to be recognized, which comprises a plurality of dialects, without building a set of corresponding voice recognition system for each dialect, so that the accuracy of voice recognition is improved while the terminal performance is ensured, and simultaneously, the acoustic coding characteristics corresponding to the voice data to be recognized can be processed through the shared expert network, the common characteristics among different dialects can be modeled, and the accuracy of voice recognition can be further improved.

The embodiment of the application provides a voice recognition method, which comprises the following steps: the voice recognition device acquires voice data to be recognized; wherein the voice data to be recognized is voice data comprising at least one dialect; determining voice acoustic characteristics and dialect embedding characteristics corresponding to voice data to be recognized; inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network, and outputting a characteristic sequence corresponding to voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gate control network, a shared expert network and a plurality of private expert networks; and generating a recognition result corresponding to the voice data to be recognized according to the feature sequence. Therefore, the voice recognition device can acquire the voice data to be recognized, determine the voice acoustic features and the dialect embedded features corresponding to the voice data to be recognized, and then input the voice acoustic features and the dialect embedded features into the coding network, so that the feature sequence corresponding to the voice data to be recognized can be acquired; the voice recognition device can process voice data comprising different dialects through a specific private expert network, so that the accuracy of voice recognition is improved while the performance of a terminal is ensured, meanwhile, the sharing expert network can process acoustic coding features corresponding to voice acoustic features, and common features among different dialects can be modeled, so that the accuracy of voice recognition can be further improved.

Example two

Based on the above embodiments, still another embodiment of the present application provides a speech recognition method, which can be applied to a system (speech recognition model) with multi-dialect speech recognition performance, and introduces different expert networks (private expert networks) based on the ideas of hybrid expert (hybrid expert network), wherein each expert (private expert network) can be understood as processing a specific task, and different tasks correspond to different dialect categories. The speech recognition method proposed in this embodiment is further described below.

It should be noted that, in the embodiment of the present application, fig. 4 is a schematic diagram of a voice recognition method according to the embodiment of the present application, and as shown in fig. 4, the schematic diagram of the voice recognition method includes an acoustic feature calculation module, a dialect recognizer, a coding network, and a decoder; the coding network may include an N-layer encoder, where the encoder includes a non-expert network layer in addition to a gating network, a shared expert network, and a plurality of private expert networks, where the number of private experts is N.

It should be noted that, in the embodiment of the present application, the system (speech recognition model) for multiparty speech recognition performance mainly includes the following modules, (1) an acoustic feature calculation module: the module is used for extracting the acoustic characteristics of target voice data (voice data to be recognized), and is mainly directly generated through operations such as frequency spectrum framing, time-frequency conversion, filtering and the like, and commonly used characteristics include a mel-frequency cepstrum coefficient, a Fbank and the like. (2) a dialect identifier module: the module is mainly used for extracting the embedded features of the dialect. The module consists of a pre-trained dialect identifier. The voice acoustic features pass through a dialect identifier to obtain high-dimensional embedded features of dialects, and the features (the dialects embedded features) are spliced with the acoustic coding features and then sent to a gating network. (3) a speech recognition network: the network is mainly used for outputting a text prediction result of target voice (voice data to be recognized); the application adopts a sequence-based end-to-end network structure, a complete language identification network generally comprises two parts, (1) an encoding module (encoding network) which realizes modeling of acoustic information by encoding the voice acoustic characteristics of each frame of voice data into higher-dimension characteristics, wherein the module is composed of encoders with the same multi-layer structure; (2) The decoding module is mainly used for modeling language information and is responsible for decoding the output of the encoding module into corresponding texts.

Further, in the embodiment of the present application, each layer of encoder has the same network structure, and mainly includes the following two large modules, (1) a non-hybrid expert network (MoE) module (non-expert network layer). The module is responsible for encoding the calculated acoustic feature sequence of the voice into a higher dimension feature to perform high-dimension modeling on acoustic information, the network structure is not limited by the application, and for example, a self-attention mechanism and normalization with residual errors can be adopted to form a non-expert layer (non-expert network layer). (2) an improved hybrid expert network (MoE) module. The module consists of a gating network, a plurality of private specialists (private specialist networks) and a shared specialist (shared specialist network). The basic idea of MoE is to select which expert (private expert network) the current input needs to activate and determine the weight of each expert (private expert network) through the gating network during training reasoning; in general, the expert network (private expert network) herein may be a conventional neural network, and the present application is not particularly limited to the type of private expert network.

It should be noted that, in the embodiment of the present application, the present application applies the idea of MoE (hybrid expert network module) to the multi-dialect recognition task and makes targeted improvements. Firstly, acoustic and language differences exist between different dialects, and different types of dialects are fed into a model and cannot be well different in modeling task, so that different expert networks are adopted to model the different types of dialects, and the model is called a private expert network. On the other hand, knowledge among different dialects is relevant, and thus the independent modeling mode ignores the knowledge correlation among the dialects, so that knowledge migration among languages is difficult to carry out, particularly, for some low-resource dialect data, efficient modeling is difficult, and the shared expert network is introduced to model common characteristics of the different dialects. For different types of dialect input data, the gating network dynamically selects corresponding private expert models (private expert networks) according to different types of inputs, and in order to guide the gating network to more accurately select corresponding private expert (private expert network), the dialect embedding feature is used as input to be sent into the gating network.

Further, in the embodiment of the present application, the target voice (voice data to be recognized) is data mixed by multiple dialects, firstly, acoustic feature calculation is performed on the target voice (voice data to be recognized) to obtain corresponding acoustic features, and then the corresponding acoustic features are sent to a non-mixed expert network layer (non-expert network layer) for high-dimensional encoding, so as to obtain acoustic encoding features of the layer. The acoustic coding feature is fed into a series of private expert sub-models (private expert network) and shared expert sub-models (shared expert network), the application sets an L-layer encoder, and each layer encoder sets n private experts (private expert network), one shared expert (shared expert network) and one gate control network. The gating network is used for determining the output weight of the corresponding private expert (private expert network), and the dimension of the output vector is equal to the number of the set private expert networks. In the application, the input vector of the gating unit is a vector formed by splicing the dialect embedded feature and the acoustic coding feature. The input-output relationship of the gating cell can be expressed by equation 1.

Further, in an embodiment of the present application, the output value of the gating network is a set of vectors comprising n probability values, where n is equal to the number of private expert networks, and the value is used to represent the probability that the corresponding private expert network is selected. It is assumed that the layer encoder sets 4 private expert networks, namely private expert 1, private expert 2, private expert 3, private expert 4. The output vector of the gating network is 0.6,0.2,0.1,0.1. Each probability value corresponds to a weight of a different private expert network selected. The application does not limit the number of private experts, for example, one expert model (private expert network) with highest probability can be selected from 4 private expert models to be used as a target submodel (target private expert network) for processing the acoustic coding feature, two expert models (private expert networks) with highest probability can be selected to be used as target submodels (target private expert networks) for processing the acoustic coding feature, and all private expert models (private expert networks) can be selected to be used as target submodels (target private expert networks) for processing the acoustic coding feature.

Further, in the embodiment of the present application, the voice recognition device selects the private expert network corresponding to the highest output probability value of the gating network. After the private expert model (private expert network) is selected, the acoustic coding feature is sent to the selected private expert sub-model (private expert network), and the output sequence of the private expert sub-model (private expert network) is multiplied by the corresponding weight value to obtain the output of the mixed private expert model (output sequence of the private expert network), as shown in a formula 2.

Further, in an embodiment of the present application, the speech recognition apparatus may introduce a shared expert sub-model (shared expert network) in each layer of encoder to model the commonality among different dialects. For some types of dialect data, such as Guizhou dialect and Sichuan dialect, have certain similarity in pronunciation, even though dialects with larger pronunciation difference, such as Guangdong dialect and Mandarin, have certain linguistic similarity, so that the modeling capability of the multi-dialect network can be effectively improved by uniformly modeling the input dialect by adopting a shared expert network; in addition, some dialects have rare data quantity and expensive labeling cost, the model is difficult to effectively model a small amount of data, related knowledge needs to be referred to from voice data with abundant resources, and the recognition performance of some low-resource dialects can be effectively improved by introducing a shared expert network, so that the accuracy of voice recognition is improved.

It should be noted that, in the embodiment of the present application, the output of the shared expert sub-model (shared expert network) and the output of the private expert network are weighted to obtain a mixed output sequence (feature output sequence) and sent to the next layer encoder, as shown in formula 3.

It should be noted that, in the embodiment of the present application, the voice recognition device determines the feature output sequence through the output sequence of the private expert network and the output sequence of the shared expert network, where the input of the shared expert network is the acoustic coding feature corresponding to the voice data to be recognized, and it is assumed that the voice data to be recognized is the voice data including the Yunnan dialect and the Sichuan dialect, and the two have a certain similarity in pronunciation, and the shared expert network can be used to model the common feature between the Yunnan dialect and the Sichuan dialect, so that the accuracy of voice recognition, especially the recognition accuracy of the low-resource dialect, can be further improved.

It should be noted that, in the embodiment of the present application, after the voice recognition device obtains the feature output sequence of the last layer encoder, that is, the feature sequence, the voice recognition device may generate the recognition result corresponding to the voice data to be recognized according to the feature sequence.

It should be noted that, in the embodiment of the present application, the speech recognition device splices the dialect embedded feature and the acoustic coding feature and sends them to the gating unit, and the reason for this design mainly has two points: (1) The acoustic coding characteristics output by the non-hybrid expert network (non-expert network layer) do not have dialect category information, so that the output probability of the gating network is not differentiated, and the acoustic coding characteristics of each layer of encoder are different, so that the output value of each layer of gating network is greatly different, namely, each layer of encoder selects different private expert networks for the same input data. (2) For different types of dialect input data, the dialect embedding feature can provide a distinguishing representation, so that the output weight values of the gating networks are guided to be more differentiated, and the dialect embedding features of the gating network inputs of different coding layers are the same, so that the gating networks of different coding layers output similar weight values. By introducing dialect embedded features to guide the gating network, a specific private expert network is realized to process specific types of input data.

Example III

Based on the above embodiments, the embodiment of the present application provides a voice recognition device, fig. 5 is a schematic diagram of the composition structure of the voice recognition device, and as shown in fig. 5, the voice recognition device 10 includes: an acquisition unit 11, a determination unit 12, a generation unit 13;

the acquiring unit 11 is configured to acquire voice data to be recognized; wherein the voice data to be recognized comprises at least one dialect;

the determining unit 12 is configured to determine a voice acoustic feature and a dialect embedding feature corresponding to the voice data to be recognized;

the obtaining unit 11 is further configured to input the voice acoustic feature and the dialect embedding feature to a coding network, and obtain a feature sequence corresponding to the voice data to be identified; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gating network, a shared expert network and a plurality of private expert networks;

the generating unit 13 is configured to generate a recognition result corresponding to the voice data to be recognized according to the feature sequence.

In an embodiment of the present application, further, fig. 6 is a schematic diagram of a second component structure of the speech recognition device, as shown in fig. 6, the speech recognition device 10 according to the embodiment of the present application may further include a processor 14, a memory 15 storing instructions executable by the processor 14, further, the speech recognition device 10 may further include a communication interface 16, and a bus 17 for connecting the processor 14, the memory 15 and the communication interface 16.

In an embodiment of the present application, the processor 14 may be at least one of an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (ProgRAMmable Logic Device, PLD), a field programmable gate array (Field ProgRAMmable Gate Array, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronics for implementing the above-described processor functions may be other for different devices, and embodiments of the present application are not particularly limited. The speech recognition device 10 may further comprise a memory 15, which memory 15 may be connected to the processor 14, wherein the memory 15 is adapted to store executable program code comprising computer operating instructions, the memory 15 may comprise a high speed RAM memory, and may further comprise a non-volatile memory, e.g. at least two disk memories.

In an embodiment of the application, a bus 17 is used to connect the communication interface 16, the processor 14 and the memory 15 and the communication between these devices.

In an embodiment of the application, the memory 15 is used for storing instructions and data.

Further, in the embodiment of the present application, the processor 14 is configured to obtain the voice data to be recognized by the voice recognition device; wherein the voice data to be recognized is voice data comprising at least one dialect;

inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network to obtain a characteristic sequence corresponding to the voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gating network, a shared expert network and a plurality of private expert networks;

In practical applications, the Memory 15 may be a volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 14.

The embodiment of the application provides a voice recognition device, which acquires voice data to be recognized; wherein the voice data to be recognized is voice data comprising at least one dialect; determining voice acoustic characteristics and dialect embedding characteristics corresponding to voice data to be recognized; inputting the voice acoustic characteristics and the dialect embedded characteristics into a coding network, and outputting a characteristic sequence corresponding to voice data to be recognized; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gate control network, a shared expert network and a plurality of private expert networks; and generating a recognition result corresponding to the voice data to be recognized according to the feature sequence. Therefore, the voice recognition device can acquire the voice data to be recognized, determine the voice acoustic features and the dialect embedded features corresponding to the voice data to be recognized, and then input the voice acoustic features and the dialect embedded features into the coding network, so that the feature sequence corresponding to the voice data to be recognized can be acquired; the voice recognition device can process voice data comprising different dialects through a specific private expert network, so that the accuracy of voice recognition is improved while the performance of a terminal is ensured, meanwhile, the sharing expert network can process acoustic coding features corresponding to voice acoustic features, and common features among different dialects can be modeled, so that the accuracy of voice recognition can be further improved.

An embodiment of the present application provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the speech recognition method as described above.

Specifically, the program instructions corresponding to a voice recognition method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disk, or a usb disk, and when the program instructions corresponding to a voice recognition method in the storage medium are read or executed by an electronic device, the method includes the following steps:

acquiring voice data to be recognized; wherein the voice data to be recognized is voice data comprising at least one dialect;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block and/or flow of the flowchart illustrations and/or block diagrams, and combinations of blocks and/or flow diagrams in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the at least one layer of encoders each comprise a non-expert network layer.

3. The method according to claim 2, wherein the inputting the speech acoustic feature and the dialect embedding feature into the coding network and outputting the feature sequence corresponding to the speech data to be recognized includes:

inputting the speech acoustic features and the dialect embedded features into a first layer encoder, and determining a feature output sequence of the first layer encoder;

inputting the characteristic output sequence of the N-1 layer encoder and the dialect embedded characteristic to the N layer encoder, and determining the characteristic output sequence of the N layer encoder; wherein N is an integer greater than or equal to 2;

and determining the characteristic output sequence of the last layer of encoder as the characteristic sequence.

4. A method according to claim 3, wherein said inputting the speech acoustic features and the dialect-embedded features into a first layer encoder, determining an output sequence of the first layer encoder, comprises:

determining acoustic coding features corresponding to the acoustic features of the voice through the non-expert network layer;

Inputting the acoustic coding feature and the dialect embedding feature into the gating network, determining a plurality of weight values corresponding to the private expert networks, and generating an output sequence of the private expert network according to the plurality of weight values and the acoustic coding feature;

inputting the acoustic coding features into the shared expert network, and determining an output sequence of the shared expert network;

and determining the characteristic output sequence according to the output sequence of the private expert network and the output sequence of the shared expert network.

5. A method according to claim 3, wherein said inputting the output sequence of the N-1 layer encoder and the dialect embedding feature into the N layer encoder, determining the output sequence of the N layer encoder, comprises:

determining acoustic coding characteristics corresponding to an output sequence of the N-1 layer encoder through the non-expert network layer;

6. The method of claim 4 or 5, wherein the inputting the acoustic encoding feature and the dialect embedding feature into the gating network, determining a plurality of weight values corresponding to the plurality of private expert networks, comprises:

performing splicing treatment on the acoustic coding feature and the dialect embedded feature to obtain a spliced feature;

and inputting the spliced characteristics into the gating network to obtain a plurality of weight values corresponding to the private expert networks.

7. The method according to claim 4 or 5, wherein said generating an output sequence of the private expert network from the plurality of weight values and the acoustic coding feature comprises:

determining a target private expert network from the plurality of private expert networks according to the plurality of weight values;

inputting the acoustic feature codes into the target private expert network to obtain an original feature sequence;

And carrying out weighting processing on the original characteristic sequence according to the weight corresponding to the target private expert network, and determining an output sequence of the private expert network.

8. The method according to claim 4 or 5, wherein said determining said characteristic output sequence from an output sequence of said private expert network and an output sequence of said shared expert network comprises:

9. The method of claim 1, wherein the determining speech acoustic features and dialect embedding features corresponding to the speech data to be recognized comprises:

inputting the voice data to be recognized into an acoustic feature calculation module to obtain voice acoustic features corresponding to the voice data to be recognized;

and extracting the characteristics of the voice acoustic characteristics through a dialect identifier to obtain the dialect embedded characteristics corresponding to the voice data to be identified.

10. The method according to claim 1, wherein the generating the recognition result corresponding to the voice data to be recognized according to the feature sequence includes:

And inputting the characteristic sequence to a decoder to obtain a recognition result corresponding to the voice data to be recognized.

11. A speech recognition device, characterized in that the speech recognition device comprises: an acquisition unit, a determination unit, and a generation unit;

the obtaining unit is further configured to input the voice acoustic feature and the dialect embedding feature to a coding network, and obtain a feature sequence corresponding to the voice data to be identified; the encoding network comprises at least one layer of encoder, wherein the at least one layer of encoder comprises a gating network, a shared expert network and a plurality of private expert networks;

12. A speech recognition device, characterized in that the speech recognition device: a processor and a memory; wherein,,

The processor being adapted to perform the method of any of claims 1-10 when the computer program is run.

13. A computer readable storage medium, characterized in that the storage medium has stored thereon a computer program code which, when executed by a computer, performs the method of any of claims 1-10.