CN117059104A

CN117059104A - Speech recognition method, related device and medium

Info

Publication number: CN117059104A
Application number: CN202311038160.0A
Authority: CN
Inventors: 井博军; 朱紫薇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-14

Abstract

The present disclosure provides a voice recognition method, related apparatus, and medium. The voice recognition method comprises the following steps: performing first coding on target voice characteristics of target voice to obtain first intermediate characteristics, and backing up the first intermediate characteristics to a cloud server; acquiring a first identification text based on the first intermediate feature; acquiring a first confidence coefficient of a first identification text; and if the first confidence coefficient is higher than a first threshold value, determining the first recognition text as a voice recognition result, otherwise, sending a voice recognition request to the cloud server so that the cloud server carries out second coding on the backed-up first intermediate feature, acquiring a second recognition text based on the result of the second coding, and determining the second recognition text as the voice recognition result. According to the embodiment of the disclosure, on the premise that the identification accuracy is guaranteed, the processing overhead of the cloud server is reduced, and the security of the object information is improved. The embodiment of the disclosure can be applied to man-machine interaction voice recognition, voice input and the like.

Description

Speech recognition method, related device and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a speech recognition method, related apparatus, and medium.

Background

In cloud voice recognition in the prior art, object voice is uploaded to a cloud server through a client, voice recognition is performed by the cloud server, and a voice recognition result is sent back to the client for display. In order to ensure the final recognition effect, all reasoning calculation is executed in the cloud, the cloud server is high in load, and more processing resources are consumed. Meanwhile, because the object voice is directly transmitted to the cloud server, the security of the object information is poor.

Disclosure of Invention

The embodiment of the disclosure provides a voice recognition method, a related device and a medium, which can reduce the processing overhead of a cloud server and improve the security of object information on the premise of guaranteeing the recognition accuracy.

According to an aspect of the present disclosure, there is provided a voice recognition method applied to a client, the voice recognition method including:

performing first coding on target voice characteristics of target voice to obtain first intermediate characteristics, and backing up the first intermediate characteristics to a cloud server;

acquiring a first identification text based on the first intermediate feature;

acquiring a first confidence coefficient of the first identification text;

and if the first confidence coefficient is higher than a first threshold value, determining the first recognition text as a voice recognition result, otherwise, sending a voice recognition request to the cloud server so that the cloud server carries out second encoding on the backed-up first intermediate feature, acquiring a second recognition text based on the result of the second encoding, and determining the second recognition text as the voice recognition result.

According to an aspect of the present disclosure, there is provided a voice recognition method applied to a cloud server, the voice recognition method including:

receiving and storing a first intermediate feature sent by a client, wherein the first intermediate feature is obtained by performing first encoding on target voice features of target voice by the client;

receiving a voice recognition request from the client, wherein the voice recognition request is sent by the client when a first confidence degree is not higher than a first threshold value, and the first confidence degree is the confidence degree of a first recognition text acquired by the client based on the first intermediate feature;

and carrying out second coding on the stored first intermediate features, acquiring second recognition text based on the result of the second coding, and determining the second recognition text as the voice recognition result.

According to an aspect of the present disclosure, there is provided a voice recognition apparatus including:

the first coding unit is used for carrying out first coding on target voice characteristics of target voice so as to obtain first intermediate characteristics, and backing up the first intermediate characteristics to the cloud server;

a first text obtaining unit, configured to obtain a first recognition text based on the first intermediate feature;

The confidence coefficient acquisition unit is used for acquiring a first confidence coefficient of the first identification text;

a first output unit configured to determine the first recognition text as a speech recognition result if the first confidence level is higher than a first threshold;

and the voice recognition request unit is used for sending a voice recognition request to the cloud server if the first confidence coefficient is equal to or lower than a first threshold value so that the cloud server carries out second encoding on the backed-up first intermediate feature, acquiring a second recognition text based on a result of the second encoding, and determining the second recognition text as the voice recognition result.

Optionally, the first encoding unit is specifically configured to:

extracting the target voice characteristics from the target voice;

dividing the target voice feature into blocks;

performing first coding on each block to obtain a block code;

and cascading the block codes according to the block sequence to obtain the first intermediate feature.

Optionally, the first encoding unit is specifically configured to:

dividing the target voice feature into blocks according to a first time window;

the voice recognition request unit is specifically configured to:

And sending a voice recognition request to the cloud server, so that the cloud server takes a plurality of block codes in a second time window as a group, carries out second coding on the group to obtain a group code, wherein the second time window comprises a plurality of first time windows, and the group codes are cascaded according to the sequence of the group to obtain the result of the second coding.

Optionally, the first encoding unit is specifically configured to:

for each block code in the first intermediate feature, after the block code is generated, sending the block code to the cloud server for caching;

said second encoding said group with a plurality of said block codes within a second time window as a group, comprising: and when the second time window is finished, taking out a plurality of block codes which correspond to the first time windows in the second time window and are cached in the cloud server, and taking the block codes as the group to perform the second coding.

Optionally, the first encoding is performed by a first encoder, and the second encoding is performed by a second encoder; the first encoder and the second encoder are jointly trained by:

Inputting sample speech features of the sample speech into the first encoder to obtain second intermediate features, and calculating a first loss function of the first encoding;

inputting the second intermediate feature into the second encoder to obtain a second encoding result, and calculating a second loss function of the second encoding;

the first encoder and the second encoder are jointly trained based on the first loss function and the second loss function.

Optionally, the first loss function includes a first loss sub-function, and a second loss sub-function; the second loss function includes a third loss sub-function, and a fourth loss sub-function;

said calculating a first loss function for said first code, comprising: simultaneously inputting the second intermediate feature into a first decoder classified based on connection timing and a second decoder based on attention codec; calculating a first loss sub-function of the first code based on a first output of the first decoder, and calculating a second loss sub-function of the first code based on a second output of the second decoder;

said calculating a second loss function for said second code, comprising: simultaneously inputting the second encoding result into a third decoder based on connection timing classification and a fourth decoder based on attention codec; a third loss sub-function of the second code is calculated based on a third output of the third decoder, and a fourth loss sub-function of the second code is calculated based on a fourth output of the fourth decoder.

Optionally, the jointly training the first encoder and the second encoder based on the first loss function and the second loss function includes:

calculating a total loss function based on the first loss sub-function, the second loss sub-function, the third loss sub-function, and the fourth loss sub-function;

the first encoder and the second encoder are jointly trained based on the total loss function.

Optionally, the inputting the sample speech feature of the sample speech into the first encoder to obtain a second intermediate feature includes:

extracting the sample voice features from the sample voice;

replacing a portion of the features in the sample speech features with a mask;

and inputting the sample voice features with the mask replaced into the first encoder, predicting the mask in the sample voice features through the first encoder, and encoding the sample voice features with the predicted mask to obtain the second intermediate features.

Optionally, the inputting the sample speech feature after mask replacement into the first encoder includes:

dividing the sample speech features after mask replacement into a plurality of batches;

Setting a block size for each of the batches;

partitioning each batch according to the partition size corresponding to the batch; and

the blocks are sequentially input to the first encoder.

Optionally, the sample speech comprises a plurality of sample speech sentences;

the joint training process of the first encoder and the second encoder further includes, prior to replacing a portion of the features in the sample speech features with a mask:

determining the longest anchor sample speech sentence in a plurality of sample speech sentences;

and complementing the sample voice features of the other sample voice sentences except the anchor sample voice sentence in the plurality of sample voice sentences according to a first length, wherein the first length is the length of the sample voice features of the anchor sample voice sentence.

Optionally, the length of each of the batches is an integer multiple of the first length;

said setting a chunk size for each of said batches, comprising:

setting the chunk sizes of the lots of a first ratio among the plurality of the lots to the first length, wherein the first ratio is not less than 50%;

the chunk sizes of the remaining ones of the plurality of the batches are set to a length between a predetermined second length and the first length.

Optionally, the setting the block sizes of the remaining lots of the plurality of lots to a length between a predetermined second length and the first length includes:

determining a fourth length of the first length minus one characteristic length unit;

determining a predetermined minimum value of a third length and the fourth length, wherein the third length and the fourth length are both greater than the second length;

setting the chunk sizes of the lots remaining in the plurality of lots to a length between the second length and the minimum value.

Optionally, said predicting said mask in said sample speech features comprises:

determining the block in which the mask in the sample speech feature is located;

the mask is predicted based on the unmasked sample speech features in the chunk and the sample speech features in the lot in which the chunk is located prior to the chunk.

Optionally, the obtaining a second recognition text based on the result of the second encoding, and determining the second recognition text as the speech recognition result, includes:

inputting a result based on the second encoding into a third decoder to obtain a plurality of second candidate texts and a plurality of second probabilities of the second candidate texts, wherein the second identification text is the second candidate text with the largest second probability;

Inputting a plurality of second candidate texts and a plurality of second probabilities of the second candidate texts into a second confidence prediction model to obtain predicted second confidence;

and if the second confidence is higher than a second threshold, determining the second recognition text as the voice recognition result.

the device comprises a feature receiving unit, a feature coding unit and a feature decoding unit, wherein the feature receiving unit is used for receiving and storing a first intermediate feature sent by a client, and the first intermediate feature is obtained by first coding a target voice feature of target voice by the client;

a request receiving unit, configured to receive a voice recognition request from the client, where the voice recognition request is sent by the client when a first confidence level is recognized to be not higher than a first threshold, and the first confidence level is a confidence level of a first recognition text acquired by the client based on the first intermediate feature;

a second encoding unit, configured to perform second encoding on the stored first intermediate feature;

and a second text acquisition unit configured to acquire a second recognition text based on a result of the second encoding, and determine the second recognition text as the speech recognition result.

According to an aspect of the present disclosure, there is provided an electronic device comprising a memory storing a computer program and a processor implementing a speech recognition method as described above when executing the computer program.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the speech recognition method as described above.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program, which is read and executed by a processor of a computer device, causing the computer device to perform the speech recognition method as described above.

The embodiment of the disclosure divides the voice codes in the voice recognition into a first code and a second code, the first code is put in a client to be executed, and the second code is put in a cloud to be executed. When the first confidence of the first code is higher than a first threshold value, the voice recognition result of the client is accurate, and the voice recognition result of the client can be completely relied on. When the first confidence coefficient is not higher than the first threshold value, the voice recognition result of the client is not accurate, and the more accurate voice recognition result obtained by the second code is requested to the cloud server, so that the local terminal and cloud computing resources are cooperated, and the processing cost of the cloud server is reduced under the condition that the voice recognition accuracy is guaranteed. Meanwhile, in the embodiment of the disclosure, the first intermediate feature obtained by performing the first encoding on the object voice feature is transmitted to the cloud server, instead of directly transmitting the object voice to the cloud server, so that the security of the object information is improved.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the disclosed embodiments.

FIG. 1 is an architectural diagram of a speech recognition method of an embodiment of the present disclosure;

fig. 2A-2D are schematic interface diagrams of an object terminal 130 to which embodiments of the present disclosure are applied to a cloud intelligent voice question-answer scenario;

FIG. 3 is a flow chart of a speech recognition method according to one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an overall framework for speech recognition of an embodiment of the present disclosure;

FIG. 5 is a flow chart of the acquisition of the first intermediate feature in step 310 of FIG. 3;

FIG. 6 is a schematic diagram of a target speech feature being partitioned and encoded according to a first time window to obtain a first intermediate feature;

FIG. 7 is a schematic diagram of a framework of a client of one embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a framework of a client according to another embodiment of the present disclosure;

FIG. 9 is a flowchart of a second encoding of the backed-up first intermediate feature in step 340 of FIG. 3;

FIG. 10 is a diagram of performing a second encoding of a block code based on a second time window by blocking and performing a first encoding based on a first time window;

FIG. 11 is a flowchart of determining a speech recognition result based on the result of the second encoding in step 340 of FIG. 3;

FIG. 12 is a flow chart of training a first encoder and a second encoder according to one embodiment of the present disclosure;

FIG. 13 is a flow chart of step 1210 of FIG. 12 for obtaining a second intermediate feature;

FIG. 14 is a schematic diagram of the addition of the mask and the resulting second intermediate feature after the mask prediction of FIG. 13;

FIG. 15 is a flow chart of the input of the mask-replaced sample speech features to the first encoder in step 1330 of FIG. 13;

FIG. 16 is a schematic illustration of the masking replaced sample speech features of FIG. 12 being input to a first encoder;

FIG. 17 is a flowchart of adding a length to complement the length of the sample speech feature according to the first length prior to step 1210 of FIG. 12;

FIG. 18 is a schematic diagram of the complementing of the length of the sample speech feature according to the first length in FIG. 17;

FIG. 19 is a flowchart of step 1520 of FIG. 15 in which the chunk size is set for each of the batches;

FIG. 20 is a flow chart of setting the chunk sizes of the remaining batches of step 1920 of FIG. 19;

FIG. 21 is a flow chart of predicting a mask in a sample speech feature of step 1330 of FIG. 13;

FIG. 22 is a schematic diagram of the prediction of a mask in the sample speech features of FIG. 13;

FIG. 23A is a flowchart of the calculation of the first loss function of step 1210 of FIG. 12;

FIG. 23B is a flowchart of the calculation of the second loss function of step 1220 in FIG. 12;

FIG. 24 is a schematic diagram of an architecture for joint training of a first encoder and a second encoder based on a first loss sub-function, a second loss sub-function, a third loss sub-function, and a fourth loss sub-function;

FIG. 25 is a flow chart of a method of speech recognition according to another embodiment of the present disclosure;

FIG. 26 is a block diagram of a speech recognition device according to one embodiment of the present disclosure;

FIG. 27 is a block diagram of a speech recognition device according to another embodiment of the present disclosure;

fig. 28 is a terminal structure diagram of the voice recognition method shown in fig. 3 according to an embodiment of the present disclosure;

Fig. 29 is a server configuration diagram of the voice recognition method shown in fig. 3 or 25 according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.

Before proceeding to further detailed description of the disclosed embodiments, the terms and terms involved in the disclosed embodiments are described, which are applicable to the following explanation:

the connection timing classification (Connectionist Temporal Classification, CTC) refers to an end-to-end speech recognition system framework, and input speech can directly output corresponding text, which is mainly used for processing alignment of input and output labels in sequence labeling problems. CTC is a complete end-to-end acoustic model training, and data need not be aligned in advance, and can be trained only by one input sequence and one output sequence, so that data alignment and one-to-one labeling are not needed, and the alignment between input and output is not important. CTCs can directly output the probability of sequence prediction without external post-processing.

Attention-based Encoder-Decoder (AED) refers to an end-to-end speech recognition system framework based on an Attention-based Encoder-Decoder. The encoder in the AED extracts information from the input sequence (speech) and the decoder is an autoregressive model (the unit before input, the next unit is predicted) on the target sequence (text) and the output encoded information of the encoder is obtained by means of attention during autoregressive calculation, so that the information of the input sequence can be utilized.

Chunk refers to an attention mechanism streaming scheme that implements information truncation by partitioning a sequence.

Mel-frequency coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) refers to one of the speech features used in speech recognition and speaker recognition. According to the research of the auditory mechanism of human ears, the human ears have different auditory sensitivities to sound waves with different frequencies. The speech signal from 200Hz to 5000Hz has a large impact on the intelligibility of speech. When two sounds of unequal loudness act on the human ear, the presence of the frequency component of higher loudness affects the perception of the frequency component of lower loudness, making it less noticeable, a phenomenon known as masking effect. Since lower frequency sounds travel farther up the cochlea's basal membrane than higher frequency sounds, generally speaking, bass sounds tend to mask treble, while treble bass sounds tend to be more difficult to mask. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, a set of band-pass filters is arranged from dense to sparse according to the critical bandwidth in the frequency band from low frequency to high frequency to filter the input signal. The energy of the signal output by each band-pass filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of voice after further processing. Because the characteristics do not depend on the nature of the signals, no assumption and limitation are made on the input signals, and the research results of the auditory model are utilized. Therefore, the parameter has better robustness than linear prediction cepstrum coefficient based on the sound channel model, is more in line with the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced.

The Softmax layer refers to the most common activation function in a neural network that can convert the output of the neural network into a probability distribution, where Softmax is a normalized exponential function. In deep learning, softmax is widely used for classification problems, as well as for some tasks that require output probability distributions. During model training, a cross entropy loss function is typically added after the Softmax layer, which can measure the gap between the actual output probability distribution and the target probability distribution.

In the related technology, in the current cascade voice recognition system, after receiving voice information input by a user, a client sends the user voice to a cloud server, the cloud server decodes the user voice once through a stream encoder by adopting short-time blocking, and returns a decoding result to the client, so that the client realizes real-time on-screen, and a text result which is currently recognized according to the user voice is displayed on the display part. Meanwhile, the cloud server further increases the context information to 3 to 5s through the cascade encoder, and then performs secondary decoding according to the increased voice information, so that the accuracy of the final voice recognition effect is ensured. The stream encoder and the cascade encoder are both deployed on a cloud server, all the calculation and processing of voice recognition are performed on the cloud server, the cloud server has a large load, and more processing resources are required to be consumed. The cost of the cloud server is high, especially in the case of processing a high query rate Per Second (QPS). In addition, in the process that the client sends user voice to the cloud server, the risk of intercepted information exists, the wind direction that privacy such as voice print of the user is stolen exists, and the security of user object information is poor.

Based on the above, the embodiment of the disclosure provides a voice recognition method, a related device and a medium, where the voice recognition method can reduce the processing overhead of a cloud server and improve the security of object information on the premise of guaranteeing the recognition accuracy.

System architecture and scenario description applied to embodiments of the present disclosure

Fig. 1 is a system architecture diagram to which a speech recognition method according to an embodiment of the present disclosure is applied. It includes a server 110, the internet 120, an object terminal 130, and the like.

The object terminal 130 is a device that receives a voice of an object and displays a voice recognition result. It includes desktop computers, laptops, PDAs (personal digital assistants), cell phones, car terminals, home theater terminals, dedicated terminals, etc. In addition, the device can be a single device or a set of a plurality of devices. For example, a plurality of devices are connected through a local area network, and a display device is commonly used for cooperative work to form a terminal. The terminals may also communicate with the internet 120, either wired or wireless, to exchange data. Wherein the object terminal 130 includes at least a display 131 for displaying the voice recognition result and a radio 132 for receiving the user's voice.

The server 110 refers to a computer system capable of providing a voice recognition service to the object terminal 130. The server 110 is required to have higher stability, security, performance, and the like than the target terminal 130. The server 110 may be one high-performance computer in a network platform, a cluster of multiple high-performance computers, a portion of one high-performance computer (e.g., a virtual machine), a combination of portions of multiple high-performance computers (e.g., virtual machines), etc. The server 110 may also communicate with the internet 120 in a wired or wireless manner, exchanging data.

The embodiment of the disclosure can be applied to various scenes, such as the scenes of cloud intelligent voice questions and answers shown in fig. 2A-2D.

As shown in fig. 2A, the client is an object terminal 130, which may be a mobile phone terminal. The cloud intelligent voice question-answering interface is displayed on the interface of the object terminal 130, wherein the cloud intelligent voice question-answering can be activated by means of key wakeup, click application loading or keyword voice wakeup and the like. After the client wakes up the cloud intelligent voice question and answer, the cloud intelligent voice question and answer interface reminds the user of being able to conduct voice question and answer by displaying "please input questions … … to be consulted. After activating the cloud intelligent voice questions and answers and reminding the user that voice input is possible, the interface of the cloud intelligent voice questions and answers is shown in fig. 2B.

In fig. 2B, the cloud intelligent voice question-answering interface prompts the user to speak through the microphone icon, and prompts the user to receive the voice information input by the user currently in a flashing lamp mode through the shaking of the microphone icon. After receiving voice information input by a user, a client locally extracts characteristics of the voice information, performs first coding on the extracted voice characteristics to obtain first intermediate characteristics, decodes the first intermediate characteristics to obtain first identification text, and simultaneously caches the first intermediate characteristics in a cloud server.

As shown in fig. 2C, a first recognition text is obtained by recognition according to voice information input by a user at the client, confidence judgment is performed on the first recognition text, and when the confidence of the first recognition text meets a threshold condition, the first recognition text is displayed, as in fig. 2C, a cloud intelligent voice question-answering interface displays the text of "whether a hotel … …" exists nearby ". After receiving the voice information input by the user, the client firstly utilizes local resources to carry out voice recognition, and when the recognition result meets the confidence coefficient requirement, namely, the recognition result is accurate enough, the recognition result is displayed without carrying out secondary voice recognition by the cloud server, so that the processing cost of the cloud server is saved.

After some voices are accumulated, the client initiates a voice recognition request to the cloud server because the local recognition result does not meet the threshold requirement of the confidence coefficient, and the cloud server performs voice recognition according to the cached first intermediate features.

As shown in fig. 2D, the cloud server performs the second encoding according to the first intermediate feature of the scraping, obtains the second recognition text according to the result of the second encoding, determines the second recognition text as the latest speech recognition result, and sends the latest speech recognition result to the client for display by the client. In fig. 2D, the cloud server recognizes that the latest accurate speech recognition result is "whether there is an emergency point nearby? I just injured … … ", the client receives the latest recognition result sent by the cloud server, and uses" there is an emergency point nearby? I just injured … … "replace" if there is a hotel … … "nearby and displayed on the cloud intelligent voice question-answer interface.

When the local recognition result of the client does not meet the confidence coefficient requirement, a recognition request is initiated to the cloud server, the cloud server performs secondary voice recognition, and the stronger computing processing capacity of the cloud server is utilized to obtain a more accurate recognition result, so that the accuracy of the final voice recognition result is ensured. Meanwhile, the cloud server performs secondary voice recognition according to the first intermediate feature obtained by the first encoding of the client, instead of directly acquiring the user object voice, so that the object information security is improved.

It should be understood that the foregoing is merely illustrative of some application scenarios of the present disclosure, which may include, but are not limited to, the specific embodiments set forth above.

General description of embodiments of the disclosure

According to one embodiment of the present disclosure, a speech recognition method is provided.

The voice recognition method is a process of performing voice recognition on the voice of the user object speaking into corresponding text information and displaying the text information. The voice recognition method is divided into two stages, wherein the first stage is used for receiving the voice information of the user object by the client, the client locally carries out recognition, and when the accuracy of the recognition result meets the confidence coefficient requirement, the cloud server is not required to carry out the voice recognition of the second stage, and the text content obtained by the voice recognition of the first stage is directly displayed. For example, when the surrounding environment of the user is quiet and the user pronounces standard, the local voice recognition of the client can generally obtain a more accurate recognition result, and if the voice recognition of the second stage is performed again, the processing resource is wasted. When the voice recognition result of the first stage is not accurate enough and cannot meet the confidence requirement, voice recognition of the second stage is carried out, the cloud server feeds back the recognized text content to the client for display when receiving the voice recognition request sent by the client. For example, when the surrounding environment of the user is noisy and the acquired context information is insufficient, the accuracy of the local recognition result of the client is low, and at this time, the cloud server performs the second-stage voice recognition to obtain a more accurate recognition result. According to the method and the device for identifying the object information, on the premise that identification accuracy is guaranteed, processing overhead of the cloud server can be reduced, and object information safety is improved.

The voice recognition method of the embodiment of the present disclosure is partially performed at the object terminal 130 and partially performed at the server 110. In a part of the voice recognition method, the object terminal 130 responds to the triggering of the object, and the object terminal 130 locally performs voice recognition and display; in another part of the speech recognition method, speech recognition is requested from the server 110 by the object terminal 130. The object terminal 130 receives the voice recognition result of the server 110, and displays the corresponding recognition result to the object through the object terminal 130.

As shown in fig. 3, according to one embodiment of the present disclosure, the voice recognition method at least includes, but is not limited to, application to a client, or the object terminal 130 as provided in fig. 1, etc., and includes:

step 310, performing first encoding on target voice features of target voice to obtain first intermediate features, and backing up the first intermediate features to a cloud server;

step 320, acquiring a first recognition text based on the first intermediate feature;

step 330, obtaining a first confidence coefficient of the first recognition text;

and 340, if the first confidence coefficient is higher than a first threshold value, determining the first recognition text as a voice recognition result, otherwise, sending a voice recognition request to the cloud server so that the cloud server performs second encoding on the backed-up first intermediate feature, acquiring a second recognition text based on the result of the second encoding, and determining the second recognition text as the voice recognition result.

Steps 310-340 are briefly described in connection with fig. 4.

The client receives the voice of the object through the radio to obtain target voice, and locally performs feature extraction on the target to obtain target voice features. The client may input the target voice feature into the first encoder for first encoding to obtain a first intermediate feature, and send the first intermediate feature to the cloud server for backup.

At the local of the client, the first intermediate feature can be input into a decoder for decoding to obtain a first identification text and a first confidence coefficient of the first identification text; then, whether the first confidence is higher than a first threshold value can be judged, and if so, the first recognition text is used as a voice recognition result and is displayed by the client.

If the first confidence level is lower than or equal to a first threshold value, the client may send a voice recognition request to the cloud server, and the cloud server performs secondary voice recognition in response to the voice recognition request.

In the cloud server, a plurality of first intermediate features can be acquired from the backup storage space in response to a voice recognition request sent by the client, the plurality of first intermediate features are input into a second encoder to perform second encoding, the result of the second encoding is input into a decoder, and a second recognition text is obtained through the decoder. And then, the second recognition text is used as a voice recognition result and is sent to the client.

The client can receive and display the voice recognition result sent by the cloud server.

In this embodiment, the speech codes in the speech recognition are divided into a first code and a second code, the first code is executed at the client, and the second code is executed at the cloud. When the first confidence of the first code is higher than a first threshold value, the voice recognition result of the client is accurate, and the voice recognition result of the client can be completely relied on. When the first confidence coefficient is not higher than the first threshold value, the voice recognition result of the client is not accurate, and the more accurate voice recognition result obtained by the second code is requested to the cloud server, so that the local terminal and cloud computing resources are cooperated, and the processing cost of the cloud server is reduced under the condition that the voice recognition accuracy is guaranteed. Meanwhile, in the embodiment of the disclosure, the first intermediate feature obtained by performing the first encoding on the object voice feature is transmitted to the cloud server, instead of directly transmitting the object voice to the cloud server, so that the security of the object information is improved.

Steps 310-340 are described in detail below.

Detailed description of step 310

In step 310, the target voice refers to voice information input by the user object. The voice information may be input by speaking the user object, or may be input by playing audio from the user object. The client collects the speech of the object through the radio to obtain the target voice. For example, in a cloud intelligent voice question-answering scene, a client interacts with an object through a voice question-answering interface to collect target voice; for another example, in the input french voice input function scenario, an object inputs a content to be input to a client by speaking, and the client receives the content of the object speaking as a target voice and performs subsequent recognition to obtain text content that the object desires to input. The target voice feature refers to key feature information extracted according to target voice. For example, the target speech feature may be mel-cepstral coefficients extracted from the target speech. The first coding refers to coding the target voice characteristics and converting the target voice characteristics into a form which can be recognized and processed by a computer; the first coding may be a streaming coding based on connection timing, or may be a streaming coding based on attention mechanism.

The first intermediate feature refers to information obtained after the client side encodes the target voice feature once by using the local computing resource. The step of backing up the first intermediate feature to the cloud server means that the client sends the first intermediate feature obtained by the first encoding to the cloud server, and the cloud server receives and stores the first intermediate feature, wherein the client backs up the first intermediate feature to the cloud server in real time.

As shown in fig. 4, a client receives a target voice of an object, performs feature extraction on the target voice locally at the client to obtain a target voice feature, inputs the target voice feature into a first encoder to perform first encoding, and sends a first intermediate feature obtained through the first encoder to a cloud server for storage backup in a manner of internet/wireless transmission network and the like. In one embodiment, the first encoder is a streaming encoder deployed at the client.

In one embodiment, as shown in fig. 5, first encoding the target speech feature of the target speech to obtain a first intermediate feature includes:

step 510, extracting target voice characteristics from target voice;

step 520, dividing the target voice feature into blocks;

Step 530, performing first coding on each block to obtain a block code;

step 540, cascading the block codes according to the block sequence to obtain a first intermediate feature.

In this embodiment, the client performs a block processing on the target voice feature by using a block (Chunk) mechanism, where each block has the same size, and the size of the block is a feature quantity of 8 to 22 frames, and the corresponding voice length is 320ms to 880ms. After dividing the target voice feature into a plurality of blocks, inputting the blocks into a first encoder to perform first encoding respectively, wherein the first encoding of each block can be performed simultaneously or in batches according to time sequence. The block codes obtained by the first coding are cascaded according to the sequence of the blocks, the sequence of the blocks can be confirmed according to the time sequence, or a label bit is added during the block processing to confirm the sequence corresponding to each block.

Illustratively, as shown in fig. 6, after the client obtains the target voice and extracts the target voice feature, the client performs a block processing on the target voice feature according to a first time window T1 to obtain a plurality of blocks with equal length, where the sizes of the blocks correspond to the first time window T1; for example, if the first time window is 320ms, the size of the block corresponds to a feature quantity of 8 frames. And carrying out first coding treatment on each block to obtain block codes, and cascading a plurality of block codes according to a time sequence to obtain a first intermediate feature.

The target voice features are divided into blocks and then subjected to first coding, so that the coding and decoding speed can be improved, and the voice recognition efficiency can be improved.

In one embodiment, backing up the first intermediate feature to the cloud server includes: and for each block code in the first intermediate feature, after the block code is generated, sending the block code to a cloud server for caching.

In this embodiment, the client backs up the first intermediate feature to the cloud server in real time, that is, each time the client generates a block code, the block code is sent to the cloud server, so that the cloud server stores the block code according to time sequence, or orders the block code according to a sequence flag carried when the client sends the block code, and then stores the block code. In primary voice recognition, each time a client generates a block code, the block code is sent to a cloud server for storage backup, so that the cloud server can synchronize with a first coding progress of the client, and when the confidence of a first recognition text of the client is not satisfied, secondary voice recognition can be directly performed according to a recognition request of the client through a cached first intermediate feature, so that timeliness of the voice recognition of the cloud server is ensured. And the blocking code is transmitted to the cloud server for backup, and the blocking code is information converted into computer voice through the first code, so that privacy such as voiceprint of an object can be prevented from being revealed, and the safety of user data is improved.

In another embodiment, the first encoder may not be synchronized with the generation of the first intermediate feature and the sending of the first intermediate feature to the cloud server for backup. For example, the first intermediate features generated in the time interval may be periodically sent to the cloud server in batches at preset time intervals. For another example, the batch may be sent to the cloud server and backed up when the first intermediate features accumulate to a preset number.

In one embodiment, when a target voice input by an object is already recognized, and before the next new target voice is started to be recognized, the cloud server clears the first intermediate feature cached in the previous target voice recognition process, so that occupation of redundant data to space is reduced, and resources are saved.

Detailed description of step 320 and step 330

In step 320, the first recognition text is text information corresponding to the target speech, which is decoded according to the first intermediate feature. After the target voice feature is subjected to first coding by the client to obtain a first intermediate feature, the first intermediate feature is decoded locally, and the computer voice is converted into natural voice characters to obtain a first recognition text.

In step 330, the first confidence level is used to characterize the confidence level of the first recognition text, and a higher first confidence level indicates a higher confidence level of the first recognition text locally recognized by the client. The first confidence coefficient may be a confidence coefficient generated simultaneously in the process of acquiring the first recognition text, or may be acquired through a confidence coefficient module.

In one embodiment, obtaining the first recognition text based on the first intermediate feature includes: based on the first intermediate feature, obtaining a first recognition text through a first decoder classified based on connection timing; acquiring a first confidence of a first recognition text, including: and taking the posterior probability of the connection time sequence classification obtained by the first decoder as a first confidence degree.

As shown in fig. 7, the client is locally deployed with a first encoder and a first decoder classified based on connection timing. After the client receives the target voice and performs feature extraction and first coding to obtain a first intermediate feature, the client can send the first intermediate feature to the cloud server for backup through the internet/wireless network, and simultaneously send the first intermediate feature to a first decoder based on connection time sequence classification for decoding to obtain a first identification text. The first decoder based on the connection time sequence classification generates a posterior probability of the connection time sequence classification corresponding to the first recognition text while decoding, and uses the posterior probability as a first confidence level to be used as a subsequent confidence level judgment. The first decoder based on connection time sequence classification can obtain the corresponding confidence coefficient when decoding, so that the framework of the voice recognition model is simplified, and the voice recognition efficiency is improved.

Illustratively, the first decoder based on the connection timing classification may employ prefix bundle searching of CTCs (prefix beam search). The decoding result generated by the CTC model is provided with posterior probability, so that the posterior probability is used as the confidence of the decoding result; CTCs plus a small language model, such as CTC combined markup vocabulary grammar (Token Lexicon Grammar, TLG) model, may also be used.

In one embodiment, obtaining the first recognition text based on the first intermediate feature includes: acquiring a plurality of first candidate texts and first probabilities of the plurality of first candidate texts through a first decoder based on the first intermediate features, wherein the first identification text is the first candidate text with the maximum first probability; acquiring a first confidence of a first recognition text, including: the first candidate texts and the first probabilities of the first candidate texts are input into a first confidence prediction model, and predicted first confidence is obtained.

As shown in fig. 8, the client deploys the first encoder, the first decoder, and the first confidence prediction model locally. The client can perform first encoding on target voice characteristics of target voice input by a user object through a first encoder, input an encoding result into a first decoder, obtain a plurality of first candidate texts and first probabilities of the plurality of first candidate texts through decoding of the first decoder, and select the first candidate text with the largest first probability as a first recognition text; meanwhile, the first decoder can input the plurality of first candidate texts obtained through decoding and the first probability of the plurality of first candidate texts into a first confidence prediction model, so that the first confidence prediction model predicts a first confidence corresponding to the first identification text. The first decoder may be a decoder based on connection timing classification, or may be a decoder based on an attention mechanism. The first confidence coefficient of the first recognition text is predicted by the first confidence coefficient prediction model, the confidence coefficient result is more accurate, the reliability of the first recognition text can be more accurately confirmed, and whether the cloud server is required to perform secondary recognition or not is judged, so that the reliability and accuracy of the whole voice recognition are improved.

In one embodiment, the first encoding is performed by a first encoder and the second encoding is performed by a second encoder; the number of parameters of the first encoder is smaller than the number of parameters of the second encoder. The parameters refer to the node number and the weight number corresponding to the node of the encoder model, the first encoder is a streaming encoder deployed locally at the client, and the second encoder is a cascade encoder deployed at the cloud server. By reducing the number of parameters of the first encoder, the computing resources of the client can be relieved; by increasing the number of parameters of the second encoder, the accuracy of the second encoder can be improved, thereby improving the reliability of the overall speech recognition result.

Detailed description of step 340

In step 340, the first threshold refers to a confidence value of a preset value, which may be preset according to the reliability requirement of speech recognition, and when the first confidence is higher than the first threshold, the confidence of the corresponding first recognition text is characterized to meet the requirement, otherwise, the confidence of the corresponding first recognition text is not reliable. The voice recognition request refers to a request which is generated by the client when the first confidence level is equal to or lower than a first threshold value and indicates the cloud server to perform secondary voice recognition. The second encoding refers to that the cloud server encodes the backed-up first intermediate feature to convert the first intermediate feature into an encoding form which can be identified by the cloud server. The second recognition text refers to text content corresponding to the target voice, which is obtained by the cloud server performing secondary voice recognition according to the first intermediate feature.

As shown in fig. 4, the client may receive a target voice input by an object, perform feature extraction on the target voice to obtain a target voice feature, input the target voice feature into a first encoder, perform first encoding on the target voice feature by the first encoder to obtain a first intermediate feature, and send the first intermediate feature to the decoder, and meanwhile, the client may synchronously backup the first intermediate feature generated by the first encoder to the cloud server through the internet/wireless communication network. The decoder of the client can decode the first intermediate feature to obtain a first identification text and a corresponding first confidence coefficient, and when the first confidence coefficient is higher than a first threshold value, the client can output and display text content corresponding to the first identification text; the client may send a voice recognition request to the cloud server when the first confidence level is equal to or below a first threshold. After receiving a voice recognition request from a terminal to a client, the cloud server can extract the backup first intermediate features from the storage space and input the extracted first intermediate features into a second encoder for second encoding, wherein the second encoder is a cascade encoder for cascade-connecting a plurality of first intermediate features and then performing second encoding. The second encoder can input the second encoding result into a decoder of the cloud server to obtain a second recognition text, the cloud server can determine the second recognition text as a voice recognition result and send the voice recognition result back to the client, and the client outputs and displays text content corresponding to the second recognition text.

The embodiment of the disclosure divides the voice codes in the voice recognition into a first code and a second code, the first code is put in a client to be executed, and the second code is put in a cloud to be executed. When the first confidence coefficient of the first code is higher than a first threshold value, the voice recognition result of the client is accurate, the voice recognition result of the client can be completely relied on, and the cloud server is not required to perform second voice recognition, so that the processing resources of the cloud server are saved. When the first confidence coefficient is not higher than the first threshold value, the voice recognition result of the client is not accurate, and the more accurate voice recognition result obtained by the second code is requested to the cloud server, so that the local terminal and cloud computing resources are cooperated, and the processing cost of the cloud server is reduced under the condition that the voice recognition accuracy is guaranteed. Meanwhile, the first intermediate feature obtained by performing first encoding on the object voice feature is transmitted to the cloud server, instead of directly transmitting the object voice to the cloud server, so that the security of the object information is improved.

In one embodiment, as shown in FIG. 9, second encoding the backed-up first intermediate feature includes:

step 910, taking a plurality of block codes in a second time window as a group, and performing second coding on the group to obtain a group code, wherein the second time window comprises a plurality of first time windows;

Step 920, cascading the group codes according to the order of the groups to obtain the result of the second coding.

In step 910, the time window refers to a set length of time, and the first time window and the second time window refer to different ranges of time, respectively. The second time window includes a plurality of first time windows, which means that the time length corresponding to the second time window is equal to the sum of the time lengths corresponding to the plurality of first time windows. Grouping the plurality of block codes within the second time window means that the plurality of block codes whose timings are consecutive within the second time window are divided into the same group.

In step 920, the order of the groups refers to the timing relationship between each group, and in the two adjacent groups, the last block code in the preceding group is in sequential succession with the first block code in the following group. Concatenation refers to splicing the data segments in a set order.

The step 910-920 has the advantage that, because the second time window includes a plurality of first time windows, when the second encoding is performed, the plurality of block codes in the second time window are encoded as a group, so that the block codes in the plurality of first time windows are mixed in the second encoding process, and the connection between the block codes of different blocks is reflected in the mixing process, thereby understanding higher-layer semantics between the blocks, improving the semantic understanding capability of the encoder and improving the speech recognition accuracy.

In one embodiment, taking a plurality of block codes within a second time window as a group, performing second encoding on the group includes: and when the second time window is finished, taking out a plurality of block codes which correspond to the plurality of first time windows in the second time window and are cached in the cloud server as a group, and performing second coding.

The benefit of this embodiment is that after the server receives the block codes in the first time window sent by the client, if the process starts to perform the second encoding immediately, the process will wait until all the block codes in the first time window included in the second time window are received from the client, and then can perform the second encoding, resulting in a waste of computing resources. In this embodiment, after receiving the block codes from the client, the block codes are cached first until the second time window ends, and after receiving a voice recognition instruction sent to the server by the client because the first confidence coefficient is determined not to be higher than the first threshold, the process is allocated, and a plurality of block codes cached in the cloud server corresponding to a plurality of first time windows in the second time window are taken out as a group to perform the second encoding. This embodiment reduces computing resource consumption.

As shown in fig. 10, in one speech recognition, a plurality of block codes divided according to a first time window T1 are stored in a cloud server, and the cloud server groups the plurality of block codes according to a second time window T2, for example, the second time window T2 includes three first time windows T1, and then the three block codes are grouped into a plurality of groups; respectively carrying out second coding on each group to obtain a plurality of group codes; and cascading the plurality of coding results according to the time sequence of the plurality of groups to obtain a coding result corresponding to the second coding, namely, text content corresponding to the target voice obtained by the secondary voice recognition. Illustratively, the first time window T1 may be 800ms, that is, a block code length corresponding to one first time window T1 is a feature quantity of 20 frames; the second time window T2 may be 2400ms, that is, one second time window T2 includes three first time windows T1, and groups each including three block codes are divided according to the second time window T2, and the length of one group is a feature quantity of 60 frames.

In one embodiment, as shown in fig. 11, acquiring the second recognition text based on the result of the second encoding, and determining the second recognition text as a speech recognition result, includes:

Step 1110, inputting the result based on the second encoding into a third decoder to obtain a plurality of second candidate texts and second probabilities of the plurality of second candidate texts, wherein the second identified text is the second candidate text with the largest second probability;

step 1120, inputting the plurality of second candidate texts and the second probabilities of the plurality of second candidate texts into a second confidence prediction model to obtain a predicted second confidence;

if the second confidence level is higher than the second threshold, the second recognition text is determined to be the speech recognition result, step 1130.

In this embodiment, the second encoding refers to encoding, in the cloud server, the first intermediate feature cached by the cloud server by the second encoder. The third decoder is a decoder corresponding to the second code and is used for decoding the coding result output by the second coder to obtain corresponding text content. The second candidate text is a plurality of possible decoding results obtained by the third decoder based on the second encoding result re-encoded by the second encoder. Each second candidate text corresponds to a second probability, and the second probability is used for representing the credibility of the corresponding second candidate text.

As shown in fig. 4, the cloud server is further provided with a re-scoring module, the re-scoring module includes a second confidence prediction model, the second encoder inputs the result of the second encoding into the decoder, decodes the result to obtain a plurality of second candidate texts and second probabilities of the plurality of second candidate texts, and confirms the second candidate text with the largest second probability as a second recognition text; and simultaneously inputting the plurality of second candidate texts and the second probabilities of the plurality of second candidate texts into a re-scoring module, and predicting a second confidence coefficient corresponding to the second recognition text through a second confidence coefficient prediction model. And if the second confidence coefficient meets the second threshold requirement, proving that the reliability of the second identification text meets the requirement, and sending the second identification text to the client by the cloud server, and receiving the second identification text and displaying the text by the client.

The step 1110-1130 has the advantage that the second confidence of the second code is predicted by the second confidence prediction model, if the second confidence is higher than the second threshold, the second recognition text is determined to be a speech recognition result, and the speech recognition result is returned to the client, otherwise, the speech recognition result is not returned, so that the speech recognition result is basically trusted as long as the speech recognition result is generated, and the accuracy and the reliability of the speech recognition of the embodiment of the disclosure are improved.

In one embodiment, as shown in fig. 4, the cloud server is further provided with a punctuation model, where the punctuation model is used to add punctuation marks to the first recognition text or the second recognition text to form complete display content. When the first confidence coefficient of the first identification text locally identified by the client is higher than a first threshold value, the client sends the first identification text to a punctuation model of the cloud server; in the cloud server, after punctuation marks are added to the first identification text by the punctuation model, the first identification text after punctuation is added is sent back to the client for text display. When the first confidence coefficient of the first recognition text locally recognized by the client is lower than or equal to a first threshold value, the client sends a voice recognition request to the cloud server, and the cloud server responds to the voice recognition request and performs second encoding and decoding based on the backup first intermediate feature, namely second voice recognition to obtain a second recognition text; and the cloud server inputs the second identification text into a punctuation model, and after the punctuation model adds punctuation marks to the second identification text, the second identification text after punctuation is added is sent back to the client for text display.

In one embodiment, as shown in fig. 12, the first encoding is performed by a first encoder and the second encoding is performed by a second encoder; the first encoder and the second encoder are jointly trained by:

step 1210, inputting the sample speech feature of the sample speech into a first encoder to obtain a second intermediate feature, and calculating a first loss function of the first encoding;

step 1220, inputting the second intermediate feature into a second encoder to obtain a second encoding result, and calculating a second loss function of the second encoding;

step 1230, jointly training the first encoder and the second encoder based on the first loss function and the second loss function.

In this embodiment, in step 1210, the sample speech refers to a collection of speech information used for model training, and may be speech information input by a subject collected in history, randomly generated speech information, or speech information extracted from a sample database. The sample speech feature refers to feature data extracted based on the sample speech, and in this embodiment, the sample speech feature may be a mel-frequency cepstral coefficient of the sample speech. The first encoder refers to a streaming encoder deployed at a client, and may be an encoder based on connection timing classification or an encoder based on attention codec. The second intermediate feature refers to intermediate data obtained by first encoding the sample speech feature by the first encoder. In step 1220, the second encoding result refers to the data that is second encoded by the second encoder based on the second intermediate feature. The second encoder refers to a cascade encoder deployed at the cloud server, and is used for performing secondary encoding based on the intermediate characteristic data encoded by the first encoder.

Based on the architecture provided in fig. 4, the joint training procedure in this embodiment is as follows: and inputting the sample voice into the client, locally extracting the characteristics of the sample voice by the client to obtain sample voice characteristics, and carrying out first coding on the sample voice characteristics by a first coder to obtain second intermediate characteristics. And the client sends the second intermediate feature to the cloud server for backup, inputs the second intermediate feature into a decoder of the client for decoding, and calculates a first loss function after the output of the decoder of the client passes through a Softmax layer. And the cloud server encodes the second intermediate feature through a second encoder to obtain a second encoding result, inputs the second encoding result into a decoder of the cloud server for decoding, and calculates a second loss function after the output of the decoder of the cloud server passes through the Softmax layer. The cloud server can perform second encoding under the condition that a voice recognition request of the client is received, wherein the voice recognition request of the client is generated under the condition that a text result obtained by decoding the second intermediate feature is not higher than a first threshold; still alternatively, due to the joint training phase, the cloud server may perform the second encoding when the second intermediate features are accumulated to a certain number, or in response to a request periodically sent by the client. The first encoder and the second encoder are jointly trained based on the first loss function and the second loss function.

The benefit of steps 1210-1230 is that instead of separately training the first encoder of the client and the second encoder of the server side in isolation, the respective first and second loss functions are calculated and based on the first and second loss functions the first encoder and the second encoder are jointly trained. This embodiment improves the accuracy of the first encoder and the second encoder with respect to the manner of isolated training, taking into account the association of the first code and the second code, thereby improving the speech recognition accuracy.

In one embodiment, as shown in fig. 13, inputting a sample speech feature of a sample speech into a first encoder to obtain a second intermediate feature, comprising:

step 1310, extracting sample voice characteristics from the sample voice;

step 1320, replacing part of the features in the sample speech features with a mask;

step 1330, the sample speech feature with the mask replaced is input to the first encoder, the mask in the sample speech feature is predicted by the first encoder, and the sample speech feature with the mask predicted is encoded to obtain a second intermediate feature.

In this embodiment, masking in step 1320 refers to randomly generating data for replacing the target content; typically, the mask is a binary code of the same shape as the target data, with each element set to 0 or 1, where 0 indicates that the contents of the target data at the corresponding location are masked. In step 1330, predicting the mask in the sample speech feature by the first encoder means that the first encoder predicts the content of the portion of the sample speech feature replaced by the mask by a preset prediction algorithm.

For example, as shown in fig. 14, assuming that the sample speech feature is a data sequence of 16 unit lengths, the contents of the second unit length, the fifth unit length, the ninth unit length, the twelfth unit length, and the fifteenth unit length are replaced by a mask generated randomly, so as to obtain the sample speech feature after the mask replacement. And inputting the sample voice features subjected to mask replacement into a first encoder for mask prediction, generating prediction results of positions of a second unit length, a fifth unit length, a ninth unit length, a twelfth unit length and a fifteenth unit length, and replacing the mask to obtain the sample voice features subjected to mask prediction. And encoding the sample voice features after the prediction mask through a first encoder to obtain second intermediate features.

In the combined training process, a training mode of adding a prediction mask is adopted, so that the first encoder can not only encode each sample voice feature in isolation, but also take the relation between adjacent features into consideration in the encoding process, thereby forming high-level understanding of semantics between the features, and further improving the encoding accuracy of the first encoder after training.

In one embodiment, as shown in fig. 15, inputting the mask-replaced sample speech features into a first encoder, comprises:

Step 1510, dividing the sample speech features after mask substitution into a plurality of batches;

step 1520, setting a chunk size for each batch;

step 1530, partitioning each batch according to the partition size corresponding to the batch;

step 1540, inputting the blocks sequentially to the first encoder.

In this embodiment, in step 1510, each batch contains an integer multiple of the sample speech features corresponding to the sample speech sentence. In step 1520, the block sizes set for each batch may be the same or may be different; the size of the block may be randomly selected from the feature quantities of 8 to 22 frames. In step 1540, sequentially inputting the blocks to the first encoder means sequentially inputting the blocks to the first encoder in a time series order.

For example, as shown in fig. 16, assuming that the sample speech feature is a data sequence of 16 unit lengths, the contents of the second unit length, the fifth unit length, the ninth unit length, the twelfth unit length, and the fifteenth unit length are replaced by a mask generated randomly, so as to obtain the sample speech feature after the mask replacement. The length of each batch is assumed to be the same as four unit lengths, the sample voice features after mask replacement are divided into four batches, and each batch is four unit lengths, wherein each unit length is assumed to correspond to the sample voice feature length of one sample voice sentence, i.e. each batch contains the sample voice features corresponding to the four sample voice sentences. The block sizes of the first batch and the third batch are set to be one unit length, and the block sizes of the second batch and the fourth batch are set to be two block lengths. Dividing the first batch and the third batch into four blocks according to the set block sizes, and dividing the second batch and the fourth batch into two blocks according to the set block sizes. The 12 blocks are input to the first encoder in a time sequential order.

The steps 1510-1540 have the advantages that the sample speech features after mask replacement are divided into a plurality of batches, each batch is divided into blocks according to different block sizes, so that the training first encoder has better semantic understanding capability for sentences with different lengths, the first encoder has certain mask prediction capability for sentences with different lengths, the universality of the first encoder on various sentence lengths is improved, and the coding accuracy of the first encoder is improved.

It will be appreciated that in other examples, the number of sample speech features of the sample speech sentences contained in each batch may be different, for example, taking the example of the sample speech features replaced with a mask of 16 unit lengths in fig. 16, the sample speech features are divided into three batches, wherein the first batch contains 8 unit lengths and the other two batches contain 4 unit lengths.

In one embodiment, as shown in FIG. 17, the sample speech includes a plurality of sample speech sentences;

the joint training process of the first encoder and the second encoder further includes, prior to replacing a portion of the features in the sample speech features with the mask:

step 1710, determining the longest anchor sample speech sentence among the plurality of sample speech sentences;

Step 1720, complementing the sample speech features of the other sample speech sentences of the plurality of sample speech sentences except the anchor sample speech sentence by a first length, wherein the first length is the length of the sample speech features of the anchor sample speech sentence.

In this embodiment, in step 1710, the anchor sample speech sentence refers to a sentence with the longest length determined from a plurality of sample speech sentences of the sample speech, for use as a length complement reference. In step 1720, the length of the sample speech sentence corresponds to the length of the extracted sample speech feature, and the longer the speech sentence length, the longer the sample speech feature length. The complement in step 1720 may be generally by adding 0 or 1 to complement the length of the sample speech feature. And determining a plurality of sample voice sentences from the sample voice, determining the sample voice sentence with the longest length as an anchor sample voice sentence, and extracting the characteristics of the plurality of sample voice sentences to obtain sample voice characteristics corresponding to the plurality of voice sentences. And taking the length of the sample voice features of the anchor sample voice sentence as a first length, and carrying out length complementation on the sample voice features of the rest voice sentences so that the lengths of the sample voice features corresponding to the voice sentences are equal. The lengths of the sample voice features corresponding to each sample voice sentence are complemented to be the same length according to the longest length, so that the subsequent training process can be facilitated, and the training efficiency is improved; meanwhile, batch and block dividing operation can be more conveniently carried out, and more possible batch and block dividing modes are provided.

For example, as shown in fig. 18, assuming that there are three lengths of sample speech sentences, named first speech sentence, second speech sentence, third speech sentence from top to bottom in this order, the third speech sentence of which length is longest is determined as the anchor sample speech sentence. And extracting the characteristics of the three sample voice sentences to obtain a first voice characteristic, a second voice characteristic and a third voice characteristic, wherein the first voice characteristic comprises seven unit lengths, the second voice characteristic comprises five unit lengths, and the third voice characteristic comprises ten unit lengths. The third speech feature length of the anchor sample speech sentence is taken as the first length, i.e. the first length is ten units long. According to the first length, filling data of three unit lengths are added for complement after the first voice feature, and filling data of five unit lengths are added for complement after the second voice feature. It is to be understood that this example is merely illustrative of, and not particularly limiting on, embodiments of the present disclosure.

In one embodiment, the length of each batch is an integer multiple of the first length; as shown in fig. 19, setting a chunk size for each batch includes:

step 1910, setting the block size of a first ratio of the plurality of batches to a first length, wherein the first ratio is not less than 50%;

Step 1920, setting the block sizes of the remaining lots of the plurality of lots to a length between the predetermined second length and the first length.

In this embodiment, in step 1910, the first length refers to the length of the sample speech feature of the anchor sample speech sentence, that is, when the block size is set to the first length, one block contains the sample speech feature of the complemented one sample speech sentence. The first ratio represents a number of batches belonging to the first ratio of the plurality of batches, the number of batches belonging to the first ratio. In step 1920, the second length is a preset value of length, and the second length is less than the first length; the setting to a length between the predetermined second length and the first length means that one length is selected from a plurality of length values between the second length and the first length (including the first length and the second length). However, a length may be randomly selected from the second length to the first length as the block size, or a length may be sequentially selected from large to small or from small to small as the block size.

By way of example, assuming there are currently 10 batches, the first ratio is 50%, the first length is a feature quantity of 20 frames, and the second length is a feature quantity of 8 frames; 5 batches are randomly selected from 10 batches, and the block sizes of the 5 batches are set to be the first length, namely the 5 batches are partitioned according to the characteristic quantity of 20 frames of each block length. And 5 other batches, randomly selecting five lengths from the characteristic quantities of 8 to 20 frames for blocking, for example, selecting the characteristic quantity of 8 frames, the characteristic quantity of 10 frames, the characteristic quantity of 12 frames, the characteristic quantity of 14 frames and the characteristic quantity of 16 frames as the blocking sizes of the other five batches respectively. For another example, from among the feature amounts of 8 to 20 frames, the feature amount of 8 frames, the feature amount of 9 frames, the feature amount of 10 frames, the feature amount of 11 frames, and the feature amount of 12 frames are sequentially selected from the small to large as the block sizes of the other five batches, respectively. For another example, from among the feature amounts of 8 to 20 frames, a feature amount of 19 frames, a feature amount of 18 frames, a feature amount of 17 frames, and a feature amount of 16 frames are sequentially selected from large to small as the block sizes of the other five batches, respectively. Wherein, the voice length corresponding to each frame is 40ms.

The benefit of steps 1910-1920 is that since the first length corresponds to the length of one sample speech sentence (the sample speech sentences shorter than the anchor sample speech sentence are all length-compensated so that they are equal to the anchor sample speech sentence), at least half of the predictions are masked from one complete sample speech sentence, since masking based on the semantics of one complete sentence is the most frequently used prediction in practice, and is also the most important case when training the first encoder. Masking from a portion of a sentence is also encountered in practice, but fewer masking from a complete sentence is encountered, thus leaving less than half of the batch to be used as the ability to train the first encoder to masking from incomplete sentence information. When the sentence is too short, it cannot be predicted correctly, and therefore a second length smaller than the first length is defined, so that the block size in the rest of the batches is smaller than half of the second length and the first length. In this way, the ability of the first encoder to understand semantically according to a whole sentence is comprehensively trained, the ability of the first encoder to understand semantically in incomplete sentences with various lengths is trained, the batch ratio of the two cases meets the practical requirement, and the coding adaptability and accuracy of the first encoder are improved.

In one embodiment, as shown in fig. 20, setting the block sizes of the remaining lots of the plurality of lots to a length between a predetermined second length and a first length includes:

step 2010, determining a fourth length of the first length minus one characteristic length unit;

step 2020, determining a predetermined minimum value for a third length and a fourth length, wherein both the third length and the fourth length are greater than the second length;

step 2030, setting the block sizes of the remaining lots in the plurality of lots to a length between the second length and the minimum value.

In this embodiment, the fourth length in step 2010 is equal to the first length minus one feature length unit, and when the length unit is a feature amount of 1 frame, one feature length unit is a feature amount of 1 frame; assuming that the first length is a feature quantity of 20 frames, the fourth length is a feature quantity of 20-1=19 frames. In step 2020, the minimum value is compared with the third length and the fourth length, and then the minimum value is taken as one; for example, assuming that the third length is a feature quantity of 22 frames, the first length is a feature quantity of 20 frames, the fourth length is a feature quantity of 19 frames, and at this time, the minimum value of the third length and the fourth length is the fourth length, and the minimum value is a feature quantity of 19 frames; for another example, assuming that the third length is a feature quantity of 22 frames and the first length is a feature quantity of 25 frames, the fourth length is a feature quantity of 25-1=24 frames, and at this time, the minimum value of the third length and the fourth length is the third length and the minimum value is a feature quantity of 22 frames. In step 2030, one or more lengths are determined from the second length and the length before the minimum value, and as the block sizes of the remaining batches, each batch is correspondingly set to a block size, and the sizes of different batches may be the same or different.

The benefit of steps 2010-2030 is that since more than half of the chunks' chunk size has been set to the first length (i.e., the entire sentence length), the remaining chunks are applied to the semantic understanding capability and mask prediction capability when training the first encoder for incomplete sentences. If the size of the split chunks is too close to the entire sentence length, it may result in training usage similar to more than half of the previous batches. Therefore, the embodiment takes a minimum value between the result of subtracting 1 from the first length and the third length, and uses the minimum value as the maximum value of the block sizes in the rest batch, thereby avoiding the repetition of the training target caused by the fact that the block sizes of the blocks divided in the rest batch are too close to the whole sentence length, and improving the universality and the coding accuracy of the first coder on sentences with various lengths.

In one embodiment, setting the chunk size of the remaining ones of the plurality of batches to a length between the second length and the minimum value comprises:

determining a plurality of length values between the second length and the minimum value and being an integer multiple of the characteristic length unit;

the block sizes of the remaining ones of the plurality of batches are uniformly distributed over the plurality of length values.

Assuming that there are 10 batches currently, the first ratio is 50%, the first length is a feature quantity of 25 frames, and the second length is a feature quantity of 8 frames. The feature length unit is a feature amount of 1 frame. Every 25 feature quantities in 5 batches out of 10 are taken as a block. The third length is a feature quantity of 20 frames. The fourth length is 24 frame feature quantity. The minimum value between the third length and the fourth length is a feature quantity of 20 frames. The block sizes of the remaining 5 batches are uniformly distributed between the feature amounts of 8 and 20 frames, and therefore, 10 frame feature amounts, 12 frame feature amounts, 14 frame feature amounts, 16 frame feature amounts, and 18 frame feature amounts can be taken as the block sizes of the remaining 5 batches.

The benefit of this embodiment is that the even distribution enables the trained first encoder to have equal coding accuracy for sentences of various lengths, thereby improving the speech recognition effect.

In the present embodiment, the unit of the feature length is a feature amount of a frame, and the length value of an integer multiple of the feature length unit, that is, the length value is a feature amount of an integer frame. A uniform distribution of a plurality of length values means that the characteristic lengths in the remaining batches are in an arithmetic progression.

In the training process, the same or different block sizes are set for different batches in a dynamic block mode, so that more kinds of training data can be added, the randomness of the data in the training process is increased, and the training effect is improved.

For example, assuming that the second length is a feature quantity of 8 frames and the third length is a feature quantity of 22 frames, the calculation formula of the block size is as follows:

wherein chunksize is the partition size; l (L) _max The first length is the length of the sample voice feature of the anchor sample voice sentence; l (L) _max -1 is of a fourth length, min (22, l _max -1) is the minimum of the third length and the fourth length, u representing the value of the length l being uniformly distributed between the second length and the minimum; x is a first ratio.

By the above formula, in the final training process, half of the batches are segmented according to the first length, each segment corresponds to a speech sentence, i.e. one half of the batches are trained according to the whole sentence, and the other half of the batches have a segment size of 8 to min (22, l) _max -1) randomly taking values. Thereby promote training data's kind, promote training effect.

In another example, when l _max When the characteristic quantity is larger than 23 frames, the block size of the other half batch takes a value of l-U (8, 22), namely the block size of the other half batch can be characterized in 8-22 framesThe values of the quantities are randomly taken.

In another example, when l _max When the characteristic quantity is smaller than 22 frames, the block size of the other half batch is l-U (8,l) _max -1), i.e. the chunk size of the other half of the batch will be at a characteristic value of 8 frames to l _max -1 frame of feature values.

In another example, the calculation formula of the chunk size may also be set to:

that is, half of the batches are trained according to the whole sentence, and the other half of the batches have the block sizes ranging from 8 frames to l _max Training by random value taking.

In one embodiment, when predicting the mask in the sample speech features, the block in which the mask in the sample speech features is located may be determined first, and then the mask is predicted based on the unmasked sample speech features in the block. For example, where a mask is located for block 6 in lot 1, block 6 has A, B, C, D length units, where length unit B is masked, mask B is predicted from the remaining length units A, C, D in block 6.

In another embodiment, shown in FIG. 21, predicting a mask in a sample speech feature includes:

step 2110, determining a block in which a mask in the sample speech feature is located;

step 2120, predicting the mask based on the unmasked sample speech features in the chunk and the sample speech features in the lot where the chunk was located, before the chunk.

In step 2120, the mask is predicted based on the unmasked sample speech features in the chunk where the mask is located, and the sample speech features in the lot where the chunk is located, before the chunk.

As shown in fig. 22, assuming that there is a section including 18 sample speech features of unit length, the masked sample speech features are obtained by masking the corresponding contents at the third, eighth, twelfth, and fifteenth unit lengths, and the masked sample speech features are divided into two batches including 9 unit lengths, and are respectively named as a first batch and a second batch in the order from left to right. The first batch and the second batch are divided into a first block, a second block and a third block according to the size of 3 unit lengths, and the second batch is divided into a fourth block, a fifth block and a sixth block.

Wherein the first mask is located in the first chunk, and the first mask is predicted based on the first two unmasked unit length sample speech features in the first chunk. The second mask is located in a third block, the third block being preceded by a first block and a second block, the second mask being predicted based on unmasked sample speech features per unit length in the third block, and the sample speech features of the first block and the second block. The third mask is located at the fourth chunk and the fourth chunk is located at the second lot, and since the fourth chunk is the first chunk in the second lot that is not preceded by a chunk, the third mask predicts based on unmasked unit length sample speech features in the fourth chunk. The fourth mask is located in the fifth chunk and the fifth chunk is located in the second lot, the fifth chunk being preceded by a fourth chunk, so the fourth mask can be predicted based on the unmasked sample speech features in the fifth chunk and the fourth chunk.

The benefit of steps 2110-2120 is that in masking predictions, not only the unmasked sample speech features in the chunk are considered, but also the sample speech features in the lot where the chunk is located, before the chunk. The sample voice features before the segmentation are the above of the segmentation, and have prompting and limiting effects on the content of the segmentation. Therefore, the mask prediction is performed in combination with the above information of the block, so that the accuracy of the mask prediction can be improved, thereby improving the coding accuracy of the first encoder.

In one embodiment, predicting the mask based on unmasked sample speech features in the chunk and sample speech features in a lot in which the chunk is located prior to the chunk includes:

determining the number of characteristic length units in the batch and before blocking;

taking a random number from 0 to the number of units with characteristic length;

the mask is predicted based on the unmasked sample speech features in the block and the sample speech features in random number of feature length units prior to the block.

Taking fig. 22 as an example, assuming that the feature length unit is a feature quantity of a frame, in the figure, a cell is a feature quantity of 22 frames, that is, a feature quantity of 22 frames per unit length, for the first block, since the first block is the first block in the first batch, the number of feature length units before the first block is 0; for the fifth block, it belongs to the second lot, and in the same lot, there is also the fourth block, so the number of feature length units before the fifth block in the second lot is a feature quantity of 22 frames. Taking the fifth block as an example, when the random number taken between 0 and the number of feature length units before the fifth block is 0, performing mask prediction based on the unmasked sample voice features in the fifth block; when the random number is fetched 22, mask prediction is performed based on the unmasked sample speech features in the fifth chunk and the fourth chunk; when the random number is taken 11, then a masked prediction is made based on the unmasked sample speech features in the fifth chunk and the second half of the sample speech features in the fourth chunk.

The benefit of this embodiment is that since the length of the sample speech features before the segmentation employed for the mask prediction is random, the first encoder can be trained to predictively encode the mask according to the different length contexts, improving the flexibility of the mask prediction and thus the robustness of the first encoder relative to schemes where the mask prediction can only be performed according to fixed length contexts.

In one embodiment, the training process of the first encoder adopts a manner of dynamically setting the block size and dynamically allocating the historical sample speech characteristics, and specifically, the calculation formula of the block size of the first encoder is as follows:

/>

for each prediction of the mask, a random number is taken from 0 to the number of feature length units in the lot where it is located, before chunking, to dynamically assign the historical sample speech features.

In one embodiment, the training process of the second encoder is identical to the training process of the first encoder, and the manner of dynamically setting the block size and dynamically allocating the historical sample speech features provided in the above embodiments is also adopted. The training process of the second encoder is mainly different from that of the first encoder in that when the block size is dynamically set, the setting range of the block size is 50 to 250 frames of feature quantity for the second encoder, and the corresponding voice length is 2s to 10s.

In one embodiment, the first loss function includes a first loss sub-function, and a second loss sub-function; the second loss function includes a third loss sub-function, and a fourth loss sub-function.

As shown in fig. 23A, calculating a first loss function for a first code includes:

step 2310, inputting the second intermediate feature into the first decoder classified based on the connection timing and the second decoder based on the attention codec simultaneously;

step 2320, calculating a first loss sub-function of the first code based on the first output of the first decoder, and calculating a second loss sub-function of the first code based on the second output of the second decoder.

In this embodiment, as shown in fig. 24, the client inputs the sample speech feature into the first encoder, encodes to obtain the second intermediate feature, and inputs the second intermediate feature into the first decoder and the second decoder simultaneously, and the first output of the first decoder calculates the CTC loss function (loss _{ctc-streaming} ) I.e. the first loss sub-function. The second output of the second decoder is passed through a Softmax layer and then a Cross Entropy (CE) loss function (loss) of the AED-based streaming encoding is calculated _{aed-streaming} ) I.e. the second loss subfunction. At the same time, the client transmits the second intermediate feature through the Internet/wireless communication networkAnd sending the cloud server.

The advantage of the above embodiment is that the first decoder based on the connection timing classification and the second decoder based on the attention codec are simultaneously used for decoding, and in this process, the first loss sub-function and the second loss sub-function are respectively generated, and are jointly used as the first loss function of the first code, so that the losses caused by different decoding modes to the first code can be comprehensively considered, and the considered aspect of the obtained first loss function is more comprehensive, and the result is more accurate.

As shown in fig. 23B, calculating a second loss function for a second code includes:

step 2330, inputting the second encoding result to the third decoder based on the connection timing classification and the fourth decoder based on the attention codec at the same time;

step 2340, calculating a third loss sub-function of the second code based on the third output of the third decoder, and calculating a fourth loss sub-function of the second code based on the fourth output of the fourth decoder.

In this embodiment, as shown in fig. 24, the cloud server receives the second intermediate feature sent by the client, inputs the second intermediate feature into the second encoder to obtain a second encoding result, and inputs the second encoding result into the third decoder and the fourth decoder at the same time. The third output of the third decoder is subjected to a Softmax layer to calculate a CTC loss function (loss _ctc-cascaded ) I.e. a third loss sub-function. The fourth output of the fourth decoder is passed through a Softmax layer to calculate the CE loss function (loss _aed-cascaded ) I.e. a fourth loss sub-function.

The benefits of this embodiment are similar to those of fig. 23A and are not repeated.

In one embodiment, jointly training the first encoder and the second encoder based on the first loss function and the second loss function comprises:

The benefit of this embodiment is that the overall loss function is determined by a combination of factors, thereby improving the coding accuracy of the trained first and second encoders.

In one embodiment, the total loss function is averaged over four loss subfunctions, a first loss subfunction, a second loss subfunction, a third loss subfunction, and a fourth loss subfunction.

In another embodiment, the total loss function may be obtained by a direct sum of the first loss sub-function, the second loss sub-function, the third loss sub-function, and the fourth loss sub-function.

In another embodiment, the total loss function may be derived from a weighted sum of the first loss sub-function, the second loss sub-function, the third loss sub-function, and the fourth loss sub-function. Specifically, calculating the total loss function based on the first loss sub-function, the second loss sub-function, the third loss sub-function, and the fourth loss sub-function includes:

acquiring a first weight of a first loss sub-function, a second weight of a second loss sub-function, a third weight of a third loss sub-function and a fourth weight of a fourth loss sub-function;

a weighted sum of the first, second, third, and fourth loss subfunctions is calculated as a total loss function based on the first, second, third, and fourth weights.

In this embodiment, the first weight, the second weight, the third weight, and the fourth weight may be set based on experience; or in the training process, determining the influence degree of each loss sub-function on the training result, and setting the corresponding weight according to the influence degree of each loss sub-function, wherein the larger the influence degree is, the larger the weight corresponding to the loss sub-function is. The formula of the total loss function is as follows: loss=λ ₁ loss _{ctc-streaming} +λ ₂ loss _ctc-cascaded +λ ₃ loss _{aed-streamrng} +λ ₄ loss _aed-cascaded . Wherein lambda is ₁ Lambda is the first weight ₂ Lambda is the third weight ₃ Lambda is the second weight ₄ Is the fourth weight.

The advantage of this embodiment is that the weighted sum of the various loss sub-functions is taken as the total loss function, which improves the flexibility of the determination of the total loss function and improves the accuracy of the determined total loss function by setting different weights for the various loss sub-functions.

As shown in fig. 25, according to one embodiment of the present disclosure, the voice recognition method at least includes, but is not limited to, application to a cloud server, or a server 110 as provided in fig. 1, etc., and includes:

step 2510, receiving and storing a first intermediate feature sent by the client, where the first intermediate feature is obtained by the client performing a first encoding on a target voice feature of the target voice;

step 2520, receiving a voice recognition request from the client, wherein the voice recognition request is sent by the client when the first confidence is not higher than a first threshold, and the first confidence is a confidence of a first recognition text acquired by the client based on the first intermediate feature;

step 2530, performing second encoding on the stored first intermediate feature, acquiring a second recognition text based on a result of the second encoding, and determining the second recognition text as a voice recognition result.

It can be understood that the specific implementation details and the corresponding beneficial effects of the voice recognition method applied to the cloud server side in the embodiments of the present disclosure correspond to the voice recognition method applied to the client side provided in the foregoing embodiments, and are not described herein again.

Implementation details of the speech recognition method of the embodiments of the present disclosure

Referring now to fig. 4, the implementation of the speech recognition method according to the embodiment of the present disclosure will be described in detail.

The client receives the voice of the object through the radio to obtain target voice, and locally performs feature extraction on the target to obtain target voice features. The client inputs the target speech feature into a first encoder for first encoding, wherein the first encoder is a streaming encoder. The first encoder divides the target voice characteristic into blocks according to a first time window, and performs first encoding on each block to obtain a block code; and cascading the block codes according to the time sequence order to obtain a first intermediate feature corresponding to the target voice feature. The client sends the first intermediate feature to the server for backup.

And locally at the client, inputting the first intermediate feature into a decoder for decoding to obtain a first identification text and a first confidence of the first identification text. In an example, when the decoder of the client is a first decoder based on the connection timing classification, a posterior probability of the connection timing classification is generated while decoding the first recognition text, and the posterior probability is used as the first confidence. In another example, a plurality of first candidate texts, and a first probability of the plurality of first candidate texts, are obtained by a decoder based on the first intermediate feature, wherein the first identified text is a first candidate text with a first maximum probability; the first candidate texts and the first probabilities of the first candidate texts are input into a first confidence prediction model, and predicted first confidence is obtained.

And judging whether the first confidence coefficient is higher than a first threshold value, if so, sending the first identification text to a punctuation model of the cloud server, adding punctuation marks to the first identification text through the punctuation model, sending the first identification text after punctuation is added back to the client by the cloud server, and displaying the first identification text after punctuation is added as a voice identification result by the client.

If the first confidence coefficient is lower than or equal to a first threshold value, the client sends a voice recognition request to the cloud server, and the cloud server responds to the voice recognition request to conduct secondary voice recognition.

In a cloud server, responding to a voice recognition request sent by a client, acquiring a plurality of first intermediate features from a backup storage space, extracting a plurality of first intermediate features corresponding to a plurality of block codes in a second time window from the backup according to the second time window, and taking the extracted plurality of first intermediate features as a group; wherein the second time window comprises a plurality of first time windows. And inputting the group into a second encoder to perform second encoding to obtain a group code, and cascading the group code according to the time sequence of the group to obtain a second encoding result. The result of the second encoding is input to a decoder, and a second recognition text is obtained by the decoder. And predicting and judging the second confidence coefficient of the second identification text through a re-scoring module of the cloud server, and indicating that the second identification text is credible when the second confidence coefficient is higher than a second threshold value. And adding punctuation marks into a second recognition text input punctuation model with the second confidence coefficient meeting a second threshold value, obtaining a completed text serving as a voice recognition result and sending the voice recognition result to the client. And the client receives and displays the voice recognition result sent by the cloud server.

In this example, the speech encoding in speech recognition is divided into a first encoding and a second encoding, the first encoding is placed in the client for execution, and the second encoding is placed in the cloud for execution. When the first confidence of the first code is higher than a first threshold value, the voice recognition result of the client is accurate, and the voice recognition result of the client can be completely relied on. When the first confidence coefficient is not higher than the first threshold value, the voice recognition result of the client is not accurate, and the more accurate voice recognition result obtained by the second code is requested to the cloud server, so that the local terminal and cloud computing resources are cooperated, and the processing cost of the cloud server is reduced under the condition that the voice recognition accuracy is guaranteed. Meanwhile, in the embodiment of the disclosure, the first intermediate feature obtained by performing the first encoding on the object voice feature is transmitted to the cloud server, instead of directly transmitting the object voice to the cloud server, so that the voiceprint information of the user is well protected, and the security of the object information is improved.

Apparatus and device descriptions of embodiments of the present disclosure

It will be appreciated that, although the steps in the various flowcharts described above are shown in succession in the order indicated by the arrows, the steps are not necessarily executed in the order indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

In the various embodiments of the present disclosure, when related processing is performed according to data related to the task content characteristics, such as the task content attribute information or the attribute information set, permission or consent of the task content is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the present disclosure needs to acquire the attribute information of the task content, the embodiment of the present disclosure may acquire the independent permission or independent consent of the task content by means of a pop-up window or a jump to a confirmation page, and acquire the necessary task content related data for enabling the embodiment of the present disclosure to operate normally after explicitly acquiring the independent permission or independent consent of the task content.

Fig. 26 is a schematic structural diagram of a voice recognition device 2600 according to an embodiment of the disclosure. The voice recognition apparatus 2600 may be applied to a client, or the object terminal 130 provided in fig. 1, and the voice recognition apparatus 2600 includes:

the first encoding unit 2610 is configured to perform first encoding on a target voice feature of the target voice to obtain a first intermediate feature, and backup the first intermediate feature to the cloud server;

A first text obtaining unit 2620 for obtaining a first recognition text based on the first intermediate feature;

a confidence coefficient obtaining unit 2630, configured to obtain a first confidence coefficient of the first recognition text;

a first output unit 2640 for determining the first recognition text as a speech recognition result if the first confidence level is higher than a first threshold;

the voice recognition request unit 2650 is configured to send a voice recognition request to the cloud server if the first confidence coefficient is equal to or lower than the first threshold value, so that the cloud server performs second encoding on the backed-up first intermediate feature, obtains a second recognition text based on a result of the second encoding, and determines the second recognition text as a voice recognition result.

Optionally, the first encoding unit 2610 is specifically configured to:

extracting target voice characteristics from target voice;

dividing the target voice characteristic into blocks;

performing first coding on each block to obtain a block code;

and cascading the block codes according to the block sequence to obtain a first intermediate feature.

Optionally, the first encoding unit 2610 is also specifically configured to:

dividing the target voice characteristic into blocks according to a first time window;

the speech recognition request unit 2650 specifically functions to:

and sending a voice recognition request to the cloud server so that the cloud server takes a plurality of block codes in a second time window as a group, and performs second coding on the group to obtain a group code, wherein the second time window comprises a plurality of first time windows, and the group codes are cascaded according to the sequence of the group to obtain a result of the second coding.

Optionally, the first encoding unit 2610 is specifically configured to:

aiming at each block code in the first intermediate feature, after the block code is generated, the block codes are sent to a cloud server for caching;

taking a plurality of block codes in a second time window as a group, and performing second coding on the group, wherein the second coding comprises the following steps: and when the second time window is finished, taking out a plurality of block codes which correspond to the plurality of first time windows in the second time window and are cached in the cloud server as a group, and performing second coding.

Alternatively, the first text obtaining unit 2620 is specifically configured to:

based on the first intermediate feature, obtaining a first recognition text through a first decoder classified based on connection timing;

the confidence acquiring unit 2630 is specifically configured to:

and taking the posterior probability of the connection time sequence classification obtained by the first decoder as a first confidence degree.

acquiring a plurality of first candidate texts and first probabilities of the plurality of first candidate texts through a first decoder based on the first intermediate features, wherein the first identification text is the first candidate text with the maximum first probability;

the confidence acquiring unit 2630 is specifically configured to:

the first candidate texts and the first probabilities of the first candidate texts are input into a first confidence prediction model, and predicted first confidence is obtained.

Optionally, the first encoding is performed by a first encoder and the second encoding is performed by a second encoder; the first encoder and the second encoder are jointly trained by:

inputting the sample voice characteristics of the sample voice into a first encoder to obtain second intermediate characteristics, and calculating a first loss function of the first encoding;

inputting the second intermediate feature into a second encoder to obtain a second encoding result, and calculating a second loss function of the second encoding;

calculating a first loss function for a first code, comprising: simultaneously inputting the second intermediate feature into a first decoder classified based on connection timing and a second decoder based on attention codec; calculating a first loss sub-function of the first code based on a first output of the first decoder, and calculating a second loss sub-function of the first code based on a second output of the second decoder;

calculating a second loss function for a second code, comprising: simultaneously inputting the second encoding result into a third decoder based on connection timing classification and a fourth decoder based on attention codec; a third loss sub-function of the second code is calculated based on a third output of the third decoder, and a fourth loss sub-function of the second code is calculated based on a fourth output of the fourth decoder.

Optionally, jointly training the first encoder and the second encoder based on the first loss function and the second loss function, comprising:

Optionally, calculating the total loss function based on the first loss sub-function, the second loss sub-function, the third loss sub-function, and the fourth loss sub-function includes:

Optionally, inputting the sample speech feature of the sample speech into the first encoder to obtain a second intermediate feature, including:

extracting sample voice characteristics from sample voice;

replacing a portion of the features in the sample speech features with a mask;

And inputting the sample voice features with the mask replaced into a first encoder, predicting the mask in the sample voice features by the first encoder, and encoding the sample voice features with the predicted mask to obtain second intermediate features.

Optionally, inputting the mask-replaced sample speech feature into a first encoder, comprising:

setting a block size for each batch;

partitioning each batch according to the corresponding partitioning size of the batch; and

the blocks are sequentially input to a first encoder.

Optionally, the sample speech comprises a plurality of sample speech sentences;

determining the longest anchor sample speech sentence in the plurality of sample speech sentences;

Optionally, the length of each batch is an integer multiple of the first length;

Setting a chunk size for each batch, comprising:

setting a block size of a first ratio of the plurality of batches to a first length, wherein the first ratio is not less than 50%;

the block sizes of the remaining lots of the plurality of lots are set to a length between the predetermined second length and the first length.

Optionally, setting the block sizes of the remaining lots of the plurality of lots to a length between the predetermined second length and the first length includes:

determining a minimum value of a predetermined third length and fourth length, wherein the third length and fourth length are both greater than the second length;

the chunk size of the remaining ones of the plurality of batches is set to a length between the second length and the minimum value.

Optionally, setting the chunk size of the remaining batches of the plurality of batches to a length between the second length and the minimum value includes:

Optionally, predicting the mask in the sample speech feature includes:

Determining a block in which a mask in the sample speech feature is located;

the mask is predicted based on the unmasked sample speech features in the chunk and the sample speech features in the lot where the chunk was located prior to the chunk.

Optionally, predicting the mask based on the unmasked sample speech features in the chunk and the sample speech features in the lot in which the chunk is located prior to the chunk, includes:

Optionally, the first encoding is performed by a first encoder and the second encoding is performed by a second encoder; the number of parameters of the first encoder is smaller than the number of parameters of the second encoder.

Optionally, acquiring the second recognition text based on the result of the second encoding, and determining the second recognition text as a speech recognition result, including:

inputting the result based on the second encoding into a third decoder to obtain a plurality of second candidate texts and second probabilities of the plurality of second candidate texts, wherein the second identification text is the second candidate text with the largest second probability;

Inputting the plurality of second candidate texts and the second probabilities of the plurality of second candidate texts into a second confidence prediction model to obtain predicted second confidence;

and if the second confidence is higher than a second threshold, determining the second recognition text as a voice recognition result.

Fig. 27 is a schematic structural diagram of a speech recognition device 2700 according to another embodiment of the present disclosure. The voice recognition apparatus 2700 may be applied to a cloud server, or the server 110 provided in fig. 1, the voice recognition apparatus 2700 including:

a feature receiving unit 2710, configured to receive and store a first intermediate feature sent by the client, where the first intermediate feature is obtained by performing first encoding on a target speech feature of the target speech by the client;

a request receiving unit 2720, configured to receive a voice recognition request from a client, where the voice recognition request is sent by the client when it is recognized that a first confidence is not higher than a first threshold, and the first confidence is a confidence of a first recognition text acquired by the client based on a first intermediate feature;

a second encoding unit 2730 for performing second encoding on the stored first intermediate features;

a second text acquisition unit 2740 for acquiring a second recognition text based on the result of the second encoding and determining the second recognition text as a speech recognition result.

It is to be understood that details of the implementation of the voice recognition device 2700 according to the embodiment of the present disclosure can be confirmed from the voice recognition device 2600 according to the above embodiment, and will not be described herein.

Referring to fig. 28, fig. 28 is a block diagram of a portion of a terminal implementing a voice recognition method according to an embodiment of the present disclosure, the terminal including: radio Frequency (RF) circuit 2810, memory 2815, input unit 2830, display unit 2840, sensor 2850, audio circuit 2860, wireless fidelity (wireless fidelity, wiFi) module 2870, processor 2880, and power supply 2890. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 28 is not limiting of a cell phone or computer and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The RF circuit 2810 can be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, in particular, after receiving downlink information of a base station, the downlink information is processed by the processor 2880; in addition, the data of the design uplink is sent to the base station.

The memory 2815 may be used to store software programs and modules, and the processor 2880 performs various functional applications and data processing of the content terminal by executing the software programs and modules stored in the memory 2815.

The input unit 2830 may be used to receive input numeric or character information and generate key signal inputs related to setting and function control of the content terminal. Specifically, the input unit 2830 may include a touch panel 2831 and other input devices 2832.

The display unit 2840 may be used to display input information or provided information and various menus of the content terminal. The display unit 2840 may include a display panel 2841.

Audio circuitry 2860, speaker 2861, and microphone 2862 may provide an audio interface.

In this embodiment, the processor 2880 included in the terminal may perform the voice recognition method of the previous embodiment.

Terminals of embodiments of the present disclosure include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent home appliances, vehicle terminals, aircraft, and the like. Embodiments of the present invention may be applied to a variety of scenarios including, but not limited to, speech processing, natural language learning, and the like.

Fig. 29 is a block diagram of a portion of a server implementing a voice recognition method of an embodiment of the present disclosure. The servers may vary widely in configuration or performance and may include one or more central processing units (Central Processing Units, simply CPU) 2922 (e.g., one or more processors) and memory 2932, one or more storage media 2930 (e.g., one or more mass storage devices) storing applications 2942 or data 2944. Wherein memory 2932 and storage medium 2930 may be transitory or persistent storage. The program stored on the storage medium 2930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 2922 may be arranged to communicate with the storage medium 2930 to execute a series of instruction operations in the storage medium 2930 on a server.

The server(s) may also include one or more power supplies 2926, one or more wired or wireless network interfaces 2950, one or more input/output interfaces 2958, and/or one or more operating systems 2941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The central processor 2922 in the server may be used to perform the voice recognition methods of the embodiments of the present disclosure.

The embodiments of the present disclosure also provide a computer-readable storage medium storing a program code for executing the speech recognition method of the foregoing embodiments.

The disclosed embodiments also provide a computer program product comprising a computer program. The processor of the computer device reads the computer program and executes it, causing the computer device to execute the speech recognition method as described above.

The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing association relation of association contents, the representation may have three kinds of relations, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context of the association is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present disclosure, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should also be appreciated that the various implementations provided by the embodiments of the present disclosure may be arbitrarily combined to achieve different technical effects.

The above is a specific description of the embodiments of the present disclosure, but the present disclosure is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present disclosure, and are included in the scope of the present disclosure as defined in the claims.

Claims

1. A voice recognition method, wherein the voice recognition method is applied to a client, and the voice recognition method comprises:

acquiring a first identification text based on the first intermediate feature;

acquiring a first confidence coefficient of the first identification text;

2. The method of claim 1, wherein first encoding the target speech feature of the target speech to obtain a first intermediate feature comprises:

extracting the target voice characteristics from the target voice;

dividing the target voice feature into blocks;

performing first coding on each block to obtain a block code;

3. The method of claim 2, wherein the dividing the target speech feature into blocks comprises: dividing the target voice feature into blocks according to a first time window;

said second encoding of said first intermediate feature of the backup comprises:

taking a plurality of block codes in a second time window as a group, and performing second coding on the group to obtain a group code, wherein the second time window comprises a plurality of first time windows;

and cascading the group codes according to the sequence of the groups to obtain the result of the second code.

4. The method of claim 3, wherein backing up the first intermediate feature to a cloud server comprises: for each block code in the first intermediate feature, after the block code is generated, sending the block code to the cloud server for caching;

5. The speech recognition method of claim 1, wherein the first encoding is performed by a first encoder and the second encoding is performed by a second encoder; the first encoder and the second encoder are jointly trained by:

6. The method of claim 5, wherein the first loss function comprises a first loss sub-function and a second loss sub-function; the second loss function includes a third loss sub-function, and a fourth loss sub-function;

7. The method of claim 6, wherein the jointly training the first encoder and the second encoder based on the first loss function and the second loss function comprises:

8. The method of claim 5, wherein inputting the sample speech feature of the sample speech into the first encoder to obtain a second intermediate feature comprises:

extracting the sample voice features from the sample voice;

replacing a portion of the features in the sample speech features with a mask;

9. The method of claim 8, wherein the masking the replaced sample speech features into the first encoder comprises:

setting a block size for each of the batches;

the blocks are sequentially input to the first encoder.

10. The method of claim 9, wherein the sample speech comprises a plurality of sample speech sentences;

11. The method of claim 10, wherein the length of each of the batches is an integer multiple of the first length;

said setting a chunk size for each of said batches, comprising:

12. The method according to claim 11, wherein the setting the block sizes of the remaining lots among the plurality of lots to a length between a predetermined second length and the first length includes:

13. The method of claim 8, wherein predicting the mask in the sample speech features comprises:

14. The method according to claim 1, wherein the acquiring a second recognition text based on the result of the second encoding and determining the second recognition text as the speech recognition result includes:

15. A voice recognition method, wherein the voice recognition method is applied to a cloud server, and the voice recognition method comprises:

16. A speech recognition apparatus, comprising:

17. A speech recognition apparatus, comprising:

18. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the speech recognition method according to any one of claims 1 to 15 when executing the computer program.

19. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech recognition method according to any one of claims 1 to 15.

20. A computer program product comprising a computer program, which computer program is read and executed by a processor of a computer device, causing the computer device to perform the speech recognition method according to any one of claims 1 to 15.