CN111524517B

CN111524517B - Speech recognition method, device, equipment and storage medium

Info

Publication number: CN111524517B
Application number: CN202010595832.8A
Authority: CN
Inventors: 连荣忠; 姜迪; 徐倩; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2023-11-03
Anticipated expiration: 2040-06-24
Also published as: CN111524517A

Abstract

The application discloses a voice recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data of the candidate texts; extracting auxiliary text from the candidate text; and extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text. The application solves the technical problem of low accuracy of voice recognition in the prior art.

Description

Speech recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technology of financial technology (Fintech), and in particular, to a method, apparatus, device and storage medium for speech recognition.

Background

With the continuous development of financial technology, especially internet technology finance, more and more technologies are applied in the finance field, but the finance industry also puts higher requirements on technology, such as the finance industry has higher requirements on voice recognition.

At present, a language model in a traditional ASR (automatic speech recognition technology, automatic Speech Recognition) algorithm is generally adopted to perform processes such as decoding recognition on dialog scene contents which need to be processed currently, but the processes such as decoding recognition on the dialog scene contents which need to be processed currently have strong limitations, such as deviation of the contents decoded by the model, and the accuracy of speech recognition is reduced.

Disclosure of Invention

The application mainly aims to provide a voice recognition method, a device, equipment and a storage medium, which aim to solve the technical problem of poor voice recognition accuracy in the prior art.

To achieve the above object, the present application provides a voice recognition method including:

acquiring voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data of the candidate texts;

extracting auxiliary text from the candidate text;

and extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text.

Optionally, the step of extracting and outputting text from the candidate text based on the above text data and the auxiliary text includes:

and inputting the above text data and the auxiliary text into a preset recognition model to recognize the above text data and the auxiliary text, obtaining and outputting the output text of the candidate text.

The preset recognition model is a target model which reaches preset training conditions after iterative training is performed on a preset basic model based on preset text data with preset labels.

Optionally, the preset recognition model is a target model that reaches preset training conditions after iterative training is performed on the preset basic model based on the first preset attention mechanism based on preset text data with preset labels.

carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;

decoding the above text data and the coding vector of the auxiliary text by a preset decoding rule to obtain the decoding vector of the candidate text;

and extracting output text from the candidate text based on the decoding vector and outputting the output text.

Optionally, the step of extracting and outputting text from the candidate text based on the decoding vector includes:

based on the decoding vector, obtaining a vector value of each text in the candidate texts;

and extracting output text from the candidate text based on the vector value and outputting the output text.

Optionally, the step of decoding the above text data and the encoded vector of the auxiliary text by a preset decoding rule to obtain the decoded vector of the candidate text includes:

through a second preset attention mechanism, carrying out directional selection on the coding vectors of the candidate texts to obtain candidate vectors of the candidate texts;

and decoding the candidate vectors of the candidate texts through a preset decoding rule to obtain the decoding vectors of the candidate texts.

Optionally, the step of extracting the auxiliary text from the candidate text includes:

counting the frequency of each word in the candidate text;

and selecting words with the frequency larger than a preset value as the auxiliary text.

The application also provides a voice recognition device, which comprises:

the acquisition module is used for acquiring voice data to be identified and determining candidate texts of the voice data to be identified and the text data of the candidate texts;

a first extraction module for extracting an auxiliary text from the candidate text;

and the second extraction module is used for extracting output text from the candidate text and outputting the output text based on the text data and the auxiliary text.

Optionally, the second extraction module includes:

the input unit is used for inputting the above text data and the auxiliary text into a preset recognition model so as to recognize the above text data and the auxiliary text, obtain and output the output text of the candidate text;

Optionally, the second extraction module includes:

the coding unit is used for carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;

the decoding unit is used for decoding the above text data and the coding vector of the auxiliary text through a preset decoding rule to obtain the decoding vector of the candidate text;

and the extraction unit is used for extracting output text from the candidate text based on the decoding vector and outputting the output text.

Optionally, the extracting unit is configured to implement:

Optionally, the decoding unit is configured to implement:

Optionally, the first extraction module includes:

the statistics unit is used for counting the frequency of each word in the candidate text;

and the selecting unit is used for selecting words with the frequency larger than a preset value as the auxiliary text.

The application also provides a voice recognition device, which is an entity device, comprising: the voice recognition device comprises a memory, a processor and a program of the voice recognition method stored in the memory and capable of running on the processor, wherein the program of the voice recognition method can realize the steps of the voice recognition method when being executed by the processor.

The present application also provides a storage medium having stored thereon a program for implementing the above-described speech recognition method, which when executed by a processor implements the steps of the above-described speech recognition method.

According to the method, the candidate text of the voice data to be recognized and the text data of the candidate text are determined by acquiring the voice data to be recognized; extracting auxiliary text from the candidate text; and extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text. In the application, after the voice data to be recognized is obtained, the candidate text of the voice data to be recognized is determined, the text data above the candidate text is obtained, further, the auxiliary text is extracted from the candidate text, the text data above the candidate text and the auxiliary text are combined, and the output text is extracted from the candidate text and output, namely, in the application, the output text is obtained by decoding and recognizing the candidate text of the voice data to be recognized, but the auxiliary word is selected from the candidate sentences (obtained by collective decision), namely, the auxiliary word after the collective decision is selected, the text data above the candidate text (beneficial to improving the accuracy) and the auxiliary text (beneficial to improving the accuracy) after the collective decision are combined, and the output text is extracted from the candidate text and output, so that the accuracy of language recognition is improved, and the technical problem of low accuracy of the voice recognition in the prior art is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flowchart of a first embodiment of a speech recognition method according to the present application;

FIG. 2 is a flowchart illustrating a refinement step for extracting auxiliary text from the candidate text according to a first embodiment of the speech recognition method of the present application;

FIG. 3 is a schematic diagram of a device architecture of a hardware operating environment according to an embodiment of the present application;

fig. 4 is a schematic diagram of a speech recognition method according to the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

An embodiment of the present application provides a voice recognition method, in a first embodiment of the present application, referring to fig. 1, the voice recognition method includes:

step S10, obtaining voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data above the candidate texts;

step S20, extracting auxiliary texts from the candidate texts;

and step S30, extracting output text from the candidate text and outputting the output text based on the text data and the auxiliary text.

The method comprises the following specific steps:

in this embodiment, it should be noted that the speech recognition method is applied to a speech recognition system, where the speech recognition system belongs to a speech recognition device, and the speech recognition system is in communication with a speech platform, for example, where there is a daily audio dialogue, speaker a: "you have or not seen CBA yesterday evening, guangdong team wins opponents with a big score; speaker B: "seen, the match is very wonderful, the MVP name is obtained by easy establishment and association", if the corresponding voice data to be identified is: in the prior art, the voice data to be recognized of the speaker B, such as 'seen, very wonderful in match, easily linked to obtain the MVP name, is recognized through the preset trained ASR model', such as 'seen by using an n-gram model', very wonderful in match, easily linked to obtain the MVP name, is modeled, and the speaker A cannot be obtained by using the n-gram model in decoding: "you have or not to see the CBA in yesterday evening, guangdong team wins the opponent" content, and in fact "CBA" and "guangdong team" play a better role in decoding the word "easy to build" and therefore using the traditional n-gram model has strong limitations on ASR in conversational scenarios, which can make the content decoded by the model deviate.

In this embodiment, after receiving the voice data to be recognized, the voice recognition system determines a candidate text of the voice data to be recognized and text data of the candidate text. The voice recognition system is in communication with a voice platform, for example, the voice platform may include a telephone call-in customer service sub-platform to record telephone call-in content of each telephone call-in person, obtain recording content, set the recording content as voice data to be recognized, send the voice data to the voice recognition system, and after receiving the voice data to be recognized, the voice recognition system determines candidate text of the voice data to be recognized and text data above the candidate text. Specifically, by using a preset trained ASR model (for example, a preset DNN-HMM/CTC model), the speech data to be recognized is decoded to obtain N candidate sentences, for example, D1, D2, …, DN, where the candidate text may also be a candidate sentence obtained based on the trained ASR model after combining with the current hotspot, in this embodiment, after obtaining the candidate text, the text data of the candidate text is further obtained, where the text data of the candidate text may be text data of a preset history period, and in addition, the text data of the candidate text may also be text data of a preset sentence number corresponding to the candidate text. For example, if there is a dialogue: speaker a: "you have or not to see CBA yesterday evening, guangdong team win opponents" the player is a big score; speaker B: "seeing the match very wonderful, easily build a tie to get the MVP name in fact", "seeing the match very wonderful, easily build a tie to get the MVP name in fact" if it is the speech data to be recognized, the above text data includes "you have CBA yesterday evening, guangdong team wins opponents with great score".

Step S20, extracting auxiliary texts from the candidate texts;

in the embodiment, the current context is combined to extract the output text from the candidate text, so that the accuracy of recognition is improved.

Specifically, the auxiliary text is extracted from the candidate text through a preset extraction strategy, which may be a word weight strategy or the like, that is, the auxiliary text is extracted through weights (which are determined in advance) of the texts in the candidate text.

Wherein, referring to fig. 2, the step of extracting the auxiliary text from the candidate text includes:

step S21, counting the frequency of each word in the candidate text;

and S22, selecting words with the frequency larger than a preset value as the auxiliary text.

In this embodiment, through a preset feature extraction (Feature Extractor) unit, the frequency of each word in the N candidate sentences is counted, and the word with the frequency greater than the preset value is selected as the auxiliary text C, as shown in fig. 4.

After the auxiliary text is obtained, based on the text data and the auxiliary text, extracting output text from the candidate text and outputting the output text.

Wherein the step of extracting and outputting text from the candidate text based on the above text data and the auxiliary text includes:

step S31, inputting the above text data and the auxiliary text into a preset recognition model to recognize the above text data and the auxiliary text, obtaining and outputting the output text of the candidate text;

In this embodiment, the above text data and the auxiliary text are input into a preset recognition model to perform recognition processing on the above text data and the auxiliary text, so as to obtain and output an output text of the candidate text, specifically, the preset recognition model includes a Seq2Seq (sequence to sequence, including RNN network structure, which is often used to model a mapping relationship of a dialog context in a dialog field) model and a preset end-to-end model, and the above text data and the auxiliary text are input into a Seq2Seq model to perform recognition processing on the above text data and the auxiliary text, so as to obtain and output an output text of the candidate text.

The preset recognition model is a target model for achieving preset training conditions after performing iterative training on a preset basic model based on preset text data with preset labels, specifically, the preset text data with preset labels is obtained, the preset basic model is subjected to iterative training to adjust model parameters (determined by a comparison result of a prediction result of the preset text data and the preset labels) in the preset basic model (including an RNN network structure) until reaching preset conditions such as reaching preset times of iteration times, or model convergence, and the target model is obtained, and it is required to be noted that a preset coding sub-model (used for representing each text in the text data by a vector of preset dimensions) and a preset decoding sub-model (used for determining a vector of the whole text data) can be included in the preset basic model. When the preset basic model includes the preset coding sub-model and the preset decoding sub-model, the preset coding sub-model and the preset decoding sub-model need to be respectively preset trained, and then a converged coding sub-model and a converged decoding sub-model are obtained. In this embodiment, the function of setting the coding sub-model and the decoding sub-model is that: facilitating the introduction of other mechanisms such as attention mechanisms and the like to process text data.

In this embodiment, it should be noted that, after the above text data and the auxiliary text are input into a preset recognition model (as shown in fig. 4, including a preset encoding sub-model (encoding module) and a preset decoding sub-model (decoding module), the above text data and the auxiliary text are respectively encoded and finally integrated to obtain an output text, specifically, the above text data U is input into an encoding sub-model in the preset recognition model to obtain an encoded above textInputting the auxiliary text into a coding sub-model in a preset recognition model to obtain a coding auxiliary text +.>Encoding the above text +.>And said coding auxiliary text->Joint input into a preset decoding submodel (it should be noted that the coding text may also be +.>And said coding auxiliary text->The candidate texts are jointly input into an intermediate processing layer to obtain an intermediate processing result and then input into a preset decoding sub-model), specifically, each candidate text can be scored (obtained through code vector synthesis) through the preset decoding sub-model, the corresponding score1, score2, … and score N are obtained, and one with the highest score is selected as an output text.

The preset recognition model is a target model which reaches preset training conditions after iterative training is performed on a preset basic model based on a first preset attention mechanism based on preset text data with preset labels.

In this embodiment, a first attention mechanism is introduced, and after iterative training is performed on a preset basic model based on the first preset attention mechanism, a target model of preset training conditions is reached.

Wherein the Attention mechanism (Attention Mechanism) (Attention module in fig. 4) is an artificial selective Attention to a part of all information while ignoring other visible information. That is, in order to reasonably utilize limited information processing resources, it is necessary to select a specific portion in an area and then concentrate on, for example, people often only a small number of words to be read are focused and processed when reading. Namely, the mechanism of attention has two main aspects: it is decided which part of the input needs to be focused on, and limited information processing resources are allocated to important parts.

In this embodiment, the model parameters of the preset basic model are adjusted in a focused direction by introducing the first focusing mechanism, and in particular, in this embodiment, the model parameters of the preset decoding sub-model or the preset encoding sub-model are adjusted in a focused direction by introducing the first focusing mechanism, where the focusing direction based on the first focusing mechanism may refer to an adjustment direction of a connection weight between each matrix in the neural network structure, and so on. In addition, the output of data such as candidate text by using the "encode-decode" method has two problems: that is, all information of the candidate text needs to be stored in the encoding vector to be effectively decoded; and secondly, the problem of long-distance dependence, namely the problem of information loss in long-distance information transmission in the process of encoding and decoding. By introducing a preset first attention mechanism, related information is directly selected from candidate texts through the preset first attention mechanism as an aid in the decoding process. All candidate text information is not required to be transmitted through the coding vector, and the candidate text information can be directly transmitted, so that the information transmission distance is shortened.

In this embodiment, based on preset text data with preset labels, after iterative training is performed on a preset basic model based on a first preset attention mechanism, a target model with preset training conditions is reached. The efficiency and the accuracy of obtaining the required output text can be improved.

Further, according to the first embodiment of the present application, in another embodiment of the present application, the step of extracting and outputting text from the candidate text based on the above text data and the auxiliary text includes:

a1, carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;

in this embodiment, another way of obtaining the output text without using a model is provided, specifically, the encoding processing of the preset vector, such as a word vector, a preset euclidean distance vector, and the like, is directly performed on the above text data and the auxiliary text through a preset vector encoding rule, so as to obtain the encoding vector of the above text data and the auxiliary text.

A2, decoding the above text data and the coding vector of the auxiliary text by presetting a decoding rule to obtain the decoding vector of the candidate text;

in this embodiment, after the encoding vectors of the above text data and the auxiliary text are obtained, decoding processing is performed on the encoding vectors of the above text data and the auxiliary text by using a preset decoding rule, so as to obtain decoding vectors of the candidate text, where the preset decoding rule may be a vector addition rule between each encoding vector.

And A3, extracting and outputting text from the candidate text based on the decoding vector.

And extracting output text from the candidate texts based on the decoding vectors, and outputting the output text, specifically, extracting output text from the candidate texts based on the magnitude of the vector values of the candidate texts in the decoding vectors.

In this embodiment, the encoding process of the preset vector is performed on the above text data and the auxiliary text, so as to obtain encoding vectors of the above text data and the auxiliary text; decoding the above text data and the coding vector of the auxiliary text by a preset decoding rule to obtain the decoding vector of the candidate text; and extracting output text from the candidate text based on the decoding vector and outputting the output text. In this embodiment, the output text is accurately obtained.

Further, according to the first embodiment and the second embodiment of the present application, the step of decoding the above text data and the encoded vector of the auxiliary text by presetting a decoding rule to obtain the decoded vector of the candidate text includes:

step B1, carrying out directional selection on the coding vectors of the candidate texts through a second preset attention mechanism to obtain the candidate vectors of the candidate texts;

in this embodiment, through a second preset attention mechanism, the encoding vector of the candidate text is selected in a directional manner, for example, only the euclidean distance vector in the encoding vector is obtained, so as to obtain the candidate vector of the candidate text.

And B2, decoding the candidate vectors of the candidate texts by presetting a decoding rule to obtain the decoding vectors of the candidate texts.

And decoding the candidate vectors of the candidate texts by presetting a decoding rule, such as adding all the part-of-speech vectors, so as to obtain the decoding vectors of the candidate texts.

The step of extracting and outputting text from the candidate text based on the decoding vector includes:

e1, obtaining a vector value of each text in the candidate texts based on the decoding vector;

and E2, extracting and outputting text from the candidate text based on the vector value.

In this embodiment, based on the decoding vector, a vector value of each text in the candidate texts is obtained, specifically, based on an association relationship between the decoding vector and the vector value, a vector value of each text in the candidate texts is obtained by calculation, and based on the height of the vector value, each text in the candidate texts is ranked, and an output text with the highest ranking is extracted and output.

According to the embodiment, through a second preset attention mechanism, the coding vector of the candidate text is subjected to directional selection, so that the candidate vector of the candidate text is obtained; and decoding the candidate vectors of the candidate texts through a preset decoding rule to obtain the decoding vectors of the candidate texts. In the embodiment, the decoding vector is accurately acquired, and a foundation is laid for accurately acquiring and outputting the output text.

Referring to fig. 3, fig. 3 is a schematic device structure diagram of a hardware running environment according to an embodiment of the present application.

As shown in fig. 3, the voice recognition apparatus may include: a processor 1001, such as a CPU, memory 1005, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connected communication between the processor 1001 and a memory 1005. The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Optionally, the speech recognition device may further comprise a rectangular user interface, a network interface, a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, etc. The rectangular user interface may include a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also include a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).

It will be appreciated by those skilled in the art that the speech recognition device structure shown in fig. 3 is not limiting of the speech recognition device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 3, an operating system, a network communication module, and a voice recognition program may be included in the memory 1005 as one type of storage medium. An operating system is a program that manages and controls the hardware and software resources of a speech recognition device, supporting the execution of speech recognition programs and other software and/or programs. The network communication module is used to enable communication between components within the memory 1005 and with other hardware and software in the speech recognition system.

In the speech recognition device shown in fig. 3, the processor 1001 is configured to execute a speech recognition program stored in the memory 1005, and implement the steps of the speech recognition method described in any one of the above.

The specific implementation manner of the voice recognition device of the present application is substantially the same as that of each embodiment of the voice recognition method, and will not be repeated here.

The application also provides a voice recognition device, which comprises:

Optionally, the second extraction module includes:

Optionally, the extracting unit is configured to implement:

Optionally, the decoding unit is configured to implement:

Optionally, the first extraction module includes:

The specific implementation manner of the voice recognition device of the present application is basically the same as that of each embodiment of the voice recognition method, and will not be repeated here.

Embodiments of the present application provide a storage medium, and the storage medium stores one or more programs, which are further executable by one or more processors for implementing the steps of the speech recognition method described in any one of the above.

The specific implementation manner of the storage medium of the present application is basically the same as that of each embodiment of the voice recognition method, and will not be repeated here.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, within the scope of the application.

Claims

1. A method of speech recognition, the method comprising:

extracting auxiliary text from the candidate text;

inputting the above text data and the auxiliary text into a preset recognition model to recognize the above text data and the auxiliary text, obtaining and outputting an output text of the candidate text, wherein the preset recognition model is trained based on a first preset attention mechanism; or (b)

and extracting and outputting the text from the candidate text based on the decoding vector, wherein the decoding vector of the candidate text is obtained by directionally selecting the encoding vector of the candidate text through a second preset attention mechanism.

2. The method of claim 1, wherein the predetermined recognition model is a target model that reaches a predetermined training condition after performing iterative training on a predetermined basic model based on a first predetermined attention mechanism based on predetermined text data having a predetermined label.

3. The method of speech recognition according to claim 1, wherein the step of extracting and outputting text from the candidate texts based on the decoded vector comprises:

4. The method of claim 1, wherein the step of decoding the above text data and the encoded vector of the auxiliary text by a preset decoding rule to obtain the decoded vector of the candidate text comprises:

5. The method of claim 1, wherein the step of extracting auxiliary text from the candidate text comprises:

counting the frequency of each word in the candidate text;

6. A speech recognition device, characterized in that the speech recognition device comprises:

the second extraction module is used for extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text, and the second extraction module inputs the above text data and the auxiliary text into a preset recognition model so as to recognize the above text data and the auxiliary text to obtain and output the output text of the candidate text, wherein the preset recognition model is trained based on a first preset attention mechanism; or (b)

7. A speech recognition device, characterized in that the speech recognition device comprises: a memory, a processor and a program stored on the memory for implementing the speech recognition method,

the memory is used for storing a program for realizing a voice recognition method;

the processor is configured to execute a program implementing the speech recognition method to implement the steps of the speech recognition method according to any one of claims 1 to 5.

8. A storage medium, characterized in that the storage medium has stored thereon a program for realizing a speech recognition method, the program for realizing a speech recognition method being executed by a processor to realize the steps of the speech recognition method according to any one of claims 1 to 5.