CN111524517B - Speech recognition method, device, equipment and storage medium - Google Patents

Speech recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN111524517B
CN111524517B CN202010595832.8A CN202010595832A CN111524517B CN 111524517 B CN111524517 B CN 111524517B CN 202010595832 A CN202010595832 A CN 202010595832A CN 111524517 B CN111524517 B CN 111524517B
Authority
CN
China
Prior art keywords
text
candidate
preset
auxiliary
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010595832.8A
Other languages
Chinese (zh)
Other versions
CN111524517A (en
Inventor
连荣忠
姜迪
徐倩
杨强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010595832.8A priority Critical patent/CN111524517B/en
Publication of CN111524517A publication Critical patent/CN111524517A/en
Application granted granted Critical
Publication of CN111524517B publication Critical patent/CN111524517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a voice recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data of the candidate texts; extracting auxiliary text from the candidate text; and extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text. The application solves the technical problem of low accuracy of voice recognition in the prior art.

Description

Speech recognition method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technology of financial technology (Fintech), and in particular, to a method, apparatus, device and storage medium for speech recognition.
Background
With the continuous development of financial technology, especially internet technology finance, more and more technologies are applied in the finance field, but the finance industry also puts higher requirements on technology, such as the finance industry has higher requirements on voice recognition.
At present, a language model in a traditional ASR (automatic speech recognition technology, automatic Speech Recognition) algorithm is generally adopted to perform processes such as decoding recognition on dialog scene contents which need to be processed currently, but the processes such as decoding recognition on the dialog scene contents which need to be processed currently have strong limitations, such as deviation of the contents decoded by the model, and the accuracy of speech recognition is reduced.
Disclosure of Invention
The application mainly aims to provide a voice recognition method, a device, equipment and a storage medium, which aim to solve the technical problem of poor voice recognition accuracy in the prior art.
To achieve the above object, the present application provides a voice recognition method including:
acquiring voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data of the candidate texts;
extracting auxiliary text from the candidate text;
and extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text.
Optionally, the step of extracting and outputting text from the candidate text based on the above text data and the auxiliary text includes:
and inputting the above text data and the auxiliary text into a preset recognition model to recognize the above text data and the auxiliary text, obtaining and outputting the output text of the candidate text.
The preset recognition model is a target model which reaches preset training conditions after iterative training is performed on a preset basic model based on preset text data with preset labels.
Optionally, the preset recognition model is a target model that reaches preset training conditions after iterative training is performed on the preset basic model based on the first preset attention mechanism based on preset text data with preset labels.
Optionally, the step of extracting and outputting text from the candidate text based on the above text data and the auxiliary text includes:
carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;
decoding the above text data and the coding vector of the auxiliary text by a preset decoding rule to obtain the decoding vector of the candidate text;
and extracting output text from the candidate text based on the decoding vector and outputting the output text.
Optionally, the step of extracting and outputting text from the candidate text based on the decoding vector includes:
based on the decoding vector, obtaining a vector value of each text in the candidate texts;
and extracting output text from the candidate text based on the vector value and outputting the output text.
Optionally, the step of decoding the above text data and the encoded vector of the auxiliary text by a preset decoding rule to obtain the decoded vector of the candidate text includes:
through a second preset attention mechanism, carrying out directional selection on the coding vectors of the candidate texts to obtain candidate vectors of the candidate texts;
and decoding the candidate vectors of the candidate texts through a preset decoding rule to obtain the decoding vectors of the candidate texts.
Optionally, the step of extracting the auxiliary text from the candidate text includes:
counting the frequency of each word in the candidate text;
and selecting words with the frequency larger than a preset value as the auxiliary text.
The application also provides a voice recognition device, which comprises:
the acquisition module is used for acquiring voice data to be identified and determining candidate texts of the voice data to be identified and the text data of the candidate texts;
a first extraction module for extracting an auxiliary text from the candidate text;
and the second extraction module is used for extracting output text from the candidate text and outputting the output text based on the text data and the auxiliary text.
Optionally, the second extraction module includes:
the input unit is used for inputting the above text data and the auxiliary text into a preset recognition model so as to recognize the above text data and the auxiliary text, obtain and output the output text of the candidate text;
the preset recognition model is a target model which reaches preset training conditions after iterative training is performed on a preset basic model based on preset text data with preset labels.
Optionally, the preset recognition model is a target model that reaches preset training conditions after iterative training is performed on the preset basic model based on the first preset attention mechanism based on preset text data with preset labels.
Optionally, the second extraction module includes:
the coding unit is used for carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;
the decoding unit is used for decoding the above text data and the coding vector of the auxiliary text through a preset decoding rule to obtain the decoding vector of the candidate text;
and the extraction unit is used for extracting output text from the candidate text based on the decoding vector and outputting the output text.
Optionally, the extracting unit is configured to implement:
based on the decoding vector, obtaining a vector value of each text in the candidate texts;
and extracting output text from the candidate text based on the vector value and outputting the output text.
Optionally, the decoding unit is configured to implement:
through a second preset attention mechanism, carrying out directional selection on the coding vectors of the candidate texts to obtain candidate vectors of the candidate texts;
and decoding the candidate vectors of the candidate texts through a preset decoding rule to obtain the decoding vectors of the candidate texts.
Optionally, the first extraction module includes:
the statistics unit is used for counting the frequency of each word in the candidate text;
and the selecting unit is used for selecting words with the frequency larger than a preset value as the auxiliary text.
The application also provides a voice recognition device, which is an entity device, comprising: the voice recognition device comprises a memory, a processor and a program of the voice recognition method stored in the memory and capable of running on the processor, wherein the program of the voice recognition method can realize the steps of the voice recognition method when being executed by the processor.
The present application also provides a storage medium having stored thereon a program for implementing the above-described speech recognition method, which when executed by a processor implements the steps of the above-described speech recognition method.
According to the method, the candidate text of the voice data to be recognized and the text data of the candidate text are determined by acquiring the voice data to be recognized; extracting auxiliary text from the candidate text; and extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text. In the application, after the voice data to be recognized is obtained, the candidate text of the voice data to be recognized is determined, the text data above the candidate text is obtained, further, the auxiliary text is extracted from the candidate text, the text data above the candidate text and the auxiliary text are combined, and the output text is extracted from the candidate text and output, namely, in the application, the output text is obtained by decoding and recognizing the candidate text of the voice data to be recognized, but the auxiliary word is selected from the candidate sentences (obtained by collective decision), namely, the auxiliary word after the collective decision is selected, the text data above the candidate text (beneficial to improving the accuracy) and the auxiliary text (beneficial to improving the accuracy) after the collective decision are combined, and the output text is extracted from the candidate text and output, so that the accuracy of language recognition is improved, and the technical problem of low accuracy of the voice recognition in the prior art is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flowchart of a first embodiment of a speech recognition method according to the present application;
FIG. 2 is a flowchart illustrating a refinement step for extracting auxiliary text from the candidate text according to a first embodiment of the speech recognition method of the present application;
FIG. 3 is a schematic diagram of a device architecture of a hardware operating environment according to an embodiment of the present application;
fig. 4 is a schematic diagram of a speech recognition method according to the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
An embodiment of the present application provides a voice recognition method, in a first embodiment of the present application, referring to fig. 1, the voice recognition method includes:
step S10, obtaining voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data above the candidate texts;
step S20, extracting auxiliary texts from the candidate texts;
and step S30, extracting output text from the candidate text and outputting the output text based on the text data and the auxiliary text.
The method comprises the following specific steps:
step S10, obtaining voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data above the candidate texts;
in this embodiment, it should be noted that the speech recognition method is applied to a speech recognition system, where the speech recognition system belongs to a speech recognition device, and the speech recognition system is in communication with a speech platform, for example, where there is a daily audio dialogue, speaker a: "you have or not seen CBA yesterday evening, guangdong team wins opponents with a big score; speaker B: "seen, the match is very wonderful, the MVP name is obtained by easy establishment and association", if the corresponding voice data to be identified is: in the prior art, the voice data to be recognized of the speaker B, such as 'seen, very wonderful in match, easily linked to obtain the MVP name, is recognized through the preset trained ASR model', such as 'seen by using an n-gram model', very wonderful in match, easily linked to obtain the MVP name, is modeled, and the speaker A cannot be obtained by using the n-gram model in decoding: "you have or not to see the CBA in yesterday evening, guangdong team wins the opponent" content, and in fact "CBA" and "guangdong team" play a better role in decoding the word "easy to build" and therefore using the traditional n-gram model has strong limitations on ASR in conversational scenarios, which can make the content decoded by the model deviate.
In this embodiment, after receiving the voice data to be recognized, the voice recognition system determines a candidate text of the voice data to be recognized and text data of the candidate text. The voice recognition system is in communication with a voice platform, for example, the voice platform may include a telephone call-in customer service sub-platform to record telephone call-in content of each telephone call-in person, obtain recording content, set the recording content as voice data to be recognized, send the voice data to the voice recognition system, and after receiving the voice data to be recognized, the voice recognition system determines candidate text of the voice data to be recognized and text data above the candidate text. Specifically, by using a preset trained ASR model (for example, a preset DNN-HMM/CTC model), the speech data to be recognized is decoded to obtain N candidate sentences, for example, D1, D2, …, DN, where the candidate text may also be a candidate sentence obtained based on the trained ASR model after combining with the current hotspot, in this embodiment, after obtaining the candidate text, the text data of the candidate text is further obtained, where the text data of the candidate text may be text data of a preset history period, and in addition, the text data of the candidate text may also be text data of a preset sentence number corresponding to the candidate text. For example, if there is a dialogue: speaker a: "you have or not to see CBA yesterday evening, guangdong team win opponents" the player is a big score; speaker B: "seeing the match very wonderful, easily build a tie to get the MVP name in fact", "seeing the match very wonderful, easily build a tie to get the MVP name in fact" if it is the speech data to be recognized, the above text data includes "you have CBA yesterday evening, guangdong team wins opponents with great score".
Step S20, extracting auxiliary texts from the candidate texts;
in the embodiment, the current context is combined to extract the output text from the candidate text, so that the accuracy of recognition is improved.
Specifically, the auxiliary text is extracted from the candidate text through a preset extraction strategy, which may be a word weight strategy or the like, that is, the auxiliary text is extracted through weights (which are determined in advance) of the texts in the candidate text.
Wherein, referring to fig. 2, the step of extracting the auxiliary text from the candidate text includes:
step S21, counting the frequency of each word in the candidate text;
and S22, selecting words with the frequency larger than a preset value as the auxiliary text.
In this embodiment, through a preset feature extraction (Feature Extractor) unit, the frequency of each word in the N candidate sentences is counted, and the word with the frequency greater than the preset value is selected as the auxiliary text C, as shown in fig. 4.
And step S30, extracting output text from the candidate text and outputting the output text based on the text data and the auxiliary text.
After the auxiliary text is obtained, based on the text data and the auxiliary text, extracting output text from the candidate text and outputting the output text.
Wherein the step of extracting and outputting text from the candidate text based on the above text data and the auxiliary text includes:
step S31, inputting the above text data and the auxiliary text into a preset recognition model to recognize the above text data and the auxiliary text, obtaining and outputting the output text of the candidate text;
the preset recognition model is a target model which reaches preset training conditions after iterative training is performed on a preset basic model based on preset text data with preset labels.
In this embodiment, the above text data and the auxiliary text are input into a preset recognition model to perform recognition processing on the above text data and the auxiliary text, so as to obtain and output an output text of the candidate text, specifically, the preset recognition model includes a Seq2Seq (sequence to sequence, including RNN network structure, which is often used to model a mapping relationship of a dialog context in a dialog field) model and a preset end-to-end model, and the above text data and the auxiliary text are input into a Seq2Seq model to perform recognition processing on the above text data and the auxiliary text, so as to obtain and output an output text of the candidate text.
The preset recognition model is a target model for achieving preset training conditions after performing iterative training on a preset basic model based on preset text data with preset labels, specifically, the preset text data with preset labels is obtained, the preset basic model is subjected to iterative training to adjust model parameters (determined by a comparison result of a prediction result of the preset text data and the preset labels) in the preset basic model (including an RNN network structure) until reaching preset conditions such as reaching preset times of iteration times, or model convergence, and the target model is obtained, and it is required to be noted that a preset coding sub-model (used for representing each text in the text data by a vector of preset dimensions) and a preset decoding sub-model (used for determining a vector of the whole text data) can be included in the preset basic model. When the preset basic model includes the preset coding sub-model and the preset decoding sub-model, the preset coding sub-model and the preset decoding sub-model need to be respectively preset trained, and then a converged coding sub-model and a converged decoding sub-model are obtained. In this embodiment, the function of setting the coding sub-model and the decoding sub-model is that: facilitating the introduction of other mechanisms such as attention mechanisms and the like to process text data.
In this embodiment, it should be noted that, after the above text data and the auxiliary text are input into a preset recognition model (as shown in fig. 4, including a preset encoding sub-model (encoding module) and a preset decoding sub-model (decoding module), the above text data and the auxiliary text are respectively encoded and finally integrated to obtain an output text, specifically, the above text data U is input into an encoding sub-model in the preset recognition model to obtain an encoded above textInputting the auxiliary text into a coding sub-model in a preset recognition model to obtain a coding auxiliary text +.>Encoding the above text +.>And said coding auxiliary text->Joint input into a preset decoding submodel (it should be noted that the coding text may also be +.>And said coding auxiliary text->The candidate texts are jointly input into an intermediate processing layer to obtain an intermediate processing result and then input into a preset decoding sub-model), specifically, each candidate text can be scored (obtained through code vector synthesis) through the preset decoding sub-model, the corresponding score1, score2, … and score N are obtained, and one with the highest score is selected as an output text.
The preset recognition model is a target model which reaches preset training conditions after iterative training is performed on a preset basic model based on a first preset attention mechanism based on preset text data with preset labels.
In this embodiment, a first attention mechanism is introduced, and after iterative training is performed on a preset basic model based on the first preset attention mechanism, a target model of preset training conditions is reached.
Wherein the Attention mechanism (Attention Mechanism) (Attention module in fig. 4) is an artificial selective Attention to a part of all information while ignoring other visible information. That is, in order to reasonably utilize limited information processing resources, it is necessary to select a specific portion in an area and then concentrate on, for example, people often only a small number of words to be read are focused and processed when reading. Namely, the mechanism of attention has two main aspects: it is decided which part of the input needs to be focused on, and limited information processing resources are allocated to important parts.
In this embodiment, the model parameters of the preset basic model are adjusted in a focused direction by introducing the first focusing mechanism, and in particular, in this embodiment, the model parameters of the preset decoding sub-model or the preset encoding sub-model are adjusted in a focused direction by introducing the first focusing mechanism, where the focusing direction based on the first focusing mechanism may refer to an adjustment direction of a connection weight between each matrix in the neural network structure, and so on. In addition, the output of data such as candidate text by using the "encode-decode" method has two problems: that is, all information of the candidate text needs to be stored in the encoding vector to be effectively decoded; and secondly, the problem of long-distance dependence, namely the problem of information loss in long-distance information transmission in the process of encoding and decoding. By introducing a preset first attention mechanism, related information is directly selected from candidate texts through the preset first attention mechanism as an aid in the decoding process. All candidate text information is not required to be transmitted through the coding vector, and the candidate text information can be directly transmitted, so that the information transmission distance is shortened.
In this embodiment, based on preset text data with preset labels, after iterative training is performed on a preset basic model based on a first preset attention mechanism, a target model with preset training conditions is reached. The efficiency and the accuracy of obtaining the required output text can be improved.
According to the method, the candidate text of the voice data to be recognized and the text data of the candidate text are determined by acquiring the voice data to be recognized; extracting auxiliary text from the candidate text; and extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text. In the application, after the voice data to be recognized is obtained, the candidate text of the voice data to be recognized is determined, the text data above the candidate text is obtained, further, the auxiliary text is extracted from the candidate text, the text data above the candidate text and the auxiliary text are combined, and the output text is extracted from the candidate text and output, namely, in the application, the output text is obtained by decoding and recognizing the candidate text of the voice data to be recognized, but the auxiliary word is selected from the candidate sentences (obtained by collective decision), namely, the auxiliary word after the collective decision is selected, the text data above the candidate text (beneficial to improving the accuracy) and the auxiliary text (beneficial to improving the accuracy) after the collective decision are combined, and the output text is extracted from the candidate text and output, so that the accuracy of language recognition is improved, and the technical problem of low accuracy of the voice recognition in the prior art is solved.
Further, according to the first embodiment of the present application, in another embodiment of the present application, the step of extracting and outputting text from the candidate text based on the above text data and the auxiliary text includes:
a1, carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;
in this embodiment, another way of obtaining the output text without using a model is provided, specifically, the encoding processing of the preset vector, such as a word vector, a preset euclidean distance vector, and the like, is directly performed on the above text data and the auxiliary text through a preset vector encoding rule, so as to obtain the encoding vector of the above text data and the auxiliary text.
A2, decoding the above text data and the coding vector of the auxiliary text by presetting a decoding rule to obtain the decoding vector of the candidate text;
in this embodiment, after the encoding vectors of the above text data and the auxiliary text are obtained, decoding processing is performed on the encoding vectors of the above text data and the auxiliary text by using a preset decoding rule, so as to obtain decoding vectors of the candidate text, where the preset decoding rule may be a vector addition rule between each encoding vector.
And A3, extracting and outputting text from the candidate text based on the decoding vector.
And extracting output text from the candidate texts based on the decoding vectors, and outputting the output text, specifically, extracting output text from the candidate texts based on the magnitude of the vector values of the candidate texts in the decoding vectors.
In this embodiment, the encoding process of the preset vector is performed on the above text data and the auxiliary text, so as to obtain encoding vectors of the above text data and the auxiliary text; decoding the above text data and the coding vector of the auxiliary text by a preset decoding rule to obtain the decoding vector of the candidate text; and extracting output text from the candidate text based on the decoding vector and outputting the output text. In this embodiment, the output text is accurately obtained.
Further, according to the first embodiment and the second embodiment of the present application, the step of decoding the above text data and the encoded vector of the auxiliary text by presetting a decoding rule to obtain the decoded vector of the candidate text includes:
step B1, carrying out directional selection on the coding vectors of the candidate texts through a second preset attention mechanism to obtain the candidate vectors of the candidate texts;
in this embodiment, through a second preset attention mechanism, the encoding vector of the candidate text is selected in a directional manner, for example, only the euclidean distance vector in the encoding vector is obtained, so as to obtain the candidate vector of the candidate text.
And B2, decoding the candidate vectors of the candidate texts by presetting a decoding rule to obtain the decoding vectors of the candidate texts.
And decoding the candidate vectors of the candidate texts by presetting a decoding rule, such as adding all the part-of-speech vectors, so as to obtain the decoding vectors of the candidate texts.
The step of extracting and outputting text from the candidate text based on the decoding vector includes:
e1, obtaining a vector value of each text in the candidate texts based on the decoding vector;
and E2, extracting and outputting text from the candidate text based on the vector value.
In this embodiment, based on the decoding vector, a vector value of each text in the candidate texts is obtained, specifically, based on an association relationship between the decoding vector and the vector value, a vector value of each text in the candidate texts is obtained by calculation, and based on the height of the vector value, each text in the candidate texts is ranked, and an output text with the highest ranking is extracted and output.
According to the embodiment, through a second preset attention mechanism, the coding vector of the candidate text is subjected to directional selection, so that the candidate vector of the candidate text is obtained; and decoding the candidate vectors of the candidate texts through a preset decoding rule to obtain the decoding vectors of the candidate texts. In the embodiment, the decoding vector is accurately acquired, and a foundation is laid for accurately acquiring and outputting the output text.
Referring to fig. 3, fig. 3 is a schematic device structure diagram of a hardware running environment according to an embodiment of the present application.
As shown in fig. 3, the voice recognition apparatus may include: a processor 1001, such as a CPU, memory 1005, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connected communication between the processor 1001 and a memory 1005. The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Optionally, the speech recognition device may further comprise a rectangular user interface, a network interface, a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, etc. The rectangular user interface may include a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also include a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).
It will be appreciated by those skilled in the art that the speech recognition device structure shown in fig. 3 is not limiting of the speech recognition device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 3, an operating system, a network communication module, and a voice recognition program may be included in the memory 1005 as one type of storage medium. An operating system is a program that manages and controls the hardware and software resources of a speech recognition device, supporting the execution of speech recognition programs and other software and/or programs. The network communication module is used to enable communication between components within the memory 1005 and with other hardware and software in the speech recognition system.
In the speech recognition device shown in fig. 3, the processor 1001 is configured to execute a speech recognition program stored in the memory 1005, and implement the steps of the speech recognition method described in any one of the above.
The specific implementation manner of the voice recognition device of the present application is substantially the same as that of each embodiment of the voice recognition method, and will not be repeated here.
The application also provides a voice recognition device, which comprises:
the acquisition module is used for acquiring voice data to be identified and determining candidate texts of the voice data to be identified and the text data of the candidate texts;
a first extraction module for extracting an auxiliary text from the candidate text;
and the second extraction module is used for extracting output text from the candidate text and outputting the output text based on the text data and the auxiliary text.
Optionally, the second extraction module includes:
the input unit is used for inputting the above text data and the auxiliary text into a preset recognition model so as to recognize the above text data and the auxiliary text, obtain and output the output text of the candidate text;
the preset recognition model is a target model which reaches preset training conditions after iterative training is performed on a preset basic model based on preset text data with preset labels.
Optionally, the preset recognition model is a target model that reaches preset training conditions after iterative training is performed on the preset basic model based on the first preset attention mechanism based on preset text data with preset labels.
Optionally, the second extraction module includes:
the coding unit is used for carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;
the decoding unit is used for decoding the above text data and the coding vector of the auxiliary text through a preset decoding rule to obtain the decoding vector of the candidate text;
and the extraction unit is used for extracting output text from the candidate text based on the decoding vector and outputting the output text.
Optionally, the extracting unit is configured to implement:
based on the decoding vector, obtaining a vector value of each text in the candidate texts;
and extracting output text from the candidate text based on the vector value and outputting the output text.
Optionally, the decoding unit is configured to implement:
through a second preset attention mechanism, carrying out directional selection on the coding vectors of the candidate texts to obtain candidate vectors of the candidate texts;
and decoding the candidate vectors of the candidate texts through a preset decoding rule to obtain the decoding vectors of the candidate texts.
Optionally, the first extraction module includes:
the statistics unit is used for counting the frequency of each word in the candidate text;
and the selecting unit is used for selecting words with the frequency larger than a preset value as the auxiliary text.
The specific implementation manner of the voice recognition device of the present application is basically the same as that of each embodiment of the voice recognition method, and will not be repeated here.
Embodiments of the present application provide a storage medium, and the storage medium stores one or more programs, which are further executable by one or more processors for implementing the steps of the speech recognition method described in any one of the above.
The specific implementation manner of the storage medium of the present application is basically the same as that of each embodiment of the voice recognition method, and will not be repeated here.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, within the scope of the application.

Claims (8)

1. A method of speech recognition, the method comprising:
acquiring voice data to be recognized, and determining candidate texts of the voice data to be recognized and text data of the candidate texts;
extracting auxiliary text from the candidate text;
inputting the above text data and the auxiliary text into a preset recognition model to recognize the above text data and the auxiliary text, obtaining and outputting an output text of the candidate text, wherein the preset recognition model is trained based on a first preset attention mechanism; or (b)
Carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;
decoding the above text data and the coding vector of the auxiliary text by a preset decoding rule to obtain the decoding vector of the candidate text;
and extracting and outputting the text from the candidate text based on the decoding vector, wherein the decoding vector of the candidate text is obtained by directionally selecting the encoding vector of the candidate text through a second preset attention mechanism.
2. The method of claim 1, wherein the predetermined recognition model is a target model that reaches a predetermined training condition after performing iterative training on a predetermined basic model based on a first predetermined attention mechanism based on predetermined text data having a predetermined label.
3. The method of speech recognition according to claim 1, wherein the step of extracting and outputting text from the candidate texts based on the decoded vector comprises:
based on the decoding vector, obtaining a vector value of each text in the candidate texts;
and extracting output text from the candidate text based on the vector value and outputting the output text.
4. The method of claim 1, wherein the step of decoding the above text data and the encoded vector of the auxiliary text by a preset decoding rule to obtain the decoded vector of the candidate text comprises:
through a second preset attention mechanism, carrying out directional selection on the coding vectors of the candidate texts to obtain candidate vectors of the candidate texts;
and decoding the candidate vectors of the candidate texts through a preset decoding rule to obtain the decoding vectors of the candidate texts.
5. The method of claim 1, wherein the step of extracting auxiliary text from the candidate text comprises:
counting the frequency of each word in the candidate text;
and selecting words with the frequency larger than a preset value as the auxiliary text.
6. A speech recognition device, characterized in that the speech recognition device comprises:
the acquisition module is used for acquiring voice data to be identified and determining candidate texts of the voice data to be identified and the text data of the candidate texts;
a first extraction module for extracting an auxiliary text from the candidate text;
the second extraction module is used for extracting output text from the candidate text based on the above text data and the auxiliary text and outputting the output text, and the second extraction module inputs the above text data and the auxiliary text into a preset recognition model so as to recognize the above text data and the auxiliary text to obtain and output the output text of the candidate text, wherein the preset recognition model is trained based on a first preset attention mechanism; or (b)
Carrying out coding processing of preset vectors on the above text data and the auxiliary text to obtain coding vectors of the above text data and the auxiliary text;
decoding the above text data and the coding vector of the auxiliary text by a preset decoding rule to obtain the decoding vector of the candidate text;
and extracting and outputting the text from the candidate text based on the decoding vector, wherein the decoding vector of the candidate text is obtained by directionally selecting the encoding vector of the candidate text through a second preset attention mechanism.
7. A speech recognition device, characterized in that the speech recognition device comprises: a memory, a processor and a program stored on the memory for implementing the speech recognition method,
the memory is used for storing a program for realizing a voice recognition method;
the processor is configured to execute a program implementing the speech recognition method to implement the steps of the speech recognition method according to any one of claims 1 to 5.
8. A storage medium, characterized in that the storage medium has stored thereon a program for realizing a speech recognition method, the program for realizing a speech recognition method being executed by a processor to realize the steps of the speech recognition method according to any one of claims 1 to 5.
CN202010595832.8A 2020-06-24 2020-06-24 Speech recognition method, device, equipment and storage medium Active CN111524517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010595832.8A CN111524517B (en) 2020-06-24 2020-06-24 Speech recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010595832.8A CN111524517B (en) 2020-06-24 2020-06-24 Speech recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111524517A CN111524517A (en) 2020-08-11
CN111524517B true CN111524517B (en) 2023-11-03

Family

ID=71910194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010595832.8A Active CN111524517B (en) 2020-06-24 2020-06-24 Speech recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111524517B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885338B (en) * 2021-01-29 2024-05-14 深圳前海微众银行股份有限公司 Speech recognition method, device, computer-readable storage medium, and program product

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1282072A (en) * 1999-07-27 2001-01-31 国际商业机器公司 Error correcting method for voice identification result and voice identification system
JP2006107353A (en) * 2004-10-08 2006-04-20 Sony Corp Information processor, information processing method, recording medium and program
CN106803422A (en) * 2015-11-26 2017-06-06 中国科学院声学研究所 A kind of language model re-evaluation method based on memory network in short-term long
CN107785018A (en) * 2016-08-31 2018-03-09 科大讯飞股份有限公司 More wheel interaction semantics understanding methods and device
CN109065054A (en) * 2018-08-31 2018-12-21 出门问问信息科技有限公司 Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing
CN110069781A (en) * 2019-04-24 2019-07-30 北京奇艺世纪科技有限公司 A kind of recognition methods of entity tag and relevant device
CN110334347A (en) * 2019-06-27 2019-10-15 腾讯科技(深圳)有限公司 Information processing method, relevant device and storage medium based on natural language recognition
CN110460715A (en) * 2018-05-07 2019-11-15 苹果公司 For operating the method, equipment and medium of digital assistants
CN110473523A (en) * 2019-08-30 2019-11-19 北京大米科技有限公司 A kind of audio recognition method, device, storage medium and terminal
CN110503945A (en) * 2019-09-06 2019-11-26 北京金山数字娱乐科技有限公司 A kind of training method and device of speech processes model

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1282072A (en) * 1999-07-27 2001-01-31 国际商业机器公司 Error correcting method for voice identification result and voice identification system
JP2006107353A (en) * 2004-10-08 2006-04-20 Sony Corp Information processor, information processing method, recording medium and program
CN106803422A (en) * 2015-11-26 2017-06-06 中国科学院声学研究所 A kind of language model re-evaluation method based on memory network in short-term long
CN107785018A (en) * 2016-08-31 2018-03-09 科大讯飞股份有限公司 More wheel interaction semantics understanding methods and device
CN110460715A (en) * 2018-05-07 2019-11-15 苹果公司 For operating the method, equipment and medium of digital assistants
CN109065054A (en) * 2018-08-31 2018-12-21 出门问问信息科技有限公司 Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing
CN110069781A (en) * 2019-04-24 2019-07-30 北京奇艺世纪科技有限公司 A kind of recognition methods of entity tag and relevant device
CN110334347A (en) * 2019-06-27 2019-10-15 腾讯科技(深圳)有限公司 Information processing method, relevant device and storage medium based on natural language recognition
CN110473523A (en) * 2019-08-30 2019-11-19 北京大米科技有限公司 A kind of audio recognition method, device, storage medium and terminal
CN110503945A (en) * 2019-09-06 2019-11-26 北京金山数字娱乐科技有限公司 A kind of training method and device of speech processes model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Using word burst analysis to rescore keyword search candidates on low-resource languages;Justin Richards et al.;2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP);全文 *
基于循环神经网络语言模型的N-best重打分算法;张剑等;数据采集与处理;全文 *

Also Published As

Publication number Publication date
CN111524517A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
US20180158449A1 (en) Method and device for waking up via speech based on artificial intelligence
CN111190600B (en) Method and system for automatically generating front-end codes based on GRU attention model
CN110930980B (en) Acoustic recognition method and system for Chinese and English mixed voice
CN112100337B (en) Emotion recognition method and device in interactive dialogue
CN110163181B (en) Sign language identification method and device
CN110475129A (en) Method for processing video frequency, medium and server
JP2020004382A (en) Method and device for voice interaction
CN111401259B (en) Model training method, system, computer readable medium and electronic device
CN112579760B (en) Man-machine conversation method, device, computer equipment and readable storage medium
CN111816169A (en) Method and device for training Chinese and English hybrid speech recognition model
CN111524517B (en) Speech recognition method, device, equipment and storage medium
CN115617955A (en) Hierarchical prediction model training method, punctuation symbol recovery method and device
CN113793591A (en) Speech synthesis method and related device, electronic equipment and storage medium
CN113822072A (en) Keyword extraction method and device and electronic equipment
CN115312034A (en) Method, device and equipment for processing voice signal based on automaton and dictionary tree
CN111477212B (en) Content identification, model training and data processing method, system and equipment
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN113643706B (en) Speech recognition method, device, electronic equipment and storage medium
CN116189663A (en) Training method and device of prosody prediction model, and man-machine interaction method and device
CN113656566B (en) Intelligent dialogue processing method, intelligent dialogue processing device, computer equipment and storage medium
CN115273828A (en) Training method and device of voice intention recognition model and electronic equipment
CN111241830B (en) Method for generating word vector and training model for generating word
CN112686059B (en) Text translation method, device, electronic equipment and storage medium
CN113327691A (en) Query method and device based on language model, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant