CN112037773B - N-optimal spoken language semantic recognition method and device and electronic equipment - Google Patents

N-optimal spoken language semantic recognition method and device and electronic equipment Download PDF

Info

Publication number
CN112037773B
CN112037773B CN202011220689.0A CN202011220689A CN112037773B CN 112037773 B CN112037773 B CN 112037773B CN 202011220689 A CN202011220689 A CN 202011220689A CN 112037773 B CN112037773 B CN 112037773B
Authority
CN
China
Prior art keywords
model
intention
text data
recognition
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011220689.0A
Other languages
Chinese (zh)
Other versions
CN112037773A (en
Inventor
张常睿
李蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qilu Information Technology Co Ltd
Original Assignee
Beijing Qilu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qilu Information Technology Co Ltd filed Critical Beijing Qilu Information Technology Co Ltd
Priority to CN202011220689.0A priority Critical patent/CN112037773B/en
Publication of CN112037773A publication Critical patent/CN112037773A/en
Application granted granted Critical
Publication of CN112037773B publication Critical patent/CN112037773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for recognizing N-best spoken language semantics and electronic equipment, wherein the method comprises the following steps: acquiring text data with first N probability values output by the automatic speech recognition ASR model on historical audio data and labels of each text data as a training set; training a Spoken Language Understanding (SLU) model through the training set; inputting the text data of the first M probability values output by the ASR model to the test audio data into the SLU model to obtain an intention recognition probability sequence of the M text data; and outputting the intention with the highest probability in the intention identification probability sequence as the intention of the test audio data. The invention considers the text data of the first probability values of the ASR model in both SLU model training and SLU model application, and then performs intention recognition according to the text data of the first probability values, thereby effectively reducing user intention recognition errors caused by ASR recognition errors and improving intention recognition accuracy.

Description

N-optimal spoken language semantic recognition method and device and electronic equipment
Technical Field
The invention relates to the technical field of voice intelligence, in particular to an N-best spoken language semantic recognition method and device, electronic equipment and a computer readable medium.
Background
With the development of artificial intelligence technology, the application of the voice robot is more and more extensive. The voice robot can endow the enterprise with intelligent man-machine interaction experience of the type of 'being able to listen, speak and understand you' in various practical application scenes based on the technologies of voice recognition, voice synthesis, natural language understanding and the like. At present, the voice robot is widely applied to scenes such as telephone sales, intelligent question answering, intelligent quality inspection, real-time speech subtitles, interview recording and the like.
The voice robot firstly carries out natural voice understanding on the voice of the user to recognize the intention of the user, and then generates question and answer voice for the user through a natural voice generating technology according to the intention of the user, so that the voice question and answer with the user are completed. In the natural Speech Understanding process, the Speech robot converts the Speech of the user into characters through an Automatic Speech Recognition (ASR) technology, and then recognizes the user intention through a Spoken Language Understanding (SLU) technology. The ASR technology analyzes the speech characteristic parameters in advance through training data, makes a speech template, stores the speech template in a speech parameter library, and analyzes the speech to be recognized as the speech to be recognized during training to obtain the speech parameters. Comparing it with the speech templates in speech parameter library one by one, finding out the template closest to speech characteristics by means of model scoring, and obtaining the recognized character result. The ASR technology uses a model scoring method to take the text data corresponding to the closest speech feature template as the recognized word, and the text data corresponding to the closest speech feature template may be different from the actual user speech word, or even opposite in meaning. Therefore, certain errors occur in the text conversion process of the existing ASR, which causes the recognition error of the subsequent SLU on the user intention, and affects the conversation effect between the voice robot and the user.
Disclosure of Invention
The invention aims to solve the technical problem that the recognition error of the user intention is caused by the recognition error of the ASR.
In order to solve the above technical problem, a first aspect of the present invention provides an N-best spoken language semantic recognition method, where the method includes:
acquiring text data with first N probability values output by the automatic speech recognition ASR model on historical audio data and labels of each text data as a training set;
training a Spoken Language Understanding (SLU) model through the training set;
inputting the text data of the first M probability values output by the ASR model to the test audio data into the SLU model to obtain an intention recognition probability sequence of the M text data;
and outputting the intention with the highest probability in the intention identification probability sequence as the intention of the test audio data.
According to a preferred embodiment of the invention, the ASR model comprises an acoustic model and a language model.
According to a preferred embodiment of the present invention, the acoustic model is a short-short memory LSTM neural network or a hidden markov model HMM.
According to a preferred embodiment of the present invention, the language model is any one of an n-gram model, a neural network language model NNLM, and a word2vec model.
According to a preferred embodiment of the invention, the SLU model characterizes the BERT model for a bi-directional coding of a multi-tasking deep neural network MT-DNN or transformer.
According to a preferred embodiment of the invention, the method further comprises:
acquiring a slot position value corresponding to the intention of the test audio data through a slot position filling model;
and sending the intention of the test audio data and the corresponding slot position value to a voice answering system.
In order to solve the above technical problem, a second aspect of the present invention provides an N-best spoken language semantic recognition apparatus, including:
the acquisition module is used for acquiring the text data with the first N probability values output by the automatic speech recognition ASR model on the historical audio data and the labels of all the text data as training sets;
a training module for training a Spoken Language Understanding (SLU) model through the training set;
the input module is used for inputting the text data of the first M probability values output by the ASR model to the test audio data into the SLU model to obtain an intention recognition probability sequence of the M text data;
and the output module is used for outputting the intention with the highest probability in the intention identification probability sequence as the intention of the test audio data.
According to a preferred embodiment of the invention, the ASR model comprises an acoustic model and a language model.
According to a preferred embodiment of the present invention, the acoustic model is a short-short memory LSTM neural network or a hidden markov model HMM.
According to a preferred embodiment of the present invention, the language model is any one of an n-gram model, a neural network language model NNLM, and a word2vec model.
According to a preferred embodiment of the invention, the SLU model characterizes the BERT model for a bi-directional coding of a multi-tasking deep neural network MT-DNN or transformer.
According to a preferred embodiment of the invention, the device further comprises:
the sub-acquisition module is used for acquiring a slot position value corresponding to the intention of the test audio data through a slot position filling model;
and the sending module is used for sending the intention of the test audio data and the corresponding slot position value to a voice answering system.
To solve the above technical problem, a third aspect of the present invention provides an electronic device, comprising:
a processor; and
a memory storing computer executable instructions that, when executed, cause the processor to perform the method described above.
In order to solve the above technical problem, a fourth aspect of the present invention proposes a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs that, when executed by a processor, implement the above method.
The invention starts from two aspects of SLU model training and SLU model application to reduce the problem of user intention recognition error caused by ASR recognition error. On one hand, in the SLU model training process, the text data of the first N probability values output by the ASR model to the historical audio data and the labels of the text data are obtained to complete the enhancement of the training set data, and the spoken language understanding SLU model is trained through the training set, so that the trained SLU model can identify correct text data from the text data of the first N probability values. On the other hand, in the application process of the SLU model, inputting the text data of the first M probability values of the test audio data output by the ASR model into the SLU model to obtain the intention recognition probability sequences of the M text data, completing the enhancement of intention recognition, and finally outputting the intention with the highest probability in the intention recognition probability sequences as the intention of the test audio data. In the invention, text data of the first probability values of the ASR model are considered in both SLU model training and SLU model application, and intention recognition is carried out according to the text data of the first probability values.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive step.
FIG. 1 is a flow chart of an N-best spoken language semantic recognition method according to the present invention;
FIG. 2 is a schematic diagram of the text data step of obtaining the first N probability values of the output of the automatic speech recognition ASR model on the historical audio data according to the present invention;
FIG. 3 is a schematic diagram of the step of inputting the text data of the first M probability values of the ASR model for the test audio data output into the SLU model to obtain the intended recognition probability sequence of the M text data according to the present invention;
FIG. 4 is a schematic diagram of the structural framework of the BERT model of the present invention;
FIG. 5 is a schematic structural framework diagram of an N-best spoken language semantic recognition apparatus according to the present invention;
FIG. 6 is a block diagram of an exemplary embodiment of an electronic device in accordance with the present invention;
FIG. 7 is a diagrammatic representation of one embodiment of a computer-readable medium of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.
The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
The scheme provided by the embodiment of the invention relates to technologies such as artificial intelligence natural language understanding and deep learning, and the like, and is explained by the following embodiment.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
Natural Language Understanding (NLU) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. The natural language understanding is based on phonetics, integrates disciplines such as logicals, computer disciplines and the like, and obtains semantic representation of natural speech through analysis of semantics, grammar and pragmatics. The main functions of natural language understanding include entity recognition, user intention recognition, user emotion recognition, reference resolution, omission recovery, reply confirmation, rejection judgment and the like.
The intention recognition means that various machine learning methods are used to enable a machine to learn and understand semantic intentions represented by a text, and relates to multiple subjects such as phonetics, computational linguistics, artificial intelligence, machine learning and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence.
Deep learning is a core part of machine learning, and generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning. The natural speech understanding technology based on deep learning directly produces a reply by adopting an end-to-end method after obtaining a vectorized representation of natural speech, and the most typical frame is an Encoder-Decoder frame. The method can be applied to the field of chat robots, and can also be applied to application scenes such as machine translation, text summarization and syntactic analysis. Among them, the language model is one of core technologies that introduce deep learning into natural language understanding.
In the invention, text data of the first probability values of the ASR model are considered in both SLU model training and SLU model application, and intention recognition is carried out according to the text data of the first probability values.
The SLU is an abbreviation of spoke Language Understanding, is translated into Spoken Language Understanding, and is an NLU applied to a dialogue system.
Referring to fig. 1, fig. 1 is a flowchart of an N-best spoken language semantic recognition method provided by the present invention, where N-best refers to a manner of performing intent recognition on text data with the first N probability values output by an ASR model to finally obtain an optimal recognition result. As shown in fig. 1, the method includes:
s1, acquiring text data with first N probability values of the automatic speech recognition ASR model for outputting historical audio data and labels of the text data as training sets;
among other things, the purpose of the ASR model is to convert speech into text. Specifically, a speech signal is input, and a text sequence (consisting of words or characters) is sought so that it matches the speech signal to the highest degree. This degree of match is typically expressed in terms of probability. Using X to represent the speech signal and W to represent the text sequence, the following formula is required to be solved:
Figure 677126DEST_PATH_IMAGE001
the above formula is the most central formula in speech recognition. P (w) represents the probability of a word sequence itself, i.e. how "words" the string or word itself has; p (x) represents the probability of a speech signal after a given word, i.e. how likely this crosstalk is to occur. Calculating the values of the two terms is the task of the language model and the acoustic model respectively. Accordingly, the ASR model of the present invention includes a language model and an acoustic model.
The language model generally uses the chain rule to break down the probability of a sentence into the product of the probabilities of each word. The language model may be an n-gram model in which the probability distribution of each word is considered to depend only on the first n-1 words. In addition, the language model can also be a neural network language model NNLM and a word2vec model.
Among these, NNLMs follow the core view of the n-gram model: the probability of a sentence occurring is the joint probability of each word in the sentence occurring in sequence. NNLM uses vectors to represent words, predicting the word Wk that is most likely to occur at the k position given the probability of occurrence of sentence W. Specifically, the NNLM is a three-layer neural network model, the training sample is a word vector of the context of the word w, the word vector is transmitted to the output layer through the hidden layer, and the output layer is the word vector of the word w.
word2vec is evolved from NNLM, and it makes important improvement to NNLM, has improved computational efficiency. There are two main implementations of the Word2Vec model: continuous Bag-of-words Model (CBOW Model) and skip-gram models. The CBOW model is a three-layer neural network (input layer, hidden layer and huffman tree layer). The word vectors of the context are input into a CBOW model, intermediate vectors are obtained through accumulation of hidden layers, the intermediate vectors are input into a root node of a Huffman tree, the root node divides the intermediate vectors into a left sub-tree or a right sub-tree, each non-leaf node classifies the intermediate vectors until reaching a certain leaf node, and the word corresponding to the leaf node is the prediction of the next word. The Skip-gram model is also a three-layer neural network. The skip-gram model inputs a word and outputs a prediction of its context word vector. The core of the Skip-gram model is also a Huffman tree, each word can predict a word in the context of the word when the word reaches a leaf node from the root of the tree, each word is iterated for N-1 times to obtain the prediction of all words in the context of the word, and the word vector is adjusted according to training data to obtain a sufficiently accurate result.
The task of the acoustic model is to calculate the probability that a given word will be followed by the speech. The acoustic model may specifically be a long-short memory LSTM neural network or a hidden markov model HMM. Wherein the content of the first and second substances,
long Short Memory Networks (LSTM) is a Recurrent Neural Networks (RNN) structure that is widely used in acoustic models at present. Compared with the common RNN, the LSTM controls the storage, input and output of information through a well-designed gate structure, and meanwhile, the problem of gradient disappearance of the common RNN can be avoided to a certain extent, so that the LSTM can effectively model the long-term correlation of the time sequence signal.
In this step, as shown in fig. 2, the historical audio data W is input into the ASR model to obtain probability values Pi corresponding to the respective character sequences. Wherein the probability value represents the degree of matching of the text sequence with the historical audio data. According to the method and the device, N character sequences corresponding to the first N probability values are used as text data according to the probability values, and meanwhile, labels of the text data are set and used for identifying whether the character sequences are real character sequences of the historical audio data or not.
Where N may be set according to the accuracy of intent recognition. As shown in fig. 2, if N =3, 3 text sequences R1, R2, and R3 corresponding to probability values P1, P2, and P3 of top3 are used as text data, and labels of the text sequences R1, R2, and R3 are used together as a training set.
S2, training a spoken language understanding SLU model through the training set;
as shown in fig. 2, the word sequences R1, R2, and R3 and the corresponding labels are input into the spoken language understanding SLU model to train the model, and the training of the SLU model is completed. And the trained SLU model can identify correct text data from the text data with the first N probability values.
S3, inputting the text data of the first M probability values output by the ASR model to the test audio data into the SLU model to obtain an intention recognition probability sequence of the M text data;
where M may be set according to the accuracy of intent recognition. M may be the same as or different from N, and the present invention is not particularly limited.
As shown in fig. 3, the test audio data Q is input into the ASR model, and probability values Qi indicating the degree of matching between each text sequence and the test audio data Q are obtained. Inputting character sequences R1 and R2 corresponding to the probability values Q1 and Q2 of top2 into an SLU model as text data according to the size of the probability value Qi, and respectively obtaining an intention recognition sequence P of the character sequence R1R1And an intention recognition sequence P for the literal sequence R2R2
Wherein the SLU model can be a multitasking deep neural network MT-DNN or a bidirectional coding characterization BERT model of a transducer.
In the invention, the BERT model comprises N layers of feature encoders, and each layer of feature encoder is respectively connected with one classifier. The classifier may be a decision tree model, a naive bayes model, a Logistic classifier, a support vector machine classifier, or the like, which is not limited in the present invention.
Fig. 4 shows the structure of the BERT model. Among them, the BERT model is essentially a language model composed of bidirectional transformers. The BERT model may include 12-layer transformers (BERT-base model) or 24-layer transformers (BERT-lager model). Namely: n can be 12 or 24. In fig. 4, the BERT model includes N layers of feature encoders Trm having the same structure, which are sequentially stacked, and each layer of feature encoder Trm is connected to a classifier Fr. Wherein the feature encoder refers to an encoder of a Transformer. E denotes the embedding of words, T denotes the new feature representation of each word after being encoded by the BERT model, and F denotes a classifier connected to the feature encoder of each layer.
Specifically, after text data is input into a BERT model, the text data is sequentially input into an i-th layer feature encoder and an i-th classifier connected with the i-th layer feature encoder to obtain an i-th layer intention identification result; and judging whether the ith layer intention identification result meets the intention identification requirement. Specifically, the information entropy S of the ith layer of intention recognition result may be calculated, and when the information entropy S of the ith layer of intention recognition result is smaller than a preset value, it is determined whether the ith layer of intention recognition result meets the intention recognition requirement. The preset value can be set according to the precision requirement of the BERT model. And if the ith layer of intention recognition result does not meet the intention recognition requirement, performing i +1 layer of intention recognition on the ith layer of intention recognition result until the current layer of intention recognition result meets the intention recognition requirement, outputting the current layer of intention recognition result as the intention of the text data, and deleting the text data.
The BERT model of the invention carries out intention recognition layer by layer from a feature encoder at the bottom layer and a classifier connected with the feature encoder; and after the recognition of each layer of intentions is finished, judging whether the recognition result of the layer of intentions meets the intention recognition requirement. If the answer is satisfied, the next layer of intention recognition is not required, the intention recognition result of the layer of intention recognition is directly output, and the intention recognition of the current text is finished, so that the intention recognition speed of the model is effectively improved, the phenomena that the answer speed of the voice robot is low and the waiting time of the user is long in the interaction between the user and the voice robot are avoided, and the voice interaction effect between the voice robot and the user is improved.
In addition, the BERT model adopts a plurality of layers of transformers to learn the text in a two-way mode, and the transformers read the text in a one-time reading mode, so that the context relationship among words in the text can be more accurately learned, the context can be more deeply understood, namely the two-way trained language model can more deeply understand the context than the one-way language model, and the text can be accurately processed, therefore, the BERT model has a better task processing effect compared with other models for processing natural language understanding tasks.
And S4, outputting the intention with the highest probability in the intention identification probability sequence as the intention of the test audio data.
Specifically, the intention recognition sequence P for the letter sequence R1R1And an intention recognition sequence P for the literal sequence R2R2The intention probabilities in (1) are compared one by one, and the intention category corresponding to the maximum intention recognition probability is output as the intention of the test audio data.
Further, the method further comprises: acquiring a slot position value corresponding to the intention of the test audio data through a slot position filling model; and sending the intention of the test audio data and the corresponding slot position value to a voice answering system. So that the voice answering system makes voice response according to the intention and the corresponding slot position value.
Fig. 5 is a schematic structural diagram of an N-best spoken language semantic recognition apparatus according to the present invention, as shown in fig. 5, the apparatus includes:
an obtaining module 51, configured to obtain text data with first N probability values output by the automatic speech recognition ASR model for historical audio data and tags of each text data as a training set;
a training module 52 for training a spoken language understanding SLU model through the training set;
an input module 53, configured to input text data with the first M probability values output by the ASR model for the test audio data into the SLU model, so as to obtain an intention recognition probability sequence for the M text data;
an output module 54, configured to output the intention with the highest probability in the intention identification probability sequence as the intention of the test audio data.
Wherein the ASR model includes an acoustic model and a language model. The acoustic model is a long-short memory LSTM neural network or a hidden Markov model HMM. The language model is any one of an n-gram model, a neural network language model NNLM and a word2vec model.
The SLU model is a bidirectional coding characterization BERT model of a multitask deep neural network MT-DNN or a converter.
Further, the apparatus further comprises:
the sub-acquisition module is used for acquiring a slot position value corresponding to the intention of the test audio data through a slot position filling model;
and the sending module is used for sending the intention of the test audio data and the corresponding slot position value to a voice answering system.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.
Fig. 6 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the electronic device 600 of the exemplary embodiment is represented in the form of a general-purpose data processing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different electronic device components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
The storage unit 620 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 610 such that the processing unit 610 performs the steps of various embodiments of the present invention. For example, the processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203. The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: operating the electronic device, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 300 (e.g., keyboard, display, network device, bluetooth device, etc.), enable a user to interact with the electronic device 600 via the external devices 300, and/or enable the electronic device 600 to communicate with one or more other data processing devices (e.g., router, modem, etc.). Such communication can occur via input/output (I/O) interfaces 650, and can also occur via network adapter 660 with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet). The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in FIG. 6, other hardware and/or software modules may be used in the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID electronics, tape drives, and data backup storage electronics, among others.
FIG. 7 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 7, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic device, apparatus, or device that is electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: acquiring text data with first N probability values output by the automatic speech recognition ASR model on historical audio data and labels of each text data as a training set; training a Spoken Language Understanding (SLU) model through the training set; inputting the text data of the first M probability values output by the ASR model to the test audio data into the SLU model to obtain an intention recognition probability sequence of the M text data; and outputting the intention with the highest probability in the intention identification probability sequence as the intention of the test audio data.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution electronic device, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object oriented programming languages such as Java, C + + or the like and conventional procedural programming languages, such as "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the present invention can be implemented as a method, an apparatus, an electronic device, or a computer-readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (9)

1. An N-best spoken language semantic recognition method, characterized in that the method comprises:
acquiring text data with first N probability values output by the automatic speech recognition ASR model on historical audio data and labels of each text data as a training set;
training a Spoken Language Understanding (SLU) model through the training set;
inputting the text data of the first M probability values output by the ASR model to the test audio data into the SLU model to obtain an intention recognition probability sequence of the M text data;
outputting an intention with the highest probability in the intention identification probability sequence as an intention of the test audio data;
the N-best mode is a mode of performing intention recognition on the text data with the first N probability values output by the ASR model to finally obtain an optimal recognition result.
2. The method of claim 1, wherein the ASR model comprises an acoustic model and a language model.
3. The method of claim 2, wherein the acoustic model is a long-short memory (LSTM) neural network or a Hidden Markov Model (HMM).
4. The method according to claim 2, wherein the language model is any one of an n-gram model, a neural network language model NNLM, and a word2vec model.
5. The method of claim 1, wherein the SLU model characterizes the BERT model for a bi-directional coding of a multi-tasking deep neural network MT-DNN or transformer.
6. The method of claim 1, further comprising:
acquiring a slot position value corresponding to the intention of the test audio data through a slot position filling model;
and sending the intention of the test audio data and the corresponding slot position value to a voice answering system.
7. An N-best spoken semantic recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring the text data with the first N probability values output by the automatic speech recognition ASR model on the historical audio data and the labels of all the text data as training sets;
a training module for training a Spoken Language Understanding (SLU) model through the training set;
the input module is used for inputting the text data of the first M probability values output by the ASR model to the test audio data into the SLU model to obtain an intention recognition probability sequence of the M text data;
an output module, configured to output an intention with a highest probability in the intention recognition probability sequence as an intention of the test audio data;
the N-best mode is a mode of performing intention recognition on the text data with the first N probability values output by the ASR model to finally obtain an optimal recognition result.
8. An electronic device, comprising:
a processor; and
a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-6.
9. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.
CN202011220689.0A 2020-11-05 2020-11-05 N-optimal spoken language semantic recognition method and device and electronic equipment Active CN112037773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011220689.0A CN112037773B (en) 2020-11-05 2020-11-05 N-optimal spoken language semantic recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011220689.0A CN112037773B (en) 2020-11-05 2020-11-05 N-optimal spoken language semantic recognition method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112037773A CN112037773A (en) 2020-12-04
CN112037773B true CN112037773B (en) 2021-01-29

Family

ID=73573580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011220689.0A Active CN112037773B (en) 2020-11-05 2020-11-05 N-optimal spoken language semantic recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112037773B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI787755B (en) * 2021-03-11 2022-12-21 碩網資訊股份有限公司 Method for cross-device and cross-language question answering matching based on deep learning
CN113160798B (en) * 2021-04-28 2024-04-16 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113035236B (en) * 2021-05-24 2021-08-27 北京爱数智慧科技有限公司 Quality inspection method and device for voice synthesis data
CN113591463B (en) * 2021-07-30 2023-07-18 中国平安人寿保险股份有限公司 Intention recognition method, device, electronic equipment and storage medium
CN115269809B (en) * 2022-09-19 2022-12-30 支付宝(杭州)信息技术有限公司 Method and device for training intention recognition model and method and device for recognizing intention
CN115273849B (en) * 2022-09-27 2022-12-27 北京宝兰德软件股份有限公司 Intention identification method and device for audio data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105895089A (en) * 2015-12-30 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method and device
CN110858480A (en) * 2018-08-15 2020-03-03 中国科学院声学研究所 Speech recognition method based on N-element grammar neural network language model
CN111429887A (en) * 2020-04-20 2020-07-17 合肥讯飞数码科技有限公司 End-to-end-based speech keyword recognition method, device and equipment
CN111564164A (en) * 2020-04-01 2020-08-21 中国电力科学研究院有限公司 Multi-mode emotion recognition method and device
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3680895B1 (en) * 2018-01-23 2021-08-11 Google LLC Selective adaptation and utilization of noise reduction technique in invocation phrase detection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105895089A (en) * 2015-12-30 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method and device
CN110858480A (en) * 2018-08-15 2020-03-03 中国科学院声学研究所 Speech recognition method based on N-element grammar neural network language model
CN111564164A (en) * 2020-04-01 2020-08-21 中国电力科学研究院有限公司 Multi-mode emotion recognition method and device
CN111429887A (en) * 2020-04-20 2020-07-17 合肥讯飞数码科技有限公司 End-to-end-based speech keyword recognition method, device and equipment
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability

Also Published As

Publication number Publication date
CN112037773A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
Vashisht et al. Speech recognition using machine learning
CN111862977B (en) Voice conversation processing method and system
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN112101045B (en) Multi-mode semantic integrity recognition method and device and electronic equipment
CN112101044B (en) Intention identification method and device and electronic equipment
CN109754809A (en) Audio recognition method, device, electronic equipment and storage medium
CN113569562A (en) Method and system for reducing cross-modal and cross-language barrier of end-to-end voice translation
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
CN114330371A (en) Session intention identification method and device based on prompt learning and electronic equipment
Basak et al. Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems.
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN113326367B (en) Task type dialogue method and system based on end-to-end text generation
CN112257432A (en) Self-adaptive intention identification method and device and electronic equipment
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN112307179A (en) Text matching method, device, equipment and storage medium
CN115983287A (en) Acoustic and text joint coding speech translation model modeling method and device
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN115374784A (en) Chinese named entity recognition method based on multi-mode information selective fusion
CN113257225B (en) Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
CN115238048A (en) Quick interaction method for joint chart identification and slot filling
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology
Sharan et al. ASR for Speech based Search in Hindi using Attention based Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant