CA3166784A1 - Human-machine interactive speech recognizing method and system for intelligent devices - Google Patents

Human-machine interactive speech recognizing method and system for intelligent devices

Info

Publication number
CA3166784A1
CA3166784A1 CA3166784A CA3166784A CA3166784A1 CA 3166784 A1 CA3166784 A1 CA 3166784A1 CA 3166784 A CA3166784 A CA 3166784A CA 3166784 A CA3166784 A CA 3166784A CA 3166784 A1 CA3166784 A1 CA 3166784A1
Authority
CA
Canada
Prior art keywords
slot
vector
intent
term
hidden state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3166784A
Other languages
French (fr)
Inventor
Pengfei Sun
Hongyuan JIA
Chunsheng Li
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3166784A1 publication Critical patent/CA3166784A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A speech recognition method for human-machine interaction of a smart apparatus and a system, pertaining to the technical field of speech recognition, and improving the accuracy of speech recognition by means of joint optimization training of intent detection and slot filling. The method comprises: performing word segmentation on speech data of a user's question to obtain an original word sequence, and generating a vector representation of the original word sequence by means of embedding processing; performing weighting processing on a hidden state vector hi and a slot context vector ci S to obtain a slot label model yi S; performing weighting processing on a hidden state vector hT and an intent context vector cI to obtain an intent prediction model yI; joining the slot context vector ci S and the intent context vector cI by means of a slot gate g, and obtaining a transformed representation of the slot label model yi S by means of the slot gate g; and constructing an objective function for joint optimization of the intent prediction model yI and the transformed slot label model yi S, and performing intent detection on the speech data of the user's question on the basis of the objective function.

Description

HUMAN-MACHINE INTERACTIVE SPEECH RECOGNIZING METHOD AND
SYSTEM FOR INTELLIGENT DEVICES
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the technical field of speech recognition, and more particularly to a human-machine interactive speech recognizing method and system for an intelligent device.
Description of Related Art
[0002] With the development of the internet technology, there come into being more and more intelligent devices that employ speeches for human-machine interaction.
Currently available speech interactive systems include Siri, Xiaomi, Cortana, Avatar Framework, and Duer, etc. As compared with the traditional human-machine interaction based on manual input, speech human-machine interaction exhibits characteristics of conveniency, high efficiency, and broad range of application scenarios. During the process of speech recognition, intent recognition and slot filling techniques are keys to ensuring the accuracy of speech recognition results.
[0003] As regards intent recognition, it can be abstracted as a classification problem, and a classifier represented by means of CNN + knowledge is then employed to train an intent recognition model, in which is further introduced semantic representation of knowledge to enhance the generalization capability of the presentation layer in addition to word-embedding for speech questions of users, but it has been found in practical application that such a model is defective in terms of slot information filling deviation, whereby accuracy of the intent recognition model is adversely affected. As regards slot filling, its Date Regue/Date Received 2022-07-04 essence is to formalize a sentence sequence to a marked sequence, and there are many frequently used methods to mark sequences, such as the hidden Markov model or the conditional random field model, but these slot filling models cannot satisfy practical application requirements under specific application scenarios, due to ambiguities of slots existent under different semantic intents caused by the lack of contextual information.
Seen as such, trainings of the two models are independently carried out in the state of the art, and there is no combined optimization of the intent recognition task and the slot filling task, so that the finally trained models are problematic in terms of low recognition accuracy in the aspect of speech recognition, and user experience is lowered.
SUMMARY OF THE INVENTION
[0004] The objective of the present invention is to provide a human-machine interactive speech recognizing method and system for an intelligent device, to enhance accuracy of speech recognition by jointly optimizing and training intent recognition and slot filling.
[0005] To achieve the above objective, according to one aspect, the present invention provides a human-machine interactive speech recognizing method for an intelligent device, the method comprising:
[0006] subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process;
[0007] calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model y ;
[0008] calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT and the intent context vector cf to thereafter obtain an intent prediction model y';
[0009] employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot Date Regue/Date Received 2022-07-04 gate g; and
[0010] jointly optimizing the intent prediction model yi and the transformed slot label model yis to construct a target function, and performing intent recognition on the speech question of the user based on the target function.
[0011] Preferably, the step of subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process includes:
[0012] receiving the speech question of the user and transforming the speech question to a recognizable text, and employing a tokenizer to term-segment the recognizable text and obtain the original term sequence; and
[0013] subjecting the original term sequence to a word embedding process, and realizing a vector representation of each segmented term in the original term sequence.
[0014] Preferably, the step of calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model yis includes:
[0015] employing a bidirectional LSTM network to encode each term segmentation vector, and outputting the hidden state vector hi corresponding to each term segmentation vector;
[0016] calculating the slot context vector cis, to which each term segmentation vector corresponds, through formula ciS = , wherein a represents an attention weight of a slot, its calculation formula is a11 = ¨ exp (ei k) Texp (eij) , e ¨ o-(Whsehj), where represents a slot activation function, and WL represents a slot weight matrix;
and
[0017] constructing a slot label model yiS = softmax (Whse(hi +cis) ) based on the hidden state vector hi and the slot context vector cis.
[0018] Further, the step of calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT and the Date Regue/Date Received 2022-07-04 intent context vector cf to thereafter obtain an intent prediction model yi includes:
[0019] employing a hidden unit in the bidirectional LSTM network to encode the vectorized original term sequence, and obtaining the hidden state vector hT;
[0020] calculating the intent context vector cf of the original term sequence through formula c1 = EaJhT, wherein ai represents an attention weight of an intent, its calculation formula is al. ¨ Texp (e1) e ¨ o- ' hT
) , where a' represents an intent activation k=lexp (e k)' function, and Ku, represents an intent weight matrix; and
[0021] constructing an intent prediction model 371 = so ftmax(Wilu,(hT + cl)) based on the hidden state vector hT and the intent context vector cf.
[0022] Preferably, the step of employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot gate g includes:
[0023] formally representing the slot gate g as g= v = tanh (cr + W = c') , wherein v represents a weight vector obtained by training, and W represents a weight matrix obtained by training; and
[0024] formally representing the transformation of the slot label model yis through the slot gate g as:
[0025] y = so ftmax(W hse(hi + c g)).
[0026] Optionally, the target function constructed by jointly optimizing the intent prediction model yi and the transformed slot label model yis is:
[0027] p(ys , 3711X) = p(y1 IX) {I p(yiclX) , wherein p(ys , 371 IX) represents a conditional probability for outputting slot filling and intent prediction at a given original term sequence, where X is the vectorized original term sequence.
[0028] Preferably, the step of performing intent recognition on the speech question of the user based on the target function includes:

Date Regue/Date Received 2022-07-04
[0029] sequentially obtaining intent conditional probabilities, to which the various segmented terms in the original term sequence correspond, through the target function;
and
[0030] screening therefrom a segmented term with the maximum probability value and recognizing the segmented term as the intent of the speech question of the user.
[0031] In comparison with prior-art technology, the human-machine interactive speech recognizing method for an intelligent device provided by the present invention achieves the following advantageous effects.
[0032] In the human-machine interactive speech recognizing method for an intelligent device provided by the present invention, the speech question of the user as obtained is firstly transformed to a recognizable text, a term segmenting process is carried out on the basis of the recognizable text to generate an original term sequence, which is then subjected to a word embedding process to realize vector representation, thereafter, a slot label model yis and an intent prediction model yi are respectively constructed on the basis of the vectorized original term sequence, wherein the step of constructing the slot label model yis is to calculate a hidden state vector h, and a slot context vector c15 of each term segmentation vector, and weight the hidden state vector hi and the slot context vector os to thereafter obtain the slot label model yis, while the step of constructing the intent prediction model yi is to calculate a hidden state vector hT and an intent context vector cf of the original term sequence, and weight the hidden state vector hT and the intent context vector cf to thereafter obtain the intent prediction model j/; seen as such, in order to fuse the intent prediction model)/ with the slot label model yis, a decoder layer is additionally added to the existing encoder-decoder framework to construct the intent prediction model y', join the slot context vector c15 and the intent context vector c' by introducing a slot gate g, finally jointly optimize the intent prediction model yl and the transformed slot label model yis to obtain a target function, employ the target function to sequentially obtain intent conditional probabilities, to which the various segmented terms in the original term sequence correspond, and screen therefrom a segmented term with the maximum Date Regue/Date Received 2022-07-04 probability value and recognize it as the intent of the speech question of the user, so as to ensure accuracy of speech recognition.
[0033] According to another aspect, the present invention provides a human-machine interactive speech recognizing system for an intelligent device, wherein the system is applied to the human-machine interactive speech recognizing method for an intelligent device as recited in the foregoing technical solution, and the system comprises:
[0034] a term segmentation processing unit, for subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process;
[0035] a first calculating unit, for calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model yis;
[0036] a second calculating unit, for calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT and the intent context vector cf to thereafter obtain an intent prediction model yi;
[0037] a model transforming unit, for employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot gate g; and
[0038] a joint optimization unit, for jointly optimizing the intent prediction model yl and the transformed slot label model yis to construct a target function, and performing intent recognition on the speech question of the user based on the target function.
[0039] Preferably, the term segmentation processing unit includes:
[0040] a term-segmenting module, for receiving the speech question of the user and transforming the speech question to a recognizable text, and employing a tokenizer to term-segment the recognizable text and obtain the original term sequence; and
[0041] an embedding processing module, for subjecting the original term sequence to a word embedding process, and realizing a vector representation of each segmented term in the Date Regue/Date Received 2022-07-04 original term sequence.
[0042] Preferably, the first calculating unit includes:
[0043] a hidden state calculating module, for employing a bidirectional LSTM
network to encode each term segmentation vector, and outputting the hidden state vector hi corresponding to each term segmentation vector;
[0044] a slot context calculating module, for calculating the slot context vector cis, to which each term segmentation vector corresponds, through formula c = Eah1, wherein ais:j represents an attention weight of a slot, its calculation formula is ___ =
Texp (e Ek=i exP (ei,k)' e = o-(Whsehj), where a represents a slot activation function, and WL
represents a slot weight matrix; and
[0045] a slot label model module, for constructing a slot label model yic =
so f tmax (14a(hi + cis) ) based on the hidden state vector hi and the slot context vector cis.
[0046] As compared with prior-art technology, the advantageous effects achieved by the human-machine interactive speech recognizing system for an intelligent device provided by the present invention are identical with the advantageous effects achievable by the human-machine interactive speech recognizing method for an intelligent device provided by the foregoing technical solution, so these are not redundantly described in this context.
BRIEF DESCRIPTION OF THE DRAWINGS
[0047] The drawings described here are meant to provide further understanding of the present invention, and constitute part of the present invention. The exemplary embodiments of the present invention and the descriptions thereof are meant to explain the present invention, rather than to restrict the present invention. In the drawings:

Date Regue/Date Received 2022-07-04
[0048] Fig. 1 is a flowchart schematically illustrating the human-machine interactive speech recognizing method for an intelligent device in Embodiment 1 of the present invention;
[0049] Fig. 2 is an exemplary view illustrating encoder-decoder fusing model in Embodiment 1 of the present invention;
[0050] Fig. 3 is an exemplary view illustrating the slot gate g in Fig. 2; and
[0051] Fig. 4 is a block diagram illustrating the structure of the human-machine interactive speech recognizing system for an intelligent device in Embodiment 2 of the present invention.
[0052] Reference numerals:
[0053] 1 ¨ term segmentation processing unit
[0054] 3 ¨ second calculating unit
[0055] 5 ¨joint optimization unit 2¨ first calculating unit 4¨ model transforming unit Date Regue/Date Received 2022-07-04 DETAILED DESCRIPTION OF THE INVENTION
[0056] To make more lucid and clear the objectives, features and advantages of the present invention, the technical solutions in the embodiments of the present invention are clearly and comprehensively described below with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the embodiments as described are merely partial, rather than the entire, embodiments of the present invention.
All other embodiments obtainable by persons ordinarily skilled in the art on the basis of the embodiments in the present invention without spending creative effort shall all fall within the protection scope of the present invention.
[0057] Embodiment 1
[0058] Fig. 1 is a flowchart schematically illustrating the human-machine interactive speech recognizing method for an intelligent device in Embodiment 1 of the present invention.
Referring to Fig. 1, the human-machine interactive speech recognizing method for an intelligent device provided by this embodiment comprises:
[0059] subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process;
calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model yis; calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT and the intent context vector cf to thereafter obtain an intent prediction model yi;
employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot gate g; and jointly optimizing the intent prediction model./ and the transformed slot label model yis to construct a target function, and performing intent recognition on the speech Date Regue/Date Received 2022-07-04 question of the user based on the target function.
[0060] In the human-machine interactive speech recognizing method for an intelligent device provided by this embodiment, the speech question of the user as obtained is firstly transformed to a recognizable text, a term segmenting process is carried out on the basis of the recognizable text to generate an original term sequence, which is then subjected to a word embedding process to realize vector representation, thereafter, a slot label model yis and an intent prediction model yi are respectively constructed on the basis of the vectorized original term sequence, wherein the step of constructing the slot label model yis is to calculate a hidden state vector h, and a slot context vector cis of each term segmentation vector, and weight the hidden state vector hi and the slot context vector cis to thereafter obtain the slot label model yis, while the step of constructing the intent prediction model yi is to calculate a hidden state vector hT and an intent context vector cf of the original term sequence, and weight the hidden state vector hT and the intent context vector cf to thereafter obtain the intent prediction model yl; as shown in Fig. 2, in order to fuse the intent prediction model yi with the slot label model yis, a decoder layer is additionally added to the existing encoder-decoder framework to construct the intent prediction model join the slot context vector cis and the intent context vector cf by introducing a slot gate g, finally jointly optimize the intent prediction model yl and the transformed slot label model yis to obtain a target function, employ the target function to sequentially obtain intent conditional probabilities, to which the various segmented terms in the original term sequence correspond, and subsequently screen therefrom a segmented term with the maximum probability value and recognize it as the intent of the speech question of the user, so as to ensure accuracy of speech recognition.
[0061] Specifically, the step of subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process in the foregoing embodiment includes:
[0062] receiving the speech question of the user and transforming the speech question to a Date Regue/Date Received 2022-07-04 recognizable text, and employing a tokenizer to term-segment the recognizable text and obtain the original term sequence; and subjecting the original term sequence to a word embedding process, and realizing a vector representation of each segmented term in the original term sequence.
[0063] As should be noted, the step of calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model yis in the foregoing embodiment includes:
[0064] employing a bidirectional LSTM network to encode each term segmentation vector, and outputting the hidden state vector hi corresponding to each term segmentation vector;
calculating the slot context vector cis, to which each term segmentation vector corresponds, through formula c = , wherein a represents an attention exp (ei j) s weight of a slot, its calculation formula is = = _______________________ e = o-(Whehj), where exp a represents a slot activation function, and WL represents a slot weight matrix; and constructing a slot label model y;s. = softmax (WL(hi +cis) ) based on the hidden state vector hi and the slot context vector cis.
[0065] During specific implementation, after plural term segmentation vectors have been input to the bidirectional LSTM network, hidden state vectors hi can be correspondingly output on a one-by-one basis, as regards formula c = E of the slot context vector, where represents the attention weight of the slot, i represents the ith term segmentation vector, and j represents the jth element in the ith term segmentation vector.
Specifically, the calculation formula of the attention weight of the slot is cO. ¨ Texp (e , e = =
exp (ei,k) o-(Wilhj), where T represents the total number of elements in the term segmentation vector, and K represents the Kth element in T. In addition, as regards slot activation function a and slot weight matrix W,, these can be derived on the basis of vector Date Regue/Date Received 2022-07-04 matrix training of the original term sequence, and the specific training processes are conventional technical means frequently employed in this technical field, so these are not redundantly described in this embodiment.
[0066] The step of calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT
and the intent context vector cf to thereafter obtain an intent prediction model yi in the foregoing embodiment includes:
[0067] employing a hidden unit in the bidirectional LSTM network to encode the vectorized original term sequence, and obtaining the hidden state vector hT; calculating the intent context vector cf of the original term sequence through formula c1 = EaJhT, wherein aJ represents an attention weight of an intent, its calculation formula is aJ
¨ Texp (ei) Ek,, exp (e k)' ei = o-' (Wifi,hT) , where a' represents an intent activation function, and Wif, represents an intent weight matrix; and constructing an intent prediction model 37' =
so f tmax (14/11,õ(hT + c')) based on the hidden state vector hT and the intent context vector c.f.
[0068] During the process of specific implementation, the method of training the intent prediction model yi is the same as the method of training the slot label model yis, and the difference rests in the fact that the hidden state vector hT can be obtained merely by means of a hidden unit in the bidirectional LSTM network, after one-dimensional transformation of the vector matrix, formula cd = E ct. hT is subsequently invoked to calculate the intent context vector cf of the original term sequence, where ct.; represents an attention weight of an intent, its calculation formula is ct" = Texp (e1) e- = (K.", hT) , Ek,, exp (e k) wherein a' represents an intent activation function, and Ku, represents an intent weight matrix; as regards the intent activation function a' and the intent weight matrix Wkõ, these can be derived on the basis of processed one-dimensional vector training, the Date Regue/Date Received 2022-07-04 specific training processes are conventional technical means frequently employed in this technical field, so these are not redundantly described in this embodiment.
[0069] Moreover, the step of employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot gate g in the foregoing embodiment includes:
[0070] formally representing the slot gate g as g= v = tanh (cis. + W = c') , wherein v represents a weight vector obtained by training, and W represents a weight matrix obtained by training; and formally representing the transformation of the slot label model yis through the slot gate g as yis. = so ftmax (WL (hi + c g)). Fig. 3 shows a structure model of the slot gate g.
[0071] Preferably, the target function constructed by jointly optimizing the intent prediction model yi and the transformed slot label model yis in the foregoing embodiment is:
[0072] p(ys y' po , polispo wherein p (ys, y11X) represents a conditional probability for outputting slot filling and intent prediction at a given original term sequence, where X represents the vectorized original term sequence. After expansion, P (ys ,Y1 IX) = P(371 IX) Fr P(YiclX) = P(371 ixi,' xT) P(Yis. 'xi, = = = xT) , where xi represents the ith term segmentation vector, and T represents the total number of term segmentation vectors. Through calculation of the target function can be obtained intent probability values of the various term segmentation vectors, and a segmented term with the maximum probability value is screened out of the various term segmentation vectors and recognized as the intent of the speech question of the user.
[0073] Embodiment 2
[0074] Referring to Fig. 1 and Fig. 4, this embodiment provides a human-machine interactive speech recognizing system for an intelligent device, the system comprising:

Date Regue/Date Received 2022-07-04
[0075] a term segmentation processing unit 1, for subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process;
[0076] a first calculating unit 2, for calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model yis;
[0077] a second calculating unit 3, for calculating a hidden state vector hT
and an intent context vector ef of the vectorized original term sequence, and weighting the hidden state vector hT and the intent context vector cf to thereafter obtain an intent prediction model yl;
[0078] a model transforming unit 4, for employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot gate g; and
[0079] a joint optimization unit 5, for jointly optimizing the intent prediction model yl and the transformed slot label model yis to construct a target function, and performing intent recognition on the speech question of the user based on the target function.
Specifically, the term segmentation processing unit includes:
[0080] a term-segmenting module, for receiving the speech question of the user and transforming the speech question to a recognizable text, and employing a tokenizer to term-segment the recognizable text and obtain the original term sequence; and
[0081] an embedding processing module, for subjecting the original term sequence to a word embedding process, and realizing a vector representation of each segmented term in the original term sequence.
[0082] Specifically, the first calculating unit includes:
[0083] a hidden state calculating module, for employing a bidirectional LSTM
network to encode each term segmentation vector, and outputting the hidden state vector hi corresponding to each term segmentation vector;
[0084] a slot context calculating module, for calculating the slot context vector cis, to which each term segmentation vector corresponds, through formula c = Eahj, wherein ais:j Date Regue/Date Received 2022-07-04 exp (e ii) represents an attention weight of a slot, its calculation formula is c0../ =

exp cid = o-(Whsehj), where a represents a slot activation function, and WL
represents a slot weight matrix; and
[0085] a slot label model module, for constructing a slot label model yis =
softmax (W, (hi +cis) ) based on the hidden state vector hi and the slot context vector cis.
[0086] As compared with prior-art technology, the advantageous effects achieved by the human-machine interactive speech recognizing system for an intelligent device provided by this embodiment of the present invention are identical with the advantageous effects achievable by the human-machine interactive speech recognizing method for an intelligent device provided by the foregoing Embodiment 1, so these are not redundantly described in this context.
[0087] As understandable to persons ordinarily skilled in the art, realization of the entire or partial steps in the method of the present invention can be completed via a program that instructs relevant hardware, the program can be stored in a computer-readable storage medium, and subsumes the various steps of the method in the foregoing embodiment when it is executed, wherein the storage medium can be an ROM/RAM, a magnetic disk, an optical disk, or a memory card, etc.
[0088] What the above describes is merely directed to specific modes of execution of the present invention, but the protection scope of the present invention is not restricted thereby. Any change or replacement easily conceivable to persons skilled in the art within the technical range disclosed by the present invention shall be covered by the protection scope of the present invention. Accordingly, the protection scope of the present invention shall be based on the protection scope as claimed in the Claims.
Date Regue/Date Received 2022-07-04

Claims (10)

What is claimed is:
1. A human-machine interactive speech recognizing method for an intelligent device, characterized in comprising:
subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process;
calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model yis;
calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT and the intent context vector cf to thereafter obtain an intent prediction model yi;
employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot gate g; and jointly optimizing the intent prediction model yi and the transformed slot label model yis to construct a target function, and performing intent recognition on the speech question of the user based on the target function.
2. The method according to Claim 1, characterized in that the step of subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process includes:
receiving the speech question of the user and transforming the speech question to a recognizable text, and employing a tokenizer to term-segment the recognizable text and obtain the original term sequence; and subjecting the original term sequence to a word embedding process, and realizing a vector representation of each segmented term in the original term sequence.
3. The method according to Claim 1, characterized in that the step of calculating a hidden state vector hi and a slot context vector cis of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector cis to thereafter obtain a slot label model yis includes:
employing a bidirectional LSTM network to encode each term segmentation vector, and outputting the hidden state vector hi corresponding to each term segmentation vector;
calculating the slot context vector cis, to which each term segmentation vector corresponds, through formula <BIG> , wherein IMG represents an attention weight of a slot, its calculation formula is , where a represents a slot activation function, and BIG represents a slot weight matrix; and constructing a slot label model based on the hidden state vector hi and the slot context vector c1s.
4. The method according to Claim 1, characterized in that the step of calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT and the intent context vector cf to thereafter obtain an intent prediction model./ includes:
employing a hidden unit in the bidirectional LSTM network to encode the vectorized original term sequence, and obtaining the hidden state vector hT;
calculating the intent context vector cf of the original term sequence through formula c1 =
, wherein ct.; represents an attention weight of an intent, its calculation formula is ct.; =
where a' represents an intent activation function, and <BIG>
represents an intent weight matrix; and constructing an intent prediction model based on the hidden state vector hT and the intent context vector cf.
5. The method according to Claim 1, characterized in that the step of employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yiS through the slot gate g includes:
formally representing the slot gate g as g= v = tanh (cis. + W = c') , wherein v represents a weight vector obtained by training, and W represents a weight matrix obtained by training; and formally representing the transformation of the slot label model yis through the slot gate g as
6. The method according to Claim 1, characterized in that the target function constructed by jointly optimizing the intent prediction model./ and the transformed slot label model y1s. is:
wherein represents a conditional probability for outputting slot filling and intent prediction at a given original term sequence, where X is the vectorized original term sequence.
7. The method according to Claim 6, characterized in that the step of performing intent recognition on the speech question of the user based on the target function includes:
sequentially obtaining intent conditional probabilities, to which the various segmented terms in the original term sequence correspond, through the target function; and screening therefrom a segmented term with the maximum probability value and recognizing the segmented term as the intent of the speech question of the user.
8. A human-machine interactive speech recognizing system for an intelligent device, characterized in comprising:
a term segmentation processing unit, for subjecting a speech question of a user to a term-segmenting process to obtain an original term sequence, and vectorizing the original term sequence through an embedding process;
a first calculating unit, for calculating a hidden state vector hi and a slot context vector ci5 of each term segmentation vector, and weighting the hidden state vector hi and the slot context vector ci5 to thereafter obtain a slot label model yiS;
a second calculating unit, for calculating a hidden state vector hT and an intent context vector cf of the vectorized original term sequence, and weighting the hidden state vector hT and the intent context vector cf to thereafter obtain an intent prediction model yi;
a model transforming unit, for employing a slot gate g to join the slot context vector cis and the intent context vector cf, and generating a transformed representation of the slot label model yis through the slot gate g; and a joint optimization unit, for jointly optimizing the intent prediction model./ and the transformed slot label model yis to construct a target function, and performing intent recognition on the speech question of the user based on the target function.
9. The system according to Claim 8, characterized in that the term segmentation processing unit includes:
a term-segmenting module, for receiving the speech question of the user and transforming the speech question to a recognizable text, and employing a tokenizer to term-segment the recognizable text and obtain the original term sequence; and an embedding processing module, for subjecting the original term sequence to a word embedding process, and realizing a vector representation of each segmented term in the original term sequence.
10. The system according to Claim 8, characterized in that the first calculating unit includes:
a hidden state calculating module, for employing a bidirectional LSTM network to encode each term segmentation vector, and outputting the hidden state vector hi corresponding to each term segmentation vector;
a slot context calculating module, for calculating the slot context vector cis, to which each term segmentation vector corresponds, through formula <IMG"
.. wherein a represents an attention weight of a slot, its calculation formula is , where a represents a slot activation function, and WL represents a slot weight matrix; and a slot label model module, for constructing a slot label model yis = so ftmax ) based on the hidden state vector hi and the slot context vector cis.
CA3166784A 2019-01-02 2019-09-19 Human-machine interactive speech recognizing method and system for intelligent devices Pending CA3166784A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201910002748.8 2019-01-02
CN201910002748.8A CN109785833A (en) 2019-01-02 2019-01-02 Human-computer interaction audio recognition method and system for smart machine
PCT/CN2019/106778 WO2020140487A1 (en) 2019-01-02 2019-09-19 Speech recognition method for human-machine interaction of smart apparatus, and system

Publications (1)

Publication Number Publication Date
CA3166784A1 true CA3166784A1 (en) 2020-07-09

Family

ID=66499837

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3166784A Pending CA3166784A1 (en) 2019-01-02 2019-09-19 Human-machine interactive speech recognizing method and system for intelligent devices

Country Status (3)

Country Link
CN (1) CN109785833A (en)
CA (1) CA3166784A1 (en)
WO (1) WO2020140487A1 (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785833A (en) * 2019-01-02 2019-05-21 苏宁易购集团股份有限公司 Human-computer interaction audio recognition method and system for smart machine
CN110532355B (en) * 2019-08-27 2022-07-01 华侨大学 Intention and slot position joint identification method based on multitask learning
CN110750628A (en) * 2019-09-09 2020-02-04 深圳壹账通智能科技有限公司 Session information interaction processing method and device, computer equipment and storage medium
CN110795532A (en) * 2019-10-18 2020-02-14 珠海格力电器股份有限公司 Voice information processing method and device, intelligent terminal and storage medium
CN110853626B (en) * 2019-10-21 2021-04-20 成都信息工程大学 Bidirectional attention neural network-based dialogue understanding method, device and equipment
CN110827816A (en) * 2019-11-08 2020-02-21 杭州依图医疗技术有限公司 Voice instruction recognition method and device, electronic equipment and storage medium
CN111090728B (en) * 2019-12-13 2023-05-26 车智互联(北京)科技有限公司 Dialogue state tracking method and device and computing equipment
CN111062209A (en) * 2019-12-16 2020-04-24 苏州思必驰信息科技有限公司 Natural language processing model training method and natural language processing model
CN111177381A (en) * 2019-12-21 2020-05-19 深圳市傲立科技有限公司 Slot filling and intention detection joint modeling method based on context vector feedback
WO2021140447A1 (en) * 2020-01-06 2021-07-15 7Hugs Labs System and method for controlling a plurality of devices
CN111339770B (en) * 2020-02-18 2023-07-21 百度在线网络技术(北京)有限公司 Method and device for outputting information
CN111833849A (en) * 2020-03-10 2020-10-27 北京嘀嘀无限科技发展有限公司 Method for speech recognition and speech model training, storage medium and electronic device
CN113505591A (en) * 2020-03-23 2021-10-15 华为技术有限公司 Slot position identification method and electronic equipment
CN111597342B (en) * 2020-05-22 2024-01-26 北京慧闻科技(集团)有限公司 Multitasking intention classification method, device, equipment and storage medium
CN113779975B (en) * 2020-06-10 2024-03-01 北京猎户星空科技有限公司 Semantic recognition method, device, equipment and medium
CN112069828B (en) * 2020-07-31 2023-07-04 飞诺门阵(北京)科技有限公司 Text intention recognition method and device
CN112800190B (en) * 2020-11-11 2022-06-10 重庆邮电大学 Intent recognition and slot value filling joint prediction method based on Bert model
CN112765959B (en) * 2020-12-31 2024-05-28 康佳集团股份有限公司 Intention recognition method, device, equipment and computer readable storage medium
CN114969339B (en) * 2022-05-30 2023-05-12 中电金信软件有限公司 Text matching method and device, electronic equipment and readable storage medium
CN115358186B (en) * 2022-08-31 2023-11-14 南京擎盾信息科技有限公司 Generating method and device of slot label and storage medium
CN115273849B (en) * 2022-09-27 2022-12-27 北京宝兰德软件股份有限公司 Intention identification method and device for audio data
CN117151121B (en) * 2023-10-26 2024-01-12 安徽农业大学 Multi-intention spoken language understanding method based on fluctuation threshold and segmentation

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10319375B2 (en) * 2016-12-28 2019-06-11 Amazon Technologies, Inc. Audio message extraction
CN107491541B (en) * 2017-08-24 2021-03-02 北京丁牛科技有限公司 Text classification method and device
CN108415923B (en) * 2017-10-18 2020-12-11 北京邮电大学 Intelligent man-machine conversation system of closed domain
CN108417205B (en) * 2018-01-19 2020-12-18 苏州思必驰信息科技有限公司 Semantic understanding training method and system
CN108876527A (en) * 2018-06-06 2018-11-23 北京京东尚科信息技术有限公司 Method of servicing and service unit, using open platform and storage medium
CN108874782B (en) * 2018-06-29 2019-04-26 北京寻领科技有限公司 A kind of more wheel dialogue management methods of level attention LSTM and knowledge mapping
CN109065053B (en) * 2018-08-20 2020-05-15 百度在线网络技术(北京)有限公司 Method and apparatus for processing information
CN109785833A (en) * 2019-01-02 2019-05-21 苏宁易购集团股份有限公司 Human-computer interaction audio recognition method and system for smart machine

Also Published As

Publication number Publication date
WO2020140487A1 (en) 2020-07-09
CN109785833A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CA3166784A1 (en) Human-machine interactive speech recognizing method and system for intelligent devices
US10373610B2 (en) Systems and methods for automatic unit selection and target decomposition for sequence labelling
TWI530940B (en) Method and apparatus for acoustic model training
CN111738251B (en) Optical character recognition method and device fused with language model and electronic equipment
US20220351487A1 (en) Image Description Method and Apparatus, Computing Device, and Storage Medium
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN111916067A (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN108710704B (en) Method and device for determining conversation state, electronic equipment and storage medium
CN113011186B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium
CN106202056B (en) Chinese word segmentation scene library update method and system
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN115617955B (en) Hierarchical prediction model training method, punctuation symbol recovery method and device
CN116861995A (en) Training of multi-mode pre-training model and multi-mode data processing method and device
CN114913590B (en) Data emotion recognition method, device and equipment and readable storage medium
CN115100582B (en) Model training method and device based on multi-mode data
CN113609284A (en) Method and device for automatically generating text abstract fused with multivariate semantics
CN116259075A (en) Pedestrian attribute identification method based on prompt fine tuning pre-training large model
CN114387537A (en) Video question-answering method based on description text
CN113113024A (en) Voice recognition method and device, electronic equipment and storage medium
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN114860938A (en) Statement intention identification method and electronic equipment
CN113870863A (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN116522905B (en) Text error correction method, apparatus, device, readable storage medium, and program product
CN115408494A (en) Text matching method integrating multi-head attention alignment
US11321527B1 (en) Effective classification of data based on curated features

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20220704

EEER Examination request

Effective date: 20220704

EEER Examination request

Effective date: 20220704

EEER Examination request

Effective date: 20220704

EEER Examination request

Effective date: 20220704

EEER Examination request

Effective date: 20220704

EEER Examination request

Effective date: 20220704