CN111986673A - Slot value filling method and device for voice recognition and electronic equipment - Google Patents

Slot value filling method and device for voice recognition and electronic equipment Download PDF

Info

Publication number
CN111986673A
CN111986673A CN202010721012.9A CN202010721012A CN111986673A CN 111986673 A CN111986673 A CN 111986673A CN 202010721012 A CN202010721012 A CN 202010721012A CN 111986673 A CN111986673 A CN 111986673A
Authority
CN
China
Prior art keywords
slot
semantic vector
text
voice
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010721012.9A
Other languages
Chinese (zh)
Inventor
刘志敏
刘宗全
张家兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qibao Xinan Technology Co ltd
Original Assignee
Beijing Qibao Xinan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qibao Xinan Technology Co ltd filed Critical Beijing Qibao Xinan Technology Co ltd
Priority to CN202010721012.9A priority Critical patent/CN111986673A/en
Publication of CN111986673A publication Critical patent/CN111986673A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a slot value filling method and device for voice recognition and electronic equipment. The method comprises the following steps: configuring a slot description text and a specific number of example slot values corresponding to the slot; acquiring user voice in task type conversation to obtain a voice text corresponding to the user voice; converting the slot position description text, the example slot value and the voice text into semantic vectors to respectively obtain a slot position description text semantic vector, an example slot value semantic vector and a voice text semantic vector; calculating the attention weight coefficient of each word in the voice text to obtain an attention semantic vector corresponding to each word; further splicing to obtain a combined semantic vector; and labeling and BIO classifying the combined semantic vector to determine a slot position contained in the voice text and a slot value corresponding to the slot position. The method improves the identification capability, has the cross-domain migration capability and improves the alignment robustness.

Description

Slot value filling method and device for voice recognition and electronic equipment
Technical Field
The invention relates to the field of computer information processing, in particular to a slot value filling method and device for voice recognition and electronic equipment.
Background
With the development of internet technology, the conversation system has wide application in e-commerce, intelligent equipment and the like, and is more and more concerned by people. Common task-based dialog systems are Siri, Echo, ali honey, etc. In a task-based dialog system, a slot value filling technology is a common technology for understanding the intention of a user, the intention of the user can be represented semantically only by accurately identifying the slot value of each slot position in a dialog, and subsequent state management and reply generation are performed based on the semantic representation.
The existing slot value filling technology needs to obtain good effect, often depends on a large amount of marking data, and the number of slot values of a plurality of slot positions is large and even can not be enumerated; moreover, when the dialogue domain changes, data labeling in the new domain needs to be performed again, which consumes manpower. In addition, the existing method lacks robustness to the phenomenon of data misalignment, which often appears as: the slot positions with the same semantics adopt different literal names or the slot values of the same slot positions in different fields are different. For example, for the air ticket reservation systems (company a and company B) of two companies, the slot at the destination point where the flight lands is displayed by the company a system indicates a destination, and the company B system indicates an arrival place, and the semantic levels are the same but different, and the existing slot identifier cannot be directly migrated to the company B system on the premise of being applicable to the company a system, in other words, cannot be directly used, and needs to be manually marked again, which is high in cost. In addition, there is still much room for improvement in terms of recognition effects.
Therefore, it is necessary to provide a slot filling method for speech recognition to solve the related problems caused by data misalignment and simultaneously improve the recognition capability.
Disclosure of Invention
In order to solve the above problems, the present invention provides a slot value filling method for speech recognition, which is used for recognizing a slot included in speech input by a user and a slot value corresponding to the slot in a task-based dialog, and comprises the following steps: configuring a slot description text and a specific number of example slot values corresponding to the slot; acquiring user voice in task type conversation, identifying the user voice and obtaining a voice text corresponding to the user voice; converting the slot position description text, the example slot value and the voice text into semantic vectors to respectively obtain a slot position description text semantic vector, an example slot value semantic vector and a voice text semantic vector; calculating the attention weight coefficient of each word in the voice text to obtain an attention semantic vector corresponding to each word; splicing the voice text semantic vector, the attention semantic vector corresponding to each word and the slot position description text semantic vector to obtain a combined semantic vector; and inputting the combined semantic vector into a sequence annotator and a sequence classifier for annotation and BIO classification to obtain BIO classification of each word in the voice text, and determining slot positions contained in the voice text and slot values corresponding to the slot positions according to the BIO classification.
Preferably, the calculating the attention weight coefficient of each word in the speech text to obtain the attention semantic vector corresponding to each word includes: and performing attention weight calculation on the example slot value semantic vector and the semantic vector of each word in the voice text to obtain an attention weight coefficient, and adding the weight coefficients to obtain an attention semantic vector corresponding to the semantic vector of each word.
Preferably, the method further comprises the following steps: based on historical dialogue data of a specific task, a slot value database is constructed, wherein the slot value database comprises preset slot positions, a specific number of example slot values corresponding to the preset slot positions, description text information of each preset slot position and a corresponding relation of the slot positions with the same semantics.
Preferably, the number of example bin values is between 2 and 5.
Preferably, the step of converting the slot description text, the example slot value and the voice text into semantic vectors includes: and inputting the slot description text, the example slot value and the voice text into a BERT pre-training model to output a corresponding semantic vector.
Preferably, the sequence annotation model is a Bi-LSTM model.
Preferably, the inputting the combined semantic vector into a sequence annotator and a sequence classifier for annotation and BIO classification to obtain BIO classification of each word in the speech text includes: and inputting the combined semantic vector into a trained Bi-LSTM model, and inputting an implicit vector output by an LSTM layer into a softmax layer to perform 3-way BIO sequence classification.
In addition, the present invention also provides a slot value filling apparatus for speech recognition, configured to recognize a slot included in speech input by a user and a slot value corresponding to the slot in a task-based dialog, including: a configuration module for configuring a slot description text and a specific number of example slot values corresponding to the slot; the data acquisition module is used for acquiring user voice in the task type conversation, identifying the user voice and obtaining a voice text corresponding to the user voice; the conversion module is used for converting the slot description text, the example slot value and the voice text into semantic vectors to respectively obtain a slot description text semantic vector, an example slot value semantic vector and a voice text semantic vector; the calculation module is used for calculating the attention weight coefficient of each word in the voice text so as to obtain an attention semantic vector corresponding to each word; the processing module is used for splicing the voice text semantic vector, the attention semantic vector corresponding to each word and the slot position description text semantic vector to obtain a combined semantic vector; and the determining module is used for inputting the combined semantic vector into the sequence annotator and the sequence classifier for annotation and BIO classification to obtain BIO classification of each word in the voice text, and determining slot positions contained in the voice text and slot values corresponding to the slot positions according to the BIO classification.
Preferably, the calculation module further comprises: and performing attention weight calculation on the example slot value semantic vector and the semantic vector of each word in the voice text to obtain an attention weight coefficient, and adding the weight coefficients to obtain an attention semantic vector corresponding to the semantic vector of each word.
Preferably, the system further comprises a construction module, wherein the construction module constructs a slot value database based on historical dialogue data of a specific task, and the slot value database comprises preset slot positions, a specific number of example slot values corresponding to the preset slot positions, description text information of each preset slot position, and a corresponding relation of slot positions with semantics.
Preferably, the number of example bin values is between 2 and 5.
Preferably, the step of converting the slot description text, the example slot value and the voice text into semantic vectors includes: and inputting the slot description text, the example slot value and the voice text into a BERT pre-training model to output a corresponding semantic vector.
Preferably, the sequence annotation model is a Bi-LSTM model.
Preferably, the system further comprises a classification module, wherein the classification module inputs the combined semantic vector into a trained Bi-LSTM model, and inputs an implicit vector output by an LSTM layer into a softmax layer to perform 3-way BIO sequence classification.
In addition, the present invention also provides an electronic device, wherein the electronic device includes: a processor; and a memory storing computer executable instructions that, when executed, cause the processor to perform the slot-value-filling method for speech recognition of the present invention.
Further, the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs which, when executed by a processor, implement the slot value filling method for speech recognition according to the present invention.
Advantageous effects
Compared with the prior art, the slot value filling method disclosed by the invention has the advantages that the slot description text information and the slot value are coded based on the pre-training model, and the slot position which is similar in semantic meaning but does not appear in the training data can be identified; the attention weight coefficient of each word in the voice text to be recognized is calculated through an attention mechanism model to obtain the attention semantic vector of each word, the combined semantic vector is further obtained through splicing, and then labeling and classification are carried out to obtain a recognition result, so that the recognition capability is improved, the cross-domain migration capability is achieved, and the alignment robustness is improved; the method can achieve the effect approximately equal to the groove value filler obtained from a large amount of training data only by a small amount of labeled samples, and does not need a large amount of manual labeling.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive faculty.
FIG. 1 is a flow chart of one example of a slot filling method for speech recognition of the present invention.
FIG. 2 is a flow chart of another example of a slot-value-filling method for speech recognition of the present invention.
Fig. 3 is a block diagram showing an example of an application scenario of the slot filling method for speech recognition according to the present invention.
FIG. 4 is a flow chart of yet another example of a slot-value-filling method for speech recognition of the present invention.
Fig. 5 is a schematic block diagram of an example of the slot value filling apparatus for speech recognition of the present invention.
Fig. 6 is a schematic block diagram of another example of the slot value filling apparatus for speech recognition of the present invention.
Fig. 7 is a schematic block diagram of still another example of the slot value filling apparatus for speech recognition of the present invention.
Fig. 8 is a block diagram of an exemplary embodiment of an electronic device according to the present invention.
Fig. 9 is a block diagram of an exemplary embodiment of a computer-readable medium according to the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. The same reference numerals denote the same or similar elements, components, or parts in the drawings, and thus their repetitive description will be omitted.
Features, structures, characteristics or other details described in a particular embodiment do not preclude the fact that the features, structures, characteristics or other details may be combined in a suitable manner in one or more other embodiments in accordance with the technical idea of the invention.
In describing particular embodiments, the present invention has been described with reference to features, structures, characteristics or other details that are within the purview of one skilled in the art to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific features, structures, characteristics, or other details.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these terms should not be construed as limiting. These phrases are used to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention.
The term "and/or" and/or "includes any and all combinations of one or more of the associated listed items.
In order to solve the problem that slot positions cannot be identified among data which are not aligned, cross-domain or cross-service, and to achieve the good effect of slot value filling of the identification slot positions under the condition of a small number of labeled samples and the like, the invention provides a slot value filling method for voice identification, which is used for coding slot position description text information and slot values based on a pre-training model and can identify slot positions which are similar in semantics but do not appear in training data; the attention weight coefficient of each word in the voice text to be recognized is calculated through the attention mechanism model, the attention semantic vector of each word is obtained, the combined semantic vector is further obtained through splicing, and then labeling and classification are carried out to obtain a recognition result, so that the recognition capability is improved, the cross-domain migration capability is achieved, and the alignment robustness is improved.
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
Example 1
Hereinafter, an embodiment of a slot filling method for voice recognition of the present invention will be described with reference to fig. 1 to 4.
FIG. 1 is a flow chart of one example of a slot filling method for speech recognition of the present invention.
As shown in fig. 1, a slot filling method for speech recognition includes the following steps.
Step S101 configures a slot description text and a specific number of example slot values corresponding to the slot.
Step S102, obtaining user voice in task type dialogue, recognizing the user voice and obtaining voice text corresponding to the user voice.
Step S103, converting the slot description text, the example slot value and the voice text into semantic vectors, and respectively obtaining a slot description text semantic vector, an example slot value semantic vector and a voice text semantic vector.
Step S104, calculating the attention weight coefficient of each word in the voice text to obtain the attention semantic vector corresponding to each word.
And step S105, splicing the voice text semantic vector, the attention semantic vector corresponding to each word and the slot position description text semantic vector to obtain a combined semantic vector.
And step S106, inputting the combined semantic vector into a sequence annotator and a sequence classifier for annotation and BIO classification to obtain BIO classification of each word in the voice text, and determining slot positions contained in the voice text and slot values corresponding to the slot positions according to the BIO classification.
In this example, the method of the present invention is applied to recognizing a slot included in a voice input by a user and a slot value corresponding to the slot in a task-based dialog, and a specific recognition process will be described in detail below.
First, in step S101, a slot description text and a specific number of example slot values corresponding to the slot are configured.
As shown in fig. 2, a step S201 of constructing a slot value database is further included.
In step S201, a slot value database is constructed for querying or storing slot positions and their corresponding slot value information data.
Preferably, a slot value database is constructed based on historical dialogue data of a specific task, and the slot value database includes preset slot positions, example slot values of a specific number K corresponding to the preset slot positions, description text information of each preset slot position, and a corresponding relation between slot positions with the same semantics.
In this example, a preset slot position such as a place and a time is set based on historical dialogue data of a dialogue task such as ticket or meal ordering, and a specific number of example slot values corresponding to the preset slot position are set correspondingly.
For example, each preset slot has a description text to depict semantic information of the slot, and the description information corresponding to a location slot such as "place of arrival" may be "destination to go".
Further, there is corresponding slot value information for the location slot of "arrival place", such as "beijing", "shanghai", "nanjing", and the like.
For example, in the a service domain, the location slot is the "arrival place", while in the B service domain, the location slot is the "destination". For the slot position of the place with the same semantic meaning, the slot position with the same semantic meaning or similar semantic vector is set with an association corresponding relation so as to solve the problem that the slot position cannot be identified among data which are not aligned, cross-domain or cross-service data, thereby ensuring that the slot position and related data information of the slot position can be identified and used in different domains or services.
Preferably, the number of example bin values is between 2 and 5, in this example 3.
Next, in step S102, the user speech in the task-based dialog is acquired, the user speech is recognized, and a speech text corresponding to the user speech is obtained.
Specifically, the acquired user voice is subjected to voice-to-text conversion to recognize and obtain a voice text of the user voice, which is, for example, "flight in hobeijing".
Next, in step S102, the slot description text, the example slot value, and the voice text are all converted into semantic vectors, and a slot description text semantic vector, an example slot value semantic vector, and a voice text semantic vector are obtained respectively.
In this example, the slot description text, example slot values, and speech text are input into a BERT pre-trained model to output corresponding semantic vectors.
Specifically, the acquired voice text S of the user is split into T words, where the voice text is represented as S ═ { u ═iAnd i is more than or equal to 1 and less than or equal to T, inputting S into a BERT pre-training model, and outputting semantic vector representation of each word, wherein H is { H ═ H ≦ T }, andi∈Rd,1≤i≤T}。
further, a slot description text of a preset slot is input into a BERT pre-training model, semantic vector representation of the description text is output, and ds∈Rd
Further, K slot values corresponding to each preset slot, in this example, the value of K is 3, and the K values are respectively input into the BERT pre-training model to respectively obtain 3 semantic vector representations, { e }k,1≤k≤K}。
Therefore, the slot description text information and the slot value are coded based on the pre-training model, and the slot which is similar in semantic meaning but does not appear in the training data can be identified.
It should be noted that, for the conversion of the semantic vector, the above is only described as an example, and is not to be construed as limiting the present invention, and in other examples, a RoBERTa model or a referral BERT model and the like may also be used.
Next, in step S104, an attention weight coefficient of each word in the speech text is calculated to obtain an attention semantic vector corresponding to each word.
Preferably, an attention mechanism model is constructed, and the attention mechanism model is used for calculating the attention weight coefficient of each word in the voice text.
Specifically, training an attention model using the sample label data is also included. The sample label data comprises information data such as slot positions in the historical task type conversation and slot values corresponding to the slot positions, wherein the marking information of the slot values is less, in other words, a small amount of marking sample data is used for training.
Specifically, attention weight calculation is carried out on the example slot value semantic vector and the semantic vector of each word in the voice text to obtain an attention weight coefficient, and the attention semantic vector corresponding to the semantic vector of each word is obtained through the sum of the weight coefficients.
Further, 3 slot values are associated with each word u in the speech textiCalculating attention weight to obtain an attention weight coefficient alphaiFinally, each word u is obtained by weight additioniCorresponding attention semantic vector ei αSee, in particular, equation (1).
Figure BDA0002599999240000091
Wherein alpha isiIs an attention weight coefficient; h isiA semantic vector representation for the ith word; wαA weight coefficient corresponding to each word in the neural network; e.g. of the typekA semantic vector representation of an example slot value; k is a set number (specific number) of example slot values.
Figure BDA0002599999240000092
Wherein e isi αFor each word uiA corresponding attention semantic vector; alpha is alphaikIs the attention weight coefficient corresponding to the K example bin values; e.g. of the typekA semantic vector representation of an example slot value; k is a set number (specific number) of example slot values.
Calculating an attention semantic vector e corresponding to each word of the voice text by the formulas (1) and (2)i α
It should be noted that the attention mechanism model can be used to assign important weights to semantic vector representations of these different words, and these weights can determine the implicit relevance of each word in the speech text to the example slot value, and ignore the noise and redundancy in the input, thereby improving the recognition capability, having the capability of cross-domain migration, and improving the alignment robustness.
Next, in step S105, the speech text semantic vector, the attention semantic vector corresponding to each word, and the slot description text semantic vector are spliced to obtain a combined semantic vector.
As shown in fig. 3, the semantic vector h of each word to be obtainediSlot description text vector dsAttention-normalized vector e of each wordi αAnd performing splicing processing to obtain a combined semantic vector, and taking the combined semantic vector as the input features of the sequence annotator and the sequence classifier.
Next, in step S106, the combined semantic vector is input into a sequence annotator and a sequence classifier for annotation and BIO classification, so as to obtain BIO classifications of each word in the voice text, and a slot included in the voice text and a slot value corresponding to the slot are determined according to the BIO classifications.
In this example, the sequence annotation model is a Bi-LSTM model (i.e., a Bi-directional LSTM model). The Bi-LSTM model was used for this classification.
It should be noted that, in this example, the Attention mechanism model and the Bi-LSTM model are separately established, but not limited to this, in other examples, only the Bi-LSTM model may be established, an Attention layer may be added to the Bi-LSTM model, and softmax classification may be performed. The foregoing is illustrative only and is not to be construed as limiting the invention.
Specifically, the spliced combined semantic vector is input into a trained Bi-LSTM model, and an implicit vector x output by an LSTM layer is usediAnd inputting a softmax layer, and labeling each word in the voice text in a 3-way BIO form so as to perform sequence classification.
For example, after the 3-way BIO sequence classification, the slot position and the slot value corresponding to the slot position included in the voice text are determined according to the BIO classification, so that the slot position and the corresponding slot value can be identified, and the intent of the user can be semantically represented by the slot values to be used as the input of subsequent modules (user state management and reply generation) of the dialogue system.
It should be noted that, a sequence labeler is used to mark or label each word in the voice text. Wherein, 3-way BIO: the representation mode of the label in the sequence label, b (begin) represents that the current word is the head of the slot value to be identified, i (middle) represents that the current word is the inside of the slot value to be identified, and o (out) represents that the current word does not belong to the content of the slot value to be identified.
The above-described procedure of the slot value filling method is only for explanation of the present invention, and the order and number of steps are not particularly limited. In addition, the steps in the method may be further split into two or three (for example, the step S106 is split into the step S401 and the step S106, see fig. 4 in particular), or some steps may be combined into one step, and the adjustment is performed according to an actual example.
Compared with the prior art, the slot value filling method disclosed by the invention has the advantages that the slot description text information and the slot value are coded based on the pre-training model, and the slot position which is similar in semantic meaning but does not appear in the training data can be identified; the attention weight coefficient of each word in the voice text to be recognized is calculated through an attention mechanism model to obtain the attention semantic vector of each word, the combined semantic vector is further obtained through splicing, and then labeling and classification are carried out to obtain a recognition result, so that the recognition capability is improved, the cross-domain migration capability is achieved, and the alignment robustness is improved; the method can achieve the effect approximately equal to the groove value filler obtained from a large amount of training data only by a small amount of labeled samples, and does not need a large amount of manual labeling.
Those skilled in the art will appreciate that all or part of the steps to implement the above-described embodiments are implemented as programs (computer programs) executed by a computer data processing apparatus. When the computer program is executed, the method provided by the invention can be realized. Furthermore, the computer program may be stored in a computer readable storage medium, which may be a readable storage medium such as a magnetic disk, an optical disk, a ROM, a RAM, or a storage array composed of a plurality of storage media, such as a magnetic disk or a magnetic tape storage array. The storage medium is not limited to centralized storage, but may be distributed storage, such as cloud storage based on cloud computing.
Embodiments of the apparatus of the present invention are described below, which may be used to perform method embodiments of the present invention. The details described in the device embodiments of the invention should be regarded as complementary to the above-described method embodiments; reference is made to the above-described method embodiments for details not disclosed in the apparatus embodiments of the invention.
Example 2
Referring to fig. 5, 6 and 7, the present invention also provides a slot value filling apparatus 500 for voice recognition, which is used for recognizing a slot included in a voice input by a user and a slot value corresponding to the slot in a task-based dialog, including: a configuration module 501, configured to configure a slot description text and a specific number of example slot values corresponding to the slot; a data obtaining module 502, configured to obtain a user voice in a task-based dialog, recognize the user voice, and obtain a voice text corresponding to the user voice; a conversion module 503, configured to convert the slot description text, the example slot value, and the voice text into semantic vectors, and obtain a slot description text semantic vector, an example slot value semantic vector, and a voice text semantic vector, respectively; a calculating module 504, configured to calculate an attention weight coefficient of each word in the speech text to obtain an attention semantic vector corresponding to each word; the processing module 505 is configured to splice the speech text semantic vector, the attention semantic vector corresponding to each word, and the slot description text semantic vector to obtain a combined semantic vector; a determining module 506, configured to input the combined semantic vector into a sequence annotator and a sequence classifier for annotation and BIO classification, to obtain a BIO classification of each word in the voice text, and determine a slot included in the voice text and a slot value corresponding to the slot according to the BIO classification.
Preferably, the calculation module 504 further comprises: and performing attention weight calculation on the example slot value semantic vector and the semantic vector of each word in the voice text to obtain an attention weight coefficient, and adding the weight coefficients to obtain an attention semantic vector corresponding to the semantic vector of each word.
As shown in fig. 6, the system further includes a constructing module 601, where the constructing module 601 constructs a slot value database based on historical dialogue data of a specific task, where the slot value database includes preset slot positions, a specific number of example slot values corresponding to the preset slot positions, description text information of each preset slot position, and a corresponding relationship between slot positions with semantics.
Preferably, the number of example bin values is between 2 and 5.
Preferably, the step of converting the slot description text, the example slot value and the voice text into semantic vectors includes: and inputting the slot description text, the example slot value and the voice text into a BERT pre-training model to output a corresponding semantic vector.
Preferably, the sequence annotation model is a Bi-LSTM model.
As shown in fig. 7, the system further includes a classification module 701, where the classification module 701 inputs the combined semantic vector into a trained Bi-LSTM model, and inputs an implicit vector output by an LSTM layer into a softmax layer to perform 3-way BIO sequence classification.
In embodiment 2, the same portions as those in embodiment 1 are not described.
Compared with the prior art, the slot value filling device encodes the slot description text information and the slot value based on the pre-training model, and can identify the slot position which is similar in semantic meaning but does not appear in the training data; the attention weight coefficient of each word in the voice text to be recognized is calculated through an attention mechanism model to obtain the attention semantic vector of each word, the combined semantic vector is further obtained through splicing, and then labeling and classification are carried out to obtain a recognition result, so that the recognition capability is improved, the cross-domain migration capability is achieved, and the alignment robustness is improved; the method can achieve the effect approximately equal to the groove value filler obtained from a large amount of training data only by a small amount of labeled samples, and does not need a large amount of manual labeling.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Example 3
In the following, embodiments of the electronic device of the present invention are described, which may be regarded as specific physical implementations for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.
Fig. 8 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. An electronic device 200 according to the invention will be described below with reference to fig. 8. The electronic device 200 shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 8, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic device processing method section of the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 1.
The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention. The computer program, when executed by a data processing apparatus, enables the computer readable medium to carry out the above-described methods of the invention.
As shown in fig. 9, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components in embodiments in accordance with the invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP). The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. A slot value filling method for voice recognition, which is used for recognizing a slot included in a voice input by a user and a slot value corresponding to the slot in a task-based dialog, comprising the steps of:
configuring a slot description text and a specific number of example slot values corresponding to the slot;
acquiring user voice in task type conversation, identifying the user voice and obtaining a voice text corresponding to the user voice;
converting the slot position description text, the example slot value and the voice text into semantic vectors to respectively obtain a slot position description text semantic vector, an example slot value semantic vector and a voice text semantic vector;
calculating the attention weight coefficient of each word in the voice text to obtain an attention semantic vector corresponding to each word;
splicing the voice text semantic vector, the attention semantic vector corresponding to each word and the slot position description text semantic vector to obtain a combined semantic vector;
and inputting the combined semantic vector into a sequence annotator and a sequence classifier for annotation and BIO classification to obtain BIO classification of each word in the voice text, and determining slot positions contained in the voice text and slot values corresponding to the slot positions according to the BIO classification.
2. The method according to claim 1, wherein the calculating the attention weight coefficient of each word in the speech text to obtain the attention semantic vector corresponding to each word comprises:
and performing attention weight calculation on the example slot value semantic vector and the semantic vector of each word in the voice text to obtain an attention weight coefficient, and adding the weight coefficients to obtain an attention semantic vector corresponding to the semantic vector of each word.
3. The bin filling method for speech recognition according to any one of claims 1 to 2, further comprising:
based on historical dialogue data of a specific task, a slot value database is constructed, wherein the slot value database comprises preset slot positions, a specific number of example slot values corresponding to the preset slot positions, description text information of each preset slot position and a corresponding relation of the slot positions with the same semantics.
4. A method of bin filling for speech recognition according to any one of claims 1 to 3, wherein the number of example bin values is between 2 and 5.
5. The method of any of claims 1 to 4, wherein the step of converting the slot description text, the example slot value, and the speech text into semantic vectors comprises:
and inputting the slot description text, the example slot value and the voice text into a BERT pre-training model to output a corresponding semantic vector.
6. The method of slot filling for speech recognition according to any of claims 1 to 5, wherein the sequence annotation model is a Bi-LSTM model.
7. The method according to any one of claims 1 to 6, wherein the entering of the combined semantic vector into a sequence annotator and a sequence classifier for annotation and BIO classification to obtain BIO classification of each word in the speech text comprises: and inputting the combined semantic vector into a trained Bi-LSTM model, and inputting an implicit vector output by an LSTM layer into a softmax layer to perform 3-way BIO sequence classification.
8. A slot value filling apparatus for speech recognition, which is used for recognizing a slot included in speech input by a user and a slot value corresponding to the slot in a task-based dialog, comprising:
a configuration module for configuring a slot description text and a specific number of example slot values corresponding to the slot;
the data acquisition module is used for acquiring user voice in the task type conversation, identifying the user voice and obtaining a voice text corresponding to the user voice;
the conversion module is used for converting the slot description text, the example slot value and the voice text into semantic vectors to respectively obtain a slot description text semantic vector, an example slot value semantic vector and a voice text semantic vector;
the calculation module is used for calculating the attention weight coefficient of each word in the voice text so as to obtain an attention semantic vector corresponding to each word;
the processing module is used for splicing the voice text semantic vector, the attention semantic vector corresponding to each word and the slot position description text semantic vector to obtain a combined semantic vector;
and the determining module is used for inputting the combined semantic vector into the sequence annotator and the sequence classifier for annotation and BIO classification to obtain BIO classification of each word in the voice text, and determining slot positions contained in the voice text and slot values corresponding to the slot positions according to the BIO classification.
9. An electronic device, wherein the electronic device comprises:
a processor; and the number of the first and second groups,
a memory storing computer executable instructions that, when executed, cause the processor to perform the slot value filling method for speech recognition according to any one of claims 1 to 7.
10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the slot value filling method for speech recognition of any one of claims 1 to 7.
CN202010721012.9A 2020-07-24 2020-07-24 Slot value filling method and device for voice recognition and electronic equipment Withdrawn CN111986673A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010721012.9A CN111986673A (en) 2020-07-24 2020-07-24 Slot value filling method and device for voice recognition and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010721012.9A CN111986673A (en) 2020-07-24 2020-07-24 Slot value filling method and device for voice recognition and electronic equipment

Publications (1)

Publication Number Publication Date
CN111986673A true CN111986673A (en) 2020-11-24

Family

ID=73438889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010721012.9A Withdrawn CN111986673A (en) 2020-07-24 2020-07-24 Slot value filling method and device for voice recognition and electronic equipment

Country Status (1)

Country Link
CN (1) CN111986673A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466288A (en) * 2020-12-18 2021-03-09 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112489651A (en) * 2020-11-30 2021-03-12 科大讯飞股份有限公司 Voice recognition method, electronic device and storage device
CN112581954A (en) * 2020-12-01 2021-03-30 杭州九阳小家电有限公司 High-matching voice interaction method and intelligent equipment
CN112882679A (en) * 2020-12-21 2021-06-01 广州橙行智动汽车科技有限公司 Voice interaction method and device
CN113192534A (en) * 2021-03-23 2021-07-30 汉海信息技术(上海)有限公司 Address search method and device, electronic equipment and storage medium
CN113380418A (en) * 2021-06-22 2021-09-10 浙江工业大学 System for analyzing and identifying depression through dialog text
CN113611316A (en) * 2021-07-30 2021-11-05 百度在线网络技术(北京)有限公司 Man-machine interaction method, device, equipment and storage medium
CN114022668A (en) * 2021-10-29 2022-02-08 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489651A (en) * 2020-11-30 2021-03-12 科大讯飞股份有限公司 Voice recognition method, electronic device and storage device
CN112489651B (en) * 2020-11-30 2023-02-17 科大讯飞股份有限公司 Voice recognition method, electronic device and storage device
CN112581954A (en) * 2020-12-01 2021-03-30 杭州九阳小家电有限公司 High-matching voice interaction method and intelligent equipment
CN112581954B (en) * 2020-12-01 2023-08-04 杭州九阳小家电有限公司 High-matching voice interaction method and intelligent device
CN112466288B (en) * 2020-12-18 2022-05-31 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112466288A (en) * 2020-12-18 2021-03-09 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium
WO2022135419A1 (en) * 2020-12-21 2022-06-30 广州橙行智动汽车科技有限公司 Voice interaction method and apparatus
CN112882679B (en) * 2020-12-21 2022-07-01 广州橙行智动汽车科技有限公司 Voice interaction method and device
CN112882679A (en) * 2020-12-21 2021-06-01 广州橙行智动汽车科技有限公司 Voice interaction method and device
CN113192534A (en) * 2021-03-23 2021-07-30 汉海信息技术(上海)有限公司 Address search method and device, electronic equipment and storage medium
CN113380418A (en) * 2021-06-22 2021-09-10 浙江工业大学 System for analyzing and identifying depression through dialog text
CN113611316A (en) * 2021-07-30 2021-11-05 百度在线网络技术(北京)有限公司 Man-machine interaction method, device, equipment and storage medium
CN114022668A (en) * 2021-10-29 2022-02-08 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice
CN114022668B (en) * 2021-10-29 2023-09-22 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice

Similar Documents

Publication Publication Date Title
CN111986673A (en) Slot value filling method and device for voice recognition and electronic equipment
CN111428467B (en) Method, device, equipment and storage medium for generating problem questions for reading and understanding
CN113220836B (en) Training method and device for sequence annotation model, electronic equipment and storage medium
CN111191030A (en) Single sentence intention identification method, device and system based on classification
US11551437B2 (en) Collaborative information extraction
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN113064964A (en) Text classification method, model training method, device, equipment and storage medium
CN111177186B (en) Single sentence intention recognition method, device and system based on question retrieval
CN111177350A (en) Method, device and system for forming dialect of intelligent voice robot
US20220358292A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
CN112507695A (en) Text error correction model establishing method, device, medium and electronic equipment
CN115238045B (en) Method, system and storage medium for extracting generation type event argument
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN112732902A (en) Cross-language abstract generation method and device, electronic equipment and computer readable medium
CN113705222B (en) Training method and device for slot identification model and slot filling method and device
CN112036186A (en) Corpus labeling method and device, computer storage medium and electronic equipment
CN111783424B (en) Text sentence dividing method and device
CN112100360A (en) Dialog response method, device and system based on vector retrieval
US11790885B2 (en) Semi-structured content aware bi-directional transformer
CN110851572A (en) Session labeling method and device, storage medium and electronic equipment
CN112017660B (en) Dialogue strategy construction method, device and system for intelligent voice robot
CN115587184A (en) Method and device for training key information extraction model and storage medium thereof
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium
CN112925889B (en) Natural language processing method, device, electronic equipment and storage medium
US20220092403A1 (en) Dialog data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201124

WW01 Invention patent application withdrawn after publication