CN114036272A

CN114036272A - Semantic analysis method and system for dialog system, electronic device and storage medium

Info

Publication number: CN114036272A
Application number: CN202111271655.9A
Authority: CN
Inventors: 江豪; 肖龙源; 李稀敏; 李威
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-11

Abstract

The invention discloses a semantic analysis method, a semantic analysis system, an electronic device and a storage medium of a dialogue system, wherein the semantic analysis method comprises the following steps of a, obtaining dialogue data, and preprocessing the dialogue data to obtain corpus information to be trained; b, training a word2vec model by using the corpus information to be trained; c, constructing a semantic analysis model based on the word2vec model; and d, inputting the corpus information to be analyzed into the semantic analysis model, wherein the semantic analysis model comprises a word2vec embedded layer, a BilSTM layer, a CDW layer and a linear classification layer. The invention provides a semantic analysis method, a semantic analysis system, an electronic device and a storage medium of a dialog system, which can simply and efficiently distinguish user semantics, provide accurate semantic information and provide reliable guidance for the next action of an intelligent dialog system.

Description

Semantic analysis method and system for dialog system, electronic device and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a semantic analysis method and system for a dialog system, an electronic device and a storage medium.

Background

In the intelligent dialogue system, the semantic analysis result influences the next state of the intelligent dialogue, so that the analysis of the correct semantics of the dialogue information of the user is very important. For example: in the intelligent medical dialogue system, the semantic analysis result is the active inquiry of the user, and the next state corresponding to the intelligent dialogue is the answer of the user; the semantic analysis result is a passive answer of the user, and the next state corresponding to the intelligent conversation is to summarize symptoms/diseases, or further provide accurate treatment/examination suggestions and the like.

In general, the dialogue semantics of the user can be distinguished by whether the dialogue semantics are question sentences, wherein the question sentences are actively inquired and the statement sentences are passively answered. However, due to the particularity of Chinese dialogs, it is generally difficult to simply distinguish the semantics of a user by whether the dialog is a question or not. For example: "I consult for symptoms of XX disease", which sentence is a statement sentence, but actually belongs to the active inquiry of the user.

In the prior art, sentence pattern matching is performed by adopting a rule template method or a machine learning method to simply distinguish the semantics of users. However, the two ways can only distinguish whether the user dialogue sentences are question sentences, and for statement sentence inquiry sentences of the user, the user semantics cannot be correctly distinguished; and the accuracy is low, and reliable semantic guidance cannot be provided for the intelligent dialog system.

Disclosure of Invention

The invention mainly aims to provide a semantic analysis method, a semantic analysis system, an electronic device and a storage medium for a dialog system, which can simply and efficiently distinguish user semantics, provide accurate semantic information and provide reliable guidance for the next step of behavior of an intelligent dialog system.

In order to achieve the above object, the present invention provides a semantic analysis method for a dialog system, which comprises the following steps: step a, obtaining dialogue data, and preprocessing the dialogue data to obtain corpus information to be trained; b, training a word2vec model by using the corpus information to be trained; c, constructing a semantic analysis model based on the word2vec model; step d, inputting corpus information to be analyzed into the semantic analysis model, wherein the semantic analysis model comprises a word2vec embedded layer, a BilSTM layer, a CDW layer and a linear classification layer; the specific semantic analysis process comprises the following steps: d1. the word2vec embedding layer extracts word vector information of the corpus information to be analyzed, and the BilSTM layer is used for acquiring context information of the corpus to be analyzed; d2. the CDW layer acquires semantic information of the linguistic data to be analyzed according to the word vector information and the context information of the linguistic data to be analyzed; d3. and the linear classification layer classifies according to the semantic information to obtain two classification results 1 or 0 as a semantic analysis result, wherein 1 represents active query and 0 represents passive answer.

Optionally, the preprocessing includes removing stop words, removing useless characters, and removing emoticons.

Optionally, the step b includes the following steps: b1. carrying out entity recognition on the preprocessed corpus information to be trained by adopting an NER algorithm, and determining an entity contained in the corpus information to be trained; b2. performing word segmentation on the preprocessed corpus information by adopting a Jieba word segmentation, and counting the word frequency T of a word segmentation result; b3. manually combining and reserving unidentified entities in the word segmentation result; b4. the word2vec model was trained and saved using the Gensim package.

Optionally, in the step b, training is performed only on the word segmentation result with the word frequency T being greater than or equal to 5.

Optionally, the semantic analysis model further includes a Dropout layer and a LayerNorm layer; the corpus information to be analyzed sequentially passes through a word2vec embedded layer, a Dropout layer, a BilSTM layer, a LayerNorm layer, a CDW layer and a linear classification layer.

Optionally, the step d2 specifically includes the following steps:

d21. calculating a first weight u for each word_it，

u_it＝tanh(W_wh_it+b_w)；

Wherein i represents the ith sentence, t represents the t character in the ith sentence, and h_itFor the output of the t character in the ith sentence after passing through the LayerNorm layer, W_wIs h_itCorresponding weight, b_wIs h_itA corresponding offset;

d22. calculating the distance relationship SRD between each character and the central word_it，

Wherein, P_aThe position of the central word is the position of the central word, the central word is one of symptoms, diseases or examination entities contained in the ith sentence, and m is a threshold value;

d23. based on threshold parameter sigma and distance relation SRD of each character and central word_itTo obtain a second weight u for each word_it′，

Wherein n is the sentence length of the ith sentence;

d24. computing a feature vector s for the entire sentence_i，

Wherein, theta_itThe contribution degree of the t character in the ith sentence to the semantic information;

d25. according to the feature vector s of the whole sentence_iTwo classification results are obtained, 1 for active queries and 0 for passive answers.

Optionally, the threshold m is 10, and the threshold parameter σ is 5.

In addition, corresponding to the semantic analysis method of the dialogue system, the semantic analysis system comprises a text acquisition module, a semantic analysis module and a semantic analysis module, wherein the text acquisition module is used for acquiring dialogue data and preprocessing the dialogue data to obtain corpus information to be trained;

the model training module is used for training a word2vec model by adopting the corpus information to be trained;

the semantic analysis model building module is used for building a semantic analysis model based on the word2vec model;

the semantic analysis module is used for inputting the corpus information to be analyzed into the semantic analysis model, the semantic analysis model comprises a word2vec embedded layer, a BilSTM layer, a CDW layer and a linear classification layer, the word2vec embedded layer extracts word vector information of the corpus information to be analyzed, and the BilSTM layer is used for acquiring context information of the corpus to be analyzed; the CDW layer acquires semantic information of the linguistic data to be analyzed according to the word vector information and the context information of the linguistic data to be analyzed; and the linear classification layer classifies according to the semantic information to obtain two classification results 1 or 0 as a semantic analysis result, wherein 1 represents active query and 0 represents passive answer.

And, corresponding to the dialog system semantic analysis method, a semantic analysis system comprising at least one processor and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the dialog system semantic analysis method.

The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the dialog system semantic analysis method.

The invention has the beneficial effects that:

(1) according to the invention, through the semantic analysis model, user semantics can be distinguished simply and efficiently, active query and passive answer of a user are correctly distinguished, accurate semantic information is provided, and reliable guidance is provided for the next step of behavior of the intelligent dialogue system according to the semantic analysis result;

(2) after the semantic analysis result of the user is obtained through the semantic analysis model, the intelligent dialogue system can carry out dialogue process design according to the result, so that the fluency and the specialty of the dialogue system are improved;

(3) the invention carries out word segmentation by adopting a method of combining the NER algorithm and the Jieba, thereby avoiding the condition that the common word segmentation tool can not carry out correct word segmentation on the entity content appearing in the specific field (such as the medical field); moreover, by adopting a method of combining the NER algorithm and the Jieba, the special words in a specific field (such as the medical field) can be reserved to the maximum extent, so that the performance of the word2vec model is improved;

(4) according to the invention, the generalization of the semantic analysis model is improved through the Dropout layer and the LayerNorm layer.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a simplified flow diagram of a semantic analysis method for a dialog system according to the present invention;

FIG. 2 is a schematic diagram of the semantic analysis model of the dialog system according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the method for analyzing the semantics of the dialog system can simply and efficiently distinguish the semantics of the user, correctly distinguish the active query and the passive answer of the user, provide accurate semantic information, and provide reliable guidance for the next action of the intelligent dialog system according to the semantic analysis result.

The invention discloses a semantic analysis method for a dialog system, wherein the dialog system is preferably a medical dialog system, and the semantic analysis method specifically comprises the following steps:

step a, obtaining dialogue data in the medical field, and preprocessing the dialogue data to obtain corpus information to be trained; preferably, preprocessing includes, but is not limited to, removing stop words, removing useless characters, and removing emoticons;

b, training a word2vec model by using corpus information to be trained;

c, constructing a semantic analysis model based on the word2vec model (the model structure can be shown by referring to FIG. 2);

step d, inputting corpus information to be analyzed into a semantic analysis model, wherein the semantic analysis model comprises a word2vec embedded layer, a BilSTM layer, a CDW layer and a linear classification layer; the specific semantic analysis process comprises the following steps:

word2vec embedding layer extracts word vector information of the corpus information to be analyzed, and the BilSTM layer is used for obtaining context information of the corpus to be analyzed;

the CDW layer obtains semantic information of the linguistic data to be analyzed according to the word vector information and the context information of the linguistic data to be trained;

d3. and the linear classification layer classifies according to the semantic information and outputs a semantic analysis result. Preferably, the classifier used by the linear classification layer is a binary classifier, and the output classification result is 1 (representing active query) or 0 (representing passive answer). The output classification result is the semantic analysis result.

Preferably, if the semantic analysis result is active query, the next action of the medical dialogue system is to answer the active query; if the semantic analysis result is a passive answer, the next action of the medical dialogue system is to analyze the passive answer or continue to inquire the user.

After the semantic analysis result of the user is obtained through the semantic analysis model, the intelligent medical dialogue system can carry out dialogue process design according to the result, so that the fluency and the specialty of the dialogue system are improved.

In addition, due to the fact that the self-training word2vec model is adopted instead of a large-scale pre-training model, the model effect is guaranteed, and meanwhile the calculation efficiency of the model is improved.

Furthermore, the invention can well acquire the local semantic information of the sentence through the Context features Dynamic Weighted (CDW) model, thereby improving the accuracy of the model. Specifically, the CDW model obtains semantic information according to word vectors of the central words and context information thereof, and the target words refer to entities such as diseases and symptoms contained in the corpus to be analyzed.

In this embodiment, step b includes the following steps:

b1. performing entity recognition on the preprocessed corpus information to be trained by adopting an NER algorithm, and determining medical field entities contained in the corpus information to be trained;

b2. performing word segmentation on the preprocessed corpus information by adopting a Jieba word segmentation, counting the word frequency T of a word segmentation result, and preferably, training only aiming at the word segmentation result with the word frequency T being more than or equal to 5;

b3. manually combining the medical field entities which are not identified in the word segmentation result, and reserving the medical field entities;

b4. the word2vec model is trained and saved using the Gensim package for subsequent use.

The invention carries out word segmentation by adopting a method of combining the NER algorithm and the Jieba, thereby avoiding the condition that the common word segmentation tool can not carry out correct word segmentation on the content of symptoms, diseases and the like in the medical field.

And for a section of corpus, firstly finding out the medical field entities of the corpus through the existing NER service, then carrying out Jieba word segmentation on the corpus, and manually merging and reserving the medical field entities which are not identified in the word segmentation result, so that the specific words in the medical field can be reserved to the maximum extent by the method, and the performance of the word2vec model is improved.

As shown in FIG. 2, the semantic analysis model of the invention mainly comprises a word2vec embedded layer, a Dropout layer, a BilSTM layer, a LayerNorm layer, a CDW layer and a linear classification layer. The corpus information to be analyzed sequentially passes through a word2vec embedded layer, a Dropout layer, a BilSTM layer, a LayerNorm layer, a CDW layer and a linear classification layer.

In this embodiment, the CDW layer specifically acquiring semantic information of a corpus to be analyzed includes the following steps:

d21. calculating a first weight u for each word_it，

u_it＝tanh(W_wh_it+b_w)；

Wherein i represents the ith sentence, t represents the t character in the ith sentence, and h_itIs the output of the t character in the ith sentence after passing through the LayerNorm layer, W_wIs h_itCorresponding weight, b_wIs h_itA corresponding offset;

Wherein, P_aIs the position of the central word, the central word is one of symptoms, diseases or examination entities contained in the ith sentence, and m is a threshold value; preferably, the threshold m is 10;

d23. based on threshold parameter sigma and distance relation SRD of each character and central word_itTo obtain a second weight u for each word_it', updating the second weight u by the distance relation_it' so that the weight of the important character is kept unchanged, the weight of the character farther away from the central word is smaller, the influence of extra information on the final prediction result is reduced, and the accuracy of the model is improved;

wherein n is the sentence length of the ith sentence; preferably, the threshold parameter σ is 5.

d24. Computing a feature vector s for the entire sentence_i，

Wherein, theta_itThe contribution degree of the t character in the ith sentence to semantic information is smaller for the character or word which is farther away from the central word;

d25. from the feature vector s of the whole sentence_iTwo classification results are obtained, 1 for active queries and 0 for passive answers.

The invention also provides a corresponding semantic analysis system, which comprises: the text acquisition module acquires dialogue data and performs preprocessing to obtain corpus information to be trained; the model training module is used for training a word2vec model by adopting the corpus information to be trained; the semantic analysis model building module is used for building a semantic analysis model based on the word2vec model; the semantic analysis module is used for inputting the corpus information to be analyzed into a semantic analysis model, the semantic analysis model comprises a word2vec embedded layer, a BilSTM layer, a CDW layer and a linear classification layer, the word vector information of the corpus information to be analyzed is extracted by the word2vec embedded layer, and the BilSTM layer is used for acquiring the context information of the corpus to be analyzed; the CDW layer acquires semantic information of the linguistic data to be analyzed according to the word vector information and the context information of the linguistic data to be analyzed; and the linear classification layer classifies according to the semantic information to obtain two classification results 1 or 0 as a semantic analysis result, wherein 1 represents active query and 0 represents passive answer.

The method is mainly applied to an intelligent medical dialogue system, analyzes complex sentences in the medical dialogue of the user, provides accurate semantic information, judges whether the input of the user is active inquiry or passive answer, and provides reliable guidance for the next action of the intelligent dialogue system. The semantic analysis model of the invention can judge the semantic condition of the input sentence of the user, and judge the query or answer semantics of the symptom, disease or checking entity contained in the input sentence.

For example, the user inputs "i want to consult the symptom of XX disease", it can be judged by the model that the user is a query for XX disease, and the user inputs "i do not have XX symptom" it can be judged by the model that the user is an answer to XX symptom.

After the semantic information of the user is obtained through the semantic analysis model, the intelligent medical dialogue system can carry out dialogue process design according to the semantic information, for example, if the input of the user is an inquiry sentence, the dialogue system needs to answer the inquiry sentence, if the input of the user is a passive answer, the dialogue system can carry out analysis according to an answer result or continue inquiring the user, and therefore fluency and specialty of the dialogue system are improved.

The invention also provides an electronic device, which comprises at least one processor and a memory which is in communication connection with the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a dialog system semantic analysis method.

In this embodiment, a computer-readable storage medium is further provided, in which a computer program is stored, where the computer program is executed by a processor to implement the dialog system semantic analysis method.

Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A dialogue system semantic analysis method is characterized by comprising the following steps:

step a, obtaining dialogue data, and preprocessing the dialogue data to obtain corpus information to be trained;

b, training a word2vec model by using the corpus information to be trained;

c, constructing a semantic analysis model based on the word2vec model;

step d, inputting corpus information to be analyzed into the semantic analysis model, wherein the semantic analysis model comprises a word2vec embedded layer, a BilSTM layer, a CDW layer and a linear classification layer; the specific semantic analysis process comprises the following steps:

d1. the word2vec embedding layer extracts word vector information of the corpus information to be analyzed, and the BilSTM layer is used for acquiring context information of the corpus to be analyzed;

d2. the CDW layer acquires semantic information of the linguistic data to be analyzed according to the word vector information and the context information of the linguistic data to be analyzed;

d3. and the linear classification layer classifies according to the semantic information to obtain two classification results 1 or 0 as a semantic analysis result, wherein 1 represents active query and 0 represents passive answer.

2. The semantic analysis method of a dialog system according to claim 1, characterized in that: the preprocessing includes removing stop words, removing useless characters, and removing emoticons.

3. The semantic analysis method of a dialog system according to claim 1, characterized in that: the step b comprises the following steps:

b1. carrying out entity recognition on the preprocessed corpus information to be trained by adopting an NER algorithm, and determining an entity contained in the corpus information to be trained;

b2. performing word segmentation on the preprocessed corpus information by adopting a Jieba word segmentation, and counting the word frequency T of a word segmentation result;

b3. manually combining and reserving unidentified entities in the word segmentation result;

b4. the word2vec model was trained and saved using the Gensim package.

4. A dialog system semantic analysis method according to claim 3, characterized in that: in the step b, training is only carried out on the word segmentation result with the word frequency T being more than or equal to 5.

5. The semantic analysis method of a dialog system according to claim 1, characterized in that: the semantic analysis model further comprises a Dropout layer and a LayerNorm layer;

the corpus information to be analyzed sequentially passes through a word2vec embedded layer, a Dropout layer, a BilSTM layer, a LayerNorm layer, a CDW layer and a linear classification layer.

6. The semantic analysis method of a dialog system according to claim 5, characterized in that: the step d2 specifically comprises the following steps:

d21. calculating a first weight u for each word_it，

u_it＝tanh(W_wh_it+b_w)；

d23. based on threshold parameter sigma and distance relation SRD of each character and central word_itTo obtain each word'

Second weight u_it，

Wherein n is the sentence length of the ith sentence;

d24. computing a feature vector s for the entire sentence_i，

7. The semantic analysis method of a dialog system according to claim 6, characterized in that: the threshold m is 10, and the threshold parameter σ is 5.

8. A semantic analysis system, the system comprising:

the text acquisition module acquires dialogue data and performs preprocessing to obtain corpus information to be trained;

9. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the dialog system semantic analysis method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the dialog system semantic analysis method according to one of claims 1 to 7.