CN111026884B

CN111026884B - Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus

Info

Publication number: CN111026884B
Application number: CN201911271656.6A
Authority: CN
Inventors: 张献涛; 张猛; 暴筱; 林小俊
Original assignee: Shanghai Yishang Network Technology Co ltd
Current assignee: Shanghai Yishang Network Technology Co ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2023-06-02
Anticipated expiration: 2039-12-12
Also published as: CN111026884A

Abstract

The invention discloses a dialog corpus generation method for improving the quality and diversity of man-machine interaction dialog corpus. The method comprises the following steps: 1) Carrying out synonymous sentence expansion on the selected dialogue corpus to form a candidate set; 2) Performing anomaly detection on each dialogue corpus in the candidate set to obtain an anomaly value of each dialogue corpus; 3) Storing the dialogue corpus with the abnormal value lower than the set scoring threshold value into a lifted dialogue corpus; 4) Semantic analysis is carried out on dialogue corpora with outliers higher than or equal to the scoring threshold value: if the dialogue data is wrong, the dialogue data is directly discarded; if the dialogue data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the lifted dialogue corpus; 5) And (4) taking the dialogue data determined to be diversified as input again, and executing the steps 1-4) until a pause condition is reached, and stopping iteration. The invention realizes the quality control and diversity expansion of the original dialogue corpus.

Description

Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus

Technical Field

The invention belongs to the technical fields of information technology and data mining, and relates to a dialog corpus generation method for improving the quality and diversity of man-machine interaction dialog corpus.

Background

With the continuous development of science and technology, various models of artificial intelligence are increasingly applied to various intelligent systems, and various requirements of human-computer interaction are proposed. How to perform man-machine interaction more effectively is a problem which needs to be solved urgently at present. At present, most human-computer interaction models are driven by data, and the models are trained on a Corpus (Corpus) to obtain parameter results with good performance, and the parameter results are applied to a system. Therefore, a high quality corpus plays an increasingly important role.

In human-computer interaction dialogue, the human has various rich language expression modes, and the accuracy requirement for semantic understanding is higher. In order to better train a precise (robust) model, an accurate and high-quality dialogue corpus is required, and the dialogue corpus is also required to be as rich as possible and have various expression modes.

The Chinese patent ZL201510251428.8 discloses a corpus screening method and device, wherein the corpus screening method comprises the following steps: cross checking is carried out based on the first corpus set, and a first checking result is obtained; judging whether the first check result meets a first preset condition or not; when the first verification result meets the first preset condition, performing public verification based on the first corpus set to obtain a second verification result; judging whether the first corpus set needs to be screened according to the second checking result; and when the first corpus is judged to need to be screened, executing first screening processing on the first corpus. The method solves the problem of low quality of the training samples caused by influence of subjective preference when corpus is screened in the related technology, and further achieves the effect of improving the quality of the training samples.

Chinese patent ZL201310344326.1 provides a corpus expansion device, comprising: the screening unit screens out an initial corpus sample according to preset corpus screening conditions; the expansion unit is used for identifying the collected corpus according to the initial corpus sample and the expansion strategy to obtain an expanded corpus sample, and carrying out corpus expansion again based on the expanded corpus sample and the expansion strategy. According to the method, the large-scale training corpus is subjected to machine labeling in an automatic mode, so that the time period and the cost for manufacturing the large-scale training corpus are greatly saved, and the labeling accuracy can be improved.

Currently, most corpus processing modes are subjected to simple cleaning work, and abnormal data inconsistent with expected or overall distribution are removed according to different standards. The invention focuses on the abnormal data in the human-computer dialogue corpus, and classifies the abnormal data into two types of error data and special case data. The error data needs to be removed, and the special case data is a more specific expression, not a common expression method, but the diversity of expressions in the corpus can be enhanced, and the corpus needs to be reserved and further expanded. And finally, the quality of the human-computer interaction dialogue corpus is improved, and the accuracy of subsequent model training is improved by utilizing the corpus.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to provide a method for improving the quality and diversity of human-computer interaction dialogue corpora. The invention relates to a statistical method or a machine learning method based on a corpus, which is used for controlling the quality and expanding the diversity of an original dialogue corpus.

The technical scheme adopted by the invention is as follows:

a method for improving the quality and diversity of human-computer interaction dialogue corpus comprises the following steps:

1) Carrying out synonymous sentence expansion on the input selected dialogue corpus to form a candidate set;

2) Performing anomaly detection on the dialogue corpus in the candidate set, and outputting an anomaly value scoring of each dialogue corpus;

3) Sorting according to the scoring, determining a threshold according to the adjacent maximum difference method, and storing the scoring lower than the threshold in the lifted dialogue corpus;

4) Further semantic analysis of outliers scoring outliers above a threshold:

4.1 If the dialogue data is wrong, directly discarding the dialogue data;

4.2 If the dialogue data is the dialogue data with better diversity, the dialogue data is used as input again, iterated again, and the step 1) is entered;

4.3 If the dialogue data is correct in other types and the diversity is general, saving the dialogue data into the lifted dialogue corpus;

5) And (5) reaching a pause condition, and stopping iteration.

Further, the expansion of the synonyms of step 1) may be manually expanded by a human. The plurality of markers can expand according to the input dialogue corpus. The method can be expanded by better utilizing the experience knowledge of human beings. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. The task of synonym expansion can be completed by randomly carrying out word order exchange, stop word deletion, synonym replacement, cross-language translation and the like on the dialogue corpus.

Further, step 2) the invention will vectorize the questions of all the segmented dialogue corpora, i.e. with a fixed vector d _text The representation is performed. Then, the vectors of all the dialogue corpora are averaged to obtain an average vector d _mean . According to the distance Dis calculation formula provided by the invention, the distance between each dialogue corpus and the average vector is calculated, and the distance is used as a differential scoring value. The higher the score, the larger the difference, and the "large difference" may be the sentence indicating that the dialogue corpus is wrong, and should be discarded, so as to improve the quality of the corpus. It is also possible to indicate that the dialogue corpus is correct, but an unusual expression is beneficial to increasing the diversity of the corpus, and needs to be reserved for further expansion.

Further, step 3) is based on the scoring value, the scoring value is relatively low, and the scoring value can be considered as a common and effective expression and can be stored in the lifted corpus.

Further, step 4) is to perform a judgment process according to different situations for the dialogue corpus with high score according to the processing result in step 3). The judgment here may be made by selecting a person to distinguish. The labeling personnel can judge the category of the dialogue corpus according to experience according to the input dialogue corpus. The method can better utilize the experience knowledge of human beings and flexibly process. In order to save labor and improve efficiency, an automatic judgment method can be adopted. Specifically, comprehensive judgment can be performed through the automatic network search index and the results returned by the question-answer model.

Further, step 4.1) if the dialog corpus is wrong, the dialog corpus is irrelevant to the intention target of the dialog itself, or the type of the dialog corpus cannot be judged temporarily, the dialog corpus is directly discarded, and the quality of the corpus is improved.

Further, step 4.2) is to use the correct dialogue corpus and is a less common expression, which is favorable for increasing the diversity of the corpus, and the dialogue corpus needs to be reserved and used as a seed dialogue corpus to be used as the input of the next iteration, and step 1 is repeated.

Further, step 4.3) is a correct and effective dialogue corpus, but the expression is common, and the dialogue corpus is directly stored into the lifted corpus.

Further, in step 5), the method is an iterative update process, and different stop conditions can be set according to requirements. For example, the number of iterations is fixed, or the input in step 4.2) is null, and the number of dialogues in the corpus is increased to satisfy the preset number.

The method is innovative in that the method focuses on abnormal data of the dialogue corpus, and in the process iteration processing, not only the corpus quality can be improved, but also the corpus diversity can be increased. The synonymous sentence expansion method in the step 1), the abnormal data detection method in the step 2) and the distinguishing processing method in the step 4) are novel, feasible and effective.

The invention also provides an artificial intelligent model training method which is characterized in that the artificial intelligent model is trained by the corpus in the dialogue corpus obtained by the method.

The invention also provides a man-machine interaction method which is characterized in that man-machine interaction is carried out by adopting the artificial intelligent model obtained through training.

Compared with the prior art, the invention has the following positive effects:

the method and the device can improve the quality and expand the diversity of the man-machine interaction dialogue corpus, can reduce the input of manpower, and can improve the accuracy and the robustness of the algorithm model for the improved corpus.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is an automatic synonym expansion flow chart;

FIG. 3 is an example diagram of a synonym extension;

fig. 4 is a flow chart of abnormal dialogue corpus detection.

Detailed Description

The present invention will be further described with reference to the following specific examples and drawings in order to make the above objects, features and advantages of the present invention more comprehensible.

Taking the application of a human-computer interaction corpus in the hotel field as an example, the invention specifically describes the improvement of quality and the expansion of diversity of the original corpus based on a plurality of steps. The invention firstly carries out basic word segmentation processing on the initial corpus. Initial Corpus _init Is composed of a series of n dialogue corpora (question-answer dialogue pairs) which can be expressed as { QApair ] ₁ ,QApair ₂ ,…,QApair _n }. Each question-answer pair has question sentences, answers and labeled question-answer intents, for example, the ith question-answer pair can be expressed as (Sentence) _i ,Answer _i ,Intent _i ) For example ("please check out how hotel rooms cover wifi", "our hotel is worldwide wireless, no direct password connection is required", "query the network"). Generally, the Answer is essentially fixed for the same purpose. Therefore, the present invention is currently focused on the quality and diversity of the extended question Sentence under the same Intent intelnt.

The Chinese word segmentation is a basic step of Chinese natural language processing, and the word segmentation adopts a dictionary word segmentation and statistical word segmentation fusion method. Firstly, a maximum matching word segmentation method based on a dictionary is adopted, and a word segmentation method of sequence labeling (conditional random field) is adopted for ambiguous parts of word segmentation.

Thus, a Sentence spoken by the user _i May be composed of several divided words, and may be expressed as

Where i represents the question of the ith dialog corpus, k is the word order number, and max represents the maximum number of words of the allowed sentence. The invention takes max as 100, if the length is exceeded, the following words are truncated.

FIG. 1 is a flow chart of the steps of the method of the present invention, which is followed by specific steps:

step 1: the expansion of the synonyms can be performed manually according to the dialogue corpus. The method can be expanded by better utilizing the experience knowledge of human beings. There are also mature "crowdsourcing" approach techniques at present, which extend the dialogue corpus. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. The method can randomly exchange word sequences, delete stop words, replace synonyms, translate across languages and the like for dialogue corpora to finish the task of expanding synonyms. The expansion flow is shown in fig. 2, and a simple example is shown in fig. 3.

The present invention defines four basic operations, each as follows

1. Word order exchange operation: selecting any word for the input dialogue corpus question

And the next word->

Exchange is performed. In an actual human-machine conversation, a user is often presented with word order transformation behavior in the presentation. Such as "how the hotel room is covered by wifi" may be expressed as "how the hotel room is covered by wifi", so that the operation of word order exchange is beneficial to enhancing the diversity of expression, and can cover the expression habits of some users. This operation is not a necessary operation and may be described as p ₁ Probability value of 0,1]In the present invention, 0.8 is taken.

2. Stop word pruning operations: for the input dialogue corpus question, a common stop word dictionary is utilized, and the dictionary is obtained by manual arrangement. It is determined whether the sentence contains a stop word, and if so, the stop word is deleted. For example, the word "please ask" in "please ask hotel room to cover wifi" can be deleted, and the expression of the whole sentence is not affected. This operation is not a necessary operation and may be described as p ₂ Probability value of 0,1]In the present invention, 0.4 is taken.

3. Synonym substitution operation: for the input dialogue corpus question, a synonym dictionary (the dictionary is obtained by manual correction based on the public Hadamard forest) is utilized to judge whether synonyms are contained in the dictionary, and if the synonyms are contained, the synonyms are utilizedWords in the word dictionary that have the same meaning are replaced. For example, "please check out about the hotel room wifi to cover what," wherein "hotel" can find the synonym "hotel," this sentence can be replaced with "please check out about the hotel room wifi to cover what. Synonym substitution can maintain semantic consistency, and can introduce unseen words (meaning that no word appears in the original corpus, but there are in the synonym dictionary), enhancing the diversity of expressions. This operation is not a necessary operation and may be described as p ₃ Probability value of 0,1]In the present invention, 0.1 is taken.

4. Cross-language translation operations: the cross-language translation operation refers to the expansion of synonyms by using the current machine translation technology and using the expression change of translation between different languages. The specific operation includes that the dialogue corpus question is translated into the first intermediate language by using the existing machine translation services (such as Google translation, hundred degree translation and the like), then the first intermediate language is translated into the second intermediate language, and finally the second intermediate language is translated back into the Chinese. Comparing the returned results, if not consistent with the original input, then retaining. For example, "please check out the hotel room wifi coverage" can be translated into English "Does the hotel room have WiFi coverage" first, and then translated from English to French "La Chambre d'

est-elle couverte par un r eseau sans fil ", finally translated back to chinese by french" whether hotel rooms are covered by wireless network ". The final returned language expression not only preserves semantic consistency, but also has richer expressions.

This operation is not a necessary operation and may be described as p ₄ Probability value of 0,1]In the present invention, 0.3 is taken.

The four steps are randomly skipped or selected, and finally the probability values are respectively 1-p ₁ ，1-p ₂ ，1-p ₃ ，1-p ₄ Acting on the candidate dialogue corpus and operating the candidate dialogue corpus. FIG. 3 is an example, where the first two steps have been performed, there have been 4 possible variations, when followedThere are 4×2×2=16 possible variations in continuing the operation.

Step 2: the invention can carry out vectorization processing on all the dialogue corpora after word segmentation, namely, a fixed vector d is adopted _text The representation is performed. On the basis, the distance is calculated. The specific steps are shown in fig. 4.

The system maps each word into a low-dimensional continuous vector. Here, a text depth representation model (e.g. Word2 Vec) may be used to characterize questions of the dialog corpus in the text segment to obtain Word vectors. word2vec is a tool that converts words into vector form. For sentences

Every word->

Can be mapped to a vector, where the dimension of the vector is taken to be 200, e.g. +.>

Then, summing operation is carried out according to the word vectors to obtain the representation of semantic vectors of the dialogue corpus>

For example->

Further, the vectors of the n dialogue corpora are averaged to obtain an average vector

Such as

The distance between each dialogue corpus and the average vector is calculated and is used as the differential scoring value of the dialogue corpus. The invention designs a distance calculation formula specifically, and amplifies the difference and the differentiation between dialogue corpora as much as possible. For transfusionVector of entry->

And average vector->

The distance Dis of (2) is calculated as follows:

threshold in the formula is defined threshold value, and the difference is ensured to be smaller. The value of the invention is 0.01.

The higher the Dis score, the larger the difference, and the "large difference" is not only the incorrect dialogue corpus, but also the quality of the corpus is improved. It is also possible to indicate that the dialogue corpus is correct, but an unusual expression is beneficial to increasing the diversity of the corpus and needs to be preserved.

Step 3: sorting according to the scoring, and storing the scoring low person into the improved dialogue Corpus Corpus _improved Is a kind of medium. In step 2, each dialogue corpus has been scored as n dialogue corpora. Each dialog corpus is then ranked according to the score. Ordered in ascending order from low to high. Different threshold selection methods may be determined as desired. In the invention, the threshold value Score is determined according to the adjacent maximum difference method _threshold If the score is lower than the threshold, the top r dialogue corpora are selected. These dialogue corpora are low scoring ones, have small difference, can be considered as effective and accurate dialogue corpora, and can be put into Corpus _improved Is a kind of medium. The rest n-r dialogue corpora need to be analyzed and processed in the next step. The adjacent maximum variance method presented herein is calculated as follows:

3.1 ordering: pairs of n pairsCorpus was scored according to distance Score, ranked from high to low as follows (Sentence ₁ ,Score ₁ ),(Sentence ₂ ,Score ₂ ),..,(Sentence _n ,Score _n )。

3.2 calculating the approach difference: calculating the difference Delta of adjacent sequences _k The calculation method is as follows:

3.3 taking the maximum difference value, determining the threshold Score of the Score _threshold : taking the maximum difference value in 3.2, and recording as Delta _q I.e. representing Sentence _q-1 And Sentence _q The difference between the two is the largest, the average value of the two is taken, and the scoring threshold value is calculated as follows:

3.4 according to the threshold Score in 3.3 _threshold The top r dialogue corpora can be selected if the score is below the threshold. These dialogue corpora are low scoring ones, have small difference, can be considered as effective and accurate dialogue corpora, and can be put into Corpus _improved Is a kind of medium. The rest n-r dialogue corpora need to be analyzed and processed in the next step.

Step 4: the outliers of scoring the high are further semantically analyzed: the analysis here requires a human to judge its quality and diversity from the dialogue corpus. Mainly judging whether the current dialogue corpus is consistent with the target intention Intent or not and whether the expression of the dialogue corpus is clear or not. Finally, the error dialogue data are divided into the following four types (1); (2) a diversity of dialogue data; (3) correct but general data of diversity; (4) data for which semantic judgment is temporarily impossible.

To improve efficiency, classification screening of semantic analysis may also be performed automatically. Specifically, comprehensive judgment can be performed through the automatic network search index and the results returned by the question-answer model. The automatic judging process is as follows:

4.1 calculating the occurrence frequency of the dialogue corpus

For question Sentence in a dialogue corpus _i The word string of the sentence is used as a search word, the search is carried out by utilizing an Internet search index (hundred degrees, google and the like), and the found relevant result count is returned _i This value characterizes the heat frequency of the dialogue corpus, resulting in a tuple (Sentence _i ，count _i ). The method is not limited to search engine data, and can also construct an index library by itself, and return similar measurement values through known technologies such as inverted indexes.

4.2 verifying the question-answer Effect of dialog corpus

First, using well-known question-answering models, such as DSSM (Deep Structured Semantic Models), etc., in an initial Corpus Corpus _init Training to obtain an automatic question-answering Model QA-Model, and answering the input question. Thereafter, a question Sentence for a dialog corpus _i The string is used as an input, and is automatically answered by QA-Model, and the answer results are two. One returns a string answer that can be answered under the current model capabilities, the other is that the model is currently unable to answer, and the returned result is null. Obtaining a tuple (Sentence) _i ，Answer _i-QA-Model )。

4.3 Classification attribution

From the meta-analysis results in 4.1 and 4.2, for each dialogue corpus, a triplet (Sentence can be obtained _i ，count _i ，Answer _i-QA-Model ). According to count _i ，Answer _i-QA-Model For Sentence _i The following classification is made:

when count _i Greater than K, and Answer _i-QA-Model The value is null and is classified as (1) error dialogue data;

when count _i Not greater than K, and Answer _i-QA-Model The non-value is null and is classified as (2) the diversity dialogue data;

when count _i Greater than K, and Answer _i-QA-Model The value is not null and is classified as (3) correct but general data of diversity;

when count _i Not greater than K, and Answer _i-QA-Model The value is empty, and is classified as (4) data for which semantic judgment is temporarily impossible.

The above threshold K may be specified empirically, and the present invention selects a value of 100000.

4.4, according to different classification attributions, carrying out the following treatment:

if the dialogue data is wrong (1), or data which cannot be semantically judged (4). Directly discard. For example, if the intelnt is "query network", the dialogue corpus containing the question "give me a network line" does not conform to the intention and needs to be discarded;

if the dialogue data (2) with better diversity is used, the dialogue data is used as input again, iterated again and the step 1 is entered. For example, on the premise that Intent is an "inquiry network", the dialogue Corpus containing "I can go to hotel and connect wifi" has better diversity and meets the intention, and not only the Corpus Corpus is put into the lifting Corpus _improved Continuing to expand on the basis, and repeating the iterative step 1;

if the other types are correct, the general dialogue data (3) with diversity is saved in the lifted dialogue corpus. For example, on the premise that Intent is an "inquiry network", the dialogue Corpus containing "hotel wifi should be a fully covered bar" is generally diverse and directly put into the lifting Corpus Corpus _improved Is prepared by the following steps;

step 5: the method is an iterative updating process, and different stopping conditions can be set according to the needs. For example, the iteration times reach a fixed value, the input in the step 4 is empty, and the Corpus Corpus is improved _improved The number of the conversations in the system can meet the preset number.

Finally, the corpuses Corpus is improved _improved The human-machine dialogue corpus with reliable quality and rich diversity is obtained by the method.

The dictionary of the maximum matching method and the training and learning corpus of the supervised conditional random field model come from 10 ten thousand user critique manually marked by the invention.

The test results on a plurality of groups of dialogue corpora show that the method for improving the quality and diversity of the man-machine dialogue corpora reduces about 20% of error dialogue corpora and improves the quality of the corpora; the corpus number is increased by about 60%, and the diversity of the corpus is increased. And on the lifted corpus, the precision of the man-machine conversation model is commonly lifted by 3-7 percentage points.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims

1. A dialog corpus generation method for improving the quality and diversity of man-machine interaction dialog corpus comprises the following steps:

1) Carrying out synonymous sentence expansion on the selected dialogue corpus to form a candidate set; the method for generating the candidate set comprises the following steps: 11 Selecting two adjacent words from dialogue corpus each time

And->

Exchanging to obtain a plurality of expanded sentences; 12 Deleting the stop words in each expanded sentence by using the stop word dictionary; 13 Judging whether each word segment in each sentence has a synonym or not by using the synonym dictionary, if so, replacing the corresponding word segment by using the synonym in the synonym dictionary, and expanding each sentence into a plurality of sentences; 14 For each sentence after expansion, firstly translating the sentence into a first intermediate language, then translating the first intermediate language into a second intermediate language, and then translating the second intermediate language back into the original language or translating the sentence into the original language after a plurality of times of language conversion; then comparing the returned result after multiple translation conversion with the original sentence to determine whether the returned result is consistent with the original sentence, if not, storingReturning the result and the original sentence to the candidate set, otherwise, saving the original sentence to the candidate set;

2) Performing anomaly detection on each dialogue corpus in the candidate set to obtain an anomaly value of each dialogue corpus;

3) Storing the dialogue corpus with the abnormal value lower than the set scoring threshold value into a lifted dialogue corpus;

4) Semantic analysis is carried out on dialogue corpora with outliers higher than or equal to the scoring threshold value: if the dialogue data is wrong, the dialogue data is directly discarded; if the dialogue data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the lifted dialogue corpus;

5) And (4) taking the dialogue data determined to be diversified as input again, and executing the steps 1-4) until a pause condition is reached, and stopping iteration.

2. The method of claim 1, wherein the word order exchange of step 11), the stop word pruning of step 12), the synonym substitution of step 13), the cross-language translation process of step 14) correspond to a skip probability for setting a probability of skipping execution of the corresponding step process.

3. The method as claimed in claim 1, wherein in step 2), vectorization is performed on all sentences after segmentation of each dialogue corpus in the candidate set to obtain a vector d with a set length _text The method comprises the steps of carrying out a first treatment on the surface of the Then, the vectors corresponding to all the dialogue corpora in the candidate set are averaged to obtain an average vector d _mean The method comprises the steps of carrying out a first treatment on the surface of the Then calculate each vector d _text And average vector d _mean The distance is used as the difference value of the corresponding dialogue corpus.

4. The method of claim 3, wherein the distance is

Wherein threshold is a defined threshold, N is the dimension of the vector, x _i Is vector d _text The ith dimension component, d _i Is the average vector d _mean Is included in the (c) vector.

5. The method of claim 1, wherein in step 3) the scoring threshold is determined based on a neighboring maximum variance method; wherein the adjacent maximum difference method is as follows:

31 Ranking the dialogue corpora according to the outliers, and the ranking result obtained is recorded as: (Sentence) ₁ ,Score ₁ ),(Sentence ₂ ,Score ₂ ),..,(Sentence _n ,Score _n )；Sentence _n Score for sentence corresponding to nth dialogue corpus _n Abnormal value of nth dialogue corpus;

32 Calculating the difference of adjacent ranks

33 Taking the maximum difference value in the result obtained in the step 32), and recording as Delta _q The method comprises the steps of carrying out a first treatment on the surface of the Will Delta _q Corresponding two adjacent outlier Score _q 、Score _q-1 Is used as a scoring threshold Score _threshold 。

6. The method of claim 1, wherein the method of semantically analyzing the dialogue corpus is:

41 Calculating question Sentence in dialogue corpus _i As search terms, count the Sentence _i Frequency of occurrence count of (a) _i ；

42 Pair will Sentence _i Inputting an automatic question-Answer model to obtain a returned result Answer _i-QA-Model ；

43 According to count) _i And Answer _i-QA-Model For Sentence _i Classification is carried out: when count _i Greater than a set threshold K, and Answer _i-QA-Model If the value is null, the error is classifiedDialogue data; when count _i Not greater than a set threshold K, and Answer _i-QA-Model If the non-value is null, classifying the non-value into diversified dialogue data; when count _i Greater than K, and Answer _i-QA-Model If the value is not null, the data is classified as correct but general data of diversity; when count _i Not greater than K, and Answer _i-QA-Model If the value is empty, the data is classified as data which cannot be semantically judged temporarily.

7. The method of claim 1, wherein the selected dialog corpus includes question, answer, and annotated question-answer intents; and in the step 1), carrying out synonymous sentence expansion on question sentences in the selected dialogue corpus.

8. An artificial intelligence model training method, characterized in that the artificial intelligence model is trained by using the corpus in the dialogue corpus obtained by the method as claimed in claim 1.

9. A man-machine interaction method is characterized in that man-machine interaction is carried out by adopting an artificial intelligent model trained by the method according to claim 8.