CN111026884B - Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus - Google Patents

Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus Download PDF

Info

Publication number
CN111026884B
CN111026884B CN201911271656.6A CN201911271656A CN111026884B CN 111026884 B CN111026884 B CN 111026884B CN 201911271656 A CN201911271656 A CN 201911271656A CN 111026884 B CN111026884 B CN 111026884B
Authority
CN
China
Prior art keywords
corpus
dialogue
sentence
data
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911271656.6A
Other languages
Chinese (zh)
Other versions
CN111026884A (en
Inventor
张献涛
张猛
暴筱
林小俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yishang Network Technology Co ltd
Original Assignee
Shanghai Yishang Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yishang Network Technology Co ltd filed Critical Shanghai Yishang Network Technology Co ltd
Priority to CN201911271656.6A priority Critical patent/CN111026884B/en
Publication of CN111026884A publication Critical patent/CN111026884A/en
Application granted granted Critical
Publication of CN111026884B publication Critical patent/CN111026884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a dialog corpus generation method for improving the quality and diversity of man-machine interaction dialog corpus. The method comprises the following steps: 1) Carrying out synonymous sentence expansion on the selected dialogue corpus to form a candidate set; 2) Performing anomaly detection on each dialogue corpus in the candidate set to obtain an anomaly value of each dialogue corpus; 3) Storing the dialogue corpus with the abnormal value lower than the set scoring threshold value into a lifted dialogue corpus; 4) Semantic analysis is carried out on dialogue corpora with outliers higher than or equal to the scoring threshold value: if the dialogue data is wrong, the dialogue data is directly discarded; if the dialogue data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the lifted dialogue corpus; 5) And (4) taking the dialogue data determined to be diversified as input again, and executing the steps 1-4) until a pause condition is reached, and stopping iteration. The invention realizes the quality control and diversity expansion of the original dialogue corpus.

Description

Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus
Technical Field
The invention belongs to the technical fields of information technology and data mining, and relates to a dialog corpus generation method for improving the quality and diversity of man-machine interaction dialog corpus.
Background
With the continuous development of science and technology, various models of artificial intelligence are increasingly applied to various intelligent systems, and various requirements of human-computer interaction are proposed. How to perform man-machine interaction more effectively is a problem which needs to be solved urgently at present. At present, most human-computer interaction models are driven by data, and the models are trained on a Corpus (Corpus) to obtain parameter results with good performance, and the parameter results are applied to a system. Therefore, a high quality corpus plays an increasingly important role.
In human-computer interaction dialogue, the human has various rich language expression modes, and the accuracy requirement for semantic understanding is higher. In order to better train a precise (robust) model, an accurate and high-quality dialogue corpus is required, and the dialogue corpus is also required to be as rich as possible and have various expression modes.
The Chinese patent ZL201510251428.8 discloses a corpus screening method and device, wherein the corpus screening method comprises the following steps: cross checking is carried out based on the first corpus set, and a first checking result is obtained; judging whether the first check result meets a first preset condition or not; when the first verification result meets the first preset condition, performing public verification based on the first corpus set to obtain a second verification result; judging whether the first corpus set needs to be screened according to the second checking result; and when the first corpus is judged to need to be screened, executing first screening processing on the first corpus. The method solves the problem of low quality of the training samples caused by influence of subjective preference when corpus is screened in the related technology, and further achieves the effect of improving the quality of the training samples.
Chinese patent ZL201310344326.1 provides a corpus expansion device, comprising: the screening unit screens out an initial corpus sample according to preset corpus screening conditions; the expansion unit is used for identifying the collected corpus according to the initial corpus sample and the expansion strategy to obtain an expanded corpus sample, and carrying out corpus expansion again based on the expanded corpus sample and the expansion strategy. According to the method, the large-scale training corpus is subjected to machine labeling in an automatic mode, so that the time period and the cost for manufacturing the large-scale training corpus are greatly saved, and the labeling accuracy can be improved.
Currently, most corpus processing modes are subjected to simple cleaning work, and abnormal data inconsistent with expected or overall distribution are removed according to different standards. The invention focuses on the abnormal data in the human-computer dialogue corpus, and classifies the abnormal data into two types of error data and special case data. The error data needs to be removed, and the special case data is a more specific expression, not a common expression method, but the diversity of expressions in the corpus can be enhanced, and the corpus needs to be reserved and further expanded. And finally, the quality of the human-computer interaction dialogue corpus is improved, and the accuracy of subsequent model training is improved by utilizing the corpus.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a method for improving the quality and diversity of human-computer interaction dialogue corpora. The invention relates to a statistical method or a machine learning method based on a corpus, which is used for controlling the quality and expanding the diversity of an original dialogue corpus.
The technical scheme adopted by the invention is as follows:
a method for improving the quality and diversity of human-computer interaction dialogue corpus comprises the following steps:
1) Carrying out synonymous sentence expansion on the input selected dialogue corpus to form a candidate set;
2) Performing anomaly detection on the dialogue corpus in the candidate set, and outputting an anomaly value scoring of each dialogue corpus;
3) Sorting according to the scoring, determining a threshold according to the adjacent maximum difference method, and storing the scoring lower than the threshold in the lifted dialogue corpus;
4) Further semantic analysis of outliers scoring outliers above a threshold:
4.1 If the dialogue data is wrong, directly discarding the dialogue data;
4.2 If the dialogue data is the dialogue data with better diversity, the dialogue data is used as input again, iterated again, and the step 1) is entered;
4.3 If the dialogue data is correct in other types and the diversity is general, saving the dialogue data into the lifted dialogue corpus;
5) And (5) reaching a pause condition, and stopping iteration.
Further, the expansion of the synonyms of step 1) may be manually expanded by a human. The plurality of markers can expand according to the input dialogue corpus. The method can be expanded by better utilizing the experience knowledge of human beings. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. The task of synonym expansion can be completed by randomly carrying out word order exchange, stop word deletion, synonym replacement, cross-language translation and the like on the dialogue corpus.
Further, step 2) the invention will vectorize the questions of all the segmented dialogue corpora, i.e. with a fixed vector d text The representation is performed. Then, the vectors of all the dialogue corpora are averaged to obtain an average vector d mean . According to the distance Dis calculation formula provided by the invention, the distance between each dialogue corpus and the average vector is calculated, and the distance is used as a differential scoring value. The higher the score, the larger the difference, and the "large difference" may be the sentence indicating that the dialogue corpus is wrong, and should be discarded, so as to improve the quality of the corpus. It is also possible to indicate that the dialogue corpus is correct, but an unusual expression is beneficial to increasing the diversity of the corpus, and needs to be reserved for further expansion.
Further, step 3) is based on the scoring value, the scoring value is relatively low, and the scoring value can be considered as a common and effective expression and can be stored in the lifted corpus.
Further, step 4) is to perform a judgment process according to different situations for the dialogue corpus with high score according to the processing result in step 3). The judgment here may be made by selecting a person to distinguish. The labeling personnel can judge the category of the dialogue corpus according to experience according to the input dialogue corpus. The method can better utilize the experience knowledge of human beings and flexibly process. In order to save labor and improve efficiency, an automatic judgment method can be adopted. Specifically, comprehensive judgment can be performed through the automatic network search index and the results returned by the question-answer model.
Further, step 4.1) if the dialog corpus is wrong, the dialog corpus is irrelevant to the intention target of the dialog itself, or the type of the dialog corpus cannot be judged temporarily, the dialog corpus is directly discarded, and the quality of the corpus is improved.
Further, step 4.2) is to use the correct dialogue corpus and is a less common expression, which is favorable for increasing the diversity of the corpus, and the dialogue corpus needs to be reserved and used as a seed dialogue corpus to be used as the input of the next iteration, and step 1 is repeated.
Further, step 4.3) is a correct and effective dialogue corpus, but the expression is common, and the dialogue corpus is directly stored into the lifted corpus.
Further, in step 5), the method is an iterative update process, and different stop conditions can be set according to requirements. For example, the number of iterations is fixed, or the input in step 4.2) is null, and the number of dialogues in the corpus is increased to satisfy the preset number.
The method is innovative in that the method focuses on abnormal data of the dialogue corpus, and in the process iteration processing, not only the corpus quality can be improved, but also the corpus diversity can be increased. The synonymous sentence expansion method in the step 1), the abnormal data detection method in the step 2) and the distinguishing processing method in the step 4) are novel, feasible and effective.
The invention also provides an artificial intelligent model training method which is characterized in that the artificial intelligent model is trained by the corpus in the dialogue corpus obtained by the method.
The invention also provides a man-machine interaction method which is characterized in that man-machine interaction is carried out by adopting the artificial intelligent model obtained through training.
Compared with the prior art, the invention has the following positive effects:
the method and the device can improve the quality and expand the diversity of the man-machine interaction dialogue corpus, can reduce the input of manpower, and can improve the accuracy and the robustness of the algorithm model for the improved corpus.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is an automatic synonym expansion flow chart;
FIG. 3 is an example diagram of a synonym extension;
fig. 4 is a flow chart of abnormal dialogue corpus detection.
Detailed Description
The present invention will be further described with reference to the following specific examples and drawings in order to make the above objects, features and advantages of the present invention more comprehensible.
Taking the application of a human-computer interaction corpus in the hotel field as an example, the invention specifically describes the improvement of quality and the expansion of diversity of the original corpus based on a plurality of steps. The invention firstly carries out basic word segmentation processing on the initial corpus. Initial Corpus init Is composed of a series of n dialogue corpora (question-answer dialogue pairs) which can be expressed as { QApair ] 1 ,QApair 2 ,…,QApair n }. Each question-answer pair has question sentences, answers and labeled question-answer intents, for example, the ith question-answer pair can be expressed as (Sentence) i ,Answer i ,Intent i ) For example ("please check out how hotel rooms cover wifi", "our hotel is worldwide wireless, no direct password connection is required", "query the network"). Generally, the Answer is essentially fixed for the same purpose. Therefore, the present invention is currently focused on the quality and diversity of the extended question Sentence under the same Intent intelnt.
The Chinese word segmentation is a basic step of Chinese natural language processing, and the word segmentation adopts a dictionary word segmentation and statistical word segmentation fusion method. Firstly, a maximum matching word segmentation method based on a dictionary is adopted, and a word segmentation method of sequence labeling (conditional random field) is adopted for ambiguous parts of word segmentation.
Thus, a Sentence spoken by the user i May be composed of several divided words, and may be expressed as
Figure BDA0002314365740000041
Where i represents the question of the ith dialog corpus, k is the word order number, and max represents the maximum number of words of the allowed sentence. The invention takes max as 100, if the length is exceeded, the following words are truncated.
FIG. 1 is a flow chart of the steps of the method of the present invention, which is followed by specific steps:
step 1: the expansion of the synonyms can be performed manually according to the dialogue corpus. The method can be expanded by better utilizing the experience knowledge of human beings. There are also mature "crowdsourcing" approach techniques at present, which extend the dialogue corpus. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. The method can randomly exchange word sequences, delete stop words, replace synonyms, translate across languages and the like for dialogue corpora to finish the task of expanding synonyms. The expansion flow is shown in fig. 2, and a simple example is shown in fig. 3.
The present invention defines four basic operations, each as follows
1. Word order exchange operation: selecting any word for the input dialogue corpus question
Figure BDA0002314365740000042
And the next word->
Figure BDA0002314365740000043
Exchange is performed. In an actual human-machine conversation, a user is often presented with word order transformation behavior in the presentation. Such as "how the hotel room is covered by wifi" may be expressed as "how the hotel room is covered by wifi", so that the operation of word order exchange is beneficial to enhancing the diversity of expression, and can cover the expression habits of some users. This operation is not a necessary operation and may be described as p 1 Probability value of 0,1]In the present invention, 0.8 is taken.
2. Stop word pruning operations: for the input dialogue corpus question, a common stop word dictionary is utilized, and the dictionary is obtained by manual arrangement. It is determined whether the sentence contains a stop word, and if so, the stop word is deleted. For example, the word "please ask" in "please ask hotel room to cover wifi" can be deleted, and the expression of the whole sentence is not affected. This operation is not a necessary operation and may be described as p 2 Probability value of 0,1]In the present invention, 0.4 is taken.
3. Synonym substitution operation: for the input dialogue corpus question, a synonym dictionary (the dictionary is obtained by manual correction based on the public Hadamard forest) is utilized to judge whether synonyms are contained in the dictionary, and if the synonyms are contained, the synonyms are utilizedWords in the word dictionary that have the same meaning are replaced. For example, "please check out about the hotel room wifi to cover what," wherein "hotel" can find the synonym "hotel," this sentence can be replaced with "please check out about the hotel room wifi to cover what. Synonym substitution can maintain semantic consistency, and can introduce unseen words (meaning that no word appears in the original corpus, but there are in the synonym dictionary), enhancing the diversity of expressions. This operation is not a necessary operation and may be described as p 3 Probability value of 0,1]In the present invention, 0.1 is taken.
4. Cross-language translation operations: the cross-language translation operation refers to the expansion of synonyms by using the current machine translation technology and using the expression change of translation between different languages. The specific operation includes that the dialogue corpus question is translated into the first intermediate language by using the existing machine translation services (such as Google translation, hundred degree translation and the like), then the first intermediate language is translated into the second intermediate language, and finally the second intermediate language is translated back into the Chinese. Comparing the returned results, if not consistent with the original input, then retaining. For example, "please check out the hotel room wifi coverage" can be translated into English "Does the hotel room have WiFi coverage" first, and then translated from English to French "La Chambre d'
Figure BDA0002314365740000056
est-elle couverte par un r eseau sans fil ", finally translated back to chinese by french" whether hotel rooms are covered by wireless network ". The final returned language expression not only preserves semantic consistency, but also has richer expressions.
This operation is not a necessary operation and may be described as p 4 Probability value of 0,1]In the present invention, 0.3 is taken.
The four steps are randomly skipped or selected, and finally the probability values are respectively 1-p 1 ,1-p 2 ,1-p 3 ,1-p 4 Acting on the candidate dialogue corpus and operating the candidate dialogue corpus. FIG. 3 is an example, where the first two steps have been performed, there have been 4 possible variations, when followedThere are 4×2×2=16 possible variations in continuing the operation.
Step 2: the invention can carry out vectorization processing on all the dialogue corpora after word segmentation, namely, a fixed vector d is adopted text The representation is performed. On the basis, the distance is calculated. The specific steps are shown in fig. 4.
The system maps each word into a low-dimensional continuous vector. Here, a text depth representation model (e.g. Word2 Vec) may be used to characterize questions of the dialog corpus in the text segment to obtain Word vectors. word2vec is a tool that converts words into vector form. For sentences
Figure BDA0002314365740000051
Every word->
Figure BDA0002314365740000052
Can be mapped to a vector, where the dimension of the vector is taken to be 200, e.g. +.>
Figure BDA0002314365740000053
Then, summing operation is carried out according to the word vectors to obtain the representation of semantic vectors of the dialogue corpus>
Figure BDA0002314365740000054
For example->
Figure BDA0002314365740000055
Further, the vectors of the n dialogue corpora are averaged to obtain an average vector
Figure BDA0002314365740000061
Such as
Figure BDA0002314365740000062
The distance between each dialogue corpus and the average vector is calculated and is used as the differential scoring value of the dialogue corpus. The invention designs a distance calculation formula specifically, and amplifies the difference and the differentiation between dialogue corpora as much as possible. For transfusionVector of entry->
Figure BDA0002314365740000063
And average vector->
Figure BDA0002314365740000064
The distance Dis of (2) is calculated as follows:
Figure BDA0002314365740000065
Figure BDA0002314365740000066
threshold in the formula is defined threshold value, and the difference is ensured to be smaller. The value of the invention is 0.01.
The higher the Dis score, the larger the difference, and the "large difference" is not only the incorrect dialogue corpus, but also the quality of the corpus is improved. It is also possible to indicate that the dialogue corpus is correct, but an unusual expression is beneficial to increasing the diversity of the corpus and needs to be preserved.
Step 3: sorting according to the scoring, and storing the scoring low person into the improved dialogue Corpus Corpus improved Is a kind of medium. In step 2, each dialogue corpus has been scored as n dialogue corpora. Each dialog corpus is then ranked according to the score. Ordered in ascending order from low to high. Different threshold selection methods may be determined as desired. In the invention, the threshold value Score is determined according to the adjacent maximum difference method threshold If the score is lower than the threshold, the top r dialogue corpora are selected. These dialogue corpora are low scoring ones, have small difference, can be considered as effective and accurate dialogue corpora, and can be put into Corpus improved Is a kind of medium. The rest n-r dialogue corpora need to be analyzed and processed in the next step. The adjacent maximum variance method presented herein is calculated as follows:
3.1 ordering: pairs of n pairsCorpus was scored according to distance Score, ranked from high to low as follows (Sentence 1 ,Score 1 ),(Sentence 2 ,Score 2 ),..,(Sentence n ,Score n )。
3.2 calculating the approach difference: calculating the difference Delta of adjacent sequences k The calculation method is as follows:
Figure BDA0002314365740000067
3.3 taking the maximum difference value, determining the threshold Score of the Score threshold : taking the maximum difference value in 3.2, and recording as Delta q I.e. representing Sentence q-1 And Sentence q The difference between the two is the largest, the average value of the two is taken, and the scoring threshold value is calculated as follows:
Figure BDA0002314365740000068
3.4 according to the threshold Score in 3.3 threshold The top r dialogue corpora can be selected if the score is below the threshold. These dialogue corpora are low scoring ones, have small difference, can be considered as effective and accurate dialogue corpora, and can be put into Corpus improved Is a kind of medium. The rest n-r dialogue corpora need to be analyzed and processed in the next step.
Step 4: the outliers of scoring the high are further semantically analyzed: the analysis here requires a human to judge its quality and diversity from the dialogue corpus. Mainly judging whether the current dialogue corpus is consistent with the target intention Intent or not and whether the expression of the dialogue corpus is clear or not. Finally, the error dialogue data are divided into the following four types (1); (2) a diversity of dialogue data; (3) correct but general data of diversity; (4) data for which semantic judgment is temporarily impossible.
To improve efficiency, classification screening of semantic analysis may also be performed automatically. Specifically, comprehensive judgment can be performed through the automatic network search index and the results returned by the question-answer model. The automatic judging process is as follows:
4.1 calculating the occurrence frequency of the dialogue corpus
For question Sentence in a dialogue corpus i The word string of the sentence is used as a search word, the search is carried out by utilizing an Internet search index (hundred degrees, google and the like), and the found relevant result count is returned i This value characterizes the heat frequency of the dialogue corpus, resulting in a tuple (Sentence i ,count i ). The method is not limited to search engine data, and can also construct an index library by itself, and return similar measurement values through known technologies such as inverted indexes.
4.2 verifying the question-answer Effect of dialog corpus
First, using well-known question-answering models, such as DSSM (Deep Structured Semantic Models), etc., in an initial Corpus Corpus init Training to obtain an automatic question-answering Model QA-Model, and answering the input question. Thereafter, a question Sentence for a dialog corpus i The string is used as an input, and is automatically answered by QA-Model, and the answer results are two. One returns a string answer that can be answered under the current model capabilities, the other is that the model is currently unable to answer, and the returned result is null. Obtaining a tuple (Sentence) i ,Answer i-QA-Model )。
4.3 Classification attribution
From the meta-analysis results in 4.1 and 4.2, for each dialogue corpus, a triplet (Sentence can be obtained i ,count i ,Answer i-QA-Model ). According to count i ,Answer i-QA-Model For Sentence i The following classification is made:
when count i Greater than K, and Answer i-QA-Model The value is null and is classified as (1) error dialogue data;
when count i Not greater than K, and Answer i-QA-Model The non-value is null and is classified as (2) the diversity dialogue data;
when count i Greater than K, and Answer i-QA-Model The value is not null and is classified as (3) correct but general data of diversity;
when count i Not greater than K, and Answer i-QA-Model The value is empty, and is classified as (4) data for which semantic judgment is temporarily impossible.
The above threshold K may be specified empirically, and the present invention selects a value of 100000.
4.4, according to different classification attributions, carrying out the following treatment:
if the dialogue data is wrong (1), or data which cannot be semantically judged (4). Directly discard. For example, if the intelnt is "query network", the dialogue corpus containing the question "give me a network line" does not conform to the intention and needs to be discarded;
if the dialogue data (2) with better diversity is used, the dialogue data is used as input again, iterated again and the step 1 is entered. For example, on the premise that Intent is an "inquiry network", the dialogue Corpus containing "I can go to hotel and connect wifi" has better diversity and meets the intention, and not only the Corpus Corpus is put into the lifting Corpus improved Continuing to expand on the basis, and repeating the iterative step 1;
if the other types are correct, the general dialogue data (3) with diversity is saved in the lifted dialogue corpus. For example, on the premise that Intent is an "inquiry network", the dialogue Corpus containing "hotel wifi should be a fully covered bar" is generally diverse and directly put into the lifting Corpus Corpus improved Is prepared by the following steps;
step 5: the method is an iterative updating process, and different stopping conditions can be set according to the needs. For example, the iteration times reach a fixed value, the input in the step 4 is empty, and the Corpus Corpus is improved improved The number of the conversations in the system can meet the preset number.
Finally, the corpuses Corpus is improved improved The human-machine dialogue corpus with reliable quality and rich diversity is obtained by the method.
The dictionary of the maximum matching method and the training and learning corpus of the supervised conditional random field model come from 10 ten thousand user critique manually marked by the invention.
The test results on a plurality of groups of dialogue corpora show that the method for improving the quality and diversity of the man-machine dialogue corpora reduces about 20% of error dialogue corpora and improves the quality of the corpora; the corpus number is increased by about 60%, and the diversity of the corpus is increased. And on the lifted corpus, the precision of the man-machine conversation model is commonly lifted by 3-7 percentage points.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims (9)

1. A dialog corpus generation method for improving the quality and diversity of man-machine interaction dialog corpus comprises the following steps:
1) Carrying out synonymous sentence expansion on the selected dialogue corpus to form a candidate set; the method for generating the candidate set comprises the following steps: 11 Selecting two adjacent words from dialogue corpus each time
Figure FDA0004055145240000011
And->
Figure FDA0004055145240000012
Exchanging to obtain a plurality of expanded sentences; 12 Deleting the stop words in each expanded sentence by using the stop word dictionary; 13 Judging whether each word segment in each sentence has a synonym or not by using the synonym dictionary, if so, replacing the corresponding word segment by using the synonym in the synonym dictionary, and expanding each sentence into a plurality of sentences; 14 For each sentence after expansion, firstly translating the sentence into a first intermediate language, then translating the first intermediate language into a second intermediate language, and then translating the second intermediate language back into the original language or translating the sentence into the original language after a plurality of times of language conversion; then comparing the returned result after multiple translation conversion with the original sentence to determine whether the returned result is consistent with the original sentence, if not, storingReturning the result and the original sentence to the candidate set, otherwise, saving the original sentence to the candidate set;
2) Performing anomaly detection on each dialogue corpus in the candidate set to obtain an anomaly value of each dialogue corpus;
3) Storing the dialogue corpus with the abnormal value lower than the set scoring threshold value into a lifted dialogue corpus;
4) Semantic analysis is carried out on dialogue corpora with outliers higher than or equal to the scoring threshold value: if the dialogue data is wrong, the dialogue data is directly discarded; if the dialogue data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the lifted dialogue corpus;
5) And (4) taking the dialogue data determined to be diversified as input again, and executing the steps 1-4) until a pause condition is reached, and stopping iteration.
2. The method of claim 1, wherein the word order exchange of step 11), the stop word pruning of step 12), the synonym substitution of step 13), the cross-language translation process of step 14) correspond to a skip probability for setting a probability of skipping execution of the corresponding step process.
3. The method as claimed in claim 1, wherein in step 2), vectorization is performed on all sentences after segmentation of each dialogue corpus in the candidate set to obtain a vector d with a set length text The method comprises the steps of carrying out a first treatment on the surface of the Then, the vectors corresponding to all the dialogue corpora in the candidate set are averaged to obtain an average vector d mean The method comprises the steps of carrying out a first treatment on the surface of the Then calculate each vector d text And average vector d mean The distance is used as the difference value of the corresponding dialogue corpus.
4. The method of claim 3, wherein the distance is
Figure FDA0004055145240000013
Figure FDA0004055145240000014
Wherein threshold is a defined threshold, N is the dimension of the vector, x i Is vector d text The ith dimension component, d i Is the average vector d mean Is included in the (c) vector.
5. The method of claim 1, wherein in step 3) the scoring threshold is determined based on a neighboring maximum variance method; wherein the adjacent maximum difference method is as follows:
31 Ranking the dialogue corpora according to the outliers, and the ranking result obtained is recorded as: (Sentence) 1 ,Score 1 ),(Sentence 2 ,Score 2 ),..,(Sentence n ,Score n );Sentence n Score for sentence corresponding to nth dialogue corpus n Abnormal value of nth dialogue corpus;
32 Calculating the difference of adjacent ranks
Figure FDA0004055145240000021
33 Taking the maximum difference value in the result obtained in the step 32), and recording as Delta q The method comprises the steps of carrying out a first treatment on the surface of the Will Delta q Corresponding two adjacent outlier Score q 、Score q-1 Is used as a scoring threshold Score threshold
6. The method of claim 1, wherein the method of semantically analyzing the dialogue corpus is:
41 Calculating question Sentence in dialogue corpus i As search terms, count the Sentence i Frequency of occurrence count of (a) i
42 Pair will Sentence i Inputting an automatic question-Answer model to obtain a returned result Answer i-QA-Model
43 According to count) i And Answer i-QA-Model For Sentence i Classification is carried out: when count i Greater than a set threshold K, and Answer i-QA-Model If the value is null, the error is classifiedDialogue data; when count i Not greater than a set threshold K, and Answer i-QA-Model If the non-value is null, classifying the non-value into diversified dialogue data; when count i Greater than K, and Answer i-QA-Model If the value is not null, the data is classified as correct but general data of diversity; when count i Not greater than K, and Answer i-QA-Model If the value is empty, the data is classified as data which cannot be semantically judged temporarily.
7. The method of claim 1, wherein the selected dialog corpus includes question, answer, and annotated question-answer intents; and in the step 1), carrying out synonymous sentence expansion on question sentences in the selected dialogue corpus.
8. An artificial intelligence model training method, characterized in that the artificial intelligence model is trained by using the corpus in the dialogue corpus obtained by the method as claimed in claim 1.
9. A man-machine interaction method is characterized in that man-machine interaction is carried out by adopting an artificial intelligent model trained by the method according to claim 8.
CN201911271656.6A 2019-12-12 2019-12-12 Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus Active CN111026884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911271656.6A CN111026884B (en) 2019-12-12 2019-12-12 Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911271656.6A CN111026884B (en) 2019-12-12 2019-12-12 Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus

Publications (2)

Publication Number Publication Date
CN111026884A CN111026884A (en) 2020-04-17
CN111026884B true CN111026884B (en) 2023-06-02

Family

ID=70208856

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911271656.6A Active CN111026884B (en) 2019-12-12 2019-12-12 Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus

Country Status (1)

Country Link
CN (1) CN111026884B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231458B (en) * 2020-10-23 2023-03-21 河北省讯飞人工智能研究院 Capacity expansion method, device, equipment and storage medium for dialogue corpus
CN112489628B (en) * 2020-11-23 2024-02-06 平安科技(深圳)有限公司 Voice data selection method and device, electronic equipment and storage medium
CN112597748B (en) * 2020-12-18 2023-08-11 深圳赛安特技术服务有限公司 Corpus generation method, corpus generation device, corpus generation equipment and computer-readable storage medium
CN112836525B (en) * 2021-01-13 2023-08-18 江苏金陵科技集团有限公司 Machine translation system based on man-machine interaction and automatic optimization method thereof
CN113204966B (en) * 2021-06-08 2023-03-28 重庆度小满优扬科技有限公司 Corpus augmentation method, apparatus, device and storage medium
CN115062630B (en) * 2022-07-25 2023-01-06 北京云迹科技股份有限公司 Method and device for confirming nickname of robot

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10283373A (en) * 1997-04-07 1998-10-23 Aptecs Software Inc System and method for generating and retrieving context vector

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516903A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Keyword extension method and system and classification corpus labeling method and system
WO2015105994A1 (en) * 2014-01-08 2015-07-16 Callminer, Inc. Real-time conversational analytics facility
CN107122346B (en) * 2016-12-28 2018-02-27 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN108197274B (en) * 2018-01-08 2020-10-09 合肥工业大学 Abnormal personality detection method and device based on conversation
CN109189901B (en) * 2018-08-09 2021-05-18 北京中关村科金技术有限公司 Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN109376224B (en) * 2018-10-24 2020-07-21 深圳市壹鸽科技有限公司 Corpus filtering method and apparatus
CN110134952B (en) * 2019-04-29 2020-03-31 华南师范大学 Error text rejection method, device and storage medium
CN110362659A (en) * 2019-07-16 2019-10-22 北京洛必德科技有限公司 The abnormal statement filter method and system of the open corpus of robot
CN110489538B (en) * 2019-08-27 2020-12-25 腾讯科技(深圳)有限公司 Statement response method and device based on artificial intelligence and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10283373A (en) * 1997-04-07 1998-10-23 Aptecs Software Inc System and method for generating and retrieving context vector

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
石静 ; 吴云芳 ; 邱立坤 ; 吕学强 ; .基于大规模语料库的汉语词义相似度计算方法.中文信息学报.2013,(01),全文. *

Also Published As

Publication number Publication date
CN111026884A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN111026884B (en) Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus
CN109271505B (en) Question-answering system implementation method based on question-answer pairs
CN110765257B (en) Intelligent consulting system of law of knowledge map driving type
CN108804521B (en) Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system
CN107729468B (en) answer extraction method and system based on deep learning
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN107818164A (en) A kind of intelligent answer method and its system
CN110895559B (en) Model training method, text processing method, device and equipment
CN106649742A (en) Database maintenance method and device
KR20160026892A (en) Non-factoid question-and-answer system and method
CN106528599A (en) A rapid fuzzy matching algorithm for strings in mass audio data
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN116244445B (en) Aviation text data labeling method and labeling system thereof
CN111966810A (en) Question-answer pair ordering method for question-answer system
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN110765781A (en) Man-machine collaborative construction method for domain term semantic knowledge base
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN105631032B (en) Question and answer Knowledge Base, the apparatus and system recommended based on abstract semantics
CN117216221A (en) Intelligent question-answering system based on knowledge graph and construction method
CN112084312A (en) Intelligent customer service system constructed based on knowledge graph
CN116431746A (en) Address mapping method and device based on coding library, electronic equipment and storage medium
CN111444414A (en) Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task
CN115577080A (en) Question reply matching method, system, server and storage medium
CN115238705A (en) Semantic analysis result reordering method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210112

Address after: Room b-9014, 8th floor, building 1, 188 Changyi Road, Baoshan District, Shanghai 200441

Applicant after: Shanghai Yishang Network Technology Co.,Ltd.

Address before: Room 1506, 15 / F, building 1, yangyangchun investment building, 66 Yangming East Road, Donghu District, Nanchang City, Jiangxi Province, 330000

Applicant before: Nanchang Zhonghui Zhiying Information Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant