CN111191442B - Similar problem generation method, device, equipment and medium - Google Patents
Similar problem generation method, device, equipment and medium Download PDFInfo
- Publication number
- CN111191442B CN111191442B CN201911397007.0A CN201911397007A CN111191442B CN 111191442 B CN111191442 B CN 111191442B CN 201911397007 A CN201911397007 A CN 201911397007A CN 111191442 B CN111191442 B CN 111191442B
- Authority
- CN
- China
- Prior art keywords
- corpus
- data
- text
- similar
- interactive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000002452 interceptive effect Effects 0.000 claims abstract description 89
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 49
- 238000012549 training Methods 0.000 claims abstract description 46
- 238000010801 machine learning Methods 0.000 claims abstract description 27
- 238000002372 labelling Methods 0.000 claims abstract description 24
- 239000013598 vector Substances 0.000 claims description 58
- 238000004364 calculation method Methods 0.000 claims description 18
- 230000003993 interaction Effects 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012937 correction Methods 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 5
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000011156 evaluation Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013136 deep learning model Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 206010025482 malaise Diseases 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
The invention discloses a similar problem generation method, which relates to the technical field of natural language processing and aims at training a problem identification model by labeling part of interactive corpus data, identifying problems by the problem identification model and calculating the similarity of the problems so as to finish similar problem classification and further realize similar problem generation, and the method comprises the following steps: acquiring interactive corpus text data; labeling part of the interactive corpus text data to form problem text corpus data; training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model; performing problem recognition on unlabeled interactive corpus text data by using the problem recognition model to form a problem corpus; calculating the similarity of the problem corpus through a text similarity algorithm to generate various similar problem corpora; and outputting corpus of similar problems of each class. The invention also discloses a similar problem generating device, electronic equipment and a computer storage medium.
Description
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a medium for generating a similar problem.
Background
The current intelligent customer service robot technology mainly realizes intention recognition based on a machine learning algorithm. The machine learning algorithm needs to provide a large amount of similar corpus with known categories for training, so that a large amount of labeling data needs to be prepared in the initial stage of the construction of the intelligent customer service robot. In the existing intelligent customer service robot construction process, a large amount of labeling data is generally obtained through a manual labeling mode to serve as similar corpus, and the problems of long period, high labeling cost and the like exist. Therefore, how to automatically acquire related similar corpus data at the initial stage of intelligent customer service robot construction is particularly critical.
In the prior art, chinese patent CN109033390a discloses a method and apparatus for automatically generating similar questions, the method comprising obtaining an initial question; generating an expanded question according to the initial question; judging whether the expanded question is a similar question, and marking the similar question according to a judging result. According to the method, the computer is utilized to automatically generate the similar questions, so that the labor consumed by manually marking the similar questions is saved, and the cost of robot customer service can be reduced. However, the method adopts a deep learning model in practical application, so that a large amount of labeled training corpus is needed in training the deep learning model, and the problem of difficulty in acquiring training data exists; the sentences output by the model are limited by the training corpus range, and problems such as confusion, malaise and the like are possible.
In another chinese patent CN109063004a, a method and apparatus for automatically generating FAQ-like questions is disclosed, the method comprising: generating a text based on the selected FAQ; judging whether the generated text is similar to the selected FAQ; if the generated text is similar to the selected FAQ, the text is a similar question of the selected FAQ. According to the patent, the input cost of manual marking is reduced by automatically generating the FAQ similar questions, and corresponding similar questions can be quickly constructed for newly added FAQ; in addition, the text generation is performed by combining the natural language processing method with sentence generation rules, so that the quality of similar question generation can be improved more effectively. However, the patent has the problems of difficult accurate maintenance of paraphrasing, difficult construction of initial problem generation rules and the like, is difficult to maintain, has low applicability and cannot be widely applied.
Therefore, in order to achieve automatic acquisition of similar corpus data, it is highly desirable to propose a method for generating similar problems that overcomes the above-mentioned drawbacks of the prior art and is generally applicable.
Disclosure of Invention
In order to overcome the defects of the prior art, one of the purposes of the invention is to provide a similar problem generating method, which aims at labeling part of interactive corpus, taking the labeled interactive corpus as training corpus to complete the training of a problem recognition model, recognizing problem sentences in the interactive corpus through the problem recognition model, and generating similar problems through a text similarity algorithm, thereby realizing automatic acquisition of similar corpus data, avoiding a large amount of labeled training corpus, and ensuring that the generated similar problems are not limited by the training corpus range.
One of the purposes of the invention is realized by adopting the following technical scheme:
a similar problem generating method comprising the steps of:
acquiring interactive corpus text data;
extracting a plurality of interactive corpus text data for problem labeling to form problem text corpus data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
and outputting corpus of similar problems of each class.
Further, obtaining the interactive corpus text data specifically includes:
acquiring interaction corpus data and judging the data type of the interaction corpus;
if the data type of the interactive corpus data is text data, the interactive corpus data is used as the interactive corpus text data;
if the data type of the interactive corpus data is voice data, voice translation and text correction are carried out on the interactive corpus data, and the interactive corpus data after voice translation and text correction is used as the interactive corpus text data.
And further, carrying out problem labeling on the extracted plurality of interactive corpus text data by combining an industry customer service operation database.
Further, preprocessing the question text corpus data and the unlabeled interactive corpus text data specifically includes:
and respectively carrying out word segmentation and stop word filtering on the question text corpus data and the unlabeled interactive text corpus data.
Further, the original machine learning model is a supervised machine learning model, and the problem recognition model is used for performing problem recognition on the unlabeled interactive corpus text data to obtain a problem corpus, and the method further comprises the following steps:
labeling the problem corpus, and iteratively training the problem identification model by using the labeled problem corpus.
Further, the text similarity algorithm is a cosine similarity algorithm, the similarity of the problem corpus is calculated through the cosine similarity algorithm, the problem corpus is classified according to a preset threshold value and a similarity calculation result, and various similar problem corpuses are generated, and the method specifically comprises the following steps:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the vectorization question sentence set to be used as a center vector of a question category;
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
Further, outputting corpus of similar problems of each category, which specifically comprises the following steps:
extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class;
the keyword algorithm is a TextRank algorithm, and the corpus of each category of similar problems is integrated according to the keyword extraction result, and specifically comprises the following steps:
merging similar problem corpus with the same meaning of the keywords;
and/or the number of the groups of groups,
splitting or deleting similar problem corpus with different keyword meanings;
and/or the number of the groups of groups,
splitting or deleting similar problem corpora with different keyword meanings, extracting similar problems in the split or deleted similar problem corpora, and adding the similar problems to all classes of similar problem corpora with the same keyword meanings.
The second objective of the present invention is to provide a similar problem generating device, which uses part of interaction data as training corpus for model training to generate a problem recognition model through model training, recognizes problem sentences in the interaction corpus through the problem recognition model, and finally calculates the problem similarity through a text similarity algorithm to generate similar problems.
The second purpose of the invention is realized by adopting the following technical scheme:
the data acquisition module is used for acquiring interactive corpus text data;
the labeling module is used for extracting a plurality of interactive corpus text data for question labeling to form question text corpus data;
the data preprocessing module is used for preprocessing the question text corpus data and the unlabeled interactive corpus text data;
the model training module is used for training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
the problem recognition module is used for carrying out problem recognition on the unlabeled interactive corpus text data by utilizing the problem recognition model to obtain a problem corpus;
the classification module is used for calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus data according to a preset threshold value and a similarity calculation result, and generating various similar problem corpuses;
and the result output module is used for outputting corpus of similar problems of each class.
It is a further object of the present invention to provide an electronic device for performing one of the objects of the present invention, comprising a processor, a storage medium, and a computer program stored in the storage medium, which when executed by the processor, implements the above-mentioned similar problem generating method.
It is a fourth object of the present invention to provide a computer-readable storage medium storing one of the objects of the present invention, on which a computer program is stored which, when executed by a processor, implements the above-described similar problem generating method.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, a small amount of interaction corpus is marked as the training sample, so that the problem identification model is trained, a large amount of marking work of the training sample is reduced, and time and labor are saved. According to the method, the problem sentences are extracted from the interactive corpus by using the problem recognition model, and the similarity of the problem sentences is calculated through a text similarity algorithm, so that similar problems are obtained, and similar corpus data are generated.
According to the method, similar corpus data are obtained from the interactive corpus, a near meaning word library and a related text generation rule are not required to be constructed, so that the generated similar problem is not limited by the scope of the training corpus, and the situation that sentences are not smooth is avoided.
Drawings
FIG. 1 is a flow chart of a similar problem generation method of the present invention;
FIG. 2 is a flowchart of a similar problem generating method of embodiment 2;
fig. 3 is a block diagram of the structure of a similar problem generating apparatus of embodiment 3;
fig. 4 is a block diagram of the electronic device of embodiment 4.
Detailed Description
The invention will now be described in more detail with reference to the accompanying drawings, to which it should be noted that the description is given below by way of illustration only and not by way of limitation. Various embodiments may be combined with one another to form further embodiments not shown in the following description.
Example 1
The embodiment provides a similar problem generating method, which aims at performing machine learning model training by using interactive scene corpus data to obtain a machine learning model for problem identification, identifying problem sentences in the interactive scene corpus by the machine learning model, classifying similar problem sentences by text similarity calculation, thereby generating similar problems and further obtaining similar corpus data.
According to the above principle, a method for generating similar problems is described, as shown in fig. 1, and specifically includes the following steps:
acquiring interactive corpus text data;
extracting a plurality of interactive corpus text data for problem labeling to form problem text corpus data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
and outputting corpus of similar problems of each class.
The interactive corpus text data is divided into two parts, one part (namely the extracted interactive corpus text data) is used for model training after problem marking, and the other part (namely the unlabeled interactive corpus text data) is used for subsequent problem recognition, so that a problem recognition model is trained by marking a small amount of interactive corpus text data.
Preferably, the obtaining the interactive corpus text data specifically includes:
acquiring interaction corpus data and judging the data type of the interaction corpus;
if the data type of the interactive corpus data is text data, the interactive corpus data is used as the interactive corpus text data;
if the data type of the interactive corpus data is voice data, voice translation and text correction are carried out on the interactive corpus data, and the interactive corpus data after voice translation and text correction is used as the interactive corpus text data.
The interactive corpus data are interactive scene corpus data in the industry, and the similar problems are directly extracted from the interactive scene corpus data in the industry by the similar problem generating method, so that the problem that generated similar problems have unsmooth sentences can be avoided.
Preferably, the extracted plurality of interactive corpus text data are subjected to question labeling in combination with an industry customer service operation database.
The business customer service operation database is a pre-constructed database, stores customer service operation data, if the interaction corpus data obtained in the embodiment does not have corresponding customer service operation data in the business customer service operation database, carries out problem statement judgment on the interaction corpus data, and stores the associated customer service operation of the problem statement obtained by judgment in the business customer service operation database so as to complete updating of the business customer service operation database.
Preferably, preprocessing the question text corpus data and the unlabeled interactive corpus text data specifically includes:
and respectively carrying out word segmentation and stop word filtering on the question text corpus data and the unlabeled interactive text corpus data.
In another embodiment of the present invention, the question text corpus data and the unlabeled interactive text corpus data are subjected to word segmentation, stop-word filtering, and synonym replacement, respectively. It should be noted that the purpose of performing synonym replacement here is to maintain a library of related synonyms, which may not be performed if maintenance is not required.
The preprocessed problem text corpus data and the unlabeled interaction corpus text data are respectively used for subsequent machine learning model training and problem recognition.
Preferably, the original machine learning model is a supervised machine learning model including, but not limited to, logistic regression models, support vector machines, CNNs, and LSTM; training the original machine learning model by using the problem text corpus data to generate a problem recognition model, and performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain a problem corpus, wherein the method further comprises the following steps:
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
labeling the problem corpus, and iteratively training the problem identification model by using the labeled problem corpus so as to adjust the predicted value of the problem identification model.
According to the supervised machine learning model, iterative training of the model is carried out through training data marked in advance, problem recognition (prediction) is carried out on unlabeled interactive corpus text data by utilizing a problem recognition model after iterative training to obtain a problem corpus (prediction result), then the problem corpus is marked, iterative training of the model is continued by utilizing the marked problem corpus so as to optimize weight parameters in the model, when training stopping conditions (such as reaching the upper limit value of iteration times or reaching the target value of a gap value) are met, the model is completely trained, at the moment, the model converges and iterates to the fixed weight parameters, so that the purpose of adjusting the prediction value of the problem recognition model is achieved, the purpose of reducing the gap value between the prediction value and the true value of the model is achieved, and finally the model after training is completely predicted.
In the embodiment, a CNN model is selected as an original machine learning model, the problem text corpus data is trained to generate a problem recognition model, and the problem recognition model is utilized to perform problem recognition on the interaction corpus text data which is not marked, so that a problem corpus is obtained. And labeling the identified problem corpus again to adjust the predicted value of the model, so that the labeled problem corpus is used for carrying out iterative training on the problem identification model, the weight parameters of the CNN model are optimized until the weight parameters are fixed, the identification rate of the problem identification model is improved, and finally, the problem sentence in the interactive corpus text data is identified by using the trained problem identification model, so that the problem corpus is obtained.
Preferably, the text similarity algorithm is a cosine similarity algorithm, the similarity of the problem corpus is calculated through the cosine similarity algorithm, the problem corpus is classified according to a preset threshold and a similarity calculation result, and various similar problem corpuses are generated, and the method specifically comprises the following steps:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the vectorized question sentence set to be used as a center vector of a question category (assumed to be a first question category);
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
The above-mentioned text similarity algorithm is an algorithm for measuring the distance between two vectors, and any algorithm capable of calculating the distance between the vectors is suitable for calculating the text similarity, so in other embodiments of the present invention, the text similarity algorithm may also use a euclidean distance, a manhattan distance, a minkowski distance, a chebyshev distance, or the like.
The present embodiment uses TF-IDF algorithm, N-Gram model and word vector to vectorize the problem corpus.
The TF-IDF algorithm is a numerical statistic, and is used to reflect the frequency of occurrence of a word in a document, if TF is high and DF is low, it indicates that the frequency of occurrence of a word in a sentence is high and the frequency of occurrence of a word in other sentences is low, so that the word has good class distinguishing capability.
When TF-IDF is used in this embodiment, the calculation formula is as follows: TF is multiplied by IDF, where,/>Indicating the total number of words t contained in sentence di,/->Representing the total number of words in sentence di; />,Representing the total number of sentences in the question corpus, < ->Representing the word t contained in the corpus of questionsStatement count.
According to the TF-IDF calculation result, a Group is formed by a problem sentence in the problem corpus through a window with the size of N by an N-Gram model, then the formed Group is counted and filtered (the Group with low occurrence frequency is filtered), and a Group composition feature space is obtained by filtering, so that the problem sentence segmentation is used. The N-Gram model can adopt a Bi-Gram binary model or a Tri-Gram ternary model according to specific conditions, and the specific processing process is common knowledge in the field and is not described herein.
The problem corpus after N-Gram word segmentation is vectorized through word vectors.
The term vector refers to the transformation of a word into a distributed representation such that there are distances between words and each dimension of the vector has a characteristic meaning. In actual use, word vectors are typically trained through a neural network DNN, which is typically a three-layer neural network structure, including an input layer, a hidden layer, and an output layer (softmax layer), and including both CBOW and Skip-Gram models. In this embodiment, the CBOW model in the neural network DNN is also used to complete vectorization of the problem corpus, specifically: the word segmentation in the problem corpus is sequentially used as the feature words, word vectors of words relevant to the context of the feature words are input into a CBOW model to perform model training, when the softmax probability corresponding to a specific word in a training sample is maximum, training is finished, then parameters of the CBOW model are obtained through a back propagation algorithm of a neural network DNN, and simultaneously word vectors of all words in the problem corpus are obtained, so that vectorization of the problem corpus is completed, and a problem sentence vector set is formed.
In other embodiments of the present invention, the Skip-Gram model may also be used to complete the vectorization of the problem corpus.
And calculating cosine similarity between the question sentence vector and the center vector of each question category by using a cosine similarity algorithm.
In this embodiment, the value of the preset threshold is determined according to the category set and the similarity evaluation index, and the specific determining process is as follows:
presetting a threshold initial value, acquiring a category set (a known problem corpus classification set), and then calculating a similarity evaluation index according to the category set;
the threshold initial value is adjusted for multiple times, and similarity evaluation indexes are recalculated according to the category sets respectively;
and finally selecting the threshold value corresponding to the maximum similarity of the category set as the value of the preset threshold value.
The similarity evaluation index mainly includes, but is not limited to, a contour coefficient, a rand coefficient, and an adjusted rand coefficient, where the contour coefficient is based on the fact that the actual category is unknown, and for a single sample, the calculation formula of the contour coefficient is:a represents the average distance of other samples of the same class as the single sample, and b represents the average distance of samples in different classes nearest to the single sample.
The lander coefficient and the adjusted lander coefficient are based on a known class set, so that the lander coefficient and the adjusted lander coefficient are used as similarity evaluation indexes in the determination process of the preset threshold.
In this embodiment, a description will be given of a calculation method of a rader coefficient and an adjustment rader coefficient.
Assuming that the class C and the clustering result K are known, the calculation formula of the Rankine coefficient is as follows:wherein a represents the same class of element pairs in both C and K, b represents different classes of element pairs in both C and K, +.>Representing the total element logarithm of the composition in the set, +.>Representing the total number of elements in the collection. RI is in the range of 0,1]The larger the RI value is, the more consistent the clustering result is with the real situation.
Since the Rankine coefficient RI cannot be guaranteed for random clustering resultsNear zero, in order to achieve that the similarity evaluation index is near zero under the condition that the clustering result is randomly generated, an adjusting Rand coefficient is used, and a calculation formula of the adjusting Rand coefficient is as follows:wherein the ARI has a value of [ -1,1]The larger the ARI value is, the more consistent the clustering result is with the real situation, and the degree of coincidence of two data distributions is measured by adjusting the Rankine coefficient ARI from a generalized perspective.
Example 2
The difference between the present embodiment and the foregoing embodiment is that, as shown in fig. 2, corpus of similar questions of each category is output, and specifically includes the following steps:
extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class;
the keyword algorithm is a TextRank algorithm, and the corpus of each category of similar problems is integrated according to the keyword extraction result, and specifically comprises the following steps:
merging similar problem corpus with the same meaning of the keywords;
and/or the number of the groups of groups,
splitting or deleting similar problem corpus with different keyword meanings;
and/or the number of the groups of groups,
splitting or deleting similar problem corpora with different keyword meanings, extracting similar problems in the split or deleted similar problem corpora, and adding the similar problems to all classes of similar problem corpora with the same keyword meanings.
The TextRank algorithm is based on the Pagerank principle and is used for generating keywords for a text, the TextRank algorithm regards grammar units in the text as nodes in a graph, if a certain grammar relationship exists between the two grammar units, the two grammar units are connected with each other by one side in the graph, different nodes finally have different weights through a certain iteration number, and grammar units with high weights can be used as keywords.
Therefore, in the embodiment, keywords of corpus with similar problems of various categories can be extracted by adopting a TextRank algorithm.
The extraction of the keywords is helpful for combing and integrating the corpora of similar problems of each category, and in other embodiments, other methods for extracting text main information such as topic models can be adopted. In practical application, the corpus of similar problems of each class can be manually combed and integrated by combining the keyword extraction result according to specific conditions.
Example 3
The present embodiment discloses a similar problem generating apparatus corresponding to the similar problem generating methods of embodiments 1 and 2, which is a virtual structure apparatus, as shown in fig. 3, including:
a data acquisition module 310, configured to acquire interactive corpus text data;
the labeling module 320 is configured to extract a plurality of the interactive corpus text data for question labeling, so as to form question text corpus data;
a data preprocessing module 330, configured to preprocess the question text corpus data and the unlabeled interactive corpus text data;
the model training module 340 is configured to train an original machine learning model by using the preprocessed problem text corpus data, and generate a problem recognition model;
the problem recognition module 350 is configured to perform problem recognition on the unlabeled interactive corpus text data by using the problem recognition model, so as to obtain a problem corpus;
the classification module 360 is configured to calculate a similarity of the problem corpus through a text similarity algorithm, and classify the problem corpus data according to a preset threshold and a similarity calculation result, so as to generate similar problem corpora of each class;
and the result output module 370 is used for outputting corpus of similar questions of each class.
The original machine learning model mentioned in model training module 340 above employs a supervised machine learning model, wherein the supervised machine learning model includes, but is not limited to, logistic regression models, support vector machines, CNNs, and LSTM; and training an original machine learning model by using the problem text corpus data to generate a problem recognition model. The problem recognition model is used to recognize problem sentences in the related corpus data.
The classification module 360 calculates the similarity of the problem corpus by using a cosine similarity algorithm, and the value of the preset threshold is determined according to the category set and the similarity evaluation index, wherein the similarity evaluation index comprises, but is not limited to, a contour coefficient, a Rand coefficient and an adjusted Rand coefficient.
In the result output module 370, the specific steps of outputting the corpus of each class of similar questions include:
and extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class. Wherein the integration operation comprises merging, splitting, deleting and adding.
In this embodiment, the TextRank algorithm is used to extract keywords of corpus of similar questions, and of course, in other embodiments of the present invention, other models and algorithms for extracting text main information may be used to implement keyword extraction, such as topic model.
Example 4
Fig. 4 is a schematic structural diagram of an electronic device according to embodiment 4 of the present invention, and as shown in fig. 4, the electronic device includes a processor 410, a memory 420, an input device 430 and an output device 440; the number of processors 410 in the computer device may be one or more, one processor 410 being taken as an example in fig. 4; the processor 410, memory 420, input device 430, and output device 440 in the electronic device may be connected by a bus or other means, for example in fig. 4.
The memory 420 is used as a computer readable storage medium, and may be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the method for generating a similar problem in the embodiment of the present invention (for example, the data acquisition module 310, the labeling module 320, the data preprocessing module 330, the model training module 340, the problem identification module 350, the classification module 360, and the result output module 370 in the similar problem generating apparatus). The processor 410 executes various functional applications of the electronic device and data processing, that is, implements the similar problem generating method of embodiment 1, by running software programs, instructions, and modules stored in the memory 420.
Memory 420 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 420 may further include memory remotely located relative to processor 410, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Input device 430 may be used to receive interaction corpus data, and the like. The output means 440 is used to output the generated similar questions.
Example 5
Embodiment 5 of the present invention also provides a storage medium containing computer-executable instructions for implementing a similar problem generating method when executed by a computer processor, the method comprising:
acquiring interactive corpus text data;
extracting a plurality of interactive corpus text data for problem labeling to form problem text corpus data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
and outputting corpus of similar problems of each class.
Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the related operations in the similar problem generating method provided in any embodiment of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing an electronic device (which may be a mobile phone, a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the method or the apparatus for generating a similar problem, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.
It will be apparent to those skilled in the art from this disclosure that various other changes and modifications can be made which are within the scope of the invention as defined in the appended claims.
Claims (9)
1. A similar problem generating method, characterized by comprising the steps of:
acquiring interactive corpus text data;
extracting a plurality of pieces of interactive corpus text data for problem labeling to form problem text corpus data, wherein the other part of the non-extracted data in the interactive corpus text data form non-labeled interactive corpus text data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
outputting corpus of similar problems of each class;
the text similarity algorithm is a cosine similarity algorithm, the similarity of the problem corpus is calculated through the cosine similarity algorithm, the problem corpus is classified according to a preset threshold value and a similarity calculation result, and various similar problem corpuses are generated, and the method specifically comprises the following steps:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the question sentence vector set to be used as a center vector of a question category;
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
2. The method for generating similar problems according to claim 1, wherein obtaining the text data of the interaction corpus specifically comprises:
acquiring interaction corpus data and judging the data type of the interaction corpus data;
if the data type of the interactive corpus data is text data, the interactive corpus data is used as the interactive corpus text data;
if the data type of the interactive corpus data is voice data, voice translation and text correction are carried out on the interactive corpus data, and the interactive corpus data after voice translation and text correction is used as the interactive corpus text data.
3. The method for generating similar questions as claimed in claim 1, wherein the extracted plurality of interactive corpus text data are question marked in combination with an industry customer service database.
4. The method for generating similar problems according to claim 1, wherein preprocessing the problem text corpus data and the unlabeled interactive corpus text data specifically comprises:
and respectively carrying out word segmentation and stop word filtering on the question text corpus data and the unlabeled interactive text corpus data.
5. The method for generating similar problems according to any one of claims 1 to 4, wherein the original machine learning model is a supervised machine learning model, and the method further comprises, after performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain a problem corpus:
labeling the problem corpus, and iteratively training the problem identification model by using the labeled problem corpus.
6. The method for generating similar problems according to claim 5, wherein the step of outputting the corpus of similar problems of each category comprises the steps of:
extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class;
the keyword algorithm is a TextRank algorithm, and the corpus of each category of similar problems is integrated according to the keyword extraction result, and specifically comprises the following steps:
merging similar problem corpus with the same meaning of the keywords;
and/or the number of the groups of groups,
splitting or deleting similar problem corpus with different keyword meanings;
and/or the number of the groups of groups,
splitting or deleting similar problem corpora with different keyword meanings, extracting similar problems in the split or deleted similar problem corpora, and adding the similar problems to all classes of similar problem corpora with the same keyword meanings.
7. A similar problem generating apparatus, comprising:
the data acquisition module is used for acquiring interactive corpus text data;
the labeling module is used for extracting a plurality of pieces of the interactive corpus text data for question labeling to form question text data, and the other part of the non-extracted data in the interactive corpus text data forms non-labeled interactive corpus text data;
the data preprocessing module is used for preprocessing the question text corpus data and the unlabeled interactive corpus text data;
the model training module is used for training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
the problem recognition module is used for carrying out problem recognition on the unlabeled interactive corpus text data by utilizing the problem recognition model to obtain a problem corpus;
the classification module is used for calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus data according to a preset threshold value and a similarity calculation result, and generating various similar problem corpuses;
the result output module is used for outputting corpus of similar problems of each class;
the text similarity algorithm is a cosine similarity algorithm, and the classification module is specifically configured to: calculating the similarity of the problem corpus through the cosine similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various classes of similar problem corpus, wherein the method specifically comprises the following steps of:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the question sentence vector set to be used as a center vector of a question category;
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
8. An electronic device comprising a processor, a storage medium and a computer program stored in the storage medium, characterized in that the computer program, when executed by the processor, implements the similar problem generating method of any of claims 1 to 6.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the similarity problem generation method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911397007.0A CN111191442B (en) | 2019-12-30 | 2019-12-30 | Similar problem generation method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911397007.0A CN111191442B (en) | 2019-12-30 | 2019-12-30 | Similar problem generation method, device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111191442A CN111191442A (en) | 2020-05-22 |
CN111191442B true CN111191442B (en) | 2024-02-02 |
Family
ID=70711076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911397007.0A Active CN111191442B (en) | 2019-12-30 | 2019-12-30 | Similar problem generation method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111191442B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111611781B (en) * | 2020-05-27 | 2023-08-18 | 北京妙医佳健康科技集团有限公司 | Data labeling method, question answering device and electronic equipment |
CN112949674A (en) * | 2020-08-22 | 2021-06-11 | 上海昌投网络科技有限公司 | Multi-model fused corpus generation method and device |
CN112101423A (en) * | 2020-08-22 | 2020-12-18 | 上海昌投网络科技有限公司 | Multi-model fused FAQ matching method and device |
CN112131876A (en) * | 2020-09-04 | 2020-12-25 | 交通银行股份有限公司太平洋信用卡中心 | Method and system for determining standard problem based on similarity |
CN113312532B (en) * | 2021-06-01 | 2022-10-21 | 哈尔滨工业大学 | Public opinion grade prediction method based on deep learning and oriented to public inspection field |
CN113434650B (en) * | 2021-06-29 | 2023-11-14 | 平安科技(深圳)有限公司 | Question-answer pair expansion method and device, electronic equipment and readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108287822A (en) * | 2018-01-23 | 2018-07-17 | 北京容联易通信息技术有限公司 | A kind of Chinese Similar Problems generation System and method for |
CN109933779A (en) * | 2017-12-18 | 2019-06-25 | 苏宁云商集团股份有限公司 | User's intension recognizing method and system |
CN110032724A (en) * | 2018-12-19 | 2019-07-19 | 阿里巴巴集团控股有限公司 | The method and device that user is intended to for identification |
US10417350B1 (en) * | 2017-08-28 | 2019-09-17 | Amazon Technologies, Inc. | Artificial intelligence system for automated adaptation of text-based classification models for multiple languages |
CN110619051A (en) * | 2019-08-16 | 2019-12-27 | 科大讯飞(苏州)科技有限公司 | Question and sentence classification method and device, electronic equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10387430B2 (en) * | 2015-02-26 | 2019-08-20 | International Business Machines Corporation | Geometry-directed active question selection for question answering systems |
US10726061B2 (en) * | 2017-11-17 | 2020-07-28 | International Business Machines Corporation | Identifying text for labeling utilizing topic modeling-based text clustering |
-
2019
- 2019-12-30 CN CN201911397007.0A patent/CN111191442B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10417350B1 (en) * | 2017-08-28 | 2019-09-17 | Amazon Technologies, Inc. | Artificial intelligence system for automated adaptation of text-based classification models for multiple languages |
CN109933779A (en) * | 2017-12-18 | 2019-06-25 | 苏宁云商集团股份有限公司 | User's intension recognizing method and system |
CN108287822A (en) * | 2018-01-23 | 2018-07-17 | 北京容联易通信息技术有限公司 | A kind of Chinese Similar Problems generation System and method for |
CN110032724A (en) * | 2018-12-19 | 2019-07-19 | 阿里巴巴集团控股有限公司 | The method and device that user is intended to for identification |
CN110619051A (en) * | 2019-08-16 | 2019-12-27 | 科大讯飞(苏州)科技有限公司 | Question and sentence classification method and device, electronic equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
Variational Reasoning for Question Answering with Knowledge Graph;Yuyu Zhang et al;Proceedings of the AAAI Conference on Artificial Intelligence;第32卷(第1期);6070-6076 * |
深度学习算法在问句意图分类中的应用研究;杨志明;王来奇;王泳;;计算机工程与应用(第10期);159-165 * |
相似问句判别研究;尹庆宇等;智能计算机与应用;第9卷(第6期);41-44 * |
Also Published As
Publication number | Publication date |
---|---|
CN111191442A (en) | 2020-05-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111191442B (en) | Similar problem generation method, device, equipment and medium | |
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
CN108073568B (en) | Keyword extraction method and device | |
CN109543178B (en) | Method and system for constructing judicial text label system | |
CN107085581B (en) | Short text classification method and device | |
WO2019149200A1 (en) | Text classification method, computer device, and storage medium | |
WO2019153737A1 (en) | Comment assessing method, device, equipment and storage medium | |
CN109165294B (en) | Short text classification method based on Bayesian classification | |
CN110362819B (en) | Text emotion analysis method based on convolutional neural network | |
CN110807084A (en) | Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy | |
CN106897439A (en) | The emotion identification method of text, device, server and storage medium | |
CN110083832B (en) | Article reprint relation identification method, device, equipment and readable storage medium | |
CN111158641B (en) | Automatic recognition method for transaction function points based on semantic analysis and text mining | |
CN112464656A (en) | Keyword extraction method and device, electronic equipment and storage medium | |
CN110309504B (en) | Text processing method, device, equipment and storage medium based on word segmentation | |
CN112732871A (en) | Multi-label classification method for acquiring client intention label by robot | |
CN111930933A (en) | Detection case processing method and device based on artificial intelligence | |
CN113449084A (en) | Relationship extraction method based on graph convolution | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN114936277A (en) | Similarity problem matching method and user similarity problem matching system | |
CN116304020A (en) | Industrial text entity extraction method based on semantic source analysis and span characteristics | |
CN115168580A (en) | Text classification method based on keyword extraction and attention mechanism | |
CN112380346A (en) | Financial news emotion analysis method and device, computer equipment and storage medium | |
CN111563361A (en) | Text label extraction method and device and storage medium | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 23011, Yuejiang Commercial Center, No. 857 Xincheng Road, Puyan Street, Hangzhou City, Zhejiang Province, 310000 Applicant after: Hangzhou Yuanchuan Xinye Technology Co.,Ltd. Address before: Room 23011, Yuejiang Commercial Center, No. 857 Xincheng Road, Puyan Street, Hangzhou City, Zhejiang Province, 310000 Applicant before: Hangzhou Yuanchuan New Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |