CN111191442B - Similar problem generation method, device, equipment and medium - Google Patents

Similar problem generation method, device, equipment and medium Download PDF

Info

Publication number
CN111191442B
CN111191442B CN201911397007.0A CN201911397007A CN111191442B CN 111191442 B CN111191442 B CN 111191442B CN 201911397007 A CN201911397007 A CN 201911397007A CN 111191442 B CN111191442 B CN 111191442B
Authority
CN
China
Prior art keywords
corpus
data
text
similar
interactive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911397007.0A
Other languages
Chinese (zh)
Other versions
CN111191442A (en
Inventor
王伟凯
钱艳
邱霞霞
安毫亿
朱鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yuanchuan Xinye Technology Co ltd
Original Assignee
Hangzhou Yuanchuan Xinye Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yuanchuan Xinye Technology Co ltd filed Critical Hangzhou Yuanchuan Xinye Technology Co ltd
Priority to CN201911397007.0A priority Critical patent/CN111191442B/en
Publication of CN111191442A publication Critical patent/CN111191442A/en
Application granted granted Critical
Publication of CN111191442B publication Critical patent/CN111191442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a similar problem generation method, which relates to the technical field of natural language processing and aims at training a problem identification model by labeling part of interactive corpus data, identifying problems by the problem identification model and calculating the similarity of the problems so as to finish similar problem classification and further realize similar problem generation, and the method comprises the following steps: acquiring interactive corpus text data; labeling part of the interactive corpus text data to form problem text corpus data; training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model; performing problem recognition on unlabeled interactive corpus text data by using the problem recognition model to form a problem corpus; calculating the similarity of the problem corpus through a text similarity algorithm to generate various similar problem corpora; and outputting corpus of similar problems of each class. The invention also discloses a similar problem generating device, electronic equipment and a computer storage medium.

Description

Similar problem generation method, device, equipment and medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a medium for generating a similar problem.
Background
The current intelligent customer service robot technology mainly realizes intention recognition based on a machine learning algorithm. The machine learning algorithm needs to provide a large amount of similar corpus with known categories for training, so that a large amount of labeling data needs to be prepared in the initial stage of the construction of the intelligent customer service robot. In the existing intelligent customer service robot construction process, a large amount of labeling data is generally obtained through a manual labeling mode to serve as similar corpus, and the problems of long period, high labeling cost and the like exist. Therefore, how to automatically acquire related similar corpus data at the initial stage of intelligent customer service robot construction is particularly critical.
In the prior art, chinese patent CN109033390a discloses a method and apparatus for automatically generating similar questions, the method comprising obtaining an initial question; generating an expanded question according to the initial question; judging whether the expanded question is a similar question, and marking the similar question according to a judging result. According to the method, the computer is utilized to automatically generate the similar questions, so that the labor consumed by manually marking the similar questions is saved, and the cost of robot customer service can be reduced. However, the method adopts a deep learning model in practical application, so that a large amount of labeled training corpus is needed in training the deep learning model, and the problem of difficulty in acquiring training data exists; the sentences output by the model are limited by the training corpus range, and problems such as confusion, malaise and the like are possible.
In another chinese patent CN109063004a, a method and apparatus for automatically generating FAQ-like questions is disclosed, the method comprising: generating a text based on the selected FAQ; judging whether the generated text is similar to the selected FAQ; if the generated text is similar to the selected FAQ, the text is a similar question of the selected FAQ. According to the patent, the input cost of manual marking is reduced by automatically generating the FAQ similar questions, and corresponding similar questions can be quickly constructed for newly added FAQ; in addition, the text generation is performed by combining the natural language processing method with sentence generation rules, so that the quality of similar question generation can be improved more effectively. However, the patent has the problems of difficult accurate maintenance of paraphrasing, difficult construction of initial problem generation rules and the like, is difficult to maintain, has low applicability and cannot be widely applied.
Therefore, in order to achieve automatic acquisition of similar corpus data, it is highly desirable to propose a method for generating similar problems that overcomes the above-mentioned drawbacks of the prior art and is generally applicable.
Disclosure of Invention
In order to overcome the defects of the prior art, one of the purposes of the invention is to provide a similar problem generating method, which aims at labeling part of interactive corpus, taking the labeled interactive corpus as training corpus to complete the training of a problem recognition model, recognizing problem sentences in the interactive corpus through the problem recognition model, and generating similar problems through a text similarity algorithm, thereby realizing automatic acquisition of similar corpus data, avoiding a large amount of labeled training corpus, and ensuring that the generated similar problems are not limited by the training corpus range.
One of the purposes of the invention is realized by adopting the following technical scheme:
a similar problem generating method comprising the steps of:
acquiring interactive corpus text data;
extracting a plurality of interactive corpus text data for problem labeling to form problem text corpus data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
and outputting corpus of similar problems of each class.
Further, obtaining the interactive corpus text data specifically includes:
acquiring interaction corpus data and judging the data type of the interaction corpus;
if the data type of the interactive corpus data is text data, the interactive corpus data is used as the interactive corpus text data;
if the data type of the interactive corpus data is voice data, voice translation and text correction are carried out on the interactive corpus data, and the interactive corpus data after voice translation and text correction is used as the interactive corpus text data.
And further, carrying out problem labeling on the extracted plurality of interactive corpus text data by combining an industry customer service operation database.
Further, preprocessing the question text corpus data and the unlabeled interactive corpus text data specifically includes:
and respectively carrying out word segmentation and stop word filtering on the question text corpus data and the unlabeled interactive text corpus data.
Further, the original machine learning model is a supervised machine learning model, and the problem recognition model is used for performing problem recognition on the unlabeled interactive corpus text data to obtain a problem corpus, and the method further comprises the following steps:
labeling the problem corpus, and iteratively training the problem identification model by using the labeled problem corpus.
Further, the text similarity algorithm is a cosine similarity algorithm, the similarity of the problem corpus is calculated through the cosine similarity algorithm, the problem corpus is classified according to a preset threshold value and a similarity calculation result, and various similar problem corpuses are generated, and the method specifically comprises the following steps:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the vectorization question sentence set to be used as a center vector of a question category;
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
Further, outputting corpus of similar problems of each category, which specifically comprises the following steps:
extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class;
the keyword algorithm is a TextRank algorithm, and the corpus of each category of similar problems is integrated according to the keyword extraction result, and specifically comprises the following steps:
merging similar problem corpus with the same meaning of the keywords;
and/or the number of the groups of groups,
splitting or deleting similar problem corpus with different keyword meanings;
and/or the number of the groups of groups,
splitting or deleting similar problem corpora with different keyword meanings, extracting similar problems in the split or deleted similar problem corpora, and adding the similar problems to all classes of similar problem corpora with the same keyword meanings.
The second objective of the present invention is to provide a similar problem generating device, which uses part of interaction data as training corpus for model training to generate a problem recognition model through model training, recognizes problem sentences in the interaction corpus through the problem recognition model, and finally calculates the problem similarity through a text similarity algorithm to generate similar problems.
The second purpose of the invention is realized by adopting the following technical scheme:
the data acquisition module is used for acquiring interactive corpus text data;
the labeling module is used for extracting a plurality of interactive corpus text data for question labeling to form question text corpus data;
the data preprocessing module is used for preprocessing the question text corpus data and the unlabeled interactive corpus text data;
the model training module is used for training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
the problem recognition module is used for carrying out problem recognition on the unlabeled interactive corpus text data by utilizing the problem recognition model to obtain a problem corpus;
the classification module is used for calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus data according to a preset threshold value and a similarity calculation result, and generating various similar problem corpuses;
and the result output module is used for outputting corpus of similar problems of each class.
It is a further object of the present invention to provide an electronic device for performing one of the objects of the present invention, comprising a processor, a storage medium, and a computer program stored in the storage medium, which when executed by the processor, implements the above-mentioned similar problem generating method.
It is a fourth object of the present invention to provide a computer-readable storage medium storing one of the objects of the present invention, on which a computer program is stored which, when executed by a processor, implements the above-described similar problem generating method.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, a small amount of interaction corpus is marked as the training sample, so that the problem identification model is trained, a large amount of marking work of the training sample is reduced, and time and labor are saved. According to the method, the problem sentences are extracted from the interactive corpus by using the problem recognition model, and the similarity of the problem sentences is calculated through a text similarity algorithm, so that similar problems are obtained, and similar corpus data are generated.
According to the method, similar corpus data are obtained from the interactive corpus, a near meaning word library and a related text generation rule are not required to be constructed, so that the generated similar problem is not limited by the scope of the training corpus, and the situation that sentences are not smooth is avoided.
Drawings
FIG. 1 is a flow chart of a similar problem generation method of the present invention;
FIG. 2 is a flowchart of a similar problem generating method of embodiment 2;
fig. 3 is a block diagram of the structure of a similar problem generating apparatus of embodiment 3;
fig. 4 is a block diagram of the electronic device of embodiment 4.
Detailed Description
The invention will now be described in more detail with reference to the accompanying drawings, to which it should be noted that the description is given below by way of illustration only and not by way of limitation. Various embodiments may be combined with one another to form further embodiments not shown in the following description.
Example 1
The embodiment provides a similar problem generating method, which aims at performing machine learning model training by using interactive scene corpus data to obtain a machine learning model for problem identification, identifying problem sentences in the interactive scene corpus by the machine learning model, classifying similar problem sentences by text similarity calculation, thereby generating similar problems and further obtaining similar corpus data.
According to the above principle, a method for generating similar problems is described, as shown in fig. 1, and specifically includes the following steps:
acquiring interactive corpus text data;
extracting a plurality of interactive corpus text data for problem labeling to form problem text corpus data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
and outputting corpus of similar problems of each class.
The interactive corpus text data is divided into two parts, one part (namely the extracted interactive corpus text data) is used for model training after problem marking, and the other part (namely the unlabeled interactive corpus text data) is used for subsequent problem recognition, so that a problem recognition model is trained by marking a small amount of interactive corpus text data.
Preferably, the obtaining the interactive corpus text data specifically includes:
acquiring interaction corpus data and judging the data type of the interaction corpus;
if the data type of the interactive corpus data is text data, the interactive corpus data is used as the interactive corpus text data;
if the data type of the interactive corpus data is voice data, voice translation and text correction are carried out on the interactive corpus data, and the interactive corpus data after voice translation and text correction is used as the interactive corpus text data.
The interactive corpus data are interactive scene corpus data in the industry, and the similar problems are directly extracted from the interactive scene corpus data in the industry by the similar problem generating method, so that the problem that generated similar problems have unsmooth sentences can be avoided.
Preferably, the extracted plurality of interactive corpus text data are subjected to question labeling in combination with an industry customer service operation database.
The business customer service operation database is a pre-constructed database, stores customer service operation data, if the interaction corpus data obtained in the embodiment does not have corresponding customer service operation data in the business customer service operation database, carries out problem statement judgment on the interaction corpus data, and stores the associated customer service operation of the problem statement obtained by judgment in the business customer service operation database so as to complete updating of the business customer service operation database.
Preferably, preprocessing the question text corpus data and the unlabeled interactive corpus text data specifically includes:
and respectively carrying out word segmentation and stop word filtering on the question text corpus data and the unlabeled interactive text corpus data.
In another embodiment of the present invention, the question text corpus data and the unlabeled interactive text corpus data are subjected to word segmentation, stop-word filtering, and synonym replacement, respectively. It should be noted that the purpose of performing synonym replacement here is to maintain a library of related synonyms, which may not be performed if maintenance is not required.
The preprocessed problem text corpus data and the unlabeled interaction corpus text data are respectively used for subsequent machine learning model training and problem recognition.
Preferably, the original machine learning model is a supervised machine learning model including, but not limited to, logistic regression models, support vector machines, CNNs, and LSTM; training the original machine learning model by using the problem text corpus data to generate a problem recognition model, and performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain a problem corpus, wherein the method further comprises the following steps:
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
labeling the problem corpus, and iteratively training the problem identification model by using the labeled problem corpus so as to adjust the predicted value of the problem identification model.
According to the supervised machine learning model, iterative training of the model is carried out through training data marked in advance, problem recognition (prediction) is carried out on unlabeled interactive corpus text data by utilizing a problem recognition model after iterative training to obtain a problem corpus (prediction result), then the problem corpus is marked, iterative training of the model is continued by utilizing the marked problem corpus so as to optimize weight parameters in the model, when training stopping conditions (such as reaching the upper limit value of iteration times or reaching the target value of a gap value) are met, the model is completely trained, at the moment, the model converges and iterates to the fixed weight parameters, so that the purpose of adjusting the prediction value of the problem recognition model is achieved, the purpose of reducing the gap value between the prediction value and the true value of the model is achieved, and finally the model after training is completely predicted.
In the embodiment, a CNN model is selected as an original machine learning model, the problem text corpus data is trained to generate a problem recognition model, and the problem recognition model is utilized to perform problem recognition on the interaction corpus text data which is not marked, so that a problem corpus is obtained. And labeling the identified problem corpus again to adjust the predicted value of the model, so that the labeled problem corpus is used for carrying out iterative training on the problem identification model, the weight parameters of the CNN model are optimized until the weight parameters are fixed, the identification rate of the problem identification model is improved, and finally, the problem sentence in the interactive corpus text data is identified by using the trained problem identification model, so that the problem corpus is obtained.
Preferably, the text similarity algorithm is a cosine similarity algorithm, the similarity of the problem corpus is calculated through the cosine similarity algorithm, the problem corpus is classified according to a preset threshold and a similarity calculation result, and various similar problem corpuses are generated, and the method specifically comprises the following steps:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the vectorized question sentence set to be used as a center vector of a question category (assumed to be a first question category);
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
The above-mentioned text similarity algorithm is an algorithm for measuring the distance between two vectors, and any algorithm capable of calculating the distance between the vectors is suitable for calculating the text similarity, so in other embodiments of the present invention, the text similarity algorithm may also use a euclidean distance, a manhattan distance, a minkowski distance, a chebyshev distance, or the like.
The present embodiment uses TF-IDF algorithm, N-Gram model and word vector to vectorize the problem corpus.
The TF-IDF algorithm is a numerical statistic, and is used to reflect the frequency of occurrence of a word in a document, if TF is high and DF is low, it indicates that the frequency of occurrence of a word in a sentence is high and the frequency of occurrence of a word in other sentences is low, so that the word has good class distinguishing capability.
When TF-IDF is used in this embodiment, the calculation formula is as follows: TF is multiplied by IDF, where,/>Indicating the total number of words t contained in sentence di,/->Representing the total number of words in sentence di; />Representing the total number of sentences in the question corpus, < ->Representing the word t contained in the corpus of questionsStatement count.
According to the TF-IDF calculation result, a Group is formed by a problem sentence in the problem corpus through a window with the size of N by an N-Gram model, then the formed Group is counted and filtered (the Group with low occurrence frequency is filtered), and a Group composition feature space is obtained by filtering, so that the problem sentence segmentation is used. The N-Gram model can adopt a Bi-Gram binary model or a Tri-Gram ternary model according to specific conditions, and the specific processing process is common knowledge in the field and is not described herein.
The problem corpus after N-Gram word segmentation is vectorized through word vectors.
The term vector refers to the transformation of a word into a distributed representation such that there are distances between words and each dimension of the vector has a characteristic meaning. In actual use, word vectors are typically trained through a neural network DNN, which is typically a three-layer neural network structure, including an input layer, a hidden layer, and an output layer (softmax layer), and including both CBOW and Skip-Gram models. In this embodiment, the CBOW model in the neural network DNN is also used to complete vectorization of the problem corpus, specifically: the word segmentation in the problem corpus is sequentially used as the feature words, word vectors of words relevant to the context of the feature words are input into a CBOW model to perform model training, when the softmax probability corresponding to a specific word in a training sample is maximum, training is finished, then parameters of the CBOW model are obtained through a back propagation algorithm of a neural network DNN, and simultaneously word vectors of all words in the problem corpus are obtained, so that vectorization of the problem corpus is completed, and a problem sentence vector set is formed.
In other embodiments of the present invention, the Skip-Gram model may also be used to complete the vectorization of the problem corpus.
And calculating cosine similarity between the question sentence vector and the center vector of each question category by using a cosine similarity algorithm.
In this embodiment, the value of the preset threshold is determined according to the category set and the similarity evaluation index, and the specific determining process is as follows:
presetting a threshold initial value, acquiring a category set (a known problem corpus classification set), and then calculating a similarity evaluation index according to the category set;
the threshold initial value is adjusted for multiple times, and similarity evaluation indexes are recalculated according to the category sets respectively;
and finally selecting the threshold value corresponding to the maximum similarity of the category set as the value of the preset threshold value.
The similarity evaluation index mainly includes, but is not limited to, a contour coefficient, a rand coefficient, and an adjusted rand coefficient, where the contour coefficient is based on the fact that the actual category is unknown, and for a single sample, the calculation formula of the contour coefficient is:a represents the average distance of other samples of the same class as the single sample, and b represents the average distance of samples in different classes nearest to the single sample.
The lander coefficient and the adjusted lander coefficient are based on a known class set, so that the lander coefficient and the adjusted lander coefficient are used as similarity evaluation indexes in the determination process of the preset threshold.
In this embodiment, a description will be given of a calculation method of a rader coefficient and an adjustment rader coefficient.
Assuming that the class C and the clustering result K are known, the calculation formula of the Rankine coefficient is as follows:wherein a represents the same class of element pairs in both C and K, b represents different classes of element pairs in both C and K, +.>Representing the total element logarithm of the composition in the set, +.>Representing the total number of elements in the collection. RI is in the range of 0,1]The larger the RI value is, the more consistent the clustering result is with the real situation.
Since the Rankine coefficient RI cannot be guaranteed for random clustering resultsNear zero, in order to achieve that the similarity evaluation index is near zero under the condition that the clustering result is randomly generated, an adjusting Rand coefficient is used, and a calculation formula of the adjusting Rand coefficient is as follows:wherein the ARI has a value of [ -1,1]The larger the ARI value is, the more consistent the clustering result is with the real situation, and the degree of coincidence of two data distributions is measured by adjusting the Rankine coefficient ARI from a generalized perspective.
Example 2
The difference between the present embodiment and the foregoing embodiment is that, as shown in fig. 2, corpus of similar questions of each category is output, and specifically includes the following steps:
extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class;
the keyword algorithm is a TextRank algorithm, and the corpus of each category of similar problems is integrated according to the keyword extraction result, and specifically comprises the following steps:
merging similar problem corpus with the same meaning of the keywords;
and/or the number of the groups of groups,
splitting or deleting similar problem corpus with different keyword meanings;
and/or the number of the groups of groups,
splitting or deleting similar problem corpora with different keyword meanings, extracting similar problems in the split or deleted similar problem corpora, and adding the similar problems to all classes of similar problem corpora with the same keyword meanings.
The TextRank algorithm is based on the Pagerank principle and is used for generating keywords for a text, the TextRank algorithm regards grammar units in the text as nodes in a graph, if a certain grammar relationship exists between the two grammar units, the two grammar units are connected with each other by one side in the graph, different nodes finally have different weights through a certain iteration number, and grammar units with high weights can be used as keywords.
Therefore, in the embodiment, keywords of corpus with similar problems of various categories can be extracted by adopting a TextRank algorithm.
The extraction of the keywords is helpful for combing and integrating the corpora of similar problems of each category, and in other embodiments, other methods for extracting text main information such as topic models can be adopted. In practical application, the corpus of similar problems of each class can be manually combed and integrated by combining the keyword extraction result according to specific conditions.
Example 3
The present embodiment discloses a similar problem generating apparatus corresponding to the similar problem generating methods of embodiments 1 and 2, which is a virtual structure apparatus, as shown in fig. 3, including:
a data acquisition module 310, configured to acquire interactive corpus text data;
the labeling module 320 is configured to extract a plurality of the interactive corpus text data for question labeling, so as to form question text corpus data;
a data preprocessing module 330, configured to preprocess the question text corpus data and the unlabeled interactive corpus text data;
the model training module 340 is configured to train an original machine learning model by using the preprocessed problem text corpus data, and generate a problem recognition model;
the problem recognition module 350 is configured to perform problem recognition on the unlabeled interactive corpus text data by using the problem recognition model, so as to obtain a problem corpus;
the classification module 360 is configured to calculate a similarity of the problem corpus through a text similarity algorithm, and classify the problem corpus data according to a preset threshold and a similarity calculation result, so as to generate similar problem corpora of each class;
and the result output module 370 is used for outputting corpus of similar questions of each class.
The original machine learning model mentioned in model training module 340 above employs a supervised machine learning model, wherein the supervised machine learning model includes, but is not limited to, logistic regression models, support vector machines, CNNs, and LSTM; and training an original machine learning model by using the problem text corpus data to generate a problem recognition model. The problem recognition model is used to recognize problem sentences in the related corpus data.
The classification module 360 calculates the similarity of the problem corpus by using a cosine similarity algorithm, and the value of the preset threshold is determined according to the category set and the similarity evaluation index, wherein the similarity evaluation index comprises, but is not limited to, a contour coefficient, a Rand coefficient and an adjusted Rand coefficient.
In the result output module 370, the specific steps of outputting the corpus of each class of similar questions include:
and extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class. Wherein the integration operation comprises merging, splitting, deleting and adding.
In this embodiment, the TextRank algorithm is used to extract keywords of corpus of similar questions, and of course, in other embodiments of the present invention, other models and algorithms for extracting text main information may be used to implement keyword extraction, such as topic model.
Example 4
Fig. 4 is a schematic structural diagram of an electronic device according to embodiment 4 of the present invention, and as shown in fig. 4, the electronic device includes a processor 410, a memory 420, an input device 430 and an output device 440; the number of processors 410 in the computer device may be one or more, one processor 410 being taken as an example in fig. 4; the processor 410, memory 420, input device 430, and output device 440 in the electronic device may be connected by a bus or other means, for example in fig. 4.
The memory 420 is used as a computer readable storage medium, and may be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the method for generating a similar problem in the embodiment of the present invention (for example, the data acquisition module 310, the labeling module 320, the data preprocessing module 330, the model training module 340, the problem identification module 350, the classification module 360, and the result output module 370 in the similar problem generating apparatus). The processor 410 executes various functional applications of the electronic device and data processing, that is, implements the similar problem generating method of embodiment 1, by running software programs, instructions, and modules stored in the memory 420.
Memory 420 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 420 may further include memory remotely located relative to processor 410, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Input device 430 may be used to receive interaction corpus data, and the like. The output means 440 is used to output the generated similar questions.
Example 5
Embodiment 5 of the present invention also provides a storage medium containing computer-executable instructions for implementing a similar problem generating method when executed by a computer processor, the method comprising:
acquiring interactive corpus text data;
extracting a plurality of interactive corpus text data for problem labeling to form problem text corpus data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
and outputting corpus of similar problems of each class.
Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the related operations in the similar problem generating method provided in any embodiment of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing an electronic device (which may be a mobile phone, a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the method or the apparatus for generating a similar problem, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.
It will be apparent to those skilled in the art from this disclosure that various other changes and modifications can be made which are within the scope of the invention as defined in the appended claims.

Claims (9)

1. A similar problem generating method, characterized by comprising the steps of:
acquiring interactive corpus text data;
extracting a plurality of pieces of interactive corpus text data for problem labeling to form problem text corpus data, wherein the other part of the non-extracted data in the interactive corpus text data form non-labeled interactive corpus text data;
preprocessing the question text corpus data and the unlabeled interactive corpus text data;
training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain problem corpus;
calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various similar problem corpora;
outputting corpus of similar problems of each class;
the text similarity algorithm is a cosine similarity algorithm, the similarity of the problem corpus is calculated through the cosine similarity algorithm, the problem corpus is classified according to a preset threshold value and a similarity calculation result, and various similar problem corpuses are generated, and the method specifically comprises the following steps:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the question sentence vector set to be used as a center vector of a question category;
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
2. The method for generating similar problems according to claim 1, wherein obtaining the text data of the interaction corpus specifically comprises:
acquiring interaction corpus data and judging the data type of the interaction corpus data;
if the data type of the interactive corpus data is text data, the interactive corpus data is used as the interactive corpus text data;
if the data type of the interactive corpus data is voice data, voice translation and text correction are carried out on the interactive corpus data, and the interactive corpus data after voice translation and text correction is used as the interactive corpus text data.
3. The method for generating similar questions as claimed in claim 1, wherein the extracted plurality of interactive corpus text data are question marked in combination with an industry customer service database.
4. The method for generating similar problems according to claim 1, wherein preprocessing the problem text corpus data and the unlabeled interactive corpus text data specifically comprises:
and respectively carrying out word segmentation and stop word filtering on the question text corpus data and the unlabeled interactive text corpus data.
5. The method for generating similar problems according to any one of claims 1 to 4, wherein the original machine learning model is a supervised machine learning model, and the method further comprises, after performing problem recognition on the unlabeled interactive corpus text data by using the problem recognition model to obtain a problem corpus:
labeling the problem corpus, and iteratively training the problem identification model by using the labeled problem corpus.
6. The method for generating similar problems according to claim 5, wherein the step of outputting the corpus of similar problems of each category comprises the steps of:
extracting keywords from the similar problem corpora of each class through a keyword algorithm, integrating the similar problem corpora of each class according to the keyword extraction result, and outputting the similar problem corpora of each class;
the keyword algorithm is a TextRank algorithm, and the corpus of each category of similar problems is integrated according to the keyword extraction result, and specifically comprises the following steps:
merging similar problem corpus with the same meaning of the keywords;
and/or the number of the groups of groups,
splitting or deleting similar problem corpus with different keyword meanings;
and/or the number of the groups of groups,
splitting or deleting similar problem corpora with different keyword meanings, extracting similar problems in the split or deleted similar problem corpora, and adding the similar problems to all classes of similar problem corpora with the same keyword meanings.
7. A similar problem generating apparatus, comprising:
the data acquisition module is used for acquiring interactive corpus text data;
the labeling module is used for extracting a plurality of pieces of the interactive corpus text data for question labeling to form question text data, and the other part of the non-extracted data in the interactive corpus text data forms non-labeled interactive corpus text data;
the data preprocessing module is used for preprocessing the question text corpus data and the unlabeled interactive corpus text data;
the model training module is used for training an original machine learning model by utilizing the preprocessed problem text corpus data to generate a problem recognition model;
the problem recognition module is used for carrying out problem recognition on the unlabeled interactive corpus text data by utilizing the problem recognition model to obtain a problem corpus;
the classification module is used for calculating the similarity of the problem corpus through a text similarity algorithm, classifying the problem corpus data according to a preset threshold value and a similarity calculation result, and generating various similar problem corpuses;
the result output module is used for outputting corpus of similar problems of each class;
the text similarity algorithm is a cosine similarity algorithm, and the classification module is specifically configured to: calculating the similarity of the problem corpus through the cosine similarity algorithm, classifying the problem corpus according to a preset threshold value and a similarity calculation result, and generating various classes of similar problem corpus, wherein the method specifically comprises the following steps of:
vectorizing the problem corpus to obtain a problem sentence vector set;
randomly extracting a question sentence vector from the question sentence vector set to be used as a center vector of a question category;
and traversing and calculating cosine similarity between the question sentence vector in the question sentence vector set and the center vector of each question category:
if the maximum value of the cosine similarity is larger than the preset threshold value, classifying the problem sentence vector into a problem category corresponding to the maximum value of the cosine similarity, and updating the center vector of the problem category by a vector average method;
and if the maximum value of the cosine similarity is smaller than or equal to the preset threshold value, adding a problem category, wherein the problem sentence vector is used as a center vector of the added problem category.
8. An electronic device comprising a processor, a storage medium and a computer program stored in the storage medium, characterized in that the computer program, when executed by the processor, implements the similar problem generating method of any of claims 1 to 6.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the similarity problem generation method of any one of claims 1 to 6.
CN201911397007.0A 2019-12-30 2019-12-30 Similar problem generation method, device, equipment and medium Active CN111191442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911397007.0A CN111191442B (en) 2019-12-30 2019-12-30 Similar problem generation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911397007.0A CN111191442B (en) 2019-12-30 2019-12-30 Similar problem generation method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN111191442A CN111191442A (en) 2020-05-22
CN111191442B true CN111191442B (en) 2024-02-02

Family

ID=70711076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911397007.0A Active CN111191442B (en) 2019-12-30 2019-12-30 Similar problem generation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN111191442B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611781B (en) * 2020-05-27 2023-08-18 北京妙医佳健康科技集团有限公司 Data labeling method, question answering device and electronic equipment
CN112949674A (en) * 2020-08-22 2021-06-11 上海昌投网络科技有限公司 Multi-model fused corpus generation method and device
CN112101423A (en) * 2020-08-22 2020-12-18 上海昌投网络科技有限公司 Multi-model fused FAQ matching method and device
CN112131876A (en) * 2020-09-04 2020-12-25 交通银行股份有限公司太平洋信用卡中心 Method and system for determining standard problem based on similarity
CN113312532B (en) * 2021-06-01 2022-10-21 哈尔滨工业大学 Public opinion grade prediction method based on deep learning and oriented to public inspection field
CN113434650B (en) * 2021-06-29 2023-11-14 平安科技(深圳)有限公司 Question-answer pair expansion method and device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287822A (en) * 2018-01-23 2018-07-17 北京容联易通信息技术有限公司 A kind of Chinese Similar Problems generation System and method for
CN109933779A (en) * 2017-12-18 2019-06-25 苏宁云商集团股份有限公司 User's intension recognizing method and system
CN110032724A (en) * 2018-12-19 2019-07-19 阿里巴巴集团控股有限公司 The method and device that user is intended to for identification
US10417350B1 (en) * 2017-08-28 2019-09-17 Amazon Technologies, Inc. Artificial intelligence system for automated adaptation of text-based classification models for multiple languages
CN110619051A (en) * 2019-08-16 2019-12-27 科大讯飞(苏州)科技有限公司 Question and sentence classification method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10387430B2 (en) * 2015-02-26 2019-08-20 International Business Machines Corporation Geometry-directed active question selection for question answering systems
US10726061B2 (en) * 2017-11-17 2020-07-28 International Business Machines Corporation Identifying text for labeling utilizing topic modeling-based text clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10417350B1 (en) * 2017-08-28 2019-09-17 Amazon Technologies, Inc. Artificial intelligence system for automated adaptation of text-based classification models for multiple languages
CN109933779A (en) * 2017-12-18 2019-06-25 苏宁云商集团股份有限公司 User's intension recognizing method and system
CN108287822A (en) * 2018-01-23 2018-07-17 北京容联易通信息技术有限公司 A kind of Chinese Similar Problems generation System and method for
CN110032724A (en) * 2018-12-19 2019-07-19 阿里巴巴集团控股有限公司 The method and device that user is intended to for identification
CN110619051A (en) * 2019-08-16 2019-12-27 科大讯飞(苏州)科技有限公司 Question and sentence classification method and device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Variational Reasoning for Question Answering with Knowledge Graph;Yuyu Zhang et al;Proceedings of the AAAI Conference on Artificial Intelligence;第32卷(第1期);6070-6076 *
深度学习算法在问句意图分类中的应用研究;杨志明;王来奇;王泳;;计算机工程与应用(第10期);159-165 *
相似问句判别研究;尹庆宇等;智能计算机与应用;第9卷(第6期);41-44 *

Also Published As

Publication number Publication date
CN111191442A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN111191442B (en) Similar problem generation method, device, equipment and medium
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN108073568B (en) Keyword extraction method and device
CN109543178B (en) Method and system for constructing judicial text label system
CN107085581B (en) Short text classification method and device
WO2019149200A1 (en) Text classification method, computer device, and storage medium
WO2019153737A1 (en) Comment assessing method, device, equipment and storage medium
CN109165294B (en) Short text classification method based on Bayesian classification
CN110362819B (en) Text emotion analysis method based on convolutional neural network
CN110807084A (en) Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
CN106897439A (en) The emotion identification method of text, device, server and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN111158641B (en) Automatic recognition method for transaction function points based on semantic analysis and text mining
CN112464656A (en) Keyword extraction method and device, electronic equipment and storage medium
CN110309504B (en) Text processing method, device, equipment and storage medium based on word segmentation
CN112732871A (en) Multi-label classification method for acquiring client intention label by robot
CN111930933A (en) Detection case processing method and device based on artificial intelligence
CN113449084A (en) Relationship extraction method based on graph convolution
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN114936277A (en) Similarity problem matching method and user similarity problem matching system
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN115168580A (en) Text classification method based on keyword extraction and attention mechanism
CN112380346A (en) Financial news emotion analysis method and device, computer equipment and storage medium
CN111563361A (en) Text label extraction method and device and storage medium
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 23011, Yuejiang Commercial Center, No. 857 Xincheng Road, Puyan Street, Hangzhou City, Zhejiang Province, 310000

Applicant after: Hangzhou Yuanchuan Xinye Technology Co.,Ltd.

Address before: Room 23011, Yuejiang Commercial Center, No. 857 Xincheng Road, Puyan Street, Hangzhou City, Zhejiang Province, 310000

Applicant before: Hangzhou Yuanchuan New Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant