CN106599054B - Method and system for classifying and pushing questions - Google Patents

Method and system for classifying and pushing questions Download PDF

Info

Publication number
CN106599054B
CN106599054B CN201611009278.0A CN201611009278A CN106599054B CN 106599054 B CN106599054 B CN 106599054B CN 201611009278 A CN201611009278 A CN 201611009278A CN 106599054 B CN106599054 B CN 106599054B
Authority
CN
China
Prior art keywords
classification
topic
question
preset
knowledge point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611009278.0A
Other languages
Chinese (zh)
Other versions
CN106599054A (en
Inventor
刘德建
章亮
詹博悍
陈霖
吴拥民
陈宏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Tianquan Educational Technology Ltd
Original Assignee
Fujian Tianquan Educational Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Tianquan Educational Technology Ltd filed Critical Fujian Tianquan Educational Technology Ltd
Priority to CN201611009278.0A priority Critical patent/CN106599054B/en
Publication of CN106599054A publication Critical patent/CN106599054A/en
Application granted granted Critical
Publication of CN106599054B publication Critical patent/CN106599054B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of classification, in particular to a method and a system for classifying and pushing subjects. Classifying a first question according to a preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set; calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set; obtaining a second association set according to the similarity set and the first association set; obtaining an approximate question set according to the second association degree set; and pushing the approximate question set. The accuracy of topic classification and the relevance of the pushed near topics are improved.

Description

Method and system for classifying and pushing questions
Technical Field
The invention relates to the field of classification, in particular to a method and a system for classifying and pushing subjects.
Background
In the big data era, the amount of data produced each day has increased explosively. The K12 education, one of the most important forms of education in china, produces a non-negligible amount of data each day. The scale of online education in china is growing at a rate of over 30% per year and market estimates will exceed 1600 billion dollars. The k12 online education resources become a necessary place for each enterprise, if the problem data which is increasing day by day can be analyzed and utilized, and reasonably classified into the corresponding knowledge points, when the students encounter difficult or weak problems, the problems with large association degree with the knowledge points are pushed for the students to deeply practice, and the user experience of application can be improved.
Patent document No. 201510246727.2 provides a title recommendation method by receiving search titles; acquiring the title attribute information of the search title, and acquiring a preliminary search result according to the title attribute information; acquiring user description information of a user, and sequencing the preliminary retrieval result according to the user description information to obtain a sequenced result; and selecting a preset number of results from the sorted results to determine the results as recommended topics. The relevance of the recommendation questions and the retrieval questions is improved, and therefore the recommendation effect is improved.
However, the above patent document ranks the preliminary search results according to the user description information, and the accuracy of the classification result depends on the accuracy of the user description information.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method and the system for classifying and pushing the questions are provided, so that the accuracy of classifying the questions and the relevance of pushing the questions are improved.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention provides a method for classifying and pushing questions, which comprises the following steps:
s1, classifying the first question according to the preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;
s2, calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;
s3, obtaining a second association set according to the similarity set and the first association set;
s4, obtaining an approximate question set according to the second relevance set;
and S5, pushing the approximate question set.
The invention also provides a system for classifying and pushing questions, which comprises the following steps:
the classification module classifies the first question according to a preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;
the calculation module is used for calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;
the first processing module is used for obtaining a second association set according to the similarity set and the first association set;
the second processing module is used for obtaining an approximate question set according to the second relevance set;
and the pushing module is used for pushing the similar topic set.
The invention has the beneficial effects that: compared with the prior art that related approximate questions are directly pushed according to the classification result of the classification model, the similarity analysis is carried out on the first question and the questions in the knowledge point classification obtained according to the knowledge point classification model, the association degree of the first question and the knowledge point classification is calculated according to the similarity, then the questions with high similarity with the first question are extracted from the knowledge point classification with high association degree and are pushed to the user as the approximate questions, and the relevance between the pushed approximate questions and the first question can be improved.
Drawings
FIG. 1 is a block diagram of a method for topic classification and push according to the present invention;
FIG. 2 is a block diagram of a topic classification and push system according to the present invention;
description of reference numerals:
1. a classification module; 2. a calculation module; 3. a first processing module; 4. a second processing module; 5. and a pushing module.
Detailed Description
In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.
The most key concept of the invention is as follows: similarity analysis is carried out on the first question and the questions in the knowledge point classification obtained according to the knowledge point classification model, and the association degree of the first question and each knowledge point classification is recalculated, so that the relevance of the pushed approximate question and the first question can be improved.
As shown in FIG. 1, the present invention provides a method for topic classification and push, comprising:
s1, classifying the first question according to the preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;
s2, calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;
s3, obtaining a second association set according to the similarity set and the first association set;
s4, obtaining an approximate question set according to the second relevance set;
and S5, pushing the approximate question set.
Further, the S1 specifically includes:
deploying different preset knowledge point classification models to each node in a preset classification cluster;
and sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set.
According to the description, the distributed cluster is beneficial to processing the approximate topic pushing task of the large-scale batch topics, and the pushing efficiency is improved.
Further, still include:
obtaining the corresponding classification of each question in the approximate question set to obtain a third classification set;
and updating the preset knowledge point classification model according to the third classification set.
According to the description, the knowledge point classification model is updated regularly according to the classification result, the classification accuracy of the classification model can be improved, and therefore the relevance of the pushing approximation question is improved.
Further, the sending of the first topic to each node in the preset classification cluster obtains the first classification set and the first association set, and specifically includes:
sending the first question to each node in the preset classification cluster to obtain a classification set and an association set corresponding to the node;
obtaining the weight value of the node according to a knowledge point classification model deployed on the node;
and obtaining the first classification set and the first association set according to the weight values of the nodes and the classification sets and association sets corresponding to the nodes.
As can be seen from the above description, a plurality of different classification models are respectively deployed on the nodes in the classification cluster, so that the classification results obtained by the nodes are different, the weight values of the nodes are determined according to the classification models deployed on the nodes, and the weight values and the corresponding classification results are comprehensively analyzed to obtain the knowledge point classification with a high association degree with the first topic. The weighted value of each node is adjusted according to the actual application scene, and the pushing of the approximate question which best meets the user expectation according to different requirements of the user is facilitated.
Further, the S1 specifically includes:
converting symbols in the first question according to preset escape characters to obtain a second question;
extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector;
and obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model.
According to the description, because the description modes of the titles from different sources are possibly different, especially the description difference of the symbols in the formula by different formula editors is larger, the symbols in the formula are converted through the preset escape characters, the symbols with different description modes but the same meaning can be normalized, the information in the titles is accurately and fully utilized, the precision of the title classification is improved, and the relevance of the pushed titles and the efficiency of obtaining the similar titles are improved.
For example: topic 1 to push approximation questions as "make function meaningfulThe element of the set consisting of the positive integer value ranges of (a) is? ". The topic 2 to be pushed of the approximation question is" make the function meaningful y ═ (5-x)1/2The positive integer value ranges of the same item are?'. in fact, the items 1 and 2 are essentially the same, but the existing method cannot fully utilize the information of the formula in the items, only push the value range of the calculation variable to make the function meaningful items, and cannot push the value range of the calculation variable more pertinently to make the function meaningful items with root numbers.
Further, according to the preset knowledge point classification model, a first classification set and a first association set corresponding to the feature vector are obtained, specifically:
deploying a knowledge point classification model based on word frequency to nodes in a preset classification cluster;
deploying a semantic-based knowledge point classification model to nodes in a preset classification cluster;
and sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set.
According to the description, the knowledge point classification related to the first topic obtained by the classification cluster comprises a classification result obtained from two dimensions of word frequency and semantics, and the word frequency and the semantics in the topic are comprehensively considered, so that the classification accuracy can be improved, and the relevance of the pushed approximate topic and the first topic is improved.
Further, extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector, and specifically comprises the following steps:
analyzing the second question to obtain a Chinese character stack and a non-Chinese character stack;
performing word segmentation processing on the characters in the Chinese character stack by using a word segmentation algorithm, and matching a formula stored in the non-Chinese character stack by using a preset regular expression to obtain a third topic;
deleting a stop word from the third topic to obtain a fourth topic;
constructing a word frequency vector according to the fourth topic; the number of elements in the word frequency vector is the number of different words in the fourth topic, and the value of the element in the word frequency vector is the number of times of occurrence of the word corresponding to the element in the fourth topic;
establishing a semantic feature extraction model according to preset dimensions;
and constructing a semantic vector corresponding to the fourth topic according to the semantic feature extraction model.
According to the description, the existing word segmentation algorithm can delete the non-Chinese characters in the question and only performs word segmentation on the Chinese characters, so that the Chinese characters and the non-Chinese characters in the question are respectively put into different stacks, the word segmentation is performed on the Chinese character stacks, the regular expressions are used for matching the corresponding formulas on the non-Chinese character stacks, the recognizable parts in the formulas are separated as much as possible, the words can be segmented on the question while the information in the question is kept, and the feature vector in the question can be extracted. In addition, the stack is used for storing Chinese characters and non-Chinese characters, so that the order of the characters can be kept unchanged, and the original meaning of the question is not changed in the word cutting processing process. Moreover, deleting stop words in the titles, namely nonsensical words such as 'the', 'it', 'on', 'be', 'inside', and the like, can more accurately extract the feature vectors of the titles, ignore irrelevant information and reduce the redundancy of the feature vectors.
Further, deleting a stop word from the third topic to obtain a fourth topic, specifically:
calculating the weight of each word in the third topic;
sorting the words in the third topic according to the weight to form a first queue;
and deleting words corresponding to a preset number element in front of the first queue from the third topic to obtain a fourth topic.
As can be seen from the above description, because the specific contents of stop words of different disciplines and different ages are different, the conventional stop word acquisition method is to look up through a stop word table, and the flexibility and pertinence are low.
For example, the common word "acceleration" is a word that frequently appears in the physical discipline and is important for understanding the subject meaning, but in biology, 1000 subjects do not necessarily have such a word, so if "acceleration" is found in the biological discipline, it can be regarded as a stop word, and cannot be regarded as an important word in the biological discipline, and can be deleted.
Where Term Frequency (TF) refers to the number of times a given term appears in the document. This number is usually normalized (numerator is usually less than denominator as distinguished from IDF) to prevent it from being biased towards long documents. The calculation formula is as follows:
n in the above formulai,jIs that the word is inPart djThe number of occurrences in, and the denominator this is in file djThe sum of the number of times all words in (a) occur.
Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. The formula is as follows:
where | D | is the total number of corpus files, | { j: ti∈djContains the word tiIf the word is not in the corpus, it will result in a dividend of 0, so 1+ | { j: t:, is typically usedi∈djJ. The final equation for TF-IDF is obtained as follows:
tf-idfi,j=tfi,j×idfi
a high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, TF-IDF tends to filter out common words, preserving important words.
Further, according to the second relevance set, an approximate question set is obtained, specifically:
sorting the classifications in the first classification set according to the second relevance set to obtain a first classification queue;
obtaining the classification of a preset classification number from the first classification queue to obtain a second classification set;
and obtaining the topics with the similarity with the first topic being greater than a preset similarity threshold value in the second classification set to obtain an approximate topic set.
According to the description, the relevance between the pushed approximate topic set and the first topic is improved by selecting the topic with higher similarity with the first topic from the knowledge point classification related to the relevance between the first topic and the first topic to form the approximate topic set.
As shown in FIG. 2, the present invention further provides a system for topic classification and pushing, comprising:
the classification module 1 classifies the first question according to a preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;
a calculating module 2, configured to calculate similarity between the first topic and topics included in each classification in the first classification set, so as to obtain a similarity set corresponding to each classification in the first classification set;
the first processing module 3 is configured to obtain a second association set according to the similarity set and the first association set;
the second processing module 4 is used for obtaining an approximate question set according to the second relevance set;
and the pushing module 5 is used for pushing the approximate topic set.
According to the above description, the system for classifying and pushing the titles can improve the precision of the title classification, so that the correlation between the pushed similar titles and the first title is further improved.
The embodiment of the invention is as follows:
s1, respectively deploying a knowledge point classification model based on word frequency and a knowledge point classification model based on semantics on nodes of a preset classification cluster;
the knowledge point classification model based on the word frequency specifically comprises the following steps:
(1) inputting a new question;
(2) carrying out the latex format conversion on the new question;
(3) word segmentation processing of the text, and deleting the corresponding stop word according to the stop word obtained in the training process
(4) Constructing the new subject into a word frequency vector;
(5) and inputting the word frequency vector into a pre-trained knowledge point classification model based on the word frequency, and obtaining corresponding knowledge points and weights thereof.
The process of training the knowledge point classification model based on the word frequency specifically comprises the following steps:
(1) inputting training questions;
(2) converting the training questions into a latex format;
(3) word segmentation processing of the text;
(4) calculating the weight of each word by using a stop word algorithm (TF-IDF), obtaining a stop word according to a set threshold value, and deleting the stop word in the training topic;
(5) converting each training question into a word frequency vector;
(6) setting corresponding parameters according to a classification algorithm;
(7) and inputting the word frequency vectors into a classification algorithm for training, and obtaining a knowledge point classification model based on the word frequency.
The semantic-based knowledge point classification model specifically comprises the following steps:
(1) inputting a new question;
(2) carrying out the latex format conversion on the new question;
(3) performing word segmentation processing on the text, and deleting corresponding stop words according to the stop words obtained in the training process;
(4) inputting the new question into a pre-trained semantic feature extraction model to obtain a corresponding semantic vector;
(5) and inputting the semantic vector into a pre-trained knowledge point classification model based on semantics, and obtaining corresponding knowledge points and weights thereof.
The process of training the knowledge point classification model based on the semantics specifically comprises the following steps:
(1) inputting training questions;
(2) converting the training questions into a latex format;
(3) word segmentation processing of the text;
(4) inputting the training questions after word segmentation into a semantic feature extraction model (such as a word2vec model), and obtaining the semantic feature extraction model aiming at the training questions according to set model parameters;
(5) inputting each training question into a semantic feature extraction model to obtain a semantic vector for each training question;
(6) setting corresponding classification algorithms (such as random forest and xgboost algorithms);
(7) and inputting the semantic vectors into a classification algorithm for training, and obtaining a knowledge point classification model based on semantics.
S2, sending the first topic to each node in the preset classification cluster; each node classifies the first topic, and specifically comprises:
s21, converting the symbols in the first question according to preset escape characters to obtain a second question;
wherein, the escape character of symbol "√" is "\ sqrt", the escape character of symbol "-" is the equal sign input in the English state, the escape character of symbol "-" is the minus sign input in the English state, the second title obtained after the conversion of the escape character is?' which is the element of the set composed of the positive integer value range of y \ sqrt (5-x) making the function meaningful "
S22, analyzing the second question to obtain a Chinese character stack and a non-Chinese character stack;
performing word segmentation processing on the characters in the Chinese character stack by using a word segmentation algorithm, and matching a formula stored in the non-Chinese character stack by using a preset regular expression to obtain a third topic;
calculating the weight of each word in the third topic;
sorting the words in the third topic according to the weight to form a first queue;
deleting words corresponding to a preset number element in front of the first queue from the third title to obtain a fourth title;
constructing a word frequency vector according to the fourth topic; the number of elements in the word frequency vector is the number of different words in the fourth topic, and the value of the element in the word frequency vector is the number of times of occurrence of the word corresponding to the element in the fourth topic;
establishing a semantic feature extraction model according to preset dimensions;
constructing a semantic vector corresponding to the fourth question according to the semantic feature extraction model;
the method comprises the following steps of performing word segmentation processing on characters in the Chinese character stack by using a jieba word segmentation algorithm, and matching a formula stored in the non-Chinese character stack by using a preset regular expression, wherein the word segmentation processing specifically comprises the following steps:
firstly, the characters in the Chinese character string are cut into words by utilizing a jieba word cutting algorithm to obtain the @ of a third topic which makes the @ function @ meaningful @@ positive integer @ value range @ element @ of @ set @ has @? ", and the symbol @ is a representation delimiter.
Calculating the weight of each word in the third topic by using a TF-IDF algorithm, and obtaining the weight of each word in the third topic as follows in sequence:
"let": 0.05, "function": 0.51, "meaningful": 0.22, "for": 0.02, "y": 0.09, "═ 0.07," \ sqrt ": 0.22," (": 0.01," 5 ": 0.01," - ": 0.07," x ": 0.07,") ": 0.01," positive integer ": 0.49," value range ": 0.44," component ": 0.15," for ": 0.02," set ": 0.38," for ": 0.02," element ": 0.35," for ": 0.05,"? ": 0.01.
Counting the occurrence frequency of each word in the fourth topic, and constructing a word frequency vector of the fourth topic according to non-stop word vectors constructed by all non-stop words, wherein the word frequency vector specifically comprises the following steps:
if the number of the non-stop words in all the training sets is 1000, the length of the word frequency vector of the fourth topic is 1000, each element in the vector represents the number of times that the corresponding word appears in the topic, and then the word appearing in the fourth topic, for example, the "function" appears only once, the dimension value corresponding to the "function" in the word frequency vector of the fourth topic is 1, and if the "function" appears twice in the topic, the dimension value corresponding to the "function" is 2. The remaining words not present in the title all have a dimension value of 0.
Inputting each word appearing in the fourth topic into a trained semantic model (such as a word2vec or GloVe model) to obtain a vector of each word, wherein the obtained vectors of each word are equal in length, so that the vectors of each word can be superposed, that is, the same dimension values are added to obtain a vector containing the whole topic, the semantic model is a representation method capable of storing semantic context, and the process of constructing the semantic vector of the fourth topic is specifically as follows:
inputting the fourth topic into the pre-trained semantic model, and obtaining the semantic vector of each word according to the parameter setting of the pre-trained model, for example: since the length of the vector of each word is generally set to 100 to 200 dimensions in practice, the vector of each word is set to 4 dimensions here for the sake of illustration.
Function(s) 0.41 0.12 0.02 0.31
Of significance 0.21 0.01 0.02 0.22
\sqrt 0.02 0.08 0.06 0.05
Positive integer 0.35 0.14 0.21 0.33
Value range 0.01 0.03 0.05 0.06
Composition of 0.23 0.41 0.05 0.02
Collection 0.14 0.02 0.13 0.09
Element(s) 0.06 0.04 0.07 0.08
Finally, the values of the same dimension of each word are added to obtain a semantic vector of a fourth topic:
1.43 0.85 0.61 1.16
and dividing the value in each dimension by the total number of words (8) of the fourth topic to obtain:
0.17875 0.10625 0.07625 0.145
the above is the semantic vector of the fourth topic.
S23, obtaining a first classification set and a first association set corresponding to the word frequency vector and the semantic vector according to the preset knowledge point classification model;
wherein the first set of knowledge points is: { value range of collection elements, expression method of function, expression method of collection, maximum value of elements in collection, existence of root and number judgment of root, value of function, root formula operation }, and association degree of the fourth topic and each knowledge point is respectively: {0.85,0.04,0.03,0.02,0.03,0.02,0.01}. The second set of knowledge points is: { value range of collection elements, root operation, representation of collection, maximum value of elements in collection, equality of collection, definition domain of function and value of its solving function }, and association degrees of the fourth topic and each knowledge point are respectively: {0.73,0.08,0.08,0.04,0.04,0.02}. And acquiring knowledge points in the first knowledge point set and the second knowledge point set to form a third knowledge point set. The knowledge points of the third knowledge point set simultaneously meet the characteristics of the word frequency vector and the semantic vector of the fourth topic, the association degree with the fourth topic is high, and the classification corresponding to the knowledge points in the third knowledge point set forms a first classification set.
S24, calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;
obtaining a second association set according to the similarity set and the first association set;
sorting the classifications in the first classification set according to the second relevance set to obtain a first classification queue;
obtaining the classification of a preset classification number from the first classification queue to obtain a second classification set;
obtaining the topics with the similarity with the first topic being greater than a preset similarity threshold in the second classification set to obtain an approximate topic set;
the cosine distance formula for calculating the similarity is shown as follows:
wherein x represents the feature vector of the first topic, y represents the feature vector of each topic in the classification, and the closer the value of cos theta is to 1, the higher the similarity of the two topics is.
The first association degrees of the first topic and each classification in the first classification set are respectively as follows:
{ span of collection elements: 1.58, representation method of function: 0.04, representation of the set: 0.11, the most valued of the elements in the set: 0.06, judging existence of roots and the number of the roots: 0.03, value of function: 0.02, root equation operation: 0.09, the most valued of the elements in the set: 0.04, equal of sets: 0.04, the domain of the function and its value of the normal function: 0.02} the process of obtaining the second relevance set according to the similarity set and the first relevance set specifically comprises:
(1) and acquiring four elements with larger first relevance in the first classification set, namely the value range of the set elements, the representation of the set, the root operation and the maximum value of the elements in the set, and extracting TF-IDF vectors of all questions belonging to the knowledge points in the question bank.
(2) And respectively calculating cosine distances of the extracted subjects and the feature vectors of the first subjects by utilizing a cosine distance formula.
(3) And sequencing the cosine distances of all the obtained topics and the first topic to obtain a second association degree set. And selecting the topics with higher similarity to the first topic from the corresponding classification with the second higher relevance to form an approximate topic set.
S25, pushing the approximate question set;
s26, obtaining the corresponding classification of each question in the approximate question set to obtain a third classification set;
and updating the preset knowledge point classification model according to the third classification set.
The process of updating the knowledge point classification model specifically comprises the following steps:
(1) firstly, calculating the length of the topic after topic segmentation, wherein the fourth topic is as follows: "function @ meaning @ \ sqrt @ positive integer @ value range @ composition @ set @ element". The topic length of the fourth topic is equal to 8.
(2) Let updateWeight be the parameter to be judged, i.e. when the fourth topic is greater than 5, updateWeight is 0.5, otherwise:
i.e., updateWeight for the fourth topic is 0.5.
(3) Calculate incomeWeight, which is the average degree of similarity of the approximate topics under the knowledge point with a degree of similarity exceeding 0.1, assuming that the set of all topics under the knowledge point is A, a belongs to A, and x is the currently queried topic, such as the fourth topic.
Define a' ═ x | sim (a, x) >0.1}, where sim (a, x) is the approximation of title a to x. Calculate incomeWeight for each knowledge point as:
(4) according to the following formula:
newWeight=oldWeight×(1-updateWeight)+incomeWeight×updateWeight
updating the weight value of each knowledge point, wherein newWeight is the updated weight value of the knowledge point, and oldWeight is the weight value of the original knowledge point, for example, the weights of the old knowledge points of the fourth topic are respectively: 1.58, 0.11, 0.09 and 0.06, and the finally obtained newWeight is the new knowledge point weight.
In summary, according to the method and system for topic classification and pushing provided by the present invention, similarity analysis is performed on the first topic and the topics in the knowledge point classification obtained according to the knowledge point classification model, the association degree between the first topic and the knowledge point classification is calculated according to the similarity, and then the topics with high similarity to the first topic are extracted from the knowledge point classification with high association degree and are pushed to the user as the approximate topics, so that the relevance between the pushed approximate topics and the first topic can be improved. Further, as can be seen from the above description, the use of the distributed cluster is beneficial to processing the task of pushing the similar questions of the large-scale batch questions, and the pushing efficiency is improved. Furthermore, the knowledge point classification model is updated regularly according to the classification result, so that the classification accuracy of the classification model can be improved, and the relevance of the push similar questions is improved. Furthermore, the weighted value of each node is adjusted according to the actual application scene, and the pushing of the approximate questions which best meet the user expectation according to different requirements of the user is facilitated. Furthermore, symbols in the formula are converted through preset escape characters, symbols with different description modes but the same meaning can be normalized, so that information in the questions is accurately and fully utilized, the precision of question classification is improved, and the relevance of pushing the questions and the efficiency of obtaining similar questions are improved. Furthermore, the word frequency and the semantics in the questions are comprehensively considered, so that the classification accuracy can be improved, and the relevance between the pushed similar questions and the first questions is improved. Furthermore, the words of the topics can be cut while information in the topics is kept, and extraction of feature vectors in the topics is facilitated. In addition, the stack is used for storing Chinese characters and non-Chinese characters, so that the order of the characters can be kept unchanged, and the original meaning of the question is not changed in the word cutting processing process. Furthermore, the weight of each word in the topic is calculated through a stop word calculation algorithm, the word with the smaller weight in the third topic is deleted, different stop words can be obtained for different disciplines, and therefore the correlation of the obtained approximate topic is improved. Furthermore, the topics with higher similarity to the first topic are selected from the knowledge point classification related to the relevance of the first topic to form an approximate topic set, so that the relevance of the pushed approximate topic set and the first topic is improved. The invention also provides a system for classifying and pushing the questions, and the accuracy of classifying the questions is improved through the system for classifying and pushing the questions, so that the correlation between the pushed similar questions and the first questions is further improved.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims (6)

1. A method for topic classification and pushing is characterized by comprising the following steps:
s1, classifying the first question according to the preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;
the first classification set is a set of preset knowledge points;
s2, calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;
s3, obtaining a second association set according to the similarity set and the first association set;
s4, obtaining an approximate question set according to the second relevance set;
s5, pushing the approximate question set;
the S1 specifically includes:
converting symbols in the first question according to preset escape characters to obtain a second question;
extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector;
obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model;
extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector, and specifically comprises the following steps:
analyzing the second question to obtain a Chinese character stack and a non-Chinese character stack;
performing word segmentation processing on the characters in the Chinese character stack by using a word segmentation algorithm, and matching a formula stored in the non-Chinese character stack by using a preset regular expression to obtain a third topic;
deleting a stop word from the third topic to obtain a fourth topic;
constructing a word frequency vector according to the fourth topic; the number of elements in the word frequency vector is the number of different words in the fourth topic, and the value of the element in the word frequency vector is the number of times of occurrence of the word corresponding to the element in the fourth topic;
establishing a semantic feature extraction model according to preset dimensions;
constructing a semantic vector corresponding to the fourth question according to the semantic feature extraction model;
obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model, specifically:
deploying a knowledge point classification model based on word frequency to nodes in a preset classification cluster;
deploying a semantic-based knowledge point classification model to nodes in a preset classification cluster;
sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set;
the obtaining a second association set according to the similarity set and the first association set includes:
acquiring four elements with larger first relevance in the first classification set, and extracting TF-IDF vectors of all questions belonging to knowledge points related to the four elements in a question bank;
calculating cosine distances of the extracted questions and the feature vectors of the first questions by using a cosine distance formula respectively;
sequencing all the obtained cosine distances between the questions and the first question to obtain a second association degree set;
the obtaining of the approximate question set according to the second relevance set specifically includes:
sorting the classifications in the first classification set according to the second relevance set to obtain a first classification queue;
obtaining the classification of a preset classification number from the first classification queue to obtain a second classification set;
obtaining the topics with the similarity with the first topic being greater than a preset similarity threshold in the second classification set to obtain an approximate topic set;
obtaining the corresponding classification of each question in the approximate question set to obtain a third classification set;
and updating the preset knowledge point classification model according to the third classification set.
2. The method for topic classification and push according to claim 1, wherein the S1 specifically is:
deploying different preset knowledge point classification models to each node in a preset classification cluster;
and sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set.
3. The method for topic classification and push according to claim 2,
the updating the preset knowledge point classification model according to the third classification set comprises:
for each knowledge point in the third classification set, sequentially performing:
calculating the subject length after the fourth subject is segmented;
setting updateWeight as a parameter to be judged, wherein when the topic length of the fourth topic is greater than 5, the updateWeight is 0.5; if not, then,
calculating incomeWeight, wherein the incomeWeight refers to the average approximation degree of the approximation questions with the similarity exceeding 0.1 under the knowledge point, and assuming that the set of all the questions under the knowledge point is A, a belongs to A, and x is the fourth question queried currently;
define a' ═ { x | sim (a, x) >0.1}, where sim (a, x) is the approximation of topic a to x;
calculating incomeWeight of the knowledge point as:
according to the following formula:
newWeight=oldWeight×(1-updateWeight)+incomeWeight×updateWeight
and updating the weight value of the knowledge point, wherein newWeight is the weight of the updated knowledge point, and oldWeight is the weight of the original knowledge point.
4. The topic classification and pushing method according to claim 2, wherein the step of sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set specifically comprises:
sending the first question to each node in the preset classification cluster to obtain a classification set and an association set corresponding to the node;
obtaining the weight value of the node according to a knowledge point classification model deployed on the node;
and obtaining the first classification set and the first association set according to the weight values of the nodes and the classification sets and association sets corresponding to the nodes.
5. The topic classification and push method according to claim 1, wherein a stop word is deleted from the third topic to obtain a fourth topic, specifically:
calculating the weight of each word in the third topic by using a TF-IDF algorithm;
sorting the words in the third topic according to the weight to form a first queue;
and deleting words corresponding to a preset number element in front of the first queue from the third topic to obtain a fourth topic.
6. A system for topic classification and push, comprising:
the classification module classifies the first question according to a preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;
the first classification set is a set of preset knowledge points;
the calculation module is used for calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;
the first processing module is used for obtaining a second association set according to the similarity set and the first association set;
the second processing module is used for obtaining an approximate question set according to the second relevance set;
the pushing module is used for pushing the similar topic set;
the classification module is specifically configured to:
converting symbols in the first question according to preset escape characters to obtain a second question;
extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector;
obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model;
extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector, and specifically comprises the following steps:
analyzing the second question to obtain a Chinese character stack and a non-Chinese character stack;
performing word segmentation processing on the characters in the Chinese character stack by using a word segmentation algorithm, and matching a formula stored in the non-Chinese character stack by using a preset regular expression to obtain a third topic;
deleting a stop word from the third topic to obtain a fourth topic;
constructing a word frequency vector according to the fourth topic; the number of elements in the word frequency vector is the number of different words in the fourth topic, and the value of the element in the word frequency vector is the number of times of occurrence of the word corresponding to the element in the fourth topic;
establishing a semantic feature extraction model according to preset dimensions;
constructing a semantic vector corresponding to the fourth question according to the semantic feature extraction model;
obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model, specifically:
deploying a knowledge point classification model based on word frequency to nodes in a preset classification cluster;
deploying a semantic-based knowledge point classification model to nodes in a preset classification cluster;
sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set;
the obtaining a second association set according to the similarity set and the first association set includes:
acquiring four elements with larger first relevance in the first classification set, and extracting TF-IDF vectors of all questions belonging to knowledge points related to the four elements in a question bank;
calculating cosine distances of the extracted questions and the feature vectors of the first questions by using a cosine distance formula respectively;
sequencing all the obtained cosine distances between the questions and the first question to obtain a second association degree set;
the obtaining of the approximate question set according to the second relevance set specifically includes:
sorting the classifications in the first classification set according to the second relevance set to obtain a first classification queue;
obtaining the classification of a preset classification number from the first classification queue to obtain a second classification set;
obtaining the topics with the similarity with the first topic being greater than a preset similarity threshold in the second classification set to obtain an approximate topic set;
obtaining the corresponding classification of each question in the approximate question set to obtain a third classification set;
and updating the preset knowledge point classification model according to the third classification set.
CN201611009278.0A 2016-11-16 2016-11-16 Method and system for classifying and pushing questions Active CN106599054B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611009278.0A CN106599054B (en) 2016-11-16 2016-11-16 Method and system for classifying and pushing questions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611009278.0A CN106599054B (en) 2016-11-16 2016-11-16 Method and system for classifying and pushing questions

Publications (2)

Publication Number Publication Date
CN106599054A CN106599054A (en) 2017-04-26
CN106599054B true CN106599054B (en) 2019-12-24

Family

ID=58590375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611009278.0A Active CN106599054B (en) 2016-11-16 2016-11-16 Method and system for classifying and pushing questions

Country Status (1)

Country Link
CN (1) CN106599054B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463553B (en) * 2017-09-12 2021-03-30 复旦大学 Text semantic extraction, representation and modeling method and system for elementary mathematic problems
CN108182275A (en) * 2018-01-24 2018-06-19 上海互教教育科技有限公司 A kind of mathematics variant training topic supplying system and correlating method
CN108376132B (en) * 2018-03-16 2020-08-28 中国科学技术大学 Method and system for judging similar test questions
CN108765221A (en) * 2018-05-15 2018-11-06 广西英腾教育科技股份有限公司 Pumping inscribes method and device
CN109189920A (en) * 2018-08-02 2019-01-11 上海欣方智能系统有限公司 Sweep-black case classification method and system
CN109685137A (en) * 2018-12-24 2019-04-26 上海仁静信息技术有限公司 A kind of topic classification method, device, electronic equipment and storage medium
CN109785691B (en) * 2019-01-18 2021-09-24 广东小天才科技有限公司 Method and system for assisting learning through terminal
CN110136512A (en) * 2019-04-17 2019-08-16 许昌学院 A kind of English grade examzation examination exercise and the automatic clustering system of answer parsing
CN110472044A (en) * 2019-07-11 2019-11-19 平安国际智慧城市科技股份有限公司 Knowledge point classification method, device, readable storage medium storing program for executing and the server of mathematical problem
CN112989760A (en) * 2019-12-17 2021-06-18 北京一起教育信息咨询有限责任公司 Method and device for labeling subjects, storage medium and electronic equipment
CN111723193A (en) * 2020-06-19 2020-09-29 平安科技(深圳)有限公司 Exercise intelligent recommendation method and device, computer equipment and storage medium
CN111881285A (en) * 2020-07-28 2020-11-03 扬州大学 Wrong question collection and important and difficult point knowledge extraction method
CN112257966B (en) * 2020-12-18 2021-04-09 北京世纪好未来教育科技有限公司 Model processing method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN104834729A (en) * 2015-05-14 2015-08-12 百度在线网络技术(北京)有限公司 Title recommendation method and title recommendation device
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
CN105589972A (en) * 2016-01-08 2016-05-18 天津车之家科技有限公司 Method and device for training classification model, and method and device for classifying search words
CN106021288A (en) * 2016-04-27 2016-10-12 南京慕测信息科技有限公司 Method for rapid and automatic classification of classroom testing answers based on natural language analysis

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255B (en) * 2013-10-15 2017-01-11 常州大学 Text semantic relativity based network public opinion information analysis method
CN105893362A (en) * 2014-09-26 2016-08-24 北大方正集团有限公司 A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN105930509B (en) * 2016-05-11 2019-05-17 华东师范大学 Field concept based on statistics and template matching extracts refined method and system automatically

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
CN104834729A (en) * 2015-05-14 2015-08-12 百度在线网络技术(北京)有限公司 Title recommendation method and title recommendation device
CN105589972A (en) * 2016-01-08 2016-05-18 天津车之家科技有限公司 Method and device for training classification model, and method and device for classifying search words
CN106021288A (en) * 2016-04-27 2016-10-12 南京慕测信息科技有限公司 Method for rapid and automatic classification of classroom testing answers based on natural language analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于向量空间模型的知识点与试题自动关联方法;董奥根等;《计算机与现代化》;20151031(第10期);第6-9页 *
面向机构知识库结构化数据的文本相似度评价算法;吴旭等;《技术研究》;20150531(第5期);第16-20页 *

Also Published As

Publication number Publication date
CN106599054A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
CN106599054B (en) Method and system for classifying and pushing questions
Rathi et al. Sentiment analysis of tweets using machine learning approach
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN106156204B (en) Text label extraction method and device
CN106651696B (en) Approximate question pushing method and system
Kadhim et al. Text document preprocessing and dimension reduction techniques for text document clustering
CN102411563B (en) Method, device and system for identifying target words
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN106991127B (en) Knowledge subject short text hierarchical classification method based on topological feature expansion
CN103617157A (en) Text similarity calculation method based on semantics
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN108363694B (en) Keyword extraction method and device
Liliana et al. Indonesian news classification using support vector machine
CN107066555A (en) Towards the online topic detection method of professional domain
CN112559684A (en) Keyword extraction and information retrieval method
CN106708940A (en) Method and device used for processing pictures
Deniz et al. Effects of various preprocessing techniques to Turkish text categorization using n-gram features
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN110866102A (en) Search processing method
CN106844482B (en) Search engine-based retrieval information matching method and device
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN114491062B (en) Short text classification method integrating knowledge graph and topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant