CN106599054B

CN106599054B - Method and system for classifying and pushing questions

Info

Publication number: CN106599054B
Application number: CN201611009278.0A
Authority: CN
Inventors: 刘德建; 章亮; 詹博悍; 陈霖; 吴拥民; 陈宏展
Original assignee: Fujian Tianquan Educational Technology Ltd
Current assignee: Fujian Tianquan Educational Technology Ltd
Priority date: 2016-11-16
Filing date: 2016-11-16
Publication date: 2019-12-24
Anticipated expiration: 2036-11-16
Also published as: CN106599054A

Abstract

The invention relates to the field of classification, in particular to a method and a system for classifying and pushing subjects. Classifying a first question according to a preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set; calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set; obtaining a second association set according to the similarity set and the first association set; obtaining an approximate question set according to the second association degree set; and pushing the approximate question set. The accuracy of topic classification and the relevance of the pushed near topics are improved.

Description

Method and system for classifying and pushing questions

Technical Field

The invention relates to the field of classification, in particular to a method and a system for classifying and pushing subjects.

Background

In the big data era, the amount of data produced each day has increased explosively. The K12 education, one of the most important forms of education in china, produces a non-negligible amount of data each day. The scale of online education in china is growing at a rate of over 30% per year and market estimates will exceed 1600 billion dollars. The k12 online education resources become a necessary place for each enterprise, if the problem data which is increasing day by day can be analyzed and utilized, and reasonably classified into the corresponding knowledge points, when the students encounter difficult or weak problems, the problems with large association degree with the knowledge points are pushed for the students to deeply practice, and the user experience of application can be improved.

Patent document No. 201510246727.2 provides a title recommendation method by receiving search titles; acquiring the title attribute information of the search title, and acquiring a preliminary search result according to the title attribute information; acquiring user description information of a user, and sequencing the preliminary retrieval result according to the user description information to obtain a sequenced result; and selecting a preset number of results from the sorted results to determine the results as recommended topics. The relevance of the recommendation questions and the retrieval questions is improved, and therefore the recommendation effect is improved.

However, the above patent document ranks the preliminary search results according to the user description information, and the accuracy of the classification result depends on the accuracy of the user description information.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method and the system for classifying and pushing the questions are provided, so that the accuracy of classifying the questions and the relevance of pushing the questions are improved.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention provides a method for classifying and pushing questions, which comprises the following steps:

s1, classifying the first question according to the preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;

s2, calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;

s3, obtaining a second association set according to the similarity set and the first association set;

s4, obtaining an approximate question set according to the second relevance set;

and S5, pushing the approximate question set.

The invention also provides a system for classifying and pushing questions, which comprises the following steps:

the classification module classifies the first question according to a preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;

the calculation module is used for calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;

the first processing module is used for obtaining a second association set according to the similarity set and the first association set;

the second processing module is used for obtaining an approximate question set according to the second relevance set;

and the pushing module is used for pushing the similar topic set.

The invention has the beneficial effects that: compared with the prior art that related approximate questions are directly pushed according to the classification result of the classification model, the similarity analysis is carried out on the first question and the questions in the knowledge point classification obtained according to the knowledge point classification model, the association degree of the first question and the knowledge point classification is calculated according to the similarity, then the questions with high similarity with the first question are extracted from the knowledge point classification with high association degree and are pushed to the user as the approximate questions, and the relevance between the pushed approximate questions and the first question can be improved.

Drawings

FIG. 1 is a block diagram of a method for topic classification and push according to the present invention;

FIG. 2 is a block diagram of a topic classification and push system according to the present invention;

description of reference numerals:

1. a classification module; 2. a calculation module; 3. a first processing module; 4. a second processing module; 5. and a pushing module.

Detailed Description

In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.

The most key concept of the invention is as follows: similarity analysis is carried out on the first question and the questions in the knowledge point classification obtained according to the knowledge point classification model, and the association degree of the first question and each knowledge point classification is recalculated, so that the relevance of the pushed approximate question and the first question can be improved.

As shown in FIG. 1, the present invention provides a method for topic classification and push, comprising:

and S5, pushing the approximate question set.

Further, the S1 specifically includes:

deploying different preset knowledge point classification models to each node in a preset classification cluster;

and sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set.

According to the description, the distributed cluster is beneficial to processing the approximate topic pushing task of the large-scale batch topics, and the pushing efficiency is improved.

Further, still include:

obtaining the corresponding classification of each question in the approximate question set to obtain a third classification set;

and updating the preset knowledge point classification model according to the third classification set.

According to the description, the knowledge point classification model is updated regularly according to the classification result, the classification accuracy of the classification model can be improved, and therefore the relevance of the pushing approximation question is improved.

Further, the sending of the first topic to each node in the preset classification cluster obtains the first classification set and the first association set, and specifically includes:

sending the first question to each node in the preset classification cluster to obtain a classification set and an association set corresponding to the node;

obtaining the weight value of the node according to a knowledge point classification model deployed on the node;

and obtaining the first classification set and the first association set according to the weight values of the nodes and the classification sets and association sets corresponding to the nodes.

As can be seen from the above description, a plurality of different classification models are respectively deployed on the nodes in the classification cluster, so that the classification results obtained by the nodes are different, the weight values of the nodes are determined according to the classification models deployed on the nodes, and the weight values and the corresponding classification results are comprehensively analyzed to obtain the knowledge point classification with a high association degree with the first topic. The weighted value of each node is adjusted according to the actual application scene, and the pushing of the approximate question which best meets the user expectation according to different requirements of the user is facilitated.

Further, the S1 specifically includes:

converting symbols in the first question according to preset escape characters to obtain a second question;

extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector;

and obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model.

According to the description, because the description modes of the titles from different sources are possibly different, especially the description difference of the symbols in the formula by different formula editors is larger, the symbols in the formula are converted through the preset escape characters, the symbols with different description modes but the same meaning can be normalized, the information in the titles is accurately and fully utilized, the precision of the title classification is improved, and the relevance of the pushed titles and the efficiency of obtaining the similar titles are improved.

For example: topic 1 to push approximation questions as "make function meaningfulThe element of the set consisting of the positive integer value ranges of (a) is? ". The topic 2 to be pushed of the approximation question is" make the function meaningful y ═ (5-x)^1/2The positive integer value ranges of the same item are?'. in fact, the items 1 and 2 are essentially the same, but the existing method cannot fully utilize the information of the formula in the items, only push the value range of the calculation variable to make the function meaningful items, and cannot push the value range of the calculation variable more pertinently to make the function meaningful items with root numbers.

Further, according to the preset knowledge point classification model, a first classification set and a first association set corresponding to the feature vector are obtained, specifically:

deploying a knowledge point classification model based on word frequency to nodes in a preset classification cluster;

deploying a semantic-based knowledge point classification model to nodes in a preset classification cluster;

According to the description, the knowledge point classification related to the first topic obtained by the classification cluster comprises a classification result obtained from two dimensions of word frequency and semantics, and the word frequency and the semantics in the topic are comprehensively considered, so that the classification accuracy can be improved, and the relevance of the pushed approximate topic and the first topic is improved.

Further, extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector, and specifically comprises the following steps:

analyzing the second question to obtain a Chinese character stack and a non-Chinese character stack;

performing word segmentation processing on the characters in the Chinese character stack by using a word segmentation algorithm, and matching a formula stored in the non-Chinese character stack by using a preset regular expression to obtain a third topic;

deleting a stop word from the third topic to obtain a fourth topic;

constructing a word frequency vector according to the fourth topic; the number of elements in the word frequency vector is the number of different words in the fourth topic, and the value of the element in the word frequency vector is the number of times of occurrence of the word corresponding to the element in the fourth topic;

establishing a semantic feature extraction model according to preset dimensions;

and constructing a semantic vector corresponding to the fourth topic according to the semantic feature extraction model.

According to the description, the existing word segmentation algorithm can delete the non-Chinese characters in the question and only performs word segmentation on the Chinese characters, so that the Chinese characters and the non-Chinese characters in the question are respectively put into different stacks, the word segmentation is performed on the Chinese character stacks, the regular expressions are used for matching the corresponding formulas on the non-Chinese character stacks, the recognizable parts in the formulas are separated as much as possible, the words can be segmented on the question while the information in the question is kept, and the feature vector in the question can be extracted. In addition, the stack is used for storing Chinese characters and non-Chinese characters, so that the order of the characters can be kept unchanged, and the original meaning of the question is not changed in the word cutting processing process. Moreover, deleting stop words in the titles, namely nonsensical words such as 'the', 'it', 'on', 'be', 'inside', and the like, can more accurately extract the feature vectors of the titles, ignore irrelevant information and reduce the redundancy of the feature vectors.

Further, deleting a stop word from the third topic to obtain a fourth topic, specifically:

calculating the weight of each word in the third topic;

sorting the words in the third topic according to the weight to form a first queue;

and deleting words corresponding to a preset number element in front of the first queue from the third topic to obtain a fourth topic.

As can be seen from the above description, because the specific contents of stop words of different disciplines and different ages are different, the conventional stop word acquisition method is to look up through a stop word table, and the flexibility and pertinence are low.

For example, the common word "acceleration" is a word that frequently appears in the physical discipline and is important for understanding the subject meaning, but in biology, 1000 subjects do not necessarily have such a word, so if "acceleration" is found in the biological discipline, it can be regarded as a stop word, and cannot be regarded as an important word in the biological discipline, and can be deleted.

Where Term Frequency (TF) refers to the number of times a given term appears in the document. This number is usually normalized (numerator is usually less than denominator as distinguished from IDF) to prevent it from being biased towards long documents. The calculation formula is as follows:

n in the above formula_i,jIs that the word is inPart d_jThe number of occurrences in, and the denominator this is in file d_jThe sum of the number of times all words in (a) occur.

Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. The formula is as follows:

where | D | is the total number of corpus files, | { j: t_i∈d_jContains the word t_iIf the word is not in the corpus, it will result in a dividend of 0, so 1+ | { j: t:, is typically used_i∈d_jJ. The final equation for TF-IDF is obtained as follows:

tf-idf_i,j＝tf_i,j×idf_i

a high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, TF-IDF tends to filter out common words, preserving important words.

Further, according to the second relevance set, an approximate question set is obtained, specifically:

sorting the classifications in the first classification set according to the second relevance set to obtain a first classification queue;

obtaining the classification of a preset classification number from the first classification queue to obtain a second classification set;

and obtaining the topics with the similarity with the first topic being greater than a preset similarity threshold value in the second classification set to obtain an approximate topic set.

According to the description, the relevance between the pushed approximate topic set and the first topic is improved by selecting the topic with higher similarity with the first topic from the knowledge point classification related to the relevance between the first topic and the first topic to form the approximate topic set.

As shown in FIG. 2, the present invention further provides a system for topic classification and pushing, comprising:

the classification module 1 classifies the first question according to a preset knowledge point classification model to obtain a first classification set and a first association degree set; elements in the first relevance set are relevance of the first topic and each classification in the first classification set;

a calculating module 2, configured to calculate similarity between the first topic and topics included in each classification in the first classification set, so as to obtain a similarity set corresponding to each classification in the first classification set;

the first processing module 3 is configured to obtain a second association set according to the similarity set and the first association set;

the second processing module 4 is used for obtaining an approximate question set according to the second relevance set;

and the pushing module 5 is used for pushing the approximate topic set.

According to the above description, the system for classifying and pushing the titles can improve the precision of the title classification, so that the correlation between the pushed similar titles and the first title is further improved.

The embodiment of the invention is as follows:

s1, respectively deploying a knowledge point classification model based on word frequency and a knowledge point classification model based on semantics on nodes of a preset classification cluster;

the knowledge point classification model based on the word frequency specifically comprises the following steps:

(1) inputting a new question;

(2) carrying out the latex format conversion on the new question;

(3) word segmentation processing of the text, and deleting the corresponding stop word according to the stop word obtained in the training process

(4) Constructing the new subject into a word frequency vector;

(5) and inputting the word frequency vector into a pre-trained knowledge point classification model based on the word frequency, and obtaining corresponding knowledge points and weights thereof.

The process of training the knowledge point classification model based on the word frequency specifically comprises the following steps:

(1) inputting training questions;

(2) converting the training questions into a latex format;

(3) word segmentation processing of the text;

(4) calculating the weight of each word by using a stop word algorithm (TF-IDF), obtaining a stop word according to a set threshold value, and deleting the stop word in the training topic;

(5) converting each training question into a word frequency vector;

(6) setting corresponding parameters according to a classification algorithm;

(7) and inputting the word frequency vectors into a classification algorithm for training, and obtaining a knowledge point classification model based on the word frequency.

The semantic-based knowledge point classification model specifically comprises the following steps:

(1) inputting a new question;

(2) carrying out the latex format conversion on the new question;

(3) performing word segmentation processing on the text, and deleting corresponding stop words according to the stop words obtained in the training process;

(4) inputting the new question into a pre-trained semantic feature extraction model to obtain a corresponding semantic vector;

(5) and inputting the semantic vector into a pre-trained knowledge point classification model based on semantics, and obtaining corresponding knowledge points and weights thereof.

The process of training the knowledge point classification model based on the semantics specifically comprises the following steps:

(1) inputting training questions;

(2) converting the training questions into a latex format;

(3) word segmentation processing of the text;

(4) inputting the training questions after word segmentation into a semantic feature extraction model (such as a word2vec model), and obtaining the semantic feature extraction model aiming at the training questions according to set model parameters;

(5) inputting each training question into a semantic feature extraction model to obtain a semantic vector for each training question;

(6) setting corresponding classification algorithms (such as random forest and xgboost algorithms);

(7) and inputting the semantic vectors into a classification algorithm for training, and obtaining a knowledge point classification model based on semantics.

S2, sending the first topic to each node in the preset classification cluster; each node classifies the first topic, and specifically comprises:

s21, converting the symbols in the first question according to preset escape characters to obtain a second question;

wherein, the escape character of symbol "√" is "\ sqrt", the escape character of symbol "-" is the equal sign input in the English state, the escape character of symbol "-" is the minus sign input in the English state, the second title obtained after the conversion of the escape character is?' which is the element of the set composed of the positive integer value range of y \ sqrt (5-x) making the function meaningful "

S22, analyzing the second question to obtain a Chinese character stack and a non-Chinese character stack;

calculating the weight of each word in the third topic;

deleting words corresponding to a preset number element in front of the first queue from the third title to obtain a fourth title;

constructing a semantic vector corresponding to the fourth question according to the semantic feature extraction model;

the method comprises the following steps of performing word segmentation processing on characters in the Chinese character stack by using a jieba word segmentation algorithm, and matching a formula stored in the non-Chinese character stack by using a preset regular expression, wherein the word segmentation processing specifically comprises the following steps:

firstly, the characters in the Chinese character string are cut into words by utilizing a jieba word cutting algorithm to obtain the @ of a third topic which makes the @ function @ meaningful @@ positive integer @ value range @ element @ of @ set @ has @? ", and the symbol @ is a representation delimiter.

Calculating the weight of each word in the third topic by using a TF-IDF algorithm, and obtaining the weight of each word in the third topic as follows in sequence:

"let": 0.05, "function": 0.51, "meaningful": 0.22, "for": 0.02, "y": 0.09, "═ 0.07," \ sqrt ": 0.22," (": 0.01," 5 ": 0.01," - ": 0.07," x ": 0.07,") ": 0.01," positive integer ": 0.49," value range ": 0.44," component ": 0.15," for ": 0.02," set ": 0.38," for ": 0.02," element ": 0.35," for ": 0.05,"? ": 0.01.

Counting the occurrence frequency of each word in the fourth topic, and constructing a word frequency vector of the fourth topic according to non-stop word vectors constructed by all non-stop words, wherein the word frequency vector specifically comprises the following steps:

if the number of the non-stop words in all the training sets is 1000, the length of the word frequency vector of the fourth topic is 1000, each element in the vector represents the number of times that the corresponding word appears in the topic, and then the word appearing in the fourth topic, for example, the "function" appears only once, the dimension value corresponding to the "function" in the word frequency vector of the fourth topic is 1, and if the "function" appears twice in the topic, the dimension value corresponding to the "function" is 2. The remaining words not present in the title all have a dimension value of 0.

Inputting each word appearing in the fourth topic into a trained semantic model (such as a word2vec or GloVe model) to obtain a vector of each word, wherein the obtained vectors of each word are equal in length, so that the vectors of each word can be superposed, that is, the same dimension values are added to obtain a vector containing the whole topic, the semantic model is a representation method capable of storing semantic context, and the process of constructing the semantic vector of the fourth topic is specifically as follows:

inputting the fourth topic into the pre-trained semantic model, and obtaining the semantic vector of each word according to the parameter setting of the pre-trained model, for example: since the length of the vector of each word is generally set to 100 to 200 dimensions in practice, the vector of each word is set to 4 dimensions here for the sake of illustration.

Function(s)	0.41	0.12	0.02	0.31
					Of significance	0.21	0.01	0.02	0.22
\sqrt	0.02	0.08	0.06	0.05
					Positive integer	0.35	0.14	0.21	0.33
Value range	0.01	0.03	0.05	0.06
					Composition of	0.23	0.41	0.05	0.02
Collection	0.14	0.02	0.13	0.09
					Element(s)	0.06	0.04	0.07	0.08

Finally, the values of the same dimension of each word are added to obtain a semantic vector of a fourth topic:

1.43

0.85

0.61

1.16

and dividing the value in each dimension by the total number of words (8) of the fourth topic to obtain:

0.17875

0.10625

0.07625

0.145

the above is the semantic vector of the fourth topic.

S23, obtaining a first classification set and a first association set corresponding to the word frequency vector and the semantic vector according to the preset knowledge point classification model;

wherein the first set of knowledge points is: { value range of collection elements, expression method of function, expression method of collection, maximum value of elements in collection, existence of root and number judgment of root, value of function, root formula operation }, and association degree of the fourth topic and each knowledge point is respectively: {0.85,0.04,0.03,0.02,0.03,0.02,0.01}. The second set of knowledge points is: { value range of collection elements, root operation, representation of collection, maximum value of elements in collection, equality of collection, definition domain of function and value of its solving function }, and association degrees of the fourth topic and each knowledge point are respectively: {0.73,0.08,0.08,0.04,0.04,0.02}. And acquiring knowledge points in the first knowledge point set and the second knowledge point set to form a third knowledge point set. The knowledge points of the third knowledge point set simultaneously meet the characteristics of the word frequency vector and the semantic vector of the fourth topic, the association degree with the fourth topic is high, and the classification corresponding to the knowledge points in the third knowledge point set forms a first classification set.

S24, calculating the similarity between the first topic and the topics contained in each classification in the first classification set to obtain a similarity set corresponding to each classification in the first classification set;

obtaining a second association set according to the similarity set and the first association set;

obtaining the topics with the similarity with the first topic being greater than a preset similarity threshold in the second classification set to obtain an approximate topic set;

the cosine distance formula for calculating the similarity is shown as follows:

wherein x represents the feature vector of the first topic, y represents the feature vector of each topic in the classification, and the closer the value of cos theta is to 1, the higher the similarity of the two topics is.

The first association degrees of the first topic and each classification in the first classification set are respectively as follows:

{ span of collection elements: 1.58, representation method of function: 0.04, representation of the set: 0.11, the most valued of the elements in the set: 0.06, judging existence of roots and the number of the roots: 0.03, value of function: 0.02, root equation operation: 0.09, the most valued of the elements in the set: 0.04, equal of sets: 0.04, the domain of the function and its value of the normal function: 0.02} the process of obtaining the second relevance set according to the similarity set and the first relevance set specifically comprises:

(1) and acquiring four elements with larger first relevance in the first classification set, namely the value range of the set elements, the representation of the set, the root operation and the maximum value of the elements in the set, and extracting TF-IDF vectors of all questions belonging to the knowledge points in the question bank.

(2) And respectively calculating cosine distances of the extracted subjects and the feature vectors of the first subjects by utilizing a cosine distance formula.

(3) And sequencing the cosine distances of all the obtained topics and the first topic to obtain a second association degree set. And selecting the topics with higher similarity to the first topic from the corresponding classification with the second higher relevance to form an approximate topic set.

S25, pushing the approximate question set;

s26, obtaining the corresponding classification of each question in the approximate question set to obtain a third classification set;

The process of updating the knowledge point classification model specifically comprises the following steps:

(1) firstly, calculating the length of the topic after topic segmentation, wherein the fourth topic is as follows: "function @ meaning @ \ sqrt @ positive integer @ value range @ composition @ set @ element". The topic length of the fourth topic is equal to 8.

(2) Let updateWeight be the parameter to be judged, i.e. when the fourth topic is greater than 5, updateWeight is 0.5, otherwise:

i.e., updateWeight for the fourth topic is 0.5.

(3) Calculate incomeWeight, which is the average degree of similarity of the approximate topics under the knowledge point with a degree of similarity exceeding 0.1, assuming that the set of all topics under the knowledge point is A, a belongs to A, and x is the currently queried topic, such as the fourth topic.

Define a' ═ x | sim (a, x) >0.1}, where sim (a, x) is the approximation of title a to x. Calculate incomeWeight for each knowledge point as:

(4) according to the following formula:

newWeight＝oldWeight×(1-updateWeight)+incomeWeight×updateWeight

updating the weight value of each knowledge point, wherein newWeight is the updated weight value of the knowledge point, and oldWeight is the weight value of the original knowledge point, for example, the weights of the old knowledge points of the fourth topic are respectively: 1.58, 0.11, 0.09 and 0.06, and the finally obtained newWeight is the new knowledge point weight.

In summary, according to the method and system for topic classification and pushing provided by the present invention, similarity analysis is performed on the first topic and the topics in the knowledge point classification obtained according to the knowledge point classification model, the association degree between the first topic and the knowledge point classification is calculated according to the similarity, and then the topics with high similarity to the first topic are extracted from the knowledge point classification with high association degree and are pushed to the user as the approximate topics, so that the relevance between the pushed approximate topics and the first topic can be improved. Further, as can be seen from the above description, the use of the distributed cluster is beneficial to processing the task of pushing the similar questions of the large-scale batch questions, and the pushing efficiency is improved. Furthermore, the knowledge point classification model is updated regularly according to the classification result, so that the classification accuracy of the classification model can be improved, and the relevance of the push similar questions is improved. Furthermore, the weighted value of each node is adjusted according to the actual application scene, and the pushing of the approximate questions which best meet the user expectation according to different requirements of the user is facilitated. Furthermore, symbols in the formula are converted through preset escape characters, symbols with different description modes but the same meaning can be normalized, so that information in the questions is accurately and fully utilized, the precision of question classification is improved, and the relevance of pushing the questions and the efficiency of obtaining similar questions are improved. Furthermore, the word frequency and the semantics in the questions are comprehensively considered, so that the classification accuracy can be improved, and the relevance between the pushed similar questions and the first questions is improved. Furthermore, the words of the topics can be cut while information in the topics is kept, and extraction of feature vectors in the topics is facilitated. In addition, the stack is used for storing Chinese characters and non-Chinese characters, so that the order of the characters can be kept unchanged, and the original meaning of the question is not changed in the word cutting processing process. Furthermore, the weight of each word in the topic is calculated through a stop word calculation algorithm, the word with the smaller weight in the third topic is deleted, different stop words can be obtained for different disciplines, and therefore the correlation of the obtained approximate topic is improved. Furthermore, the topics with higher similarity to the first topic are selected from the knowledge point classification related to the relevance of the first topic to form an approximate topic set, so that the relevance of the pushed approximate topic set and the first topic is improved. The invention also provides a system for classifying and pushing the questions, and the accuracy of classifying the questions is improved through the system for classifying and pushing the questions, so that the correlation between the pushed similar questions and the first questions is further improved.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A method for topic classification and pushing is characterized by comprising the following steps:

the first classification set is a set of preset knowledge points;

s5, pushing the approximate question set;

the S1 specifically includes:

obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model;

extracting the features of the second question to obtain a feature vector; the feature vector comprises a word frequency vector and a semantic vector, and specifically comprises the following steps:

deleting a stop word from the third topic to obtain a fourth topic;

obtaining a first classification set and a first association set corresponding to the feature vector according to the preset knowledge point classification model, specifically:

sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set;

the obtaining a second association set according to the similarity set and the first association set includes:

acquiring four elements with larger first relevance in the first classification set, and extracting TF-IDF vectors of all questions belonging to knowledge points related to the four elements in a question bank;

calculating cosine distances of the extracted questions and the feature vectors of the first questions by using a cosine distance formula respectively;

sequencing all the obtained cosine distances between the questions and the first question to obtain a second association degree set;

the obtaining of the approximate question set according to the second relevance set specifically includes:

2. The method for topic classification and push according to claim 1, wherein the S1 specifically is:

3. The method for topic classification and push according to claim 2,

the updating the preset knowledge point classification model according to the third classification set comprises:

for each knowledge point in the third classification set, sequentially performing:

calculating the subject length after the fourth subject is segmented;

setting updateWeight as a parameter to be judged, wherein when the topic length of the fourth topic is greater than 5, the updateWeight is 0.5; if not, then,

calculating incomeWeight, wherein the incomeWeight refers to the average approximation degree of the approximation questions with the similarity exceeding 0.1 under the knowledge point, and assuming that the set of all the questions under the knowledge point is A, a belongs to A, and x is the fourth question queried currently;

define a' ═ { x | sim (a, x) >0.1}, where sim (a, x) is the approximation of topic a to x;

calculating incomeWeight of the knowledge point as:

according to the following formula:

newWeight＝oldWeight×(1-updateWeight)+incomeWeight×updateWeight

and updating the weight value of the knowledge point, wherein newWeight is the weight of the updated knowledge point, and oldWeight is the weight of the original knowledge point.

4. The topic classification and pushing method according to claim 2, wherein the step of sending the first topic to each node in the preset classification cluster to obtain the first classification set and the first association set specifically comprises:

5. The topic classification and push method according to claim 1, wherein a stop word is deleted from the third topic to obtain a fourth topic, specifically:

calculating the weight of each word in the third topic by using a TF-IDF algorithm;

6. A system for topic classification and push, comprising:

the first classification set is a set of preset knowledge points;

the pushing module is used for pushing the similar topic set;

the classification module is specifically configured to:

deleting a stop word from the third topic to obtain a fourth topic;