CN106599054A - Method and system for title classification and push - Google Patents

Method and system for title classification and push Download PDF

Info

Publication number
CN106599054A
CN106599054A CN201611009278.0A CN201611009278A CN106599054A CN 106599054 A CN106599054 A CN 106599054A CN 201611009278 A CN201611009278 A CN 201611009278A CN 106599054 A CN106599054 A CN 106599054A
Authority
CN
China
Prior art keywords
exercise question
classification
degree
word
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611009278.0A
Other languages
Chinese (zh)
Other versions
CN106599054B (en
Inventor
刘德建
章亮
詹博悍
陈霖
吴拥民
陈宏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Tianquan Educational Technology Ltd
Original Assignee
Fujian Tianquan Educational Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Tianquan Educational Technology Ltd filed Critical Fujian Tianquan Educational Technology Ltd
Priority to CN201611009278.0A priority Critical patent/CN106599054B/en
Publication of CN106599054A publication Critical patent/CN106599054A/en
Application granted granted Critical
Publication of CN106599054B publication Critical patent/CN106599054B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention relates to the field of classification, in particular to a method and a system for title classification and push. The method comprises the following steps of classifying a first title according to a preset knowledge point classification model to obtain a first classification set and a first correlation degree set, wherein elements in the first correlation degree set are correlation degrees of the first title and the classifications in the first classification set; computing similarity between the first title and the titles included in the classifications in the first classification set to obtain a similarity set corresponding to the classifications in the first classification set; obtaining a second correlation degree set according to the similarity set and the first correlation degree set; obtaining an approximate title set according to the second correlation degree set; and pushing the approximate title set. According to the method and the system, the accuracy of title classification and the correlation of the pushed approximate titles are improved.

Description

A kind of exercise question classification and the method and system for pushing
Technical field
The present invention relates to field of classifying, more particularly to the method and system of a kind of classification of exercise question and push.
Background technology
Big data epoch, daily produced data volume explosive growth.K12 is educated as Chinese most important education One of form, the data volume for producing daily is very important.The scale of China On Line education is just increased with annual more than 30% speed Long, market valuation will be more than 160,000,000,000 yuan.K12 online education resources become each enterprise's hotly contested spot, if can be to increasingly increasing Long problem data in addition analysis and utilization, in Rational Classification to corresponding knowledge point, is difficult to resolve or after weak topic when student runs into, and pushes The exercise question big with the Knowledge Relation degree is deeply practised for student, can improve the Consumer's Experience of application.
The patent document of Application No. 201510246727.2 provides a kind of exercise question and recommends method, by receiving retrieval topic Mesh;The theme attribute information of the retrieval exercise question is obtained, and according to the theme attribute acquisition of information preliminary search result;Obtain The user description information of user, and the preliminary search result is ranked up according to the user description information, sorted Result afterwards;The result of predetermined number is selected after result from after the sequence, is defined as recommending exercise question.Realize improving recommendation topic Mesh and the correlation for retrieving exercise question, so as to improve recommendation effect.
But, above-mentioned patent document is ranked up according to user description information to the preliminary search result, its classification knot The accuracy of fruit depends on the accuracy of user description information.
The content of the invention
The technical problem to be solved is:A kind of exercise question classification and the method and system for pushing are provided, realize carrying The accuracy of high exercise question classification and the correlation of push exercise question.
In order to solve above-mentioned technical problem, the technical solution used in the present invention is:
The present invention provides a kind of exercise question classification and the method for pushing, including:
S1, the first exercise question of being classified according to default knowledge point disaggregated model, obtain the first classification set and the first degree of association collection Close;Element in first degree of association set is first exercise question and associating for respectively classifying in the described first classification set Degree;
S2, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain With corresponding similarity set of respectively classifying in the described first classification set;
S3, according to the similarity set and first degree of association set, obtain the second degree of association set;
S4, according to second degree of association set, obtain approximate topic set;
S5, the push approximate topic set.
The present invention also provides a kind of exercise question classification and the system for pushing, including:
Sort module, according to default knowledge point disaggregated model the first exercise question of classification, obtains the first classification set and first and closes Connection degree set;Element in first degree of association set is first exercise question and each classification in the described first classification set The degree of association;
Computing module, for calculating the phase of the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification Like spending, obtain and corresponding similarity set of respectively classifying in the described first classification set;
First processing module, for according to the similarity set and first degree of association set, obtaining the second association Degree set;
Second processing module, for according to second degree of association set, obtaining approximate topic set;
Pushing module, for pushing the approximate topic set.
The beneficial effects of the present invention is:It is different from prior art and directly pushes correlation according to the classification results of disaggregated model Approximate topic, the present invention by the way that the first exercise question is classified with the knowledge point that obtained according to knowledge point disaggregated model in exercise question carry out Similarity analysis, according to the degree of association that the exercise question of Similarity Measure first is classified with the knowledge point, then know from the degree of association is larger Know and extract in point classification the exercise question high with the first exercise question similarity and be pushed to user as approximate topic, it is possible to increase push it is approximate Inscribe the correlation with the first exercise question.
Description of the drawings
Fig. 1 is the FB(flow block) of a kind of exercise question classification of the invention and the method for pushing;
Fig. 2 is the structured flowchart of a kind of exercise question classification of the invention and the system for pushing;
Label declaration:
1st, sort module;2nd, computing module;3rd, first processing module;4th, Second processing module;5th, pushing module.
Specific embodiment
To describe the technology contents of the present invention in detail, being realized purpose and effect, below in conjunction with embodiment and coordinate attached Figure is explained.
The design of most critical of the present invention is:By by the first exercise question and the knowledge point that obtained according to knowledge point disaggregated model Exercise question in classification carries out similarity analysis, recalculates the degree of association that the first exercise question is classified with each knowledge point, it is possible to increase push away The correlation of the approximate topic sent and the first exercise question.
As shown in figure 1, the present invention provides a kind of exercise question classification and the method for pushing, including:
S1, the first exercise question of being classified according to default knowledge point disaggregated model, obtain the first classification set and the first degree of association collection Close;Element in first degree of association set is first exercise question and associating for respectively classifying in the described first classification set Degree;
S2, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain With corresponding similarity set of respectively classifying in the described first classification set;
S3, according to the similarity set and first degree of association set, obtain the second degree of association set;
S4, according to second degree of association set, obtain approximate topic set;
S5, the push approximate topic set.
Further, the S1 is specially:
Each node of the different default knowledge point disaggregated model of deployment in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and institute is obtained State the first degree of association set.
Seen from the above description, be conducive to the approximate topic for processing extensive batch exercise question to push using distributed type assemblies to appoint Business, improves the efficiency for pushing.
Further, also include:
The corresponding classification of each exercise question in the approximate topic set is obtained, the second classification set is obtained;
The default knowledge point disaggregated model is updated according to the described second classification set.
Seen from the above description, periodically according to classification results renewal knowledge point disaggregated model, it is possible to increase disaggregated model point The accuracy of class, so as to improve the correlation for pushing approximate topic.
Further, each node in first exercise question to the default classification cluster is sent, described first point is obtained Class set and first degree of association set, specially:
The each node in first exercise question to the default classification cluster is sent, classify corresponding with the node is obtained Set and degree of association set;
Knowledge point disaggregated model according to disposing on the node obtains the weighted value of the node;
Classified accordingly according to the weighted value and the node of the node and gathered and degree of association set, obtain described first Classification set and first degree of association set.
Seen from the above description, various different disaggregated models are disposed on node respectively in classification cluster, therefore, respectively The classification results that node is obtained are different, and according to the disaggregated model disposed on each node its weighted value, comprehensive analysis weighted value are determined And corresponding classification results, obtain the knowledge point classification big with the first exercise question degree of association.Realize being adjusted according to actual application scenarios The weighted value of whole each node, is conducive to best suiting the desired approximate topic of user according to the push of the different demands of user.
Further, the S1 is specially:
Symbol in first exercise question is changed according to default ESC, the second exercise question is obtained;
The feature of second exercise question is extracted, characteristic vector is obtained;The characteristic vector include word frequency vector sum it is semantic to Amount;
According to the default knowledge point disaggregated model, the first classification set corresponding with the characteristic vector and first are obtained Degree of association set.
Seen from the above description, because the describing mode of the exercise question of separate sources may difference, especially different formula Description of the editing machine to the symbol in formula differs greatly, therefore, the symbol in the formula is changed by default ESC Number, different describing modes can be normalized but the symbol of equivalent is represented, so as to accurately and make full use of information in exercise question, carry The accuracy of high exercise question classification, so as to the efficiency for improving the correlation for pushing exercise question and obtain approximate topic.
For example:Wait the exercise question 1 for pushing approximate topic " to make function significantPositive integer span group Into the unit of set have”.Wait the exercise question 2 for pushing approximate topic " to make the significant y=of function (5-x)1/2Positive integer value The unit of the set of scope composition have”.In fact, exercise question 1 and exercise question 2 are substantially identicals, but existing method cannot The information of formula in exercise question is made full use of, the span for calculating variable is simply pushed so that the significant exercise question of function, and The span for calculating variable more cannot targetedly be pushed so that the significant exercise question of function with radical sign.And it is existing Some method None- identifieds and judge identical exercise question, cause to need the same exercise question of repeated resolution approximately to inscribe so as to obtain, efficiency is low.
Further, according to the default knowledge point disaggregated model, the first classification corresponding with the characteristic vector is obtained Set and the first degree of association set, specially:
Node of knowledge point disaggregated model of the deployment based on word frequency in default classification cluster;
Node of the deployment based on semantic knowledge point disaggregated model in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and institute is obtained State the first degree of association set.
Seen from the above description, the knowledge point related to the first exercise question for being obtained by the classification cluster classify include from The classification results that word frequency and semantic two dimensions are obtained, due to having considered exercise question in word frequency and semanteme, it is possible to increase point The accuracy of class, so as to improve the approximate topic of push and the correlation of the first exercise question.
Further, the feature of second exercise question is extracted, characteristic vector is obtained;The characteristic vector includes word frequency vector And semantic vector, specially:
Second exercise question is parsed, Chinese character stack and non-Chinese character stack is obtained;
Cutting word process is carried out to the character in the Chinese character stack using cutting word algorithm, and using default regular expressions Formula matches the formula stored in the non-Chinese character stack, obtains the 3rd exercise question;
Stop-word is deleted from the 3rd exercise question, the 4th exercise question is obtained;
Word frequency vector is built according to the 4th exercise question;The number of element is in the 4th exercise question in the word frequency vector The quantity of different words, the value of element is that word corresponding with the element occurs in the 4th exercise question in the word frequency vector Number of times;
Semantic feature extraction model is set up according to default dimension;
Semantic vector corresponding with the 4th exercise question is built according to the semantic feature extraction model.
Seen from the above description, the non-Chinese character in exercise question, a centering word can be deleted due to existing cutting word algorithm Symbol carries out cutting word process, therefore, the Chinese character in exercise question and non-Chinese character are first respectively put into different stacks by the present invention, right Chinese character stack carries out cutting word process, the corresponding formula of matching regular expressions is used to non-Chinese character stack, as far as possible by formula In discernible part separate, can retain exercise question in information while, cutting word is carried out to exercise question, be conducive to extract exercise question in Characteristic vector.Additionally, ensure that character sequence is constant using stack preservation Chinese character and non-Chinese character, in cutting word process During do not change the original meaning of exercise question.Furthermore, delete exercise question in stop-word, i.e., insignificant word, as " ", " it ", " ", " being ", " the inside " etc., can more accurately extract the characteristic vector of exercise question, ignore irrelevant information, reduce the redundancy of characteristic vector Degree.
Further, stop-word is deleted from the 3rd exercise question, obtains the 4th exercise question, specially:
Calculate the weight of each word in the 3rd exercise question;
The word in the 3rd exercise question is sorted according to the weight, forms first queue;
Word corresponding with predetermined number element before the first queue is deleted from the 3rd exercise question, the 4th topic is obtained Mesh.
Seen from the above description, it is existing because the particular content of different subjects and the stop-word of different school age sections is different Stop-word acquisition methods are to be consulted by stopping vocabulary, and flexibility and specific aim are relatively low, and the present invention is calculated by stop-word Algorithm, such as TF-IDF algorithms, calculate weight of each word in exercise question, and delete the less word of weight in the 3rd exercise question, Different subjects can be directed to and obtain different stop-words, so as to improve the correlation of the approximate topic for getting.
For example, common vocabulary " acceleration " is the vocabulary that Jing often occurs in physics subject, and to the understanding of the meaning of the question It is critically important, but in biology, 1000 road exercise questions may not necessarily all have this vocabulary, so if sending out in biological subject Existing " acceleration ", it is possible to regard as it and be off word, can not treat as word important in biological subject, can be by it Delete.
Wherein, word frequency (term frequency, TF) refer to that some given word occurs in this document time Number.This numeral would generally be normalized (molecule is generally less than denominator and is different from IDF), to prevent it to be partial to long file.Its Computing formula is as follows:
N in above-mentioned formulai,jIt is the word in file djThe number of times of middle appearance, and denominator this be in file djIn all words go out The sum of existing number of times.
Reverse document-frequency (inverse document frequency, IDF) is the degree of a word general importance Amount.The IDF of a certain particular words, can by general act number divided by the file comprising the word number, then by the business for obtaining Take the logarithm and obtain.Its formula is as follows:
Wherein | D | is the sum of language material file, | { j:ti∈dj| comprising word tiNumber of files, if the word does not exist In corpus, may result in dividend is 0, therefore generally uses 1+ | { j:ti∈dj}|.Finally obtain TF-IDF Formula, it is as follows:
tf-idfi,j=tfi,j×idfi
High term frequencies in a certain specific file, and low document-frequency of the word in whole file set, can To produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common word, retains important word.
Further, according to second degree of association set, approximate topic set is obtained, specially:
According to the classification in second degree of association set sequence the first classification set, the first classification queue is obtained;
The classification of default classification number is obtained from first classification queue, the second classification set is obtained;
The exercise question for being more than default similarity threshold in the second classification set with the similarity of first exercise question is obtained, Obtain approximate topic set.
Seen from the above description, by choosing from the knowledge point classification related to the degree of association of the first exercise question and the first topic The higher exercise question of mesh similarity forms approximate topic set, realizes improving the correlation of approximate topic set and the first exercise question for pushing.
As shown in Fig. 2 the present invention also provides a kind of exercise question classification and the system for pushing, including:
Sort module 1, according to default knowledge point disaggregated model the first exercise question of classification, obtains the first classification set and first and closes Connection degree set;Element in first degree of association set is first exercise question and each classification in the described first classification set The degree of association;
Computing module 2, for calculating first exercise question and the described first classification set in respectively classify the exercise question that includes Similarity, obtains and corresponding similarity set of respectively classifying in the described first classification set;
First processing module 3, for according to the similarity set and first degree of association set, obtaining the second association Degree set;
Second processing module 4, for according to second degree of association set, obtaining approximate topic set;
Pushing module 5, for pushing the approximate topic set.
Seen from the above description, the system classified by the exercise question and pushed, realizes improving the accuracy of exercise question classification, So as to further improve the approximate topic of push and the correlation of the first exercise question.
Embodiments of the invention are:
S1, it is default classification cluster node on respectively deployment based on word frequency knowledge point disaggregated model and based on semanteme Knowledge point disaggregated model;
Wherein, the knowledge point disaggregated model based on word frequency is specially:
(1) input of new exercise question;
(2) new exercise question is carried out the conversion of latex forms;
(3) cutting word of text is processed, and the stop-word obtained according to training process, deletes corresponding stop-word
(4) new exercise question is built into into word frequency vector;
(5) word frequency vector is input in the knowledge point disaggregated model based on word frequency that training in advance is completed, and obtains phase The knowledge point answered and its weight.
The process of the training knowledge point disaggregated model based on word frequency is specially:
(1) education question purpose input;
(2) training exercise question is converted into into latex forms;
(3) cutting word of text is processed;
(4) weight of each word is calculated using stopping word algorithm (TF-IDF), and stop-word is obtained according to the threshold value of setting, Stop-word in training exercise question is deleted;
(5) each training exercise question is changed into into word frequency vector;
(6) corresponding parameter is set according to sorting algorithm;
(7) word frequency vector is all input in sorting algorithm and is trained, and obtain the knowledge point classification mould based on word frequency Type.
It is described to be specially based on semantic knowledge point disaggregated model:
(1) input of new exercise question;
(2) new exercise question is carried out the conversion of latex forms;
(3) cutting word of text is processed, and the stop-word obtained according to training process, deletes corresponding stop-word;
(4) new exercise question is input in the good semantic feature extraction model of training in advance, obtains corresponding semantic vector;
(5) semantic vector is input in the knowledge point disaggregated model based on semanteme that training in advance is completed, and obtains phase The knowledge point answered and its weight.
The training process based on semantic knowledge point disaggregated model is specially:
(1) education question purpose input;
(2) training exercise question is converted into into latex forms;
(3) cutting word of text is processed;
(4) the training exercise question after cutting word is input in semantic feature extraction model (such as word2vec models), and root Obtain for education question purpose semanteme feature extraction model according to the model parameter of setting;
(5) each training exercise question is input in semantic feature extraction model, is obtained for each education question target langua0 Adopted vector;
(6) corresponding sorting algorithm (such as random forest and xgboost algorithms) is set;
(7) semantic vector is all input in sorting algorithm and is trained, and obtained based on semantic knowledge point classification mould Type.
S2, each node sent in first exercise question to the default classification cluster;Each node is to described first Exercise question carries out classification process, specially:
S21, the symbol in default ESC conversion first exercise question, obtain the second exercise question;
Wherein, symbolESC be " sqrt ", the ESC of symbol "=" is to be input under English state Equal to number, the ESC of symbol "-" is the minus sign being input under English state.The second exercise question obtained Jing after ESC conversion For " unit for making the set of significant y=sqrt (5-x) the positive integers span composition of function have”
S22, parsing second exercise question, obtain Chinese character stack and non-Chinese character stack;
Cutting word process is carried out to the character in the Chinese character stack using cutting word algorithm, and using default regular expressions Formula matches the formula stored in the non-Chinese character stack, obtains the 3rd exercise question;
Calculate the weight of each word in the 3rd exercise question;
The word in the 3rd exercise question is sorted according to the weight, forms first queue;
Word corresponding with predetermined number element before the first queue is deleted from the 3rd exercise question, the 4th topic is obtained Mesh;
Word frequency vector is built according to the 4th exercise question;The number of element is in the 4th exercise question in the word frequency vector The quantity of different words, the value of element is that word corresponding with the element occurs in the 4th exercise question in the word frequency vector Number of times;
Semantic feature extraction model is set up according to default dimension;
Semantic vector corresponding with the 4th exercise question is built according to the semantic feature extraction model;
Wherein, cutting word process is carried out to the character in the Chinese character stack using jieba cutting words algorithm, and using default Matching regular expressions described in the formula that stores in non-Chinese character stack, specially:
First the character in Chinese character string is carried out into cutting word using jieba cutting words algorithm, obtaining the 3rd exercise question " makes@letters The@of the meaningful@of number@The@element@of the@set@of the@positive integers@span@composition@of@have@", symbol@is Represent separator.
The weight of each word in the 3rd exercise question is calculated using TF-IDF algorithms, each word in the 3rd exercise question is obtained Weight be followed successively by:
" making ":0.05, " function ":0.51, " meaningful ":0.22, " ":0.02, " y ":0.09, "=":0.07, " sqrt”:0.22, " (":0.01, " 5 ":0.01, "-":0.07, " x ":0.07, ") ":0.01, " positive integer ":0.49, " value model Enclose ":0.44, " composition ":0.15, " ":0.02, " set ":0.38, " ":0.02, " element ":0.35, " having ":0.05, “”:0.01.The less word of weight of word is deleted from the 3rd exercise question, the 4th exercise question is obtained, the described 4th is entitled:" function@ Meaningful@sqrt@positive integers@span@composition@set@elements ".
Count in the 4th exercise question the number of times that each word occurs, the non-stop term vector according to constructed by all non-stop words, The word frequency vector of the 4th exercise question is built, specially:
If the quantity of non-stop word is 1000 in all training sets, then the word frequency vector length of the 4th exercise question is 1000, each element in vector represents the number of times that equivalent occurs in the exercise question, then occur in the 4th exercise question Word, such as " function " only occur once, then in the word frequency vector of the 4th exercise question, the dimension values corresponding to " function " will be 1, If " function " is in the event of twice in the exercise question, then the dimension values corresponding to " function " are 2.Remaining does not go out in the exercise question The dimension values of existing word are all 0.
By in the 4th exercise question occur each word be input in the semantic model for having trained (such as word2vec or GloVe models) vector of each word is obtained, because the vector of each word for obtaining is isometric, therefore can be by each word Vector is overlapped, i.e., identical dimensional value is added, and obtains one comprising whole topic object vector, and semantic model is that one kind can be protected The method for expressing of semantic context relation is deposited, the process for building the semantic vector of the 4th exercise question is specially:
4th exercise question is input in the good semantic model of pre-training, can be obtained according to the parameter setting of pre-training model The semantic vector of each word, for example:Because the vector length of each word in practice can typically be set to 100 to 200 dimensions, in order to say Bright problem, sets here the vector of each word as 4 dimensions.
Function 0.41 0.12 0.02 0.31
It is meaningful 0.21 0.01 0.02 0.22
\sqrt 0.02 0.08 0.06 0.05
Positive integer 0.35 0.14 0.21 0.33
Span 0.01 0.03 0.05 0.06
Composition 0.23 0.41 0.05 0.02
Set 0.14 0.02 0.13 0.09
Element 0.06 0.04 0.07 0.08
Finally the value of each word identical dimensional above is added, so that it may obtain the semantic vector of the 4th exercise question:
1.43 0.85 0.61 1.16
And all obtain the value in every dimension divided by total word number (8) of the 4th exercise question:
0.17875 0.10625 0.07625 0.145
It is exactly above the semantic vector of the 4th exercise question.
S23, according to the default knowledge point disaggregated model, obtain corresponding with semantic vector described in the word frequency vector sum First classification set and the first degree of association set;
Wherein, the first knowledge point set is:{ span of set element, the method for expressing of function, the expression of set Method, the most value of element in set, the existence of root and and the number of root judge, the value of function, radical computing }, and the 4th exercise question It is respectively with the degree of association of each knowledge point:{ 0.85,0.04,0.03,0.02,0.03,0.02,0.01 }.Second knowledge point set It is combined into:{ span of set element, radical computing, the representation of set, the most value of element in set, equal, the letter of set Several domain of definition and its seek the value of method function, and the 4th exercise question is respectively with the degree of association of each knowledge point:0.73,0.08, 0.08,0.04,0.04,0.02 }.The knowledge point in the first knowledge point set and the second knowledge point set is obtained, the 3rd is formed and is known Know point set.The knowledge point of the 3rd knowledge point set meets the spy of the word frequency vector sum semantic vector of the 4th exercise question simultaneously Levy, larger with the degree of association of the 4th exercise question, the classification corresponding to the knowledge point in the 3rd knowledge point set forms the first category set Close.
S24, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain With corresponding similarity set of respectively classifying in the described first classification set;
According to the similarity set and first degree of association set, the second degree of association set is obtained;
According to the classification in second degree of association set sequence the first classification set, the first classification queue is obtained;
The classification of default classification number is obtained from first classification queue, the second classification set is obtained;
The exercise question for being more than default similarity threshold in the second classification set with the similarity of first exercise question is obtained, Obtain approximate topic set;
Wherein, the COS distance formula for calculating similarity is as follows:
Wherein, x represents the characteristic vector of the first exercise question, and y represents the characteristic vector of each exercise question in the classification, the value of cos θ Closer to 1, represent that the similarity of two exercise questions is higher.
First exercise question is respectively with first degree of association of each classification in the described first classification set:
{ the span of set element:1.58, the method for expressing of function:0.04, the representation of set:0.11, in set The most value of element:0.06, the existence of root and and root number judge:0.03, the value of function:0.02, radical computing:0.09, The most value of element in set:0.04, set it is equal:0.04, the domain of definition of function and its seek the value of method function:0.02 } basis Similarity set and the first degree of association set obtain the process of the second degree of association set and are specially:
(1) four larger elements of first degree of association, i.e. the value model of set element in above-mentioned first classification set are obtained The most value of element in representation, radical computing, the set enclose, gathered, by all exercise questions for belonging to these knowledge points in exam pool TF-IDF vectors are extracted.
(2) it is utilized respectively COS distance formula and the characteristic vector of the exercise question for being extracted and the first exercise question is calculated into cosine Distance.
(3) COS distance for obtaining all exercise questions and the first exercise question is ranked up and obtains the second degree of association set.From The exercise question higher with the first exercise question similarity is chosen in the larger corresponding classification of two degrees of association and forms approximate topic set.
S25, the push approximate topic set;
S26, the corresponding classification of each exercise question in the approximate topic set is obtained, obtain the second classification set;
The default knowledge point disaggregated model is updated according to the described second classification set.
Wherein, the process for updating knowledge point disaggregated model is specially:
(1) the exercise question length after exercise question participle is calculated first, the 4th is entitled:" the meaningful@of function@sqrt@positive integers@ Span@constitutes@set@elements ".The exercise question length of the 4th exercise question is equal to 8.
(2) updateWeight is set to need the parameter of judgement, i.e., when the 4th exercise question is more than 5, then updateWeight =0.5, otherwise:
That is the updateWeight=0.5 of the 4th exercise question.
(3) incomeWeight is calculated, the value refers to the approximation on the average of approximate topic of the similarity more than 0.1 under the knowledge point Degree, it is assumed that all topic destination aggregation (mda)s are A under the knowledge point, and a ∈ A, x is the exercise question currently inquired about, such as the 4th exercise question.
Definition A'=x | sim (a, x) > 0.1 }, wherein, sim (a, x) is the degree of approximation of exercise question a and x.Calculate each to know Knowing the incomeWeight for putting is:
(4) according to equation below:
NewWeight=oldWeight × (1-updateWeight)+incomeWeight × updateWeight
Update the weighted value of each knowledge point, wherein newWeight is the knowledge point weight after updating, oldWeight For the weight of original knowledge point, the old knowledge point weight of such as the 4th exercise question is respectively:1.58th, 0.11,0.09,0.06, finally The newWeight for obtaining is new knowledge point weight.
In sum, the present invention provide a kind of exercise question classification and push method and system, by by the first exercise question with Similarity analysis are carried out according to the exercise question in the knowledge point classification that knowledge point disaggregated model is obtained, is inscribed according to Similarity Measure first The degree of association that mesh is classified with the knowledge point, then extract high with the first exercise question similarity from the larger knowledge point classification of the degree of association Exercise question be pushed to user as approximate topic, it is possible to increase the approximate topic of push and the correlation of the first exercise question.Further, by Foregoing description understands that be conducive to the approximate topic for processing extensive batch exercise question to push task using distributed type assemblies, raising is pushed Efficiency.Further, periodically according to classification results renewal knowledge point disaggregated model, it is possible to increase it is accurate that disaggregated model is classified Degree, so as to improve the correlation for pushing approximate topic.Further, realize adjusting the weight of each node according to actual application scenarios Value, is conducive to best suiting the desired approximate topic of user according to the push of the different demands of user.Further, by default escape Character changes the symbol in the formula, can normalize different describing modes but represent the symbol of equivalent, so as to accurately simultaneously The information in exercise question is made full use of, the accuracy of exercise question classification is improved, so as to improving the correlation of push exercise question and obtaining approximate The efficiency of topic.Further, by having considered exercise question in word frequency and semanteme, it is possible to increase the accuracy of classification, so as to Improve the correlation of approximate topic and the first exercise question for pushing.Further, can be while information in retaining exercise question, to exercise question Cutting word is carried out, is conducive to extracting the characteristic vector in exercise question.Additionally, preserve Chinese character and non-Chinese character using stack can protect Card character sequence is constant, and the original meaning of exercise question is not changed in cutting word processing procedure.Further, the present invention is calculated by stop-word Algorithm, calculates weight of each word in exercise question, and deletes the less word of weight in the 3rd exercise question, can be for not classmate Section obtains different stop-words, so as to improve the correlation of the approximate topic for getting.Further, by from the first exercise question The exercise question higher with the first exercise question similarity is chosen in the related knowledge point classification of the degree of association and form approximate topic set, realize improving The approximate topic set of push and the correlation of the first exercise question.The present invention also provides a kind of exercise question classification and the system for pushing, and passes through The exercise question classification and the system for pushing, realize improving the accuracy of exercise question classification, so as to further improve the approximate topic of push With the correlation of the first exercise question.
Embodiments of the invention are the foregoing is only, the scope of the claims of the present invention is not thereby limited, it is every using this The equivalents that bright specification and accompanying drawing content are made, or the technical field of correlation is directly or indirectly used in, include in the same manner In the scope of patent protection of the present invention.

Claims (10)

1. a kind of method that exercise question is classified and pushed, it is characterised in that include:
S1, the first exercise question of being classified according to default knowledge point disaggregated model, obtain the first classification set and the first degree of association set;Institute State the degree of association that the element in the first degree of association set is first exercise question and each classification in the described first classification set;
S2, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain and institute State corresponding similarity set of respectively classifying in the first classification set;
S3, according to the similarity set and first degree of association set, obtain the second degree of association set;
S4, according to second degree of association set, obtain approximate topic set;
S5, the push approximate topic set.
2. the method that exercise question according to claim 1 is classified and pushed, it is characterised in that the S1 is specially:
Each node of the different default knowledge point disaggregated model of deployment in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and described the is obtained One degree of association set.
3. the method that exercise question according to claim 2 is classified and pushed, it is characterised in that also include:
The corresponding classification of each exercise question in the approximate topic set is obtained, the second classification set is obtained;
The default knowledge point disaggregated model is updated according to the described second classification set.
4. the method that exercise question according to claim 2 is classified and pushed, it is characterised in that send first exercise question to institute The each node in default classification cluster is stated, the first classification set and first degree of association set is obtained, specially:
The each node in first exercise question to the default classification cluster is sent, set of classifying corresponding with the node is obtained With degree of association set;
Knowledge point disaggregated model according to disposing on the node obtains the weighted value of the node;
Classified accordingly according to the weighted value and the node of the node and gathered and degree of association set, obtain first classification Set and first degree of association set.
5. the method that exercise question according to claim 1 is classified and pushed, it is characterised in that the S1 is specially:
Symbol in first exercise question is changed according to default ESC, the second exercise question is obtained;
The feature of second exercise question is extracted, characteristic vector is obtained;The characteristic vector includes word frequency vector sum semantic vector;
According to the default knowledge point disaggregated model, the first classification set corresponding with the characteristic vector and the first association are obtained Degree set.
6. the method that exercise question according to claim 5 is classified and pushed, it is characterised in that according to the default knowledge point point Class model, obtains the first classification set corresponding with the characteristic vector and the first degree of association set, specially:
Node of knowledge point disaggregated model of the deployment based on word frequency in default classification cluster;
Node of the deployment based on semantic knowledge point disaggregated model in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and described the is obtained One degree of association set.
7. the method that exercise question according to claim 5 is classified and pushed, it is characterised in that the spy for extracting second exercise question Levy, obtain characteristic vector;The characteristic vector includes word frequency vector sum semantic vector, specially:
Second exercise question is parsed, Chinese character stack and non-Chinese character stack is obtained;
Cutting word process is carried out to the character in the Chinese character stack using cutting word algorithm, and using default regular expression With the formula stored in the non-Chinese character stack, the 3rd exercise question is obtained;
Stop-word is deleted from the 3rd exercise question, the 4th exercise question is obtained;
Word frequency vector is built according to the 4th exercise question;The number of element is different in the 4th exercise question in the word frequency vector The quantity of word, in the word frequency vector value of element be word corresponding with the element occurs in the 4th exercise question it is secondary Number;
Semantic feature extraction model is set up according to default dimension;
Semantic vector corresponding with the 4th exercise question is built according to the semantic feature extraction model.
8. the method that exercise question according to claim 7 is classified and pushed, it is characterised in that delete from the 3rd exercise question Stop-word, obtains the 4th exercise question, specially:
Calculate the weight of each word in the 3rd exercise question;
The word in the 3rd exercise question is sorted according to the weight, forms first queue;
Word corresponding with predetermined number element before the first queue is deleted from the 3rd exercise question, the 4th exercise question is obtained.
9. the method that exercise question according to claim 1 is classified and pushed, it is characterised in that according to second degree of association collection Close, obtain approximate topic set, specially:
According to the classification in second degree of association set sequence the first classification set, the first classification queue is obtained;
The classification of default classification number is obtained from first classification queue, the second classification set is obtained;
Obtain in the second classification set with the similarity of first exercise question more than the exercise question of default similarity threshold, obtain Approximate topic set.
10. the system that a kind of exercise question is classified and pushed, it is characterised in that include:
Sort module, according to default knowledge point disaggregated model the first exercise question of classification, obtains the first classification set and first degree of association Set;Element in first degree of association set is first exercise question and associating for respectively classifying in the described first classification set Degree;
Computing module, for calculating first exercise question to the similar of the exercise question that includes of respectively classifying in the described first classification set Degree, obtains and corresponding similarity set of respectively classifying in the described first classification set;
First processing module, for according to the similarity set and first degree of association set, obtaining the second degree of association collection Close;
Second processing module, for according to second degree of association set, obtaining approximate topic set;
Pushing module, for pushing the approximate topic set.
CN201611009278.0A 2016-11-16 2016-11-16 Method and system for classifying and pushing questions Active CN106599054B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611009278.0A CN106599054B (en) 2016-11-16 2016-11-16 Method and system for classifying and pushing questions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611009278.0A CN106599054B (en) 2016-11-16 2016-11-16 Method and system for classifying and pushing questions

Publications (2)

Publication Number Publication Date
CN106599054A true CN106599054A (en) 2017-04-26
CN106599054B CN106599054B (en) 2019-12-24

Family

ID=58590375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611009278.0A Active CN106599054B (en) 2016-11-16 2016-11-16 Method and system for classifying and pushing questions

Country Status (1)

Country Link
CN (1) CN106599054B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463553A (en) * 2017-09-12 2017-12-12 复旦大学 For the text semantic extraction, expression and modeling method and system of elementary mathematics topic
CN108182275A (en) * 2018-01-24 2018-06-19 上海互教教育科技有限公司 A kind of mathematics variant training topic supplying system and correlating method
CN108376132A (en) * 2018-03-16 2018-08-07 中国科学技术大学 The determination method and system of similar examination question
CN108765221A (en) * 2018-05-15 2018-11-06 广西英腾教育科技股份有限公司 Pumping inscribes method and device
CN109189920A (en) * 2018-08-02 2019-01-11 上海欣方智能系统有限公司 Sweep-black case classification method and system
CN109685137A (en) * 2018-12-24 2019-04-26 上海仁静信息技术有限公司 A kind of topic classification method, device, electronic equipment and storage medium
CN109785691A (en) * 2019-01-18 2019-05-21 广东小天才科技有限公司 A kind of method and system by terminal assisted learning
CN110136512A (en) * 2019-04-17 2019-08-16 许昌学院 A kind of English grade examzation examination exercise and the automatic clustering system of answer parsing
CN110472044A (en) * 2019-07-11 2019-11-19 平安国际智慧城市科技股份有限公司 Knowledge point classification method, device, readable storage medium storing program for executing and the server of mathematical problem
CN111881285A (en) * 2020-07-28 2020-11-03 扬州大学 Wrong question collection and important and difficult point knowledge extraction method
CN112257966A (en) * 2020-12-18 2021-01-22 北京世纪好未来教育科技有限公司 Model processing method and device, electronic equipment and storage medium
CN112989760A (en) * 2019-12-17 2021-06-18 北京一起教育信息咨询有限责任公司 Method and device for labeling subjects, storage medium and electronic equipment
WO2021253480A1 (en) * 2020-06-19 2021-12-23 平安科技(深圳)有限公司 Intelligent exercise recommendation method and apparatus, computer device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN104834729A (en) * 2015-05-14 2015-08-12 百度在线网络技术(北京)有限公司 Title recommendation method and title recommendation device
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
CN105589972A (en) * 2016-01-08 2016-05-18 天津车之家科技有限公司 Method and device for training classification model, and method and device for classifying search words
CN105893362A (en) * 2014-09-26 2016-08-24 北大方正集团有限公司 A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN105930509A (en) * 2016-05-11 2016-09-07 华东师范大学 Method and system for automatic extraction and refinement of domain concept based on statistics and template matching
CN106021288A (en) * 2016-04-27 2016-10-12 南京慕测信息科技有限公司 Method for rapid and automatic classification of classroom testing answers based on natural language analysis

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
CN105893362A (en) * 2014-09-26 2016-08-24 北大方正集团有限公司 A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN104834729A (en) * 2015-05-14 2015-08-12 百度在线网络技术(北京)有限公司 Title recommendation method and title recommendation device
CN105589972A (en) * 2016-01-08 2016-05-18 天津车之家科技有限公司 Method and device for training classification model, and method and device for classifying search words
CN106021288A (en) * 2016-04-27 2016-10-12 南京慕测信息科技有限公司 Method for rapid and automatic classification of classroom testing answers based on natural language analysis
CN105930509A (en) * 2016-05-11 2016-09-07 华东师范大学 Method and system for automatic extraction and refinement of domain concept based on statistics and template matching

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
吴旭等: "面向机构知识库结构化数据的文本相似度评价算法", 《技术研究》 *
董奥根等: "基于向量空间模型的知识点与试题自动关联方法", 《计算机与现代化》 *
许鑫: "《基于文本特征计算的信息分析方法》", 30 November 2015, 上海科学技术文献出版社 *
麦好: "《机器学习实践指南 案例应用解析》", 30 April 2014, 机械工业出版社 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463553A (en) * 2017-09-12 2017-12-12 复旦大学 For the text semantic extraction, expression and modeling method and system of elementary mathematics topic
CN107463553B (en) * 2017-09-12 2021-03-30 复旦大学 Text semantic extraction, representation and modeling method and system for elementary mathematic problems
CN108182275A (en) * 2018-01-24 2018-06-19 上海互教教育科技有限公司 A kind of mathematics variant training topic supplying system and correlating method
CN108376132B (en) * 2018-03-16 2020-08-28 中国科学技术大学 Method and system for judging similar test questions
CN108376132A (en) * 2018-03-16 2018-08-07 中国科学技术大学 The determination method and system of similar examination question
CN108765221A (en) * 2018-05-15 2018-11-06 广西英腾教育科技股份有限公司 Pumping inscribes method and device
CN109189920A (en) * 2018-08-02 2019-01-11 上海欣方智能系统有限公司 Sweep-black case classification method and system
CN109685137A (en) * 2018-12-24 2019-04-26 上海仁静信息技术有限公司 A kind of topic classification method, device, electronic equipment and storage medium
CN109785691A (en) * 2019-01-18 2019-05-21 广东小天才科技有限公司 A kind of method and system by terminal assisted learning
CN109785691B (en) * 2019-01-18 2021-09-24 广东小天才科技有限公司 Method and system for assisting learning through terminal
CN110136512A (en) * 2019-04-17 2019-08-16 许昌学院 A kind of English grade examzation examination exercise and the automatic clustering system of answer parsing
CN110472044A (en) * 2019-07-11 2019-11-19 平安国际智慧城市科技股份有限公司 Knowledge point classification method, device, readable storage medium storing program for executing and the server of mathematical problem
CN112989760A (en) * 2019-12-17 2021-06-18 北京一起教育信息咨询有限责任公司 Method and device for labeling subjects, storage medium and electronic equipment
WO2021253480A1 (en) * 2020-06-19 2021-12-23 平安科技(深圳)有限公司 Intelligent exercise recommendation method and apparatus, computer device and storage medium
CN111881285A (en) * 2020-07-28 2020-11-03 扬州大学 Wrong question collection and important and difficult point knowledge extraction method
CN112257966A (en) * 2020-12-18 2021-01-22 北京世纪好未来教育科技有限公司 Model processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106599054B (en) 2019-12-24

Similar Documents

Publication Publication Date Title
CN106599054A (en) Method and system for title classification and push
CN102411563B (en) Method, device and system for identifying target words
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN105808526B (en) Commodity short text core word extracting method and device
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN102289522B (en) Method of intelligently classifying texts
CN108763213A (en) Theme feature text key word extracting method
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN106651696B (en) Approximate question pushing method and system
CN107992542A (en) A kind of similar article based on topic model recommends method
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN102033919A (en) Method and system for extracting text key words
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN107943824A (en) A kind of big data news category method, system and device based on LDA
CN104484380A (en) Personalized search method and personalized search device
CN103020167B (en) A kind of computer Chinese file classification method
CN103593431A (en) Internet public opinion analyzing method and device
CN109815400A (en) Personage's interest extracting method based on long text
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN111221968A (en) Author disambiguation method and device based on subject tree clustering
CN107463715A (en) English social media account number classification method based on information gain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant