CN106599054A - Method and system for title classification and push - Google Patents
Method and system for title classification and push Download PDFInfo
- Publication number
- CN106599054A CN106599054A CN201611009278.0A CN201611009278A CN106599054A CN 106599054 A CN106599054 A CN 106599054A CN 201611009278 A CN201611009278 A CN 201611009278A CN 106599054 A CN106599054 A CN 106599054A
- Authority
- CN
- China
- Prior art keywords
- exercise question
- classification
- degree
- word
- association
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/243—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of classification, in particular to a method and a system for title classification and push. The method comprises the following steps of classifying a first title according to a preset knowledge point classification model to obtain a first classification set and a first correlation degree set, wherein elements in the first correlation degree set are correlation degrees of the first title and the classifications in the first classification set; computing similarity between the first title and the titles included in the classifications in the first classification set to obtain a similarity set corresponding to the classifications in the first classification set; obtaining a second correlation degree set according to the similarity set and the first correlation degree set; obtaining an approximate title set according to the second correlation degree set; and pushing the approximate title set. According to the method and the system, the accuracy of title classification and the correlation of the pushed approximate titles are improved.
Description
Technical field
The present invention relates to field of classifying, more particularly to the method and system of a kind of classification of exercise question and push.
Background technology
Big data epoch, daily produced data volume explosive growth.K12 is educated as Chinese most important education
One of form, the data volume for producing daily is very important.The scale of China On Line education is just increased with annual more than 30% speed
Long, market valuation will be more than 160,000,000,000 yuan.K12 online education resources become each enterprise's hotly contested spot, if can be to increasingly increasing
Long problem data in addition analysis and utilization, in Rational Classification to corresponding knowledge point, is difficult to resolve or after weak topic when student runs into, and pushes
The exercise question big with the Knowledge Relation degree is deeply practised for student, can improve the Consumer's Experience of application.
The patent document of Application No. 201510246727.2 provides a kind of exercise question and recommends method, by receiving retrieval topic
Mesh;The theme attribute information of the retrieval exercise question is obtained, and according to the theme attribute acquisition of information preliminary search result;Obtain
The user description information of user, and the preliminary search result is ranked up according to the user description information, sorted
Result afterwards;The result of predetermined number is selected after result from after the sequence, is defined as recommending exercise question.Realize improving recommendation topic
Mesh and the correlation for retrieving exercise question, so as to improve recommendation effect.
But, above-mentioned patent document is ranked up according to user description information to the preliminary search result, its classification knot
The accuracy of fruit depends on the accuracy of user description information.
The content of the invention
The technical problem to be solved is:A kind of exercise question classification and the method and system for pushing are provided, realize carrying
The accuracy of high exercise question classification and the correlation of push exercise question.
In order to solve above-mentioned technical problem, the technical solution used in the present invention is:
The present invention provides a kind of exercise question classification and the method for pushing, including:
S1, the first exercise question of being classified according to default knowledge point disaggregated model, obtain the first classification set and the first degree of association collection
Close;Element in first degree of association set is first exercise question and associating for respectively classifying in the described first classification set
Degree;
S2, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain
With corresponding similarity set of respectively classifying in the described first classification set;
S3, according to the similarity set and first degree of association set, obtain the second degree of association set;
S4, according to second degree of association set, obtain approximate topic set;
S5, the push approximate topic set.
The present invention also provides a kind of exercise question classification and the system for pushing, including:
Sort module, according to default knowledge point disaggregated model the first exercise question of classification, obtains the first classification set and first and closes
Connection degree set;Element in first degree of association set is first exercise question and each classification in the described first classification set
The degree of association;
Computing module, for calculating the phase of the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification
Like spending, obtain and corresponding similarity set of respectively classifying in the described first classification set;
First processing module, for according to the similarity set and first degree of association set, obtaining the second association
Degree set;
Second processing module, for according to second degree of association set, obtaining approximate topic set;
Pushing module, for pushing the approximate topic set.
The beneficial effects of the present invention is:It is different from prior art and directly pushes correlation according to the classification results of disaggregated model
Approximate topic, the present invention by the way that the first exercise question is classified with the knowledge point that obtained according to knowledge point disaggregated model in exercise question carry out
Similarity analysis, according to the degree of association that the exercise question of Similarity Measure first is classified with the knowledge point, then know from the degree of association is larger
Know and extract in point classification the exercise question high with the first exercise question similarity and be pushed to user as approximate topic, it is possible to increase push it is approximate
Inscribe the correlation with the first exercise question.
Description of the drawings
Fig. 1 is the FB(flow block) of a kind of exercise question classification of the invention and the method for pushing;
Fig. 2 is the structured flowchart of a kind of exercise question classification of the invention and the system for pushing;
Label declaration:
1st, sort module;2nd, computing module;3rd, first processing module;4th, Second processing module;5th, pushing module.
Specific embodiment
To describe the technology contents of the present invention in detail, being realized purpose and effect, below in conjunction with embodiment and coordinate attached
Figure is explained.
The design of most critical of the present invention is:By by the first exercise question and the knowledge point that obtained according to knowledge point disaggregated model
Exercise question in classification carries out similarity analysis, recalculates the degree of association that the first exercise question is classified with each knowledge point, it is possible to increase push away
The correlation of the approximate topic sent and the first exercise question.
As shown in figure 1, the present invention provides a kind of exercise question classification and the method for pushing, including:
S1, the first exercise question of being classified according to default knowledge point disaggregated model, obtain the first classification set and the first degree of association collection
Close;Element in first degree of association set is first exercise question and associating for respectively classifying in the described first classification set
Degree;
S2, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain
With corresponding similarity set of respectively classifying in the described first classification set;
S3, according to the similarity set and first degree of association set, obtain the second degree of association set;
S4, according to second degree of association set, obtain approximate topic set;
S5, the push approximate topic set.
Further, the S1 is specially:
Each node of the different default knowledge point disaggregated model of deployment in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and institute is obtained
State the first degree of association set.
Seen from the above description, be conducive to the approximate topic for processing extensive batch exercise question to push using distributed type assemblies to appoint
Business, improves the efficiency for pushing.
Further, also include:
The corresponding classification of each exercise question in the approximate topic set is obtained, the second classification set is obtained;
The default knowledge point disaggregated model is updated according to the described second classification set.
Seen from the above description, periodically according to classification results renewal knowledge point disaggregated model, it is possible to increase disaggregated model point
The accuracy of class, so as to improve the correlation for pushing approximate topic.
Further, each node in first exercise question to the default classification cluster is sent, described first point is obtained
Class set and first degree of association set, specially:
The each node in first exercise question to the default classification cluster is sent, classify corresponding with the node is obtained
Set and degree of association set;
Knowledge point disaggregated model according to disposing on the node obtains the weighted value of the node;
Classified accordingly according to the weighted value and the node of the node and gathered and degree of association set, obtain described first
Classification set and first degree of association set.
Seen from the above description, various different disaggregated models are disposed on node respectively in classification cluster, therefore, respectively
The classification results that node is obtained are different, and according to the disaggregated model disposed on each node its weighted value, comprehensive analysis weighted value are determined
And corresponding classification results, obtain the knowledge point classification big with the first exercise question degree of association.Realize being adjusted according to actual application scenarios
The weighted value of whole each node, is conducive to best suiting the desired approximate topic of user according to the push of the different demands of user.
Further, the S1 is specially:
Symbol in first exercise question is changed according to default ESC, the second exercise question is obtained;
The feature of second exercise question is extracted, characteristic vector is obtained;The characteristic vector include word frequency vector sum it is semantic to
Amount;
According to the default knowledge point disaggregated model, the first classification set corresponding with the characteristic vector and first are obtained
Degree of association set.
Seen from the above description, because the describing mode of the exercise question of separate sources may difference, especially different formula
Description of the editing machine to the symbol in formula differs greatly, therefore, the symbol in the formula is changed by default ESC
Number, different describing modes can be normalized but the symbol of equivalent is represented, so as to accurately and make full use of information in exercise question, carry
The accuracy of high exercise question classification, so as to the efficiency for improving the correlation for pushing exercise question and obtain approximate topic.
For example:Wait the exercise question 1 for pushing approximate topic " to make function significantPositive integer span group
Into the unit of set have”.Wait the exercise question 2 for pushing approximate topic " to make the significant y=of function (5-x)1/2Positive integer value
The unit of the set of scope composition have”.In fact, exercise question 1 and exercise question 2 are substantially identicals, but existing method cannot
The information of formula in exercise question is made full use of, the span for calculating variable is simply pushed so that the significant exercise question of function, and
The span for calculating variable more cannot targetedly be pushed so that the significant exercise question of function with radical sign.And it is existing
Some method None- identifieds and judge identical exercise question, cause to need the same exercise question of repeated resolution approximately to inscribe so as to obtain, efficiency is low.
Further, according to the default knowledge point disaggregated model, the first classification corresponding with the characteristic vector is obtained
Set and the first degree of association set, specially:
Node of knowledge point disaggregated model of the deployment based on word frequency in default classification cluster;
Node of the deployment based on semantic knowledge point disaggregated model in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and institute is obtained
State the first degree of association set.
Seen from the above description, the knowledge point related to the first exercise question for being obtained by the classification cluster classify include from
The classification results that word frequency and semantic two dimensions are obtained, due to having considered exercise question in word frequency and semanteme, it is possible to increase point
The accuracy of class, so as to improve the approximate topic of push and the correlation of the first exercise question.
Further, the feature of second exercise question is extracted, characteristic vector is obtained;The characteristic vector includes word frequency vector
And semantic vector, specially:
Second exercise question is parsed, Chinese character stack and non-Chinese character stack is obtained;
Cutting word process is carried out to the character in the Chinese character stack using cutting word algorithm, and using default regular expressions
Formula matches the formula stored in the non-Chinese character stack, obtains the 3rd exercise question;
Stop-word is deleted from the 3rd exercise question, the 4th exercise question is obtained;
Word frequency vector is built according to the 4th exercise question;The number of element is in the 4th exercise question in the word frequency vector
The quantity of different words, the value of element is that word corresponding with the element occurs in the 4th exercise question in the word frequency vector
Number of times;
Semantic feature extraction model is set up according to default dimension;
Semantic vector corresponding with the 4th exercise question is built according to the semantic feature extraction model.
Seen from the above description, the non-Chinese character in exercise question, a centering word can be deleted due to existing cutting word algorithm
Symbol carries out cutting word process, therefore, the Chinese character in exercise question and non-Chinese character are first respectively put into different stacks by the present invention, right
Chinese character stack carries out cutting word process, the corresponding formula of matching regular expressions is used to non-Chinese character stack, as far as possible by formula
In discernible part separate, can retain exercise question in information while, cutting word is carried out to exercise question, be conducive to extract exercise question in
Characteristic vector.Additionally, ensure that character sequence is constant using stack preservation Chinese character and non-Chinese character, in cutting word process
During do not change the original meaning of exercise question.Furthermore, delete exercise question in stop-word, i.e., insignificant word, as " ", " it ", " ",
" being ", " the inside " etc., can more accurately extract the characteristic vector of exercise question, ignore irrelevant information, reduce the redundancy of characteristic vector
Degree.
Further, stop-word is deleted from the 3rd exercise question, obtains the 4th exercise question, specially:
Calculate the weight of each word in the 3rd exercise question;
The word in the 3rd exercise question is sorted according to the weight, forms first queue;
Word corresponding with predetermined number element before the first queue is deleted from the 3rd exercise question, the 4th topic is obtained
Mesh.
Seen from the above description, it is existing because the particular content of different subjects and the stop-word of different school age sections is different
Stop-word acquisition methods are to be consulted by stopping vocabulary, and flexibility and specific aim are relatively low, and the present invention is calculated by stop-word
Algorithm, such as TF-IDF algorithms, calculate weight of each word in exercise question, and delete the less word of weight in the 3rd exercise question,
Different subjects can be directed to and obtain different stop-words, so as to improve the correlation of the approximate topic for getting.
For example, common vocabulary " acceleration " is the vocabulary that Jing often occurs in physics subject, and to the understanding of the meaning of the question
It is critically important, but in biology, 1000 road exercise questions may not necessarily all have this vocabulary, so if sending out in biological subject
Existing " acceleration ", it is possible to regard as it and be off word, can not treat as word important in biological subject, can be by it
Delete.
Wherein, word frequency (term frequency, TF) refer to that some given word occurs in this document time
Number.This numeral would generally be normalized (molecule is generally less than denominator and is different from IDF), to prevent it to be partial to long file.Its
Computing formula is as follows:
N in above-mentioned formulai,jIt is the word in file djThe number of times of middle appearance, and denominator this be in file djIn all words go out
The sum of existing number of times.
Reverse document-frequency (inverse document frequency, IDF) is the degree of a word general importance
Amount.The IDF of a certain particular words, can by general act number divided by the file comprising the word number, then by the business for obtaining
Take the logarithm and obtain.Its formula is as follows:
Wherein | D | is the sum of language material file, | { j:ti∈dj| comprising word tiNumber of files, if the word does not exist
In corpus, may result in dividend is 0, therefore generally uses 1+ | { j:ti∈dj}|.Finally obtain TF-IDF
Formula, it is as follows:
tf-idfi,j=tfi,j×idfi
High term frequencies in a certain specific file, and low document-frequency of the word in whole file set, can
To produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common word, retains important word.
Further, according to second degree of association set, approximate topic set is obtained, specially:
According to the classification in second degree of association set sequence the first classification set, the first classification queue is obtained;
The classification of default classification number is obtained from first classification queue, the second classification set is obtained;
The exercise question for being more than default similarity threshold in the second classification set with the similarity of first exercise question is obtained,
Obtain approximate topic set.
Seen from the above description, by choosing from the knowledge point classification related to the degree of association of the first exercise question and the first topic
The higher exercise question of mesh similarity forms approximate topic set, realizes improving the correlation of approximate topic set and the first exercise question for pushing.
As shown in Fig. 2 the present invention also provides a kind of exercise question classification and the system for pushing, including:
Sort module 1, according to default knowledge point disaggregated model the first exercise question of classification, obtains the first classification set and first and closes
Connection degree set;Element in first degree of association set is first exercise question and each classification in the described first classification set
The degree of association;
Computing module 2, for calculating first exercise question and the described first classification set in respectively classify the exercise question that includes
Similarity, obtains and corresponding similarity set of respectively classifying in the described first classification set;
First processing module 3, for according to the similarity set and first degree of association set, obtaining the second association
Degree set;
Second processing module 4, for according to second degree of association set, obtaining approximate topic set;
Pushing module 5, for pushing the approximate topic set.
Seen from the above description, the system classified by the exercise question and pushed, realizes improving the accuracy of exercise question classification,
So as to further improve the approximate topic of push and the correlation of the first exercise question.
Embodiments of the invention are:
S1, it is default classification cluster node on respectively deployment based on word frequency knowledge point disaggregated model and based on semanteme
Knowledge point disaggregated model;
Wherein, the knowledge point disaggregated model based on word frequency is specially:
(1) input of new exercise question;
(2) new exercise question is carried out the conversion of latex forms;
(3) cutting word of text is processed, and the stop-word obtained according to training process, deletes corresponding stop-word
(4) new exercise question is built into into word frequency vector;
(5) word frequency vector is input in the knowledge point disaggregated model based on word frequency that training in advance is completed, and obtains phase
The knowledge point answered and its weight.
The process of the training knowledge point disaggregated model based on word frequency is specially:
(1) education question purpose input;
(2) training exercise question is converted into into latex forms;
(3) cutting word of text is processed;
(4) weight of each word is calculated using stopping word algorithm (TF-IDF), and stop-word is obtained according to the threshold value of setting,
Stop-word in training exercise question is deleted;
(5) each training exercise question is changed into into word frequency vector;
(6) corresponding parameter is set according to sorting algorithm;
(7) word frequency vector is all input in sorting algorithm and is trained, and obtain the knowledge point classification mould based on word frequency
Type.
It is described to be specially based on semantic knowledge point disaggregated model:
(1) input of new exercise question;
(2) new exercise question is carried out the conversion of latex forms;
(3) cutting word of text is processed, and the stop-word obtained according to training process, deletes corresponding stop-word;
(4) new exercise question is input in the good semantic feature extraction model of training in advance, obtains corresponding semantic vector;
(5) semantic vector is input in the knowledge point disaggregated model based on semanteme that training in advance is completed, and obtains phase
The knowledge point answered and its weight.
The training process based on semantic knowledge point disaggregated model is specially:
(1) education question purpose input;
(2) training exercise question is converted into into latex forms;
(3) cutting word of text is processed;
(4) the training exercise question after cutting word is input in semantic feature extraction model (such as word2vec models), and root
Obtain for education question purpose semanteme feature extraction model according to the model parameter of setting;
(5) each training exercise question is input in semantic feature extraction model, is obtained for each education question target langua0
Adopted vector;
(6) corresponding sorting algorithm (such as random forest and xgboost algorithms) is set;
(7) semantic vector is all input in sorting algorithm and is trained, and obtained based on semantic knowledge point classification mould
Type.
S2, each node sent in first exercise question to the default classification cluster;Each node is to described first
Exercise question carries out classification process, specially:
S21, the symbol in default ESC conversion first exercise question, obtain the second exercise question;
Wherein, symbolESC be " sqrt ", the ESC of symbol "=" is to be input under English state
Equal to number, the ESC of symbol "-" is the minus sign being input under English state.The second exercise question obtained Jing after ESC conversion
For " unit for making the set of significant y=sqrt (5-x) the positive integers span composition of function have”
S22, parsing second exercise question, obtain Chinese character stack and non-Chinese character stack;
Cutting word process is carried out to the character in the Chinese character stack using cutting word algorithm, and using default regular expressions
Formula matches the formula stored in the non-Chinese character stack, obtains the 3rd exercise question;
Calculate the weight of each word in the 3rd exercise question;
The word in the 3rd exercise question is sorted according to the weight, forms first queue;
Word corresponding with predetermined number element before the first queue is deleted from the 3rd exercise question, the 4th topic is obtained
Mesh;
Word frequency vector is built according to the 4th exercise question;The number of element is in the 4th exercise question in the word frequency vector
The quantity of different words, the value of element is that word corresponding with the element occurs in the 4th exercise question in the word frequency vector
Number of times;
Semantic feature extraction model is set up according to default dimension;
Semantic vector corresponding with the 4th exercise question is built according to the semantic feature extraction model;
Wherein, cutting word process is carried out to the character in the Chinese character stack using jieba cutting words algorithm, and using default
Matching regular expressions described in the formula that stores in non-Chinese character stack, specially:
First the character in Chinese character string is carried out into cutting word using jieba cutting words algorithm, obtaining the 3rd exercise question " makes@letters
The@of the meaningful@of number@The@element@of the@set@of the@positive integers@span@composition@of@have@", symbol@is
Represent separator.
The weight of each word in the 3rd exercise question is calculated using TF-IDF algorithms, each word in the 3rd exercise question is obtained
Weight be followed successively by:
" making ":0.05, " function ":0.51, " meaningful ":0.22, " ":0.02, " y ":0.09, "=":0.07, "
sqrt”:0.22, " (":0.01, " 5 ":0.01, "-":0.07, " x ":0.07, ") ":0.01, " positive integer ":0.49, " value model
Enclose ":0.44, " composition ":0.15, " ":0.02, " set ":0.38, " ":0.02, " element ":0.35, " having ":0.05,
“”:0.01.The less word of weight of word is deleted from the 3rd exercise question, the 4th exercise question is obtained, the described 4th is entitled:" function@
Meaningful@sqrt@positive integers@span@composition@set@elements ".
Count in the 4th exercise question the number of times that each word occurs, the non-stop term vector according to constructed by all non-stop words,
The word frequency vector of the 4th exercise question is built, specially:
If the quantity of non-stop word is 1000 in all training sets, then the word frequency vector length of the 4th exercise question is
1000, each element in vector represents the number of times that equivalent occurs in the exercise question, then occur in the 4th exercise question
Word, such as " function " only occur once, then in the word frequency vector of the 4th exercise question, the dimension values corresponding to " function " will be 1,
If " function " is in the event of twice in the exercise question, then the dimension values corresponding to " function " are 2.Remaining does not go out in the exercise question
The dimension values of existing word are all 0.
By in the 4th exercise question occur each word be input in the semantic model for having trained (such as word2vec or
GloVe models) vector of each word is obtained, because the vector of each word for obtaining is isometric, therefore can be by each word
Vector is overlapped, i.e., identical dimensional value is added, and obtains one comprising whole topic object vector, and semantic model is that one kind can be protected
The method for expressing of semantic context relation is deposited, the process for building the semantic vector of the 4th exercise question is specially:
4th exercise question is input in the good semantic model of pre-training, can be obtained according to the parameter setting of pre-training model
The semantic vector of each word, for example:Because the vector length of each word in practice can typically be set to 100 to 200 dimensions, in order to say
Bright problem, sets here the vector of each word as 4 dimensions.
Function | 0.41 | 0.12 | 0.02 | 0.31 |
It is meaningful | 0.21 | 0.01 | 0.02 | 0.22 |
\sqrt | 0.02 | 0.08 | 0.06 | 0.05 |
Positive integer | 0.35 | 0.14 | 0.21 | 0.33 |
Span | 0.01 | 0.03 | 0.05 | 0.06 |
Composition | 0.23 | 0.41 | 0.05 | 0.02 |
Set | 0.14 | 0.02 | 0.13 | 0.09 |
Element | 0.06 | 0.04 | 0.07 | 0.08 |
Finally the value of each word identical dimensional above is added, so that it may obtain the semantic vector of the 4th exercise question:
1.43 | 0.85 | 0.61 | 1.16 |
And all obtain the value in every dimension divided by total word number (8) of the 4th exercise question:
0.17875 | 0.10625 | 0.07625 | 0.145 |
It is exactly above the semantic vector of the 4th exercise question.
S23, according to the default knowledge point disaggregated model, obtain corresponding with semantic vector described in the word frequency vector sum
First classification set and the first degree of association set;
Wherein, the first knowledge point set is:{ span of set element, the method for expressing of function, the expression of set
Method, the most value of element in set, the existence of root and and the number of root judge, the value of function, radical computing }, and the 4th exercise question
It is respectively with the degree of association of each knowledge point:{ 0.85,0.04,0.03,0.02,0.03,0.02,0.01 }.Second knowledge point set
It is combined into:{ span of set element, radical computing, the representation of set, the most value of element in set, equal, the letter of set
Several domain of definition and its seek the value of method function, and the 4th exercise question is respectively with the degree of association of each knowledge point:0.73,0.08,
0.08,0.04,0.04,0.02 }.The knowledge point in the first knowledge point set and the second knowledge point set is obtained, the 3rd is formed and is known
Know point set.The knowledge point of the 3rd knowledge point set meets the spy of the word frequency vector sum semantic vector of the 4th exercise question simultaneously
Levy, larger with the degree of association of the 4th exercise question, the classification corresponding to the knowledge point in the 3rd knowledge point set forms the first category set
Close.
S24, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain
With corresponding similarity set of respectively classifying in the described first classification set;
According to the similarity set and first degree of association set, the second degree of association set is obtained;
According to the classification in second degree of association set sequence the first classification set, the first classification queue is obtained;
The classification of default classification number is obtained from first classification queue, the second classification set is obtained;
The exercise question for being more than default similarity threshold in the second classification set with the similarity of first exercise question is obtained,
Obtain approximate topic set;
Wherein, the COS distance formula for calculating similarity is as follows:
Wherein, x represents the characteristic vector of the first exercise question, and y represents the characteristic vector of each exercise question in the classification, the value of cos θ
Closer to 1, represent that the similarity of two exercise questions is higher.
First exercise question is respectively with first degree of association of each classification in the described first classification set:
{ the span of set element:1.58, the method for expressing of function:0.04, the representation of set:0.11, in set
The most value of element:0.06, the existence of root and and root number judge:0.03, the value of function:0.02, radical computing:0.09,
The most value of element in set:0.04, set it is equal:0.04, the domain of definition of function and its seek the value of method function:0.02 } basis
Similarity set and the first degree of association set obtain the process of the second degree of association set and are specially:
(1) four larger elements of first degree of association, i.e. the value model of set element in above-mentioned first classification set are obtained
The most value of element in representation, radical computing, the set enclose, gathered, by all exercise questions for belonging to these knowledge points in exam pool
TF-IDF vectors are extracted.
(2) it is utilized respectively COS distance formula and the characteristic vector of the exercise question for being extracted and the first exercise question is calculated into cosine
Distance.
(3) COS distance for obtaining all exercise questions and the first exercise question is ranked up and obtains the second degree of association set.From
The exercise question higher with the first exercise question similarity is chosen in the larger corresponding classification of two degrees of association and forms approximate topic set.
S25, the push approximate topic set;
S26, the corresponding classification of each exercise question in the approximate topic set is obtained, obtain the second classification set;
The default knowledge point disaggregated model is updated according to the described second classification set.
Wherein, the process for updating knowledge point disaggregated model is specially:
(1) the exercise question length after exercise question participle is calculated first, the 4th is entitled:" the meaningful@of function@sqrt@positive integers@
Span@constitutes@set@elements ".The exercise question length of the 4th exercise question is equal to 8.
(2) updateWeight is set to need the parameter of judgement, i.e., when the 4th exercise question is more than 5, then updateWeight
=0.5, otherwise:
That is the updateWeight=0.5 of the 4th exercise question.
(3) incomeWeight is calculated, the value refers to the approximation on the average of approximate topic of the similarity more than 0.1 under the knowledge point
Degree, it is assumed that all topic destination aggregation (mda)s are A under the knowledge point, and a ∈ A, x is the exercise question currently inquired about, such as the 4th exercise question.
Definition A'=x | sim (a, x) > 0.1 }, wherein, sim (a, x) is the degree of approximation of exercise question a and x.Calculate each to know
Knowing the incomeWeight for putting is:
(4) according to equation below:
NewWeight=oldWeight × (1-updateWeight)+incomeWeight × updateWeight
Update the weighted value of each knowledge point, wherein newWeight is the knowledge point weight after updating, oldWeight
For the weight of original knowledge point, the old knowledge point weight of such as the 4th exercise question is respectively:1.58th, 0.11,0.09,0.06, finally
The newWeight for obtaining is new knowledge point weight.
In sum, the present invention provide a kind of exercise question classification and push method and system, by by the first exercise question with
Similarity analysis are carried out according to the exercise question in the knowledge point classification that knowledge point disaggregated model is obtained, is inscribed according to Similarity Measure first
The degree of association that mesh is classified with the knowledge point, then extract high with the first exercise question similarity from the larger knowledge point classification of the degree of association
Exercise question be pushed to user as approximate topic, it is possible to increase the approximate topic of push and the correlation of the first exercise question.Further, by
Foregoing description understands that be conducive to the approximate topic for processing extensive batch exercise question to push task using distributed type assemblies, raising is pushed
Efficiency.Further, periodically according to classification results renewal knowledge point disaggregated model, it is possible to increase it is accurate that disaggregated model is classified
Degree, so as to improve the correlation for pushing approximate topic.Further, realize adjusting the weight of each node according to actual application scenarios
Value, is conducive to best suiting the desired approximate topic of user according to the push of the different demands of user.Further, by default escape
Character changes the symbol in the formula, can normalize different describing modes but represent the symbol of equivalent, so as to accurately simultaneously
The information in exercise question is made full use of, the accuracy of exercise question classification is improved, so as to improving the correlation of push exercise question and obtaining approximate
The efficiency of topic.Further, by having considered exercise question in word frequency and semanteme, it is possible to increase the accuracy of classification, so as to
Improve the correlation of approximate topic and the first exercise question for pushing.Further, can be while information in retaining exercise question, to exercise question
Cutting word is carried out, is conducive to extracting the characteristic vector in exercise question.Additionally, preserve Chinese character and non-Chinese character using stack can protect
Card character sequence is constant, and the original meaning of exercise question is not changed in cutting word processing procedure.Further, the present invention is calculated by stop-word
Algorithm, calculates weight of each word in exercise question, and deletes the less word of weight in the 3rd exercise question, can be for not classmate
Section obtains different stop-words, so as to improve the correlation of the approximate topic for getting.Further, by from the first exercise question
The exercise question higher with the first exercise question similarity is chosen in the related knowledge point classification of the degree of association and form approximate topic set, realize improving
The approximate topic set of push and the correlation of the first exercise question.The present invention also provides a kind of exercise question classification and the system for pushing, and passes through
The exercise question classification and the system for pushing, realize improving the accuracy of exercise question classification, so as to further improve the approximate topic of push
With the correlation of the first exercise question.
Embodiments of the invention are the foregoing is only, the scope of the claims of the present invention is not thereby limited, it is every using this
The equivalents that bright specification and accompanying drawing content are made, or the technical field of correlation is directly or indirectly used in, include in the same manner
In the scope of patent protection of the present invention.
Claims (10)
1. a kind of method that exercise question is classified and pushed, it is characterised in that include:
S1, the first exercise question of being classified according to default knowledge point disaggregated model, obtain the first classification set and the first degree of association set;Institute
State the degree of association that the element in the first degree of association set is first exercise question and each classification in the described first classification set;
S2, the similarity for calculating the exercise question for including of respectively classifying during first exercise question is gathered with the described first classification, obtain and institute
State corresponding similarity set of respectively classifying in the first classification set;
S3, according to the similarity set and first degree of association set, obtain the second degree of association set;
S4, according to second degree of association set, obtain approximate topic set;
S5, the push approximate topic set.
2. the method that exercise question according to claim 1 is classified and pushed, it is characterised in that the S1 is specially:
Each node of the different default knowledge point disaggregated model of deployment in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and described the is obtained
One degree of association set.
3. the method that exercise question according to claim 2 is classified and pushed, it is characterised in that also include:
The corresponding classification of each exercise question in the approximate topic set is obtained, the second classification set is obtained;
The default knowledge point disaggregated model is updated according to the described second classification set.
4. the method that exercise question according to claim 2 is classified and pushed, it is characterised in that send first exercise question to institute
The each node in default classification cluster is stated, the first classification set and first degree of association set is obtained, specially:
The each node in first exercise question to the default classification cluster is sent, set of classifying corresponding with the node is obtained
With degree of association set;
Knowledge point disaggregated model according to disposing on the node obtains the weighted value of the node;
Classified accordingly according to the weighted value and the node of the node and gathered and degree of association set, obtain first classification
Set and first degree of association set.
5. the method that exercise question according to claim 1 is classified and pushed, it is characterised in that the S1 is specially:
Symbol in first exercise question is changed according to default ESC, the second exercise question is obtained;
The feature of second exercise question is extracted, characteristic vector is obtained;The characteristic vector includes word frequency vector sum semantic vector;
According to the default knowledge point disaggregated model, the first classification set corresponding with the characteristic vector and the first association are obtained
Degree set.
6. the method that exercise question according to claim 5 is classified and pushed, it is characterised in that according to the default knowledge point point
Class model, obtains the first classification set corresponding with the characteristic vector and the first degree of association set, specially:
Node of knowledge point disaggregated model of the deployment based on word frequency in default classification cluster;
Node of the deployment based on semantic knowledge point disaggregated model in default classification cluster;
The each node in first exercise question to the default classification cluster is sent, the first classification set and described the is obtained
One degree of association set.
7. the method that exercise question according to claim 5 is classified and pushed, it is characterised in that the spy for extracting second exercise question
Levy, obtain characteristic vector;The characteristic vector includes word frequency vector sum semantic vector, specially:
Second exercise question is parsed, Chinese character stack and non-Chinese character stack is obtained;
Cutting word process is carried out to the character in the Chinese character stack using cutting word algorithm, and using default regular expression
With the formula stored in the non-Chinese character stack, the 3rd exercise question is obtained;
Stop-word is deleted from the 3rd exercise question, the 4th exercise question is obtained;
Word frequency vector is built according to the 4th exercise question;The number of element is different in the 4th exercise question in the word frequency vector
The quantity of word, in the word frequency vector value of element be word corresponding with the element occurs in the 4th exercise question it is secondary
Number;
Semantic feature extraction model is set up according to default dimension;
Semantic vector corresponding with the 4th exercise question is built according to the semantic feature extraction model.
8. the method that exercise question according to claim 7 is classified and pushed, it is characterised in that delete from the 3rd exercise question
Stop-word, obtains the 4th exercise question, specially:
Calculate the weight of each word in the 3rd exercise question;
The word in the 3rd exercise question is sorted according to the weight, forms first queue;
Word corresponding with predetermined number element before the first queue is deleted from the 3rd exercise question, the 4th exercise question is obtained.
9. the method that exercise question according to claim 1 is classified and pushed, it is characterised in that according to second degree of association collection
Close, obtain approximate topic set, specially:
According to the classification in second degree of association set sequence the first classification set, the first classification queue is obtained;
The classification of default classification number is obtained from first classification queue, the second classification set is obtained;
Obtain in the second classification set with the similarity of first exercise question more than the exercise question of default similarity threshold, obtain
Approximate topic set.
10. the system that a kind of exercise question is classified and pushed, it is characterised in that include:
Sort module, according to default knowledge point disaggregated model the first exercise question of classification, obtains the first classification set and first degree of association
Set;Element in first degree of association set is first exercise question and associating for respectively classifying in the described first classification set
Degree;
Computing module, for calculating first exercise question to the similar of the exercise question that includes of respectively classifying in the described first classification set
Degree, obtains and corresponding similarity set of respectively classifying in the described first classification set;
First processing module, for according to the similarity set and first degree of association set, obtaining the second degree of association collection
Close;
Second processing module, for according to second degree of association set, obtaining approximate topic set;
Pushing module, for pushing the approximate topic set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611009278.0A CN106599054B (en) | 2016-11-16 | 2016-11-16 | Method and system for classifying and pushing questions |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611009278.0A CN106599054B (en) | 2016-11-16 | 2016-11-16 | Method and system for classifying and pushing questions |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599054A true CN106599054A (en) | 2017-04-26 |
CN106599054B CN106599054B (en) | 2019-12-24 |
Family
ID=58590375
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611009278.0A Active CN106599054B (en) | 2016-11-16 | 2016-11-16 | Method and system for classifying and pushing questions |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599054B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463553A (en) * | 2017-09-12 | 2017-12-12 | 复旦大学 | For the text semantic extraction, expression and modeling method and system of elementary mathematics topic |
CN108182275A (en) * | 2018-01-24 | 2018-06-19 | 上海互教教育科技有限公司 | A kind of mathematics variant training topic supplying system and correlating method |
CN108376132A (en) * | 2018-03-16 | 2018-08-07 | 中国科学技术大学 | The determination method and system of similar examination question |
CN108765221A (en) * | 2018-05-15 | 2018-11-06 | 广西英腾教育科技股份有限公司 | Pumping inscribes method and device |
CN109189920A (en) * | 2018-08-02 | 2019-01-11 | 上海欣方智能系统有限公司 | Sweep-black case classification method and system |
CN109685137A (en) * | 2018-12-24 | 2019-04-26 | 上海仁静信息技术有限公司 | A kind of topic classification method, device, electronic equipment and storage medium |
CN109785691A (en) * | 2019-01-18 | 2019-05-21 | 广东小天才科技有限公司 | Method and system for assisting learning through terminal |
CN110136512A (en) * | 2019-04-17 | 2019-08-16 | 许昌学院 | A kind of English grade examzation examination exercise and the automatic clustering system of answer parsing |
CN110472044A (en) * | 2019-07-11 | 2019-11-19 | 平安国际智慧城市科技股份有限公司 | Knowledge point classification method, device, readable storage medium storing program for executing and the server of mathematical problem |
CN111881285A (en) * | 2020-07-28 | 2020-11-03 | 扬州大学 | Wrong question collection and important and difficult point knowledge extraction method |
CN112257966A (en) * | 2020-12-18 | 2021-01-22 | 北京世纪好未来教育科技有限公司 | Model processing method and device, electronic equipment and storage medium |
CN112989760A (en) * | 2019-12-17 | 2021-06-18 | 北京一起教育信息咨询有限责任公司 | Method and device for labeling subjects, storage medium and electronic equipment |
WO2021253480A1 (en) * | 2020-06-19 | 2021-12-23 | 平安科技(深圳)有限公司 | Intelligent exercise recommendation method and apparatus, computer device and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
CN101685455A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and system of data retrieval |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN104834729A (en) * | 2015-05-14 | 2015-08-12 | 百度在线网络技术(北京)有限公司 | Title recommendation method and title recommendation device |
CN105095223A (en) * | 2014-04-25 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method for classifying texts and server |
CN105589972A (en) * | 2016-01-08 | 2016-05-18 | 天津车之家科技有限公司 | Method and device for training classification model, and method and device for classifying search words |
CN105893362A (en) * | 2014-09-26 | 2016-08-24 | 北大方正集团有限公司 | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points |
CN105930509A (en) * | 2016-05-11 | 2016-09-07 | 华东师范大学 | Method and system for automatic extraction and refinement of domain concept based on statistics and template matching |
CN106021288A (en) * | 2016-04-27 | 2016-10-12 | 南京慕测信息科技有限公司 | Method for rapid and automatic classification of classroom testing answers based on natural language analysis |
-
2016
- 2016-11-16 CN CN201611009278.0A patent/CN106599054B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
CN101685455A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and system of data retrieval |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN105095223A (en) * | 2014-04-25 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method for classifying texts and server |
CN105893362A (en) * | 2014-09-26 | 2016-08-24 | 北大方正集团有限公司 | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points |
CN104834729A (en) * | 2015-05-14 | 2015-08-12 | 百度在线网络技术(北京)有限公司 | Title recommendation method and title recommendation device |
CN105589972A (en) * | 2016-01-08 | 2016-05-18 | 天津车之家科技有限公司 | Method and device for training classification model, and method and device for classifying search words |
CN106021288A (en) * | 2016-04-27 | 2016-10-12 | 南京慕测信息科技有限公司 | Method for rapid and automatic classification of classroom testing answers based on natural language analysis |
CN105930509A (en) * | 2016-05-11 | 2016-09-07 | 华东师范大学 | Method and system for automatic extraction and refinement of domain concept based on statistics and template matching |
Non-Patent Citations (4)
Title |
---|
吴旭等: "面向机构知识库结构化数据的文本相似度评价算法", 《技术研究》 * |
董奥根等: "基于向量空间模型的知识点与试题自动关联方法", 《计算机与现代化》 * |
许鑫: "《基于文本特征计算的信息分析方法》", 30 November 2015, 上海科学技术文献出版社 * |
麦好: "《机器学习实践指南 案例应用解析》", 30 April 2014, 机械工业出版社 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463553A (en) * | 2017-09-12 | 2017-12-12 | 复旦大学 | For the text semantic extraction, expression and modeling method and system of elementary mathematics topic |
CN107463553B (en) * | 2017-09-12 | 2021-03-30 | 复旦大学 | Text semantic extraction, representation and modeling method and system for elementary mathematic problems |
CN108182275A (en) * | 2018-01-24 | 2018-06-19 | 上海互教教育科技有限公司 | A kind of mathematics variant training topic supplying system and correlating method |
CN108376132B (en) * | 2018-03-16 | 2020-08-28 | 中国科学技术大学 | Method and system for judging similar test questions |
CN108376132A (en) * | 2018-03-16 | 2018-08-07 | 中国科学技术大学 | The determination method and system of similar examination question |
CN108765221A (en) * | 2018-05-15 | 2018-11-06 | 广西英腾教育科技股份有限公司 | Pumping inscribes method and device |
CN109189920A (en) * | 2018-08-02 | 2019-01-11 | 上海欣方智能系统有限公司 | Sweep-black case classification method and system |
CN109685137A (en) * | 2018-12-24 | 2019-04-26 | 上海仁静信息技术有限公司 | A kind of topic classification method, device, electronic equipment and storage medium |
CN109785691A (en) * | 2019-01-18 | 2019-05-21 | 广东小天才科技有限公司 | Method and system for assisting learning through terminal |
CN109785691B (en) * | 2019-01-18 | 2021-09-24 | 广东小天才科技有限公司 | Method and system for assisting learning through terminal |
CN110136512A (en) * | 2019-04-17 | 2019-08-16 | 许昌学院 | A kind of English grade examzation examination exercise and the automatic clustering system of answer parsing |
CN110472044A (en) * | 2019-07-11 | 2019-11-19 | 平安国际智慧城市科技股份有限公司 | Knowledge point classification method, device, readable storage medium storing program for executing and the server of mathematical problem |
CN112989760A (en) * | 2019-12-17 | 2021-06-18 | 北京一起教育信息咨询有限责任公司 | Method and device for labeling subjects, storage medium and electronic equipment |
WO2021253480A1 (en) * | 2020-06-19 | 2021-12-23 | 平安科技(深圳)有限公司 | Intelligent exercise recommendation method and apparatus, computer device and storage medium |
CN111881285A (en) * | 2020-07-28 | 2020-11-03 | 扬州大学 | Wrong question collection and important and difficult point knowledge extraction method |
CN112257966A (en) * | 2020-12-18 | 2021-01-22 | 北京世纪好未来教育科技有限公司 | Model processing method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106599054B (en) | 2019-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599054A (en) | Method and system for title classification and push | |
CN102411563B (en) | Method, device and system for identifying target words | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
CN104991891B (en) | A kind of short text feature extracting method | |
CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
CN102289522B (en) | Method of intelligently classifying texts | |
CN109960786A (en) | Chinese Measurement of word similarity based on convergence strategy | |
CN106651696B (en) | Approximate question pushing method and system | |
CN103207913B (en) | The acquisition methods of commercial fine granularity semantic relation and system | |
CN107122413A (en) | A kind of keyword extracting method and device based on graph model | |
CN107038480A (en) | A kind of text sentiment classification method based on convolutional neural networks | |
CN112836509B (en) | Expert system knowledge base construction method and system | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN107122349A (en) | A kind of feature word of text extracting method based on word2vec LDA models | |
CN107180045A (en) | A kind of internet text contains the abstracting method of geographical entity relation | |
CN102033919A (en) | Method and system for extracting text key words | |
CN110134799B (en) | BM25 algorithm-based text corpus construction and optimization method | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN103593431A (en) | Internet public opinion analyzing method and device | |
CN103020167B (en) | A kind of computer Chinese file classification method | |
CN109063147A (en) | Online course forum content recommendation method and system based on text similarity | |
CN113360647B (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN111221968A (en) | Author disambiguation method and device based on subject tree clustering | |
CN107463715A (en) | English social media account number classification method based on information gain |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |