CN106649259B

CN106649259B - A method of learning dependence between extracting blocks of knowledge automatically from courseware text

Info

Publication number: CN106649259B
Application number: CN201610874480.3A
Authority: CN
Inventors: 魏笔凡; 王晨晨; 刘均; 郑庆华; 曾宏伟; 姚思雨; 吴蓓; 石磊; 郭朝彤
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2019-05-24
Anticipated expiration: 2036-09-30
Also published as: CN106649259A

Abstract

The invention discloses it is a kind of from courseware text extract blocks of knowledge automatically between learn dependence method, the text in courseware is corresponded to by handling blocks of knowledge, obtain candidate terms set, then the synonymous term in candidate terms set is handled, and each term is calculated to the criticality of blocks of knowledge, construct optimal model, by solving the study dependence extraction model optimized, courseware text can be automatically analyzed, it identifies the term in text and calculates term to the criticality of blocks of knowledge, and the model that study dependence is excavated is obtained by optimizing the relationship between term, locality of the process independent of study dependence, it can be used to excavate the study dependence that theme is associated between farther away blocks of knowledge, more complete knowledge navigation service is provided for learner.

Description

A method of learning dependence between extracting blocks of knowledge automatically from courseware text

Technical field

The present invention relates to the methods of study dependence, and in particular to it is a kind of extract blocks of knowledge automatically from courseware text between Learn the method for dependence.

Background technique

With the fast development of human sciences' technology, human knowledge total amount shows explosive growth.According to joint state religion The statistics of section's text tissue, the nearly 30 years knowledge accumulated of the mankind account for the 90% of knowledge total amount since the dawn of human civilization, and the multiplication of knowledge Period is still constantly shortening, and has reduced to 5-7 at present.The rapid growth of knowledge total amount is that effective acquisition of knowledge and expression are brought Serious challenge.It is that user feedback goes out relevant documentation that traditional solution, which is by search engine,.This mode cannot be directly anti- The interested knowledge of user is presented, the very big energy of user effort is needed to be screened from a large amount of relevant documentations.Knowledge mapping technology Using RDF triple indicate semantic network, it is intended to realize search engine from " machinery is enumerated " to " network collection is known " develop, for Family provides semantization, correlation information retrieval, alleviates the above problem to a certain extent.But knowledge mapping is not configured to face Cognitive learning to theme can not embody the cognition relationship between each theme, and study is easy to cause to get lost problem.Knowledge Map foundation The characteristics of human cognitive learns forms the efficient expression of one kind and knows by the relational organization between knowledge and knowledge at the form of figure The mode of institutional framework between knowledge and knowledge provides effective method to alleviate the study problem of getting lost.

Study dependence describes the relationship to interdepend in cognitive process between blocks of knowledge.Determine two knowledge Whether unit has relationship, is the Xiang Jiben in Knowledge Map building but very important work.Currently, the knowledge of high quality Figure, it is still necessary to which domain expert marks the study dependence between blocks of knowledge according to domain knowledge, and building process is relatively more slow Slowly.Therefore, the effective study dependence mining algorithm of design will greatly improve Knowledge Map building speed, reduce manpower and disappear Consumption, helps to push the research and application of the navigation learning based on Knowledge Map.

For the method for learning dependence excavation between blocks of knowledge, Patent No. ZL201110312882.1, title For a kind of blocks of knowledge incidence relation method for digging of text-oriented, the method for proposition includes the following steps: that (1) textual association is dug Pick: clustering text collection, finds the text pair with similar topic, and the asymmetry being distributed using central term, Excavate the linear correlation relationship between text；(2) candidate blocks of knowledge pair is generated: using the locality of blocks of knowledge incidence relation, Generate candidate blocks of knowledge pair；(3) feature selecting and blocks of knowledge incidence relation excavate: the term word of knowledge based unit pair Frequently, distance and semantic type feature, using SVM classifier by candidate blocks of knowledge to progress two-value classification, Extracting Knowledge list Incidence relation between member.This method can greatly reduce candidate blocks of knowledge number, under the premise of guaranteeing precision, be effectively reduced The time complexity of relation excavation.Since it makes use of the locality of study dependence, the above method is difficult to extract distance Study dependence between farther away blocks of knowledge.

Summary of the invention

In order to solve the problems in the prior art, the present invention proposes to learn between one kind extracts blocks of knowledge from courseware text automatically The method for practising dependence, can automatically analyze courseware text, identify the term in text and calculate term pair The criticality of blocks of knowledge, and the model that study dependence is excavated, process are obtained by optimizing the relationship between term Independent of the locality of study dependence, it can be used to excavate the study that theme is associated between farther away blocks of knowledge and rely on Relationship provides more complete knowledge navigation service for learner.

In order to achieve the goal above, the technical scheme adopted by the invention is as follows: the following steps are included:

1) candidate terms based on mutual information generate: courseware document being converted into text formatting first, and is carried out at participle Reason；Then the tightness degree that adjacent words combine is measured using mutual information, and processing is merged to compact vocabulary, from And obtain candidate terms set；

2) synonymous term based on wikipedia about subtracts: crawling the corresponding wikipedia page of term, utilizes wikipedia Redirection mark and multilingual link in the page, are about subtracted processing to synonymous term；

3) term criticality is measured: being calculated the corresponding TF-IDF parameter value of each term first, is then utilized knowledge list First name feature and format character are weighted processing to TF-IDF parameter value, measure each term to the pass of blocks of knowledge with this Stroke degree；

4) it optimal model building and solution: establishes between blocks of knowledge and learns to determine between dependence and term relationship Amount indicates, converts optimization problem for model solution problem, constructs the objective function of optimization, and calculate using gradient decline Method carries out model solution, and completion learns dependence between extracting blocks of knowledge automatically from courseware text.

The step 1) the following steps are included:

1.1) blocks of knowledge is corresponded to the text in courseware using poi kit to extract, and is segmented, removes and stop Word processing；

1.2) after setting participle, original character string c is divided into two words a and b, by character string c being total in corpus Existing frequency is denoted as f (c), co-occurrence probabilities of the character string c in corpus is denoted as p (c), then according to maximum likelihood estimate, in number According to measure it is sufficiently large in the case where, it is believed that p (c) be equal to f (c), each word is considered as event, for character string c=ab, mutual information Formula are as follows:The internal combustion tightness degree that character string is measured using mutual information, is obtained Candidate terms set.

The step 2) the following steps are included:

2.1) thesaurus expands: based on thesaurus, utilizing redirection mark in the wikipedia page and more Language link, expands thesaurus；

2.2) synonymous term about subtracts: using the thesaurus expanded by wikipedia, to same in candidate terms set Adopted term is about subtracted processing.

About subtract the mode of processing in the step 2.2) are as follows: for containing the term A of synonymous lexical item, find and have with term A There are identical meanings and the highest term B of the frequency of occurrences, replaces term A with term B in candidate terms set.

The step 3) the following steps are included:

3.1) to each of candidate terms set CT' term, it is calculated to each knowledge list by TF-IDF index The basic criticality of member, TF-IDF index calculation formula are as follows:Formula In: f_ijIndicate term i in document d_jIn word frequency；df_iIndicate the document word frequency of term i；N indicates total number of documents；n_iIndicate document The middle number of files for term i occur；

3.2) weighting of knowledge based unit title: by investigating whether term appears in blocks of knowledge title to original TF-IDF parameter be weighted, weighted formula are as follows: Name_i,j=w_name×b_i,j, in formula: w_nameIndicate that blocks of knowledge title adds Weigh weight；b_i,jIndicate whether term i appears in the title of blocks of knowledge j；

3.3) based on the weighting of format character: by the font size of term position, to the criticality of term into Row weighting processing, weighted formula are as follows:In formula: w_fontIndicate font size weighting Weight；K indicates that blocks of knowledge j corresponds to all different fonts sizes in courseware；f_i,kIndicate whether term i is gone out with font size k It is existing；rank_kAfter indicating all font size backward sequences, the ranking value of font size k；

3.4) aggregative weighted is carried out to original TF-IDF parameter by blocks of knowledge title and courseware font, obtains term Criticality, the formula of weighting are as follows: score_i,j=w_i,j×(1+Name_i,j+Font_i,j), in formula: score_i,jIndicate i pairs of term The criticality of blocks of knowledge j.

The step 4) the following steps are included:

4.1) objective function constructs: for blocks of knowledge i and blocks of knowledge j, measuring to exist between them by following formula and learn A possibility that practising dependence:In formula: x_iIt is the key that by all terms to blocks of knowledge i journey The vector constituted is spent, each element represents corresponding term to the criticality of blocks of knowledge i in vector；A matrix representative model Parameter；

To blocks of knowledge i, if set omega_i=(i, j) | y_ij=1, j=1,2 ..., n } it is all to be deposited with blocks of knowledge i In the blocks of knowledge of study dependence and the node pair of blocks of knowledge i composition, setFor it is all with blocks of knowledge i there is no the blocks of knowledge of study dependence with know Know the node pair of unit i composition, enablesOptimization is defined as follows to ask Topic:

In formula: X is a matrix, the i-th row in matrix ByIt constitutes；(1-v)₊Represent hinge loss function；||A||_FRepresent this black norm of not Luo Beini of matrix A；

4.2) it model solution: to optimization problem, is solved using accelerating gradient decline:

It enablesThen former objective function is write as:Formula to A derivation, Obtain gradient:

In formula:e_i、e_j、e_kAll it is Unit vector；

4.3) study dependence is excavated: the most optimized parameter A matrix of model is obtained by step 4.2), for any two A blocks of knowledge is judged between them by optimal model with the presence or absence of study dependence.

Compared with prior art, the present invention corresponds to the text in courseware by handling blocks of knowledge, obtains candidate terms collection It closes, then handles the synonymous term in candidate terms set, and calculate each term to the criticality of blocks of knowledge, construct Optimal model can divide automatically courseware text by solving the study dependence extraction model optimized Analysis identifies the term in text and calculates term to the criticality of blocks of knowledge, and passes through and optimize between term Relationship obtains the model that study dependence is excavated, and locality of the process independent of study dependence can be used to dig Pick theme is associated with the study dependence between farther away blocks of knowledge, provides more complete knowledge navigation clothes for learner Business.The present invention can be corresponded to the text in courseware using blocks of knowledge and extract study dependence between blocks of knowledge automatically, The cost for reducing artificial constructed Knowledge Map, helps to push the research of the navigation learning based on Knowledge Map and answers With.

Detailed description of the invention

Fig. 1 is method flow frame diagram of the invention；

Fig. 2 is the graphical schematic diagram of formula (6)；

Fig. 3 is that " Java language " learns dependence excavation partial data exemplary diagram.

Specific embodiment

Below with reference to specific embodiment and Figure of description the present invention will be further explained explanation.

Referring to Fig. 1, the present invention specifically includes the following steps:

1) candidate terms based on mutual information generate, and mainly include 2 steps:

1.1) blocks of knowledge is corresponded to the text in PPT courseware using poi kit to extract, and is segmented, is gone Except stop words processing；

1.2) after setting participle, word string c originally is divided into two words a and b.By co-occurrence of the character string c in corpus Frequency is denoted as f (c), and co-occurrence probabilities of the character string c in corpus are denoted as p (c), then, then according to maximum likelihood estimate, In the case that data volume is sufficiently large, it is believed that p (c) is equal to f (c), if each word is considered as event, for character string c= Ab, mutual information formula are as follows:

The internal combustion tightness degree of character string is measured using mutual information, to achieve the purpose that extract candidate terms. Candidate terms generating algorithm based on mutual information is shown in specific step is as follows:

Input: word segmentation result set Tokens={ word₁,word₂,...,word_n, vocabulary number n in Tokens, word Frequency statistical information TF, threshold value w

Algorithm flow:

Output: candidate terms set CT

2) synonymous term based on wikipedia about subtracts, and mainly includes 2 steps:

2.1) thesaurus expands: based on " Chinese thesaurus ", utilizing the redirection mark in the wikipedia page With multilingual link, synonymicon is expanded, it is as follows that thesaurus expands specific algorithm:

Input: candidate terms set CT={ term₁,term₂,...,term_n, vocabulary number n, thesaurus D=in CT {(term₁,...,term_i), wherein (term₁,...,term_i) it is the i terms with identical meanings

Algorithm flow:

Output: by the thesaurus D expanded

2.2) synonymous term based on thesaurus about subtracts: right herein using the thesaurus expanded by wikipedia Synonymous term in candidate terms set is uniformly processed, and the basic mode of processing is, for containing the art of synonymous lexical item Language A, finding has identical meanings and the highest term B of the frequency of occurrences with term A, replaces in candidate terms concentration term B Term A, respective algorithms are as follows:

Input: candidate terms set CT={ term₁,term₂,...,term_n, vocabulary number n, thesaurus D=in CT {(term₁,...,term_i), wherein (term₁,...,term_i) it is the i terms with identical meanings, word frequency statistics information TF

Algorithm flow:

Output: CT'

3) term criticality is measured, and mainly includes 4 steps:

3.1) to each of candidate terms set CT' term, it is calculated to each knowledge list by TF-IDF index The basic criticality of member, TF-IDF index calculation formula are as follows:

In formula: f_ijIndicate term i in document d_jIn word frequency；df_iIndicate the document word frequency of term i；N indicates that document is total Number；n_iIndicate the number of files of term i occur in document；

3.2) weighting of knowledge based unit title: appearing in the term in blocks of knowledge title is likely to be the knowledge The Key Term of unit, can be by investigating whether term appears in blocks of knowledge title to original TF-IDF parameter progress Weighting, weighted formula are as follows:

Name_i,j=w_name×b_i,j (3)

In formula: w_nameIndicate that blocks of knowledge title weights weight；b_i,jIndicate whether term i appears in the name of blocks of knowledge j In title；

3.3) weighting based on PPT format character: in the classification of PPT is shown, level is higher, and font is generally bigger, Expressed content is more important, therefore, can be weighted by the font size of term position to the criticality of term Processing, weighted formula are as follows:

In formula: w_fontIndicate that font size weights weight；It is big that k indicates that blocks of knowledge j corresponds to all different fonts in courseware It is small；f_i,kIndicate term i whether with font size k appearance；rank_kAfter indicating all font size backward sequences, font size k Ranking value；

3.4) formula of aggregative weighted is carried out to original TF-IDF parameter by blocks of knowledge title and courseware font are as follows:

score_i,j=w_i,j×(1+Name_i,j+Font_i,j) (5)

In formula: scoreⁱ _,jIndicate term i to the criticality of blocks of knowledge j；

4) optimal model building and solution, mainly include 3 steps:

4.1) objective function constructs: firstly, for blocks of knowledge i and blocks of knowledge j, being measured by following formula and is deposited between them A possibility that learning dependence:

In formula: x_iIt is the vector being made of criticality of all terms to blocks of knowledge i, each element generation in vector Criticality of the table corresponding term to blocks of knowledge i；The parameter of A matrix representative model, as shown in Fig. 2, formula (6) can be regarded as The sum of weight from paths all blocks of knowledge i to blocks of knowledge j；

To blocks of knowledge i, if set omega_i=(i, j) | y_ij=1, j=1,2 ..., n }, i.e., the set be it is all with know Know unit i and there is the blocks of knowledge of study dependence and the node pair of blocks of knowledge i composition, setI.e. the set is all knowledge that study dependence is not present with blocks of knowledge i The node pair of unit and blocks of knowledge i composition, model parameter A reasonable for one, it is desirable to each pair of knowledge in set omega i The score value of formula corresponding to unit will be greater than setIn each pair of blocks of knowledge score value, enableIt is defined as follows optimization problem:

In formula: X is a matrix, in matrix the i-th row byIt constitutes；(1-v)₊Represent hinge loss function；||A||_FGeneration This black norm of the not Luo Beini of table matrix A；

4.2) for optimization problem shown in formula (7), accelerating gradient decline model solution: can be used (AcceleratedGradient Descent) is solved:

It enablesThen former objective function can be write as:

Formula (8) obtains gradient to A derivation:

In formula:e_i、e_j、e_kAll it is unit vector, circular is as follows:

Input: X, T, λ, η, maximum number of iterations N

Algorithm flow:

Output: A

4.3) study dependence is excavated: by step 4.2), the most optimized parameter A matrix of available model, at this point, For any two blocks of knowledge, can judge to whether there is study dependence between them by the optimal model, If Fig. 3 illustrates the few examples of " Java language " course learning dependence Result, solid line represents known learn in figure Dependence is practised, dotted line represents the blocks of knowledge pair for needing to judge, the number on dotted line represents the blocks of knowledge to study A possibility that dependence.

Claims

1. it is a kind of from courseware text extract blocks of knowledge automatically between learn dependence method, which is characterized in that including following Step:

1) candidate terms based on mutual information generate: courseware document being converted into text formatting first, and carries out word segmentation processing；So The tightness degree that adjacent words combine is measured using mutual information afterwards, and processing is merged to compact vocabulary, thus To candidate terms set；

2) synonymous term based on wikipedia about subtracts: crawling the corresponding wikipedia page of term, utilizes the wikipedia page In redirection mark and multilingual link, about subtracted processing to synonymous term；

3) term criticality is measured: being calculated the corresponding TF-IDF parameter value of each term first, is then utilized blocks of knowledge name Claim feature and format character to be weighted processing to TF-IDF parameter value, each term is measured to the crucial journey of blocks of knowledge with this Degree；

4) optimal model building and solution: the quantitative table learnt between dependence and term relationship is established between blocks of knowledge Show, convert optimization problem for model solution problem, construct the objective function of optimization, and using gradient descent algorithm into Row model solution, completion learn dependence between extracting blocks of knowledge automatically from courseware text；

The step 3) the following steps are included:

3.1) to each of candidate terms set CT' term, it is calculated to each blocks of knowledge by TF-IDF index Basic criticality, TF-IDF index calculation formula are as follows:In formula: f_ijIndicate term i in document d_jIn word frequency；df_iIndicate the document word frequency of term i；N indicates total number of documents；n_iIt indicates in document There is the number of files of term i；

3.2) weighting of knowledge based unit title: by investigating whether term appears in blocks of knowledge title to original TF-IDF parameter is weighted, weighted formula are as follows: Name_i,j=w_name×b_i,j, in formula: w_nameIndicate the weighting of blocks of knowledge title Weight；b_i,jIndicate whether term i appears in the title of blocks of knowledge j；

3.3) based on the weighting of format character: by the font size of term position, adding to the criticality of term Power processing, weighted formula are as follows:In formula: w_fontIndicate that font size weights weight； K indicates that blocks of knowledge j corresponds to all different fonts sizes in courseware；f_i,kIndicate term i whether with font size k appearance； rank_kAfter indicating all font size backward sequences, the ranking value of font size k；

3.4) aggregative weighted is carried out to original TF-IDF parameter by blocks of knowledge title and courseware font, obtains term key Degree, the formula of weighting are as follows: score_i,j=w_i,j×(1+Name_i,j+Font_i,j), in formula: score_i,jIndicate term i to knowledge The criticality of unit j；

The step 4) the following steps are included:

4.1) objective function constructs: for blocks of knowledge i and blocks of knowledge j, measured between them by following formula exist study according to A possibility that relationship of relying:In formula: x_iIt is the criticality structure by all terms to blocks of knowledge i At vector, each element represents corresponding term to the criticality of blocks of knowledge i in vector；The ginseng of A matrix representative model Number；

To blocks of knowledge i, if set omega_i=(i, j) | y_ij=1, j=1,2 ..., n } it is all and blocks of knowledge i presence Practise the blocks of knowledge of dependence and the node pair of blocks of knowledge i composition, setFor The node pair of all blocks of knowledge and blocks of knowledge i composition that study dependence is not present with blocks of knowledge i, enablesIt is defined as follows optimization problem:

In formula: X is a matrix, in matrix the i-th row by It constitutes；(1-v)₊Represent hinge loss function；||A||_FRepresent this black norm of not Luo Beini of matrix A；

It enables,Then former objective function is write as:Formula obtains A derivation To gradient:

In formula:e_i、e_j、e_kIt is all unit Vector；

4.3) study dependence is excavated: obtaining the most optimized parameter A matrix of model by step 4.2), any two are known Know unit, is judged between them by optimal model with the presence or absence of study dependence.

2. it is according to claim 1 it is a kind of from courseware text extract blocks of knowledge automatically between learn dependence method, It is characterized in that, the step 1) the following steps are included:

1.1) blocks of knowledge is corresponded to the text in courseware using poi kit to extract, and is segmented, removes stop words Processing；

1.2) after setting participle, original character string c is divided into two words a and b, by co-occurrence frequency of the character string c in corpus Rate is denoted as f (c), co-occurrence probabilities of the character string c in corpus is denoted as p (c), then according to maximum likelihood estimate, in data volume In the case where sufficiently large, it is believed that p (c) is equal to f (c), each word is considered as event, for character string c=ab, mutual information formula Are as follows:The internal combustion tightness degree that character string is measured using mutual information, obtains candidate Term set.

3. it is according to claim 1 it is a kind of from courseware text extract blocks of knowledge automatically between learn dependence method, It is characterized in that, the step 2) the following steps are included:

2.1) thesaurus expands: based on thesaurus, utilizing redirection mark in the wikipedia page and multilingual Link, expands thesaurus；

2.2) synonymous term about subtracts: using the thesaurus expanded by wikipedia, to the synonymous art in candidate terms set Language is about subtracted processing.

4. it is according to claim 3 it is a kind of from courseware text extract blocks of knowledge automatically between learn dependence method, It is characterized in that, about subtracting the mode of processing in the step 2.2) are as follows: for containing the term A of synonymous lexical item, find and term A With identical meanings and the highest term B of the frequency of occurrences, term A is replaced with term B in candidate terms set.