CN103729402A

CN103729402A - Method for establishing mapping knowledge domain based on book catalogue

Info

Publication number: CN103729402A
Application number: CN201310601668.7A
Authority: CN
Inventors: 鲁伟明; 张萌; 魏宝刚; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2013-11-22
Filing date: 2013-11-22
Publication date: 2014-04-16
Anticipated expiration: 2033-11-22
Also published as: CN103729402B

Abstract

The invention discloses a method for establishing a mapping knowledge domain based on a book catalogue. The method comprises the steps that a catalogue page in a digitized book is extracted, the lengths of items in the catalogue are differentiated, and part-of-speech tagging is conducted on the long items through a natural language processing tool, so that part-of-speech arrays are obtained, and candidate nodes are extracted according to rules of conjunctions, punctuations and parts of speech; the long items and the short items are authenticated in the Baidu encyclopedia and the Hudong encyclopedia, a leader-member relation and parallel relations are formed through a catalogue structure and serve as a framework of the mapping knowledge domain, the strong and weak parallel relations are differentiated and serve as increments respectively, and the leader-member relation is supplemented with the strong and weak parallel relations; according to a noisy data excavating algorithm with suffixes serving as a base, nodes are selected from the items which do not pass the authentication of the encyclopedias and the mapping knowledge domain is supplemented with the selected nodes; finally, the weights of relations in the supplemented mapping knowledge domain are calculated and ranked, so that noise is removed through screening. Compared with an existing mapping knowledge domain, the mapping knowledge domain established through the method is richer in node, better in expandability and higher in accuracy.

Description

A kind of construction method of the knowledge collection of illustrative plates based on library catalogue

Technical field

The present invention relates to utilize the methods such as Artificial intelligence, data mining to carry out the generation of knowledge collection of illustrative plates, relate in particular to a kind of construction method of the knowledge collection of illustrative plates based on library catalogue.

Background technology

Computing machine fast development and universal today, for more easily, more clearly obtaining information, learning knowledge, and the contact evolutionary process between analysis mining knowledge, more and more need a content, levels are rich, accuracy is high, and the knowledge collection of illustrative plates that is easy to expansion, how building this knowledge collection of illustrative plates becomes the focus of current research naturally.

Current Chinese knowledge collection of illustrative plates has HowNet, interactive encyclopaedic knowledge tree, CNKI classification, but they exist limitation and variety of issue separately.

HowNet Shi You Mr. Dong Zhendong of Chinese Academy of Sciences exploitation, take the concept of the word representative of Chinese and english as description object, is the commonsense knowledge base of substance with the pass of disclosing between concept and concept and between the attribute that concept was had.Specifically, in Hownet, node major part is popular vocabulary, and level can not do deeply, and number of nodes is few comparatively speaking, relation is few, and need to be by manually generating.

Interactive encyclopaedic knowledge tree is by traditional encyclopaedia mode classification, encyclopaedia complete works is divided into personage, history, culture, art, nature, geography, science, economy, life, society, physical culture, the large objective classification of technology 12, under each objective classification, is divided into step by step again the subclasses such as different secondary classifications, reclassify.In interactive encyclopaedic knowledge tree, structure is fixed, and level is relatively not dark, and is artificial generation, is unfavorable for expansion.

CNKI is China National Knowledge Infrastructure engineering (China National Knowledge Infrastructure).CNKI engineering is to realize whole society's knowledge resource propagation to share the information system work that is utilized as target with increment, by Tsing-Hua University, Tsing Hua Tong Fang, is initiated, and is established in June, 1999.CNKI classification is categorized as basis with subject, and the document in database is divided into ten special editions, is divided into several special topics under each special edition, amounts to 168 special topics.Weak point be level relatively less, that relationships between nodes is relatively sparse, structure is fixed extendability is relatively bad, and is artificial generation.

Summary of the invention

The object of the invention is for overcoming the deficiencies in the prior art, a kind of method of the automatic generation knowledge collection of illustrative plates based on library catalogue is provided.

The construction method of the knowledge collection of illustrative plates based on library catalogue comprises the following steps:

1) select a book, its catalogue page is carried out to optical character identification and realize digitizing, and on digitized bibliographic structure, according to catalogue discal patch object length, distinguish rectangular order and billet order two class entries;

2) to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;

3) to two batches of both candidate nodes, first strictly filter, go to identify that whether this entry exists in Baidupedia, interactive encyclopaedia, the part of identifying by Baidupedia, interactive encyclopaedia utilizes the superior and the subordinate's structure of catalogue to form relationship between superior and subordinate, utilize the peer-to-peer architecture of catalogue to form coordination, the skeleton using these two parts as knowledge collection of illustrative plates;

4) distinguish strong and weak coordination, from two kinds of coordinations, select respectively node, carry out incremental supplementation and enter relationship between superior and subordinate, enrich the skeleton of knowledge collection of illustrative plates obtained in the previous step;

5), according to the method take suffix useful part in basic excavation noise data proposing, in the entry of never identifying by Baidupedia, interactive encyclopaedia, select a part of node and supplement in knowledge collection of illustrative plates;

6) to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening.

Described step 2) as follows:

To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,

First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,

Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array

To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:

2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;

2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;

If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;

To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include

NGD (x, y) = \frac{\max {\log f (x), \log f (y)} - \log f (x, y)}{\log M - \min {\log f (x), \log f (y)}}

NGD (x, y) represents the value of utilizing normalization Google distance to calculate,

F (x, y) represents the number of results that " xy " searches out in Google for key word xy,

F (x) represents the number of results that " x " searches out in Google for key word x,

F (y) represents the number of results that " y " searches out in Google for key word y,

M is all webpage numbers of including in Google.

Described step 3) comprises:

3.1) the such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;

3.2) for the length of catalogue entry, limit, only get the catalogue entry of 9 Chinese characters and following length;

Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;

When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;

3.3) catalogue entry 9 above length of Chinese character for the length of catalogue entry, if the entry type of processing is the type of " noun+conjunction+noun ", adopt step 2.1), step 2.2) in to extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;

Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.

Described step 4) comprises:

4.1) differentiation of strong and weak coordination

The number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, the coordination binary group selection that absolute frequency is greater than to threshold value is out as strong coordination, and absolute frequency is less than coordination two tuples of threshold value as weak coordination;

4.2) degree of correlation between knowledge node

Between knowledge node, often there is ambiguity, a data centralization being formed by relationship between superior and subordinate and coordination, for the multiple superior nodes that exist between node, point to the problem of same downstream site, according to other nodes relevant to knowledge node, help solve

Detailed process is as follows:

4.2.1) to each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;

4.2.2) to B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;

4.2.3) pair set Set1, S set et2,

Degree of correlation Relevancy=SameElementCount+ (Weight1+ Weight2) * 10

Weight 1 = \frac{SameElementCount}{Set 1 TotalElementCount},

Weight 2 = \frac{SameElementCount}{Set 2 TotalElementCount},

Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, Set1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2

Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value, C thought to the downstream site of A, corresponding A → C is included in,

For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;

4.3) utilize strong and weak coordination to supplement

For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.

Described step 5) comprises:

When having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, the catalogue minor structure of preserving according to step 1) and the book number at entry place, page number numbering go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book

When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,

Concrete grammar is as follows:

In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,

For each book number, find the S set etY of the downstream site of A in this bibliography record,

To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if

Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is the threshold value of setting.

Described step 6) comprises:

6.1) in relationship between superior and subordinate, clean

In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC;

6.2, relationship between superior and subordinate sequence

To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;

Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;

IDF (L) = c (T &RightArrow; L) * \frac{1 + N}{1 + DF (L)} - - - (1)

Wherein

C (T → L) represents the number of times that T → L occurs;

DF (L) represents the number of times occurring in coordination of L;

N represents the total nodes in coordination;

IDF (L) represents the anti-document frequency in coordination of L;

w(T→L)=c(T→L)*IDF(L) (2)

Wherein

W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;

Sim (T, T 1) = \log [1 + \frac{N (T, T 1)}{\sqrt{IDF (T) * IDF (T 1)}}] - - - (3)

Wherein

Sim (T, T1) represents the similarity between T and T1;

N (T, T1) represents the common number of times occurring of T, T1 in coordination;

\tilde{w} (T &RightArrow; L) = \frac{\log (Σ_{L^{'}} w (T &RightArrow; L^{'}))}{Σ_{L^{'}} w (T &RightArrow; L^{'})} * w (T &RightArrow; L) - - - (4)

Wherein

represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights;

w^{'} (T &RightArrow; L) = \tilde{w} (T &RightArrow; L) + Σ_{T 1 &NotEqual; T} [μ * Sim (T, T 1) * \tilde{w} (T 1 &RightArrow; L)] - - - (5)

Wherein

W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,

μ is weights, is 0.5.

The beneficial effect that the present invention compared with prior art has:

1. the flow process of the method guarantees to rely on machine automatically to complete, without manual intervention.

2. the method has good extendability, while enriching one's knowledge collection of illustrative plates, needs only new library catalogue is supplemented into.

3. the method level is dark, relationships between nodes abundant, and along with the continuous of new library catalogue supplemented, between the level degree of depth, node, contact and accuracy can improve thereupon.

Accompanying drawing explanation

Fig. 1 is general flow chart of the present invention;

Fig. 2 is step 2) process flow diagram;

Fig. 3 is the process flow diagram of step 3);

Fig. 4 is the process flow diagram of step 4);

Fig. 5 is the process flow diagram of step 5).

Embodiment

A kind of construction method of the knowledge collection of illustrative plates based on library catalogue comprises the following steps:

Described step 2) as follows:

NGD (x, y) = \frac{\max {\log f (x), \log f (y)} - \log f (x, y)}{\log M - \min {\log f (x), \log f (y)}}

M is all webpage numbers of including in Google.

Described step 3) comprises:

Described step 4) comprises:

4.1) differentiation of strong and weak coordination

4.2) degree of correlation between knowledge node

Detailed process is as follows:

4.2.3) pair set Set1, S set et2,

Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10

Weight 1 = \frac{SameElementCount}{Set},

Weight 2 = \frac{SameElementCount}{Set},

4.3) utilize strong and weak coordination to supplement

Described step 5) comprises:

Concrete grammar is as follows:

Described step 6) comprises:

6.1) in relationship between superior and subordinate, clean

6.2, relationship between superior and subordinate sequence

IDF (L) = c (T &RightArrow; L) * \frac{1 + N}{1 + DF (L)} - - - (1)

Wherein

C (T → L) represents the number of times that T → L occurs;

DF (L) represents the number of times occurring in coordination of L;

N represents the total nodes in coordination;

IDF (L) represents the anti-document frequency in coordination of L;

w(T→L)=c(T→L)*IDF(L) (2)

Wherein

Sim (T, T 1) = \log [1 + \frac{N (T, T 1)}{\sqrt{IDF (T) * IDF (T 1)}}] - - - (3)

Wherein

Sim (T, T1) represents the similarity between T and T1;

\tilde{w} (T &RightArrow; L) = \frac{\log (Σ_{L^{'}} w (T &RightArrow; L^{'}))}{Σ_{L^{'}} w (T &RightArrow; L^{'})} * w (T &RightArrow; L) - - - (4)

Wherein

w^{'} (T &RightArrow; L) = \tilde{w} (T &RightArrow; L) + Σ_{T 1 &NotEqual; T} [μ * Sim (T, T 1) * \tilde{w} (T 1 &RightArrow; L)] - - - (5)

Wherein

μ is weights, is 0.5.

Embodiment

Below in conjunction with method of the present invention, describe the concrete steps that this example is implemented in detail, as follows:

1) 10,000 computer books have been carried out to optical character identification OCR and processed, and on digitized bibliographic structure, according to catalogue discal patch object length, take 9 Chinese characters, as boundary, distinguished, distinguished rectangular order and billet order two class entries;

2) as depicted in figs. 1 and 2, to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;

The processing of rectangular object being cut to division fusion between word, part-of-speech tagging and word is as follows:

NGD (x, y) = \frac{\max {\log f (x), \log f (y)} - \log f (x, y)}{\log M - \min {\log f (x), \log f (y)}}

M is all webpage numbers of including in Google.

3) as shown in Figure 3,

The such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;

Length for catalogue entry limits, and only gets the catalogue entry of 9 Chinese characters and following length;

Catalogue entry for the length of catalogue entry 9 above length of Chinese character, if the entry type of processing is the type of " noun+conjunction+noun ", adopt in Fig. 2 extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;

4) as shown in Figure 4, the number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, absolute frequency is greater than to the coordination binary group selection of 4 times out as strong coordination, and absolute frequency is less than coordination two tuples of 4 times as weak coordination;

Detailed process is as follows:

To each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;

To B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;

Pair set Set1, S set et2,

Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10

Weight 1 = \frac{SameElementCount}{Set},

Weight 2 = \frac{SameElementCount}{Set},

Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value 0.5, C thought to the downstream site of A, corresponding A → C is included in,

5) as shown in Figure 5, according to the method take suffix useful part in basic excavation noise data proposing, never in the entry of identifying by Baidupedia, interactive encyclopaedia, selecting a part of node supplements in knowledge collection of illustrative plates, when having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, according to the book number at the catalogue minor structure of preserving and entry place, page number numbering, go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book

Concrete grammar is as follows:

Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is that the threshold value of setting is 0.75.

6) takes into account similarity between node, to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening,

In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC, above-mentioned is cleaning step;

IDF (L) = c (T &RightArrow; L) * \frac{1 + N}{1 + DF (L)} - - - (1)

Wherein

C (T → L) represents the number of times that T → L occurs;

DF (L) represents the number of times occurring in coordination of L;

N represents the total nodes in coordination;

IDF (L) represents the anti-document frequency in coordination of L;

w(T→L)=c(T→L)*IDF(L) (2)

Wherein

Sim (T, T 1) = \log [1 + \frac{N (T, T 1)}{\sqrt{IDF (T) * IDF (T 1)}}] - - - (3)

Wherein

Sim (T, T1) represents the similarity between T and T1;

\tilde{w} (T &RightArrow; L) = \frac{\log (Σ_{L^{'}} w (T &RightArrow; L^{'}))}{Σ_{L^{'}} w (T &RightArrow; L^{'})} * w (T &RightArrow; L) - - - (4)

Wherein

represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights

w^{'} (T &RightArrow; L) = \tilde{w} (T &RightArrow; L) + Σ_{T 1 &NotEqual; T} [μ * Sim (T, T 1) * \tilde{w} (T 1 &RightArrow; L)] - - - (5)

Wherein

μ is weights, is 0.5.

The operation result of this example: after four kinds of increments are all supplemented into, always have 25426 relationships between superior and subordinate, produced 741 root nodes, in the knowledge collection of illustrative plates generating, 843998 nodes have been comprised, maximum level is 85 layers, average 28.2 layers, and accuracy rate is 75.1%.

Meanwhile, because HowNet, the middle-level degree of depth of CNKI knowledge classification are generally units, and node quantitatively can not show a candle to interactive encyclopaedia classification tree, therefore choose interactive encyclopaedic knowledge here, sets object as a comparison.To adding up in the relevant subclass of interactive encyclopaedic knowledge tree Computer, draw and comprise altogether 21 root nodes, have 75434 nodes, the maximum level degree of depth is 48 layers, the average level degree of depth is 7.3 layers.

Contrast can find out, this method exceeds current sorting technique far away in the indexs such as number of nodes, the level degree of depth, has guaranteed higher accuracy simultaneously, without manual intervention, and has good extensibility.

Utilize 6 examples that level is 5 of selected parts in the knowledge collection of illustrative plates that this method processing goes out below, and the statistics of accuracy separately:

。

Claims

1. a construction method for the knowledge collection of illustrative plates based on library catalogue, is characterized in that comprising the following steps:

2. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 2) as follows:

NGD (x, y) = \frac{\max {\log f (x), \log f (y)} - \log f (x, y)}{\log M - \min {\log f (x), \log f (y)}}

M is all webpage numbers of including in Google.

3. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 3) comprises:

4. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 4) comprises:

4.1) differentiation of strong and weak coordination

4.2) degree of correlation between knowledge node

Detailed process is as follows:

4.2.3) pair set Set1, S set et2,

Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10

Weight 1 = \frac{SameElementCount}{Set 1 TotalElementCount},

Weight 2 = \frac{SameElementCount}{Set 2 TotalElementCount},

4.3) utilize strong and weak coordination to supplement

5. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 5) comprises:

Concrete grammar is as follows:

6. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 6) comprises:

6.1) in relationship between superior and subordinate, clean

6.2, relationship between superior and subordinate sequence

IDF (L) = c (T &RightArrow; L) * \frac{1 + N}{1 + DF (L)} - - - (1)

Wherein,

C (T → L) represents the number of times that T → L occurs;

DF (L) represents the number of times occurring in coordination of L;

N represents the total nodes in coordination;

IDF (L) represents the anti-document frequency in coordination of L;

w(T→L)=c(T→L)*IDF(L) (2)

Wherein,

Sim (T, T 1) = \log [1 + \frac{N (T, T 1)}{\sqrt{IDF (T) * IDF (T 1)}}] - - - (3)

Wherein,

Sim (T, T1) represents the similarity between T and T1;

\tilde{w} (T &RightArrow; L) = \frac{\log (Σ_{L^{'}} w (T &RightArrow; L^{'}))}{Σ_{L^{'}} w (T &RightArrow; L^{'})} * w (T &RightArrow; L) - - - (4)

Wherein,

represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights,

w^{'} (T &RightArrow; L) = \tilde{w} (T &RightArrow; L) + Σ_{T 1 &NotEqual; T} [μ * Sim (T, T 1) * \tilde{w} (T 1 &RightArrow; L)] - - - (5)

Wherein,

μ is weights, is 0.5.