CN103729402A - Method for establishing mapping knowledge domain based on book catalogue - Google Patents

Method for establishing mapping knowledge domain based on book catalogue Download PDF

Info

Publication number
CN103729402A
CN103729402A CN201310601668.7A CN201310601668A CN103729402A CN 103729402 A CN103729402 A CN 103729402A CN 201310601668 A CN201310601668 A CN 201310601668A CN 103729402 A CN103729402 A CN 103729402A
Authority
CN
China
Prior art keywords
node
speech
superior
catalogue
coordination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310601668.7A
Other languages
Chinese (zh)
Other versions
CN103729402B (en
Inventor
鲁伟明
张萌
魏宝刚
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201310601668.7A priority Critical patent/CN103729402B/en
Publication of CN103729402A publication Critical patent/CN103729402A/en
Application granted granted Critical
Publication of CN103729402B publication Critical patent/CN103729402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90324Query formulation using system suggestions
    • G06F16/90328Query formulation using system suggestions using search space presentation or visualization, e.g. category or range presentation and selection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9017Indexing; Data structures therefor; Storage structures using directory or table look-up
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems

Abstract

The invention discloses a method for establishing a mapping knowledge domain based on a book catalogue. The method comprises the steps that a catalogue page in a digitized book is extracted, the lengths of items in the catalogue are differentiated, and part-of-speech tagging is conducted on the long items through a natural language processing tool, so that part-of-speech arrays are obtained, and candidate nodes are extracted according to rules of conjunctions, punctuations and parts of speech; the long items and the short items are authenticated in the Baidu encyclopedia and the Hudong encyclopedia, a leader-member relation and parallel relations are formed through a catalogue structure and serve as a framework of the mapping knowledge domain, the strong and weak parallel relations are differentiated and serve as increments respectively, and the leader-member relation is supplemented with the strong and weak parallel relations; according to a noisy data excavating algorithm with suffixes serving as a base, nodes are selected from the items which do not pass the authentication of the encyclopedias and the mapping knowledge domain is supplemented with the selected nodes; finally, the weights of relations in the supplemented mapping knowledge domain are calculated and ranked, so that noise is removed through screening. Compared with an existing mapping knowledge domain, the mapping knowledge domain established through the method is richer in node, better in expandability and higher in accuracy.

Description

A kind of construction method of the knowledge collection of illustrative plates based on library catalogue
Technical field
The present invention relates to utilize the methods such as Artificial intelligence, data mining to carry out the generation of knowledge collection of illustrative plates, relate in particular to a kind of construction method of the knowledge collection of illustrative plates based on library catalogue.
Background technology
Computing machine fast development and universal today, for more easily, more clearly obtaining information, learning knowledge, and the contact evolutionary process between analysis mining knowledge, more and more need a content, levels are rich, accuracy is high, and the knowledge collection of illustrative plates that is easy to expansion, how building this knowledge collection of illustrative plates becomes the focus of current research naturally.
Current Chinese knowledge collection of illustrative plates has HowNet, interactive encyclopaedic knowledge tree, CNKI classification, but they exist limitation and variety of issue separately.
HowNet Shi You Mr. Dong Zhendong of Chinese Academy of Sciences exploitation, take the concept of the word representative of Chinese and english as description object, is the commonsense knowledge base of substance with the pass of disclosing between concept and concept and between the attribute that concept was had.Specifically, in Hownet, node major part is popular vocabulary, and level can not do deeply, and number of nodes is few comparatively speaking, relation is few, and need to be by manually generating.
Interactive encyclopaedic knowledge tree is by traditional encyclopaedia mode classification, encyclopaedia complete works is divided into personage, history, culture, art, nature, geography, science, economy, life, society, physical culture, the large objective classification of technology 12, under each objective classification, is divided into step by step again the subclasses such as different secondary classifications, reclassify.In interactive encyclopaedic knowledge tree, structure is fixed, and level is relatively not dark, and is artificial generation, is unfavorable for expansion.
CNKI is China National Knowledge Infrastructure engineering (China National Knowledge Infrastructure).CNKI engineering is to realize whole society's knowledge resource propagation to share the information system work that is utilized as target with increment, by Tsing-Hua University, Tsing Hua Tong Fang, is initiated, and is established in June, 1999.CNKI classification is categorized as basis with subject, and the document in database is divided into ten special editions, is divided into several special topics under each special edition, amounts to 168 special topics.Weak point be level relatively less, that relationships between nodes is relatively sparse, structure is fixed extendability is relatively bad, and is artificial generation.
Summary of the invention
The object of the invention is for overcoming the deficiencies in the prior art, a kind of method of the automatic generation knowledge collection of illustrative plates based on library catalogue is provided.
The construction method of the knowledge collection of illustrative plates based on library catalogue comprises the following steps:
1) select a book, its catalogue page is carried out to optical character identification and realize digitizing, and on digitized bibliographic structure, according to catalogue discal patch object length, distinguish rectangular order and billet order two class entries;
2) to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
3) to two batches of both candidate nodes, first strictly filter, go to identify that whether this entry exists in Baidupedia, interactive encyclopaedia, the part of identifying by Baidupedia, interactive encyclopaedia utilizes the superior and the subordinate's structure of catalogue to form relationship between superior and subordinate, utilize the peer-to-peer architecture of catalogue to form coordination, the skeleton using these two parts as knowledge collection of illustrative plates;
4) distinguish strong and weak coordination, from two kinds of coordinations, select respectively node, carry out incremental supplementation and enter relationship between superior and subordinate, enrich the skeleton of knowledge collection of illustrative plates obtained in the previous step;
5), according to the method take suffix useful part in basic excavation noise data proposing, in the entry of never identifying by Baidupedia, interactive encyclopaedia, select a part of node and supplement in knowledge collection of illustrative plates;
6) to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening.
Described step 2) as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log M - min { log f ( x ) , log f ( y ) }
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
Described step 3) comprises:
3.1) the such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
3.2) for the length of catalogue entry, limit, only get the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
3.3) catalogue entry 9 above length of Chinese character for the length of catalogue entry, if the entry type of processing is the type of " noun+conjunction+noun ", adopt step 2.1), step 2.2) in to extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
Described step 4) comprises:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, the coordination binary group selection that absolute frequency is greater than to threshold value is out as strong coordination, and absolute frequency is less than coordination two tuples of threshold value as weak coordination;
4.2) degree of correlation between knowledge node
Between knowledge node, often there is ambiguity, a data centralization being formed by relationship between superior and subordinate and coordination, for the multiple superior nodes that exist between node, point to the problem of same downstream site, according to other nodes relevant to knowledge node, help solve
Detailed process is as follows:
4.2.1) to each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
4.2.2) to B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
4.2.3) pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+ Weight2) * 10
Weight 1 = SameElementCount Set 1 TotalElementCount ,
Weight 2 = SameElementCount Set 2 TotalElementCount ,
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, Set1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
4.3) utilize strong and weak coordination to supplement
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
Described step 5) comprises:
When having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, the catalogue minor structure of preserving according to step 1) and the book number at entry place, page number numbering go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is the threshold value of setting.
Described step 6) comprises:
6.1) in relationship between superior and subordinate, clean
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC;
6.2, relationship between superior and subordinate sequence
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
IDF ( L ) = c ( T &RightArrow; L ) * 1 + N 1 + DF ( L ) - - - ( 1 )
Wherein
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Sim ( T , T 1 ) = log [ 1 + N ( T , T 1 ) IDF ( T ) * IDF ( T 1 ) ] - - - ( 3 )
Wherein
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
w ~ ( T &RightArrow; L ) = log ( &Sigma; L &prime; w ( T &RightArrow; L &prime; ) ) &Sigma; L &prime; w ( T &RightArrow; L &prime; ) * w ( T &RightArrow; L ) - - - ( 4 )
Wherein
Figure BDA0000420480340000071
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights;
w &prime; ( T &RightArrow; L ) = w ~ ( T &RightArrow; L ) + &Sigma; T 1 &NotEqual; T [ &mu; * Sim ( T , T 1 ) * w ~ ( T 1 &RightArrow; L ) ] - - - ( 5 )
Wherein
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
The beneficial effect that the present invention compared with prior art has:
1. the flow process of the method guarantees to rely on machine automatically to complete, without manual intervention.
2. the method has good extendability, while enriching one's knowledge collection of illustrative plates, needs only new library catalogue is supplemented into.
3. the method level is dark, relationships between nodes abundant, and along with the continuous of new library catalogue supplemented, between the level degree of depth, node, contact and accuracy can improve thereupon.
Accompanying drawing explanation
Fig. 1 is general flow chart of the present invention;
Fig. 2 is step 2) process flow diagram;
Fig. 3 is the process flow diagram of step 3);
Fig. 4 is the process flow diagram of step 4);
Fig. 5 is the process flow diagram of step 5).
Embodiment
A kind of construction method of the knowledge collection of illustrative plates based on library catalogue comprises the following steps:
1) select a book, its catalogue page is carried out to optical character identification and realize digitizing, and on digitized bibliographic structure, according to catalogue discal patch object length, distinguish rectangular order and billet order two class entries;
2) to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
3) to two batches of both candidate nodes, first strictly filter, go to identify that whether this entry exists in Baidupedia, interactive encyclopaedia, the part of identifying by Baidupedia, interactive encyclopaedia utilizes the superior and the subordinate's structure of catalogue to form relationship between superior and subordinate, utilize the peer-to-peer architecture of catalogue to form coordination, the skeleton using these two parts as knowledge collection of illustrative plates;
4) distinguish strong and weak coordination, from two kinds of coordinations, select respectively node, carry out incremental supplementation and enter relationship between superior and subordinate, enrich the skeleton of knowledge collection of illustrative plates obtained in the previous step;
5), according to the method take suffix useful part in basic excavation noise data proposing, in the entry of never identifying by Baidupedia, interactive encyclopaedia, select a part of node and supplement in knowledge collection of illustrative plates;
6) to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening.
Described step 2) as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log M - min { log f ( x ) , log f ( y ) }
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
Described step 3) comprises:
3.1) the such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
3.2) for the length of catalogue entry, limit, only get the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
3.3) catalogue entry 9 above length of Chinese character for the length of catalogue entry, if the entry type of processing is the type of " noun+conjunction+noun ", adopt step 2.1), step 2.2) in to extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
Described step 4) comprises:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, the coordination binary group selection that absolute frequency is greater than to threshold value is out as strong coordination, and absolute frequency is less than coordination two tuples of threshold value as weak coordination;
4.2) degree of correlation between knowledge node
Between knowledge node, often there is ambiguity, a data centralization being formed by relationship between superior and subordinate and coordination, for the multiple superior nodes that exist between node, point to the problem of same downstream site, according to other nodes relevant to knowledge node, help solve
Detailed process is as follows:
4.2.1) to each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
4.2.2) to B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
4.2.3) pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10
Weight 1 = SameElementCount Set 1 TotalElementCount ,
Weight 2 = SameElementCount Set 2 TotalElementCount ,
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, Set1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
4.3) utilize strong and weak coordination to supplement
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
Described step 5) comprises:
When having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, the catalogue minor structure of preserving according to step 1) and the book number at entry place, page number numbering go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is the threshold value of setting.
Described step 6) comprises:
6.1) in relationship between superior and subordinate, clean
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC;
6.2, relationship between superior and subordinate sequence
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
IDF ( L ) = c ( T &RightArrow; L ) * 1 + N 1 + DF ( L ) - - - ( 1 )
Wherein
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Sim ( T , T 1 ) = log [ 1 + N ( T , T 1 ) IDF ( T ) * IDF ( T 1 ) ] - - - ( 3 )
Wherein
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
w ~ ( T &RightArrow; L ) = log ( &Sigma; L &prime; w ( T &RightArrow; L &prime; ) ) &Sigma; L &prime; w ( T &RightArrow; L &prime; ) * w ( T &RightArrow; L ) - - - ( 4 )
Wherein
Figure BDA0000420480340000124
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights;
w &prime; ( T &RightArrow; L ) = w ~ ( T &RightArrow; L ) + &Sigma; T 1 &NotEqual; T [ &mu; * Sim ( T , T 1 ) * w ~ ( T 1 &RightArrow; L ) ] - - - ( 5 )
Wherein
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
Embodiment
Below in conjunction with method of the present invention, describe the concrete steps that this example is implemented in detail, as follows:
1) 10,000 computer books have been carried out to optical character identification OCR and processed, and on digitized bibliographic structure, according to catalogue discal patch object length, take 9 Chinese characters, as boundary, distinguished, distinguished rectangular order and billet order two class entries;
2) as depicted in figs. 1 and 2, to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
The processing of rectangular object being cut to division fusion between word, part-of-speech tagging and word is as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log M - min { log f ( x ) , log f ( y ) }
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
3) as shown in Figure 3,
The such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
Length for catalogue entry limits, and only gets the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
Catalogue entry for the length of catalogue entry 9 above length of Chinese character, if the entry type of processing is the type of " noun+conjunction+noun ", adopt in Fig. 2 extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
4) as shown in Figure 4, the number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, absolute frequency is greater than to the coordination binary group selection of 4 times out as strong coordination, and absolute frequency is less than coordination two tuples of 4 times as weak coordination;
Detailed process is as follows:
To each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
To B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
Pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10
Weight 1 = SameElementCount Set 1 TotalElementCount ,
Weight 2 = SameElementCount Set 2 TotalElementCount ,
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, SEt1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value 0.5, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
5) as shown in Figure 5, according to the method take suffix useful part in basic excavation noise data proposing, never in the entry of identifying by Baidupedia, interactive encyclopaedia, selecting a part of node supplements in knowledge collection of illustrative plates, when having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, according to the book number at the catalogue minor structure of preserving and entry place, page number numbering, go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is that the threshold value of setting is 0.75.
6) takes into account similarity between node, to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening,
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC, above-mentioned is cleaning step;
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
IDF ( L ) = c ( T &RightArrow; L ) * 1 + N 1 + DF ( L ) - - - ( 1 )
Wherein
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Sim ( T , T 1 ) = log [ 1 + N ( T , T 1 ) IDF ( T ) * IDF ( T 1 ) ] - - - ( 3 )
Wherein
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
w ~ ( T &RightArrow; L ) = log ( &Sigma; L &prime; w ( T &RightArrow; L &prime; ) ) &Sigma; L &prime; w ( T &RightArrow; L &prime; ) * w ( T &RightArrow; L ) - - - ( 4 )
Wherein
Figure BDA0000420480340000173
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights
w &prime; ( T &RightArrow; L ) = w ~ ( T &RightArrow; L ) + &Sigma; T 1 &NotEqual; T [ &mu; * Sim ( T , T 1 ) * w ~ ( T 1 &RightArrow; L ) ] - - - ( 5 )
Wherein
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
The operation result of this example: after four kinds of increments are all supplemented into, always have 25426 relationships between superior and subordinate, produced 741 root nodes, in the knowledge collection of illustrative plates generating, 843998 nodes have been comprised, maximum level is 85 layers, average 28.2 layers, and accuracy rate is 75.1%.
Meanwhile, because HowNet, the middle-level degree of depth of CNKI knowledge classification are generally units, and node quantitatively can not show a candle to interactive encyclopaedia classification tree, therefore choose interactive encyclopaedic knowledge here, sets object as a comparison.To adding up in the relevant subclass of interactive encyclopaedic knowledge tree Computer, draw and comprise altogether 21 root nodes, have 75434 nodes, the maximum level degree of depth is 48 layers, the average level degree of depth is 7.3 layers.
Contrast can find out, this method exceeds current sorting technique far away in the indexs such as number of nodes, the level degree of depth, has guaranteed higher accuracy simultaneously, without manual intervention, and has good extensibility.
Utilize 6 examples that level is 5 of selected parts in the knowledge collection of illustrative plates that this method processing goes out below, and the statistics of accuracy separately:
Figure BDA0000420480340000181
Figure BDA0000420480340000201
Figure BDA0000420480340000221
Figure BDA0000420480340000231

Claims (6)

1. a construction method for the knowledge collection of illustrative plates based on library catalogue, is characterized in that comprising the following steps:
1) select a book, its catalogue page is carried out to optical character identification and realize digitizing, and on digitized bibliographic structure, according to catalogue discal patch object length, distinguish rectangular order and billet order two class entries;
2) to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
3) to two batches of both candidate nodes, first strictly filter, go to identify that whether this entry exists in Baidupedia, interactive encyclopaedia, the part of identifying by Baidupedia, interactive encyclopaedia utilizes the superior and the subordinate's structure of catalogue to form relationship between superior and subordinate, utilize the peer-to-peer architecture of catalogue to form coordination, the skeleton using these two parts as knowledge collection of illustrative plates;
4) distinguish strong and weak coordination, from two kinds of coordinations, select respectively node, carry out incremental supplementation and enter relationship between superior and subordinate, enrich the skeleton of knowledge collection of illustrative plates obtained in the previous step;
5), according to the method take suffix useful part in basic excavation noise data proposing, in the entry of never identifying by Baidupedia, interactive encyclopaedia, select a part of node and supplement in knowledge collection of illustrative plates;
6) to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening.
2. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 2) as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log M - min { log f ( x ) , log f ( y ) }
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
3. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 3) comprises:
3.1) the such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
3.2) for the length of catalogue entry, limit, only get the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
3.3) catalogue entry 9 above length of Chinese character for the length of catalogue entry, if the entry type of processing is the type of " noun+conjunction+noun ", adopt step 2.1), step 2.2) in to extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
4. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 4) comprises:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, the coordination binary group selection that absolute frequency is greater than to threshold value is out as strong coordination, and absolute frequency is less than coordination two tuples of threshold value as weak coordination;
4.2) degree of correlation between knowledge node
Between knowledge node, often there is ambiguity, a data centralization being formed by relationship between superior and subordinate and coordination, for the multiple superior nodes that exist between node, point to the problem of same downstream site, according to other nodes relevant to knowledge node, help solve
Detailed process is as follows:
4.2.1) to each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
4.2.2) to B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
4.2.3) pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10
Weight 1 = SameElementCount Set 1 TotalElementCount ,
Weight 2 = SameElementCount Set 2 TotalElementCount ,
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, Set1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
4.3) utilize strong and weak coordination to supplement
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
5. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 5) comprises:
When having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, the catalogue minor structure of preserving according to step 1) and the book number at entry place, page number numbering go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is the threshold value of setting.
6. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 6) comprises:
6.1) in relationship between superior and subordinate, clean
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC;
6.2, relationship between superior and subordinate sequence
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
IDF ( L ) = c ( T &RightArrow; L ) * 1 + N 1 + DF ( L ) - - - ( 1 )
Wherein,
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein,
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Sim ( T , T 1 ) = log [ 1 + N ( T , T 1 ) IDF ( T ) * IDF ( T 1 ) ] - - - ( 3 )
Wherein,
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
w ~ ( T &RightArrow; L ) = log ( &Sigma; L &prime; w ( T &RightArrow; L &prime; ) ) &Sigma; L &prime; w ( T &RightArrow; L &prime; ) * w ( T &RightArrow; L ) - - - ( 4 )
Wherein,
Figure FDA0000420480330000063
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights,
w &prime; ( T &RightArrow; L ) = w ~ ( T &RightArrow; L ) + &Sigma; T 1 &NotEqual; T [ &mu; * Sim ( T , T 1 ) * w ~ ( T 1 &RightArrow; L ) ] - - - ( 5 )
Wherein,
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
CN201310601668.7A 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue Active CN103729402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310601668.7A CN103729402B (en) 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310601668.7A CN103729402B (en) 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue

Publications (2)

Publication Number Publication Date
CN103729402A true CN103729402A (en) 2014-04-16
CN103729402B CN103729402B (en) 2017-01-18

Family

ID=50453477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310601668.7A Active CN103729402B (en) 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue

Country Status (1)

Country Link
CN (1) CN103729402B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462227A (en) * 2014-11-13 2015-03-25 中国测绘科学研究院 Automatic construction method of graphic knowledge genealogy
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105893485A (en) * 2016-03-29 2016-08-24 浙江大学 Automatic special subject generating method based on book catalogue
CN106355627A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Method and system used for generating knowledge graphs
CN107004011A (en) * 2014-12-23 2017-08-01 英特尔公司 For evolution figure distribution overall situation edge ID
CN107609639A (en) * 2017-09-18 2018-01-19 前海梧桐(深圳)数据有限公司 The business data layering method and its system of imitative neuron
CN108205564A (en) * 2016-12-19 2018-06-26 北大方正集团有限公司 Knowledge hierarchy construction method and system
CN108416024A (en) * 2018-03-08 2018-08-17 网易乐得科技有限公司 Data processing method and device, medium and computing device
CN108491469A (en) * 2018-03-07 2018-09-04 浙江大学 Introduce the neural collaborative filtering conceptual description word proposed algorithm of concepts tab
CN108509420A (en) * 2018-03-29 2018-09-07 赵维平 Gu spectrum and ancient culture knowledge mapping natural language processing method
CN109597856A (en) * 2018-12-05 2019-04-09 北京知道创宇信息技术有限公司 A kind of data processing method, device, electronic equipment and storage medium
CN109657074A (en) * 2018-09-28 2019-04-19 北京信息科技大学 News knowledge mapping construction method based on number of addresses
CN110019948A (en) * 2018-08-31 2019-07-16 北京字节跳动网络技术有限公司 Method and apparatus for output information
CN110110089A (en) * 2018-01-09 2019-08-09 网智天元科技集团股份有限公司 Cultural relations drawing generating method and system
CN110379520A (en) * 2019-06-18 2019-10-25 北京百度网讯科技有限公司 The method for digging and device of medical knowledge map, computer equipment and readable medium
CN111061884A (en) * 2019-11-14 2020-04-24 临沂市拓普网络股份有限公司 Method for constructing K12 education knowledge graph based on DeepDive technology
CN111090754A (en) * 2019-11-20 2020-05-01 新华智云科技有限公司 Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries
CN111177411A (en) * 2019-12-27 2020-05-19 赣州市智能产业创新研究院 Knowledge graph construction method based on NLP
CN112015792A (en) * 2019-12-11 2020-12-01 天津泰凡科技有限公司 Material duplicate code analysis method and device and computer storage medium
WO2021190091A1 (en) * 2020-03-26 2021-09-30 深圳壹账通智能科技有限公司 Knowledge map construction method and device based on knowledge node belonging degree
CN115809371A (en) * 2023-02-01 2023-03-17 中信联合云科技有限责任公司 Learning demand determination method and system based on data analysis
WO2023246849A1 (en) * 2022-06-22 2023-12-28 青岛海尔电冰箱有限公司 Feedback data graph generation method and refrigerator

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1380620A (en) * 2001-12-18 2002-11-20 张弦 Automatic editing method of book index
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
CN102332023A (en) * 2011-09-27 2012-01-25 北京中科希望软件股份有限公司 Method and system for fast semantic annotation of e-book
KR20120105796A (en) * 2011-03-16 2012-09-26 주식회사 유비온 Method for intelligent tutoring and system therefor
US20120324346A1 (en) * 2011-06-15 2012-12-20 Terrence Monroe Method for relational analysis of parsed input for visual mapping of knowledge information
US20130283138A1 (en) * 2012-04-24 2013-10-24 Wo Hai Tao Method for creating knowledge map

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1380620A (en) * 2001-12-18 2002-11-20 张弦 Automatic editing method of book index
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
KR20120105796A (en) * 2011-03-16 2012-09-26 주식회사 유비온 Method for intelligent tutoring and system therefor
US20120324346A1 (en) * 2011-06-15 2012-12-20 Terrence Monroe Method for relational analysis of parsed input for visual mapping of knowledge information
CN102332023A (en) * 2011-09-27 2012-01-25 北京中科希望软件股份有限公司 Method and system for fast semantic annotation of e-book
US20130283138A1 (en) * 2012-04-24 2013-10-24 Wo Hai Tao Method for creating knowledge map

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462227A (en) * 2014-11-13 2015-03-25 中国测绘科学研究院 Automatic construction method of graphic knowledge genealogy
CN107004011A (en) * 2014-12-23 2017-08-01 英特尔公司 For evolution figure distribution overall situation edge ID
CN106355627A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Method and system used for generating knowledge graphs
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105893485B (en) * 2016-03-29 2019-02-12 浙江大学 A kind of thematic automatic generation method based on library catalogue
CN105893485A (en) * 2016-03-29 2016-08-24 浙江大学 Automatic special subject generating method based on book catalogue
CN108205564B (en) * 2016-12-19 2021-04-09 北大方正集团有限公司 Knowledge system construction method and system
CN108205564A (en) * 2016-12-19 2018-06-26 北大方正集团有限公司 Knowledge hierarchy construction method and system
CN107609639A (en) * 2017-09-18 2018-01-19 前海梧桐(深圳)数据有限公司 The business data layering method and its system of imitative neuron
CN110110089A (en) * 2018-01-09 2019-08-09 网智天元科技集团股份有限公司 Cultural relations drawing generating method and system
CN110110089B (en) * 2018-01-09 2021-03-30 网智天元科技集团股份有限公司 Cultural relation graph generation method and system
CN108491469A (en) * 2018-03-07 2018-09-04 浙江大学 Introduce the neural collaborative filtering conceptual description word proposed algorithm of concepts tab
CN108491469B (en) * 2018-03-07 2021-03-30 浙江大学 Neural collaborative filtering concept descriptor recommendation method introducing concept label
CN108416024A (en) * 2018-03-08 2018-08-17 网易乐得科技有限公司 Data processing method and device, medium and computing device
CN108509420A (en) * 2018-03-29 2018-09-07 赵维平 Gu spectrum and ancient culture knowledge mapping natural language processing method
CN110019948A (en) * 2018-08-31 2019-07-16 北京字节跳动网络技术有限公司 Method and apparatus for output information
CN109657074A (en) * 2018-09-28 2019-04-19 北京信息科技大学 News knowledge mapping construction method based on number of addresses
CN109657074B (en) * 2018-09-28 2023-11-10 北京信息科技大学 News knowledge graph construction method based on address tree
CN109597856A (en) * 2018-12-05 2019-04-09 北京知道创宇信息技术有限公司 A kind of data processing method, device, electronic equipment and storage medium
CN109597856B (en) * 2018-12-05 2020-12-25 北京知道创宇信息技术股份有限公司 Data processing method and device, electronic equipment and storage medium
CN110379520A (en) * 2019-06-18 2019-10-25 北京百度网讯科技有限公司 The method for digging and device of medical knowledge map, computer equipment and readable medium
CN111061884A (en) * 2019-11-14 2020-04-24 临沂市拓普网络股份有限公司 Method for constructing K12 education knowledge graph based on DeepDive technology
CN111061884B (en) * 2019-11-14 2023-11-21 临沂市拓普网络股份有限公司 Method for constructing K12 education knowledge graph based on deep technology
CN111090754B (en) * 2019-11-20 2023-04-07 新华智云科技有限公司 Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries
CN111090754A (en) * 2019-11-20 2020-05-01 新华智云科技有限公司 Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries
CN112015792A (en) * 2019-12-11 2020-12-01 天津泰凡科技有限公司 Material duplicate code analysis method and device and computer storage medium
CN112015792B (en) * 2019-12-11 2023-12-01 天津泰凡科技有限公司 Material repeated code analysis method and device and computer storage medium
CN111177411A (en) * 2019-12-27 2020-05-19 赣州市智能产业创新研究院 Knowledge graph construction method based on NLP
WO2021190091A1 (en) * 2020-03-26 2021-09-30 深圳壹账通智能科技有限公司 Knowledge map construction method and device based on knowledge node belonging degree
WO2023246849A1 (en) * 2022-06-22 2023-12-28 青岛海尔电冰箱有限公司 Feedback data graph generation method and refrigerator
CN115809371A (en) * 2023-02-01 2023-03-17 中信联合云科技有限责任公司 Learning demand determination method and system based on data analysis
CN115809371B (en) * 2023-02-01 2023-09-01 中信联合云科技有限责任公司 Learning requirement determining method and system based on data analysis

Also Published As

Publication number Publication date
CN103729402B (en) 2017-01-18

Similar Documents

Publication Publication Date Title
CN103729402A (en) Method for establishing mapping knowledge domain based on book catalogue
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN109359172B (en) Entity alignment optimization method based on graph partitioning
CN103207905B (en) A kind of method of calculating text similarity of based target text
Cortez et al. FLUX-CIM: flexible unsupervised extraction of citation metadata
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN102945228A (en) Multi-document summarization method based on text segmentation
Wu et al. Efficient near-duplicate detection for q&a forum
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
CN105138514A (en) Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN106055539A (en) Name disambiguation method and apparatus
CN107871002A (en) A kind of across language plagiarism detection method based on fingerprint fusion
Vavliakis et al. Event Detection via LDA for the MediaEval2012 SED Task.
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN105404677A (en) Tree structure based retrieval method
Campbell et al. Content+ context networks for user classification in twitter
CN103440308A (en) Digital thesis retrieval method based on formal concept analyses
CN111008285B (en) Author disambiguation method based on thesis key attribute network
CN105426490A (en) Tree structure based indexing method
Belz et al. Extracting parallel fragments from comparable corpora for data-to-text generation
Pan et al. Question classification with semantic tree kernel
Zhang et al. A tag recommendation system based on contents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant