CN103729402A - Method for establishing mapping knowledge domain based on book catalogue - Google Patents
Method for establishing mapping knowledge domain based on book catalogue Download PDFInfo
- Publication number
- CN103729402A CN103729402A CN201310601668.7A CN201310601668A CN103729402A CN 103729402 A CN103729402 A CN 103729402A CN 201310601668 A CN201310601668 A CN 201310601668A CN 103729402 A CN103729402 A CN 103729402A
- Authority
- CN
- China
- Prior art keywords
- node
- speech
- superior
- catalogue
- coordination
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9032—Query formulation
- G06F16/90324—Query formulation using system suggestions
- G06F16/90328—Query formulation using system suggestions using search space presentation or visualization, e.g. category or range presentation and selection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9017—Indexing; Data structures therefor; Storage structures using directory or table look-up
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9032—Query formulation
- G06F16/90332—Natural language query formulation or dialogue systems
Abstract
The invention discloses a method for establishing a mapping knowledge domain based on a book catalogue. The method comprises the steps that a catalogue page in a digitized book is extracted, the lengths of items in the catalogue are differentiated, and part-of-speech tagging is conducted on the long items through a natural language processing tool, so that part-of-speech arrays are obtained, and candidate nodes are extracted according to rules of conjunctions, punctuations and parts of speech; the long items and the short items are authenticated in the Baidu encyclopedia and the Hudong encyclopedia, a leader-member relation and parallel relations are formed through a catalogue structure and serve as a framework of the mapping knowledge domain, the strong and weak parallel relations are differentiated and serve as increments respectively, and the leader-member relation is supplemented with the strong and weak parallel relations; according to a noisy data excavating algorithm with suffixes serving as a base, nodes are selected from the items which do not pass the authentication of the encyclopedias and the mapping knowledge domain is supplemented with the selected nodes; finally, the weights of relations in the supplemented mapping knowledge domain are calculated and ranked, so that noise is removed through screening. Compared with an existing mapping knowledge domain, the mapping knowledge domain established through the method is richer in node, better in expandability and higher in accuracy.
Description
Technical field
The present invention relates to utilize the methods such as Artificial intelligence, data mining to carry out the generation of knowledge collection of illustrative plates, relate in particular to a kind of construction method of the knowledge collection of illustrative plates based on library catalogue.
Background technology
Computing machine fast development and universal today, for more easily, more clearly obtaining information, learning knowledge, and the contact evolutionary process between analysis mining knowledge, more and more need a content, levels are rich, accuracy is high, and the knowledge collection of illustrative plates that is easy to expansion, how building this knowledge collection of illustrative plates becomes the focus of current research naturally.
Current Chinese knowledge collection of illustrative plates has HowNet, interactive encyclopaedic knowledge tree, CNKI classification, but they exist limitation and variety of issue separately.
HowNet Shi You Mr. Dong Zhendong of Chinese Academy of Sciences exploitation, take the concept of the word representative of Chinese and english as description object, is the commonsense knowledge base of substance with the pass of disclosing between concept and concept and between the attribute that concept was had.Specifically, in Hownet, node major part is popular vocabulary, and level can not do deeply, and number of nodes is few comparatively speaking, relation is few, and need to be by manually generating.
Interactive encyclopaedic knowledge tree is by traditional encyclopaedia mode classification, encyclopaedia complete works is divided into personage, history, culture, art, nature, geography, science, economy, life, society, physical culture, the large objective classification of technology 12, under each objective classification, is divided into step by step again the subclasses such as different secondary classifications, reclassify.In interactive encyclopaedic knowledge tree, structure is fixed, and level is relatively not dark, and is artificial generation, is unfavorable for expansion.
CNKI is China National Knowledge Infrastructure engineering (China National Knowledge Infrastructure).CNKI engineering is to realize whole society's knowledge resource propagation to share the information system work that is utilized as target with increment, by Tsing-Hua University, Tsing Hua Tong Fang, is initiated, and is established in June, 1999.CNKI classification is categorized as basis with subject, and the document in database is divided into ten special editions, is divided into several special topics under each special edition, amounts to 168 special topics.Weak point be level relatively less, that relationships between nodes is relatively sparse, structure is fixed extendability is relatively bad, and is artificial generation.
Summary of the invention
The object of the invention is for overcoming the deficiencies in the prior art, a kind of method of the automatic generation knowledge collection of illustrative plates based on library catalogue is provided.
The construction method of the knowledge collection of illustrative plates based on library catalogue comprises the following steps:
1) select a book, its catalogue page is carried out to optical character identification and realize digitizing, and on digitized bibliographic structure, according to catalogue discal patch object length, distinguish rectangular order and billet order two class entries;
2) to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
3) to two batches of both candidate nodes, first strictly filter, go to identify that whether this entry exists in Baidupedia, interactive encyclopaedia, the part of identifying by Baidupedia, interactive encyclopaedia utilizes the superior and the subordinate's structure of catalogue to form relationship between superior and subordinate, utilize the peer-to-peer architecture of catalogue to form coordination, the skeleton using these two parts as knowledge collection of illustrative plates;
4) distinguish strong and weak coordination, from two kinds of coordinations, select respectively node, carry out incremental supplementation and enter relationship between superior and subordinate, enrich the skeleton of knowledge collection of illustrative plates obtained in the previous step;
5), according to the method take suffix useful part in basic excavation noise data proposing, in the entry of never identifying by Baidupedia, interactive encyclopaedia, select a part of node and supplement in knowledge collection of illustrative plates;
6) to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening.
Described step 2) as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
Described step 3) comprises:
3.1) the such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
3.2) for the length of catalogue entry, limit, only get the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
3.3) catalogue entry 9 above length of Chinese character for the length of catalogue entry, if the entry type of processing is the type of " noun+conjunction+noun ", adopt step 2.1), step 2.2) in to extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
Described step 4) comprises:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, the coordination binary group selection that absolute frequency is greater than to threshold value is out as strong coordination, and absolute frequency is less than coordination two tuples of threshold value as weak coordination;
4.2) degree of correlation between knowledge node
Between knowledge node, often there is ambiguity, a data centralization being formed by relationship between superior and subordinate and coordination, for the multiple superior nodes that exist between node, point to the problem of same downstream site, according to other nodes relevant to knowledge node, help solve
Detailed process is as follows:
4.2.1) to each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
4.2.2) to B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
4.2.3) pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+ Weight2) * 10
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, Set1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
4.3) utilize strong and weak coordination to supplement
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
Described step 5) comprises:
When having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, the catalogue minor structure of preserving according to step 1) and the book number at entry place, page number numbering go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is the threshold value of setting.
Described step 6) comprises:
6.1) in relationship between superior and subordinate, clean
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC;
6.2, relationship between superior and subordinate sequence
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
Wherein
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Wherein
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
Wherein
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights;
Wherein
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
The beneficial effect that the present invention compared with prior art has:
1. the flow process of the method guarantees to rely on machine automatically to complete, without manual intervention.
2. the method has good extendability, while enriching one's knowledge collection of illustrative plates, needs only new library catalogue is supplemented into.
3. the method level is dark, relationships between nodes abundant, and along with the continuous of new library catalogue supplemented, between the level degree of depth, node, contact and accuracy can improve thereupon.
Accompanying drawing explanation
Fig. 1 is general flow chart of the present invention;
Fig. 2 is step 2) process flow diagram;
Fig. 3 is the process flow diagram of step 3);
Fig. 4 is the process flow diagram of step 4);
Fig. 5 is the process flow diagram of step 5).
Embodiment
A kind of construction method of the knowledge collection of illustrative plates based on library catalogue comprises the following steps:
1) select a book, its catalogue page is carried out to optical character identification and realize digitizing, and on digitized bibliographic structure, according to catalogue discal patch object length, distinguish rectangular order and billet order two class entries;
2) to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
3) to two batches of both candidate nodes, first strictly filter, go to identify that whether this entry exists in Baidupedia, interactive encyclopaedia, the part of identifying by Baidupedia, interactive encyclopaedia utilizes the superior and the subordinate's structure of catalogue to form relationship between superior and subordinate, utilize the peer-to-peer architecture of catalogue to form coordination, the skeleton using these two parts as knowledge collection of illustrative plates;
4) distinguish strong and weak coordination, from two kinds of coordinations, select respectively node, carry out incremental supplementation and enter relationship between superior and subordinate, enrich the skeleton of knowledge collection of illustrative plates obtained in the previous step;
5), according to the method take suffix useful part in basic excavation noise data proposing, in the entry of never identifying by Baidupedia, interactive encyclopaedia, select a part of node and supplement in knowledge collection of illustrative plates;
6) to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening.
Described step 2) as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
Described step 3) comprises:
3.1) the such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
3.2) for the length of catalogue entry, limit, only get the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
3.3) catalogue entry 9 above length of Chinese character for the length of catalogue entry, if the entry type of processing is the type of " noun+conjunction+noun ", adopt step 2.1), step 2.2) in to extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
Described step 4) comprises:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, the coordination binary group selection that absolute frequency is greater than to threshold value is out as strong coordination, and absolute frequency is less than coordination two tuples of threshold value as weak coordination;
4.2) degree of correlation between knowledge node
Between knowledge node, often there is ambiguity, a data centralization being formed by relationship between superior and subordinate and coordination, for the multiple superior nodes that exist between node, point to the problem of same downstream site, according to other nodes relevant to knowledge node, help solve
Detailed process is as follows:
4.2.1) to each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
4.2.2) to B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
4.2.3) pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, Set1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
4.3) utilize strong and weak coordination to supplement
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
Described step 5) comprises:
When having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, the catalogue minor structure of preserving according to step 1) and the book number at entry place, page number numbering go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is the threshold value of setting.
Described step 6) comprises:
6.1) in relationship between superior and subordinate, clean
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC;
6.2, relationship between superior and subordinate sequence
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
Wherein
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Wherein
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
Wherein
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights;
Wherein
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
Embodiment
Below in conjunction with method of the present invention, describe the concrete steps that this example is implemented in detail, as follows:
1) 10,000 computer books have been carried out to optical character identification OCR and processed, and on digitized bibliographic structure, according to catalogue discal patch object length, take 9 Chinese characters, as boundary, distinguished, distinguished rectangular order and billet order two class entries;
2) as depicted in figs. 1 and 2, to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
The processing of rectangular object being cut to division fusion between word, part-of-speech tagging and word is as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
3) as shown in Figure 3,
The such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
Length for catalogue entry limits, and only gets the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
Catalogue entry for the length of catalogue entry 9 above length of Chinese character, if the entry type of processing is the type of " noun+conjunction+noun ", adopt in Fig. 2 extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
4) as shown in Figure 4, the number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, absolute frequency is greater than to the coordination binary group selection of 4 times out as strong coordination, and absolute frequency is less than coordination two tuples of 4 times as weak coordination;
Detailed process is as follows:
To each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
To B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
Pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, SEt1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value 0.5, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
5) as shown in Figure 5, according to the method take suffix useful part in basic excavation noise data proposing, never in the entry of identifying by Baidupedia, interactive encyclopaedia, selecting a part of node supplements in knowledge collection of illustrative plates, when having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, according to the book number at the catalogue minor structure of preserving and entry place, page number numbering, go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is that the threshold value of setting is 0.75.
6) takes into account similarity between node, to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening,
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC, above-mentioned is cleaning step;
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
Wherein
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Wherein
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
Wherein
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights
Wherein
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
The operation result of this example: after four kinds of increments are all supplemented into, always have 25426 relationships between superior and subordinate, produced 741 root nodes, in the knowledge collection of illustrative plates generating, 843998 nodes have been comprised, maximum level is 85 layers, average 28.2 layers, and accuracy rate is 75.1%.
Meanwhile, because HowNet, the middle-level degree of depth of CNKI knowledge classification are generally units, and node quantitatively can not show a candle to interactive encyclopaedia classification tree, therefore choose interactive encyclopaedic knowledge here, sets object as a comparison.To adding up in the relevant subclass of interactive encyclopaedic knowledge tree Computer, draw and comprise altogether 21 root nodes, have 75434 nodes, the maximum level degree of depth is 48 layers, the average level degree of depth is 7.3 layers.
Contrast can find out, this method exceeds current sorting technique far away in the indexs such as number of nodes, the level degree of depth, has guaranteed higher accuracy simultaneously, without manual intervention, and has good extensibility.
Utilize 6 examples that level is 5 of selected parts in the knowledge collection of illustrative plates that this method processing goes out below, and the statistics of accuracy separately:
Claims (6)
1. a construction method for the knowledge collection of illustrative plates based on library catalogue, is characterized in that comprising the following steps:
1) select a book, its catalogue page is carried out to optical character identification and realize digitizing, and on digitized bibliographic structure, according to catalogue discal patch object length, distinguish rectangular order and billet order two class entries;
2) to billet order directly as a collection of both candidate nodes, the natural language processing instrument FudanNLP simultaneously rectangular order utilization being increased income carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule to extract other a collection of both candidate nodes;
3) to two batches of both candidate nodes, first strictly filter, go to identify that whether this entry exists in Baidupedia, interactive encyclopaedia, the part of identifying by Baidupedia, interactive encyclopaedia utilizes the superior and the subordinate's structure of catalogue to form relationship between superior and subordinate, utilize the peer-to-peer architecture of catalogue to form coordination, the skeleton using these two parts as knowledge collection of illustrative plates;
4) distinguish strong and weak coordination, from two kinds of coordinations, select respectively node, carry out incremental supplementation and enter relationship between superior and subordinate, enrich the skeleton of knowledge collection of illustrative plates obtained in the previous step;
5), according to the method take suffix useful part in basic excavation noise data proposing, in the entry of never identifying by Baidupedia, interactive encyclopaedia, select a part of node and supplement in knowledge collection of illustrative plates;
6) to each relation in the knowledge collection of illustrative plates having supplemented, calculate its weight and sort again, thereby a part of noise is fallen in screening, realize sequence screening.
2. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 2) as follows:
To sentence, utilize natural language processing instrument remove to cut word and mark part of speech, the conjunction arranged side by side that is conjunction according to pause mark and part of speech is removed distich quantum splitting, and a sentence divides the character string array,
First each the substring A[i in the character string array A each being split into], form each A[i] a < word of such substring, the part of speech array of part of speech >,
Next element adjacent in part of speech array is merged, in the process merging to character string array A in the character string of diverse location adopt different merge orders, to first substring A[0 in character string array A] part of speech array manipulation time adopt back to front continuous adjective, noun are merged into a word, simultaneously, take A[0] part of speech array in the part of speech of last word be benchmark part of speech, utilize benchmark part of speech to improve ensuing character string in character string array A, accuracy rate when merging part of speech array
To second substring A[1] and during the part of speech array manipulation of later each substring, adopt with the following method:
2.1) if benchmark part of speech is noun, to A[1] and later each character string part of speech array separately in by after forward direction, match last part of speech be noun till, form a word, otherwise do not return results;
2.2) film, title knowledge node can add < < > > or " " symbol, if benchmark part of speech is for demarcating symbol, while being punctuation marks used to enclose the title, quotation marks, to A[1] and later each character string part of speech array separately in by till matching last after forward direction and demarcating symbol, form a word, otherwise do not return results;
If benchmark part of speech be not noun or demarcate symbol, to A[1] and later each character string part of speech array separately in, when first part of speech, mate with benchmark part of speech, form a word, otherwise do not return results;
To the word of not including on Baidupedia, interactive encyclopaedia, utilize normalization Google distance (Normalized Google Distance) to calculate lower both degree of condensing together, in value, be that 0 to two node between threshold value is considered to one group of merged rational word receiving of energy, when not identifying by part of speech, when utilizing the word that part-of-speech rule merges out,, to its part of speech array, utilize normalization Google distance value of calculating to determine whether include
NGD (x, y) represents the value of utilizing normalization Google distance to calculate,
F (x, y) represents the number of results that " xy " searches out in Google for key word xy,
F (x) represents the number of results that " x " searches out in Google for key word x,
F (y) represents the number of results that " y " searches out in Google for key word y,
M is all webpage numbers of including in Google.
3. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 3) comprises:
3.1) the such books structured message of exercise, experiment, example in the library catalogue page going out for the optical character recognition process of each book, conventionally in the catalogue of the same level of meeting in same book, repeat repeatedly, by the catalogue of each level is set up to one, have < catalogue entry like this, the Hash table of counting > just can be added up and be filtered out;
3.2) for the length of catalogue entry, limit, only get the catalogue entry of 9 Chinese characters and following length;
Utilize the entry on Baidupedia, interactive encyclopaedia, set up out index, then, when each node of processing in catalogue, carry out Baidupedia, the evaluation of interactive encyclopaedia, Baidupedia, interactive encyclopaedia are identified being included of passing through;
When the node of processing in each catalogue, utilize the natural language processing software FudanNLP that increases income to carry out lexical analysis, when part of speech is labeled as verb, do not go to include;
3.3) catalogue entry 9 above length of Chinese character for the length of catalogue entry, if the entry type of processing is the type of " noun+conjunction+noun ", adopt step 2.1), step 2.2) in to extracting the way of coordination in sentence, and then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun ", the book number at relationship between superior and subordinate and coordination place is preserved simultaneously;
Simultaneously, during processing, need to keep the catalogue minor structure of every book, need to preserve two tables, a table is preserved by the knowledge node that method obtains above and the book number appearing at thereof, page number numbering, and the book number that between knowledge node, relationship between superior and subordinate and coordination appear at, even if but the entry being temporarily dropped may be also qualified node, need to build another table by the book number at all directory node places that pass through and unsanctioned, page number numbering also preserves, each relationship between superior and subordinate that all entries are formed and the book number at coordination place are also preserved into simultaneously, next after adding up by correlativity between computing node with by books structure, from the entry being dropped, find out rationally useful knowledge node, as incremental supplementation, enter in knowledge collection of illustrative plates, the positional information of each entry of simultaneously preserving can be for the extraction of ensuing definition.
4. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 4) comprises:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred is added up, the absolute frequency occurring according to coordination sorts, the coordination binary group selection that absolute frequency is greater than to threshold value is out as strong coordination, and absolute frequency is less than coordination two tuples of threshold value as weak coordination;
4.2) degree of correlation between knowledge node
Between knowledge node, often there is ambiguity, a data centralization being formed by relationship between superior and subordinate and coordination, for the multiple superior nodes that exist between node, point to the problem of same downstream site, according to other nodes relevant to knowledge node, help solve
Detailed process is as follows:
4.2.1) to each A → B, utilize relationship between superior and subordinate to find one group of downstream site SubA of A, to each the node EleOfSubA in this group downstream site, utilize strong coordination to find one group of node ParaOfEleOfSubA arranged side by side, all ParaOfEleOfSubA are merged and form a S set et1;
4.2.2) to B, utilize weak coordination to find one group of node ParaB arranged side by side, to each node C in ParaB, utilize successively strong coordination to find one group of node ParaOfParaB arranged side by side, each ParaOfParaB is as a S set et2;
4.2.3) pair set Set1, S set et2,
Degree of correlation Relevancy=SameElementCount+ (Weight1+Weight2) * 10
Wherein, SameElementCount represents the identical number of element in two set, Weight1, Weight2 represent respectively the number percent that the number of identical element accounts in each set, Set1TotalElementCount represents the total number of the element in S set et1, Set2TotalElementCount represents the total number of the element in S set et2
Calculate the degree of correlation between downstream site C and superior node A, when Relevancy is greater than threshold value, C thought to the downstream site of A, corresponding A → C is included in,
For the multiple superior nodes that exist between node, point to the problem of same downstream site, calculate respectively the degree of correlation of different superior nodes and this downstream site, according to the size of the degree of correlation, select superior node and this downstream site formation relationship between superior and subordinate;
4.3) utilize strong and weak coordination to supplement
For strong coordination, be directly dissolved in relationship between superior and subordinate, for weak coordination, utilize the concept of the degree of correlation between the knowledge node of introducing to be dissolved in relationship between superior and subordinate.
5. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 5) comprises:
When having the entry of identifying by Baidupedia, interactive encyclopaedia in the data under same first class catalogue, the catalogue minor structure of preserving according to step 1) and the book number at entry place, page number numbering go again the data filling of not identifying by Baidupedia, interactive encyclopaedia to be entered in book
When there being data to pass through, after Baidupedia, the evaluation of interactive encyclopaedia, to go take suffix as basis to excavate useful part in catalogue minor structure in a sub-directory, knowledge collection of illustrative plates is supplemented,
Concrete grammar is as follows:
In the relationship between superior and subordinate of identifying by Baidupedia, interactive encyclopaedia, the set of element is SetX, to each the A → B in the superior and the subordinate, finds the list of all book numbers that occur A → B,
For each book number, find the S set etY of the downstream site of A in this bibliography record,
To each the node Node in SetY and SetX common factor, Node is the part of identifying by Baidupedia, interactive encyclopaedia, finds out the S set etZ of the entry that has identical suffix with Node but do not identify by Baidupedia, interactive encyclopaedia in SetY, if
percentage=[(|SetX∩SetY|)+|SetZ|]/|SetY|>level
Include Node all entries of not identifying by Baidupedia, interactive encyclopaedia of remainder in addition in SetY, otherwise, only include the entry that has identical suffix in SetY with Node, wherein level is the threshold value of setting.
6. the construction method of a kind of knowledge collection of illustrative plates based on library catalogue according to claim 1, is characterized in that: described step 6) comprises:
6.1) in relationship between superior and subordinate, clean
In relationship between superior and subordinate, exist: Class1, redundancy relationship, be A → A, type 2, length are related to the short A of relation → B that A → BC is shredded out, insignificant A → B → the A that is related to of type 3, circulation, so the relationship between superior and subordinate after supplementing is read in again, and Class1, type 3 screenings are fallen, the short A of relation in type 2 → B is merged and is included into long the relation in A → BC;
6.2, relationship between superior and subordinate sequence
To the relationship between superior and subordinate of having cleaned, need to calculate weight to each relation, represent the confidence level of this relation;
Step 1) to the relationship between superior and subordinate that step 5) produces is calculated to its weight w ' (T → L) according to following formula, then sort;
Wherein,
C (T → L) represents the number of times that T → L occurs;
DF (L) represents the number of times occurring in coordination of L;
N represents the total nodes in coordination;
IDF (L) represents the anti-document frequency in coordination of L;
w(T→L)=c(T→L)*IDF(L) (2)
Wherein,
W (T → L) represents to consider after the number of times of T → L appearance and the inverse document frequency of downstream site, the weights of T → L this edge;
Wherein,
Sim (T, T1) represents the similarity between T and T1;
N (T, T1) represents the common number of times occurring of T, T1 in coordination;
Wherein,
represent to increase and consider after the different subordinates of T in coordination, to the renewal of T → L weights,
Wherein,
W ' (T → L) represents to increase and considers that L is the different higher levels of coordination, and the weight of this relation of T → L after correlation among nodes of adding after upgrading,
μ is weights, is 0.5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310601668.7A CN103729402B (en) | 2013-11-22 | 2013-11-22 | Method for establishing mapping knowledge domain based on book catalogue |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310601668.7A CN103729402B (en) | 2013-11-22 | 2013-11-22 | Method for establishing mapping knowledge domain based on book catalogue |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103729402A true CN103729402A (en) | 2014-04-16 |
CN103729402B CN103729402B (en) | 2017-01-18 |
Family
ID=50453477
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310601668.7A Active CN103729402B (en) | 2013-11-22 | 2013-11-22 | Method for establishing mapping knowledge domain based on book catalogue |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103729402B (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462227A (en) * | 2014-11-13 | 2015-03-25 | 中国测绘科学研究院 | Automatic construction method of graphic knowledge genealogy |
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN105893485A (en) * | 2016-03-29 | 2016-08-24 | 浙江大学 | Automatic special subject generating method based on book catalogue |
CN106355627A (en) * | 2015-07-16 | 2017-01-25 | 中国石油化工股份有限公司 | Method and system used for generating knowledge graphs |
CN107004011A (en) * | 2014-12-23 | 2017-08-01 | 英特尔公司 | For evolution figure distribution overall situation edge ID |
CN107609639A (en) * | 2017-09-18 | 2018-01-19 | 前海梧桐(深圳)数据有限公司 | The business data layering method and its system of imitative neuron |
CN108205564A (en) * | 2016-12-19 | 2018-06-26 | 北大方正集团有限公司 | Knowledge hierarchy construction method and system |
CN108416024A (en) * | 2018-03-08 | 2018-08-17 | 网易乐得科技有限公司 | Data processing method and device, medium and computing device |
CN108491469A (en) * | 2018-03-07 | 2018-09-04 | 浙江大学 | Introduce the neural collaborative filtering conceptual description word proposed algorithm of concepts tab |
CN108509420A (en) * | 2018-03-29 | 2018-09-07 | 赵维平 | Gu spectrum and ancient culture knowledge mapping natural language processing method |
CN109597856A (en) * | 2018-12-05 | 2019-04-09 | 北京知道创宇信息技术有限公司 | A kind of data processing method, device, electronic equipment and storage medium |
CN109657074A (en) * | 2018-09-28 | 2019-04-19 | 北京信息科技大学 | News knowledge mapping construction method based on number of addresses |
CN110019948A (en) * | 2018-08-31 | 2019-07-16 | 北京字节跳动网络技术有限公司 | Method and apparatus for output information |
CN110110089A (en) * | 2018-01-09 | 2019-08-09 | 网智天元科技集团股份有限公司 | Cultural relations drawing generating method and system |
CN110379520A (en) * | 2019-06-18 | 2019-10-25 | 北京百度网讯科技有限公司 | The method for digging and device of medical knowledge map, computer equipment and readable medium |
CN111061884A (en) * | 2019-11-14 | 2020-04-24 | 临沂市拓普网络股份有限公司 | Method for constructing K12 education knowledge graph based on DeepDive technology |
CN111090754A (en) * | 2019-11-20 | 2020-05-01 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
CN111177411A (en) * | 2019-12-27 | 2020-05-19 | 赣州市智能产业创新研究院 | Knowledge graph construction method based on NLP |
CN112015792A (en) * | 2019-12-11 | 2020-12-01 | 天津泰凡科技有限公司 | Material duplicate code analysis method and device and computer storage medium |
WO2021190091A1 (en) * | 2020-03-26 | 2021-09-30 | 深圳壹账通智能科技有限公司 | Knowledge map construction method and device based on knowledge node belonging degree |
CN115809371A (en) * | 2023-02-01 | 2023-03-17 | 中信联合云科技有限责任公司 | Learning demand determination method and system based on data analysis |
WO2023246849A1 (en) * | 2022-06-22 | 2023-12-28 | 青岛海尔电冰箱有限公司 | Feedback data graph generation method and refrigerator |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1380620A (en) * | 2001-12-18 | 2002-11-20 | 张弦 | Automatic editing method of book index |
CN1389811A (en) * | 2002-02-06 | 2003-01-08 | 北京造极人工智能技术有限公司 | Intelligent search method of search engine |
CN102332023A (en) * | 2011-09-27 | 2012-01-25 | 北京中科希望软件股份有限公司 | Method and system for fast semantic annotation of e-book |
KR20120105796A (en) * | 2011-03-16 | 2012-09-26 | 주식회사 유비온 | Method for intelligent tutoring and system therefor |
US20120324346A1 (en) * | 2011-06-15 | 2012-12-20 | Terrence Monroe | Method for relational analysis of parsed input for visual mapping of knowledge information |
US20130283138A1 (en) * | 2012-04-24 | 2013-10-24 | Wo Hai Tao | Method for creating knowledge map |
-
2013
- 2013-11-22 CN CN201310601668.7A patent/CN103729402B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1380620A (en) * | 2001-12-18 | 2002-11-20 | 张弦 | Automatic editing method of book index |
CN1389811A (en) * | 2002-02-06 | 2003-01-08 | 北京造极人工智能技术有限公司 | Intelligent search method of search engine |
KR20120105796A (en) * | 2011-03-16 | 2012-09-26 | 주식회사 유비온 | Method for intelligent tutoring and system therefor |
US20120324346A1 (en) * | 2011-06-15 | 2012-12-20 | Terrence Monroe | Method for relational analysis of parsed input for visual mapping of knowledge information |
CN102332023A (en) * | 2011-09-27 | 2012-01-25 | 北京中科希望软件股份有限公司 | Method and system for fast semantic annotation of e-book |
US20130283138A1 (en) * | 2012-04-24 | 2013-10-24 | Wo Hai Tao | Method for creating knowledge map |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462227A (en) * | 2014-11-13 | 2015-03-25 | 中国测绘科学研究院 | Automatic construction method of graphic knowledge genealogy |
CN107004011A (en) * | 2014-12-23 | 2017-08-01 | 英特尔公司 | For evolution figure distribution overall situation edge ID |
CN106355627A (en) * | 2015-07-16 | 2017-01-25 | 中国石油化工股份有限公司 | Method and system used for generating knowledge graphs |
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN105893485B (en) * | 2016-03-29 | 2019-02-12 | 浙江大学 | A kind of thematic automatic generation method based on library catalogue |
CN105893485A (en) * | 2016-03-29 | 2016-08-24 | 浙江大学 | Automatic special subject generating method based on book catalogue |
CN108205564B (en) * | 2016-12-19 | 2021-04-09 | 北大方正集团有限公司 | Knowledge system construction method and system |
CN108205564A (en) * | 2016-12-19 | 2018-06-26 | 北大方正集团有限公司 | Knowledge hierarchy construction method and system |
CN107609639A (en) * | 2017-09-18 | 2018-01-19 | 前海梧桐(深圳)数据有限公司 | The business data layering method and its system of imitative neuron |
CN110110089A (en) * | 2018-01-09 | 2019-08-09 | 网智天元科技集团股份有限公司 | Cultural relations drawing generating method and system |
CN110110089B (en) * | 2018-01-09 | 2021-03-30 | 网智天元科技集团股份有限公司 | Cultural relation graph generation method and system |
CN108491469A (en) * | 2018-03-07 | 2018-09-04 | 浙江大学 | Introduce the neural collaborative filtering conceptual description word proposed algorithm of concepts tab |
CN108491469B (en) * | 2018-03-07 | 2021-03-30 | 浙江大学 | Neural collaborative filtering concept descriptor recommendation method introducing concept label |
CN108416024A (en) * | 2018-03-08 | 2018-08-17 | 网易乐得科技有限公司 | Data processing method and device, medium and computing device |
CN108509420A (en) * | 2018-03-29 | 2018-09-07 | 赵维平 | Gu spectrum and ancient culture knowledge mapping natural language processing method |
CN110019948A (en) * | 2018-08-31 | 2019-07-16 | 北京字节跳动网络技术有限公司 | Method and apparatus for output information |
CN109657074A (en) * | 2018-09-28 | 2019-04-19 | 北京信息科技大学 | News knowledge mapping construction method based on number of addresses |
CN109657074B (en) * | 2018-09-28 | 2023-11-10 | 北京信息科技大学 | News knowledge graph construction method based on address tree |
CN109597856A (en) * | 2018-12-05 | 2019-04-09 | 北京知道创宇信息技术有限公司 | A kind of data processing method, device, electronic equipment and storage medium |
CN109597856B (en) * | 2018-12-05 | 2020-12-25 | 北京知道创宇信息技术股份有限公司 | Data processing method and device, electronic equipment and storage medium |
CN110379520A (en) * | 2019-06-18 | 2019-10-25 | 北京百度网讯科技有限公司 | The method for digging and device of medical knowledge map, computer equipment and readable medium |
CN111061884A (en) * | 2019-11-14 | 2020-04-24 | 临沂市拓普网络股份有限公司 | Method for constructing K12 education knowledge graph based on DeepDive technology |
CN111061884B (en) * | 2019-11-14 | 2023-11-21 | 临沂市拓普网络股份有限公司 | Method for constructing K12 education knowledge graph based on deep technology |
CN111090754B (en) * | 2019-11-20 | 2023-04-07 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
CN111090754A (en) * | 2019-11-20 | 2020-05-01 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
CN112015792A (en) * | 2019-12-11 | 2020-12-01 | 天津泰凡科技有限公司 | Material duplicate code analysis method and device and computer storage medium |
CN112015792B (en) * | 2019-12-11 | 2023-12-01 | 天津泰凡科技有限公司 | Material repeated code analysis method and device and computer storage medium |
CN111177411A (en) * | 2019-12-27 | 2020-05-19 | 赣州市智能产业创新研究院 | Knowledge graph construction method based on NLP |
WO2021190091A1 (en) * | 2020-03-26 | 2021-09-30 | 深圳壹账通智能科技有限公司 | Knowledge map construction method and device based on knowledge node belonging degree |
WO2023246849A1 (en) * | 2022-06-22 | 2023-12-28 | 青岛海尔电冰箱有限公司 | Feedback data graph generation method and refrigerator |
CN115809371A (en) * | 2023-02-01 | 2023-03-17 | 中信联合云科技有限责任公司 | Learning demand determination method and system based on data analysis |
CN115809371B (en) * | 2023-02-01 | 2023-09-01 | 中信联合云科技有限责任公司 | Learning requirement determining method and system based on data analysis |
Also Published As
Publication number | Publication date |
---|---|
CN103729402B (en) | 2017-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103729402A (en) | Method for establishing mapping knowledge domain based on book catalogue | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN109359172B (en) | Entity alignment optimization method based on graph partitioning | |
CN103207905B (en) | A kind of method of calculating text similarity of based target text | |
Cortez et al. | FLUX-CIM: flexible unsupervised extraction of citation metadata | |
CN104391942A (en) | Short text characteristic expanding method based on semantic atlas | |
CN102945228A (en) | Multi-document summarization method based on text segmentation | |
Wu et al. | Efficient near-duplicate detection for q&a forum | |
CN113962293B (en) | LightGBM classification and representation learning-based name disambiguation method and system | |
CN105138514A (en) | Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction | |
CN101702167A (en) | Method for extracting attribution and comment word with template based on internet | |
CN108763348A (en) | A kind of classification improved method of extension short text word feature vector | |
CN106055539A (en) | Name disambiguation method and apparatus | |
CN107871002A (en) | A kind of across language plagiarism detection method based on fingerprint fusion | |
Vavliakis et al. | Event Detection via LDA for the MediaEval2012 SED Task. | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN105404677A (en) | Tree structure based retrieval method | |
Campbell et al. | Content+ context networks for user classification in twitter | |
CN103440308A (en) | Digital thesis retrieval method based on formal concept analyses | |
CN111008285B (en) | Author disambiguation method based on thesis key attribute network | |
CN105426490A (en) | Tree structure based indexing method | |
Belz et al. | Extracting parallel fragments from comparable corpora for data-to-text generation | |
Pan et al. | Question classification with semantic tree kernel | |
Zhang et al. | A tag recommendation system based on contents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |