CN106844658A

CN106844658A - A kind of Chinese text knowledge mapping method for auto constructing and system

Info

Publication number: CN106844658A
Application number: CN201710050095.1A
Authority: CN
Inventors: 苏晓恒; 万海
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2017-01-23
Filing date: 2017-01-23
Publication date: 2017-06-13
Anticipated expiration: 2037-01-23
Also published as: CN106844658B

Abstract

The method that the present invention is provided can realize the structure of Chinese text knowledge mapping, and the method, when in use with the growth of access times, the text library of its every field, relation storehouse, entity storehouse are also progressively expanded, and the effect for building knowledge mapping is better.

Description

A kind of Chinese text knowledge mapping method for auto constructing and system

Technical field

Field is built the present invention relates to knowledge mapping, is built automatically more particularly, to a kind of Chinese text knowledge mapping Method and system.

Background technology

Knowledge mapping is a kind of institutional framework of knowledge, and form is gained the name as collection of illustrative plates.One typical knowledge mapping is usual Comprising series of concepts, example and relation.For plain text, knowledge mapping is structuring, the node table in collection of illustrative plates Be shown as concept or example, and the side between node and node then represents relation therebetween, thus text be normally thought of as it is non- Structuring.The application of knowledge mapping widely, may apply to semantic search, intelligent answer, knowledge engineering, data mining and The various fields such as digital library.Generally, the structure of knowledge mapping is divided into manual construction, automatic structure and semi-automatic structure Build.Manual construction knowledge mapping can expend great man power and material, and be difficult with the change of knowledge and be adjusted；From Dynamic structure knowledge mapping will rely on knowledge acquisition technology, machine learning techniques and statistical technique from non-structured data resource Obtain knowledge mapping；Semi-automatic constructing technology is between manual construction and automation structure, because complete automation is difficult to Realize.

Building the main method of knowledge mapping at this stage includes the method based on lexical pattern, the method based on cluster and base In the method for distribution similarity.Method based on lexical pattern therefrom extracts corresponding general by predefining some patterns, then Read and relation, such as Fruit such as apple, then what such pattern was represented is that apple is a kind of fruit；Based on poly- The method of class is clustered according to the feature of concept or example, typically results in the knowledge mapping of hierarchical relationship；It is similar based on being distributed The method of degree is main to possess similar implication it is assumed that such as Beijing is the capital of China, Tokyo according to similar cliction up and down It is the capital of Japan, then Beijing and China have similar context to Tokyo and Japan.External structure knowledge mapping The early development of technology starting is fast, but the country can accomplish to automatically extract the knowledge graph of Chinese text currently without a complete system Spectrum.Main reason is that Chinese form unlike English is fixed, expression is simple and does not need participle, the complex structure of Chinese, Expression-form is various and needs participle.

The content of the invention

The present invention is to solve the problem of above prior art, there is provided a kind of Chinese text knowledge mapping side of structure automatically Method, the knowledge mapping of its Chinese text can be built using the method.

To realize above goal of the invention, the technical scheme of use is：

A kind of Chinese text knowledge mapping method for auto constructing, comprises the following steps：

S1. encyclopaedia crawls the document of every field from network, then goes out entity according to the knowledge organization structure extraction of the encyclopaedia page It is stored in the entity storehouse and relation storehouse in corresponding field with relation, the document of the every field for crawling also is stored in corresponding field In text library；

If S2. a document j needs to carry out the operation of structure knowledge mapping, following treatment is performed to it;

S3. word segmentation processing is carried out to document j；

S4. the extraction of core word is carried out to document j；

S5. the primary word of document j is extracted using the technology of TF-IDF；

S6. the field belonging to document j is determined：

S61. all words of document j are found out, their TF-IDF values is then calculated respectively, document is obtained according to the order of word The vocabulary vector expression of j；

S62. the vocabulary vector expression of the document of every field is obtained using the method for step S61, the word of document j is then calculated The cosine value of the vocabulary vector expression of the document of remittance vector expression and every field, the maximum corresponding neck of document of cosine value Domain is the field belonging to document j；Then document j is stored in the text library in the field；

S7. the triple of the entity, relation and entity in document j is extracted：

S71. the sentence of Field Words appearance is picked out from document j as affairs, affairs are referred in select sentence All entries set；The entry that wherein described Field Words collect for the entity storehouse and relation storehouse of document j arts；

S72. the support of each entry in affairs is calculated, then regards support as frequent episode higher than the entry of threshold value；

S73. the confidence level between any two frequent episode is calculated, if the confidence level between two frequent episodes is higher than threshold value, is carried Two frequent episodes are taken as word pair；

S74. the word of word pair, core word, primary word are constituted into an entry set, it is all in locating documents j to contain the entry collection The sentence of entry in conjunction, then these sentences are carried out reference resolution and delete sentence in submember, obtain extract entity, Noun and verb that the triple of relation and entity needs；

S75. the verb in sentence is found first, then by noun one candidate of composition of verb front and back in sentence （Noun, verb, noun）Triple, then calculates the relation in the relation storehouse of document j arts using similarity analysis With the similitude of the verb in candidate's triple, if similitude is more than threshold value, verb is put into the relation of document j arts In storehouse, while the noun in candidate's triple is put into the entity storehouse of document j arts；Now, candidate（Noun, moves Word, noun）Triple is the triple of formal entity, relation and entity that document j is extracted；

If S76. step S75 extracts the triple less than entity, relation and entity, find another in addition to core word in sentence Individual noun, then calculates the similitude of the entity and the noun in the entity storehouse of document j arts using similarity analysis, If similitude is more than threshold value, the word between two nouns is found, it and document j institutes are then calculated using similarity analysis The similitude of the relation in the relation storehouse in category field, if similitude is more than threshold value, document j arts is put into by the word In relation storehouse, and the noun that step S75 is extracted is put into the entity storehouse of document j arts；Now, document j is obtained to extract Entity, relation and entity triple；

S8. the knowledge mapping of document j is generated using the triple of the entity, relation and entity for extracting.

Preferably, the step S3 carries out word segmentation processing using jieba instruments to document j.

Preferably, the similarity analysis application Word2vec or Chinese thesaurus.

Meanwhile, present invention also offers a kind of system of application above method, its specific scheme is as follows：

Knowledge data library module, document process module, entity and relation extraction module and knowledge mapping life including every field Into module；The knowledge data library module of wherein described every field includes entity storehouse, relation storehouse and the text library of every field, its Described in document process module be used to perform step S3 ~ S62, the entity and relation extraction module be used to performing step S7 ~ S76, the knowledge mapping generation module is used to perform step S8.

Compared with prior art, the beneficial effects of the invention are as follows：

The method that the present invention is provided can realize the structure of Chinese text knowledge mapping, and the method is when in use with using The growth of number of times, the text library of its every field, relation storehouse, entity storehouse are also progressively expanded, and build the effect of knowledge mapping Better.

Brief description of the drawings

The schematic diagram of Fig. 1 knowledge data library modules.

Fig. 2 is the schematic diagram of document process module.

Fig. 3 is entity and relation extraction module, the schematic diagram of knowledge mapping generation module.

Specific embodiment

Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent；

Below in conjunction with drawings and Examples, the present invention is further elaborated.

Embodiment 1

The system that the present invention is provided mainly includes four modules：Document process module, entity and relation extraction module, knowledge mapping The knowledge data library module of generation module and every field.Concrete workflow is to first pass through document process module to enter input document Row pretreatment, then extracts the entity in document and relation, finally the reality for extracting by entity and relation extraction module Body and relation are sent to the complete knowledge mapping of knowledge mapping generation module structure and return to user, and update knowledge mapping generation mould Data in block.Here is that each module is discussed in detail.

The knowledge data library module of every field excavates the knowledge data of every field in encyclopaedia from network, then preserves To be used to build the knowledge mapping of document.Knowledge data base has been substantially divided into art, section by the present invention according to the classification of encyclopaedia , nature, culture, geography, life, society, personage, economy, physical culture, 11 major classes of history and in each major class again point Into more detailed group, such as the present invention has divided health medical treatment, electronic information, aviation boat again in science this major class My god, automobile engineering, biomedicine waits 16 groups.The purpose of do so is exactly the knowledge number that all directions are sorted out from encyclopaedia According to storehouse, then for the document for being arbitrarily given, the present invention just can be using Algorithm mapping the knowing to specific field of classification DBM is known, so as to use the heuristic knowledge mapping for building document of knowledge data library module in the field.Such as Fig. 1 institutes Show, knowledge data library module is divided into systematic knowledge database and user knowledge database again, at the beginning of systematic knowledge database is system The knowledge data base carried during beginningization, new knowledge data is found for heuristic from customer documentation, and user's lack of competence is carried out Operation；User knowledge database refers to the process of the knowledge data that user obtains in extraction document, and user is had permission to this operation, Self-defined or autonomous addition dictionary can be carried out so as to improve overall structure effect according to the job requirement of user.Network processing layers It is responsible for encyclopaedia from network（Baidupedia, interactive encyclopaedia, Wiki Chinese encyclopaedia）The all of document of each classification is crawled, and according to hundred The knowledge organization structure extraction of section's page goes out entity and relation, and the entity storehouse and relation storehouse put into knowledge data library module In, while the also encyclopaedia text data after reservation process, is put into text library.Then the models such as word2vec are trained, and is put into To in the middle of knowledge data library module.So knowledge data library module is by entity storehouse, relation storehouse, text library and the part group of model 4 Into.

Document process module is to pre-process as smaller processing unit document.It is main to include that pretreatment, core word are extracted Extracted with primary word and the part of document classification four.As shown in Fig. 2 of the invention by each sentence in document in preprocessing part Subprocessing is a line, and Chinese word segmentation is carried out to document using jieba instruments, and jieba has good performance, participle efficiency high And it is accurate, and allow to import entry, the vocabulary in participle process jieba can pay the utmost attention to entry is of the invention by knowledge number The entry base of jieba is merged into according to the entity storehouse and relation storehouse in library module, also needs to filter stop words in addition to participle；Core Heart word extracts the main core for extracting document in part（Purport）Word, the present invention proposes the core word extraction algorithm, the algorithm base Assume that every document is all around core in one（Purport）Word is illustrated, and core word is distributed by block , such as the 1st ~ 5 section is description internal memory, and the 6th ~ 12 section is explanation hard disk.In addition for each core word block, here all Having corresponding knowledge carries out additional notes to the entry, such as this is general to be likely to occur memory bar in this core word block of internal memory Read, or the relation between internal memory and memory bar.This is also the basis that the present invention builds knowledge mapping.The algorithm that core word is extracted Including two aspects, the entry density of the word after each participle is calculated first（What i.e. the entry occurred in this document is total The line number that number of times is crossed over divided by the entry）, a threshold value is defined, candidate's core word is selected, then calculate the uniform of candidate word Property, i.e. each candidate word is uniform in their word block, and candidate word block is divided into multiple entries by the present invention, such as every three One entry of behavior, then the uniformity of candidate word is equal to the ratio of entry where candidate word；It is document that document classification part is The knowledge data library module of a certain classification is found, because there is initial information in system library, can be found with heuristic help more Novel entities and relation under the category, of the invention here to carry out text classification using the cosine law, specific practice is as follows, first Find all words under the document（Notional word）, their TF-IDF values are calculated, then obtained according to the order of all vocabularies The vocabulary vector expression of the document, the value of each of which element represents contribution of the word to the document, because each class Document all substantially can be exactly contribution situation of the specialized vocabulary under such to such document by the combination collocation of fixed specialty, The last present invention is calculated document using the text library in knowledge data library module and calculates cosine with categorization vector in text library Value, the document is divided into the middle of the field of maximum.The usual time important word that has in core word block was previously noted to come The concept of auxiliary description core word, such as memory bar aids in describing the concept of internal memory, so primary word extracts the main work(in part Can be exactly to extract time primary word, the present invention is extracted using the technology of TF-IDF here, before document classification also use The technology, TF（Term -Frequency）Item frequency, represents the frequency that the entry occurs in this document, IDF （Inverse Document Fequency）Inverse document frequency, represents document of the entry in generic knowledge data base The frequency of middle appearance, TF-IDF is two products of value, integrating representation importance of the entry in document.Present invention setting one Individual threshold value higher, then selects important vocabulary.Then core word and primary word are put into knowledge data base mould by the present invention In the middle of block.

Entity is responsible for extracting the triple of the entity in the document, relation and entity with relation extraction module.Such as Fig. 3 institutes Show, association analysis part is used for excavating possible related word pair in document, there is the relation extracted using the present invention in document.This Invention improves Apriori related analysis technologies, picks out the Field Words from document first（Refer here to systematic knowledge The entry that entity storehouse and relation storehouse in database and user knowledge database are collected）The sentence of appearance refers to as affairs, affairs Be the present invention the set of all entries in select sentence, it is seen that, in each affairs at least one Field Words.Then the ratio that the support of each entry in document, i.e. affairs where each entry account for all affairs is calculated Support is regarded as frequent episode by example, the present invention higher than the entry of certain threshold value, then calculates two confidence levels of frequent episode, i.e., Affairs where one of frequent episode account for two frequent episode entries where affairs ratio, higher than certain threshold value, the present invention Extract as word pair.Such word equally also includes the word pair without core word to including core word mostly.Sentence is cut down Part is exactly to delete the submember in important sentences, so as to excavate the relation of word centering.The present invention the word of frequent word pair and The core word and primary word of the document constitute an entry set, all in locating documents to contain entry in the entry set Sentence, because association analysis does not take into account that the structure and composition of sentence, but the present invention may determine which word there may be relation. Reference resolution is carried out first, and the pronoun in clause that will be in a sentence is replaced with the noun that it is referred to, such as " during Beijing is The capital of state, Chinese have deep love for it ", after reference resolution, " Beijing is the capital of China, and Chinese have deep love for Beijing ", because One sentence is a processing unit, but a sentence long is usually to be made up of multiple clauses, and clause can also contain information, make The meaning of clause can be enriched after being decomposed with reference, then as the unit of present invention treatment.The present invention directly invokes Stamford Reference resolution bag in natural language processing bag, it is then of the invention noun in sentence to be cut down according to predefined pattern or is moved Word, that is, delete the submember in sentence so as to propose the real noun and verb for needing of the invention, such as " preposition+noun " Form, " teacher sees Xiao Ming in park." core of the words is exactly " teacher sees Xiao Ming " and " in park " just belongs to Secondary part, this noun of park therein needs to delete, and can otherwise influence the extraction of entity.The present invention has summed up several need The pattern to be cut down：" preposition+adjective+noun ", " preposition+verb " etc..It is from after reduction to extract entity and relationship part Noun or verb are extracted in sentence as entity or relation, the verb part in sentence is found first, then being close to verb One candidate's of noun and noun below and then composition above（Noun, verb, noun）Triple.Then similitude is utilized It is dynamic in relation and candidate's triple in the relation storehouse that analysis comes in computing system knowledge data base and relation knowledge data base The similitude of word, if similitude is more than certain threshold value, the present invention is just put into the pass of user knowledge database by the verb In being storehouse, while another noun is put into the entity storehouse of user knowledge database；The system has taken into full account that noun fills When the situation in relation storehouse, after the failure of the first extracting method, the present invention then finds another that go out outside core word in sentence Noun, then with similarity analysis come the entity in the entity storehouse in computing system knowledge data base and relation knowledge data base with The noun it is similar, if above certain threshold value（The threshold value typically setting is higher）, the word and then looked between two nouns Language（Verb or noun）, the similitude of they and relation storehouse is calculated, if greater than certain threshold value just by corresponding noun or dynamic Word puts into relation storehouse, and the noun for extracting before is put into entity storehouse.Similarity analysis part is mainly concerned with the meter of similitude Calculate, that is, give the Semantic Similarity that two words are calculated between them.Mainly two kinds of technologies are used：Word2vec and synonymous Word word woods, word2vec is a instrument that word is characterized as real number value vector that Google increased income in 2013, its profit With the thought of deep learning, the vector fortune in K gts can be reduced to the treatment to content of text by training Calculate, and the similarity in vector space can be used to characterize the similarity on text semantic.The present invention takes full advantage of system and knows Know the text library in database to train the model, so as to obtain the model in systematic knowledge database, carry out similitude point The process of analysis, the present invention directly invokes the model in knowledge data base.Another similitude instrument is society of Harbin Institute of Technology The extended edition Chinese thesaurus write with Research into information retrieval center are calculated, the Chinese thesaurus are divided into 5 levels, the present invention Using hierarchical relationship, two similitudes of word can be solved.The present invention leads to two kinds of technologies of word2vec and Chinese thesaurus The method of weighting is crossed to calculate two similitudes of word.

After the triple of the entity in the document is extracted, relation and entity, knowledge mapping generation module is using extraction The triple of entity, relation and entity generates the knowledge mapping of document.

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. a kind of Chinese text knowledge mapping method for auto constructing, it is characterised in that：Comprise the following steps：

S3. word segmentation processing is carried out to document j；

S4. the extraction of core word is carried out to document j；

S6. the field belonging to document j is determined：

S7. the triple of the entity, relation and entity in document j is extracted：

2. Chinese text knowledge mapping method for auto constructing according to claim 1, it is characterised in that：The step S3 makes Word segmentation processing is carried out to document j with jieba instruments.

3. Chinese text knowledge mapping method for auto constructing according to claim 1, it is characterised in that：The similitude point Word2vec or Chinese thesaurus are applied in analysis.

4. the system that one kind applies claim 1 ~ 3 any one method, it is characterised in that：Knowledge data base including every field Module, document process module, entity and relation extraction module and knowledge mapping generation module；The knowledge of wherein described every field DBM includes entity storehouse, relation storehouse and the text library of every field, wherein the document process module is used to perform step Rapid S3 ~ S62, the entity is used to perform step S7 ~ S76 with relation extraction module, and the knowledge mapping generation module is used to hold Row step S8.