CN109815340A

CN109815340A - A construction method of knowledge map of ethnic cultural information resources

Info

Publication number: CN109815340A
Application number: CN201910042744.2A
Authority: CN
Inventors: 甘健侯; 王俊; 周菊香; 文斌
Original assignee: Yunnan Normal University
Current assignee: Yunnan Normal University
Priority date: 2019-01-17
Filing date: 2019-01-17
Publication date: 2019-05-28

Abstract

本发明涉及一种民族文化信息资源知识图谱的构建方法，属于知识图谱技术领域。首先利用汉语分词系统及用户自定义词库对收集到的少数民族大辞典数据中的辞条数据进行分词和词性标注，然后对分词和词性标注后的辞条数据进行检测，若连续分词均为单字的数量不小于设定的阈值，则进行人工分词操作，并把人工分词结果添加至汉语分词系统的用户自定义词库，直到无新词为止，然后对正确分词后的辞条数据进行属性提取，用以构建领域知识图谱，再次对领域知识图谱进行重复性检测，删除重复数据，将存储后的领域知识图谱与资源链接，最终进行存储。The invention relates to a method for constructing a knowledge map of ethnic cultural information resources, and belongs to the technical field of knowledge maps. First, use the Chinese word segmentation system and user-defined thesaurus to perform word segmentation and part-of-speech tagging on the lexical data in the collected ethnic minority dictionary data, and then detect the lexical data after word segmentation and part-of-speech tagging. If the number of words is not less than the set threshold, perform manual word segmentation, and add the result of manual segmentation to the user-defined lexicon of the Chinese word segmentation system until there are no new words, and then attribute the correct segmented entry data. Extraction is used to construct a domain knowledge graph, repeat the repetitive detection of the domain knowledge graph, delete duplicate data, link the stored domain knowledge graph with resources, and finally store it.

Description

A kind of construction method of national culture information resources knowledge mapping

Technical field

The present invention relates to a kind of construction methods of national culture information resources knowledge mapping, belong to knowledge mapping technology neck Domain.

Background technique

National culture is a national spiritual wealth, and national culture is preserved, and can not only descendant be allowed to pass Outstanding culture is held, national culture can also be made to leave colourful one in history.Traditional national culture passes through number Word technology is converted into digitized coding form, then stores, transmits, duplication, reproducing and even create, being allowed to be rendered as " living Culture ", this will become protection national cultural heritage inexorable trend and ethnic mountainous regions digital development innovation think Road.

Knowledge mapping technology causes the extensive concern of scholars in recent years, can be by internet by knowledge mapping Information representation provides a kind of preferably tissue, management and use information at the form closer to the human cognitive world Mode.

In order to push national culture information resources gradually to develop, need to construct national culture information resources knowledge mapping, but The method to the building of English knowledge mapping external at present can not be completely suitable for the building of Chinese knowledge mapping, national culture information The knowledge mapping building of resource is also less, and there is an urgent need to national culture information resources knowledge mapping construction methods.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of construction methods of national culture information resources knowledge mapping, use To solve the above problems.

The technical scheme is that a kind of construction method of national culture information resources knowledge mapping, first with the Chinese Language Words partition system and the customized dictionary of user to the entry data in the ethnic group's voluminous dictionary data being collected into carry out participle and Then part-of-speech tagging detects the entry data after participle and part-of-speech tagging, if continuative participle be the quantity of individual character not Less than the threshold value of setting, then artificial participle operation is carried out, and artificial word segmentation result is added to the user of Chinese word segmentation system certainly Dictionary is defined, until no neologisms, attributes extraction then is carried out to the entry data after correct participle, is known to construct field Know map, repeated detection carried out to domain knowledge map again, deleting duplicated data, by after storage domain knowledge map with Resource link is finally stored.

Specific steps are as follows:

Step 1: collect ethnic group's entry data, construct ethnic group's entry database, using Chinese word segmentation system and The customized dictionary of user is segmented to the entry data in the ethnic group's entry database being collected into and part-of-speech tagging, and goes Except punctuation mark；

Step 2: and then the data after participle and part-of-speech tagging are detected, if continuative participle be the quantity of individual character not Less than the threshold value of setting, then artificial participle operation is carried out, and artificial word segmentation result is added to the user of Chinese word segmentation system certainly Dictionary is defined, step 1 is repeated, until no neologisms；

Step 3: attributes extraction being carried out to the data after correct participle, to construct domain knowledge map；

Step 4: repeated detection being carried out to domain knowledge map, deleting duplicated data is stored；

Step 5: by the domain knowledge map and resource link after storage.

Words partition system in the step 1 and step 2 is NLPIR Chinese word segmentation system

The text data after participle and part-of-speech tagging is detected in the step 2 method particularly includes:

1. defining word segmentation result set S (S₁,S₂,……,S_m)；

2. to word segmentation result S each in set S_iNumber of words is counted, set number of words result C (C is obtained₁,C₂,……,C_m), Middle C_i=len (S_i), and 1≤i≤m；

3. given threshold k meets 2≤k≤m；

4. selected subset closes P from S, P meets formula (1) and formula (2)

J-i+1≤k < m (2)

Illustrate the S in S_iTo S_jPosition have continuous k number of words be 1 participle, pass through setting k value, it is believed that continuous number of words Participle for 1 is neologisms x, x={ a S_i,S_i+1…S_i+k},S_i∈S；

4. defining new set of words W is W=(x₁,x₂…x_n), and user is added to if it is neologisms to W row manual examination and verification In customized dictionary.

The mode that the threshold value k is set is from big to small, k=m when being arranged for the first time successively successively decreases, until k=1, often Step 2 is repeated after subthreshold setting, until all neologisms are added in the customized dictionary of user.

The purpose that operation is manually segmented in the step 2 is that the word combination for judging that continuous k is individual character is a neologisms.

Attributes extraction is classified one by one according to word segmentation result and part-of-speech tagging in the step 3, all the elements Attributes extraction is carried out, and indicates attribute-name, forms the triple of " theme-attribute-name-attribute value ", i.e. knowledge mapping.

Repeatability detection is divided into following several types in affiliated step 4:

Class1: the same attribute of same entity has multiple attribute values, if some attribute value includes other attribute values, This eliminate by comprising attribute value；

Type 2: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, basis possesses this The quantity of attribute value judged, the more reservation of attribute value, and submits manual examination and verification；

Type 3: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, possesses the attribute value Quantity it is also identical, then completely submit manual examination and verification.

The storage of domain knowledge map is to be deposited with the mode of relational database to chart database (such as: Neo4j) in affiliated step 4 The mode of storage knowledge mapping is simulated.

Relational database structure design is as follows:

Node table (Number, nodename, node label)

Entity table (Number, affiliated node serial number, entity name)

Property Name table (Number, affiliated node serial number, affiliated entity number, Property Name)

Attribute value table (Number, affiliated node serial number, affiliated entity number, affiliated Property Name number, attribute value)

Relation table (Number, start node number, destination node number, relationship)

The beneficial effects of the present invention are: by building national culture information resources knowledge mapping, by the nationality in internet Cultural information is expressed as the form closer to the human cognitive world, convenient for national culture information is managed and is utilized.

Specific embodiment

With reference to embodiment, the invention will be further described.

A kind of construction method of national culture information resources knowledge mapping, comprising:

Step 5: by the domain knowledge map and resource link after storage.

1. defining word segmentation result set S (S₁,S₂,……,S_m)；

3. given threshold k meets 2≤k≤m；

4. selected subset closes P from S, P meets formula (1) and formula (2)

J-i+1≤k < m (2)

The storage of domain knowledge map is the mode with relational database to chart database stored knowledge map in affiliated step 4 Mode simulated.

Embodiment 1: entry content: river right bank tributary is held in both hands in [brave handful river] south.Positioned at Lincang City Zhenkang County, the north and Nujiang phase Neighbour, it is western, western to have a common boundary with southern Burma.

1, it segments: being segmented entry using Chinese word segmentation system are as follows: " [/ brave/handful/river /]/south/handful/river/right side/bank/ Stream/./ positioned at/Lincang City/Zhenkang County/,/northern/with/Nujiang/adjacent/,/western/,/western/and/south/Burma/friendship Boundary/.", " [[punctuation mark]/brave [noun]/handful [verb]/river [noun] /] [punctuate]/southern [noun of locality]/is held in both hands after part-of-speech tagging [verb]/river [the noun]/right side [noun of locality]/bank [noun]/tributary [noun]/.[punctuation mark]/it is located at [verb]/Lincang City [noun]/Zhenkang County [noun]/,/northern [noun of locality]/with [preposition]/Nujiang [noun]/adjacent [verb]/, [punctuate symbol Number]/western [noun of locality]/, [punctuation mark]/western [noun of locality]/hand over [preposition] by/southern [noun of locality]/Burma [noun]/ Boundary's [verb]/.[punctuation mark] " obtains " brave [noun]/handful [verb]/river [noun]/southern [orientation after removing punctuation mark Word]/hold in both hands [verb]/river [noun]/right [noun of locality]/bank [noun]/tributary [noun]/be located at [verb]/Lincang City [noun]/ Zhenkang County [noun]/the north [noun of locality]/[square with [preposition]/Nujiang [noun]/adjacent [verb]/western part [noun of locality]/western part Position word]/have a common boundary [verb] with [preposition]/south [noun of locality]/Burma [noun]/".

2, detect: define word segmentation result collection be combined into S (it is brave, hold in both hands, river, south is held in both hands, and river is right, and bank, tributary is located at, Lincang City, Zhenkang County, it is northern, with, Nujiang is adjacent, and it is western, it is western, with south, Burma, boundary).It unites to word segmentation result each in set S Number of words is counted, set number of words result C (1,1,1,1,1,1,1,1,2,2,3,3,2,1,2,2,2,2,1,2,2,2) is obtained.Thus may be used Know m=22, sets k value as 22, operated according to step 2, when k value is reduced to 3,And 3-1+1 ≤ 3 < 22 meet simultaneously, it is believed that one neologisms x={ brave, to hold in both hands, river } when the participle that continuous number of words is 1, that is, find " brave/to hold in both hands/ River " is the set that continuous individual character is all 3, and all words found are defined as set W=(river is held in both hands in brave handful river, south), are carried out artificial Confirm that these are proper nouns after audit, adds in access customer custom words library.

3, result after segmenting again are as follows: " brave handfuls river/south handful the river/right side/bank/tributary/positioned at/Lincang City/Zhenkang County/the north/ With/Nujiang/adjacent/western/western/with/south/Burma/boundary " set without continuous individual character 2.

4, carry out attribute labeling using word segmentation result, such as: " brave handfuls river/south handful the river/right side/bank/tributary/positioned at/Lincang City/ Zhenkang County/the north/and/Nujiang/adjacent/western/western part/and/south/Burma/boundary " participle forms a series of three after marking Tuple:

1. brave handful river, river right bank tributary is held in both hands in --- affiliated river --- south；

2. brave handful river --- address --- Lincang City Zhenkang County, northern adjacent with Nujiang, western, western to hand over southern Burma Boundary；

3. brave handful river, --- adjacent river --- is northern adjacent with Nujiang；

4. brave handful river --- border land --- is western, western and south Burma has a common boundary；

5, duplicate attribute detection is carried out.

Class1: the same attribute of same entity has multiple attribute values, if some attribute value includes other attribute values, This eliminate by comprising attribute value.Such as:

1. brave handful river --- address --- Lincang City Zhenkang County, northern adjacent with Nujiang, western, western to hand over southern Burma Boundary；

2. brave handful river --- address --- Lincang City Zhenkang County；

Then, it eliminates 2., retains 1.；

Type 2: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, basis possesses this The quantity of attribute value judged, the more reservation of attribute value, and submits manual examination and verification.Such as:

2. brave handful river --- address --- Lincang City Zhenkang County；

3. the county Cang Yuan, brave handful river --- address --- Lincang City；

Then, it eliminates 3., retains 1. 2., and submit manual examination and verification, carry out supplement verification using other data；

Type 3: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, possesses the attribute value Quantity it is also identical, then completely submit manual examination and verification.Such as:

1. brave handful river --- address --- Lincang City Zhenkang County；

2. the county Cang Yuan, brave handful river --- address --- Lincang City；

Manual examination and verification are then submitted completely, carry out supplement verification using other data.

6, linking for knowledge mapping and resource is carried out, when due to the building of all knowledge mappings is extracted by resource , the unique resource address of each resource is formd, the attribute addition for each knowledge mapping is hyperlinked in resource, To carry out, attribute is verified and resource is checked.

7, knowledge mapping storage is carried out using relational database.Such as:

Node table: N001, river, river；N002, soil, ground

Entity table: E001, N001, brave handful river

Property Name table: P001, N001, E001, address

Attribute value table: V001, N001, E001, P001, Lincang City Zhenkang County

Relation table: R001, N001, N002 are irrigated

Embodiment 2: entry content: [card] Lahu name, i.e. " reed is long " in Was's name, are three Buddhist patriarch period Wa nationality areas Government post name.

1, it segments: being segmented entry using Chinese word segmentation system are as follows: " [/ card/a little /]/drawing/blessing/name/,/i.e./Was/name/ In// "/reed/length/"/,/for/tri-/Buddhist patriarch/period/Wa/area// government post/name/.", " [[punctuate symbol after part-of-speech tagging Number]/card [noun]/a little [quantifiers] /] [punctuation mark]/drawing [verb]/blessing [nominal morpheme]/name [nominal morpheme]/, [mark Point symbol]/i.e. [verb]/Was [distinction word]/name [quantifier]/in [noun of locality]/[auxiliary word]/" [punctuation mark]/reed [people Name]/long [nominal morpheme]/" [punctuation mark]/, [punctuation mark]/it is [preposition]/tri- [number]/Buddhist patriarch [noun]/period [noun]/Wa [other proper names]/area [noun]/[auxiliary word]/government post [noun]/name [quantifier]/.[punctuation mark].

2, detect: define word segmentation result collection be combined into S (card is drawn, blessing, name a bit, that is, Was, name, in, it is three that reed is long, Buddhist patriarch, period, Wa, area, government post, name).Number of words is counted to word segmentation result each in set S, obtains set number of words knot Fruit C (1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,1,2,1).It can thus be appreciated that m=21, k value is set as 21, according to Step 2 is operated, when k value is reduced to 2,Meet simultaneously with the < of 2-1+1≤2 21, it is believed that continuous Number of words be 1 participle when one neologisms x={ card, a little }, that is, finding " card/a little " is the set that continuous individual character is all 2, owning The word found is defined as set W=(card, Lahu, reed are long), confirms that these are proper nouns after carrying out manual examination and verification, is added into In the customized dictionary of user.

3, result after segmenting again are as follows: " card/Lahu/name // i.e./Was/name/in// reed grow/be/tri-/Buddhist patriarch/when Phase/Wa/area// government post/name " set without continuous individual character 2.

4, carry out attribute labeling using word segmentation result, such as: " card/Lahu/name/i.e./Was/name/in// reed grow/be/ Three/Buddhist patriarch/period/Wa/area// government post/name " segment, form a series of triples after mark:

--- source is national --- Lahu name 1. card；

2. reed is long for card --- Was ---；

3. blocking --- period --- three Buddhist patriarch's period；

4. blocking --- area --- Wa nationality area

5. blocking the government post name of --- explanation --- three Buddhist patriarch period Wa nationality area；

5, duplicate attribute detection is carried out.

1. blocking the government post name of --- explanation --- three Buddhist patriarch period Wa nationality area；

2. blocking --- explanation --- government post name；

Then, it eliminates 2., retains 1.；

2. blocking --- explanation --- Wa nationality area government post name；

3. blocking the government post name of --- explanation --- deep blue source area；

1. blocking --- area --- Wa nationality area；

2. blocking --- area --- deep blue source area；

7, knowledge mapping storage is carried out using relational database.Such as:

Node table: N001, government post, river；N002, place

Entity table: E001, N001, card

Property Name table: P001, N001, E001, period

Attribute value table: V001, N001, E001, P001, three Buddhist patriarch's period

Relation table: R001, N001, N002 are subordinate to

Above, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned embodiment party Formula can also be made without departing from the purpose of the present invention within the knowledge of a person skilled in the art Various change out.

Claims

1. a kind of construction method of national culture information resources knowledge mapping, it is characterised in that:

Step 1: collecting ethnic group's entry data, construct ethnic group's entry database, utilize Chinese word segmentation system and user Customized dictionary carries out participle and part-of-speech tagging to the entry data in the ethnic group's entry database being collected into, and removes mark Point symbol；

Step 2: and then the data after participle and part-of-speech tagging are detected, if the quantity that continuative participle is individual character is not less than The threshold value of setting then carries out artificial participle operation, and the user that artificial word segmentation result is added to Chinese word segmentation system is customized Dictionary repeats step 1, until no neologisms；

Step 5: by the domain knowledge map and resource link after storage.

2. the construction method of national culture information resources knowledge mapping according to claim 1, it is characterised in that: the step Rapid 1 and step 2 in Words partition system be NLPIR Chinese word segmentation system.

3. the construction method of national culture information resources knowledge mapping according to claim 1, it is characterised in that: the step The text data after participle and part-of-speech tagging is detected in rapid 2 method particularly includes:

1. defining word segmentation result set S (S₁,S₂,……,S_m)；

2. to word segmentation result S each in set S_iNumber of words is counted, set number of words result C (C is obtained₁,C₂,……,C_m), wherein C_i= len(S_i), and 1≤i≤m；

3. given threshold k meets 2≤k≤m；

4. selected subset closes P from S, P meets formula (1) and formula (2)

J-i+1≤k < m (2)

Illustrate the S in S_iTo S_jTo have continuous k number of words be 1 participle for position, pass through setting k value, it is believed that continuous number of words is 1 Participle be neologisms x, x={ a S_i,S_i+1…S_i+k},S_i∈S；

4. defining new set of words W is W=(x₁,x₂…x_n), and are added to by user and is made by oneself if it is neologisms for W row manual examination and verification In adopted dictionary.

4. the construction method of national culture information resources knowledge mapping according to claim 3, it is characterised in that: the threshold The mode that value k is set is from big to small, k=m when being arranged for the first time successively successively decreases, until k=1, weight after every subthreshold setting Multiple step 2, until all neologisms are added in the customized dictionary of user.

5. the construction method of national culture information resources knowledge mapping according to claim 1, it is characterised in that: the step Attributes extraction is classified one by one according to word segmentation result and part-of-speech tagging in rapid 3, and all the elements are all carried out attributes extraction, and It indicates attribute-name, forms the triple of " theme-attribute-name-attribute value ", i.e. knowledge mapping.

6. the construction method of national culture information resources knowledge mapping according to claim 1, it is characterised in that:

Class1: the same attribute of same entity has multiple attribute values, if some attribute value includes other attribute values, this disappears Except by comprising attribute value；

Type 2: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, basis possesses the attribute The quantity of value judged, the more reservation of attribute value, and submits manual examination and verification；

Type 3: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, possesses the number of the attribute value Measure also identical, then submission manual examination and verification completely.

7. the construction method of national culture information resources knowledge mapping according to claim 1, it is characterised in that: affiliated step The storage of domain knowledge map is to carry out mould with mode of the mode of relational database to chart database stored knowledge map in rapid 4 It is quasi-.