A kind of construction method of national culture information resources knowledge mapping
Technical field
The present invention relates to a kind of construction methods of national culture information resources knowledge mapping, belong to knowledge mapping technology neck
Domain.
Background technique
National culture is a national spiritual wealth, and national culture is preserved, and can not only descendant be allowed to pass
Outstanding culture is held, national culture can also be made to leave colourful one in history.Traditional national culture passes through number
Word technology is converted into digitized coding form, then stores, transmits, duplication, reproducing and even create, being allowed to be rendered as " living
Culture ", this will become protection national cultural heritage inexorable trend and ethnic mountainous regions digital development innovation think
Road.
Knowledge mapping technology causes the extensive concern of scholars in recent years, can be by internet by knowledge mapping
Information representation provides a kind of preferably tissue, management and use information at the form closer to the human cognitive world
Mode.
In order to push national culture information resources gradually to develop, need to construct national culture information resources knowledge mapping, but
The method to the building of English knowledge mapping external at present can not be completely suitable for the building of Chinese knowledge mapping, national culture information
The knowledge mapping building of resource is also less, and there is an urgent need to national culture information resources knowledge mapping construction methods.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of construction methods of national culture information resources knowledge mapping, use
To solve the above problems.
The technical scheme is that a kind of construction method of national culture information resources knowledge mapping, first with the Chinese
Language Words partition system and the customized dictionary of user to the entry data in the ethnic group's voluminous dictionary data being collected into carry out participle and
Then part-of-speech tagging detects the entry data after participle and part-of-speech tagging, if continuative participle be the quantity of individual character not
Less than the threshold value of setting, then artificial participle operation is carried out, and artificial word segmentation result is added to the user of Chinese word segmentation system certainly
Dictionary is defined, until no neologisms, attributes extraction then is carried out to the entry data after correct participle, is known to construct field
Know map, repeated detection carried out to domain knowledge map again, deleting duplicated data, by after storage domain knowledge map with
Resource link is finally stored.
Specific steps are as follows:
Step 1: collect ethnic group's entry data, construct ethnic group's entry database, using Chinese word segmentation system and
The customized dictionary of user is segmented to the entry data in the ethnic group's entry database being collected into and part-of-speech tagging, and goes
Except punctuation mark;
Step 2: and then the data after participle and part-of-speech tagging are detected, if continuative participle be the quantity of individual character not
Less than the threshold value of setting, then artificial participle operation is carried out, and artificial word segmentation result is added to the user of Chinese word segmentation system certainly
Dictionary is defined, step 1 is repeated, until no neologisms;
Step 3: attributes extraction being carried out to the data after correct participle, to construct domain knowledge map;
Step 4: repeated detection being carried out to domain knowledge map, deleting duplicated data is stored;
Step 5: by the domain knowledge map and resource link after storage.
Words partition system in the step 1 and step 2 is NLPIR Chinese word segmentation system
The text data after participle and part-of-speech tagging is detected in the step 2 method particularly includes:
1. defining word segmentation result set S (S1,S2,……,Sm);
2. to word segmentation result S each in set SiNumber of words is counted, set number of words result C (C is obtained1,C2,……,Cm),
Middle Ci=len (Si), and 1≤i≤m;
3. given threshold k meets 2≤k≤m;
4. selected subset closes P from S, P meets formula (1) and formula (2)
J-i+1≤k < m (2)
Illustrate the S in SiTo SjPosition have continuous k number of words be 1 participle, pass through setting k value, it is believed that continuous number of words
Participle for 1 is neologisms x, x={ a Si,Si+1…Si+k},Si∈S;
4. defining new set of words W is W=(x1,x2…xn), and user is added to if it is neologisms to W row manual examination and verification
In customized dictionary.
The mode that the threshold value k is set is from big to small, k=m when being arranged for the first time successively successively decreases, until k=1, often
Step 2 is repeated after subthreshold setting, until all neologisms are added in the customized dictionary of user.
The purpose that operation is manually segmented in the step 2 is that the word combination for judging that continuous k is individual character is a neologisms.
Attributes extraction is classified one by one according to word segmentation result and part-of-speech tagging in the step 3, all the elements
Attributes extraction is carried out, and indicates attribute-name, forms the triple of " theme-attribute-name-attribute value ", i.e. knowledge mapping.
Repeatability detection is divided into following several types in affiliated step 4:
Class1: the same attribute of same entity has multiple attribute values, if some attribute value includes other attribute values,
This eliminate by comprising attribute value;
Type 2: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, basis possesses this
The quantity of attribute value judged, the more reservation of attribute value, and submits manual examination and verification;
Type 3: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, possesses the attribute value
Quantity it is also identical, then completely submit manual examination and verification.
The storage of domain knowledge map is to be deposited with the mode of relational database to chart database (such as: Neo4j) in affiliated step 4
The mode of storage knowledge mapping is simulated.
Relational database structure design is as follows:
Node table (Number, nodename, node label)
Entity table (Number, affiliated node serial number, entity name)
Property Name table (Number, affiliated node serial number, affiliated entity number, Property Name)
Attribute value table (Number, affiliated node serial number, affiliated entity number, affiliated Property Name number, attribute value)
Relation table (Number, start node number, destination node number, relationship)
The beneficial effects of the present invention are: by building national culture information resources knowledge mapping, by the nationality in internet
Cultural information is expressed as the form closer to the human cognitive world, convenient for national culture information is managed and is utilized.
Specific embodiment
With reference to embodiment, the invention will be further described.
A kind of construction method of national culture information resources knowledge mapping, comprising:
Step 1: collect ethnic group's entry data, construct ethnic group's entry database, using Chinese word segmentation system and
The customized dictionary of user is segmented to the entry data in the ethnic group's entry database being collected into and part-of-speech tagging, and goes
Except punctuation mark;
Step 2: and then the data after participle and part-of-speech tagging are detected, if continuative participle be the quantity of individual character not
Less than the threshold value of setting, then artificial participle operation is carried out, and artificial word segmentation result is added to the user of Chinese word segmentation system certainly
Dictionary is defined, step 1 is repeated, until no neologisms;
Step 3: attributes extraction being carried out to the data after correct participle, to construct domain knowledge map;
Step 4: repeated detection being carried out to domain knowledge map, deleting duplicated data is stored;
Step 5: by the domain knowledge map and resource link after storage.
Words partition system in the step 1 and step 2 is NLPIR Chinese word segmentation system
The text data after participle and part-of-speech tagging is detected in the step 2 method particularly includes:
1. defining word segmentation result set S (S1,S2,……,Sm);
2. to word segmentation result S each in set SiNumber of words is counted, set number of words result C (C is obtained1,C2,……,Cm),
Middle Ci=len (Si), and 1≤i≤m;
3. given threshold k meets 2≤k≤m;
4. selected subset closes P from S, P meets formula (1) and formula (2)
J-i+1≤k < m (2)
Illustrate the S in SiTo SjPosition have continuous k number of words be 1 participle, pass through setting k value, it is believed that continuous number of words
Participle for 1 is neologisms x, x={ a Si,Si+1…Si+k},Si∈S;
4. defining new set of words W is W=(x1,x2…xn), and user is added to if it is neologisms to W row manual examination and verification
In customized dictionary.
The mode that the threshold value k is set is from big to small, k=m when being arranged for the first time successively successively decreases, until k=1, often
Step 2 is repeated after subthreshold setting, until all neologisms are added in the customized dictionary of user.
Attributes extraction is classified one by one according to word segmentation result and part-of-speech tagging in the step 3, all the elements
Attributes extraction is carried out, and indicates attribute-name, forms the triple of " theme-attribute-name-attribute value ", i.e. knowledge mapping.
Repeatability detection is divided into following several types in affiliated step 4:
Class1: the same attribute of same entity has multiple attribute values, if some attribute value includes other attribute values,
This eliminate by comprising attribute value;
Type 2: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, basis possesses this
The quantity of attribute value judged, the more reservation of attribute value, and submits manual examination and verification;
Type 3: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, possesses the attribute value
Quantity it is also identical, then completely submit manual examination and verification.
The storage of domain knowledge map is the mode with relational database to chart database stored knowledge map in affiliated step 4
Mode simulated.
Embodiment 1: entry content: river right bank tributary is held in both hands in [brave handful river] south.Positioned at Lincang City Zhenkang County, the north and Nujiang phase
Neighbour, it is western, western to have a common boundary with southern Burma.
1, it segments: being segmented entry using Chinese word segmentation system are as follows: " [/ brave/handful/river /]/south/handful/river/right side/bank/
Stream/./ positioned at/Lincang City/Zhenkang County/,/northern/with/Nujiang/adjacent/,/western/,/western/and/south/Burma/friendship
Boundary/.", " [[punctuation mark]/brave [noun]/handful [verb]/river [noun] /] [punctuate]/southern [noun of locality]/is held in both hands after part-of-speech tagging
[verb]/river [the noun]/right side [noun of locality]/bank [noun]/tributary [noun]/.[punctuation mark]/it is located at [verb]/Lincang City
[noun]/Zhenkang County [noun]/,/northern [noun of locality]/with [preposition]/Nujiang [noun]/adjacent [verb]/, [punctuate symbol
Number]/western [noun of locality]/, [punctuation mark]/western [noun of locality]/hand over [preposition] by/southern [noun of locality]/Burma [noun]/
Boundary's [verb]/.[punctuation mark] " obtains " brave [noun]/handful [verb]/river [noun]/southern [orientation after removing punctuation mark
Word]/hold in both hands [verb]/river [noun]/right [noun of locality]/bank [noun]/tributary [noun]/be located at [verb]/Lincang City [noun]/
Zhenkang County [noun]/the north [noun of locality]/[square with [preposition]/Nujiang [noun]/adjacent [verb]/western part [noun of locality]/western part
Position word]/have a common boundary [verb] with [preposition]/south [noun of locality]/Burma [noun]/".
2, detect: define word segmentation result collection be combined into S (it is brave, hold in both hands, river, south is held in both hands, and river is right, and bank, tributary is located at, Lincang City,
Zhenkang County, it is northern, with, Nujiang is adjacent, and it is western, it is western, with south, Burma, boundary).It unites to word segmentation result each in set S
Number of words is counted, set number of words result C (1,1,1,1,1,1,1,1,2,2,3,3,2,1,2,2,2,2,1,2,2,2) is obtained.Thus may be used
Know m=22, sets k value as 22, operated according to step 2, when k value is reduced to 3,And 3-1+1
≤ 3 < 22 meet simultaneously, it is believed that one neologisms x={ brave, to hold in both hands, river } when the participle that continuous number of words is 1, that is, find " brave/to hold in both hands/
River " is the set that continuous individual character is all 3, and all words found are defined as set W=(river is held in both hands in brave handful river, south), are carried out artificial
Confirm that these are proper nouns after audit, adds in access customer custom words library.
3, result after segmenting again are as follows: " brave handfuls river/south handful the river/right side/bank/tributary/positioned at/Lincang City/Zhenkang County/the north/
With/Nujiang/adjacent/western/western/with/south/Burma/boundary " set without continuous individual character 2.
4, carry out attribute labeling using word segmentation result, such as: " brave handfuls river/south handful the river/right side/bank/tributary/positioned at/Lincang City/
Zhenkang County/the north/and/Nujiang/adjacent/western/western part/and/south/Burma/boundary " participle forms a series of three after marking
Tuple:
1. brave handful river, river right bank tributary is held in both hands in --- affiliated river --- south;
2. brave handful river --- address --- Lincang City Zhenkang County, northern adjacent with Nujiang, western, western to hand over southern Burma
Boundary;
3. brave handful river, --- adjacent river --- is northern adjacent with Nujiang;
4. brave handful river --- border land --- is western, western and south Burma has a common boundary;
5, duplicate attribute detection is carried out.
Class1: the same attribute of same entity has multiple attribute values, if some attribute value includes other attribute values,
This eliminate by comprising attribute value.Such as:
1. brave handful river --- address --- Lincang City Zhenkang County, northern adjacent with Nujiang, western, western to hand over southern Burma
Boundary;
2. brave handful river --- address --- Lincang City Zhenkang County;
Then, it eliminates 2., retains 1.;
Type 2: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, basis possesses this
The quantity of attribute value judged, the more reservation of attribute value, and submits manual examination and verification.Such as:
1. brave handful river --- address --- Lincang City Zhenkang County, northern adjacent with Nujiang, western, western to hand over southern Burma
Boundary;
2. brave handful river --- address --- Lincang City Zhenkang County;
3. the county Cang Yuan, brave handful river --- address --- Lincang City;
Then, it eliminates 3., retains 1. 2., and submit manual examination and verification, carry out supplement verification using other data;
Type 3: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, possesses the attribute value
Quantity it is also identical, then completely submit manual examination and verification.Such as:
1. brave handful river --- address --- Lincang City Zhenkang County;
2. the county Cang Yuan, brave handful river --- address --- Lincang City;
Manual examination and verification are then submitted completely, carry out supplement verification using other data.
6, linking for knowledge mapping and resource is carried out, when due to the building of all knowledge mappings is extracted by resource
, the unique resource address of each resource is formd, the attribute addition for each knowledge mapping is hyperlinked in resource,
To carry out, attribute is verified and resource is checked.
7, knowledge mapping storage is carried out using relational database.Such as:
Node table: N001, river, river;N002, soil, ground
Entity table: E001, N001, brave handful river
Property Name table: P001, N001, E001, address
Attribute value table: V001, N001, E001, P001, Lincang City Zhenkang County
Relation table: R001, N001, N002 are irrigated
Embodiment 2: entry content: [card] Lahu name, i.e. " reed is long " in Was's name, are three Buddhist patriarch period Wa nationality areas
Government post name.
1, it segments: being segmented entry using Chinese word segmentation system are as follows: " [/ card/a little /]/drawing/blessing/name/,/i.e./Was/name/
In// "/reed/length/"/,/for/tri-/Buddhist patriarch/period/Wa/area// government post/name/.", " [[punctuate symbol after part-of-speech tagging
Number]/card [noun]/a little [quantifiers] /] [punctuation mark]/drawing [verb]/blessing [nominal morpheme]/name [nominal morpheme]/, [mark
Point symbol]/i.e. [verb]/Was [distinction word]/name [quantifier]/in [noun of locality]/[auxiliary word]/" [punctuation mark]/reed [people
Name]/long [nominal morpheme]/" [punctuation mark]/, [punctuation mark]/it is [preposition]/tri- [number]/Buddhist patriarch [noun]/period
[noun]/Wa [other proper names]/area [noun]/[auxiliary word]/government post [noun]/name [quantifier]/.[punctuation mark].
2, detect: define word segmentation result collection be combined into S (card is drawn, blessing, name a bit, that is, Was, name, in, it is three that reed is long,
Buddhist patriarch, period, Wa, area, government post, name).Number of words is counted to word segmentation result each in set S, obtains set number of words knot
Fruit C (1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,1,2,1).It can thus be appreciated that m=21, k value is set as 21, according to
Step 2 is operated, when k value is reduced to 2,Meet simultaneously with the < of 2-1+1≤2 21, it is believed that continuous
Number of words be 1 participle when one neologisms x={ card, a little }, that is, finding " card/a little " is the set that continuous individual character is all 2, owning
The word found is defined as set W=(card, Lahu, reed are long), confirms that these are proper nouns after carrying out manual examination and verification, is added into
In the customized dictionary of user.
3, result after segmenting again are as follows: " card/Lahu/name // i.e./Was/name/in// reed grow/be/tri-/Buddhist patriarch/when
Phase/Wa/area// government post/name " set without continuous individual character 2.
4, carry out attribute labeling using word segmentation result, such as: " card/Lahu/name/i.e./Was/name/in// reed grow/be/
Three/Buddhist patriarch/period/Wa/area// government post/name " segment, form a series of triples after mark:
--- source is national --- Lahu name 1. card;
2. reed is long for card --- Was ---;
3. blocking --- period --- three Buddhist patriarch's period;
4. blocking --- area --- Wa nationality area
5. blocking the government post name of --- explanation --- three Buddhist patriarch period Wa nationality area;
5, duplicate attribute detection is carried out.
Class1: the same attribute of same entity has multiple attribute values, if some attribute value includes other attribute values,
This eliminate by comprising attribute value.Such as:
1. blocking the government post name of --- explanation --- three Buddhist patriarch period Wa nationality area;
2. blocking --- explanation --- government post name;
Then, it eliminates 2., retains 1.;
Type 2: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, basis possesses this
The quantity of attribute value judged, the more reservation of attribute value, and submits manual examination and verification.Such as:
1. blocking the government post name of --- explanation --- three Buddhist patriarch period Wa nationality area;
2. blocking --- explanation --- Wa nationality area government post name;
3. blocking the government post name of --- explanation --- deep blue source area;
Then, it eliminates 3., retains 1. 2., and submit manual examination and verification, carry out supplement verification using other data;
Type 3: the same attribute of same entity has multiple attribute values, if mutual exclusion between attribute value, possesses the attribute value
Quantity it is also identical, then completely submit manual examination and verification.Such as:
1. blocking --- area --- Wa nationality area;
2. blocking --- area --- deep blue source area;
Manual examination and verification are then submitted completely, carry out supplement verification using other data.
6, linking for knowledge mapping and resource is carried out, when due to the building of all knowledge mappings is extracted by resource
, the unique resource address of each resource is formd, the attribute addition for each knowledge mapping is hyperlinked in resource,
To carry out, attribute is verified and resource is checked.
7, knowledge mapping storage is carried out using relational database.Such as:
Node table: N001, government post, river;N002, place
Entity table: E001, N001, card
Property Name table: P001, N001, E001, period
Attribute value table: V001, N001, E001, P001, three Buddhist patriarch's period
Relation table: R001, N001, N002 are subordinate to
Above, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned embodiment party
Formula can also be made without departing from the purpose of the present invention within the knowledge of a person skilled in the art
Various change out.