CN107515902A - A kind of isomeric data distributed storage method based on semantic tagger - Google Patents
A kind of isomeric data distributed storage method based on semantic tagger Download PDFInfo
- Publication number
- CN107515902A CN107515902A CN201710608703.6A CN201710608703A CN107515902A CN 107515902 A CN107515902 A CN 107515902A CN 201710608703 A CN201710608703 A CN 201710608703A CN 107515902 A CN107515902 A CN 107515902A
- Authority
- CN
- China
- Prior art keywords
- data
- semantic
- information
- isomeric
- data source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of isomeric data distributed storage method based on semantic tagger, comprises the following steps:1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;2) isomeric data memory cell content is established;3) by heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory in .META. table information;4) it is dissolved in dynamic memory data unit in memory block;5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables;7) the distributed data information retrieval information based on semantic base is established;8) data storage retrieval information is stored in .INDEX. tables.The present invention realizes data fusion storage, solves the problems, such as that the difficult fusion of data does not have semanteme, solves the problems, such as isomery big data distributed storage again.
Description
Technical field
The present invention relates to one kind to be based on semantic tagger technology and distributed storage technology, more particularly to one kind based on semanteme
The isomeric data distributed storage method of mark.
Background technology
With the high speed development of internet, being incremented by with index, the source of data are also more abundant and multiple daily for data volume
Miscellaneous, the data format such as text data, voice data, video data is also more and more, realizes data fusion and the storage of isomery
Problem is become increasingly conspicuous, but traditional data fusion, and isomeric data simply is carried out into unified storage and without semanteme.Research is a kind of
Isomeric data distributed storage method based on semantic tagger, this will realize the height semantic fusion of isomeric data, to isomery number
Critical effect is played according to the efficient retrieval of resource.
The method of semantic tagger has much at present, mainly according to Resource Properties, resource content, resource content feature and spy
Fixed resources domain Ontology is labeled.
Isomeric data based on semantic tagger is mainly that the mark that heterogeneous data source is carried out by customized mode illustrates,
Stored on semantic tagger infologic by the way of big table, but be physically distributed storage, how by isomeric data
Combine with distributed storage be at this stage urgent need to resolve the problem of.
The content of the invention
In order to solve the above-mentioned technical problem, the present invention proposes a kind of isomeric data distributed storage based on semantic tagger
Method.
In order to achieve the above object, technical scheme is as follows:
A kind of isomeric data distributed storage method based on semantic tagger, comprises the following steps:
1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;
2) isomeric data memory cell content is established;
3) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory are believed in .META. tables
In breath, by data source semanteme and the semantic progress relationship map storage of data message;
4) it is dissolved in dynamic memory data unit in memory block;
5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;
6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables, will
The data message of Similarity value and isomeric data carries out relationship map storage;
7) the distributed data information retrieval information based on semantic base is established;
8) data storage retrieval information is stored in .INDEX. tables.
The present invention realizes the fusion storage of data fusion storage, particularly isomeric data, and it is only mutually to there is data storage
Vertical and isomeric data, without semantic relevance, the isomeric data distributed memory system based on semantic tagger, that is, solves number
Do not have the problem of semanteme according to hardly possible fusion, solve the problems, such as isomery big data distributed storage again.The present invention proposes a kind of logarithm
The method of semantic tagger and the technology of data source Semantic Similarity Measurement are carried out according to source, compensate for the sky of this function of in the market
In vain, using distributed storage semantic label storehouse and isomeric data location contents, the distributed storage of isomeric data is realized.
On the basis of above-mentioned technical proposal, following improvement can be also done:
As preferable scheme, step 1) specifically includes following steps:
1.1) semantic label storehouse creates;
1.2) heterogeneous data source inputs;
1.3) the semantic parsing of data source and/or labeled data source name and/or labeled data source category and/or labeled data
Source format and/or labeled data source time;
1.4) the semantic parsing of data message and/or labeled data title and/or labeled data attaching information and/or mark number
According to description and/or labeled data time.
Using above-mentioned preferable scheme, simple operation.
As preferable scheme, step 4) also includes herein below:
When heterogeneous data source capacity exceedes the size of memory block, then data source information cutting is automatically some small by system
Block, every piece of capacity are not more than the size of memory block.
Using above-mentioned preferable scheme, it is easy to dynamic memory.
As preferable scheme, the size of memory block is 64M.
It is good using above-mentioned preferable scheme, storage effect.
As preferable scheme, step 5) specifically includes following steps:
5.1) two memory cell are read;
5.2) semantic information of the isomeric data of storage is divided by word;
5.3) word frequency is calculated;
5.4) word frequency sequence is obtained;
5.5) the COS angle values of two word frequency sequences are calculated.
Using above-mentioned preferable scheme, simple operation.
As preferable scheme, step 5) specifically includes following steps:
5.6) judge whether obtained angle value is more than a, it is similar if being more than;Otherwise, then it is dissimilar.
Using above-mentioned preferable scheme, it is convenient to judge.
As preferable scheme, heterogeneous data source semantic label includes:DSN and/or data source category and/or
Data Source Description and/or data source format and/or data source creation time.
Using above-mentioned preferable scheme, it is marked according to specific situation.
As preferable scheme, heterogeneous data information semantic label includes:Data name and/or attribution data information and/
Or data description and/or data creation time.
Using above-mentioned preferable scheme, it is marked according to specific situation.
As preferable scheme, index information includes data source information and/or data message and/or the section of distributed storage
Point information and/or index time.
Using above-mentioned preferable scheme, it is marked according to specific situation.
Brief description of the drawings
Fig. 1 is a kind of flow of the isomeric data distributed storage method based on semantic tagger provided in an embodiment of the present invention
Figure.
Fig. 2 establishes heterogeneous data source semantic label storehouse and heterogeneous data information semantic label to be provided in an embodiment of the present invention
The flow chart in storehouse.
Fig. 3 is the flow chart of Semantic Similarity Measurement provided in an embodiment of the present invention.
Embodiment
The preferred embodiment that the invention will now be described in detail with reference to the accompanying drawings.
In order to reach the purpose of the present invention, wherein the one of a kind of isomeric data distributed storage method based on semantic tagger
In a little embodiments, as shown in figure 1, a kind of isomeric data distributed storage method based on semantic tagger comprises the following steps:
1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;
Heterogeneous data source semantic label includes:DSN, data source category, data Source Description, data source format, number
According to source creation time, definition format is【F:" name, type, describe, format, timestamp "】, F:Big data table
Family row clusters, define data source semantic label storehouse.Heterogeneous data information semantic label includes:Data name, attribution data letter
Breath, data description, data creation time, definition format is【C:" name, ftype, describe, timestamp "】F:Big number
According to table Column row clusters, data message semantic label storehouse is defined.
2) isomeric data memory cell content is established;Storage format is【F, C, V】, F:Data source semantic tagger, C:Data
Information semantic marks, V:Data content, such as text, picture, audio, video and file.
3) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory are believed in .META. tables
In breath, according to【KEY:VALUE】Data source semanteme and data message semanteme are carried out relationship map storage, such as traffic number by form
It is according to source data storage format【F, C1】、【F, C2】、...、【F, Cn】, wherein F:Represent traffic data source semantic label, C1..n
Represent the semantic label of traffic data information.As shown in table 1 .META. tables information, wherein:F represents Family information, and C is represented
Column information.
Table 1 stores .META. table information
4) it is dissolved in dynamic memory data unit in memory block, distributed memory system carries out load balancing dynamically distributes and deposited
Resource is stored up, the storage block size of storage resource node default allocation is BLOCK=64MB, when heterogeneous data source capacity exceedes storage
During the size of block, then data source information cutting is automatically some fritters by system【N1, N2..., Nn-1, Nn】,【Note:(N1,
N2 ... Nn-1)=64MB, Nn <=64MB】, every piece of capacity is not more than the size of memory block.As shown in table 2, memory cell
BLOCK table information.
The memory cell BLOCK table information of table 2
5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;According to storage format【F, C】Mark
The semantic information being poured in carries out Semantic Similarity Measurement, and similarity algorithm is realized using cosine law algorithm.
6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables, pressed
According to【KEY:VALUE】Form by the data message of Similarity value and isomeric data carry out relationship map storage, reach isomery number
According to syncretizing effect.As shown in table 3 .INFO. tables information, wherein:F represents Family information, and C represents Column information.
Table 3 stores .INFO. table information
7) the distributed data information retrieval information based on semantic base is established;Index information includes data source information, data
Information, the nodal information of distributed storage, index time, index storage format are【F:Name, C:Name, Node:Name,
timestamp】。
8) data storage retrieval information is stored in .INDEX. tables, according to【KEY:VALUE】Form by isomeric data
Index information is stored.
As shown in Fig. 2 step 1) specifically includes following steps:
1.1) semantic label storehouse creates;
1.2) heterogeneous data source inputs;
1.3) the semantic parsing of data source and/or labeled data source name and/or labeled data source category and/or labeled data
Source format and/or labeled data source time;
1.4) the semantic parsing of data message and/or labeled data title and/or labeled data attaching information and/or mark number
According to description and/or labeled data time.
As shown in figure 3, step 5) specifically includes following steps:
5.1) two memory cell are read;
5.2) semantic information of the isomeric data of storage is divided by word;
5.3) word frequency is calculated;
5.4) word frequency sequence is obtained;
5.5) the COS angle values of two word frequency sequences are calculated;
5.6) judge whether obtained angle value is more than a, it is similar if being more than;Otherwise, then it is dissimilar.
Assuming that there is traffic data source【F1】With weather data source【F2】, traffic data information【C1】Weather data information
【C2】, then data storage information C1 semantical definitions are:【City name, weather data source, current city is represented,
1404109199352】Note:1404109199352 be the timestamp for describing current weather state.
Data message C2 semantical definitions are:【City name, traffic data source, expression current city, 1404109199344】
Note:1404109199344 be the timestamp for describing current traffic condition.
C1 semantic information text is split, obtaining a semantic dictionary library is:Z1c1, Z1c2, Z1c3,
Z1c4......Z1cn:【City, city, name, claim, day, gas, number, according to, source, table, show, when, preceding, city, city】, it is converted into
GB2312 is encoded to【1730,3901,3286,1729,4117,3565,3946,2786,4813,1580,3883,1896,
3587】。
The frequency of appearance is:Z1n1:【2,2,1,1,1,1,1,1,1,1,1,1,1】.
C2 semantic information text is split, obtaining a semantic dictionary library is:Z2c1, Z2c2, Z2c3,
Z2c4......Z2cn:【City, city, name, claim, hand over, lead to, number, according to, source, table, show, when, preceding, city, city】, it is converted into
GB2312 is encoded to【1730,3901,3286,1729,2658,4143,3946,2786,4813,1580,3883,1896,
3587】。
The frequency of appearance is:Z2n1:【2,2,1,1,1,1,1,1,1,1,1,1,1】.
Z1C1 and Z1C2 compare, without semantic information be complementary to one another, and counted in Z1n1, the corresponding frequencies of Z2n2
For 0, then two are obtained with dimensional vector X, Y, then X, Y are respectively:
X:(2,2,1,1,1,1,0,0,1,1,1,1,1,1,1);
Y:(2,2,1,1,0,0,1,1,1,1,1,1,1,1,1);
Calculation formula:
Result of calculation:SimilaryValue values represent dissimilar 0 to 0.8, are worth and represent similar 0.8 to 1.
Semantic similarity between z1 and z2 is calculated according to formula, X, Y-direction quantity set brings formula into, as a result as follows:
The data of isomery are carried out stripping and slicing by distributed storage method, no matter what the form of data source is, every piece of data source
The size of acquiescence is 64MB, and last block is less than or equal to 64MB, and big data can be so distributed in different memory nodes, and
Synchronization and retrieval between data, it is managed by the .META. semantic information table stored before storage, it is different so as to realize
The storage of structure data source.
Isomeric data distributed storage method based on semantic tagger, mainly by carrying out semantic tagger to heterogeneous data source
Mode carry out data fusion, original independent isomeric data resource mutually is associated by this method, realizes different numbers
According to the information fusion in source, make data more intelligent using semanteme;Distributed storage realizes isomery using the method for data stripping and slicing
The storage of big data, data can be extended, memory space can be added dynamically, avoid big data by memory capacity
Limitation.
The present invention realizes the fusion storage of data fusion storage, particularly isomeric data, and it is only mutually to there is data storage
Vertical and isomeric data, without semantic relevance, the isomeric data distributed memory system based on semantic tagger, that is, solves number
Do not have the problem of semanteme according to hardly possible fusion, solve the problems, such as isomery big data distributed storage again.The present invention proposes a kind of logarithm
The method of semantic tagger and the technology of data source Semantic Similarity Measurement are carried out according to source, compensate for the sky of this function of in the market
In vain, using distributed storage semantic label storehouse and isomeric data location contents, the distributed storage of isomeric data is realized.
The above is only the preferred embodiment of the present invention, it is noted that for the person of ordinary skill of the art,
Without departing from the concept of the premise of the invention, various modifications and improvements can be made, these belong to the guarantor of the present invention
Protect scope.
Claims (9)
1. a kind of isomeric data distributed storage method based on semantic tagger, it is characterised in that comprise the following steps:
1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;
2) isomeric data memory cell content is established;
3) by heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory in .META. table information,
By data source semanteme and the semantic progress relationship map storage of data message;
4) it is dissolved in dynamic memory data unit in memory block;
5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;
6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables, will be similar
The data message of angle value and isomeric data carries out relationship map storage;
7) the distributed data information retrieval information based on semantic base is established;
8) data storage retrieval information is stored in .INDEX. tables.
2. the isomeric data distributed storage method according to claim 1 based on semantic tagger, it is characterised in that described
Step 1) specifically includes following steps:
1.1) semantic label storehouse creates;
1.2) heterogeneous data source inputs;
1.3) the semantic parsing of data source and/or labeled data source name and/or labeled data source category and/or labeled data source lattice
Formula and/or labeled data source time;
1.4) the semantic parsing of data message and/or labeled data title and/or labeled data attaching information and/or labeled data are retouched
State and/or the labeled data time.
3. the isomeric data distributed storage method according to claim 1 based on semantic tagger, it is characterised in that described
Step 4) also includes herein below:
When heterogeneous data source capacity exceedes the size of memory block, then data source information cutting is automatically some fritters by system,
Every piece of capacity is not more than the size of memory block.
4. the isomeric data distributed storage method according to claim 3 based on semantic tagger, it is characterised in that described
The size of memory block is 64M.
5. the isomeric data distributed storage method according to claim 1 based on semantic tagger, it is characterised in that described
Step 5) specifically includes following steps:
5.1) two memory cell are read;
5.2) semantic information of the isomeric data of storage is divided by word;
5.3) word frequency is calculated;
5.4) word frequency sequence is obtained;
5.5) the COS angle values of two word frequency sequences are calculated.
6. the isomeric data distributed storage method according to claim 5 based on semantic tagger, it is characterised in that described
Step 5) specifically includes following steps:
5.6) judge whether obtained angle value is more than a, it is similar if being more than;Otherwise, then it is dissimilar.
7. the isomeric data distributed storage method based on semantic tagger according to claim any one of 1-6, its feature
It is, heterogeneous data source semantic label includes:DSN and/or data source category and/or data Source Description and/or data
Source format and/or data source creation time.
8. the isomeric data distributed storage method according to claim 7 based on semantic tagger, it is characterised in that isomery
Data message semantic label includes:Data name and/or the description of attribution data information and/or data and/or data creation time.
9. the isomeric data distributed storage method according to claim 8 based on semantic tagger, it is characterised in that index
Information includes nodal information and/or the index time of data source information and/or data message and/or distributed storage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710608703.6A CN107515902A (en) | 2017-07-26 | 2017-07-26 | A kind of isomeric data distributed storage method based on semantic tagger |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710608703.6A CN107515902A (en) | 2017-07-26 | 2017-07-26 | A kind of isomeric data distributed storage method based on semantic tagger |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107515902A true CN107515902A (en) | 2017-12-26 |
Family
ID=60722494
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710608703.6A Pending CN107515902A (en) | 2017-07-26 | 2017-07-26 | A kind of isomeric data distributed storage method based on semantic tagger |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107515902A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076366A (en) * | 2021-04-09 | 2021-07-06 | 南京邮电大学 | Intelligent lamp pole virtualization method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156726A (en) * | 2011-04-01 | 2011-08-17 | 中国测绘科学研究院 | Geographic element querying and extending method based on semantic similarity |
US20120078595A1 (en) * | 2010-09-24 | 2012-03-29 | Nokia Corporation | Method and apparatus for ontology matching |
CN102609854A (en) * | 2011-01-25 | 2012-07-25 | 青岛理工大学 | Client partitioning method and device based on unified similarity calculation |
CN104679823A (en) * | 2014-12-31 | 2015-06-03 | 智慧城市信息技术有限公司 | Semantic annotation-based association method and system of heterogeneous data |
-
2017
- 2017-07-26 CN CN201710608703.6A patent/CN107515902A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120078595A1 (en) * | 2010-09-24 | 2012-03-29 | Nokia Corporation | Method and apparatus for ontology matching |
CN102609854A (en) * | 2011-01-25 | 2012-07-25 | 青岛理工大学 | Client partitioning method and device based on unified similarity calculation |
CN102156726A (en) * | 2011-04-01 | 2011-08-17 | 中国测绘科学研究院 | Geographic element querying and extending method based on semantic similarity |
CN104679823A (en) * | 2014-12-31 | 2015-06-03 | 智慧城市信息技术有限公司 | Semantic annotation-based association method and system of heterogeneous data |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076366A (en) * | 2021-04-09 | 2021-07-06 | 南京邮电大学 | Intelligent lamp pole virtualization method |
CN113076366B (en) * | 2021-04-09 | 2023-01-24 | 南京邮电大学 | Intelligent lamp pole virtualization method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102024058B (en) | Music recommendation method and system | |
US20100251094A1 (en) | Method and apparatus for providing comments during content rendering | |
CN108604233B (en) | Media consumption context for personalized instant query suggestions | |
CN107895016B (en) | Method and device for playing multimedia | |
US20140114979A1 (en) | Method and apparatus for classifying commodities on e-commerce platform | |
WO2015070761A1 (en) | Smart tv media player and caption processing method thereof, and smart tv | |
CN106126503B (en) | Service field positioning method and terminal | |
US20100235376A1 (en) | Method and apparatus for on-demand content mapping | |
US9477664B2 (en) | Method and apparatus for querying media based on media characteristics | |
CN108334353B (en) | Skill development system and method | |
CN113190645A (en) | Index structure establishing method, device, equipment and storage medium | |
CN104679823A (en) | Semantic annotation-based association method and system of heterogeneous data | |
WO2015070806A1 (en) | Audio file management method, device and storage medium | |
CN103853775A (en) | Method for converting data storage format based on multimedia data | |
CN107515902A (en) | A kind of isomeric data distributed storage method based on semantic tagger | |
CN104133895A (en) | Intelligent substation secondary device connection diagram encoding algorithm based on binary tree | |
Lee | A Preliminary study on the semantic network analysis of book report text | |
Kobilarov et al. | Dbpedia-a linked data hub and data source for web and enterprise applications | |
CN101600024B (en) | Mobile terminal and method for displaying play list in player | |
CN108509438A (en) | A kind of ElasticSearch fragments extended method | |
TW578067B (en) | Knowledge graphic system and method based on ontology | |
CN106339454A (en) | Inquiry-command conversion method and device | |
CN112580298A (en) | Method, device and equipment for acquiring marked data | |
CN110232182A (en) | Method for recognizing semantics, device and speech dialogue system | |
CN109992697A (en) | A kind of information processing method and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171226 |