CN107515902A - A kind of isomeric data distributed storage method based on semantic tagger - Google Patents

A kind of isomeric data distributed storage method based on semantic tagger Download PDF

Info

Publication number
CN107515902A
CN107515902A CN201710608703.6A CN201710608703A CN107515902A CN 107515902 A CN107515902 A CN 107515902A CN 201710608703 A CN201710608703 A CN 201710608703A CN 107515902 A CN107515902 A CN 107515902A
Authority
CN
China
Prior art keywords
data
semantic
information
isomeric
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710608703.6A
Other languages
Chinese (zh)
Inventor
吴含前
沈鸣飞
顾鹏
陈钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SUZHOU SIGMA TECHNOLOGY Co Ltd
Original Assignee
SUZHOU SIGMA TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SUZHOU SIGMA TECHNOLOGY Co Ltd filed Critical SUZHOU SIGMA TECHNOLOGY Co Ltd
Priority to CN201710608703.6A priority Critical patent/CN107515902A/en
Publication of CN107515902A publication Critical patent/CN107515902A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of isomeric data distributed storage method based on semantic tagger, comprises the following steps:1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;2) isomeric data memory cell content is established;3) by heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory in .META. table information;4) it is dissolved in dynamic memory data unit in memory block;5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables;7) the distributed data information retrieval information based on semantic base is established;8) data storage retrieval information is stored in .INDEX. tables.The present invention realizes data fusion storage, solves the problems, such as that the difficult fusion of data does not have semanteme, solves the problems, such as isomery big data distributed storage again.

Description

A kind of isomeric data distributed storage method based on semantic tagger
Technical field
The present invention relates to one kind to be based on semantic tagger technology and distributed storage technology, more particularly to one kind based on semanteme The isomeric data distributed storage method of mark.
Background technology
With the high speed development of internet, being incremented by with index, the source of data are also more abundant and multiple daily for data volume Miscellaneous, the data format such as text data, voice data, video data is also more and more, realizes data fusion and the storage of isomery Problem is become increasingly conspicuous, but traditional data fusion, and isomeric data simply is carried out into unified storage and without semanteme.Research is a kind of Isomeric data distributed storage method based on semantic tagger, this will realize the height semantic fusion of isomeric data, to isomery number Critical effect is played according to the efficient retrieval of resource.
The method of semantic tagger has much at present, mainly according to Resource Properties, resource content, resource content feature and spy Fixed resources domain Ontology is labeled.
Isomeric data based on semantic tagger is mainly that the mark that heterogeneous data source is carried out by customized mode illustrates, Stored on semantic tagger infologic by the way of big table, but be physically distributed storage, how by isomeric data Combine with distributed storage be at this stage urgent need to resolve the problem of.
The content of the invention
In order to solve the above-mentioned technical problem, the present invention proposes a kind of isomeric data distributed storage based on semantic tagger Method.
In order to achieve the above object, technical scheme is as follows:
A kind of isomeric data distributed storage method based on semantic tagger, comprises the following steps:
1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;
2) isomeric data memory cell content is established;
3) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory are believed in .META. tables In breath, by data source semanteme and the semantic progress relationship map storage of data message;
4) it is dissolved in dynamic memory data unit in memory block;
5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;
6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables, will The data message of Similarity value and isomeric data carries out relationship map storage;
7) the distributed data information retrieval information based on semantic base is established;
8) data storage retrieval information is stored in .INDEX. tables.
The present invention realizes the fusion storage of data fusion storage, particularly isomeric data, and it is only mutually to there is data storage Vertical and isomeric data, without semantic relevance, the isomeric data distributed memory system based on semantic tagger, that is, solves number Do not have the problem of semanteme according to hardly possible fusion, solve the problems, such as isomery big data distributed storage again.The present invention proposes a kind of logarithm The method of semantic tagger and the technology of data source Semantic Similarity Measurement are carried out according to source, compensate for the sky of this function of in the market In vain, using distributed storage semantic label storehouse and isomeric data location contents, the distributed storage of isomeric data is realized.
On the basis of above-mentioned technical proposal, following improvement can be also done:
As preferable scheme, step 1) specifically includes following steps:
1.1) semantic label storehouse creates;
1.2) heterogeneous data source inputs;
1.3) the semantic parsing of data source and/or labeled data source name and/or labeled data source category and/or labeled data Source format and/or labeled data source time;
1.4) the semantic parsing of data message and/or labeled data title and/or labeled data attaching information and/or mark number According to description and/or labeled data time.
Using above-mentioned preferable scheme, simple operation.
As preferable scheme, step 4) also includes herein below:
When heterogeneous data source capacity exceedes the size of memory block, then data source information cutting is automatically some small by system Block, every piece of capacity are not more than the size of memory block.
Using above-mentioned preferable scheme, it is easy to dynamic memory.
As preferable scheme, the size of memory block is 64M.
It is good using above-mentioned preferable scheme, storage effect.
As preferable scheme, step 5) specifically includes following steps:
5.1) two memory cell are read;
5.2) semantic information of the isomeric data of storage is divided by word;
5.3) word frequency is calculated;
5.4) word frequency sequence is obtained;
5.5) the COS angle values of two word frequency sequences are calculated.
Using above-mentioned preferable scheme, simple operation.
As preferable scheme, step 5) specifically includes following steps:
5.6) judge whether obtained angle value is more than a, it is similar if being more than;Otherwise, then it is dissimilar.
Using above-mentioned preferable scheme, it is convenient to judge.
As preferable scheme, heterogeneous data source semantic label includes:DSN and/or data source category and/or Data Source Description and/or data source format and/or data source creation time.
Using above-mentioned preferable scheme, it is marked according to specific situation.
As preferable scheme, heterogeneous data information semantic label includes:Data name and/or attribution data information and/ Or data description and/or data creation time.
Using above-mentioned preferable scheme, it is marked according to specific situation.
As preferable scheme, index information includes data source information and/or data message and/or the section of distributed storage Point information and/or index time.
Using above-mentioned preferable scheme, it is marked according to specific situation.
Brief description of the drawings
Fig. 1 is a kind of flow of the isomeric data distributed storage method based on semantic tagger provided in an embodiment of the present invention Figure.
Fig. 2 establishes heterogeneous data source semantic label storehouse and heterogeneous data information semantic label to be provided in an embodiment of the present invention The flow chart in storehouse.
Fig. 3 is the flow chart of Semantic Similarity Measurement provided in an embodiment of the present invention.
Embodiment
The preferred embodiment that the invention will now be described in detail with reference to the accompanying drawings.
In order to reach the purpose of the present invention, wherein the one of a kind of isomeric data distributed storage method based on semantic tagger In a little embodiments, as shown in figure 1, a kind of isomeric data distributed storage method based on semantic tagger comprises the following steps:
1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;
Heterogeneous data source semantic label includes:DSN, data source category, data Source Description, data source format, number According to source creation time, definition format is【F:" name, type, describe, format, timestamp "】, F:Big data table Family row clusters, define data source semantic label storehouse.Heterogeneous data information semantic label includes:Data name, attribution data letter Breath, data description, data creation time, definition format is【C:" name, ftype, describe, timestamp "】F:Big number According to table Column row clusters, data message semantic label storehouse is defined.
2) isomeric data memory cell content is established;Storage format is【F, C, V】, F:Data source semantic tagger, C:Data Information semantic marks, V:Data content, such as text, picture, audio, video and file.
3) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory are believed in .META. tables In breath, according to【KEY:VALUE】Data source semanteme and data message semanteme are carried out relationship map storage, such as traffic number by form It is according to source data storage format【F, C1】、【F, C2】、...、【F, Cn】, wherein F:Represent traffic data source semantic label, C1..n Represent the semantic label of traffic data information.As shown in table 1 .META. tables information, wherein:F represents Family information, and C is represented Column information.
Table 1 stores .META. table information
4) it is dissolved in dynamic memory data unit in memory block, distributed memory system carries out load balancing dynamically distributes and deposited Resource is stored up, the storage block size of storage resource node default allocation is BLOCK=64MB, when heterogeneous data source capacity exceedes storage During the size of block, then data source information cutting is automatically some fritters by system【N1, N2..., Nn-1, Nn】,【Note:(N1, N2 ... Nn-1)=64MB, Nn <=64MB】, every piece of capacity is not more than the size of memory block.As shown in table 2, memory cell BLOCK table information.
The memory cell BLOCK table information of table 2
5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;According to storage format【F, C】Mark The semantic information being poured in carries out Semantic Similarity Measurement, and similarity algorithm is realized using cosine law algorithm.
6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables, pressed According to【KEY:VALUE】Form by the data message of Similarity value and isomeric data carry out relationship map storage, reach isomery number According to syncretizing effect.As shown in table 3 .INFO. tables information, wherein:F represents Family information, and C represents Column information.
Table 3 stores .INFO. table information
7) the distributed data information retrieval information based on semantic base is established;Index information includes data source information, data Information, the nodal information of distributed storage, index time, index storage format are【F:Name, C:Name, Node:Name, timestamp】。
8) data storage retrieval information is stored in .INDEX. tables, according to【KEY:VALUE】Form by isomeric data Index information is stored.
As shown in Fig. 2 step 1) specifically includes following steps:
1.1) semantic label storehouse creates;
1.2) heterogeneous data source inputs;
1.3) the semantic parsing of data source and/or labeled data source name and/or labeled data source category and/or labeled data Source format and/or labeled data source time;
1.4) the semantic parsing of data message and/or labeled data title and/or labeled data attaching information and/or mark number According to description and/or labeled data time.
As shown in figure 3, step 5) specifically includes following steps:
5.1) two memory cell are read;
5.2) semantic information of the isomeric data of storage is divided by word;
5.3) word frequency is calculated;
5.4) word frequency sequence is obtained;
5.5) the COS angle values of two word frequency sequences are calculated;
5.6) judge whether obtained angle value is more than a, it is similar if being more than;Otherwise, then it is dissimilar.
Assuming that there is traffic data source【F1】With weather data source【F2】, traffic data information【C1】Weather data information 【C2】, then data storage information C1 semantical definitions are:【City name, weather data source, current city is represented, 1404109199352】Note:1404109199352 be the timestamp for describing current weather state.
Data message C2 semantical definitions are:【City name, traffic data source, expression current city, 1404109199344】 Note:1404109199344 be the timestamp for describing current traffic condition.
C1 semantic information text is split, obtaining a semantic dictionary library is:Z1c1, Z1c2, Z1c3, Z1c4......Z1cn:【City, city, name, claim, day, gas, number, according to, source, table, show, when, preceding, city, city】, it is converted into GB2312 is encoded to【1730,3901,3286,1729,4117,3565,3946,2786,4813,1580,3883,1896, 3587】。
The frequency of appearance is:Z1n1:【2,2,1,1,1,1,1,1,1,1,1,1,1】.
C2 semantic information text is split, obtaining a semantic dictionary library is:Z2c1, Z2c2, Z2c3, Z2c4......Z2cn:【City, city, name, claim, hand over, lead to, number, according to, source, table, show, when, preceding, city, city】, it is converted into GB2312 is encoded to【1730,3901,3286,1729,2658,4143,3946,2786,4813,1580,3883,1896, 3587】。
The frequency of appearance is:Z2n1:【2,2,1,1,1,1,1,1,1,1,1,1,1】.
Z1C1 and Z1C2 compare, without semantic information be complementary to one another, and counted in Z1n1, the corresponding frequencies of Z2n2 For 0, then two are obtained with dimensional vector X, Y, then X, Y are respectively:
X:(2,2,1,1,1,1,0,0,1,1,1,1,1,1,1);
Y:(2,2,1,1,0,0,1,1,1,1,1,1,1,1,1);
Calculation formula:
Result of calculation:SimilaryValue values represent dissimilar 0 to 0.8, are worth and represent similar 0.8 to 1.
Semantic similarity between z1 and z2 is calculated according to formula, X, Y-direction quantity set brings formula into, as a result as follows:
The data of isomery are carried out stripping and slicing by distributed storage method, no matter what the form of data source is, every piece of data source The size of acquiescence is 64MB, and last block is less than or equal to 64MB, and big data can be so distributed in different memory nodes, and Synchronization and retrieval between data, it is managed by the .META. semantic information table stored before storage, it is different so as to realize The storage of structure data source.
Isomeric data distributed storage method based on semantic tagger, mainly by carrying out semantic tagger to heterogeneous data source Mode carry out data fusion, original independent isomeric data resource mutually is associated by this method, realizes different numbers According to the information fusion in source, make data more intelligent using semanteme;Distributed storage realizes isomery using the method for data stripping and slicing The storage of big data, data can be extended, memory space can be added dynamically, avoid big data by memory capacity Limitation.
The present invention realizes the fusion storage of data fusion storage, particularly isomeric data, and it is only mutually to there is data storage Vertical and isomeric data, without semantic relevance, the isomeric data distributed memory system based on semantic tagger, that is, solves number Do not have the problem of semanteme according to hardly possible fusion, solve the problems, such as isomery big data distributed storage again.The present invention proposes a kind of logarithm The method of semantic tagger and the technology of data source Semantic Similarity Measurement are carried out according to source, compensate for the sky of this function of in the market In vain, using distributed storage semantic label storehouse and isomeric data location contents, the distributed storage of isomeric data is realized.
The above is only the preferred embodiment of the present invention, it is noted that for the person of ordinary skill of the art, Without departing from the concept of the premise of the invention, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.

Claims (9)

1. a kind of isomeric data distributed storage method based on semantic tagger, it is characterised in that comprise the following steps:
1) heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse are established;
2) isomeric data memory cell content is established;
3) by heterogeneous data source semantic label storehouse and heterogeneous data information semantic label storehouse dynamic memory in .META. table information, By data source semanteme and the semantic progress relationship map storage of data message;
4) it is dissolved in dynamic memory data unit in memory block;
5) semantic information of the isomeric data to having marked carries out Semantic Similarity Measurement;
6) isomeric data fuse information storage is carried out, the Similarity value calculated is stored in .INFO. information tables, will be similar The data message of angle value and isomeric data carries out relationship map storage;
7) the distributed data information retrieval information based on semantic base is established;
8) data storage retrieval information is stored in .INDEX. tables.
2. the isomeric data distributed storage method according to claim 1 based on semantic tagger, it is characterised in that described Step 1) specifically includes following steps:
1.1) semantic label storehouse creates;
1.2) heterogeneous data source inputs;
1.3) the semantic parsing of data source and/or labeled data source name and/or labeled data source category and/or labeled data source lattice Formula and/or labeled data source time;
1.4) the semantic parsing of data message and/or labeled data title and/or labeled data attaching information and/or labeled data are retouched State and/or the labeled data time.
3. the isomeric data distributed storage method according to claim 1 based on semantic tagger, it is characterised in that described Step 4) also includes herein below:
When heterogeneous data source capacity exceedes the size of memory block, then data source information cutting is automatically some fritters by system, Every piece of capacity is not more than the size of memory block.
4. the isomeric data distributed storage method according to claim 3 based on semantic tagger, it is characterised in that described The size of memory block is 64M.
5. the isomeric data distributed storage method according to claim 1 based on semantic tagger, it is characterised in that described Step 5) specifically includes following steps:
5.1) two memory cell are read;
5.2) semantic information of the isomeric data of storage is divided by word;
5.3) word frequency is calculated;
5.4) word frequency sequence is obtained;
5.5) the COS angle values of two word frequency sequences are calculated.
6. the isomeric data distributed storage method according to claim 5 based on semantic tagger, it is characterised in that described Step 5) specifically includes following steps:
5.6) judge whether obtained angle value is more than a, it is similar if being more than;Otherwise, then it is dissimilar.
7. the isomeric data distributed storage method based on semantic tagger according to claim any one of 1-6, its feature It is, heterogeneous data source semantic label includes:DSN and/or data source category and/or data Source Description and/or data Source format and/or data source creation time.
8. the isomeric data distributed storage method according to claim 7 based on semantic tagger, it is characterised in that isomery Data message semantic label includes:Data name and/or the description of attribution data information and/or data and/or data creation time.
9. the isomeric data distributed storage method according to claim 8 based on semantic tagger, it is characterised in that index Information includes nodal information and/or the index time of data source information and/or data message and/or distributed storage.
CN201710608703.6A 2017-07-26 2017-07-26 A kind of isomeric data distributed storage method based on semantic tagger Pending CN107515902A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710608703.6A CN107515902A (en) 2017-07-26 2017-07-26 A kind of isomeric data distributed storage method based on semantic tagger

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710608703.6A CN107515902A (en) 2017-07-26 2017-07-26 A kind of isomeric data distributed storage method based on semantic tagger

Publications (1)

Publication Number Publication Date
CN107515902A true CN107515902A (en) 2017-12-26

Family

ID=60722494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710608703.6A Pending CN107515902A (en) 2017-07-26 2017-07-26 A kind of isomeric data distributed storage method based on semantic tagger

Country Status (1)

Country Link
CN (1) CN107515902A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076366A (en) * 2021-04-09 2021-07-06 南京邮电大学 Intelligent lamp pole virtualization method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156726A (en) * 2011-04-01 2011-08-17 中国测绘科学研究院 Geographic element querying and extending method based on semantic similarity
US20120078595A1 (en) * 2010-09-24 2012-03-29 Nokia Corporation Method and apparatus for ontology matching
CN102609854A (en) * 2011-01-25 2012-07-25 青岛理工大学 Client partitioning method and device based on unified similarity calculation
CN104679823A (en) * 2014-12-31 2015-06-03 智慧城市信息技术有限公司 Semantic annotation-based association method and system of heterogeneous data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120078595A1 (en) * 2010-09-24 2012-03-29 Nokia Corporation Method and apparatus for ontology matching
CN102609854A (en) * 2011-01-25 2012-07-25 青岛理工大学 Client partitioning method and device based on unified similarity calculation
CN102156726A (en) * 2011-04-01 2011-08-17 中国测绘科学研究院 Geographic element querying and extending method based on semantic similarity
CN104679823A (en) * 2014-12-31 2015-06-03 智慧城市信息技术有限公司 Semantic annotation-based association method and system of heterogeneous data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076366A (en) * 2021-04-09 2021-07-06 南京邮电大学 Intelligent lamp pole virtualization method
CN113076366B (en) * 2021-04-09 2023-01-24 南京邮电大学 Intelligent lamp pole virtualization method

Similar Documents

Publication Publication Date Title
CN102024058B (en) Music recommendation method and system
US20100251094A1 (en) Method and apparatus for providing comments during content rendering
CN108604233B (en) Media consumption context for personalized instant query suggestions
CN107895016B (en) Method and device for playing multimedia
US20140114979A1 (en) Method and apparatus for classifying commodities on e-commerce platform
WO2015070761A1 (en) Smart tv media player and caption processing method thereof, and smart tv
CN106126503B (en) Service field positioning method and terminal
US20100235376A1 (en) Method and apparatus for on-demand content mapping
US9477664B2 (en) Method and apparatus for querying media based on media characteristics
CN108334353B (en) Skill development system and method
CN113190645A (en) Index structure establishing method, device, equipment and storage medium
CN104679823A (en) Semantic annotation-based association method and system of heterogeneous data
WO2015070806A1 (en) Audio file management method, device and storage medium
CN103853775A (en) Method for converting data storage format based on multimedia data
CN107515902A (en) A kind of isomeric data distributed storage method based on semantic tagger
CN104133895A (en) Intelligent substation secondary device connection diagram encoding algorithm based on binary tree
Lee A Preliminary study on the semantic network analysis of book report text
Kobilarov et al. Dbpedia-a linked data hub and data source for web and enterprise applications
CN101600024B (en) Mobile terminal and method for displaying play list in player
CN108509438A (en) A kind of ElasticSearch fragments extended method
TW578067B (en) Knowledge graphic system and method based on ontology
CN106339454A (en) Inquiry-command conversion method and device
CN112580298A (en) Method, device and equipment for acquiring marked data
CN110232182A (en) Method for recognizing semantics, device and speech dialogue system
CN109992697A (en) A kind of information processing method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171226