CN109783484A

CN109783484A - The construction method and system of the data service platform of knowledge based map

Info

Publication number: CN109783484A
Application number: CN201811640313.8A
Authority: CN
Inventors: 徐汕; 梁炬; 黄文锋; 张晶亮; 刘强; 单酉; 杨端; 卫未
Original assignee: Beijing Aerospace Cloud Co Ltd
Current assignee: Beijing Aerospace Cloud Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-05-21

Abstract

The invention discloses the construction methods and system of a kind of data service platform of knowledge based map, comprising the following steps: cleans multi-source heterogeneous data；It is inquired for the data after cleaning, the data after inquiry is generated into resource ID by redis；Building OWL ontology is simultaneously managed plug-in unit, and the data are stored using columnar database.The invention has the advantages that: object-oriented with neatly storing data, the knowledge information contained in unstructured and semi-structured data is sufficiently excavated, helps to provide the structural data of high quality for later period various application fields.

Description

The construction method and system of the data service platform of knowledge based map

Technical field

The present invention relates to industrial internet of things field, it particularly relates to a kind of data service of knowledge based map The construction method and system of platform.

Background technique

Knowledge mapping is intended to describe various entities or concept present in real world and the association between them is closed System, its each entity is identified with the ID of globally unique determination, as everyone has an ID card No.；Second Exactly two entities are connected with relationship, portray the association between them to come the intrinsic characteristic of portraying entity with attribute-value.

The rapid development of information technology especially internet, pushes the arriving of big data era, and all trades and professions are daily all In the fragmentation of data for generating enormous amount, data metering unit develops to PB, EB, ZB, YB very from Byte, KB, MB, GB, TB It is measured to BB, NB, DB, the acquisition to big data data is no longer technical problem, but its knowledge contained largely exists In the structural data of non-structured text data and a large amount of semi-structured tables and webpage and production system；Tradition Data information memory use relevant database, design is complicated, redundancy is big and search efficiency is low, can not directly acquire number According to the middle Latent Semantic information for needing reasoning, excavation.

For the problems in the relevant technologies, currently no effective solution has been proposed.

Summary of the invention

For above-mentioned technical problem in the related technology, the present invention proposes a kind of data service platform of knowledge based map Construction method and system, can object-oriented and neatly storing data, the knowledge information contained in abundant mining data, Help to provide the structural data of high quality for later period various application fields.

To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:

A kind of construction method of the data service platform of knowledge based map, comprising the following steps:

Multi-source heterogeneous data are cleaned；

It is inquired for the data after cleaning, the data after inquiry is generated into resource ID by redis；

Building OWL ontology is simultaneously managed plug-in unit, and the data are stored using columnar database.

Further, it is described by multi-source heterogeneous data carry out cleaning include:

ETL rule is obtained for different data sources load ETL plug-in unit, the relationship between entity is obtained after building entity；

Resource service subsystem is called to obtain resource ID；

Data after recycling are generated to the data object of structuring.

Further, it is described multi-source heterogeneous data are cleaned before further include being acquired using data collection client Multi-source heterogeneous data.

Further, the data collection client includes Data Acquisition Program component, association ID formation component, association ID Sending assembly and non-active service response component.

Further, the data for after cleaning, which inquire, includes

Global ID is accessed using full-text search engine；

In chart database, the entity that is mutually related is retrieved according to the Global ID, returns to the relevant ID of institute；

In distributed data-storage system, according to the association ID index structure data, respective attributes result is returned to.

Another aspect of the present invention provides a kind of building system of the data service platform of knowledge based map, comprising:

Data cleansing module, for cleaning multi-source heterogeneous data；

Resource service subsystem module passes through the data after inquiry for being inquired for the data after cleaning Redis generates resource ID；

The data are utilized columnar database for constructing OWL ontology and being managed to plug-in unit by ontology management module It is stored.

Further, the data cleansing module includes:

Entity constructs module, for obtaining ETL rule for different data sources load ETL plug-in unit, obtains after constructing entity Relationship between entity；

Recycling module, for calling resource service subsystem to obtain resource ID；

Structural data objects module, for the data after recycling to be generated to the data object of structuring.

Further, which further includes data acquisition module, and the data acquisition module is used to acquire visitor using data Family end acquires multi-source heterogeneous data.

Further, data collection client includes Data Acquisition Program component, association ID in the data acquisition module Formation component, association ID sending assembly and non-active service response component.

Further, the data inquiry module includes

Global ID's module, for accessing Global ID using full-text search engine；

It is associated with ID module, for retrieving the entity that is mutually related according to the Global ID in chart database, is returned all It is associated with ID；

Structural data module is used in distributed data-storage system, according to the association ID index structure number According to return respective attributes result.

Beneficial effects of the present invention: object-oriented and neatly storing data, the knowledge contained in abundant mining data Information helps to provide the structural data of high quality for later period various application fields.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.

Fig. 1 is the process of the construction method of the data service platform of the knowledge based map described according to embodiments of the present invention Figure；

Fig. 2 is the structure of the building system of the data service platform of the knowledge based map described according to embodiments of the present invention Schematic diagram；

Fig. 3 is the entirety of the building system of the data service platform of the knowledge based map described according to embodiments of the present invention Architecture diagram.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art's every other embodiment obtained belong to what the present invention protected Range.

As shown in Figure 1, the building of the data service platform of a kind of knowledge based map according to embodiments of the present invention Method, comprising the following steps:

Multi-source heterogeneous data are cleaned；

Specifically, resource service subsystem, by redis self-generating resource ID, and provides resource service interface；Industry is right Image data after data acquisition is with data cleansing process to Global ID generation module application ID, will be same after object acquisition Global ID Step is stored into each storage medium, to guarantee the possibility of correlation inquiry；Global ID's generation module is based on Redis database Number device function is realized, can be generated from the long id increased, simultaneously because natural support of the Redis to thread-safe, ensure that Entity object applies for the uniqueness of id under the conditions of multithreading.

Specifically, ontology management, OWL ontology is constructed according to business demand, and realizes that the additions and deletions of ontology change and looks into and ontology Reasoning, specific steps are as follows: the ontology of design is converted to by tool by owl file and is imported into system；It realizes to ontology The functions such as modification, inquiry, deletion；Realize the rule-based reasoning based on ontology.

Plug-in management provides version managements and the insert body mapping managements such as the online upgrading of plug-in unit, hot repair be multiple.

Data storage includes solid data storage and relation data storage；

Wherein, solid data storage specifically includes:

Storage system completes the storage to industrial bodies based on HBase.HBase is a kind of distributed towards column storage Database.The table of HBase can have several column clusters (family), can store multiple key-value forms under each column cluster Key-value pair.Data line is identified with line unit (Rowkey), the quantity for the key-value pair that each row of data is included can flexibly change. In view of the load balancing of HBase subregion, line unit of the reversion character string of Global ID as HBase table is used in the design.Often The non-empty field of storing data in data line is to optimize space hold.HBase, which is only realized, inquires industry by Global ID The function of entity details data, there is no need to more Computer Aided Designs.

Relation data storage specifically includes:

Neo4j is a kind of chart database, can be good at storing existing relationship between different data.In a Neo4j Include two kinds of data in figure, is node and relationship respectively.Node can have the attribute of multiple key-value pair forms, and relationship can be To be also possible to it is undirected.Neo4j distributes each node the included ID in a Neo4j.In view of in optimization Neo4j Optimization of the data to space hold, the design is merely with the relation data between Neo4j storage entity without storage entity Specific object.Specific method is, is two classes: entity object and dimension data by the node division in Neo4j.Wherein entity object In addition ID attribute is also set other than possessing the ID value that Neo4j is distributed automatically for storing the corresponding Global ID of the object Value.Dimension data refers to the associated field value of different entities, such as category of employment, product category, geographical location etc..

The ETL rule based on conditional random field models is obtained for different data sources load ETL plug-in unit, then building is real Body obtains the relationship between entity；

Wherein, condition random field (conditional random fields, abbreviation CRF or CRFs) is a kind of discriminate Probabilistic model is one kind of random field, is usually used in mark or analytical sequence data, such as natural language text or biological sequence, Such as Markov random field, condition random field is that vertex with undirected graph model, in figure represents stochastic variable, between vertex Line represent the dependence relation between stochastic variable, in condition random field, stochastic variable Y's is distributed as conditional probability, give Observed value be then stochastic variable X；In principle, the graph model layout of condition random field can be any given, general common Layout be chain eliminant framework, though chain eliminant framework training (training), inference (inference) or decoding (decoding) on, there is the higher algorithm of efficiency all for calculation.

Condition random field is used for the morphological analyses such as Chinese word segmentation and part-of-speech tagging work, and General Sequences disaggregated model is usually Using hidden Markov model (HMM), such as class-based Chinese word segmentation, but in hidden Markov model, there are two hypothesis: It exports independence assumption and Markov property is assumed.Wherein, output independence assumption requires sequence data stringent mutually indepedent Can guarantee the correctness of derivation, and in fact most of sequence datas cannot be expressed as a series of independent events, and condition with Airport then uses a kind of probability graph model, has the ability of expression long-distance dependence and overlapping property feature, can preferably solve The advantages of the problems such as (classification) biasing, is infused in award of bid, and all features can carry out global normalization, can acquire the overall situation most Excellent solution；

Condition random field variable according to the observationXAnd stochastic variableYIt is defined as follows:

G=(V, E) is enabled to indicate that a figure, the figure have the property that Y=(Y_v)_v∈V, i.e. stochastic variable Y can be by figure G Vertex index access, in this way, work as stochastic variable Y_vCondition depends on observation variable X, then (X, Y) is just known as condition random , and defer to the Markov property of graph structure:

p(Y_v| X, Y_w, w ≠ v) and=p (Y_v| X, Y_w, w~v)

Wherein, w~v indicates that w and v is adjacent vertex in figure G, and the algorithm realization of CRF has had multiple well-known at present Open source projects, and be widely used in academia research and industry application in.

CRF++ be one can be used for segmenting/continuous data mark it is simple, customizable and increase income condition random field (CRFs) tool；CRF++ be for general purpose design customization, and will be used for natural language information processing (NLP) it is each Aspect, such as name Entity recognition, information extraction and chunk parsing.

By taking industry data cleans as an example, steps are as follows for calculating:

1. obtaining irregular industry data from data source, data include level-one trade classification data Org1 and second level industry Classification data Org2；

2. by being handled as follows according to the initial data situation of reading:

When Org1 and Org2 are not sky:

If 1) Org1=Standard1, Org2=Standard2, return (Org1, Org2)；

If 2) Org1=Standard2, Org2=Standard1, return (Org2, Org1)；

If 3) Org2=Standard2, Org1 ≠ Standard1 is returned (Standard1, Org2)；

If 4) Org1=Standard2, Org2 ≠ Standard1 is returned (Standard1, Org1)；

If 5) Org1=Standard1, Org2 ≠ Standard2, then Org2 is divided by CRF algorithm model Word removes stop words, is denoted as set A.Each second level industry data is divided by CRF algorithm model in Standard Industrial Classification Word removes stop words, and each second level industry obtains a set, and the set of whole industries is denoted as LIST (B).Pass through background technique The definition of the Jaccard distance of middle introduction:

The Jaccard distance for calculating each set in A and LIST (B), selects minimum Jaccard apart from corresponding standard Second level trade classification Min is returned (Org1, Min)；

If 6) Org2=Stadard1, Org1 ≠ Standard2, then Org1 is divided by CRF algorithm model Word removes stop words, is denoted as set A.Each second level industry data is divided by CRF algorithm model in Standard Industrial Classification Word removes stop words, and each second level industry obtains a set, and the set of whole industries is denoted as LIST (B).Pass through background technique The definition of the Jaccard distance of middle introduction:

The Jaccard distance for calculating each set in A and LIST (B), selects minimum Jaccard apart from corresponding standard Second level trade classification Min is returned (Org2, Min)；

If 7) Org1 ≠ Stadard1, Org2 ≠ Standard2, then being calculated after Org1 is connect with Org2 by CRF Method model is segmented, and is removed stop words, is denoted as set A.Each level-one, second level industry data carry out in Standard Industrial Classification It is segmented after character string connection by CRF algorithm model, removes stop words, each industry obtains a set, whole industries Set be denoted as LIST (B).Pass through the definition for the Jaccard distance introduced in background technique:

The Jaccard distance for calculating each set in A and LIST (B), selects minimum Jaccard apart from corresponding standard Level-one trade classification Min1, standard second level trade classification Min2 are returned (Min1, Min2)；

When it is empty that Org1, which is not sky Org2:

If 1) Ogr1=Standard2, then returning to (Standard1, Org1)；

If 2) Ogr1=Standard1, then returning to (Org1, Standard2)；

If 3) Org1 ≠ Standard1 and Org1 ≠ Standard2, Org1 is carried out by CRF algorithm model Participle removes stop words, is denoted as set A.Each level-one, second level industry data carry out character string connection in Standard Industrial Classification It is segmented afterwards by CRF algorithm model, removes stop words, each industry obtains a set, and the set of whole industries is denoted as LIST(B).Pass through the definition for the Jaccard distance introduced in background technique:

When it is empty that Org2, which is not sky Org1:

If 1) Ogr2=Standard2, then returning to (Standard1, Org2)；

If 2) Ogr2=Standard1, then returning to (Org2, Standard2)；

If 3) Org2 ≠ Standard1 and Org2 ≠ Standard2, Org2 is carried out by CRF algorithm model Participle removes stop words, is denoted as set A.Each level-one, second level industry data carry out character string connection in Standard Industrial Classification It is segmented afterwards by CRF algorithm model, removes stop words, each industry obtains a set, and the set of whole industries is denoted as LIST(B).Pass through the definition for the Jaccard distance introduced in background technique:

Above step is the process flow of single data, and parallel form can be used for mass data while carrying out Processing, can significantly improve the efficiency of data processing.

Calling resource service subsystem is treated data acquisition resource ID, realizes the recycling of data；

Data structured, the data after making recycling become the data object of structuring, and data, are stored in database at once In, the data that can be realized with two-dimentional table structure come logical expression.

In one particular embodiment of the present invention, it is described multi-source heterogeneous data are cleaned before further include, utilize Data collection client acquires multi-source heterogeneous data, wherein data acquisition specifically includes:

Initial data is introduced directly into or provides different data collection clients, acquires multi-source heterogeneous data, data acquisition Client includes: Data Acquisition Program component, for obtaining the extremely corresponding description information of unstructured data；ID is associated with to generate Component, for the unique association ID of description information distribution for unstructured data；It is associated with ID sending assembly, for the pass Connection ID is sent to the service managing server of client, the structural data association for keeping non-structural words data corresponding；It is non-master Dynamic service response component sends the extremely corresponding description information of unstructured data to number for passively data acquisition service According to acquisition platform.

Unified acquisition interface obtains structuring and time series data from the internal business systems such as EMS, CPS, CRM, SRM, Except common http protocol internet data acquisition is supported, go back the ModBus, OPC in supporting industry field, CAN, ControlNet, DeviceNet, Profibus, Zigbee etc. all types of industrial protocols or even the production of each automation equipment and integrator are certainly Oneself develops various privately owned industrial protocols, realizes the effective parsing and acquisition of different agreement data.

In one particular embodiment of the present invention, the data collection client includes Data Acquisition Program component, closes Join ID formation component, association ID sending assembly and non-active service response component.

In one particular embodiment of the present invention, the data for after cleaning, which inquire, includes

Full-text search engine is used by keyword, returns to unique Global ID；Including creation index and retrieving；Its In, 1) creation index specifically include:

Being indexed file is the unstructured data including industrial data stored in Full-text database, will be former Data are transmitted to segmenter, and data are divided into individual word one by one, remove punctuation mark, and removal stops word；The word that will be obtained Member is transmitted to Language Processing component, by Language Processing, obtains a series of words；Obtained word is transmitted to indexing component, using obtaining Word create a dictionary, dictionary alphabet sequence is ranked up, merges identical word as the document table of falling row chain；Pass through rope Draw storage and hard disk is written into index；So far, index has created, we can find the data that we want by it.

2) retrieving specifically includes:

User input query sentence；

Query statement is the same with our common language, the grammer of query statement according to the realization of text retrieval system without Together.

A series of words are obtained by syntactic analysis and language analysis for query statement；

A query tree is obtained by syntactic analysis；

Index is read into memory by index storage；

It is searched for and is indexed using query tree, to obtain the document chained list of each word, reported to the leadship after accomplishing a task document chained list, and obtain Result document；

In reverse indexing table, the document chained list comprising each keyword is found out respectively, to the chain comprising each keyword Table merges operation, obtains the not only document chained list comprising keyword 1 but also comprising keyword 2, and then, it is poor that multiple chained lists are carried out Operation obtains the not only data link table comprising keyword 1 but also comprising keyword 2, finally returns to query result.

In chart database, the entity that is mutually related is retrieved according to the Global ID, returns to the relevant ID of institute；Specific packet It includes:

1) graph data structure models

By analyzing the data information including industrial data, the entity node and entity of each information are therefrom extracted Between relationship the graph structure model of data is generated by entity node and incidence relation.

2) data directory

Wherefrom started in graphic data base using index with determining, the index of chart database passes through specific attribute value Search node or relationship；

3) user query input by sentence

The grammer of query statement is different according to the use of database；

4) data traversal

Based on depth-first and breadth first algorithm, optimal algorithm is selected according to diagram data model using effect.

5) query result is returned.

In distributed data-storage system, according to the association ID index structure data, return to respective attributes as a result, Specifically include:

1) attribute retrieval

The grammer of user input query sentence, query statement is different according to the use of database, wherein database is non- Relevant database；

2) corresponding structural data is inquired in the database according to Global ID

The information in the information and .META. in relevant-ROOT- that client passes through inner buffer is directly connected to, request The HRegionserver of Data Matching, navigates to region corresponding with client's request on the server, and client's request first can Inquire the caching-memstore of the region in memory；

Client is directly returned result to if finding result in memstore；It is not found in memstore Next matched data can read the data in the storefile file of persistence；Storefile is the tree sorted by key The file of shape structure, hbase reading disk file read data by its basic I/O unit；

It returns the result, otherwise attends school corresponding if it can find the data to be made in BlockCache Data block is just put by the data that block is read in storefile file if reading the data to be looked into not yet In the blockcache of HRegion Server, it is then followed by and reads next block data, until recycle in this way Block data are until finding the data to be requested and returning the result；If the data in the region, which are not all found, to look for Data, be most followed by directly returning to null, indicate the matched data do not looked for；

3) analytic structure data return to the data information of expected form.

As shown in Fig. 2, another aspect of the present invention, provides a kind of building system of the data service platform of knowledge based map System, comprising:

Data cleansing module, for cleaning multi-source heterogeneous data；

In one particular embodiment of the present invention, the data cleansing module includes:

In one particular embodiment of the present invention, which further includes data acquisition module, the data acquisition module It is specific as follows for acquiring multi-source heterogeneous data using data collection client:

In one particular embodiment of the present invention, client includes Data Acquisition Program group in the data acquisition module Part, association ID formation component, association ID sending assembly and non-active service response component.

In one particular embodiment of the present invention, the data inquiry module includes

Global ID's module, for accessing Global ID using full-text search engine；Wherein, 1) creation index specifically includes:

2) retrieving specifically includes:

User input query sentence；

A query tree is obtained by syntactic analysis；

Index is read into memory by index storage；

It is associated with ID module, for retrieving the entity that is mutually related according to the Global ID in chart database, is returned all It is associated with ID, specific as follows:

1) graph data structure models

2) data directory

3) user query input by sentence

The grammer of query statement is different according to the use of database；

4) data traversal

5) query result is returned.

Structural data module is used in distributed data-storage system, according to the association ID index structure number According to return respective attributes are as a result, specific as follows:

1) attribute retrieval

3) analytic structure data return to the data information of expected form.

In order to facilitate understanding above-mentioned technical proposal of the invention, below by way of in specifically used mode to of the invention above-mentioned Technical solution is described in detail.

When specifically used, the building system of the data service platform of knowledge based map according to the present invention, from Service logic angle is set out, as shown in figure 3, the bottom is data storage layer, what is stored in MySQL is initial data, and upwards It is supplied to data collection client；Redis is responsible for generating resource ID, provides branch for the data resource after ETL rule process Support, Kafka receive the data of data acquisition interface acquisition；The data that all final process are completed are stored in chart database；Number It is Data Persistence Layer according to accumulation layer upper level, a mapping solution is provided between Object-relational Database；Service layer The operation such as modification, inquiry and deletion including plug-in unit and ontology, and resource ID service and outbound data inquiry clothes are provided for data Business etc.；Web layers provide the parameter verification of plug-in unit and ontology management, and specific business is responsible for processing by service layer；Top layer, which provides, to insert The terminals such as part Web page and ontology management Web page are shown and the open interfaces such as resource ID, acquisition service, data query.

In conclusion by means of above-mentioned technical proposal of the invention, object-oriented and neatly storing data, sufficiently dig The knowledge information contained in unstructured and semi-structured data is dug, helps to provide high quality for later period various application fields Structural data.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of construction method of the data service platform of knowledge based map, which comprises the following steps:

Multi-source heterogeneous data are cleaned；

2. the construction method of the data service platform of knowledge based map according to claim 1, which is characterized in that described Multi-source heterogeneous data, which are carried out cleaning, includes:

Resource service subsystem is called to obtain resource ID；

Data after recycling are generated to the data object of structuring.

3. the construction method of the data service platform of knowledge based map according to claim 1, which is characterized in that described It further include acquiring multi-source heterogeneous data using data collection client before multi-source heterogeneous data are cleaned.

4. the construction method of the data service platform of knowledge based map according to claim 3, which is characterized in that described Data collection client includes Data Acquisition Program component, association ID formation component, association ID sending assembly and non-active service Response assemblies.

5. the construction method of the data service platform of knowledge based map according to claim 1-4, feature It is, the data for after cleaning carry out inquiry and include

Global ID is accessed using full-text search engine；

6. a kind of building system of the data service platform of knowledge based map characterized by comprising

Data cleansing module, for cleaning multi-source heterogeneous data；

Resource service subsystem module, it is for being inquired for the data after cleaning, the data after inquiry are raw by redis At resource ID；

Ontology management module is carried out the data using columnar database for constructing OWL ontology and being managed to plug-in unit Storage.

7. the building system of the data service platform of knowledge based map according to claim 6, which is characterized in that described Data cleansing module includes:

Entity constructs module, for obtaining ETL rule for different data sources load ETL plug-in unit, obtains entity after constructing entity Between relationship；

8. the building system of the data service platform of knowledge based map according to claim 6, which is characterized in that this is System further includes data acquisition module, and the data acquisition module is used to acquire multi-source heterogeneous data using data collection client.

9. the building system of the data service platform of knowledge based map according to claim 8, which is characterized in that described Data collection client includes Data Acquisition Program component, association ID formation component, association ID transmission group in data acquisition module Part and non-active service response component.

10. according to the building system of the data service platform of the described in any item knowledge based maps of claim 6-9, feature It is, the data inquiry module includes

Global ID's module, for accessing Global ID using full-text search engine；

It is associated with ID module, for the entity that is mutually related being retrieved according to the Global ID, it is relevant returning to institute in chart database ID；

Structural data module, for according to the association ID index structure data, returning in distributed data-storage system Return respective attributes result.