CN108446368A - A kind of construction method and equipment of Packaging Industry big data knowledge mapping - Google Patents

A kind of construction method and equipment of Packaging Industry big data knowledge mapping Download PDF

Info

Publication number
CN108446368A
CN108446368A CN201810211761.XA CN201810211761A CN108446368A CN 108446368 A CN108446368 A CN 108446368A CN 201810211761 A CN201810211761 A CN 201810211761A CN 108446368 A CN108446368 A CN 108446368A
Authority
CN
China
Prior art keywords
packaging industry
data
entity
knowledge mapping
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810211761.XA
Other languages
Chinese (zh)
Inventor
李长云
吴岳忠
丁军
朱俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hai Zhi Zhi Mdt Infotech Ltd
Hunan University of Technology
Original Assignee
Shanghai Hai Zhi Zhi Mdt Infotech Ltd
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hai Zhi Zhi Mdt Infotech Ltd, Hunan University of Technology filed Critical Shanghai Hai Zhi Zhi Mdt Infotech Ltd
Priority to CN201810211761.XA priority Critical patent/CN108446368A/en
Publication of CN108446368A publication Critical patent/CN108446368A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Abstract

The present invention provides a kind of construction method and equipment of Packaging Industry big data knowledge mapping, unstructured data can be carried out structuring by the present invention while structure packs knowledge mapping, be laid the first stone for further semantic analysis calculating.In addition, the data of packaging industry are modeled using knowledge mapping, it can be with spread data pattern.

Description

A kind of construction method and equipment of Packaging Industry big data knowledge mapping
Technical field
The present invention relates to a kind of construction methods and equipment of Packaging Industry big data knowledge mapping.
Background technology
Packaging industry data are dispersed in multiple systems, and the data of separate sources possess different structure, existing Technology is difficult to polymerize these information, there is the demand to packaging industry data fusion.Most of number on internet simultaneously According to being unstructured data, computer can not understand.Simultaneously when being recognized there are new business, traditional relational database structure The mode evolution built is difficult, changes data structure and service logic is highly difficult, poor, maintenance cost height of autgmentability etc. can be brought bad Situation,.
Invention content
The purpose of the present invention is to provide a kind of construction method and equipment of Packaging Industry big data knowledge mapping, Neng Gou Unstructured data is subjected to structuring while structure packaging knowledge mapping, is laid the first stone for further semantic analysis calculating, The data of packaging industry are modeled using knowledge mapping, it can be with spread data pattern.
To solve the above problems, the present invention provides a kind of construction method of Packaging Industry big data knowledge mapping, including:
Into the structural data for obtaining packaging industry, including:The seed vocabulary that packaging industry can be represented using some, It is scanned in the searching interface of search engine and online encyclopaedia, for the web document that described search engine returns, chooses row The predetermined number of front is listed according to result as target webpage, is added to target webpage list;The online encyclopaedia is returned The page, be introduced into corresponding article page, then found in articles page two classes link, including external linkage and with reference to text The exterior chain offered is added to the target webpage list using the exterior chain of the external linkage found and bibliography as target webpage In;First is carried out to target webpage in the target webpage list according to website to sort out;Sort out to target webpage by described first Corresponding each website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., from website Homepage starts, and using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;Each website has been adopted The web page contents collected extract preservation, and the web page contents of predetermined threshold value are less than for the frequency comprising industry keyword, will It is deleted in web page contents from preservation;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:Using poly- Class algorithm carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Collect together, feature includes used in the cluster:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern uses Top-down knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between the meaning and table of lattice, while understanding the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to reflecting for semantic data by D2R Server structural data mapping tools according to preset relational database Specification D2RML is penetrated, in the table and packaging industry knowledge mapping to be built in the structural data of the packaging industry Concept or entity associated are got up, and the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge graph Spectrum, wherein relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function such as Under:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, when type determines connection The driving used;
(b)dburl:Database connection string, specified includes the letter of the address of database, port and the database used Breath.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
By the entity in open link data and online encyclopaedia and the entity in the packaging industry knowledge mapping that has built into Row merges, including:The title of the entity in open link data and online encyclopaedia and synonymous set and the packaging row built The title of entity in industry knowledge mapping and synonymous set are matched, the candidate entity that matched result is merged as entity It is right;For candidate entity pair, compare their father's concept, if they are merged into institute by the identical candidate entity pair of father's concept It states in the packaging industry knowledge mapping built;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online Entity in encyclopaedia is added in the packaging industry knowledge mapping built.
Another side according to the present invention provides a kind of computer readable storage medium, and it is executable to be stored thereon with computer Instruction, wherein the computer executable instructions make processor when being executed by processor:
The structural data of packaging industry is obtained, including:The seed vocabulary that packaging industry can be represented using some, is being searched Index is held up to be scanned for in the searching interface of online encyclopaedia, for the web document that described search engine returns, chooses arrangement Predetermined number in front, as target webpage, is added to target webpage list according to result;The online encyclopaedia is returned The page is introduced into corresponding article page, and the link of two classes, including external linkage and bibliography are then found in articles page Exterior chain be added in the target webpage list using the exterior chain of the external linkage found and bibliography as target webpage; First is carried out to target webpage in the target webpage list according to website to sort out;Sort out by described first and target webpage is corresponded to Each website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., from website homepage Start, using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;Each website has been collected Web page contents extract preservation, for the frequency comprising industry keyword be less than predetermined threshold value web page contents, by its from It is deleted in the web page contents of preservation;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:Using poly- Class algorithm carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Collect together, feature includes used in the cluster:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern uses Top-down knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between the meaning and table of lattice, while understanding the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to reflecting for semantic data by D2R Server structural data mapping tools according to preset relational database Specification D2RML is penetrated, in the table and packaging industry knowledge mapping to be built in the structural data of the packaging industry Concept or entity associated are got up, and the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge graph Spectrum, wherein relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function such as Under:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, when type determines connection The driving used;
(b)dburl:Database connection string, specified includes the letter of the address of database, port and the database used Breath.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
By the entity in open link data and online encyclopaedia and the entity in the packaging industry knowledge mapping that has built into Row merges, including:The title of the entity in open link data and online encyclopaedia and synonymous set and the packaging row built The title of entity in industry knowledge mapping and synonymous set are matched, the candidate entity that matched result is merged as entity It is right;For candidate entity pair, compare their father's concept, if they are merged into institute by the identical candidate entity pair of father's concept It states in the packaging industry knowledge mapping built;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online Entity in encyclopaedia is added in the packaging industry knowledge mapping built.
The present invention also provides a kind of calculator devices, wherein including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed Manage device:
The structural data of packaging industry is obtained, including:The seed vocabulary that packaging industry can be represented using some, is being searched Index is held up to be scanned for in the searching interface of online encyclopaedia, for the web document that described search engine returns, chooses arrangement Predetermined number in front, as target webpage, is added to target webpage list according to result;The online encyclopaedia is returned The page is introduced into corresponding article page, and the link of two classes, including external linkage and bibliography are then found in articles page Exterior chain be added in the target webpage list using the exterior chain of the external linkage found and bibliography as target webpage; First is carried out to target webpage in the target webpage list according to website to sort out;Sort out by described first and target webpage is corresponded to Each website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., from website homepage Start, using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;Each website has been collected Web page contents extract preservation, for the frequency comprising industry keyword be less than predetermined threshold value web page contents, by its from It is deleted in the web page contents of preservation;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:Using poly- Class algorithm carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Collect together, feature includes used in the cluster:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern uses Top-down knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between the meaning and table of lattice, while understanding the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to reflecting for semantic data by D2R Server structural data mapping tools according to preset relational database Specification D2RML is penetrated, in the table and packaging industry knowledge mapping to be built in the structural data of the packaging industry Concept or entity associated are got up, and the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge graph Spectrum, wherein relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function such as Under:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, when type determines connection The driving used;
(b)dburl:Database connection string, specified includes the letter of the address of database, port and the database used Breath.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
By the entity in open link data and online encyclopaedia and the entity in the packaging industry knowledge mapping that has built into Row merges, including:The title of the entity in open link data and online encyclopaedia and synonymous set and the packaging row built The title of entity in industry knowledge mapping and synonymous set are matched, the candidate entity that matched result is merged as entity It is right;For candidate entity pair, compare their father's concept, if they are merged into institute by the identical candidate entity pair of father's concept It states in the packaging industry knowledge mapping built;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online Entity in encyclopaedia is added in the packaging industry knowledge mapping built.
Compared with prior art, the present invention can tie unstructured data while structure packs knowledge mapping Structure lays the first stone for further semantic analysis calculating.In addition, the data of packaging industry are modeled using knowledge mapping, It can be with spread data pattern.
Description of the drawings
Fig. 1 is that the syndicated data source of one embodiment of the invention finds algorithm flow chart automatically;
Fig. 2 is the information module exemplary plot of one embodiment of the invention.
Specific implementation mode
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.
Based on ontology (Ontology) technology, Packaging Engineering knowledge hierarchy is described, and preservation is in the database, with Allow to develop with technology and knowledge hierarchy is adjusted and extended into Mobile state.
1. Packaging Engineering Knowledge Description Language
The characteristics of according to Packaging Engineering knowledge hierarchy, is described Packaging Engineering knowledge hierarchy using ontology;It needs A kind of ontology describing of suitable Packaging Engineering knowledge description is designed, description language is based on OWL2.0, can be simultaneous by mainstream OWL2.0 The editing machine of appearance is opened.It is too simple using RDF (S) and the SKOS progress architectonic ontology description languages of Packaging Engineering, and OWL is that a kind of currently most popular general ontology describes method, and this method not only contains the definition of class, attribute and individual, The logical reasoning mechanism based on is further comprised, but since the mechanism of offer is excessively numerous and jumbled, for describing Packaging Engineering knowledge System is excessively complicated.Therefore, the present invention needs to refer to SKOS, is based on RDF, RDFS and OWL, and design is a kind of simple and practical The architectonic ontology describing method of Packaging Engineering.
2. Packaging Engineering ontology knowledge base structural element
Ontology describing needs to support concept, attribute, example, example relationship, hyponymy.Concept of the same name can be handled, Similar concept.Specifically, Packaging Engineering ontology knowledge base includes following element:
(1) knowledge point (Knowledge Point):One section of knowledge of finger is abstracted.
(2) example (instance):Represent the entity existing one by one in reality.
(3) hierarchical relationship between knowledge point (taxonomy):Pass through the better tissue areas knowledge of hierarchical structure.
(4) attribute (attribute):Attribute is concept or the feature of example, is indicated using triple form<object- attribute-value>;Attribute is divided into object properties (object property) and numerical attribute (data type property);Attribute has its corresponding domain range (domain) and value range (range).
(5) nickname (alias) of knowledge point, example and attribute:The different appellations of same entity.
(6) other country's language title (label) of knowledge point, example and attribute:Same entity is corresponding in different language Title.
(7) data type:Simple types includes integer, floating number, character string;Aggregate type;Object type.
(8) (introduction) is introduced:Each knowledge point or example have corresponding word brief introduction.
The main purpose of structure Packaging Industry big data knowledge mapping is to obtain largely, and computer-readable packaging is allowed to know Know, in today of internet rapid development, knowledge is largely present in non-structured text data, a large amount of semi-structured tables In webpage and the structural data of production system.How the present invention mainly introduces from structural data and semi-structured data Secondly how middle acquisition knowledge carries out the knowledge that different data sources obtain to construct the association between data, ultimately form Packaging Industry big data knowledge mapping.
Based on ontology (Ontology) and knowledge mapping the relevant technologies, to packaging architectonic framework and the big number of Packaging Industry It is described, and is stored in corresponding storage database according to knowledge mapping, ontologies are carried out with allowing to develop with technology Dynamic adjusts and extension.
Knowledge hierarchy (Body of Knowledge) is the knowledge frame defined by particular professional domain expert, table The basic knowledge and technical ability that bright professional domain qualified personnel should grasp are related to the important process action of related industry and key Technology.To ensure to pack the science and integrality of knowledge organization, this project uses the resource tissue based on domain knowledge system Building policy.
Knowledge mapping (knowledge graph) is to feature entity with the real world and the mutual pass between them The knowledge network of system.Such as packaging knowledge mapping essentially describes wrapping enterprise, packaging personage, packaging product etc., by people, production Product, enterprise etc. are associated.
The present invention provides a kind of construction method of Packaging Industry big data knowledge mapping, including:
Step S1 obtains the structural data of packaging industry, including:
Step S11 can represent the seed vocabulary of packaging industry using some, in the search of search engine and online encyclopaedia It is scanned in interface, for the web document that described search engine returns, chooses the predetermined number for being arranged in front according to result As target webpage, it is added to target webpage list;For the page that the online encyclopaedia returns, it is introduced into corresponding article Then page finds the link of two classes in articles page, includes the exterior chain of external linkage and bibliography, the external linkage that will be found Exterior chain with bibliography is added to as target webpage in the target webpage list;
Step S12 carries out first according to website to target webpage in the target webpage list and sorts out;
Step S13, by described first sort out corresponding to target webpage each website stand in web page contents acquire, The depth capacity of each website acquisition is set as 3 layers, i.e., since website homepage, using depth-first acquisition strategies, adopts in total Collect 3 layers of web page contents of each website;
Step S14, to each website, collected web page contents extract preservation, for including industry keyword Frequency be less than predetermined threshold value web page contents, will be deleted in its web page contents from preservation;
Here, for the structure of packaging industry knowledge mapping, the internal structured data of industry and some rows opened Industry knowledge base or industrial vertical websites can serve very crucial.These syndicated data sources with profession due to being combined tightly It is close, therefore usually have the following advantages that:
(1) have good industry covering surface and industry depth, specificity of the industry data due to describing target usually exists Can be wider in terms of covering surface inside industry, most of information of described industry are generally comprised, such as be interconnection in IMDB Possess the website of most full film information in the data set of net.
(2) reliability is high:For the internal structured data of industry, it is used to support the industry of enterprise itself under normal conditions Business, therefore reliability is very high;For open industry knowledge base data, some are that the structured data of enterprise passes through certain forms Conversion publication on the net, and some are then to be issued after the editor of industry professional and audit on the net, therefore, Reliability can also be guaranteed;.
(3) structural strong:For internal structured data, the overwhelming majority is stored by relational database;And For open industry knowledge base, same template generation is typically used in the form of a web page, thus structure is essentially identical, It parses very convenient;
Therefore, it when carrying out packaging industry knowledge mapping structure, can pay the utmost attention to using the internal structured number in industry According to open industry knowledge base;
The structural data of industry is no doubt the top quality data source of knowledge mapping structure, however in many situations, The data of these structurings are usually not published publication, therefore only for being valuable resource for the owner of data, other Personnel are difficult to obtain, therefore, it is also desirable to adequately utilize the industry knowledge base and trade network published in internet as far as possible It stands, to use these disclosed industry datas, it is necessary first to find these data sources, this work carrys out common user It says and a difficult job.Therefore, the present invention proposes a kind of industry knowledge base based on search engine and online encyclopaedia Algorithm is found automatically with industrial sustainability.
When people obtain information in internet, most common approach is internet search engine;With being opened in internet The prevalence for the encyclopaedia put, online encyclopaedia are increasingly becoming the important sources of acquisition of information;
When using search engine retrieving information, first input represents the keyword of search intention, and search engine returns the result Document, user choose suitable result document from these result documents;Search engine uses professional algorithm (such as Hits algorithms With PageRank algorithms) it is all scored website and webpage, search result is ranked up according to this scoring again.Carry out When trade information is retrieved, the high industry knowledge base of those click volumes or professional website would generally be ordered in front, therefore from searching Index, which is held up, to be begun look for industry data and carries out data pick-up to be a kind of available method;
However, obtain the method for industry data there are two defect using search engine, when they be ranked up it is main According to click and the influence power of website for being webpage, but some industry datas, in particular for the data of some rare entities, by It is not high in clicking, engine may be searched and postponed;On the other hand, search engine is nor include whole numbers in internet According to some industry datas belong to the data netted deeply, and search engine may be included not necessarily;
The editor of encyclopaedia is in editor, especially in the editor of domain knowledge, it usually needs consult many references Data, at this point, the industry knowledge base and professional website of some openings are often the Main Basiss that they refer to, some are located at deep net But the relatively high special data of value may also be added, therefore, some corpora minings based on encyclopaedia and find algorithm Also it is proposed by Many researchers.
Both the above Knowledge Source finding method, one is automatically arranged the internet document of magnanimity by machine Sequence, and it is artificially that specific objective is chosen with reference to knowledge that another kind, which is then, if it is possible to combine, then can obtain more complete The industry data collection in face, industry data of the invention find that algorithm is exactly the advantage for having gathered both methods automatically, algorithm Flow frame is as shown in Figure 1.
The basic process of algorithm is as follows:
(1) the seed vocabulary that packaging industry can be represented using some, in the searching interface of search engine and online encyclopaedia In scan for, for described search engine return web document, choose be arranged in front predetermined number it is direct according to result It is added to target webpage list;For the page that the online encyclopaedia returns, it is introduced into corresponding article page, then in article The link of two classes is found in the page, includes the exterior chain of common external linkage and bibliography, the common external linkage that will be found It is added in the target webpage list with the exterior chain of bibliography;
(2) first is carried out according to website to target webpage in the target webpage list to sort out;
(3) sort out each website corresponding to target webpage by described first and carry out interior data acquisition of standing, each website is adopted The depth capacity of collection is set as 3 layers, i.e., since website homepage, using depth-first acquisition strategies, acquires each website in total 3 layer datas, usual industry data website can traverse the structure of entire website under 3 layers of depth;
(4) content analysis of website, to each website, collected web page contents extract preservation;For webpage Content illustrates the really relevant with industry of the website if wherein the frequency comprising industry keyword is very high, follow-up optional It is taken as target data source, otherwise illustrating wherein only to abandon comprising a small amount of example;
Step S2 obtains the data source of structure packaging industry knowledge mapping, packet from the structural data of the packaging industry It includes:
Step S21 carries out second according to structure to web page contents using clustering algorithm and sorts out, the described second purpose sorted out It is mutually isostructural webpage to be brought together, feature includes used in the cluster:(a) depth of webpage URL;(b) The word that the part after domain name is obtained using "/" segmentation is removed in URL;(c) length of webpage;(d) the label number of webpage;(e) Primary label includes in webpage<div>、<table>、<a>Respective number and its shared ratio;
Step S22, the web page contents quantity after sorting out for described second is more than the classification of predetermined threshold value, to the classification In web page contents be filtered;
Step S23, the corresponding matching template of each class after sorting out by preset each second, to described filtered Web page contents in each class are parsed, and obtain the data source of structure packaging industry knowledge mapping, wherein each class pair The matching template answered is used for:
Step S231, for each element in the web page contents in filtered each class, by using in webpage Hold the XPath in the page to be positioned;
Step S232, by including<synonym><attribute>Label the information in the corresponding elements of XPath The element being mapped in knowledge mapping;
Under normal conditions, the webpage of same class target entity described in industrial sustainability all have similar structure, such as In Chinese packaging net, the structure of all packaging product pages is substantially all similar, because they are to use unified template generation, It for being chosen for the website of target data source, next needs to analyze their page, to extract structuring therein Content.This process is automanual, and basic procedure is as follows:
(1) it uses clustering algorithm to carry out second according to structure to web page contents first to sort out, the described second purpose sorted out It is mutually isostructural webpage to be brought together, feature includes used in the cluster:(a) depth of webpage URL;(b) The word that the part after domain name is obtained using "/" segmentation is removed in URL;(c) length of webpage;(d) the label number of webpage;(e) Primary label includes in webpage<div>、<table>、<a>Respective number and its shared ratio;
(2) being to indicate a kind of entity is confirmed whether by manually being chosen for the large numbers of classifications of those webpages The collections of web pages of information;
(3) for those big classifications, to ensure the correctness of Knowledge Extraction, by the method for manual compiling template with to mesh Mark data source is targetedly parsed;
The method that researcher there are many although proposes drawing-out structure information in slave webpage automatically or semi-automatically, It is that the accuracy rate of these methods is generally unattainable the requirement (necessarily being greater than 90%) of knowledge mapping;Therefore, user of the present invention The method that work configuration template carries out specific solution to targeted website improves the accuracy rate of knowledge;Due in an industry, target Website usually will not be too many, and belongs to same type of webpage and all employ identical template, therefore uses human configuration Workload is acceptable;
In order to simplify the definition of template, invention defines the regular languages of a set of description template, are named as DWPL (Domain Websites Parse Language) is converted to information from semi-structured webpage defined in the language For the mechanism of knowledge mapping formal knowledge, include mainly:
(1) for each element in the web page contents in filtered each class, by using web page contents in page XPath is positioned in face;
(2) by including<synonym><attribute>Label the information MAP in the corresponding elements of XPath at Element in knowledge mapping;
The present invention provides the corresponding matching templates of each class, i.e. the definition method of batch processing, mainly use asterisk wildcard The URL of webpage in the same set is matched;
One typical template file is as follows:
After defining template, the extraction process of data is very convenient, it is only necessary to be parsed to target webpage according to module It can be completed;There are one the Problem of Failure that problem needed to be considered is template, if targeted website is upgraded, template at this time Failure;At this point, when extracting fall short data, system notifies user to need to carry out more template by built-in alarm mechanism Newly;
In addition, for considerably complicated or used the trade network of special technique (such as Ajax dynamic web pages, anti-crawl) It stands, additionally provides the interface (being marked using type=" customize ") that can freely access complicated adapter, Yong Huke For these complicated individual extraction engines of Website development, to be then linked into this platform;
Step S3 obtains the data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines, the data Pattern uses top-down knowledge mapping mode;
Step S4 understands the basic structure in the structural data of packaging industry, includes the structural data of packaging industry In each table meaning and table between association, while understanding that the corresponding packaging industry to be built of the data pattern is known The structure for knowing collection of illustrative plates is mapped to semanteme by D2R Server structural data mapping tools according to preset relational database The Mapping specifications D2RML of data, in the structural data of the packaging industry table and packaging industry knowledge to be built Concept or entity associated in collection of illustrative plates are got up, and the data source is filled into packaging industry knowledge mapping, structure packaging row Industry knowledge mapping, wherein relational database is mapped to the chief word of the Mapping specifications D2RML of semantic data and retouches accordingly It is as follows to state function:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, when type determines connection The driving used;
(b)dburl:Database connection string, specified includes the letter of the address of database, port and the database used Breath.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
Here, when building packaging industry knowledge mapping, can data pattern be defined by human expert first, using push up certainly to Under knowledge mapping mode, data pattern is part most crucial in knowledge mapping, thus can be improved and be known by Manual definition The integrality and accuracy for knowing spectrum data, after defining data pattern, next can carry out data from various data sources The filling of level;
When from Knowledge Mapping in packaging industry knowledge mapping is carried out in structural data, it is necessary first to understand packaging industry In basic structure in internal structured data, including packaging industry internal structured data the meaning of each table and table it Between association, while understanding the structure of packaging industry knowledge mapping to be built, then the packaging gone using D2RML language In the industry the table in portion's structural data in packaging industry knowledge mapping to be built concept or entity associated get up;
After defining Map Profile, next can according to configuration from source database conversion knowledge, knowledge The target data configured in transform engine connection profile (relational database is mapped to the Mapping specifications D2RML of semantic data) The data in corresponding table are read in library, in relational database table and column data be mapped to the entity and reality of concept respectively Then the attribute of body maps these in obtained knowledge store to knowledge mapping;
1.D2R
D2R, full name are relation database to RDF, that is, referring to becomes the data conversion in relational database The semantic data of RDF forms is simultaneously issued in internet, pioneer Christian Bizer and the Andy Seaborne of D2R, A kind of data pattern and RDF patterns and the statement formula of OWL mapping relations for describing relational database was proposed in 2004 Language D2RQ, after being described by using D2RQ, user can be the data (number in such as relational database of non-RDF forms According to) regard virtual RDF data as, and RDF data query language (RDF Data Query Language, RDQL) can be used It is inquired, then, Christian Bizer were extracted an entitled " D2R in 2006 with Richard Cyganiak again The tool of Server ", in the data publication to semantic net in relational database;The tool passes through D2RQ first Relational data is invented RDF format by Mapping files, then using D2RQ to the data of the RDF forms virtually obtained into Row inquiry;When inquiry, by the way that the query language SPARQL of RDF data to be converted to the query language SQL of RDB data to complete to close The inquiry of coefficient evidence;
2. structural data mapping tool
D2R Server provide a kind of method of the data conversion by relational database at RDF formal semantics data; However, the work that D2R Server are completed is that relational data is carried out virtual and mapping, there is no forms under normal conditions True RDF data, accordingly, it is difficult to be directly used in the conversion of knowledge mapping of the present invention;When on the other hand, using D2R Server It is understood that used mapping language RDQL and D2RQ Mapping, both of these documents need to be grasped certain when in use The relevant knowledge of RDF and SPARQL, this difficulty for common user are relatively high;
The present invention has formulated one group of Mapping specifications that semantic data is mapped to from relational database, is named as D2RML (relation database to RDF mapping language), Normalization rule XML language description;Based on XML language Ease for use and versatility so that D2RML can easily by ordinary user understand and use;When using the language, and should not It asks user to use the relevant knowledge of RDF and SPARQL, reduces using threshold, in addition, a kind of visualization has also been devised in the present invention Specification configuration tool, user only needs usually some simple configurations that the formulation of mapping principle can be completed on this tool.
Chief word and corresponding representation function in D2RML is as follows:
(a)dbtype:The type of source database has mysql, oracle, sqlserver etc., when type determines connection The driving used;
(b)dburl:Database connection string, address, port and the database information used of specified database.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
One typical mapped file is as follows, it describes following sections of the present invention and will use from fish database The configuration of middle mapping fishlore collection of illustrative plates.
Step S5, will be in the entity and the packaging industry knowledge mapping that has built in open link data and online encyclopaedia Entity merges, including:
Step S51, the title of the entity in open link data and online encyclopaedia and synonymous set and the packet built The title and synonymous set of entity in dress domain knowledge collection of illustrative plates are matched, the candidate that matched result is merged as entity Entity pair;
Step S52 compares their father's concept for candidate entity pair, will if the identical candidate entity pair of father's concept They are merged into the packaging industry knowledge mapping built;
Step S6 would not exist in the entity in the packaging industry knowledge mapping built, but be present in open link number According to the entity in online encyclopaedia, be added in the packaging industry knowledge mapping built.
Here, although domain knowledge collection of illustrative plates has used the internal structured data and domain knowledge of industry in building process Library etc., but the link data and knowledge base of opening, encyclopaedia and text are still the significant data source of domain knowledge collection of illustrative plates.One Aspect, these data sources can supplement domain knowledge collection of illustrative plates;On the other hand, when shortage internal structured data in industry When with open industry knowledge base or industrial sustainability, the study of domain knowledge collection of illustrative plates will use data identical with world knowledge collection of illustrative plates Source.
In the learning knowledge from open link data and encyclopaedia, it is necessary first to entity therein and the packaging built Entity in domain knowledge collection of illustrative plates carries out entity merging, and when merging equally exists the problems such as of the same name not synonymous, synonymous not of the same name, because This, needs a kind of rational entity alignment schemes.The process that the present invention carries out entity alignment is as follows:(1) for open link number It has been completed according to the alignment work with the entity in encyclopaedia, entity, has all employed certain mode to describe synonymous entity or general It reads.Specifically, synonymy is to use " owl in DBpedia:SameAs " descriptions, and then retouched using " means " in YAGO It states.The synonymy that they include, it is only necessary to entire data set be traversed, parsed after finding corresponding description mechanism It can be obtained.(2) the title of the entity in open link data and online encyclopaedia and it is synonymous set with built in collection of illustrative plates The title of entity and synonymous set are matched, the candidate entity pair that matched result is merged as entity;(3) for candidate Entity pair compares their father's concept, if father's concept is identical, then it is assumed that they are to need combined entity.
There is also new entities for open link data and encyclopaedia, these entities is needed to be added in knowledge mapping at this time. The Main Basiss of addition are the corresponding concept of entity, i.e., these novel entities are added in the entity sets of said concepts.
It, can be further to knowledge mapping according to entity attributes information included in open link data and encyclopaedia Entity detailed knowledge is filled.Since attribute is good by Manual definition, the first step needs to reflect attribute It penetrates;
The information largely with " attribute-value " to appearance is contained in information module in encyclopaedia.It should be noted that encyclopaedia Although information module in is not directly displayed based on concept definition in the corresponding encyclopaedia articles page of concept, But embodied in the entity for belonging to these concepts.For example, the attribute for illustrating article page " China " in Fig. 2, in front Learning process in have determined that it is an entity, the concept belonging to him has very much, including " country ", " gold brick four countries ", " gold Five state of brick ", " ancient civilized country " etc., but all include these attributes without information module in the corresponding articles page of these concepts Presentation.Therefore, the attribute that determine concept, need the bottom of from up to first have to determine it includes entity attributes, then into Professional etiquette about, obtains the due attribute of concept.
It is very simple that attribute is extracted from the corresponding page of entity, it is only necessary to the format specific aim of information module in the page Write adapter carry out parsing can be completed.Merging attribute from the page of the corresponding different encyclopaedias of same entity may touch To two problems:
(a) indicate that the title of same attribute differs between different encyclopaedias, such as in two information modules, the 1st In " political system " and the 2nd in " government form " be same attribute in fact.This problem is mainly solved by two methods:When It is merged by the synonymy extracted, if two different Property Names have synonymy, they are The different names of same attribute, merge;Another method be determined by attribute value, if gather around there are two belong to Property most of entity in, their value is consistent, then they indicate should be same attribute.
(b) same attribute has different values in different encyclopaedias, such as " main cities " attribute, the value in Baidupedia For " Beijing, Shanghai, Hong Kong, Shenzhen etc. ", and in interactive encyclopaedia it is then " Beijing, Shanghai, Chongqing, Guangzhou, Shenzhen etc. ", this is simultaneously The not instead of conflict of attribute should merge these values.
All attributes and value in entity level, not by desk checking, on the one hand because of the tangible mistake of number of entity In huge;On the other hand, even if small part entity attributes unreasonable influenced range if is these attributes itself , other entities or concept will not be impacted.
Attribute mapping for domain knowledge collection of illustrative plates, the mode for additionally providing Manual definition's mapping further increase accurately Property.
After the completion of attribute mapping, further work is the filling of entity value, and predominantly those not yet learn to arrive attribute value Attribute is supplemented.
This section verifies the various methods in domain knowledge map construction by taking the structure for packing knowledge mapping as an example.
Specifically, education commission of Chinese packaging federation net is gathered around there are one site databases are packed, wherein including packaging Information of industry product is stored in the tables of data of relational database.For each packaging product, including title, manufacturing enterprise, life The fields such as the place of production and brief introduction.
The data of this table very easily can be imported into knowledge by defined D2RML specifications through the invention In collection of illustrative plates.By this step, 51620 product informations have successfully been imported from relational database.Since these data are basic All it is that prolonged use and is had already passed through by human-edited, it is therefore contemplated that these knowledge are accurately and reliably.
Selected seed lexical set first, the more high-level keyword chosen herein are " package design, packaging system Make, packaging material, packaging are equipped " etc., while in the selected part name of product from the product information imported in relational database As lists of keywords.
Then it is scanned in search engine and online encyclopaedia, with keyword " package design " searching in wikipedia It is illustrated for Cable Structure.The article face of entitled " package design " can be directly entered after wikipedia scans for, In this page, related web site part is first observed, entitled " the Chinese packaging net " of the 1st link, this is one domestically leading Packaging industry portal website;By " friendly link " part of the homepage of " Chinese packaging net ", can be obtained from this page Link also has links such as " Chinese packaging federations ", " packet connection electric business committee of China " and " Chinese packaging talent net ", these data sources It is that quality is good, informative Packaging Industry knowledge base.When being automatically analyzed using algorithm, these web site contents are due to packet Containing abundant packaging category information, therefore, forefront also is located in the analysis result sequence of algorithm.
Experimentation demonstrates the validity that industry knowledge base proposed by the invention or website find algorithm automatically.
For acquired syndicated data source, next with interactive encyclopaedia (http://fenlei.baike.com/ Category/treeManage.jsp it illustrates how to extract knowledge from these industrial sustainabilities for).
First, in the website Package Classification complete categorizing system, shown with tree form;It is produced simultaneously for many packagings Product, also detailed information faces.Wherein tree form displaying classification tree realized using Ajax technologies, therefore, it is necessary to Adapter is developed for it, " BaikePackagingTaxoWrapper " is named as, is connect using the external adapter mechanism of DWPL Enter into the resolution system of industrial sustainability.And for packing the information faces of product, it is the webpage using template generation, it is therefore, non- It is often easy to complete parsing configuration using DWPL.
After parsing configuration is completed, following system automatically extracts knowledge according to configuration file from the website.
Application of the knowledge mapping in fields such as semantic retrieval, data mining, artificial intelligence, knowledge organization and intelligent answers is non- Often extensively.Under current internet big data background, how effectively to organize and utilize structural data, semi-structured data and Various types of data such as unstructured data, it is Packaging Engineering service to make these data preferably, becomes the big data epoch New challenge.Packaging Industry knowledge mapping constructing technology proposed by the present invention, it is intended to be obtained from the data source of various different structures As a result the knowledge changed provides a solution for efficiently using for various types data.
In conclusion unstructured data can be carried out structuring by the present invention while structure packs knowledge mapping, It lays the first stone for further semantic analysis calculating.In addition, the data of packaging industry are modeled using knowledge mapping, it can be certainly By growth data pattern.
Another side according to the present invention provides a kind of computer readable storage medium, and it is executable to be stored thereon with computer Instruction, wherein the computer executable instructions make processor when being executed by processor:
The structural data of packaging industry is obtained, including:The seed vocabulary that packaging industry can be represented using some, is being searched Index is held up to be scanned for in the searching interface of online encyclopaedia, for the web document that described search engine returns, chooses arrangement Predetermined number in front, as target webpage, is added to target webpage list according to result;The online encyclopaedia is returned The page is introduced into corresponding article page, and the link of two classes, including external linkage and bibliography are then found in articles page Exterior chain be added in the target webpage list using the exterior chain of the external linkage found and bibliography as target webpage; First is carried out to target webpage in the target webpage list according to website to sort out;Sort out by described first and target webpage is corresponded to Each website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., from website homepage Start, using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;Each website has been collected Web page contents extract preservation, for the frequency comprising industry keyword be less than predetermined threshold value web page contents, by its from It is deleted in the web page contents of preservation;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:Using poly- Class algorithm carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Collect together, feature includes used in the cluster:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern uses Top-down knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between the meaning and table of lattice, while understanding the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to reflecting for semantic data by D2R Server structural data mapping tools according to preset relational database Specification D2RML is penetrated, in the table and packaging industry knowledge mapping to be built in the structural data of the packaging industry Concept or entity associated are got up, and the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge graph Spectrum, wherein relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function such as Under:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, when type determines connection The driving used;
(b)dburl:Database connection string, specified includes the letter of the address of database, port and the database used Breath.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
By the entity in open link data and online encyclopaedia and the entity in the packaging industry knowledge mapping that has built into Row merges, including:The title of the entity in open link data and online encyclopaedia and synonymous set and the packaging row built The title of entity in industry knowledge mapping and synonymous set are matched, the candidate entity that matched result is merged as entity It is right;For candidate entity pair, compare their father's concept, if they are merged into institute by the identical candidate entity pair of father's concept It states in the packaging industry knowledge mapping built;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online Entity in encyclopaedia is added in the packaging industry knowledge mapping built.
The present invention also provides a kind of calculator devices, wherein including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed Manage device:
The structural data of packaging industry is obtained, including:The seed vocabulary that packaging industry can be represented using some, is being searched Index is held up to be scanned for in the searching interface of online encyclopaedia, for the web document that described search engine returns, chooses arrangement Predetermined number in front, as target webpage, is added to target webpage list according to result;The online encyclopaedia is returned The page is introduced into corresponding article page, and the link of two classes, including external linkage and bibliography are then found in articles page Exterior chain be added in the target webpage list using the exterior chain of the external linkage found and bibliography as target webpage; First is carried out to target webpage in the target webpage list according to website to sort out;Sort out by described first and target webpage is corresponded to Each website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., from website homepage Start, using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;Each website has been collected Web page contents extract preservation, for the frequency comprising industry keyword be less than predetermined threshold value web page contents, by its from It is deleted in the web page contents of preservation;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:Using poly- Class algorithm carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Collect together, feature includes used in the cluster:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern uses Top-down knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between the meaning and table of lattice, while understanding the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to reflecting for semantic data by D2R Server structural data mapping tools according to preset relational database Specification D2RML is penetrated, in the table and packaging industry knowledge mapping to be built in the structural data of the packaging industry Concept or entity associated are got up, and the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge graph Spectrum, wherein relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function such as Under:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, when type determines connection The driving used;
(b)dburl:Database connection string, specified includes the letter of the address of database, port and the database used Breath.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
By the entity in open link data and online encyclopaedia and the entity in the packaging industry knowledge mapping that has built into Row merges, including:The title of the entity in open link data and online encyclopaedia and synonymous set and the packaging row built The title of entity in industry knowledge mapping and synonymous set are matched, the candidate entity that matched result is merged as entity It is right;For candidate entity pair, compare their father's concept, if they are merged into institute by the identical candidate entity pair of father's concept It states in the packaging industry knowledge mapping built;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online Entity in encyclopaedia is added in the packaging industry knowledge mapping built.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with other The difference of embodiment, just to refer each other for identical similar portion between each embodiment.
Professional further appreciates that, in conjunction with disclosed in this invention each exemplary list for describing of embodiment Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, depend on the specific application and design constraint of technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond the scope of this invention.
Obviously, those skilled in the art can carry out invention spirit of the various modification and variations without departing from the present invention And range.If in this way, these modifications and changes of the present invention belong to the claims in the present invention and its equivalent technologies range it Interior, then the present invention is also intended to including these modification and variations.

Claims (3)

1. a kind of construction method of Packaging Industry big data knowledge mapping, which is characterized in that including:
The structural data of packaging industry is obtained, including:The seed vocabulary that packaging industry can be represented using some, is drawn in search It holds up and is scanned for in the searching interface of online encyclopaedia, for the web document that described search engine returns, selection is arranged in front The predetermined number in face, as target webpage, is added to target webpage list according to result;For the page that the online encyclopaedia returns, It is introduced into corresponding article page, then finds the link of two classes in articles page, including external linkage and bibliography is outer Chain is added to using the exterior chain of the external linkage found and bibliography as target webpage in the target webpage list;To institute It states target webpage in target webpage list and carries out the first classification according to website;Sort out by described first corresponding to target webpage each A website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., opened from website homepage Begin, using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;It is collected to each website Web page contents extract preservation, and the web page contents of predetermined threshold value are less than for the frequency comprising industry keyword, by it from guarantor It is deleted in the web page contents deposited;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:It is calculated using cluster Method carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Together, feature used in the cluster includes:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern is used from top Downward knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between meaning and table, while understanding the knot of the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to the mapping of semantic data according to preset relational database by D2R Server structural data mapping tools Specification D2RML, in the structural data of the packaging industry table with it is general in packaging industry knowledge mapping to be built It reads or entity associated is got up, the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge mapping, Wherein, relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function it is as follows:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, type use when determining connection Driving;
(b)dburl:Database connection string, specified includes the information of the address of database, port and the database used.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attrname then specified attribute names;
Entity in open link data and online encyclopaedia is closed with the entity in the packaging industry knowledge mapping that has built And including:The title of entity in open link data and online encyclopaedia and synonymous set are known with the packaging industry built The title and synonymous set for knowing the entity in collection of illustrative plates are matched, the candidate entity pair that matched result is merged as entity; For candidate entity pair, compare their father's concept, if the identical candidate entity pair of father's concept, by they be merged into it is described In the packaging industry knowledge mapping of structure;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online encyclopaedia In entity, be added in the packaging industry knowledge mapping built.
2. a kind of computer readable storage medium, is stored thereon with computer executable instructions, wherein the computer is executable to be referred to Make the processor when order is executed by processor:
The structural data of packaging industry is obtained, including:The seed vocabulary that packaging industry can be represented using some, is drawn in search It holds up and is scanned for in the searching interface of online encyclopaedia, for the web document that described search engine returns, selection is arranged in front The predetermined number in face, as target webpage, is added to target webpage list according to result;For the page that the online encyclopaedia returns, It is introduced into corresponding article page, then finds the link of two classes in articles page, including external linkage and bibliography is outer Chain is added to using the exterior chain of the external linkage found and bibliography as target webpage in the target webpage list;To institute It states target webpage in target webpage list and carries out the first classification according to website;Sort out by described first corresponding to target webpage each A website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., opened from website homepage Begin, using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;It is collected to each website Web page contents extract preservation, and the web page contents of predetermined threshold value are less than for the frequency comprising industry keyword, by it from guarantor It is deleted in the web page contents deposited;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:It is calculated using cluster Method carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Together, feature used in the cluster includes:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern is used from top Downward knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between meaning and table, while understanding the knot of the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to the mapping of semantic data according to preset relational database by D2R Server structural data mapping tools Specification D2RML, in the structural data of the packaging industry table with it is general in packaging industry knowledge mapping to be built It reads or entity associated is got up, the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge mapping, Wherein, relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function it is as follows:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, type use when determining connection Driving;
(b)dburl:Database connection string, specified includes the information of the address of database, port and the database used.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attmame then specified attribute names;
Entity in open link data and online encyclopaedia is closed with the entity in the packaging industry knowledge mapping that has built And including:The title of entity in open link data and online encyclopaedia and synonymous set are known with the packaging industry built The title and synonymous set for knowing the entity in collection of illustrative plates are matched, the candidate entity pair that matched result is merged as entity; For candidate entity pair, compare their father's concept, if the identical candidate entity pair of father's concept, by they be merged into it is described In the packaging industry knowledge mapping of structure;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online encyclopaedia In entity, be added in the packaging industry knowledge mapping built.
3. a kind of calculator device, wherein including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the processing when executed Device:
The structural data of packaging industry is obtained, including:The seed vocabulary that packaging industry can be represented using some, is drawn in search It holds up and is scanned for in the searching interface of online encyclopaedia, for the web document that described search engine returns, selection is arranged in front The predetermined number in face, as target webpage, is added to target webpage list according to result;For the page that the online encyclopaedia returns, It is introduced into corresponding article page, then finds the link of two classes in articles page, including external linkage and bibliography is outer Chain is added to using the exterior chain of the external linkage found and bibliography as target webpage in the target webpage list;To institute It states target webpage in target webpage list and carries out the first classification according to website;Sort out by described first corresponding to target webpage each A website stood in web page contents acquisition, the depth capacity of each website acquisition is set as 3 layers, i.e., opened from website homepage Begin, using depth-first acquisition strategies, acquires 3 layers of web page contents of each website in total;It is collected to each website Web page contents extract preservation, and the web page contents of predetermined threshold value are less than for the frequency comprising industry keyword, by it from guarantor It is deleted in the web page contents deposited;
The structural data of the packaging industry obtains the data source of structure packaging industry knowledge mapping, including:It is calculated using cluster Method carries out second according to structure to web page contents and sorts out, and the described second purpose sorted out is mutually isostructural webpage to gather Together, feature used in the cluster includes:(a) depth of webpage URL;(b) part after domain name is removed in URL to be made The word obtained with "/" segmentation;(c) length of webpage;(d) the label number of webpage;(e) primary label includes in webpage<div>、 <table>、<a>Respective number and its shared ratio;Web page contents quantity after sorting out for described second is more than default The classification of threshold value is filtered the web page contents in the classification;Each class pair after sorting out by preset each second The matching template answered parses the web page contents in filtered each class, obtains structure packaging industry knowledge graph The data source of spectrum, wherein the corresponding matching template of each class is used for:For in the webpage in filtered each class Each element in appearance, by using web page contents, XPath is positioned in the page;By including<synonym>< attribute>Label the information MAP in the corresponding elements of XPath at the element in knowledge mapping;
The data pattern that the corresponding human expert of structure packaging industry knowledge mapping defines is obtained, the data pattern is used from top Downward knowledge mapping mode;
Understand the basic structure in the structural data of packaging industry, including each table in the structural data of packaging industry Association between meaning and table, while understanding the knot of the corresponding packaging industry knowledge mapping to be built of the data pattern Structure is mapped to the mapping of semantic data according to preset relational database by D2R Server structural data mapping tools Specification D2RML, in the structural data of the packaging industry table with it is general in packaging industry knowledge mapping to be built It reads or entity associated is got up, the data source is filled into packaging industry knowledge mapping, build packaging industry knowledge mapping, Wherein, relational database be mapped to the Mapping specifications D2RML of semantic data chief word and corresponding representation function it is as follows:
(a)dbtype:The type of source database, including mysql, oracle, sqlserver, type use when determining connection Driving;
(b)dburl:Database connection string, specified includes the information of the address of database, port and the database used.
(c)dbuser:The user name of database;
(d)dbpwd:The password of database;
(e)table:Source data table;
(f)concept:Import target concept;
(g) the colname attributes of name:Entity name source arranges;
(h) the colname attributes of synonym:Synonymous entity source row;
(i) the tablename attributes of parent:The table name of father's concept;
(j) the colname specified attributes source row of attribute, attmame then specified attribute names;
Entity in open link data and online encyclopaedia is closed with the entity in the packaging industry knowledge mapping that has built And including:The title of entity in open link data and online encyclopaedia and synonymous set are known with the packaging industry built The title and synonymous set for knowing the entity in collection of illustrative plates are matched, the candidate entity pair that matched result is merged as entity; For candidate entity pair, compare their father's concept, if the identical candidate entity pair of father's concept, by they be merged into it is described In the packaging industry knowledge mapping of structure;
The entity in the packaging industry knowledge mapping built is would not exist in, but is present in open link data and online encyclopaedia In entity, be added in the packaging industry knowledge mapping built.
CN201810211761.XA 2018-03-15 2018-03-15 A kind of construction method and equipment of Packaging Industry big data knowledge mapping Withdrawn CN108446368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810211761.XA CN108446368A (en) 2018-03-15 2018-03-15 A kind of construction method and equipment of Packaging Industry big data knowledge mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810211761.XA CN108446368A (en) 2018-03-15 2018-03-15 A kind of construction method and equipment of Packaging Industry big data knowledge mapping

Publications (1)

Publication Number Publication Date
CN108446368A true CN108446368A (en) 2018-08-24

Family

ID=63195229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810211761.XA Withdrawn CN108446368A (en) 2018-03-15 2018-03-15 A kind of construction method and equipment of Packaging Industry big data knowledge mapping

Country Status (1)

Country Link
CN (1) CN108446368A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308322A (en) * 2018-12-04 2019-02-05 南京樯图数据科技有限公司 A kind of creation and transaction system of industrial economy knowledge mapping
CN109446341A (en) * 2018-10-23 2019-03-08 国家电网公司 The construction method and device of knowledge mapping
CN109471949A (en) * 2018-11-09 2019-03-15 袁琦 A kind of semi-automatic construction method of pet knowledge mapping
CN110275966A (en) * 2019-07-01 2019-09-24 科大讯飞(苏州)科技有限公司 A kind of Knowledge Extraction Method and device
CN110290116A (en) * 2019-06-04 2019-09-27 中山大学 A kind of malice domain name detection method of knowledge based map
CN110489560A (en) * 2019-06-19 2019-11-22 民生科技有限责任公司 The little Wei enterprise portrait generation method and device of knowledge based graphical spectrum technology
CN110516047A (en) * 2019-09-02 2019-11-29 湖南工业大学 The search method and searching system of knowledge mapping based on packaging field
CN110598003A (en) * 2019-08-15 2019-12-20 上海市大数据中心 Knowledge graph construction system and construction method based on public data resource catalog
CN110970112A (en) * 2018-09-29 2020-04-07 九阳股份有限公司 Method and system for constructing knowledge graph for nutrition and health
CN111078949A (en) * 2019-12-31 2020-04-28 北京明略软件系统有限公司 Product knowledge storage method and device, computer equipment and readable storage medium
CN111091006A (en) * 2019-12-20 2020-05-01 北京百度网讯科技有限公司 Entity intention system establishing method, device, equipment and medium
CN111104524A (en) * 2019-12-25 2020-05-05 航天云网科技发展有限责任公司 Method for identifying television end user set
CN111274327A (en) * 2020-01-09 2020-06-12 浙江工业大学 Entity and relation extraction method for unstructured table document
CN111522927A (en) * 2020-04-15 2020-08-11 北京百度网讯科技有限公司 Entity query method and device based on knowledge graph
CN112463984A (en) * 2020-12-04 2021-03-09 北京明略软件系统有限公司 Database mode expansion method, device, equipment and computer readable medium
CN113052005A (en) * 2021-02-08 2021-06-29 湖南工业大学 Garbage sorting method and garbage sorting device for home service
CN113139022A (en) * 2021-04-29 2021-07-20 同济大学 Enterprise logistics data on-demand fusion method based on mixing rule
CN113377957A (en) * 2021-07-01 2021-09-10 浙江工业大学 National economy industry classification method and system based on knowledge graph
CN113836293A (en) * 2021-09-23 2021-12-24 平安国际智慧城市科技股份有限公司 Data processing method, device and equipment based on knowledge graph and storage medium
CN114168745A (en) * 2021-11-30 2022-03-11 大连理工大学 Knowledge graph construction method for production process of ethylene oxide derivative
CN116702899A (en) * 2023-08-07 2023-09-05 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene
CN117520567A (en) * 2024-01-03 2024-02-06 卓世科技(海南)有限公司 Knowledge graph-based large language model training method

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970112A (en) * 2018-09-29 2020-04-07 九阳股份有限公司 Method and system for constructing knowledge graph for nutrition and health
CN110970112B (en) * 2018-09-29 2024-03-12 九阳股份有限公司 Knowledge graph construction method and system for nutrition and health
CN109446341A (en) * 2018-10-23 2019-03-08 国家电网公司 The construction method and device of knowledge mapping
CN109471949A (en) * 2018-11-09 2019-03-15 袁琦 A kind of semi-automatic construction method of pet knowledge mapping
CN109308322A (en) * 2018-12-04 2019-02-05 南京樯图数据科技有限公司 A kind of creation and transaction system of industrial economy knowledge mapping
CN110290116B (en) * 2019-06-04 2021-06-22 中山大学 Malicious domain name detection method based on knowledge graph
CN110290116A (en) * 2019-06-04 2019-09-27 中山大学 A kind of malice domain name detection method of knowledge based map
CN110489560A (en) * 2019-06-19 2019-11-22 民生科技有限责任公司 The little Wei enterprise portrait generation method and device of knowledge based graphical spectrum technology
CN110275966A (en) * 2019-07-01 2019-09-24 科大讯飞(苏州)科技有限公司 A kind of Knowledge Extraction Method and device
CN110275966B (en) * 2019-07-01 2021-10-01 科大讯飞(苏州)科技有限公司 Knowledge extraction method and device
CN110598003A (en) * 2019-08-15 2019-12-20 上海市大数据中心 Knowledge graph construction system and construction method based on public data resource catalog
CN110516047A (en) * 2019-09-02 2019-11-29 湖南工业大学 The search method and searching system of knowledge mapping based on packaging field
CN111091006B (en) * 2019-12-20 2023-08-29 北京百度网讯科技有限公司 Method, device, equipment and medium for establishing entity intention system
CN111091006A (en) * 2019-12-20 2020-05-01 北京百度网讯科技有限公司 Entity intention system establishing method, device, equipment and medium
CN111104524A (en) * 2019-12-25 2020-05-05 航天云网科技发展有限责任公司 Method for identifying television end user set
CN111078949A (en) * 2019-12-31 2020-04-28 北京明略软件系统有限公司 Product knowledge storage method and device, computer equipment and readable storage medium
CN111274327A (en) * 2020-01-09 2020-06-12 浙江工业大学 Entity and relation extraction method for unstructured table document
CN111274327B (en) * 2020-01-09 2021-08-03 浙江工业大学 Entity and relation extraction method for unstructured table document
CN111522927A (en) * 2020-04-15 2020-08-11 北京百度网讯科技有限公司 Entity query method and device based on knowledge graph
CN112463984A (en) * 2020-12-04 2021-03-09 北京明略软件系统有限公司 Database mode expansion method, device, equipment and computer readable medium
CN112463984B (en) * 2020-12-04 2024-02-27 北京明略软件系统有限公司 Database schema extension method, device, equipment and computer readable medium
CN113052005A (en) * 2021-02-08 2021-06-29 湖南工业大学 Garbage sorting method and garbage sorting device for home service
CN113052005B (en) * 2021-02-08 2024-02-02 湖南工业大学 Garbage sorting method and garbage sorting device for household service
CN113139022A (en) * 2021-04-29 2021-07-20 同济大学 Enterprise logistics data on-demand fusion method based on mixing rule
CN113377957A (en) * 2021-07-01 2021-09-10 浙江工业大学 National economy industry classification method and system based on knowledge graph
CN113836293B (en) * 2021-09-23 2024-04-16 平安国际智慧城市科技股份有限公司 Knowledge graph-based data processing method, device, equipment and storage medium
CN113836293A (en) * 2021-09-23 2021-12-24 平安国际智慧城市科技股份有限公司 Data processing method, device and equipment based on knowledge graph and storage medium
CN114168745B (en) * 2021-11-30 2022-08-09 大连理工大学 Knowledge graph construction method for production process of ethylene oxide derivative
CN114168745A (en) * 2021-11-30 2022-03-11 大连理工大学 Knowledge graph construction method for production process of ethylene oxide derivative
CN116702899B (en) * 2023-08-07 2023-11-28 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene
CN116702899A (en) * 2023-08-07 2023-09-05 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene
CN117520567A (en) * 2024-01-03 2024-02-06 卓世科技(海南)有限公司 Knowledge graph-based large language model training method
CN117520567B (en) * 2024-01-03 2024-04-02 卓世科技(海南)有限公司 Knowledge graph-based large language model training method

Similar Documents

Publication Publication Date Title
CN108446368A (en) A kind of construction method and equipment of Packaging Industry big data knowledge mapping
CN108446367A (en) A kind of the packaging industry data search method and equipment of knowledge based collection of illustrative plates
US7739257B2 (en) Search engine
CN103646032B (en) A kind of based on body with the data base query method of limited natural language processing
US20120215768A1 (en) Method and Apparatus for Creating Binary Attribute Data Relations
CN102968465A (en) Network information service platform and search service method based on network information service platform
Rinaldi et al. A matching framework for multimedia data integration using semantics and ontologies
CN106528648A (en) Distributed keyword approximate search method for RDF in combination with Redis memory database
CN111061828B (en) Digital library knowledge retrieval method and device
Stefanidis et al. A context‐aware preference database system
Ellis et al. Exploring big data with helix: Finding needles in a big haystack
Sarda et al. Mragyati: A system for keyword-based searching in databases
Gultom et al. Implementing web data extraction and making Mashup with Xtractorz
Latif et al. Harvesting Pertinent Resources from Linked Open Data.
Fakhre Alam et al. A comparative study of RDF and topic maps development tools and APIs
Tang et al. Ontology-based semantic retrieval for education management systems
CN111709239A (en) Geoscience data discovery method based on expert logic structure tree
Zhong et al. 3SEPIAS: A semi-structured search engine for personal information in dataspace system
Zhang et al. Ontology database construction for medical knowledge base
Latif et al. Weaving Scholarly Legacy Data into Web of Data.
Chen et al. Robust and Efficient Annotation based on Ontology Evolution for Deep Web Data.
Sánchez-Zamora et al. Visualizing tags as a network of relatedness
Leshcheva et al. Towards a method of ontology population from heterogeneous sources of structured data
Deshmukh et al. An improved approach for deep web data extraction
Hamdy et al. A hybrid framework for applying semantic integration technologies to improve data quality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180824