CN111708893A - Scientific and technological resource integration method and system based on knowledge graph - Google Patents
Scientific and technological resource integration method and system based on knowledge graph Download PDFInfo
- Publication number
- CN111708893A CN111708893A CN202010410946.0A CN202010410946A CN111708893A CN 111708893 A CN111708893 A CN 111708893A CN 202010410946 A CN202010410946 A CN 202010410946A CN 111708893 A CN111708893 A CN 111708893A
- Authority
- CN
- China
- Prior art keywords
- data
- scientific
- technological
- knowledge
- knowledge graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000010354 integration Effects 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000012545 processing Methods 0.000 claims abstract description 22
- 238000010276 construction Methods 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims abstract description 14
- 238000004140 cleaning Methods 0.000 claims abstract description 12
- 230000008676 import Effects 0.000 claims abstract description 8
- 238000006243 chemical reaction Methods 0.000 claims abstract description 4
- 230000004927 fusion Effects 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 7
- 239000013598 vector Substances 0.000 description 9
- 238000007726 management method Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 238000007499 fusion processing Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Tourism & Hospitality (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Economics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a scientific and technological resource integration method and system based on a knowledge graph, wherein the method comprises the following steps: collecting raw data of different sources and structures in a network; carrying out data cleaning on the original data, and unifying data formats to obtain processing data meeting construction conditions; extracting scientific and technical knowledge from the processed data; performing data form conversion on the extracted scientific and technological knowledge in a batch import mode, and converting the scientific and technological knowledge into a knowledge graph in a graph mode; and fusing entities of the same type according to the knowledge graph. The method comprises the steps of vectorizing knowledge in a knowledge graph of the scientific and technological resource field of multiple data sources, fusing through similarity calculation, starting from acquisition of scientific and technological resources of the multiple data sources, building the scientific and technological resource field knowledge graph from bottom to top, and obtaining a better scientific and technological resource integration effect by utilizing the relation between entities in the built scientific and technological resource field knowledge graph.
Description
Technical Field
The invention relates to the technical field of scientific and technological services, in particular to a scientific and technological resource integration method and system based on a knowledge graph.
Background
In the field of scientific and technological services, scientific and technological resources often include a plurality of fields such as treatises, patents, scientific and technological achievements, experts, institutions and the like. The existing scientific and technological resource service platform only contains partial fields or partial data in the fields, and the organization format and content of the data often have differences among different platforms, so that a user often encounters great difficulty in acquiring knowledge across platforms. In recent years, knowledge maps have become a hotspot of current computer science research, and the knowledge maps are constructed in the field of scientific and technological resources with strong specialty, so that data among different platforms can be integrated, and the scientific and technological resource data in the platforms can be well displayed. At present, the main scientific and technological resource management and integration modes in the scientific and technological service field are as follows:
(1) a scientific and technological resource management method based on manual arrangement mainly takes encyclopedic websites and scientific and technological resource platforms of provinces and cities of all parts of the country as representatives, data in the platforms are often uploaded by holders or managers of related scientific and technological resources, and information of the scientific and technological resources is manually arranged and then provided for the platforms to be managed. Is a traditional integration management scheme based on manual work.
(2) The scientific and technological resource platform based on network data mainly uses a third-party website, data in the website is usually obtained through a plurality of modes such as purchasing, manual sorting, network obtaining and the like, then the data are filtered by adopting different screening and sorting algorithms, and finally the filtered scientific and technological resources are delivered to a database for management and are displayed for users through the Internet.
(3) The method is based on a knowledge graph, abstracts and arranges the relation contained in the data in the network by adopting the knowledge graph to discover the possible new relation in the technical resources, and is a novel method based on entity relevance.
As mentioned above, the current integration methods for scientific and technological resources mainly include: 1) manually arranging each platform; 2) scientific and technological resource management based on network data acquisition, the method mainly means to acquire scientific and technological resource data as much as possible through different methods, and show to users through the way of network service; 3) the scientific and technological resource management method based on the knowledge graph mainly aims at potential relation mining and user recommendation.
However, the method 1 is simple and easy to implement, but requires a large amount of manual arrangement work, and as scientific and technological resource data on the network continuously increase, the corresponding manual arrangement cost also continuously increases; the method 2 is a mainstream method at present, the core of the scheme is data acquisition, a large amount of scientific and technological resource data in the network are acquired and displayed in a network service mode, and the relation among scientific and technological resources in the network is not fully considered; the method 3 has remarkable results in the aspects of potential relationship mining and recommendation, but the methods are limited by the richness and authority of the content of the knowledge graph and do not fully consider the problem of data integration in multiple data platforms in the network.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide a scientific and technological resource integration method based on a knowledge graph, which utilizes the relationship between entities in the knowledge graph in the established scientific and technological resource field to obtain a better scientific and technological resource integration effect.
The invention also aims to provide a scientific and technological resource integration system based on the knowledge graph.
In order to achieve the above object, an embodiment of the present invention provides a scientific and technological resource integration method based on a knowledge graph, including: collecting raw data of different sources and structures in a network; performing data cleaning on the original data, and unifying data formats to obtain processing data meeting construction conditions; extracting scientific and technical knowledge from the processed data; performing data form conversion on the extracted scientific and technological knowledge in a batch import mode, and converting the scientific and technological knowledge into a knowledge graph in a graph mode; and fusing entities of the same type according to the knowledge graph.
The scientific and technological resource integration method based on the knowledge graph in the embodiment of the invention designs a scientific and technological resource integration scheme based on the knowledge graph under multiple data sources by vectorizing knowledge in the scientific and technological resource field knowledge graph of multiple data sources and fusing through similarity calculation, starts from acquisition of scientific and technological resources of multiple data sources, constructs a scientific and technological resource field knowledge graph from bottom to top, and obtains a better scientific and technological resource integration effect by utilizing the relation between entities in the constructed scientific and technological resource field knowledge graph.
In addition, the scientific and technological resource integration method based on the knowledge graph according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, the collecting raw data of different sources and structures in the network further includes: taking an entity A obtained from other sources as a keyword of initial search, searching in a preset website, screening and sorting the obtained semi-structured data, wherein a structured part is taken as an attribute and is stored in a database together with the entity A, and the rest unstructured data are independently stored to obtain more entities and relations in the subsequent construction of a knowledge graph; and setting a retrieval depth M, and carrying out iterative retrieval on the first M pieces of retrieval results in the same way until the retrieval depth M is retrieved, and stopping the retrieval.
Further, in an embodiment of the present invention, the performing data cleaning on the original data and unifying data formats to obtain the processing data meeting the construction condition further includes: the basic information of some data is uniformly inserted or set as a null value according to different data sources, and/or the data of wrong data or field problems is uniformly processed and set as a null value or uniformly modified into a preset value, or the named or organized data is uniformly named according to a related synonym library to realize data consistency.
Further, in an embodiment of the present invention, the fusing entities of the same type according to the knowledge graph specifically includes:
and performing authority sorting according to different data sources, vectorizing the contents of the scientific and technological resource entities, fusing similar resources by a similarity calculation method, and regarding the same fused field, taking the field with the right-taking power meeting the preset condition as the content of a new entity.
Further, in an embodiment of the present invention, the calculation formula of the similarity is:
C=(A x B)/(|A|*|B|),
wherein A and B represent entities.
In order to achieve the above object, an embodiment of another aspect of the present invention provides a system for integrating scientific and technological resources based on knowledge graph, including: the scientific and technological resource acquisition module is used for collecting original data of different sources and structures in a network; the scientific and technological resource processing module is used for carrying out data cleaning on the original data and unifying data formats to obtain processing data meeting construction conditions; the scientific and technological knowledge extraction module is used for extracting scientific and technological knowledge from the processing data; the scientific and technological resource storage module is used for converting the extracted scientific and technological knowledge into a knowledge graph in a graph mode in a batch import mode; and the scientific and technological resource integration module is used for fusing entities of the same type according to the knowledge graph.
The scientific and technological resource integration system based on the knowledge graph in the embodiment of the invention designs a scientific and technological resource integration scheme based on the knowledge graph under multiple data sources by vectorizing knowledge in the scientific and technological resource field knowledge graph of multiple data sources and fusing through similarity calculation, starts from acquisition of scientific and technological resources of multiple data sources, constructs a scientific and technological resource field knowledge graph from bottom to top, and obtains a better scientific and technological resource integration effect by utilizing the relation between entities in the constructed scientific and technological resource field knowledge graph.
In addition, the scientific and technological resource integration system based on knowledge graph according to the above embodiment of the present invention may also have the following additional technical features:
further, in an embodiment of the present invention, the scientific and technological resource collection module is further configured to: taking an entity A obtained from other sources as a keyword of initial search, searching in a preset website, screening and sorting the obtained semi-structured data, wherein a structured part is taken as an attribute and is stored in a database together with the entity A, and the rest unstructured data are independently stored to obtain more entities and relations in the subsequent construction of a knowledge graph; and setting a retrieval depth M, and carrying out iterative retrieval on the first M pieces of retrieval results in the same way until the retrieval depth M is retrieved, and stopping the retrieval.
Further, in an embodiment of the present invention, the scientific and technological resource processing module is further configured to uniformly insert or set basic information of some data into a null value according to different data sources, and/or uniformly process data of wrong data or field problems, set the data into a null value or uniformly modify the data into a preset value, or uniformly name named or organized data with different differences according to a related synonym library to perform data consistency.
Further, in an embodiment of the present invention, the scientific and technological resource integration module is specifically configured to perform authority ranking according to different data sources, perform fusion of similar resources by a similarity calculation method after vectorizing contents of scientific and technological resource entities, and regarding a same field after fusion, use a field whose right-selecting power satisfies a preset condition as a content of a new entity.
Further, in an embodiment of the present invention, the calculation formula of the similarity is:
C=(A x B)/(|A|*|B|),
wherein A and B represent entities.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of a method for integrating scientific and technological resources based on knowledge-graph according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for integrating scientific and technological resources based on knowledge-graph according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an example entity fusion policy according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a scientific and technological resource integration system based on knowledge-graph according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The scientific and technological resource integration method and system based on the knowledge graph provided by the embodiment of the invention are described below with reference to the accompanying drawings, and first, the scientific and technological resource integration method based on the knowledge graph provided by the embodiment of the invention is described with reference to the accompanying drawings.
FIG. 1 is a flowchart of a scientific and technological resource integration method based on knowledge-graph according to an embodiment of the present invention.
As shown in fig. 1, the scientific and technological resource integration method based on knowledge graph includes the following steps:
in step S101, raw data from different sources and structures in the network is collected.
In one embodiment of the present invention, collecting raw data of different sources and structures in a network further comprises: taking an entity A obtained from other sources as a keyword of initial search, searching in a preset website, screening and sorting the obtained semi-structured data, wherein a structured part is used as an attribute and is stored in a database together with the entity A, and the rest unstructured data are independently stored to obtain more entities and relations in the subsequent construction of a knowledge graph; and setting a retrieval depth M, carrying out iterative retrieval on the first M pieces of retrieval results in the same way, and stopping the retrieval until the retrieval depth M is retrieved.
Specifically, as shown in fig. 2, the scientific and technological resources, data, are generally distributed on various platforms on the network, such as vertical domain websites of related scientific and technological resources, scientific and technological resource platforms of countries and provinces and cities, related encyclopedia websites and other news websites, and so on. Data from different sources and structures in the network needs to be collected, which needs support from relevant crawler technologies.
In addition to using a basic crawler tool to acquire data, in order to acquire scientific and technological resource data with higher authority, the embodiment of the invention also designs an expansion method of synonymous or homonymous entities based on the encyclopedic website, and performs iterative retrieval through the data in the encyclopedic website with higher authority so as to achieve the purpose of expanding the scientific and technological resource data set. The specific method comprises the following steps: and taking the entity A obtained from other sources as a keyword of initial search, searching in encyclopedic websites, screening and sorting the obtained semi-structured data, wherein the structured part is used as an attribute and is stored in a database together with the entity A, and the rest unstructured data is independently stored to obtain more entities and relations in the subsequent construction of a knowledge graph. And setting a retrieval depth M, so that the previous M pieces of retrieval results are subjected to iterative retrieval in the same way, and the retrieval is stopped until a proper depth M is retrieved, so that the condition that the retrieval information is converged after multiple iterations is prevented.
After the scientific and technological resources are acquired, the authority of the data of different data sources is manually marked so as to provide for subsequent scientific and technological resource integration and use.
In step S102, data cleaning is performed on the raw data, and data formats are unified to obtain processed data that satisfies the construction conditions.
In an embodiment of the present invention, the data cleaning is performed on the original data, and the data format is unified to obtain the processing data satisfying the construction condition, further including: the basic information of some data is uniformly inserted or set as a null value according to different data sources, and/or the data of wrong data or field problems is uniformly processed and set as a null value or uniformly modified into a preset value, or the named or organized data is uniformly named according to a related synonym library to realize data consistency.
Specifically, as shown in fig. 2, there are many problems in the raw data acquired in the scientific and technological resource acquisition step, such as field missing, data abnormality, content difference, and the like. Therefore, the cleaning of the original data is necessary, and the uniform data format after the cleaning is also suitable for the related data analysis and the construction work of the knowledge graph. Therefore, the method comprises the following work:
field missing processing: the basic information of some data is uniformly inserted according to different data sources or is set to be a null value.
And (3) data exception handling: and uniformly processing some obviously wrong data or some field problems caused by typesetting and crawlers, and setting the data to be null values or uniformly modifying the data to be other values.
Content difference processing: and uniformly naming some data with different names or organizations according to the related synonym library to carry out data consistency, so as to reduce the number of entities with the same name or the same meaning entities in the acquired scientific and technological resources.
In step S103, scientific knowledge is extracted from the processed data.
Specifically, as shown in fig. 2, the scientific and technical knowledge extraction includes:
(1) extraction from open-link data sets
The open link data of the network contains abundant concept and entity information, and the data is often organized and presented in a certain structure. Therefore, partial concept and entity information can be extracted from the partial data source and directly analyzed into a form which can be stored and displayed in the knowledge graph.
(2) Extracting from encyclopedia website
The title of an article in encyclopedic, namely a searched entry, is used as an entity, the structured and semi-structured data in the article are processed through data of a scientific and technological resource pool and then used as attribute values of the entity, and the article classification and article labels in the encyclopedic can be used as candidates of concepts and can be stored as concepts of a body layer after being manually screened. The method comprises the following specific steps:
first, the classification column for classification in encyclopedia is screened out to obtain nouns related to scientific and technological resources, which can be directly used as concepts, because the reliability of the data screened and checked by editors or encyclopedias managers is higher when the data in the encyclopedia is used. In addition, a large number of words related to concepts are stored in tag columns in encyclopedic, but the coverage of the words in the encyclopedic is wide, a domain word bank needs to be established for storing the entries related to the scientific and technological resource domain, and after the tags obtained in the encyclopedic and the entries in the domain word bank are screened, the obtained tags can be stored as concepts after manual screening.
Secondly, the titles and the search terms of the articles in encyclopedia are stored as entities, because the terms of the parts are inquired according to the existing entities. For the information of related characters or mechanism energy recommended in encyclopedia, the query can be carried out through an encyclopedia data iteration query method in a scientific and technological resource pool, and the queried title is used as a new entity. For attribute information such as affiliated organization, related technology and the like possibly existing in an attribute column when an entity is queried, iterative query can be performed to form a new entity, but manual specification is needed when which attributes are selected to serve as the new entity.
Finally, the attribute values in the encyclopedic website need to be extracted, and the contents displayed by the attribute columns of the articles in the encyclopedic website are often stored in the form of attribute-attribute values.
(3) Extraction from vertical domain websites and related scientific and technological resource platforms
In a vertical field website and a related scientific and technological resource platform, scientific and technological resource data are often sorted by the platform, so that the data structure is good, and the acquisition of the concept layer body can be acquired according to the classification of resources displayed by the website platform. However, scientific and technological resource data displayed on different platforms often have different emphasis, and the problem of difference of data of different platforms needs to be solved when the entities and attributes of the platforms are extracted.
In step S104, the extracted scientific and technical knowledge is converted into a knowledge graph in a graph mode by means of batch import.
It can be understood that, as shown in fig. 2, the embodiment of the present invention needs to convert the obtained knowledge about the scientific and technological resources into a graph form for storage, so that the structured entities and relationships obtained by extraction are converted into a data form by a batch import method, and converted into a graph-mode knowledge graph to integrate the scientific and technological resources.
In step S105, entities of the same type are fused according to the knowledge graph.
In an embodiment of the present invention, fusing entities of the same type according to a knowledge graph specifically includes: and performing authority sorting according to different data sources, vectorizing the contents of the scientific and technological resource entities, fusing similar resources by a similarity calculation method, and regarding the same fused field, taking the field with the right-taking power meeting the preset condition as the content of a new entity.
Specifically, as shown in fig. 2, scientific and technical resource integration requires merging entities of the same type, such as experts, institutions, and the like, by type, because entities of the same name, if the types are different, represent that the two entities belong to different entities. The embodiment of the invention compares the similarity of the entity content, the attribute and the entity related relation based on the established knowledge graph, and takes the entity with higher confidence coefficient as the synonymous entity for fusion. The specific method comprises the following steps:
after the relationships are established for the entities in the knowledge graph, each entity is mapped to a node in the knowledge graph spectrogram network, and can be used as an entity identifier in the network to distinguish the entity from other entities according to entity information such as the name of the entity and the relationship related to attributes, so that the entities can be classified and fused through the information in the domain knowledge graph.
The embodiment of the invention abstracts the entity by the content of the entity, the attribute contained by the entity and the relationship between the entity and other entities, maps the entity information contained in the entity information into a vector space with specified dimensions, considers the entities with higher similarity as the same entity by a method for calculating the similarity, fuses the entities, and stores the fused new entity as a new complete entity into a knowledge graph.
The embodiment of the invention adopts a word vector conversion mode to carry out entity information vectorization. The method comprises the steps of integrating news and other text data acquired from scientific and technological resources, and using the integrated data as pre-training data to pre-train a model. And then, the related information of the entities is given to the trained model for training, and the vector dimension of each entity trained by the model is specified as 50 dimensions in a way of specifying the dimension in advance. Then, integrating each information in the entities, namely combining vectors corresponding to the information in sequence, and combining each entity to form a new 200-dimensional vector according to a pre-selected field format. Then, by calculating cosine similarity C between the entity A and the entity B, a value with a value range of [ -1,1] is obtained, the value C represents the similarity of the contents of the two entities, the closer the value of C is to 1, the more similar the two vectors are, namely the similarity between the entities, and the specific formula is as follows, wherein | A |, | B | respectively represents the lengths of the vectors A and B, and x represents the multiplication of the two vectors:
C=(A x B)/(|A|*|B|)
in the process of entity fusion, if two entities are considered as synonymous entities, the two entities are fused. During the fusion process, the following situations may be encountered: the attributes of the synonymous entities are consistent, the attributes of the synonymous entities are inconsistent, and the attributes of the synonymous entities are vacant. The corresponding solution is as follows:
if the attributes of the synonymous entities are consistent, the corresponding attributes and attribute values are stored in the new fused entity; if the attributes of the synonymous entities are not consistent, authority ranking is carried out according to the sources of the entity contents (the authority of encyclopedic is highest, the national science and technology resource websites are second, the related vertical field websites are second, and the like), the entity attribute with higher authority-taking power is stored as a new attribute into a new entity, the specific implementation method is embodied in that the attribute value of a priority field corresponding to the content stored in the website with higher authority is lower (the priority is highest and is 1) in a database, the attribute value of the priority field corresponding to the content in the website with lower authority is higher, the priorities are compared when fusion is carried out, and the attribute content corresponding to the entity with lower priority field attribute value, namely high authority, is taken as the attribute of the new entity; if the synonymous entity is vacant, observing whether the attribute value corresponding to another synonymous entity is also vacant, if the synonymous entity is vacant, emptying the attribute value, and if the synonymous entity is not vacant, storing the attribute value corresponding to the attribute into a new entity. The concrete entity fusion strategy is shown in fig. 3.
To sum up, the embodiment of the present invention provides a scientific and technological resource integration method based on a knowledge graph, which includes the processes of scientific and technological resource acquisition, scientific and technological resource processing, scientific and technological knowledge extraction, scientific and technological resource storage, scientific and technological resource integration, and the specific work includes: 1. aiming at scientific and technological resource data of each platform in a network, a scientific and technological resource pool with multiple data sources is constructed, and credibility ranking is carried out on data of different platforms in the resource pool in a mode of manually setting authority values; 2. extracting entities and relations based on data in the scientific and technological resource pool, and constructing a scientific and technological resource field knowledge map; 3. vectorizing information (entity names, entity attributes, entity relationships and the like) of similar entities aiming at the entities in the knowledge graph, fusing the similar entities by calculating the way of vector similarity, comparing the conflicting entity attributes and relationships according to authority degrees of data sources in the fusion process, taking the attribute and relationship with high authority degree as the attribute and relationship of a new entity, thereby obtaining the integrated knowledge graph and displaying the obtained result.
The scientific and technological resource integration method based on the knowledge graph provided by the embodiment of the invention has the following advantages that:
(1) the existing scientific and technological resource integration scheme does not fully consider the correlation problem of technical resources under multiple data sources, and generally integrates the scientific and technological resources through manual rule setting and screening after data acquisition from a network. The embodiment of the invention provides a scientific and technological resource data integration method based on multiple data sources, which is characterized in that a domain knowledge graph is constructed in a manner of integrating scientific and technological resource data of multiple data sources, then knowledge in the knowledge graph is fused to obtain a scientific and technological resource integration scheme based on the knowledge graph, and specifically: the method starts from the acquisition of scientific and technological resources of a multi-data source, constructs a scientific and technological resource field knowledge graph from bottom to top, and obtains a better scientific and technological resource integration effect by utilizing the relation between entities in the constructed scientific and technological resource field knowledge graph.
(2) In the existing technical resource knowledge graph construction scheme, a technical resource field knowledge graph is generally constructed in a body layer modeling mode from top to bottom, and the obtained graph is managed through related technologies so as to provide recommendation and relation mining services. The embodiment of the invention designs a scientific and technological resource integration scheme based on the knowledge graph under multiple data sources by vectorizing knowledge in the knowledge graph in the field of scientific and technological resources with multiple data sources and fusing through similarity calculation, and specifically comprises the following steps: when data is acquired, data in an encyclopedic website is searched in a deep iterative search mode to expand a scientific and technological resource data set, authority degree ordering is carried out on different data sources, after contents of scientific and technological resource entities are vectorized, similar resources are fused through a method for calculating similarity, and then the same fused field is used as the content of a new entity with higher power.
Next, a scientific and technological resource integration system based on knowledge graph according to an embodiment of the present invention will be described with reference to the drawings.
FIG. 4 is a schematic structural diagram of a scientific and technological resource integration system based on knowledge-graph according to an embodiment of the present invention.
As shown in fig. 4, the system 10 for integrating scientific and technological resources based on knowledge-graph includes: the system comprises a scientific and technological resource acquisition module 100, a scientific and technological resource processing module 200, a scientific and technological knowledge extraction module 300, a scientific and technological resource storage module 400 and a scientific and technological resource integration module 500.
The scientific and technological resource acquisition module 100 is used for collecting original data of different sources and structures in a network; the scientific and technological resource processing module 200 is used for performing data cleaning on the original data and unifying data formats to obtain processing data meeting construction conditions; the scientific and technological knowledge extraction module 300 is used for extracting scientific and technological knowledge from the processing data; the scientific and technological resource storage module 400 is used for converting the extracted scientific and technological knowledge into a knowledge graph in a graph mode in a batch import mode; the scientific and technological resource integration module 500 is used for fusing entities of the same type according to the knowledge graph. The system 10 of the embodiment of the invention obtains a better scientific and technological resource integration effect by using the relation between the entities in the established scientific and technological resource field knowledge graph.
Further, in an embodiment of the present invention, the scientific and technological resource collection module 100 is further configured to: taking an entity A obtained from other sources as a keyword of initial search, searching in a preset website, screening and sorting the obtained semi-structured data, wherein a structured part is used as an attribute and is stored in a database together with the entity A, and the rest unstructured data are independently stored to obtain more entities and relations in the subsequent construction of a knowledge graph; and setting a retrieval depth M, carrying out iterative retrieval on the first M pieces of retrieval results in the same way, and stopping the retrieval until the retrieval depth M is retrieved.
Further, in an embodiment of the present invention, the scientific and technological resource processing module 200 is further configured to uniformly insert or set basic information of some data into a null value according to different data sources, and/or uniformly process data of wrong data or field problems, set the data into a null value or uniformly modify the data into a preset value, or uniformly name data with different names or different organizations according to a related synonym library to perform data consistency.
Further, in an embodiment of the present invention, the scientific and technological resource integration module 500 is specifically configured to perform authority ranking according to different data sources, perform fusion of similar resources by a similarity calculation method after vectorizing the contents of the scientific and technological resource entities, and regarding a same field after fusion, use a field whose right-selecting power satisfies a preset condition as the content of a new entity.
Further, in an embodiment of the present invention, the calculation formula of the similarity is:
C=(A x B)/(|A|*|B|),
wherein A and B represent entities.
It should be noted that the above explanation of the embodiment of the method for integrating scientific and technological resources based on a knowledge graph is also applicable to the system for integrating scientific and technological resources based on a knowledge graph of the embodiment, and is not repeated here.
According to the scientific and technological resource integration system based on the knowledge graph, which is provided by the embodiment of the invention, knowledge in the knowledge graph in the scientific and technological resource field of multiple data sources is vectorized and fused through similarity calculation, so that a scientific and technological resource integration scheme based on the knowledge graph under multiple data sources is designed, acquisition of scientific and technological resources from multiple data sources is started, a scientific and technological resource field knowledge graph is constructed from bottom to top, and a better scientific and technological resource integration effect is obtained by utilizing the relation between entities in the constructed scientific and technological resource field knowledge graph.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (10)
1. A scientific and technological resource integration method based on knowledge graph is characterized by comprising the following steps:
collecting raw data of different sources and structures in a network;
performing data cleaning on the original data, and unifying data formats to obtain processing data meeting construction conditions;
extracting scientific and technical knowledge from the processed data;
performing data form conversion on the extracted scientific and technological knowledge in a batch import mode, and converting the scientific and technological knowledge into a knowledge graph in a graph mode; and
and fusing entities of the same type according to the knowledge graph.
2. The method of claim 1, wherein collecting raw data from different sources and structures in the network further comprises:
taking an entity A obtained from other sources as a keyword of initial search, searching in a preset website, screening and sorting the obtained semi-structured data, wherein a structured part is taken as an attribute and is stored in a database together with the entity A, and the rest unstructured data are independently stored to obtain more entities and relations in the subsequent construction of a knowledge graph;
and setting a retrieval depth M, and carrying out iterative retrieval on the first M pieces of retrieval results in the same way until the retrieval depth M is retrieved, and stopping the retrieval.
3. The method of claim 1, wherein the performing data cleaning on the raw data and unifying data formats to obtain the processed data satisfying the construction condition further comprises:
the basic information of some data is uniformly inserted or set as a null value according to different data sources, and/or the data of wrong data or field problems is uniformly processed and set as a null value or uniformly modified into a preset value, or the named or organized data is uniformly named according to a related synonym library to realize data consistency.
4. The apparatus according to claim 1, wherein fusing entities of the same type according to the knowledge-graph specifically comprises:
and performing authority sorting according to different data sources, vectorizing the contents of the scientific and technological resource entities, fusing similar resources by a similarity calculation method, and regarding the same fused field, taking the field with the right-taking power meeting the preset condition as the content of a new entity.
5. The method according to claim 4, wherein the similarity is calculated by the formula:
C=(A x B)/(|A|*|B|),
wherein A and B represent entities.
6. A scientific and technological resource integration system based on knowledge graph is characterized by comprising:
the scientific and technological resource acquisition module is used for collecting original data of different sources and structures in a network;
the scientific and technological resource processing module is used for carrying out data cleaning on the original data and unifying data formats to obtain processing data meeting construction conditions;
the scientific and technological knowledge extraction module is used for extracting scientific and technological knowledge from the processing data;
the scientific and technological resource storage module is used for converting the extracted scientific and technological knowledge into a knowledge graph in a graph mode in a batch import mode; and
and the scientific and technological resource integration module is used for fusing entities of the same type according to the knowledge graph.
7. The system of claim 6, wherein the scientific resource acquisition module is further configured to:
taking an entity A obtained from other sources as a keyword of initial search, searching in a preset website, screening and sorting the obtained semi-structured data, wherein a structured part is taken as an attribute and is stored in a database together with the entity A, and the rest unstructured data are independently stored to obtain more entities and relations in the subsequent construction of a knowledge graph;
and setting a retrieval depth M, and carrying out iterative retrieval on the first M pieces of retrieval results in the same way until the retrieval depth M is retrieved, and stopping the retrieval.
8. The system according to claim 6, wherein the scientific and technological resource processing module is further configured to insert or set the basic information of some data into null values uniformly according to different data sources, and/or to process the data of wrong data or field problems uniformly, to set the null values or modify the data uniformly into preset values, or to name or organize the data with different differences uniformly according to related synonym libraries to name the data uniformly.
9. The system according to claim 6, wherein the scientific and technological resource integration module is specifically configured to perform authority ranking according to different data sources, perform fusion of similar resources by a similarity calculation method after vectorizing contents of scientific and technological resource entities, and regarding a same field after fusion, use a field whose right-taking power satisfies a preset condition as a content of a new entity.
10. The system according to claim 9, wherein the similarity is calculated by the formula:
C=(A x B)/(|A|*|B|),
wherein A and B represent entities.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010410946.0A CN111708893A (en) | 2020-05-15 | 2020-05-15 | Scientific and technological resource integration method and system based on knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010410946.0A CN111708893A (en) | 2020-05-15 | 2020-05-15 | Scientific and technological resource integration method and system based on knowledge graph |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111708893A true CN111708893A (en) | 2020-09-25 |
Family
ID=72537816
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010410946.0A Pending CN111708893A (en) | 2020-05-15 | 2020-05-15 | Scientific and technological resource integration method and system based on knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111708893A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183090A (en) * | 2020-10-09 | 2021-01-05 | 浪潮云信息技术股份公司 | Method for calculating entity relevance based on word network |
CN112417220A (en) * | 2020-11-20 | 2021-02-26 | 国家电网有限公司大数据中心 | Heterogeneous data integration method |
CN112733019A (en) * | 2020-12-31 | 2021-04-30 | 郑州轻工业大学 | Open knowledge graph reasoning research system |
CN112906826A (en) * | 2021-03-30 | 2021-06-04 | 平安科技(深圳)有限公司 | Multi-dimension-based knowledge graph fusion method and device and computer equipment |
CN113159320A (en) * | 2021-03-08 | 2021-07-23 | 北京航空航天大学 | Scientific and technological resource data integration method and device based on knowledge graph |
CN113220667A (en) * | 2021-05-31 | 2021-08-06 | 东莞理工学院 | Scientific and technological big data element construction method and system, electronic equipment and storage medium |
CN113254601A (en) * | 2021-07-06 | 2021-08-13 | 北京邮电大学 | Intellectual property oriented scientific and technological resource portrait construction method and device and storage medium |
CN113360668A (en) * | 2021-06-03 | 2021-09-07 | 中国电力科学研究院有限公司 | Unified data model construction method, system, terminal device and readable storage medium |
CN113468161A (en) * | 2021-07-23 | 2021-10-01 | 杭州数梦工场科技有限公司 | Data management method and device and electronic equipment |
CN115098698A (en) * | 2022-06-22 | 2022-09-23 | 中电金信软件有限公司 | Method and device for constructing Schema model in knowledge graph |
WO2023078104A1 (en) * | 2021-11-05 | 2023-05-11 | 中兴通讯股份有限公司 | Knowledge graph construction method and platform, and computer storage medium |
CN117150138A (en) * | 2023-09-12 | 2023-12-01 | 广东省华南技术转移中心有限公司 | Scientific and technological resource organization method and system based on high-dimensional space mapping |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
US20190035505A1 (en) * | 2017-07-31 | 2019-01-31 | Boe Technology Group Co., Ltd. | Intelligent triage server, terminal and system based on medical knowledge base (mkb) |
CN109508383A (en) * | 2018-10-30 | 2019-03-22 | 北京国双科技有限公司 | The construction method and device of knowledge mapping |
CN109597855A (en) * | 2018-11-29 | 2019-04-09 | 北京邮电大学 | Domain knowledge map construction method and system based on big data driving |
CN110111905A (en) * | 2019-04-24 | 2019-08-09 | 北京云知声信息技术有限公司 | A kind of the building system and construction method of medical knowledge map |
CN110737779A (en) * | 2019-09-18 | 2020-01-31 | 北京三快在线科技有限公司 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN110795565A (en) * | 2019-09-06 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Semantic recognition-based alias mining method, device, medium and electronic equipment |
CN110825721A (en) * | 2019-11-06 | 2020-02-21 | 武汉大学 | Hypertension knowledge base construction and system integration method under big data environment |
-
2020
- 2020-05-15 CN CN202010410946.0A patent/CN111708893A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190035505A1 (en) * | 2017-07-31 | 2019-01-31 | Boe Technology Group Co., Ltd. | Intelligent triage server, terminal and system based on medical knowledge base (mkb) |
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN109508383A (en) * | 2018-10-30 | 2019-03-22 | 北京国双科技有限公司 | The construction method and device of knowledge mapping |
CN109597855A (en) * | 2018-11-29 | 2019-04-09 | 北京邮电大学 | Domain knowledge map construction method and system based on big data driving |
CN110111905A (en) * | 2019-04-24 | 2019-08-09 | 北京云知声信息技术有限公司 | A kind of the building system and construction method of medical knowledge map |
CN110795565A (en) * | 2019-09-06 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Semantic recognition-based alias mining method, device, medium and electronic equipment |
CN110737779A (en) * | 2019-09-18 | 2020-01-31 | 北京三快在线科技有限公司 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN110825721A (en) * | 2019-11-06 | 2020-02-21 | 武汉大学 | Hypertension knowledge base construction and system integration method under big data environment |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183090A (en) * | 2020-10-09 | 2021-01-05 | 浪潮云信息技术股份公司 | Method for calculating entity relevance based on word network |
CN112417220A (en) * | 2020-11-20 | 2021-02-26 | 国家电网有限公司大数据中心 | Heterogeneous data integration method |
CN112733019A (en) * | 2020-12-31 | 2021-04-30 | 郑州轻工业大学 | Open knowledge graph reasoning research system |
CN113159320A (en) * | 2021-03-08 | 2021-07-23 | 北京航空航天大学 | Scientific and technological resource data integration method and device based on knowledge graph |
CN112906826A (en) * | 2021-03-30 | 2021-06-04 | 平安科技(深圳)有限公司 | Multi-dimension-based knowledge graph fusion method and device and computer equipment |
CN113220667A (en) * | 2021-05-31 | 2021-08-06 | 东莞理工学院 | Scientific and technological big data element construction method and system, electronic equipment and storage medium |
CN113360668A (en) * | 2021-06-03 | 2021-09-07 | 中国电力科学研究院有限公司 | Unified data model construction method, system, terminal device and readable storage medium |
CN113254601A (en) * | 2021-07-06 | 2021-08-13 | 北京邮电大学 | Intellectual property oriented scientific and technological resource portrait construction method and device and storage medium |
CN113468161A (en) * | 2021-07-23 | 2021-10-01 | 杭州数梦工场科技有限公司 | Data management method and device and electronic equipment |
WO2023078104A1 (en) * | 2021-11-05 | 2023-05-11 | 中兴通讯股份有限公司 | Knowledge graph construction method and platform, and computer storage medium |
CN115098698A (en) * | 2022-06-22 | 2022-09-23 | 中电金信软件有限公司 | Method and device for constructing Schema model in knowledge graph |
CN115098698B (en) * | 2022-06-22 | 2023-04-28 | 中电金信软件有限公司 | Method and device for constructing Schema model in knowledge graph |
CN117150138A (en) * | 2023-09-12 | 2023-12-01 | 广东省华南技术转移中心有限公司 | Scientific and technological resource organization method and system based on high-dimensional space mapping |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111708893A (en) | Scientific and technological resource integration method and system based on knowledge graph | |
Wu et al. | Natural-language-based intelligent retrieval engine for BIM object database | |
CN101364239B (en) | Method for auto constructing classified catalogue and relevant system | |
US8527487B2 (en) | Method and system for automatic construction of information organization structure for related information browsing | |
CN100507915C (en) | Network search method, network search device, and user terminals | |
US8060505B2 (en) | Methodologies and analytics tools for identifying white space opportunities in a given industry | |
CN104794242B (en) | Searching method | |
CN102968465A (en) | Network information service platform and search service method based on network information service platform | |
CN103425740B (en) | A kind of material information search method based on Semantic Clustering of internet of things oriented | |
Wang | A knowledge network constructed by integrating classification, thesaurus, and metadata in digital library | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
JP4324650B2 (en) | Information resource search device, information resource search method, and information resource search program | |
CN115168401A (en) | Data grading processing method and device, electronic equipment and computer readable medium | |
CN114706938A (en) | Document tag determination method and device, electronic equipment and storage medium | |
JPWO2013111287A1 (en) | SPARQL query optimization method | |
CN109460467B (en) | Method for constructing network information classification system | |
CN112732845A (en) | End-to-end-based large-scale knowledge graph construction and storage method and system | |
CN112307219B (en) | Method and system for updating vocabulary database for website search and computer storage medium | |
Zoghlami et al. | Using a SKOS engine to create, share and transfer terminology data sets | |
Dutta et al. | AMV: Algorithm Metadata Vocabulary | |
El Midaoui et al. | Geographical queries reformulation using a parallel association rules generator to build spatial taxonomies | |
Khurana et al. | Survey of techniques for deep web source selection and surfacing the hidden web content | |
JP5112117B2 (en) | Cooperative classification apparatus and program | |
CN112860940B (en) | Music resource retrieval method based on sequential concept space on description logic knowledge base | |
Devignes et al. | BioRegistry: Automatic extraction of metadata for biological database retrieval and discovery |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200925 |
|
RJ01 | Rejection of invention patent application after publication |