CN111708773B - Multi-source scientific and creative resource data fusion method - Google Patents

Multi-source scientific and creative resource data fusion method Download PDF

Info

Publication number
CN111708773B
CN111708773B CN202010812168.8A CN202010812168A CN111708773B CN 111708773 B CN111708773 B CN 111708773B CN 202010812168 A CN202010812168 A CN 202010812168A CN 111708773 B CN111708773 B CN 111708773B
Authority
CN
China
Prior art keywords
data
scientific
source
field
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010812168.8A
Other languages
Chinese (zh)
Other versions
CN111708773A (en
Inventor
刘啸
龚晓阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Baohe Data Co ltd
Original Assignee
Jiangsu Baohe Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Baohe Data Co ltd filed Critical Jiangsu Baohe Data Co ltd
Priority to CN202010812168.8A priority Critical patent/CN111708773B/en
Publication of CN111708773A publication Critical patent/CN111708773A/en
Application granted granted Critical
Publication of CN111708773B publication Critical patent/CN111708773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a multi-source scientific and creative resource data fusion method, which comprises the following steps: analyzing the characteristics of the data source; configuring rules to collect scientific resources; analyzing and preprocessing; combining and removing the weight; an association assignment; topic identification is fused with data. The invention realizes the standardized processing and the associated fusion of unstructured feature data such as different sources, types, information description and the like, and solves the problem that various types of scientific and creative resources are difficult to be fused and communicated.

Description

Multi-source scientific and creative resource data fusion method
Technical Field
The invention relates to the technical field of multi-source data processing, in particular to a multi-source scientific and creative resource data fusion method.
Background
With the increase of national policy support for scientific development and scientific services (with the deep development of new generation information and network technology), the efficient collaboration of scientific innovation and the sharing of scientific resources have gradually formed a macro strategic direction of the convergence development of the technical industry and the falling of the double-creation service project. In order to better focus on industrial advantages, serve enterprise requirements, and boost scientific and creative cooperation and research of the areas, how to effectively utilize scientific and creative resources becomes a key point. However, in practice, it is found that scientific resources cover multiple types such as scientific and technological documents, patent achievements, global talents, listed enterprises, technical research and reports, financial data and the like, and the data have various types, scattered sources, various classifications, complex data structures and large differences of bottom layer characteristics, so that the difficulty of realizing cross-platform convergence, normalized processing and structured fusion of data resources with multi-source heterogeneous characteristics is high, so that the resource utilization efficiency is low, and the scientific research value discovery degree is not enough.
Under the background that scientific and technological innovation promotes economic transformation, scientific and creative resources have important significance for promoting the deep fusion of obstetrical and scientific research and promoting the transformation of scientific and technological achievements to realize the innovative development of regional economy. However, the scientific resource distribution has the characteristic of multi-source isomerism, and the problems of non-uniform data structuring degree, low theme fusion degree, inconsistent space-time distribution and the like cause great difficulty in data exchange sharing and value mining. Therefore, the problem of accuracy and consistency of information description of unstructured scientific and technological achievements, semi-structured scientific and invasive examples and structured scientific and invasive objects is solved for processing the scientific and invasive resource data, and association analysis, dynamic integration and cross-domain data fusion after the heterogeneous data is structurally processed are achieved.
The main processing method for multi-source heterogeneous data fusion in the prior art comprises the following steps: on one hand, an ETL frame and tools (Zhonghong, Zhongheng, Pengyin bridge, LongmingRui. data ETL tool general frame design [ J ]. computer application, 2003(12): 96-98.) are adopted, and the data which is dispersed, disordered and has non-uniform standards is extracted, cleaned, converted and loaded, so that the method is often applied to the data conversion of a plurality of business systems under the business scene of an enterprise, lacks of a collection method for multi-field and multi-source scientific resource data, is difficult to finish the data cleaning and associated assignment of complex fields, and further cannot realize the fusion of content labels and knowledge; on the other hand, the method is realized by a multi-source heterogeneous data acquisition, caching and standardization processing method (a multi-source heterogeneous data fusion platform and a fusion method, publication No. CN107633075A, published Japanese 2018.01.26), and the fusion platform comprises a data acquisition unit, a data storage unit, a data standardization unit, a user portrait construction unit, a knowledge graph construction unit and a visualization unit; the data acquisition unit is used for acquiring multi-source heterogeneous data; the data storage unit is used for caching multi-source heterogeneous data; the data standardization unit carries out lexical, grammatical and/or semantic analysis on the multi-source heterogeneous data to obtain standardized text data; the user portrait construction unit constructs user portraits of students by using quantified student tags; the knowledge graph construction unit constructs a knowledge graph of a student, a knowledge graph of a teacher and a knowledge graph of a course, and associates the knowledge graph of the course, the knowledge graph of the student and the knowledge graph of the teacher to obtain course connection, social relation and teacher-student relation which take the student as a center; the visualization unit displays course connection, social relation and teacher-student relation centering on students. Although the processing direction is similar to that of the present invention, the data acquisition range and the specific procedure of the standardized processing are different from those of the method of the present invention, and the present invention provides a more detailed method in the aspects of data structuring processing and data association and fusion, etc. to improve the accuracy of the resource structuring conversion and the fusion degree of heterogeneous data. In view of this, a data fusion processing method based on the scientific resource data characteristics is lacking in the prior art.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a multi-source scientific and creative resource data fusion method which realizes the standardized processing and the associated fusion of unstructured feature data such as different sources, types, information description and the like and solves the problem that various types of scientific and creative resources are difficult to be fused and communicated.
The purpose of the invention is realized by the following technical scheme.
A multi-source scientific and creative resource data fusion method comprises the following steps:
the data source characteristic analysis is used for analyzing the data types and the source characteristics thereof, and the source characteristics comprise source addresses, data distribution, storage formats, data fields, updating mechanisms, information dimensions and data quality;
acquiring scientific and creative resources by a configuration rule, wherein the configuration rule comprises a configuration acquisition range, a configuration acquisition rule, time setting, alias expansion and an object monitoring rule;
analyzing and preprocessing: analyzing the storage format, carrying out data standardization processing, and cleaning and standardizing metadata of the collected multisource heterogeneous scientific and creative resource data;
merging and removing the weight: further entity extraction of personnel and institutions is carried out according to the analyzed and preprocessed data information, and the extracted data is used as a carrier for data combination;
association assignment: and identifying entity and association relation from the combined and de-duplicated standardized data information, performing association assignment with organization information by taking characteristic information of an achievement author as a judgment condition, performing inference verification in the implementation, and performing entity identification and dynamic semantic association processing on the assigned structured and standardized data information to form data collection and fusion under the association of the entity types of the scientific resources.
Further, after the association assignment, theme recognition and knowledge fusion are executed, the theme recognition and knowledge fusion are combined with field information of multi-source scientific and creative resource data after standardized processing, word segmentation, morphological restoration, word stem extraction, syntactic analysis and part of speech tagging are carried out, a research theme of the resource is deduced by using a theme analysis model and an academic dictionary, semantic tagging and label management are completed, knowledge clustering under the theme is formed, and dynamic knowledge fusion of the scientific and creative resources is realized.
Further, the association assignment adopts a renaming resolution correlation algorithm for inference verification.
Further, in the association assignment, an European public scientific research information framework CERIF model is adopted to perform entity identification and dynamic semantic association processing on assigned structured and standardized data information.
Further, after the data source feature analysis, a data integration model and a core metadata model based on the scientific resource features are formed, and the data source feature analysis specifically includes: when the field standards are different, the unified representation of the multi-element fields of the multi-source data is formed by contrasting the description of the core metadata through expert knowledge and manual annotation;
when the redundant character string exists, filtering data, reducing parameters and deleting redundant and random character string fields;
when the information represents a plurality of elements, unified description of data is formed through the steps of core field selection, data reduction, variable reduction, attribute selection, metadata mapping and field association, and the unified description is stored in a database in a standardized manner;
when the author signature is not standard and the name is subjected to English mapping, the manually arranged learner dictionary is used for recognizing, associating and aligning the signature, and manual intervention is combined for data marking, correcting and supplementing.
Further, after the scientific resources are acquired, the parsing and preprocessing specifically includes forming a data set according to data attribute features and format features and storing the data set in a unified manner, maintaining a source data format and temporarily storing the data set in a cloud file, constructing a distributed parsing module through a Django application framework to download data and support parsing tasks of PDF, CSV, HTML, TEXT, JSON, XLS and XML formats in real time, understanding field types and data contents in a semantic manner, and performing data noise reduction and data cleaning by adopting different processing rules under different source features, where the parsing and preprocessing specifically includes: the method comprises the steps of source data format distributed analysis, semantic association and mapping, field segmentation or combination, data correction, field supplement and data unique number identification and judgment to obtain structured data representation; judging and processing redundant data, default data and correction data by adopting the unique data number; if the data has no unique number, performing data set cross validation by adopting a discrimination algorithm of 'type + title + year + name' of source data, extracting or generating a spare unique number by a system, and finally obtaining a data association relation and a preprocessing result; and carrying out unified mapping and field expansion on metadata by combining data attributes and a DC element standard to form normative description of the structured data, and storing and updating the relationship between the preprocessed structured data and the data.
Furthermore, the merging and deduplication are implemented by adopting analyzed and preprocessed data information, extracting information elements surrounding an author field and an organization field in the structured metadata to serve as judgment conditions for scientific resource result merging, and performing verification reasoning for multiple times in a circulating mode to realize that scientific organization units are used as carriers, result merging and organization matching are performed on the multi-source structured resource data, and a data set of the corresponding relation between the merged and deduplicated metadata and the data source information is obtained.
Further, cross-comparing whether the result data merging and de-duplication are completed or not according to the screening structure after result merging and the data quality verification condition, screening data with insufficient effect and incomplete processing, calling a similarity algorithm of a configuration rule to judge difference values of data in sequence similarity, semantic similarity, mechanism similarity and author name similarity, and completing de-duplication merging of further data by combining a set threshold and a judgment result, wherein the threshold is set as d (i, j) =0 or d (i, j) <3, and d (i, j) is a difference value.
Further, the association assignment firstly judges whether the merged and de-duplicated data set can be successfully matched at the name field and the mechanism field, when the name field is not matched with the mechanism, the name pinyin is split based on a Chinese character pinyin segmentation algorithm, a full pinyin and an abbreviated name are matched by using a dynamic programming algorithm, a result of the author name writing method after correction and specification is output, the result is identified and matched with the mechanism field signature again, an author issuing mechanism is determined, and the result is repeatedly used for cross identification of a plurality of items of achievement data to judge the final attribution of the achievement; otherwise, matching the assigned result data with the subject talents according to the result mailbox, the subject organization unit, the subject field metadata field and the matching processing rule so as to preliminarily screen and process the conditions of the duplicate authors, the unmatched talents and the results.
Further, when the association assignment relates to the duplicate name, a co-occurrence analysis and literature coupling analysis algorithm of an assignment processing rule is called, and the relation among the research field, the publishing field, the periodical field and the collaborator where the duplicate name is located is analyzed and judged; deducing the result field and external relevance of the suspected data by using a modularity algorithm, executing matching assignment again, and circularly outputting result data for resolving duplicate names for multiple times; otherwise, semantic analysis and entity recognition are carried out on the assigned structured and standardized data, automatic association of multiple entities such as organizations, talents, achievement contents and projects of the scientific resources is achieved, and fusion and intercommunication of multi-source heterogeneous data features of the scientific resources are obtained from the aspect of data source and information exchange.
Furthermore, the topic identification and knowledge fusion is to read the data information after the association assignment, adopt nltk + space algorithm to extract the field content of the data set, and call the word segmentation algorithm and the word vector algorithm of the topic identification rule to perform content word segmentation processing and syntactic analysis, and when matching is performed according to the user-defined word bank, the space algorithm can perform prior accurate matching on the keywords existing in the user-defined word bank; when the word bank is not self-defined, the nltk algorithm can automatically identify according to the word meaning and the part of speech of the text, and further extracts the key words.
Further, the theme analysis model is adopted, the theme information of the achievement data is identified by an LDA algorithm, and the content word frequency and the weight value are obtained to deduce the achievement theme; and through matching and semantic labeling with a preset academic dictionary, positioning of subject labels of the scientific resources and knowledge clustering under research subjects are completed, and finally, convergence and fusion of cross-domain and cross-subject knowledge of the scientific resources are achieved.
Compared with the prior art, the invention has the advantages that:
1. according to the method, the field characteristics, the time-space sequence characteristics and the distribution storage characteristics of the multi-source data are mined through the data source characteristic analysis of the scientific resources, so that the acquisition, analysis and efficient storage of the original data can be completed more accurately and comprehensively.
2. According to the invention, semantic reasoning and algorithm processing are carried out according to data characteristic representation, data redundancy is reduced, metadata normalization and content fusion degree are enhanced, efficient data cleaning, automatic entity identification, accurate achievement assignment and dynamic theme distinguishing are completed, and thus the data aggregation fusion efficiency of scientific resources is improved.
Drawings
FIG. 1 is a flow chart of the multi-source scientific and creative resource data fusion method of the invention.
FIG. 2 is a flow chart of the parsing and preprocessing method of the present invention.
FIG. 3 is a flow chart of a merge deduplication method of the present invention.
FIG. 4 is a flow chart of an association assignment method of the present invention.
Detailed Description
The invention is described in detail below with reference to the drawings and specific examples.
As shown in fig. 1, the embodiment provides a method for fusing data of multi-source scientific and creative resources, which includes data source feature analysis, rule configuration and acquisition, parsing and preprocessing, combining and deduplication, association assignment, topic identification and data fusion, and specifically includes the following steps:
the data source characteristic analysis is used for analyzing the data types and the source characteristics thereof, including the aspects of source addresses, data distribution, storage formats, data fields, update mechanisms, information dimensions, data quality and the like, and the implicit problems of the source data are researched through characteristic mining, such as: when the field standards are different, for example: in the field identification of the scientific research resource Web of Science core collection, AU is author, TI is document Title, SO publication name, DT document type and the like, and in the field of the scientific research data introduction library scope, Authors is author, Title is document Title, Source _ Title is publication name, db _ type is resource type and the like. Aiming at the problems of different field standards, unified representation and unified field mapping standards of multi-element fields of multi-source data can be formed through expert knowledge, manual labeling and core metadata description contrast, and rules of automatic matching and alignment are configured;
when there is a redundant string, for example: when the official website data of a patent library is collected, character strings which cannot be directly used, such as { "jg _ level2" "," px ": 1," jg _ type "\\ u9662\ u6821", "name" \ u5317\ u4eac \ u7406\ u5de5\ u5927\ u5b66"," jg _ level ": u975e \ u4e0a \ u5e02" } ] title, appear in a data field with the patent name of "a method for jointly visualizing the UKF fiber tracking data", whether the content is available or not is judged by combining the selection and mapping of a preset field, and then data, reduction parameters and redundant and random character strings are deleted;
when the information represents a plurality, the information surrounding the scientific resources is described, for example: one of the scientific resources, Web of Science, has 73 common fields, the other scientific resource, Scopus, has 77 common fields, and the patent of scientific resources has 22 common fields. In the face of different description angles and different description modes of different data types, the scientific and creative data unified description can be formed through the steps of core field selection, data reduction, variable reduction, metadata mapping, field association and the like, and is stored in a database in a standardized manner;
when the author signature is not normalized and the name has an English mapping problem, for example: the Zhou, zh, zhu, z zhu, zhihua, z and other signature forms of the scholars occur in the search of the web of science scientific research information resource library, the Chinese and English names are difficult to point to be unified without data cleaning, and the manually arranged scholars dictionary can be used for recognition, association and alignment of signatures and be combined with manual intervention for data marking, correction and supplement. Therefore, a data integration model and a core metadata model based on the scientific and creative resource characteristics are formed to support the implementation of resource acquisition and standardization processing links.
And acquiring the scientific resources by the configuration rule, wherein the configuration rule comprises a configuration acquisition range, a configuration acquisition rule, time setting, alias expansion and an object monitoring rule, and the acquired resource data is ensured to be effective and available.
Configuring an acquisition range (comprising a source database, a network address, page elements and the like), configuring acquisition rules (comprising acquisition paths, acquisition of type objects such as pictures, videos, audios and texts, acquisition field items, data analysis marks and the like), time setting (namely acquisition period setting), alias extension (comprising the existence of various pointing descriptions and unified record processing of the objects), and object monitoring rules (comprising resource object positioning, object classification and monitoring updating);
based on a data acquisition Z39.50 protocol and an HTTP protocol, the method specifically comprises the following acquisition steps: firstly, inputting a resource object website; secondly, editing a search formula or keywords to enter search, or selecting global collection; thirdly, identifying a collection object (text, picture, audio, video and other content types), editing a collection field, mapping a standard field, and automatically identifying and expanding alias; fourthly, setting a data time period and adjusting the page cycle setting (such as data loading page turning and page drilling levels); fifthly, identifying data type, identifying data file format (such as PDF, CSV, HTML, TEXT, JSON, XLS, XML and the like), and selecting export path (exporting to local or publishing to database); sixthly, configuring a monitoring entity to circularly monitor the updating condition of the object and setting timing acquisition configuration; and step seven, starting collection.
The scientific resource data acquisition object comprises: scientific research literature databases, patent databases, national science fund project databases, national social fund project databases, financial databases, government and enterprise databases.
The business data acquires interface data by purchasing corresponding database access rights, and the public data acquires original data in multiple fields by adopting a webpage capture method.
Analyzing and preprocessing acquired multi-source heterogeneous scientific and creative resource data, analyzing the storage format, performing data standardization processing, and cleaning and standardizing metadata to realize multi-source data structuring and metadata description standardization.
And merging and de-duplicating to further extract personnel and organization entities according to the analyzed and preprocessed data information, and taking the extracted personnel and organization entities as a carrier for data merging. And according to the precondition steps of unique number identification of the data source and core metadata confirmation without the unique number, merging the result data, verifying the result and eliminating repeated items of the data according to a processing rule.
The association assignment identifies an entity and an association relation from the combined and de-duplicated standardized data information, implements association assignment with organization information around achievement author characteristic information as a judgment condition, adopts a duplicate name resolution related algorithm in implementation, firstly judges whether duplicate names refer to the same author through an author dictionary, and performs co-reference processing; and then, extracting document field objects (including authors, institutions, keywords, subject words, citation conditions and cited conditions), analyzing commonly-occurring keywords, commonly-occurring collaborators, and the probability of co-occurrence of documents, keywords and institutions and authors among a plurality of papers, analyzing document coupling strength, analyzing and judging the relation among research fields, publication fields, journal fields and collaborators of the rename places by combining expert knowledge and manual intervention analysis, and performing inference verification to improve the success rate of rename assignment. And finally, carrying out entity identification and dynamic semantic association processing on assigned structured and standardized data information by adopting a public scientific research information framework CERIF model to form data collection and fusion under the association of organization, talent, achievement, project and other scientific resource entity types, wherein the public scientific research information framework CERIF model has a neutral framework, supports data model (relation, object and information retrieval) expression and supports centralized/distributed query and knowledge-based Web/harvest/IR query.
And combining topic identification and knowledge fusion with field information of multi-source scientific resource data after standardization processing, performing word segmentation, word form reduction, word stem extraction, syntactic analysis and part of speech tagging, deducing research topics of resources by using a topic analysis model and an academic dictionary, completing semantic tagging and label management, forming knowledge clustering under topics, and realizing dynamic knowledge fusion of scientific resources.
In this embodiment, the analysis and the preprocessing use the analysis result of the data source characteristics and the collected data under the rule configuration as the processing object and the algorithm configuration.
As shown in fig. 2, after scientific resources of multi-source heterogeneous features are obtained, a source data format is maintained and temporarily stored in a cloud file, a distributed parsing module is constructed through a Django application framework to download data and support parsing tasks in formats such as PDF, CSV, HTML, TEXT, JSON, XLS, XML and the like in real time, field types and data contents are understood semantically, and different processing rules under different source features are adopted to perform data noise reduction and data cleaning, wherein the method includes: and carrying out flow methods such as source data format distributed analysis, semantic association and mapping, field segmentation or combination, data correction, field supplement, data unique number identification and judgment and the like to obtain structured data representation.
For example: the source database 1 describes the Address as Address < NJU, China >, and the source database 2 describes the same field as Has Address < nano-junction University, jiangsu, China >. In the analysis process, the algorithm automatically identifies the Address field, performs field segmentation and is specified and described as Address < Nanjing University, China >.
Analyzing and preprocessing the data by adopting a data unique number to judge and process redundant data, default data and corrected data; if the data has no unique number, a discrimination algorithm of 'type + title + year + name' of the source data is adopted to perform data set cross validation, and a spare unique number is extracted or generated by the system, so that a data association relation and a preprocessing result are finally obtained.
The unique number of the scientific resource data is set according to the resource type and source, such as: the books are ISSN numbers, the documents respectively have WOS numbers, Scopus numbers, CSCD numbers, EI numbers, DOI numbers and Handle numbers according to database sources, the patents are publication numbers, and the fund items are project approval numbers and the like.
And analyzing and preprocessing, combining data attributes and a DC element standard to carry out metadata unified mapping and field expansion to form normative description of the structured data, and storing and updating the relationship between the preprocessed structured data and the data.
And the merging and deduplication are implemented by adopting analyzed and preprocessed data information, extracting information elements surrounding an author field and an organization field in the structured metadata to serve as judgment conditions for merging of scientific and creative resource results, and realizing that the scientific and creative organization unit is used as a carrier through cyclic multiple verification and reasoning, and performing result merging and organization matching on the multi-source structured resource data to obtain a data set of the corresponding relation between the metadata and the data source information after merging and deduplication.
As shown in fig. 3, the scientific resource data is preprocessed to form a structured data format, and the process and purpose of combining and de-duplicating is to convert the structured data into standardized data through step processing.
Preprocessing the data to extract author related information, wherein the author related information comprises talent mailboxes, ages, years of entrance or departure from school, organization units, disciplinary departments or disciplinary partitions and the like which are used as next parameter conditions; similarly, according to the department organization unit, all achievements under the type of the department resource are extracted, including: scientific literature, patent achievements, technical reports, financial data, monographs, fund projects, and the like.
Combining and removing duplication to carry out initial combination of result data, and combining the results of the cross-domain scientific data of each scientific research institution according to the unique number of the data source and the combination judgment conditions of the system generated unique number, the type data title (Chinese and English), the author full name (Chinese and English), the publishing or publishing time and the like.
And cross-comparing whether the result data merging and de-duplication are finished or not according to the screening structure after the result merging and the data quality verification condition. Screening out data with insufficient effect and incomplete processing, calling a similarity algorithm of a configuration rule, namely judging difference values of the data in the aspects of sequence similarity, semantic similarity, mechanism similarity, author name similarity and the like by utilizing a method of combining a Levenshtein algorithm and word frequency syntactic analysis, and finishing further duplication elimination and combination of the data by combining a set threshold and a judgment result. The threshold is set to d (i, j) =0 or d (i, j) <3, where d (i, j) is a disparity value. And (4) performing cyclic processing for multiple times, and finally, taking the standardized data as a carrier by an organization unit to complete the combination of the data results of the scientific and invasive resources and the matching of the scientific and invasive organizations to obtain a data set of the corresponding relation between the metadata and the data source information after combination and duplication removal.
As shown in fig. 4, the association assignment is to implement data fusion of the multi-source scientific resources at the level of data source and information exchange by entity identification of the scientific resources, assignment of achievement and association of entity.
The association assignment firstly judges whether the merged and de-duplicated data set can be successfully matched with the mechanism field at the name field and the mechanism field, and aims to confirm that the result can belong to the author and the author can belong to the mechanism, so that the automatic identification and intelligent association of the scientific and creative resources at the entities such as the mechanism, the talents, the result and the like are realized.
Aiming at the condition that the name field is not matched with the mechanism, splitting name pinyin based on a Chinese character pinyin segmentation algorithm, matching full pinyin and abbreviated names by using a dynamic programming algorithm, outputting a result of the author name writing method (Chinese and English) after correction and specification, performing recognition matching with the mechanism field signature again, determining an author issuing mechanism, and repeatedly performing cross recognition on multiple items of result data to judge the final attribution of the result.
And matching the assigned result data with the subject talents according to the result mailbox, the subject department organization unit, the subject field and other metadata fields and matching processing rules so as to preliminarily screen and process the conditions of the duplicate authors, unmatched talents and results.
The method comprises the following steps of (1) associating and assigning a resolution mode related to a problem of the duplicate name, and calling a co-reference and co-reference co-occurrence analysis and document coupling analysis algorithm of an assignment processing rule, namely judging whether the duplicate name refers to the same author through an author dictionary at first, and performing co-reference processing; then, extracting document field objects (including authors, institutions, keywords, subject words, citation conditions and cited conditions), analyzing keywords which commonly appear among a plurality of papers, commonly appearing collaborators, probabilities of documents and keywords which commonly appear, and the relationships of institutions and authors, analyzing document coupling strength, and analyzing and judging the research field, publication field, journal field and collaborator relationship of the places with the same names by combining expert knowledge and manual intervention; and (3) forming a stable clustering result by dividing community modules in the network structure by utilizing a modularity algorithm to infer the result field and external relevance of the suspected data, executing matching assignment again, and outputting result data for resolving duplicate names repeatedly in a circulating manner. Also, duplication or signature irregularities may occur in the institution field portion, which may be corrected using a mapping of a pre-established scientific institution dictionary.
For example: in the assignment process, the author signature spelling specification problem, the English abbreviation and full name problem, the fuzzy name reference problem and the renaming problem exist, for example, the Nees Jan van Eck (full name) of the scholars can have the spelling error of the Nee Jan van Eck, the Nees J V Eck abbreviation reference or the fuzzy suspected reference of the Nees V Eck and the like in the technical literature signature. At the moment, calculating letter approximate values by adopting a Levenshtein algorithm, screening and classifying similar data, and then automatically correcting the writing method if the approximate values exceed a model set threshold value, namely replacing the Nee Jan van Eck with misspelling at the index by a Nees Jan van Eck with a correct writing method; and if the abbreviation fuzzy refers to the problem of renaming, deducing the result field and the external relevance of the suspected data by utilizing a modularity algorithm, and analyzing the analysis results of the relation among the research field, the publishing field, the periodical field and the collaborators of the suspected data set to compare and refer to the result so as to deduce the correct assignment result.
And the association assignment also adopts a European public scientific research information framework CERIF data model to carry out semantic analysis and entity identification on the assigned structured and standardized data, so that the automatic association of a plurality of entities such as organizations, talents, achievement contents, projects and the like of the scientific resources is realized, and the fusion and intercommunication of the multi-source heterogeneous data characteristics of the scientific resources are obtained from the level of data source and information exchange.
The purpose of topic identification and knowledge fusion is to perform further topic labeling on standardized data information and multi-source heterogeneous data fusion of scientific resources on a knowledge level.
And reading the data information after the association assignment by topic identification and knowledge fusion, extracting the field content of the data set by adopting an nltk + space algorithm, and calling a word segmentation algorithm and a word vector algorithm of a topic identification rule to perform content word segmentation processing and syntactic analysis. When matching is carried out according to the user-defined word stock, the space algorithm can carry out priority accurate matching on the existing keywords in the user-defined word stock; when the word bank is not self-defined, the nltk algorithm can automatically identify according to the word meaning and the part of speech of the text, and further extracts the key words.
And (3) identifying the theme information of the achievement data by adopting theme model analysis and an LDA algorithm, and acquiring the word frequency and the weight value of the content to deduce the achievement theme. And through matching and semantic labeling with a preset academic dictionary, positioning of subject labels of the scientific resources and knowledge clustering under research subjects are completed, and finally, convergence and fusion of cross-domain and cross-subject knowledge of the scientific resources are achieved.
The invention provides a multi-source scientific and creative resource data fusion method, which is characterized in that by acquiring multi-source heterogeneous scientific and creative resource data, data noise reduction, standardized processing, structural association and knowledge fusion are carried out on large-scale composite semantic data, so that data management and data fusion of scientific and creative resources are realized, information sharing and value mining of the resources are further improved, availability and utilization rate of the scientific and creative resources are enhanced, and efficient and durable data support services are provided for scientific innovation and scientific research decision making of governments, enterprises, regions and colleges.

Claims (7)

1. A multi-source scientific and creative resource data fusion method is characterized by comprising the following steps:
the data source characteristic analysis is used for analyzing the data types and the source characteristics thereof, and the source characteristics comprise source addresses, data distribution, storage formats, data fields, updating mechanisms, information dimensions and data quality;
acquiring scientific and creative resources by a configuration rule, wherein the configuration rule comprises a configuration acquisition range, a configuration acquisition rule, a time setting, an alias extension and an object monitoring rule, the acquisition range comprises a source database, a network address and a page element, the configuration acquisition rule comprises an acquisition path, an acquisition picture, a video, an audio, a text type object, an acquisition field item and a data analysis mark, the time setting is an acquisition period setting, the alias extension comprises the existence of multiple direction descriptions of an object, the unified record processing is carried out, and the object monitoring rule comprises the resource object positioning, the object classification and the monitoring updating;
the method comprises the following steps of collecting multisource scientific and creative resource data based on a Z39.50 protocol and an HTTP protocol, wherein the specific collection steps are as follows: firstly, inputting a resource object website; secondly, editing a search formula or keywords to enter search, or selecting global collection; thirdly, identifying the collection objects, namely the content types of texts, pictures, audios and videos, editing the collection fields, mapping the standard fields, and automatically identifying and expanding the aliases; fourthly, setting a data time period and adjusting the page cycle setting; fifthly, identifying the data type, identifying the data file format and selecting an export path; sixthly, configuring a monitoring entity to circularly monitor the updating condition of the object and setting timing acquisition configuration; step seven, starting collection;
the multi-source scientific and creative resource data acquisition object comprises: scientific research literature database, patent database, national science fund project database, national social fund project database, financial database, government and enterprise database; acquiring interface data of commercial data by purchasing access rights of a corresponding database, and acquiring original data of multiple fields by using a webpage capture method of public data;
analyzing and preprocessing: analyzing the storage format, carrying out data standardization processing, and cleaning and standardizing metadata of the collected multisource heterogeneous scientific and creative resource data;
merging and removing the weight: further entity extraction of personnel and institutions is carried out according to the analyzed and preprocessed data information, and the extracted data is used as a carrier for data combination;
association assignment: identifying entity and association relation from the combined and de-duplicated standardized data information, performing association assignment with organization information around result author characteristic information as a judgment condition, performing inference verification in the implementation, and performing entity identification and dynamic semantic association processing on assigned structured and standardized data information to form data collection and fusion under the association of the types of the scientific and creative resource entities;
the analyzing and preprocessing specifically comprises the steps of after scientific resources are obtained, forming a data set according to data attribute characteristics and format characteristics and uniformly storing the data set, maintaining a source data format and temporarily storing a cloud file, constructing a distributed analyzing module through a Django application framework to download data and support analyzing tasks of PDF, CSV, HTML, TEXT, JSON, XLS and XML formats in real time, understanding field types and data contents according to semantics, and performing data noise reduction and data cleaning by adopting different processing rules under different source characteristics, wherein the analyzing and preprocessing specifically comprises the following steps: the method comprises the steps of source data format distributed analysis, semantic association and mapping, field segmentation or combination, data correction, field supplement and data unique number identification and judgment to obtain structured data representation; judging and processing redundant data, default data and correction data by adopting the unique data number; if the data has no unique number, performing data set cross validation by adopting a discrimination algorithm of 'type + title + year + name' of source data, extracting or generating a spare unique number by a system, and finally obtaining a data association relation and a preprocessing result; metadata unified mapping and field expansion are carried out by combining data attributes and a DC element standard to form normative description of structured data, and storage and updating are carried out on the relationship between the preprocessed structured data and the data;
the merging and de-duplication are implemented by adopting analyzed and preprocessed data information, extracting information elements surrounding an author field and an organization field in the structured metadata to serve as judgment conditions for merging of scientific and creative resource results, and realizing that the scientific and creative organization unit is used as a carrier through cyclic multiple verification and reasoning, and performing result merging and organization matching on the multi-source structured resource data to obtain a data set of corresponding relations between the metadata and the data source information after merging and de-duplication;
cross-comparing whether the result data merging and de-duplication are finished or not according to the screening result after result merging and the data quality verification condition, screening data with insufficient effect and incomplete processing, calling a similarity algorithm of a configuration rule to judge difference values of sequence similarity, semantic similarity, mechanism similarity and author name similarity of the data, and finishing de-duplication merging of further data by combining a set threshold and a judgment result, wherein the threshold is set to d (i, j) being 0 or d (i, j) being less than 3, and d (i, j) is a difference value;
the association assignment firstly judges whether the merged and de-duplicated data set can be successfully matched at the name field and the mechanism field, when the name field is not matched with the mechanism, the name pinyin is split based on a Chinese character pinyin segmentation algorithm, full pinyin and abbreviated names are matched by using a dynamic programming algorithm, the result of the author name writing method after correction and specification is output, the result is identified and matched again with the mechanism field signature, the author issuing mechanism is determined, and the result is repeatedly used for cross identification of multiple items of result data to judge the final attribution of the result; otherwise, matching the assigned result data with the subject talents according to the result mailbox, the subject organization unit, the subject field metadata field and the matching processing rule so as to preliminarily screen and process the conditions of the duplicate authors, the unmatched talents and the results;
when the association assignment relates to the duplicate name, calling a co-occurrence analysis and literature coupling analysis algorithm of an assignment processing rule, and analyzing and judging the relation among the research field, the publishing field, the periodical field and the collaborator of the duplicate name; deducing the result field and external relevance of the suspected data by using a modularity algorithm, executing matching assignment again, and circularly outputting result data for resolving duplicate names for multiple times; otherwise, semantic analysis and entity recognition are carried out on the assigned structured and standardized data, automatic association of multiple entities of organization, talents, achievement content and projects of the scientific resources is achieved, and fusion and intercommunication of the multi-source heterogeneous data features of the scientific resources are obtained from the aspect of data source and information exchange.
2. The method for fusing multi-source scientific resource data according to claim 1, characterized in that after the association assignment, topic identification and knowledge fusion are performed, the topic identification and knowledge fusion are combined with field information of the multi-source scientific resource data after standardization processing, and the processing of word segmentation, morphological restoration, word stem extraction, syntactic analysis and part of speech tagging is performed, and a topic analysis model and an academic dictionary are used for deducing research topics of resources, so that semantic tagging and tag management are completed, knowledge clustering under the topics is formed, and dynamic knowledge fusion of scientific resources is realized.
3. The method of claim 1, wherein the association assignment is inferred by a duplicate name resolution correlation algorithm.
4. The method as claimed in claim 1, wherein the association assignment uses a CERIF model of european common scientific research information framework to perform entity recognition and dynamic semantic association processing on assigned structured and standardized data information.
5. The multi-source scientific resource data fusion method according to claim 1, characterized in that a data integration model and a core metadata model based on scientific resource features are formed after the data source feature analysis, and the data source feature analysis specifically includes: when the field standards are different, the unified representation of the multi-element fields of the multi-source data is formed by contrasting the description of the core metadata through expert knowledge and manual annotation;
when the redundant character string exists, filtering data, reducing parameters and deleting redundant and random character string fields;
when the information represents a plurality of elements, unified description of data is formed through the steps of core field selection, data reduction, variable reduction, attribute selection, metadata mapping and field association, and the unified description is stored in a database in a standardized manner;
when the author signature is not standard and the name is subjected to English mapping, the manually arranged learner dictionary is used for recognizing, associating and aligning the signature, and manual intervention is combined for data marking, correcting and supplementing.
6. The multi-source scientific and creative resource data fusion method of claim 2, characterized in that the topic identification and knowledge fusion is to read the data information after the association assignment, extract the field content of the data set by adopting nltk + space algorithm, and invoke the segmentation algorithm and word vector algorithm of the topic identification rule to perform content segmentation processing and syntactic analysis, when matching is performed according to the custom thesaurus, the space algorithm will perform the prior accurate matching on the keywords in the custom thesaurus; when the word bank is not self-defined, the nltk algorithm can automatically identify according to the word meaning and the part of speech of the text, and further extracts the key words.
7. The multi-source scientific and creative resource data fusion method of claim 2, wherein the theme analysis model is used to identify theme information of the achievement data by LDA algorithm, and content word frequency and weight value are obtained to infer the achievement theme; and through matching and semantic labeling with a preset academic dictionary, positioning of subject labels of the scientific resources and knowledge clustering under research subjects are completed, and finally, convergence and fusion of cross-domain and cross-subject knowledge of the scientific resources are achieved.
CN202010812168.8A 2020-08-13 2020-08-13 Multi-source scientific and creative resource data fusion method Active CN111708773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010812168.8A CN111708773B (en) 2020-08-13 2020-08-13 Multi-source scientific and creative resource data fusion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010812168.8A CN111708773B (en) 2020-08-13 2020-08-13 Multi-source scientific and creative resource data fusion method

Publications (2)

Publication Number Publication Date
CN111708773A CN111708773A (en) 2020-09-25
CN111708773B true CN111708773B (en) 2020-12-08

Family

ID=72547288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010812168.8A Active CN111708773B (en) 2020-08-13 2020-08-13 Multi-source scientific and creative resource data fusion method

Country Status (1)

Country Link
CN (1) CN111708773B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214531B (en) * 2020-10-12 2021-11-05 海南大学 Cross-data, information and knowledge multi-modal feature mining method and component
CN112231524A (en) * 2020-10-22 2021-01-15 北京天融信网络安全技术有限公司 Data fusion method and device, storage medium and electronic equipment
CN112667872B (en) * 2020-11-17 2023-04-07 国家计算机网络与信息安全管理中心 Real-time acquisition method of new coronary pneumonia epidemic situation data
CN112417017A (en) * 2020-11-19 2021-02-26 郑州轻工业大学 Cyclic filtering processing fusion system for heterogeneous data
CN112559618B (en) * 2020-12-23 2023-07-11 光大兴陇信托有限责任公司 External data integration method based on financial wind control business
CN113032353A (en) * 2020-12-30 2021-06-25 北京市农林科学院 Data sharing method, system, electronic device and medium
CN112765183B (en) * 2021-02-02 2022-02-11 浙江公共安全技术研究院有限公司 Multi-source data fusion method and device, storage medium and electronic equipment
CN112990865B (en) * 2021-03-10 2023-10-31 上海伯俊软件科技有限公司 Collaborative work system, collaborative work method, storage medium and computer device
CN113051249A (en) * 2021-03-22 2021-06-29 江苏杰瑞信息科技有限公司 Cloud service platform design method based on multi-source heterogeneous big data fusion
CN112949745B (en) * 2021-03-23 2024-04-19 中国检验检疫科学研究院 Fusion processing method and device for multi-source data, electronic equipment and storage medium
CN113127449A (en) * 2021-04-25 2021-07-16 东北大学 Method for constructing aluminum/copper plate strip production full-flow data warehouse
CN113177150A (en) * 2021-04-25 2021-07-27 新华智云科技有限公司 Publication resource integration method and publication resource integration system
CN113407603B (en) * 2021-05-13 2022-10-04 北京鼎轩科技有限责任公司 Data export method and system
CN113239201A (en) * 2021-05-20 2021-08-10 国网上海市电力公司 Scientific and technological literature classification method based on knowledge graph
CN113220667A (en) * 2021-05-31 2021-08-06 东莞理工学院 Scientific and technological big data element construction method and system, electronic equipment and storage medium
CN113537927A (en) * 2021-06-28 2021-10-22 北京航空航天大学 Scientific and technological resource service platform transaction coordination system and method
CN113743475A (en) * 2021-08-10 2021-12-03 中国电子科技集团公司第二十七研究所 Real-time multi-source data fusion method based on UKF
CN113918793A (en) * 2021-12-10 2022-01-11 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data acquisition method
CN114860875B (en) * 2022-04-26 2023-06-20 深圳市生态环境智能管控中心 Data integration system and method for fixed pollution source
CN115034290B (en) * 2022-05-17 2023-02-03 医声医事(北京)科技有限公司 Dynamic fusion system, method, equipment and medium for multi-source data
CN115080565A (en) * 2022-06-08 2022-09-20 陕西天诚软件有限公司 Multi-source data unified processing system based on big data engine
CN115080817B (en) * 2022-07-22 2022-11-04 电科云(北京)科技有限公司 Process data organization and display method, device, equipment and storage medium
CN115757573B (en) * 2022-11-07 2023-11-14 中电科大数据研究院有限公司 Processing method and device of map data and storage medium
CN115794839B (en) * 2023-02-08 2023-05-09 江西维网软件有限公司 Data collection method based on Php+Mysql system, computer equipment and storage medium
CN117076454B (en) * 2023-08-21 2024-03-12 广州地铁集团有限公司 Engineering quality acceptance form data structured storage method and system
CN117407457B (en) * 2023-12-14 2024-02-27 中国人民解放军国防科技大学 Multi-source data fusion method, system and equipment based on configurable rules

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893526A (en) * 2016-03-30 2016-08-24 上海坤士合生信息科技有限公司 Multi-source data fusion system and method
CN107633075A (en) * 2017-09-22 2018-01-26 吉林大学 A kind of multi-source heterogeneous data fusion platform and fusion method
CN110147357A (en) * 2019-05-07 2019-08-20 浙江科技学院 The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Also Published As

Publication number Publication date
CN111708773A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
CN111708773B (en) Multi-source scientific and creative resource data fusion method
CN110825882B (en) Knowledge graph-based information system management method
CN111753099B (en) Method and system for enhancing relevance of archive entity based on knowledge graph
CN109446344B (en) Intelligent analysis report automatic generation system based on big data
US20160335544A1 (en) Method and Apparatus for Generating a Knowledge Data Model
CN111914558A (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN113177124A (en) Vertical domain knowledge graph construction method and system
CN108153729B (en) Knowledge extraction method for financial field
CN113987212A (en) Knowledge graph construction method for process data in numerical control machining field
US20190354636A1 (en) Methods and Systems for Comparison of Structured Documents
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
CN115422155A (en) Modeling method of data lake metadata model
Pang et al. Methodology and mechanisms for federation of heterogeneous metadata sources and ontology development in emerging collaborative environment
CN116821376A (en) Knowledge graph construction method and system in coal mine safety production field
CN115827862A (en) Associated acquisition method for multivariate expense voucher data
CN115827885A (en) Operation and maintenance knowledge graph construction method and device and electronic equipment
TWI793432B (en) Document management method and system for engineering project
CN115309885A (en) Knowledge graph construction, retrieval and visualization method and system for scientific and technological service
CN114153983A (en) Multi-source construction method of industry knowledge graph
CN113377739A (en) Knowledge graph application method, knowledge graph application platform, electronic equipment and storage medium
Yu et al. Workflow recommendation based on graph embedding
Azeroual A text and data analytics approach to enrich the quality of unstructured research information
Assaf et al. RUBIX: a framework for improving data integration with linked data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant