CN113254518A

CN113254518A - Information resource management and analysis method based on particle data

Info

Publication number: CN113254518A
Application number: CN202110563420.0A
Authority: CN
Inventors: 黄德会
Original assignee: Jingruan Weiye Information Technology Beijing Co ltd
Current assignee: Jingruan Weiye Information Technology Beijing Co ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-08-13

Abstract

The invention discloses an information resource management and analysis method based on particle data, which comprises the following steps: 1. constructing a label knowledge body of the particle data; 2. preprocessing an input multi-source data set, endowing each unit with a multi-dimensional label, and generating a particle data set; 3. constructing a particle data logic storage adapter, and mapping a particle data set to physical storage; 4. constructing a particle data loading component; 5. defining a grain data analysis description language and constructing a grain data analysis component; 6. the unified intra-system and inter-system particle data sharing access control decision maker is realized, and the access authority of each particle data is determined according to the judgment result of the access control decision maker; 7. a formatted and visualized output of the result particle data set is provided. The invention manages and analyzes information resources flexibly and efficiently without losing generality, and solves the problems of weak universality and flexibility of the existing information resource management and analysis technology.

Description

Information resource management and analysis method based on particle data

Technical Field

The invention belongs to the field of information resource management and analysis, and particularly relates to an information resource analysis method based on particle data.

Background

With the informatization and networking of human activities, mass data and information resources borne by the mass data are distributed in the current network space, so that effective management and analysis of the multi-element heterogeneous information resources are necessary. With the increasing of computing power and the application of new artificial intelligence technologies, new opportunities and challenges are faced in the management and analysis of information resources.

The current information resource management and analysis technology generally has the following three methods:

1. data mining technology based on data warehouse

The data warehouse is a theme-oriented data set, a multidimensional data model is constructed by three processes of data extraction, conversion and loading of information resources, namely an ETL process, and data analysis and assistant decision making are realized by Online Analytical Processing (OLAP). A representative system is Oracle Warehouse and supports characteristics of complex queries, data snapshots and the like facing the theme.

Although the method has high query and analysis efficiency, the data extraction and conversion process needs a large amount of data cleaning processes such as standardization, normalization and the like, otherwise, the data quality is difficult to guarantee. In addition, the data warehouse only depends on the theme attributes of the data for management and analysis, has single dimension and cannot be dynamically adjusted, and greatly influences the data mining and analyzing capability.

2. Data analysis technology based on knowledge graph

The knowledge graph is a semantic network for describing the relationship between entities, and the internal association relationship of data is mined and predicted by extracting, expressing and fusing knowledge in information, so that the deep analysis and application of the data are realized. A representative system is a Neo4j graph database, and supports the construction of a knowledge graph and strong query capability.

The method has strong data mining and knowledge reasoning capabilities, and can realize self-learning capabilities to a certain extent on the layer surface by means of a link prediction algorithm. However, since this method needs to rely on a Schema (Schema) based on expert knowledge when extracting knowledge, it cannot be applied to data sets with weak or uncertain logical relationships, and this method is only applicable to limited application fields such as search engine recommendation, intelligent question answering, etc.

3. Machine learning technology based on big data

No matter machine learning in a supervision or unsupervised mode, the method always trains a large-scale data set in advance, classifies and predicts data according to a generated model, and mines potential association relation from multi-source mass data. Representative works include convolutional neural networks CNN, recurrent neural networks RNN, etc., supporting learning of data features and prediction of data sequences.

The method has a remarkable effect on the analysis of the data set with an ambiguous incidence relation or lacking expert knowledge, but the method is in transition dependence on training data, so that the accuracy and the recall rate cannot be guaranteed. In addition, for the machine learning model, the interpretability of the data analysis result is poor, and the data analysis quality cannot be quantified.

In summary, the current information resource management and analysis techniques are poor in generality, flexibility and usability.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide an information resource management and analysis method based on particle data, which decomposes multi-source mass data into particle data sets, endows each particle data with a group of labels, and realizes high universality and flexibility of information resource management and analysis capability by means of a complex semantic query mechanism.

A method for managing and analyzing information resources based on particle data comprises the following steps:

step S01: constructing a label knowledge body (Ontology) of the particle data according to the national standard and the best industry practice;

step S02: preprocessing the input multi-source data set based on the label knowledge body constructed in the step S01, extracting a minimum processing unit with complete logical meaning, and endowing each unit with a multi-dimensional label to generate a particle data set;

step S03: according to different physical storage architectures, a flexible particle data logic storage adapter is constructed, and the particle data set generated in the step S02 is mapped to physical storage;

step S04: constructing a particle data loading assembly, supporting conventional retrieval and semantic calculation based on particle data labels, and extracting a particle data set stored in S03 according to application requirements for further analysis and processing;

step S05: defining a particle data analysis description language, constructing a particle data analysis component, analyzing the data set output by the S04, and generating an analysis result particle data set;

step S06: a uniform intra-system and inter-system particle data sharing access control decision maker is realized, and the authority of each particle data is determined for the particle data sets generated in the step S04 and the step S05 according to the judgment result of the access control decision maker;

step S07: report data and visual output are provided for the resulting particle data set of step S06.

Further, the "particle data tag ontology" described in step S01 refers to the attribute set of the particle data, including indexes, groups, time, space, measure units and subjects.

Further, the "pre-processing the input multi-source data set" in step S02 is performed as follows: (1) for semi-structured data, converting the semi-structured data into structured data by decoupling a plurality of layers of nested attributes in a Schema (Schema); (2) for unstructured data, the unstructured data is converted into structured data in a < Key, Value > mode by calculating the hash Value of the data. All multi-source heterogeneous data is finally consolidated into structured data for further processing.

Further, the "physical storage architecture" in step S03 includes a relational database, a NoSQL database, and a graph database.

Further, the "logical storage adapter" described in step S03 refers to the middleware that maps the granule data into the physical storage according to the physical storage architecture.

Further, the "conventional search of the tag" in step S04 refers to querying the value range of the tag and the statistical index including the maximum value, the minimum value, and the average value.

Further, the "semantic computation of labels" described in step S04 refers to matching labels having similar logical meanings according to a given subject term.

Further, the "grain data analysis description language" described in step S05 provides a complex semantic relationship operation descriptor including an arithmetic operation, a logical operation, a custom complex operation script, and a predicate logic.

Further, the "access control decider" in step S06 refers to determining whether to allow or deny access to the grain data according to an access control policy, and supports coarse-grained access control based on Traffic Light Protocol (TLP) and fine-grained access control based on roles.

Through the steps, the information resource management and analysis method based on the particle data is realized, the universality is not lost while the information resources are managed and analyzed flexibly and efficiently, and the problems of poor universality and poor flexibility of the existing information resource management and analysis technology are solved.

By means of the technical scheme, the information resource management and analysis system based on the particle data is constructed, the data are decomposed into the minimum units representing the logic significance, and the multidimensional labels are given, so that the data are managed and analyzed on finer granularity, the universality and the flexibility are guaranteed, the cost of manually marking the data and training the data is reduced, and the problems of high data management cost and poor universality are effectively solved.

Drawings

Fig. 1 is a schematic diagram illustrating an embodiment of a method for managing and analyzing information resources based on particle data according to the present invention.

Detailed Description

In order to make the description of the technical solution clearer for the purpose of the method of the present invention, the following detailed description is made of specific embodiments.

Step 101: according to the national economy industry classification, a particle data label knowledge body is constructed, and the particle data label knowledge body comprises labels and value ranges of numerical indexes, non-numerical groups, measurement units, administrative divisions, time periods and the like.

Step 102: and (3) inputting multi-source heterogeneous data published by a national statistical department and each industry administrative department, automatically decomposing the data based on the label knowledge body constructed in the step (101) to generate a particle data set, and endowing multi-dimensional labels such as the industry, administrative divisions, time periods, statistical apertures and the like to each particle data.

Step 103: implementing a granular data logical storage adapter obtains compatibility for a particular physical storage architecture. For a relational database such as Oracle, converting the particle data set machine label output in the step 102 into a two-dimensional table for storage; for a Key-Value database such as elastic search, converting the grain data output in step 102 and all tags thereof into a plurality of < Key, Value > records; for a graph database such as Neo4j, the particle data and the tags output in step 102 are stored as graph nodes, and the relationships among the tags such as Create, Inherit, Include and the like form edges of the graph.

Step 104: and the conventional retrieval of the particle data set is supported, the calculation of statistical indexes such as the maximum value, the minimum value, the average value, the variance, the median and the like of numerical data is realized, and the query result subset is output according to the retrieval range.

Step 105: semantic calculation is supported to be carried out on the particle data set, similarity comparison is carried out on the input keywords and the labels of the particle data, the labels exceeding the threshold value and the marked data are used as hit data, and a query result subset is output.

Step 106: and defining a set of particle data analysis description language, calculating the subset of the query results output in the steps 104 and 105, and supporting set operation, arithmetic operation, logic operation, predicate logic and custom complex scripts.

Step 107: access control is realized on each particle data, and a data access level label based on protocols such as traffic is realized: red, orange, yellow and green; role-based data access control is implemented, accessible only to authorized principals.

Step 108: on the premise that the step 107 allows access, the data output in the steps 104, 105 and 106 are formatted and visualized.

Although specific embodiments of the invention have been disclosed for purposes of illustration and to aid in the understanding of the contents of the invention and the manner in which it may be practiced, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. An information resource management and analysis method based on particle data is characterized by comprising the following steps:

step S01: constructing a label knowledge body of the particle data according to the national standard and the best industry practice;

2. The method for information resource management and analysis based on particle data as claimed in claim 1, wherein: the "granule data tag ontology" described in step S01 refers to the attribute set of the granule data, including indexes, groups, time, space, measure units, and subjects.

3. The method for information resource management and analysis based on particle data as claimed in claim 1, wherein: "preprocessing the input multi-source data set" in step S02 is performed as follows: (1) for semi-structured data, converting the semi-structured data into structured data by decoupling multiple layers of nested attributes in a mode; (2) for unstructured data, the hash Value of the data is calculated, the unstructured data is converted into structured data in a Key and Value mode, and finally all multi-source heterogeneous data are unified into the structured data for further processing.

4. The method for information resource management and analysis based on particle data as claimed in claim 1, wherein:

the "physical storage architecture" described in step S03, including a relational database, a NoSQL database, and a graph database;

the "logical storage adapter" described in step S03 refers to the middleware that maps the granule data into the physical storage according to the physical storage architecture.

5. The method for information resource management and analysis based on particle data as claimed in claim 1, wherein:

the "conventional retrieval of the tag" in step S04 refers to querying a value range of the tag and statistical indexes including a maximum value, a minimum value, and an average value;

the "semantic computation of tags" described in step S04 refers to matching tags having similar logical meanings according to a given subject word.

6. The method for information resource management and analysis based on particle data as claimed in claim 1, wherein: the "grain data analysis description language" described in step S05 provides a complex semantic relationship operation descriptor including an arithmetic operation, a logical operation, a custom complex operation script, and including predicate logic.

7. The method for information resource management and analysis based on particle data as claimed in claim 1, wherein: the "access control decider" described in step S06 refers to a decision as to whether to permit or deny access to the traffic light protocol TLP or to support coarse-grained access control based on traffic light protocol TLP and fine-grained access control based on roles according to an access control policy.