CN111752997A

CN111752997A - Basic library data label analysis system and method

Info

Publication number: CN111752997A
Application number: CN202010618687.0A
Authority: CN
Inventors: 刘国梁
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2020-10-09

Abstract

The invention particularly relates to a system and a method for analyzing a data label of a basic library. The basic database data tag analysis system comprises a scene induction module, a data acquisition module, a storage module, a tag classifier and a task execution module; the scene induction module is used for determining a business main body, inducing all data items of related individuals and inputting induction information into the storage module; the data acquisition module is used for connecting each data set, recording all data of each data set into the storage module and integrating a large table comprising all information items of the service main body; the storage module is used for storing all data input into the system and allowing the system to call at any time; the label classifier is used for labeling the data items according to the labeling rule. The system and the method for analyzing the data labels of the basic library innovate the use function of the basic library, provide various convenient and easy-to-use data labeling modes for the basic library, reduce the data analysis threshold and improve the data application benefit.

Description

Basic library data label analysis system and method

Technical Field

The invention relates to the technical field of data classification analysis, in particular to a system and a method for analyzing a data label of a basic database.

Background

The basic information resources of the national basic information base are business information from related departments, and have the characteristics of basic performance, benchmark performance, identification performance, stability and the like. The basic information base started by the state at present comprises a human mouth basic information base, a legal unit basic information base, a natural resource and space geography basic information base and a macro economy database. The basic information base is shared among all government departments in real time, and basic information support is provided for relevant business and government service development of all levels of government departments. Meanwhile, governments in various regions also continuously integrate basic database construction of real population, legal people, space geography and the like, internal sharing and dynamic updating are enhanced, and data accuracy is improved.

The invention provides a system and a method for analyzing a data tag of a basic information base based on construction of the basic information base, and aims to provide a set of convenient and easy-to-use analysis and application tool system for a data utilization link of the basic information base and expand an application scene and an application mode of the basic information base.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a simple and efficient system and method for analyzing the data tags of the basic database.

The invention is realized by the following technical scheme:

a base library data tag analysis system, comprising: the system comprises a scene induction module, a data acquisition module, a storage module, a label classifier and a task execution module;

the scene induction module is used for determining a business main body, inducing all data items of related individuals and inputting induction information into the storage module;

the data acquisition module is used for connecting each data set, recording all data of each data set into the storage module and integrating a large table comprising all information items of the service main body;

the storage module is used for storing all data input into the system and allowing the system to call at any time;

the label classifier is used for labeling the data items according to the labeling rules and comprises a formula classifier, a sample label classifier and a cluster classifier;

the task execution module is used for executing the labeling task.

The method for analyzing the data tags of the basic library comprises the following steps:

first, service scene combing

Summarizing and carding the service scene needing to be marked with the data label, defining a service main body and all data items needed by each individual of the service main body, and inputting summarized information into a system for maintenance;

second, data acquisition and warping

Determining data sets required by all data items and the incidence relation among the data sets, inputting all data of the data sets required by all the data items into a system through a data acquisition module, and integrating a large table comprising all information items of a business main body;

each record of the large table is information of each data item of a unique business individual, and each data item of the record is formed by combining data from a plurality of data sets according to the incidence relation of the data items;

thirdly, sorting the labeling rules and selecting a label classifier according to the requirement

Sorting all label values needed by a service scene and rules according with the label values, namely, each data item meets the conditions and can be marked as a certain label value, and selecting a label classifier to label the data item according to the actual situation;

fourthly, defining the labeling task

Defining a labeling task in a system, and submitting the defined labeling task to a task execution module for execution;

fifthly, various applications of the base database after being labeled

The application scenario of the tag function comprises data query, tag query, graphical display of tags, relationship among data and tag value prediction of new data.

In the second step, the data acquisition module is used for processing the problem of database heterogeneity and supporting the import of data in various databases into a large table; after the data import is completed, a new column of information items is added in the large table for recording the tag value of the piece of data.

In order to improve the calculation efficiency of the labeling process and simultaneously solve the problem of heterogeneous databases, in the second step, the maintenance of the incidence relation among the data sets is realized before the task labeling, the incidence relation does not need to be processed in a task execution module, only a large table needs to be connected in the labeling process, and the connection of a plurality of databases does not need to be established.

In the third step, when the calculation rule of a certain label is determined clearly, that is, when each data item of a record meets a certain calculation formula condition, a certain corresponding label value can be determined, the calculation formula is maintained in a formula classifier, each record is calculated, and the data label value which is in line with the calculation formula is stored in the label value field of the record in the large table.

And in the third step, when the label value of the definite part of records needs to be classified by referring to the records of the definite label value for the rest of the data which are not labeled, when the label value is labeled, the data of the labeled label value are led into a sample label classifier to carry out model training, then the trained sample label classifier is used for labeling the rest of the data, and the proper label value is stored in the label value field of the corresponding record in the large table.

The sample label classifier supports continuous improvement of the accuracy of the classifier through cross validation and/or new sample training.

In the third step, when neither the label value nor the standard is clear, the data is classified by using a cluster classifier, then the proper label value is named for the separated large class, and the label is marked on the data belonging to the large class and is stored on the label value field in the large table.

By abstracting the cluster classifier, module decoupling can be realized, and the system can flexibly cope with various service scenes.

In the fourth step, when defining the labeling task, the task information comprises a service main body, metadata information of a large table, a selected label classifier, various parameters of the label classifier and task repeated execution parameters;

the task execution module can deploy multiple instances, and a scheduler distributes data needing to be labeled in the large table to each actuator in batches, so that the execution efficiency of the whole task is improved.

In the fourth step, the labeling task can be configured with a plurality of starting triggering modes, including automatic triggering after data updating, manual triggering and timing triggering;

and after the data is updated, re-executing the labeling task according to the repeated execution parameters configured by the task, and updating the label value information of the original record or the newly added record.

In the fifth step, the data query application refers to searching a certain piece of data and displaying all label information of the certain piece of data; the tag query application refers to searching data of a specified tag value according to the tag value; graphically displaying the relation application between the labels and the data means that multi-dimensional graphic representation between the labels and the business main bodies and among each business individual is visually displayed; the application of label value prediction of new data refers to new input data which is rapidly classified and predicted by using a label classifier.

The invention has the beneficial effects that: the system and the method for analyzing the data labels of the basic library innovate the use function of the basic library, provide various convenient and easy-to-use data labeling modes for the basic library, reduce the data analysis threshold and improve the data application benefit.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The basic database data tag analysis system comprises a scene induction module, a data acquisition module, a storage module, a tag classifier and a task execution module;

the task execution module is used for executing the labeling task.

first, service scene combing

for example, in a population base, a population is a business subject, a certain needed labeled business scene is the personal social security coverage degree in a certain city in the last year, and the needed data items comprise an identity card number, the average income in the last year and month, the average expense in the last year and month, the average social security payment amount in the last year and month, the average social security pickup amount in the last year and month and the like.

For example, in the corporate base, a corporate is a business subject, and a certain business scenario needing to be labeled is social contribution of the corporate in the last year, and the required data items include unified social credit codes, the amount of taxes in the last year, the amount of business in the last year, environmental investment in the last year, the number of employees hired in the last year, the number of employees reduced in the last year, the amount of donations in the last year, the average salary of the employees in the last year, and the like.

Second, data acquisition and warping

fourthly, defining the labeling task

fifthly, various applications of the base database after being labeled

The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A base library data tag analysis system, comprising: the system comprises a scene induction module, a data acquisition module, a storage module, a label classifier and a task execution module;

the task execution module is used for executing the labeling task.

2. A method for analyzing a basic library data tag is characterized by comprising the following steps:

first, service scene combing

second, data acquisition and warping

fourthly, defining the labeling task

fifthly, various applications of the base database after being labeled

3. The method for base library data tag analysis of claim 2, wherein: in the second step, the data acquisition module is used for processing the problem of database heterogeneity and supporting the import of data in various databases into a large table; after the data import is completed, a new column of information items is added in the large table for recording the tag value of the piece of data.

4. The method for base library data tag analysis of claim 2, wherein: in the third step, when the calculation rule of a certain label is determined clearly, that is, when each data item of a record meets a certain calculation formula condition, a corresponding label value can be determined, the calculation formula is maintained in a formula classifier, each record is calculated, and the data label value which is in line with the calculation formula is stored in the label value field of the record in the large table;

when the label value of the definite part of records needs to refer to the record classification of the definite label value for the residual unlabeled data, when the label value is labeled, the data of the labeled label value is led into a sample label classifier to carry out model training, then the trained sample label classifier is used for labeling the residual data, and the proper label value is stored in the label value field of the corresponding record in the large table;

when neither the label value nor the criterion is well-defined, the data is classified using a cluster classifier, and then the classified large class is named with the appropriate label value, and the data belonging to the large class is labeled with the label and stored in the label value field in the large table.

5. The method for base library data tag analysis of claim 4, wherein: the sample label classifier supports continuous improvement of the accuracy of the classifier through cross validation and/or new sample training.

6. The method for base library data tag analysis of claim 2, wherein: and in the fourth step, when defining the labeling task, the task information comprises a service main body, metadata information of a large table, a selected label classifier, various parameters of the label classifier and task repeated execution parameters.

7. The method for analyzing database data tags according to claim 2 or 6, wherein: in the fourth step, the task execution module can deploy multiple instances, and the scheduler distributes the data needing to be labeled in the large table to each actuator in batches, so that the execution efficiency of the whole task is improved.

8. The method for base library data tag analysis of claim 7, wherein: in the fourth step, the labeling task can be configured with a plurality of starting triggering modes, including automatic triggering after data updating, manual triggering and timing triggering;

9. The method for base library data tag analysis of claim 2, wherein: in the fifth step, the data query application refers to searching a certain piece of data and displaying all label information of the certain piece of data; the tag query application refers to searching data of a specified tag value according to the tag value; graphically displaying the relation application between the labels and the data means that multi-dimensional graphic representation between the labels and the business main bodies and among each business individual is visually displayed; the application of label value prediction of new data refers to new input data which is rapidly classified and predicted by using a label classifier.