CN111752997A - Basic library data label analysis system and method - Google Patents

Basic library data label analysis system and method Download PDF

Info

Publication number
CN111752997A
CN111752997A CN202010618687.0A CN202010618687A CN111752997A CN 111752997 A CN111752997 A CN 111752997A CN 202010618687 A CN202010618687 A CN 202010618687A CN 111752997 A CN111752997 A CN 111752997A
Authority
CN
China
Prior art keywords
data
label
classifier
task
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010618687.0A
Other languages
Chinese (zh)
Inventor
刘国梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202010618687.0A priority Critical patent/CN111752997A/en
Publication of CN111752997A publication Critical patent/CN111752997A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention particularly relates to a system and a method for analyzing a data label of a basic library. The basic database data tag analysis system comprises a scene induction module, a data acquisition module, a storage module, a tag classifier and a task execution module; the scene induction module is used for determining a business main body, inducing all data items of related individuals and inputting induction information into the storage module; the data acquisition module is used for connecting each data set, recording all data of each data set into the storage module and integrating a large table comprising all information items of the service main body; the storage module is used for storing all data input into the system and allowing the system to call at any time; the label classifier is used for labeling the data items according to the labeling rule. The system and the method for analyzing the data labels of the basic library innovate the use function of the basic library, provide various convenient and easy-to-use data labeling modes for the basic library, reduce the data analysis threshold and improve the data application benefit.

Description

Basic library data label analysis system and method
Technical Field
The invention relates to the technical field of data classification analysis, in particular to a system and a method for analyzing a data label of a basic database.
Background
The basic information resources of the national basic information base are business information from related departments, and have the characteristics of basic performance, benchmark performance, identification performance, stability and the like. The basic information base started by the state at present comprises a human mouth basic information base, a legal unit basic information base, a natural resource and space geography basic information base and a macro economy database. The basic information base is shared among all government departments in real time, and basic information support is provided for relevant business and government service development of all levels of government departments. Meanwhile, governments in various regions also continuously integrate basic database construction of real population, legal people, space geography and the like, internal sharing and dynamic updating are enhanced, and data accuracy is improved.
The invention provides a system and a method for analyzing a data tag of a basic information base based on construction of the basic information base, and aims to provide a set of convenient and easy-to-use analysis and application tool system for a data utilization link of the basic information base and expand an application scene and an application mode of the basic information base.
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a simple and efficient system and method for analyzing the data tags of the basic database.
The invention is realized by the following technical scheme:
a base library data tag analysis system, comprising: the system comprises a scene induction module, a data acquisition module, a storage module, a label classifier and a task execution module;
the scene induction module is used for determining a business main body, inducing all data items of related individuals and inputting induction information into the storage module;
the data acquisition module is used for connecting each data set, recording all data of each data set into the storage module and integrating a large table comprising all information items of the service main body;
the storage module is used for storing all data input into the system and allowing the system to call at any time;
the label classifier is used for labeling the data items according to the labeling rules and comprises a formula classifier, a sample label classifier and a cluster classifier;
the task execution module is used for executing the labeling task.
The method for analyzing the data tags of the basic library comprises the following steps:
first, service scene combing
Summarizing and carding the service scene needing to be marked with the data label, defining a service main body and all data items needed by each individual of the service main body, and inputting summarized information into a system for maintenance;
second, data acquisition and warping
Determining data sets required by all data items and the incidence relation among the data sets, inputting all data of the data sets required by all the data items into a system through a data acquisition module, and integrating a large table comprising all information items of a business main body;
each record of the large table is information of each data item of a unique business individual, and each data item of the record is formed by combining data from a plurality of data sets according to the incidence relation of the data items;
thirdly, sorting the labeling rules and selecting a label classifier according to the requirement
Sorting all label values needed by a service scene and rules according with the label values, namely, each data item meets the conditions and can be marked as a certain label value, and selecting a label classifier to label the data item according to the actual situation;
fourthly, defining the labeling task
Defining a labeling task in a system, and submitting the defined labeling task to a task execution module for execution;
fifthly, various applications of the base database after being labeled
The application scenario of the tag function comprises data query, tag query, graphical display of tags, relationship among data and tag value prediction of new data.
In the second step, the data acquisition module is used for processing the problem of database heterogeneity and supporting the import of data in various databases into a large table; after the data import is completed, a new column of information items is added in the large table for recording the tag value of the piece of data.
In order to improve the calculation efficiency of the labeling process and simultaneously solve the problem of heterogeneous databases, in the second step, the maintenance of the incidence relation among the data sets is realized before the task labeling, the incidence relation does not need to be processed in a task execution module, only a large table needs to be connected in the labeling process, and the connection of a plurality of databases does not need to be established.
In the third step, when the calculation rule of a certain label is determined clearly, that is, when each data item of a record meets a certain calculation formula condition, a certain corresponding label value can be determined, the calculation formula is maintained in a formula classifier, each record is calculated, and the data label value which is in line with the calculation formula is stored in the label value field of the record in the large table.
And in the third step, when the label value of the definite part of records needs to be classified by referring to the records of the definite label value for the rest of the data which are not labeled, when the label value is labeled, the data of the labeled label value are led into a sample label classifier to carry out model training, then the trained sample label classifier is used for labeling the rest of the data, and the proper label value is stored in the label value field of the corresponding record in the large table.
The sample label classifier supports continuous improvement of the accuracy of the classifier through cross validation and/or new sample training.
In the third step, when neither the label value nor the standard is clear, the data is classified by using a cluster classifier, then the proper label value is named for the separated large class, and the label is marked on the data belonging to the large class and is stored on the label value field in the large table.
By abstracting the cluster classifier, module decoupling can be realized, and the system can flexibly cope with various service scenes.
In the fourth step, when defining the labeling task, the task information comprises a service main body, metadata information of a large table, a selected label classifier, various parameters of the label classifier and task repeated execution parameters;
the task execution module can deploy multiple instances, and a scheduler distributes data needing to be labeled in the large table to each actuator in batches, so that the execution efficiency of the whole task is improved.
In the fourth step, the labeling task can be configured with a plurality of starting triggering modes, including automatic triggering after data updating, manual triggering and timing triggering;
and after the data is updated, re-executing the labeling task according to the repeated execution parameters configured by the task, and updating the label value information of the original record or the newly added record.
In the fifth step, the data query application refers to searching a certain piece of data and displaying all label information of the certain piece of data; the tag query application refers to searching data of a specified tag value according to the tag value; graphically displaying the relation application between the labels and the data means that multi-dimensional graphic representation between the labels and the business main bodies and among each business individual is visually displayed; the application of label value prediction of new data refers to new input data which is rapidly classified and predicted by using a label classifier.
The invention has the beneficial effects that: the system and the method for analyzing the data labels of the basic library innovate the use function of the basic library, provide various convenient and easy-to-use data labeling modes for the basic library, reduce the data analysis threshold and improve the data application benefit.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The basic database data tag analysis system comprises a scene induction module, a data acquisition module, a storage module, a tag classifier and a task execution module;
the scene induction module is used for determining a business main body, inducing all data items of related individuals and inputting induction information into the storage module;
the data acquisition module is used for connecting each data set, recording all data of each data set into the storage module and integrating a large table comprising all information items of the service main body;
the storage module is used for storing all data input into the system and allowing the system to call at any time;
the label classifier is used for labeling the data items according to the labeling rules and comprises a formula classifier, a sample label classifier and a cluster classifier;
the task execution module is used for executing the labeling task.
The method for analyzing the data tags of the basic library comprises the following steps:
first, service scene combing
Summarizing and carding the service scene needing to be marked with the data label, defining a service main body and all data items needed by each individual of the service main body, and inputting summarized information into a system for maintenance;
for example, in a population base, a population is a business subject, a certain needed labeled business scene is the personal social security coverage degree in a certain city in the last year, and the needed data items comprise an identity card number, the average income in the last year and month, the average expense in the last year and month, the average social security payment amount in the last year and month, the average social security pickup amount in the last year and month and the like.
For example, in the corporate base, a corporate is a business subject, and a certain business scenario needing to be labeled is social contribution of the corporate in the last year, and the required data items include unified social credit codes, the amount of taxes in the last year, the amount of business in the last year, environmental investment in the last year, the number of employees hired in the last year, the number of employees reduced in the last year, the amount of donations in the last year, the average salary of the employees in the last year, and the like.
Second, data acquisition and warping
Determining data sets required by all data items and the incidence relation among the data sets, inputting all data of the data sets required by all the data items into a system through a data acquisition module, and integrating a large table comprising all information items of a business main body;
each record of the large table is information of each data item of a unique business individual, and each data item of the record is formed by combining data from a plurality of data sets according to the incidence relation of the data items;
thirdly, sorting the labeling rules and selecting a label classifier according to the requirement
Sorting all label values needed by a service scene and rules according with the label values, namely, each data item meets the conditions and can be marked as a certain label value, and selecting a label classifier to label the data item according to the actual situation;
fourthly, defining the labeling task
Defining a labeling task in a system, and submitting the defined labeling task to a task execution module for execution;
fifthly, various applications of the base database after being labeled
The application scenario of the tag function comprises data query, tag query, graphical display of tags, relationship among data and tag value prediction of new data.
In the second step, the data acquisition module is used for processing the problem of database heterogeneity and supporting the import of data in various databases into a large table; after the data import is completed, a new column of information items is added in the large table for recording the tag value of the piece of data.
In order to improve the calculation efficiency of the labeling process and simultaneously solve the problem of heterogeneous databases, in the second step, the maintenance of the incidence relation among the data sets is realized before the task labeling, the incidence relation does not need to be processed in a task execution module, only a large table needs to be connected in the labeling process, and the connection of a plurality of databases does not need to be established.
In the third step, when the calculation rule of a certain label is determined clearly, that is, when each data item of a record meets a certain calculation formula condition, a certain corresponding label value can be determined, the calculation formula is maintained in a formula classifier, each record is calculated, and the data label value which is in line with the calculation formula is stored in the label value field of the record in the large table.
And in the third step, when the label value of the definite part of records needs to be classified by referring to the records of the definite label value for the rest of the data which are not labeled, when the label value is labeled, the data of the labeled label value are led into a sample label classifier to carry out model training, then the trained sample label classifier is used for labeling the rest of the data, and the proper label value is stored in the label value field of the corresponding record in the large table.
The sample label classifier supports continuous improvement of the accuracy of the classifier through cross validation and/or new sample training.
In the third step, when neither the label value nor the standard is clear, the data is classified by using a cluster classifier, then the proper label value is named for the separated large class, and the label is marked on the data belonging to the large class and is stored on the label value field in the large table.
By abstracting the cluster classifier, module decoupling can be realized, and the system can flexibly cope with various service scenes.
In the fourth step, when defining the labeling task, the task information comprises a service main body, metadata information of a large table, a selected label classifier, various parameters of the label classifier and task repeated execution parameters;
the task execution module can deploy multiple instances, and a scheduler distributes data needing to be labeled in the large table to each actuator in batches, so that the execution efficiency of the whole task is improved.
In the fourth step, the labeling task can be configured with a plurality of starting triggering modes, including automatic triggering after data updating, manual triggering and timing triggering;
and after the data is updated, re-executing the labeling task according to the repeated execution parameters configured by the task, and updating the label value information of the original record or the newly added record.
In the fifth step, the data query application refers to searching a certain piece of data and displaying all label information of the certain piece of data; the tag query application refers to searching data of a specified tag value according to the tag value; graphically displaying the relation application between the labels and the data means that multi-dimensional graphic representation between the labels and the business main bodies and among each business individual is visually displayed; the application of label value prediction of new data refers to new input data which is rapidly classified and predicted by using a label classifier.
The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (9)

1. A base library data tag analysis system, comprising: the system comprises a scene induction module, a data acquisition module, a storage module, a label classifier and a task execution module;
the scene induction module is used for determining a business main body, inducing all data items of related individuals and inputting induction information into the storage module;
the data acquisition module is used for connecting each data set, recording all data of each data set into the storage module and integrating a large table comprising all information items of the service main body;
the storage module is used for storing all data input into the system and allowing the system to call at any time;
the label classifier is used for labeling the data items according to the labeling rules and comprises a formula classifier, a sample label classifier and a cluster classifier;
the task execution module is used for executing the labeling task.
2. A method for analyzing a basic library data tag is characterized by comprising the following steps:
first, service scene combing
Summarizing and carding the service scene needing to be marked with the data label, defining a service main body and all data items needed by each individual of the service main body, and inputting summarized information into a system for maintenance;
second, data acquisition and warping
Determining data sets required by all data items and the incidence relation among the data sets, inputting all data of the data sets required by all the data items into a system through a data acquisition module, and integrating a large table comprising all information items of a business main body;
each record of the large table is information of each data item of a unique business individual, and each data item of the record is formed by combining data from a plurality of data sets according to the incidence relation of the data items;
thirdly, sorting the labeling rules and selecting a label classifier according to the requirement
Sorting all label values needed by a service scene and rules according with the label values, namely, each data item meets the conditions and can be marked as a certain label value, and selecting a label classifier to label the data item according to the actual situation;
fourthly, defining the labeling task
Defining a labeling task in a system, and submitting the defined labeling task to a task execution module for execution;
fifthly, various applications of the base database after being labeled
The application scenario of the tag function comprises data query, tag query, graphical display of tags, relationship among data and tag value prediction of new data.
3. The method for base library data tag analysis of claim 2, wherein: in the second step, the data acquisition module is used for processing the problem of database heterogeneity and supporting the import of data in various databases into a large table; after the data import is completed, a new column of information items is added in the large table for recording the tag value of the piece of data.
4. The method for base library data tag analysis of claim 2, wherein: in the third step, when the calculation rule of a certain label is determined clearly, that is, when each data item of a record meets a certain calculation formula condition, a corresponding label value can be determined, the calculation formula is maintained in a formula classifier, each record is calculated, and the data label value which is in line with the calculation formula is stored in the label value field of the record in the large table;
when the label value of the definite part of records needs to refer to the record classification of the definite label value for the residual unlabeled data, when the label value is labeled, the data of the labeled label value is led into a sample label classifier to carry out model training, then the trained sample label classifier is used for labeling the residual data, and the proper label value is stored in the label value field of the corresponding record in the large table;
when neither the label value nor the criterion is well-defined, the data is classified using a cluster classifier, and then the classified large class is named with the appropriate label value, and the data belonging to the large class is labeled with the label and stored in the label value field in the large table.
5. The method for base library data tag analysis of claim 4, wherein: the sample label classifier supports continuous improvement of the accuracy of the classifier through cross validation and/or new sample training.
6. The method for base library data tag analysis of claim 2, wherein: and in the fourth step, when defining the labeling task, the task information comprises a service main body, metadata information of a large table, a selected label classifier, various parameters of the label classifier and task repeated execution parameters.
7. The method for analyzing database data tags according to claim 2 or 6, wherein: in the fourth step, the task execution module can deploy multiple instances, and the scheduler distributes the data needing to be labeled in the large table to each actuator in batches, so that the execution efficiency of the whole task is improved.
8. The method for base library data tag analysis of claim 7, wherein: in the fourth step, the labeling task can be configured with a plurality of starting triggering modes, including automatic triggering after data updating, manual triggering and timing triggering;
and after the data is updated, re-executing the labeling task according to the repeated execution parameters configured by the task, and updating the label value information of the original record or the newly added record.
9. The method for base library data tag analysis of claim 2, wherein: in the fifth step, the data query application refers to searching a certain piece of data and displaying all label information of the certain piece of data; the tag query application refers to searching data of a specified tag value according to the tag value; graphically displaying the relation application between the labels and the data means that multi-dimensional graphic representation between the labels and the business main bodies and among each business individual is visually displayed; the application of label value prediction of new data refers to new input data which is rapidly classified and predicted by using a label classifier.
CN202010618687.0A 2020-07-01 2020-07-01 Basic library data label analysis system and method Pending CN111752997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010618687.0A CN111752997A (en) 2020-07-01 2020-07-01 Basic library data label analysis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010618687.0A CN111752997A (en) 2020-07-01 2020-07-01 Basic library data label analysis system and method

Publications (1)

Publication Number Publication Date
CN111752997A true CN111752997A (en) 2020-10-09

Family

ID=72678606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010618687.0A Pending CN111752997A (en) 2020-07-01 2020-07-01 Basic library data label analysis system and method

Country Status (1)

Country Link
CN (1) CN111752997A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114036387A (en) * 2021-11-16 2022-02-11 蛮牛健康管理服务有限公司 Large health field label system and user portrait construction method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120047507A1 (en) * 2010-08-19 2012-02-23 International Business Machines Corporation Selective constant complexity dismissal in task scheduling
CN109101652A (en) * 2018-08-27 2018-12-28 宜人恒业科技发展(北京)有限公司 A kind of creation of label and management system
CN109213750A (en) * 2017-06-30 2019-01-15 勤智数码科技股份有限公司 A kind of information resources recommended method of knowledge based library label
CN110955481A (en) * 2019-11-27 2020-04-03 北京锐安科技有限公司 Label task generation method and device, storage medium and electronic equipment
CN111191125A (en) * 2019-12-24 2020-05-22 长威信息科技发展股份有限公司 Data analysis method based on tagging

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120047507A1 (en) * 2010-08-19 2012-02-23 International Business Machines Corporation Selective constant complexity dismissal in task scheduling
CN109213750A (en) * 2017-06-30 2019-01-15 勤智数码科技股份有限公司 A kind of information resources recommended method of knowledge based library label
CN109101652A (en) * 2018-08-27 2018-12-28 宜人恒业科技发展(北京)有限公司 A kind of creation of label and management system
CN110955481A (en) * 2019-11-27 2020-04-03 北京锐安科技有限公司 Label task generation method and device, storage medium and electronic equipment
CN111191125A (en) * 2019-12-24 2020-05-22 长威信息科技发展股份有限公司 Data analysis method based on tagging

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114036387A (en) * 2021-11-16 2022-02-11 蛮牛健康管理服务有限公司 Large health field label system and user portrait construction method

Similar Documents

Publication Publication Date Title
CN110383319B (en) Large scale heterogeneous data ingestion and user resolution
Portes The informal sector: Definition, controversy, and relation to national development
US7930322B2 (en) Text based schema discovery and information extraction
CN108595621B (en) Early warning analysis method and system for false value-added tax invoice
CN109767327A (en) Customer information acquisition and its application method based on anti money washing
CN110489565B (en) Method and system for designing object root type in domain knowledge graph body
CN111026801A (en) Method and system for assisting operation quick decision-making work of insurance type e-commerce
CN112328589B (en) Electronic form data granulation and index standardization processing method
CN105095436A (en) Automatic modeling method for data of data sources
CN112598264A (en) Scenario comprehensive evaluation system for credit field
CN110263021A (en) A kind of theme library generating method based on personalized labels system
CN117952209A (en) Knowledge graph construction method and system
CN113282623A (en) Data processing method and device
CN113177051A (en) Method for dynamically adding and maintaining data tag
CN111752997A (en) Basic library data label analysis system and method
CN107038224A (en) Data processing method and data processing equipment
CN113408207A (en) Data mining method based on social network analysis technology
CN113780438A (en) Science and technology project application tutoring system based on big data
CN111552679A (en) Rapid modeling method and device based on simple requirements
KR102710397B1 (en) Apparatus and method for analysis of transaction brief data using corpus for machine learning based on financial mydata and computer program for the same
CN114722789B (en) Data report integrating method, device, electronic equipment and storage medium
CN115934927A (en) Security knowledge pushing method and system, storage medium and electronic equipment
CN110941952A (en) Method and device for perfecting audit analysis model
CN111782657B (en) Data processing method and device
CN111460052B (en) Low-security fund supervision method and system based on supervised data correlation analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201009

RJ01 Rejection of invention patent application after publication