CN113535707B - Method for managing personnel information data based on big data - Google Patents

Method for managing personnel information data based on big data Download PDF

Info

Publication number
CN113535707B
CN113535707B CN202110895458.8A CN202110895458A CN113535707B CN 113535707 B CN113535707 B CN 113535707B CN 202110895458 A CN202110895458 A CN 202110895458A CN 113535707 B CN113535707 B CN 113535707B
Authority
CN
China
Prior art keywords
data
personnel information
information data
exploration
definition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110895458.8A
Other languages
Chinese (zh)
Other versions
CN113535707A (en
Inventor
阎星娥
杨昆
刘慰慰
严荣明
张�林
袁勇斌
薛世峰
石旦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Huafei Data Technology Co ltd
Original Assignee
Nanjing Huafei Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Huafei Data Technology Co ltd filed Critical Nanjing Huafei Data Technology Co ltd
Priority to CN202110895458.8A priority Critical patent/CN113535707B/en
Publication of CN113535707A publication Critical patent/CN113535707A/en
Application granted granted Critical
Publication of CN113535707B publication Critical patent/CN113535707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method for managing personnel information data based on big data, which comprises the following steps: 1) data standard: standardizing personnel information data; 2) and (3) data recording: basic information and original data of personnel information data are uploaded for registration and filing; 3) data exploration: carrying out personnel information data exploration to generate a personnel information data exploration report; 4) pre-cleaning data: acquiring a personnel information data exploration report, and performing pre-cleaning operation on the personnel information data; 5) data definition: defining reading, processing and governing of personnel information data by taking registered personnel information data as dimensions; 6) data access processing: data reading is carried out, personnel information data with multi-source isomerism are accessed to a big data processing center, and personnel information data processing is carried out in the accessing process; 7) data asset: carrying out asset management on the personnel information data; the invention can provide complete and timely high-quality personnel information data.

Description

Method for managing personnel information data based on big data
Technical Field
The invention relates to a method for managing personnel information data based on big data, and belongs to the field of personnel information data management.
Background
Today, human production and life are producing a wide variety of enormous data every day, and the production speed is getting faster and faster; therefore, the problem of accessing, processing and managing massive heterogeneous personnel information data is more and more emphasized by people; in enterprises, problems such as discretization of personnel information data and uneven quality of personnel information data often occur, and the reasons are as follows:
the personnel information data volume is huge, the sources are various, the structure is disordered, the data lack a uniform standard specification, and the disordered data can cause the waste of resource space;
secondly, the problems of incompleteness and inaccuracy of personnel information data are more and more obvious, and the low quality of the personnel information data becomes the core problem of the personnel information data.
Disclosure of Invention
The invention provides a method for managing personnel information data based on big data, and aims to solve the problem of low quality of the personnel information data.
The technical solution of the invention is as follows: a method for managing personnel information data based on big data comprises the following steps:
1) data standard: standardizing personnel information data, and carrying out unified and standard management on the personnel information data to eliminate personnel information data barriers among departments;
2) and (3) data recording: uploading the basic information and the original data of the personnel information data for registration and recording, and carrying out memorandum registration on the personnel information data;
3) data exploration: acquiring the original data registered and recorded in the step 2), and carrying out personnel information data exploration on the original data to generate a personnel information data exploration report;
4) pre-cleaning data: acquiring a personnel information data exploration report from the step 3), mastering the quality problem of the personnel information data, performing pre-cleaning operation on the personnel information data, and storing the result into a hive temporary library;
5) data definition: defining reading, processing and governing of the personnel information data by taking the registered personnel information data as dimensionality, generating the configuration required by the step 6), and forming a personnel information data definition result for a big data governing platform to call;
6) data access processing: according to business requirements, based on steps 3) -5), data reading is carried out, multi-source heterogeneous personnel information data are accessed to a big data processing center, personnel information data processing is carried out in the accessing process, data check is carried out on the personnel information data and a personnel information data provider, and finally the processed personnel information data are written into a file for storage;
7) data asset: and (4) performing asset management on the personnel information data accessed in the step 6) and mastering the condition of the personnel information data assets.
Further, the personnel information data exploration in the step 3) comprises two repeated multidimensional exploration analyses, wherein one exploration is performed on the original data, and the other exploration is performed on the personnel information data subjected to the data pre-cleaning in the step 4).
Further, the multidimensional exploration analysis in the step 3) comprises exploration on the data volume of the personnel information, exploration on the field and quality of the personnel information data and exploration on the problem data.
Further, the investigation of the personnel information data volume is to investigate all personnel information data volume conditions.
Further, the field and quality exploration of the personnel information data comprises the following steps: a) field null rate probe, b) named entity probe, c) type and format probe.
Further, the field null rate detection specifically includes that field null ratio conditions are counted through a formula (1):
Figure BDA0003197703680000031
in the formula: rate represents the null Rate, f (k) represents the number of null values in the field, k represents the lower bound, n represents the upper bound, m represents the total number of lines, and z represents the number of lines of special characters.
Further, the named entity exploration specifically includes: and automatically exploring and analyzing the identifier of the field content, and then manually intervening by combining the value of the original data to identify the name entity of the person, place, certificate number and mobile phone number in the field content.
Further, the type and format exploration specifically includes: and probing whether the field type and the format of the personnel information data meet the specification.
Further, the problem data exploration is to explore the legality of the field and the data which are not in accordance with the specification in the field, and the scrambling code rate of each column is counted through a formula (2):
Figure BDA0003197703680000032
in the formula: rate2 represents the scrambling code Rate, g (b) represents the number of scrambling codes, b represents the lower bound, e represents the upper bound, and h represents the total number of rows.
Further, the data pre-cleaning in the step 4) comprises conditional filtering, field splicing, splitting and character string replacing operations; the condition filtering is to provide three choices of null, non-null and range rules, and the foreground takes the form of query conditions as parameters to transmit to the background; the field splicing, splitting and replacing character strings are formed by splicing and splitting the existing field and replacing character strings through a CONCAT function and a REPLACE function.
Further, the data definition in the step 5) is defined based on a data standard, and the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition; the data reading definition is to define the reading of original data from a source platform according to the personnel information data exploration result and define a file character set of the personnel information data according to the service requirement; the data format definition refers to a data standard and completes the mapping of the personnel information original field and the personnel information standard field; the data processing definition comprises a step1 data cleaning strategy definition, a step2 data extraction strategy definition; the data management definition comprises resource catalog registration, wherein the resource catalog registration is to synchronously register the registered personnel information data in the data record to the data resource catalog so as to comprehensively master the condition of the personnel information data.
Further, step1 data cleaning policy definition in the step 5) specifically includes: according to the business requirements, defining strategies of filtering personnel information data conditions, splicing fields, splitting fields and replacing character strings; and 4) filtering the conditions in the step 4), wherein the difference of field splicing, field splitting and character string replacing operation is as follows: the data objects aimed at by the two are inconsistent, step 4) is to perform pre-washing operation on the sample data, and the rule definition of the real data washing strategy is defined herein, so that a basis is provided for step1 washing in the following step 6);
further, step2 data extraction policy definition in the step 5) specifically includes: and according to the business requirements, defining and defining the extraction mapping relation from the source data to the target data, and extracting the relation between part of fields of the personnel information data and the personnel information data.
Further, in the step 6), data reading is to extract data from a hive temporary library pre-cleaned by personnel information data, check whether data definitions are consistent, perform data reading if the data definitions are consistent, and stop reading if the data definitions are inconsistent; the data checking is a link synchronously performed in a data reading stage, and the integrity and the correctness of the personnel information data are checked in a certain time reconciliation time node.
Further, the personnel information data processing in the step 6) comprises step1 washing and step2 extraction, and the steps are as follows:
according to the data cleaning strategy definition, carrying out real-time access on personnel information data;
step1 washing: according to the data cleaning strategy definition, repeatedly executing the operations of condition filtering, field splicing, splitting and character string replacement in the step 4) on the personnel information data accessed in real time; but differs from step 4), step 5); the data objects aimed at by the two are different, step 4) aims at sample data, and step 5) and step 6) aim at real data; the difference between step 5) and step 6) is again that: the purposes of the two are different, step 5) is defined by cleaning rules, and step 6) is to process real data according to step 5);
step2 extraction: and extracting target format data from the source format data according to the data extraction strategy definition, and extracting the relation between partial fields of the personnel information data accessed in real time and the personnel information data.
Further, the asset management of the personnel information data accessed in step 6) in step 7) specifically includes: through the resource catalog, the resource overview is visually displayed; the resource catalog is used for managing data, including resource classification and cataloguing, and combing the data stored by the big data management platform and the personnel information data provided for the big data management platform by an interface mode.
The invention has the beneficial effects that:
according to the method for managing the personnel information data based on the big data, the scattered and diversified personnel information data are standardized, and the personnel information data are subjected to data filing, data exploration, data precleaning, data definition and data access processing to realize data management and control, so that complete and timely high-quality personnel information data can be provided.
Drawings
Fig. 1 is a general design architecture diagram of the present invention.
FIG. 2 is a schematic diagram of a data standard flow of the present invention.
FIG. 3 is a schematic diagram of the data precleaning process of the present invention.
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
A method for managing personnel information data based on big data comprises the following steps:
1) data standard: the personnel information data are standardized, unified and standard management is carried out on the personnel information data, the personnel information data barriers among all departments are eliminated, and the personnel information data can be conveniently shared;
2) and (3) data recording: uploading the basic information and the original data of the personnel information data to carry out unified registration and filing, and making memorandum registration work of the personnel information data;
3) data exploration: acquiring the original data registered and recorded in the step 2), and carrying out personnel information data exploration on the original data to generate a personnel information data exploration report; the data exploration is a key step in the personnel information data quality, really controls the personnel information data source, improves the personnel information data quality, ensures the correctness of the personnel information data, and provides a basis for the precleaning of the following data; preferably, the personnel information data is probed by adopting multi-dimensional probing analysis;
4) pre-cleaning data: acquiring a personnel information data exploration report from the step 3), mastering the quality problem of personnel information data, performing pre-cleaning operation on the personnel information data, and storing the result in a hive temporary library, so that the quality problem of the personnel information data is solved, the consistency of the personnel information data is ensured, and the use value and the quality of the personnel information data are improved;
5) data definition: defining reading, processing and governing of the personnel information data by taking the registered personnel information data as dimensionality, generating the configuration required by the step 6), and forming a personnel information data definition result for a big data governing platform to call;
6) data access processing: according to business requirements, based on steps 3) -5), accessing multisource heterogeneous personnel information data into a big data processing center, processing the personnel information data in the accessing process, performing data check with a personnel information data provider, and finally writing the processed personnel information data into a file for storage, so that the personnel information data accessing processing efficiency is improved;
7) data asset: and (4) performing asset management on the personnel information data accessed in the step 6), mastering the condition of the personnel information data assets, and visually displaying and comprehensively knowing the accessed personnel information data assets.
The personnel information data exploration in the step 3) comprises two repeated multidimensional exploration analyses, one multidimensional exploration analysis is carried out on original data, and the other multidimensional exploration analysis is carried out on the personnel information data subjected to data precleaning in the step 4), so that two-way inspection of exploration and data precleaning is realized, and the timely discovery of personnel information data problems is ensured.
The multi-dimensional exploration analysis in the step 3) comprises exploration of personnel information data volume, exploration of fields and quality of personnel information data and exploration of problem data.
The personnel information data volume exploration is used for exploring all personnel information data volume conditions.
The field and quality exploration for the personnel information data comprises the following steps: a) field null rate probe, b) named entity probe, c) type and format probe.
The field null value rate exploration is characterized in that the field control null value proportion condition is counted through a formula (1):
Figure BDA0003197703680000081
in the formula: rate represents null Rate, f (k) represents field null number, k represents lower bound, n represents upper bound, m represents total number of lines, z represents number of special character lines, the null condition of personnel information data is explored through the field null Rate, useful data in the personnel information data is discovered, and the value of the personnel information data is known; the field null rate is the statistical field control duty ratio.
The named entity exploration comprises the following specific steps: and automatically exploring and analyzing the identification of the field content, then carrying out manual intervention by combining the value of the original data, identifying name entities of the person, place name, certificate number and mobile phone number in the field content, providing a basis for extracting the following personnel information data, and extracting the valuable personnel information data relation.
The type and format exploration comprises the following specific steps: whether the field type and the format of the personnel information data meet the specifications or not is explored, a data quality basis is provided for the extraction of the personnel information data later, and the quality of the personnel information data is ensured; the problem data is probed to detect the legality of the field and the data which does not meet the specification in the field, and the scrambling code rate of each column is counted through a formula (2):
Figure BDA0003197703680000082
in the formula: rate2 represents the scrambling code Rate, g (b) represents the number of scrambling codes, b represents the lower bound, e represents the upper bound, and h represents the total number of rows.
The personnel information data pre-cleaning in the step 4) comprises condition filtering, field splicing, splitting and character string replacing operations; the condition filtering is to provide three choices of null, non-null and range rules, and the foreground takes the form of query conditions as parameters to transmit to the background; the field splicing, splitting and replacing character strings are formed by splicing and splitting the existing field and replacing character strings through a CONCAT function and a REPLACE function.
The data definition in the step 5) is defined based on a data standard, and the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition.
The definition of data reading is to define the reading of original data from a source platform according to the personnel information data exploration result, and define a file character set of the personnel information data according to business requirements.
The data format definition refers to a data standard and completes the mapping of the personnel information original field and the personnel information standard field.
The data processing definition comprises the following specific steps:
step1 data cleaning strategy definition, according to the business requirement, defining the strategy of personnel information data condition filtration, field splicing, splitting and character string replacement; but differs from step 4); the data objects aimed at by the two are inconsistent, step 4) is to perform pre-cleaning operation on the sample data, and the rule definition of real data cleaning is defined herein, so that a basis is provided for step1 cleaning in the following step 6);
step2 data extraction policy definition: and according to the business requirements, defining and defining the extraction mapping relation from the source data to the target data, and extracting the relation between part of fields of the personnel information data and the personnel information data.
The data management definition comprises resource catalog registration, wherein the resource catalog registration is to synchronously register the registered personnel information data in the data record to the data resource catalog so as to comprehensively master the condition of the personnel information data.
In the step 6), data reading is to extract data from a hive temporary library pre-cleaned by personnel information data, check whether data definitions are consistent, perform data reading if the data definitions are consistent, and stop reading if the data definitions are inconsistent; the data checking is a link synchronously performed in a data reading stage, and is used for checking the integrity and the correctness of personnel information data in a reconciliation time node at a certain time in the data providing party and the data reading process.
The personnel information data processing in the step 6) comprises cleaning and extracting, and specifically comprises the following steps:
step1 washing: according to the definition of the data cleaning strategy, repeatedly executing the step 4) on the personnel information data accessed in real time, but the difference between the step 4) and the step 5) is that; the data objects aimed at by the two are different, step 4) aims at sample data, and step 5) and step 6) aim at real data; the difference between step 5) and step 6) is again that: the purposes of the two are different, step 5) is defined by cleaning rules, and step 6) is to process real data according to step 5);
step2 extraction: and extracting target format data from the source format data according to the data extraction strategy definition, and extracting the relation between partial fields of the personnel information data accessed in real time and the personnel information data.
The asset management of the personnel information data accessed in the step 6) in the step 7) is specifically as follows: through the resource catalog, the resource overview is visually displayed; the resource catalog is used for managing data, including resource classification and cataloguing, and combing the data stored by the big data management platform and the personnel information data provided for the big data management platform by an interface mode.
According to the method, through means of data standard, data filing, data exploration, data precleaning, data definition, data access processing, data assets and the like, a big data management platform is used for carrying out data management and control on personnel information data, personnel information data definition is carried out through data standardization, experience self-learning is provided, standard contrast self-learning and recommendation functions are provided for defined resources, and personnel information data access efficiency is improved; the efficiency of personnel information data management is improved by providing a set of rich real-time processing rules; by standardizing the personnel information data, the data legality and compliance are improved, and the correctness and quality of the data are ensured; by means of a bidirectional inspection method of diversified exploration and precleaning before access, the data quality problem can be found and solved in time; the use value and the quality of the personnel information data are improved through reliable personnel information data, powerful support is provided in the field of personnel information data management, and economic benefits are gained for enterprises.
Examples
Embodiments of the present invention will be described in further detail with reference to the accompanying drawings 1 to 3, in which:
in fig. 1, a series of complete processes of personnel information data filing, exploration, precleaning, definition, and access processing are implemented, and the specific steps are as follows:
1) data standard: the personnel information data are standardized, unified and standard management is carried out on the personnel information data, the personnel information data barriers among all departments are eliminated, and the personnel information data can be conveniently shared;
2) and (3) data recording: uploading the basic information and the original data of the personnel information data to carry out unified registration and filing, and making a memo of the personnel information data;
3) data exploration: acquiring registered original data from the step 2), carrying out multi-dimensional exploration analysis on the original data, and generating a personnel information data exploration report; the data exploration is a key step in the personnel information data quality, really controls personnel information data sources, improves the personnel information data quality and provides a basis for subsequent cleaning;
the following personnel information table is specifically used for example:
if the personnel information table has 7 pieces of data in total, it contains 7 fields AGE, HM, XM, CSRQ, SJ, XB, DZ, where field AGE is AGE, HM is certificate number, XM is name, CSRQ is date of birth, SJ is time, XB is gender, DZ is address:
step1 explores the situation of all personnel information data volume through the data volume exploration mode, provides basis for the following null value rate calculation, and the total number m of data explored to the personnel information table is 7;
step2 if DZ field values of 7 pieces of data of the personal information table are all null, the null rate is determined according to the field null rateFormula (2)
Figure BDA0003197703680000121
Calculating the null value rate of the DZ field to be 100%;
step3 type and format exploration is to explore the type and format of the field in the personnel information table and display the occupation condition of various formats of the field; if the HM field of 5 pieces of data in the personnel information table is the correct identity card number, and the HM field of the other 2 pieces of data is provided with special symbols such as #, $, then the type and format exploration can identify the 5 pieces of data in the personnel information table as the identity card number format, and the other 2 pieces of data which do not conform to the identity card number format are identified as unknown data, namely for the HM field of the personnel information table, the identity card number format accounts for 71.4%, and the unknown data format accounts for 28.6%;
step4 named entity exploration is to automatically explore and analyze the identification of the field content of the personnel information table, then to combine the value of the original data to carry out manual intervention, to identify the named entities of the certificate number, the mobile phone number and the address in the field content, to provide basis for the data extraction strategy definition in the data access processing process; if the HM field in the personnel information table is explored as the identity card number, the XM field is the name, the DZ field is the address, and other entities which cannot be matched in the existing rule are all defaulted to be unknown;
step5, through a problem data exploration mode, exploring data which are not in accordance with the specification in the personnel information table field, and counting the scrambling code rate of each column, wherein fields containing special characters such as #, $canbe identified as scrambling codes; referring to step3, the human information table has 5 data with correct identity numbers in the HM field, and the other 2 data with special symbols # and $inthe HM field; according to the formula
Figure BDA0003197703680000131
Calculating the scrambling code rate of the certificate number to be 28.6%; if the CSRQ field value of 7 data in the personnel information table has 3 data in 1996 and the CSRQ field value of 2 data in 1995, the time distribution of the CSRQ field of the personnel information table is explored: the first time ranked was 1996, data volume was 3; second 1995, data volume2;
4) Pre-cleaning data: acquiring personnel information data exploration results from data exploration, mastering the quality problem of the personnel information data, and performing conditional filtering, field splicing, splitting and character string replacing pre-cleaning operation on the personnel information data, wherein the method comprises the following specific steps:
step1 conditional filtration: providing default empty, non-empty and range filtering rule selection, and using a query condition form as a parameter to be transmitted; if the XM field value of 1 data in 7 data of the personnel information table is null and the filtering condition is set to be XM-non-null, the data lacking values in the XM field is filtered out, and only 6 data with values in the XM field are left in the personnel information table after cleaning;
step2 field concatenation: splicing the selected fields and separators to input the fields into the newly added fields, wherein the new fields are added after the original fields; the field splicing separation mode supports space separation, comma separation and no separator; if the CSRQ and SJ fields of the personnel information table are spliced into a CSSJ new field, selecting the two fields as CSRQ and SJ, selecting a space in a separation mode, and finally splicing to obtain a new field CSSJ (birth time);
step3 is performed on the personnel information table after pre-cleaning operation, and step 3) is performed on the cleaned data again to realize pre-cleaning exploration of the personnel information data, realize bidirectional inspection of exploration and cleaning and ensure the quality of the personnel information data;
5) data definition: the method comprises the following steps of generating configuration required by a personnel information access processing program by taking registered original data as dimensions, wherein the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition, and the method comprises the following specific steps:
step1 defines data reading that defines reading of original data from a source platform according to personnel information data exploration results, and defines a file character set of the personnel information data according to business requirements; the data format defines a reference data standard, the mapping of personnel information original fields and personnel information standard fields is completed by combining personnel information data exploration results, data definition is carried out by utilizing data standardization, and defined resources have standard contrast self-learning and recommendation functions; when the next data definition meets the HM, XB and DZ fields which are defined in a standardized way in the personnel information table, automatically marking the fields as the standard data fields of ZJHM, SEX and ADDRESS according to experience;
step2 data processing definition comprises data cleaning strategy definition and data extraction strategy definition; the data cleaning strategy definition defines strategies of data condition filtering, field splicing and splitting according to a data format definition result, and specifically comprises the following steps:
and (3) conditional filtration: providing default empty, non-empty and range filtering rule selection, and using a query condition form as a parameter to be transmitted; if the filtering condition set for the personnel information table accessed in real time is defined as XM-non-null, then the subsequent step 6) will carry out condition filtering processing on the personnel information table;
field splicing: splicing the selected fields and separators to input the fields into the newly added fields, wherein the new fields are added after the original fields; the field splicing separation mode supports space separation, comma separation and no separator; if the CSRQ and SJ fields of the real-time accessed personnel information table are spliced into a CSSJ new field, selecting the two fields as CSRQ and SJ, and selecting a space in a separation mode;
data extraction policy definition: defining and defining the extraction mapping relation from the source data resource to the target data resource, and extracting the relation between partial fields of the personnel information data and the personnel information data; defining the relation of name-age-gender-certificate number of the extracted personnel information table according to requirements;
step3 data governance definition comprises resource catalog registration, data is registered to a data resource catalog according to a data format definition result, the condition of personnel information data is visually displayed, and corresponding personnel information data is quickly searched as required;
6) data access processing: according to the configuration generated in the data definition, data is processed in the personnel information data access process, data is checked with a data provider, and finally the processed personnel information data is written into a file for storage, and the specific steps are as follows:
step1 defines policy according to the data format of step 5), standardizing the field HM of the personnel information table into ZJHM and XB, standardizing the field HM into SEX and DZ into ADDRESS, generating a new personnel information table after processing, wherein the personnel information table comprises 6 data, 8 fields AGE, ZJHM, XM, CSRQ, SJ, SEX, ADDRESS and CSSJ, wherein the field AGE is AGE, the ZJHM is certificate number, the XM is name, the CSRQ is birth date, the SJ is time, SEX is gender, ADDRESS is ADDRESS, and the CSSJ is birth time (new field after splicing);
step2, processing data according to the data processing definition strategy of step 5), including data cleaning and data extraction; the data cleaning processing is to perform condition filtering, field splicing and splitting operations on the data according to data definition data cleaning rules;
filtering the data lacking values in XM fields of the personnel information table accessed in real time according to condition filtering defined by data cleaning, namely, only 6 data with values in the XM fields of the personnel information table after cleaning is left;
splicing fields of the personnel information table accessed in real time into CSSJ new fields according to field splicing defined by data cleaning, selecting two fields as CSRQ and SJ, selecting blank in a separation mode, and finally splicing to obtain a new field CSSJ (birth time);
the data extraction is to extract target format data from the source format data according to the data extraction strategy definition to generate a personnel information table of 'name-age-gender-certificate number' relationship;
7) and performing standard management on the accessed personnel information data, mastering the condition of personnel information data assets, and visually displaying and comprehensively knowing the accessed personnel information data.
FIG. 2 shows a data standards flow diagram for synchronizing personnel information data standards to a data standards interface; and after receiving a standard application of the data definition, the data standard interface issues the standard to the data definition.
FIG. 3 is a flow chart of data pre-cleaning, in which personnel information data is pre-cleaned according to data problems found in personnel information data exploration, a task result set is freely formed in the process of personnel information data pre-cleaning and is stored in a hive temporary library, the task result set for confirming that all pre-cleaning work is completed can be used as new original data, and the record information is automatically synchronized from inherited original data for subsequent data exploration and data definition.

Claims (6)

1. A method for managing personnel information data based on big data is characterized by comprising the following steps:
1) data standard: standardizing personnel information data, and carrying out unified and standard management on the personnel information data to eliminate personnel information data barriers among departments;
2) and (3) data recording: uploading the basic information and the original data of the personnel information data for registration and recording, and carrying out memorandum registration on the personnel information data;
3) data exploration: acquiring the original data registered and recorded in the step 2), and carrying out personnel information data exploration on the original data to generate a personnel information data exploration report;
4) pre-cleaning data: acquiring a personnel information data exploration report from the step 3), mastering the quality problem of the personnel information data, performing pre-cleaning operation on the personnel information data, and storing the result into a hive temporary library;
5) data definition: defining reading, processing and governing of the personnel information data by taking the registered personnel information data as dimensionality, generating the configuration required by the step 6), and forming a personnel information data definition result for a big data governing platform to call;
6) data access processing: according to business requirements, based on steps 3) -5), data reading is carried out, multi-source heterogeneous personnel information data are accessed to a big data processing center, personnel information data processing is carried out in the accessing process, data check is carried out on the personnel information data and a personnel information data provider, and finally the processed personnel information data are written into a file for storage;
7) data asset: performing asset management on the personnel information data accessed in the step 6) to master the condition of the personnel information data assets;
the personnel information data exploration in the step 3) comprises two repeated multi-dimensional exploration analyses, wherein one exploration is carried out on original data, and the other exploration is carried out on the personnel information data subjected to data pre-cleaning in the step 4); the multi-dimensional exploration analysis comprises exploration on personnel information data volume, exploration on fields and quality of personnel information data and exploration on problem data;
the data pre-cleaning in the step 4) comprises the operations of condition filtering, field splicing, splitting and character string replacement; the condition filtering is to provide three choices of null, non-null and range rules, and the foreground takes the form of query conditions as parameters to transmit to the background; the field splicing, splitting and replacing character strings are obtained by splicing and splitting the existing field and replacing character strings through a CONCAT function and a REPLACE function;
the data definition in the step 5) is defined based on a data standard, and the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition; the data reading definition is to define the reading of original data from a source platform according to the personnel information data exploration result and define a file character set of the personnel information data according to the service requirement; the data format definition refers to a data standard and completes the mapping of the personnel information original field and the personnel information standard field; the data processing definition comprises a step1 data cleaning strategy definition, a step2 data extraction strategy definition; the data management definition comprises resource catalog registration, wherein the resource catalog registration is to synchronously register the registered personnel information data in the data record to a data resource catalog so as to comprehensively master the condition of the personnel information data;
in the step 6), data reading is to extract data from a hive temporary library pre-cleaned by personnel information data, check whether data definitions are consistent, perform data reading if the data definitions are consistent, and stop reading if the data definitions are inconsistent; the data checking is a link synchronously performed in a data reading stage, and the integrity and the correctness of the personnel information data are checked in a reconciliation time node at a certain time;
the personnel information data processing in the step 6) comprises step1 cleaning and step2 extraction, and the steps are as follows:
step1 washing: performing condition filtering, field splicing, field splitting and character string replacing operations on real-time accessed personnel information data according to the data cleaning strategy definition;
step2 extraction: extracting target format data from the source format data according to the data extraction strategy definition, and extracting the relation between partial fields of the personnel information data accessed in real time and the personnel information data;
the data pre-cleaning in the step 4), the step1 data cleaning strategy definition in the step 5) and the step1 cleaning in the step 6) are different in the targeted data objects, the step 4) is targeted at sample data, and the step 5) and the step 6) are targeted at real data; the difference between step 5) and step 6) is again that: the purposes of the two are different, step 5) is defined by cleaning rules, and step 6) is to process real data according to step 5).
2. The big data-based personnel information data governance method according to claim 1, wherein the investigation of personnel information data volume is an investigation of all personnel information data volume conditions; the field and quality exploration of the personnel information data comprises the following steps: a) field null rate probe, b) named entity probe, c) type and format probe.
3. The method for personnel information data governance based on big data according to claim 2, wherein the field null value rate exploration is carried out by counting the field null value proportion condition through a formula (1):
Figure DEST_PATH_IMAGE002
(1)
in the formula: rate represents the null Rate, f (k) represents the number of null values in the field, k represents the lower bound, n represents the upper bound, m represents the total number of lines, and z represents the number of lines of special characters.
4. The method for personnel information data governance based on big data according to claim 2, wherein the named entity exploration is specifically: automatically detecting and analyzing the identifier of the field content, and then manually intervening by combining the value of the original data to identify name entities of a person name, a place name, a certificate number and a mobile phone number in the field content; the type and format exploration comprises the following specific steps: whether the field type and the format of the personnel information data meet the specification or not is explored; the problem data exploration is to explore the legality of the field and the data which are not in accordance with the specification in the field, and the messy code rate of each column is counted through a formula (2):
Figure DEST_PATH_IMAGE004
(2)
in the formula: rate2 represents the scrambling code Rate, g (b) represents the number of scrambling codes, b represents the lower bound, e represents the upper bound, and h represents the total number of rows.
5. The big data-based personnel information data governance method according to claim 1, wherein step1 data cleaning policy definition in step 5) is specifically: according to the business requirements, defining a condition filtering strategy of personnel information data and a strategy of field splicing, splitting and replacing character strings, and completing the rule definition aiming at a real data cleaning strategy;
step2 data extraction strategy definition in the step 5) specifically comprises the following steps: and according to the business requirements, defining and defining the extraction mapping relation from the source data to the target data, and extracting the relation between part of fields of the personnel information data and the personnel information data.
6. The method for managing personnel information data based on big data according to claim 1, wherein the asset management of the personnel information data accessed in step 6) in step 7) specifically comprises: through the resource catalog, the resource overview is visually displayed; the resource catalog is used for managing data, including resource classification and cataloguing, and combing the data stored by the big data management platform and the personnel information data provided for the big data management platform by an interface mode.
CN202110895458.8A 2021-08-05 2021-08-05 Method for managing personnel information data based on big data Active CN113535707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110895458.8A CN113535707B (en) 2021-08-05 2021-08-05 Method for managing personnel information data based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110895458.8A CN113535707B (en) 2021-08-05 2021-08-05 Method for managing personnel information data based on big data

Publications (2)

Publication Number Publication Date
CN113535707A CN113535707A (en) 2021-10-22
CN113535707B true CN113535707B (en) 2022-04-15

Family

ID=78090531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110895458.8A Active CN113535707B (en) 2021-08-05 2021-08-05 Method for managing personnel information data based on big data

Country Status (1)

Country Link
CN (1) CN113535707B (en)

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8458148B2 (en) * 2009-09-22 2013-06-04 Oracle International Corporation Data governance manager for master data management hubs
CN103455373A (en) * 2013-09-18 2013-12-18 浪潮电子信息产业股份有限公司 Dynamic migration security framework of virtual machine
US10521601B2 (en) * 2014-04-30 2019-12-31 Sailpoint Technologies, Israel Ltd. System and method for data governance
US10936215B2 (en) * 2018-04-30 2021-03-02 EMC IP Holding Company LLC Automated data quality servicing framework for efficient utilization of information technology resources
CN110119413B (en) * 2019-04-30 2024-06-18 京东城市(南京)科技有限公司 Data fusion method and device
CN110990447B (en) * 2019-12-19 2023-09-15 北京锐安科技有限公司 Data exploration method, device, equipment and storage medium
CN111125116B (en) * 2019-12-27 2020-10-13 上海德拓信息技术股份有限公司 Method and system for positioning code field in service table and corresponding code table
CN112000656A (en) * 2020-09-01 2020-11-27 北京天源迪科信息技术有限公司 Intelligent data cleaning method and device based on metadata
CN112395325A (en) * 2020-11-27 2021-02-23 广州光点信息科技有限公司 Data management method, system, terminal equipment and storage medium
CN112527783B (en) * 2020-11-27 2024-05-24 中科曙光南京研究院有限公司 Hadoop-based data quality exploration system
CN112231315A (en) * 2020-12-16 2021-01-15 武汉凡松科技有限公司 Data management method based on big data
CN112650745A (en) * 2020-12-30 2021-04-13 中科环森智慧科技(苏州)有限公司 Data management system based on unified data resource pool
CN112699175B (en) * 2021-01-15 2024-02-13 广州汇智通信技术有限公司 Data management system and method thereof

Also Published As

Publication number Publication date
CN113535707A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN110781236A (en) Method for constructing government affair big data management system
CN110674360B (en) Tracing method and system for data
CN106164896B (en) Multi-dimensional recursion method and system for discovering counterparty relationship
CN109800354B (en) Resume modification intention identification method and system based on block chain storage
CN114168716A (en) Deep learning-based automatic engineering cost extraction and analysis method and device
US20200293528A1 (en) Systems and methods for automatically generating structured output documents based on structural rules
CN111816310A (en) Bone marrow blood disease risk factor contribution rate calculation and risk prediction system
CN111143394B (en) Knowledge data processing method, device, medium and electronic equipment
CN115936624A (en) Basic level data management method and device
CN113918705A (en) Contribution auditing method and system with early warning and recommendation functions
Brahimi et al. Mapping the Scientific Landscape of Metaverse Using VOSviewer and Bibliometrix
CN113535707B (en) Method for managing personnel information data based on big data
Azeroual et al. Putting FAIR principles in the context of research information: FAIRness for CRIS and CRIS for FAIRness
CN106649599A (en) Knowledge service oriented scientific research data processing and predictive analysis platform
CN116431828A (en) Construction method of power grid center data asset knowledge graph database constructed based on neural network technology
CN115455973A (en) Lymphoma research database construction and application method based on real world research
CN105786929A (en) Information monitoring method and device
Regla et al. Research Network Analysis, Agenda Mapping and Research Productivity Monitoring: Insights from a Higher Education in the Philippines
CN113408207A (en) Data mining method based on social network analysis technology
CN111815125A (en) Innovative entity science and technology evaluation system optimization method and device based on technical atlas
Svenningsen et al. Sharing insect data through GBIF: novel monitoring methods, opportunities and standards
KR20210001645A (en) A method for predicting corporate default
CN117786182B (en) Business data storage system and method based on ERP system
CN109542973A (en) A kind of patent information localization method and system
Hariri et al. Co-citation scientific maps: a case study of medical sciences in Iran

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant