CN113535707B

CN113535707B - Method for managing personnel information data based on big data

Info

Publication number: CN113535707B
Application number: CN202110895458.8A
Authority: CN
Inventors: 阎星娥; 杨昆; 刘慰慰; 严荣明; 张�林; 袁勇斌; 薛世峰; 石旦
Original assignee: Nanjing Huafei Data Technology Co ltd
Current assignee: Nanjing Huafei Data Technology Co ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2022-04-15
Anticipated expiration: 2041-08-05
Also published as: CN113535707A

Abstract

The invention provides a method for managing personnel information data based on big data, which comprises the following steps: 1) data standard: standardizing personnel information data; 2) and (3) data recording: basic information and original data of personnel information data are uploaded for registration and filing; 3) data exploration: carrying out personnel information data exploration to generate a personnel information data exploration report; 4) pre-cleaning data: acquiring a personnel information data exploration report, and performing pre-cleaning operation on the personnel information data; 5) data definition: defining reading, processing and governing of personnel information data by taking registered personnel information data as dimensions; 6) data access processing: data reading is carried out, personnel information data with multi-source isomerism are accessed to a big data processing center, and personnel information data processing is carried out in the accessing process; 7) data asset: carrying out asset management on the personnel information data; the invention can provide complete and timely high-quality personnel information data.

Description

Method for managing personnel information data based on big data

Technical Field

The invention relates to a method for managing personnel information data based on big data, and belongs to the field of personnel information data management.

Background

Today, human production and life are producing a wide variety of enormous data every day, and the production speed is getting faster and faster; therefore, the problem of accessing, processing and managing massive heterogeneous personnel information data is more and more emphasized by people; in enterprises, problems such as discretization of personnel information data and uneven quality of personnel information data often occur, and the reasons are as follows:

the personnel information data volume is huge, the sources are various, the structure is disordered, the data lack a uniform standard specification, and the disordered data can cause the waste of resource space;

secondly, the problems of incompleteness and inaccuracy of personnel information data are more and more obvious, and the low quality of the personnel information data becomes the core problem of the personnel information data.

Disclosure of Invention

The invention provides a method for managing personnel information data based on big data, and aims to solve the problem of low quality of the personnel information data.

The technical solution of the invention is as follows: a method for managing personnel information data based on big data comprises the following steps:

1) data standard: standardizing personnel information data, and carrying out unified and standard management on the personnel information data to eliminate personnel information data barriers among departments;

2) and (3) data recording: uploading the basic information and the original data of the personnel information data for registration and recording, and carrying out memorandum registration on the personnel information data;

3) data exploration: acquiring the original data registered and recorded in the step 2), and carrying out personnel information data exploration on the original data to generate a personnel information data exploration report;

4) pre-cleaning data: acquiring a personnel information data exploration report from the step 3), mastering the quality problem of the personnel information data, performing pre-cleaning operation on the personnel information data, and storing the result into a hive temporary library;

5) data definition: defining reading, processing and governing of the personnel information data by taking the registered personnel information data as dimensionality, generating the configuration required by the step 6), and forming a personnel information data definition result for a big data governing platform to call;

6) data access processing: according to business requirements, based on steps 3) -5), data reading is carried out, multi-source heterogeneous personnel information data are accessed to a big data processing center, personnel information data processing is carried out in the accessing process, data check is carried out on the personnel information data and a personnel information data provider, and finally the processed personnel information data are written into a file for storage;

7) data asset: and (4) performing asset management on the personnel information data accessed in the step 6) and mastering the condition of the personnel information data assets.

Further, the personnel information data exploration in the step 3) comprises two repeated multidimensional exploration analyses, wherein one exploration is performed on the original data, and the other exploration is performed on the personnel information data subjected to the data pre-cleaning in the step 4).

Further, the multidimensional exploration analysis in the step 3) comprises exploration on the data volume of the personnel information, exploration on the field and quality of the personnel information data and exploration on the problem data.

Further, the investigation of the personnel information data volume is to investigate all personnel information data volume conditions.

Further, the field and quality exploration of the personnel information data comprises the following steps: a) field null rate probe, b) named entity probe, c) type and format probe.

Further, the field null rate detection specifically includes that field null ratio conditions are counted through a formula (1):

in the formula: rate represents the null Rate, f (k) represents the number of null values in the field, k represents the lower bound, n represents the upper bound, m represents the total number of lines, and z represents the number of lines of special characters.

Further, the named entity exploration specifically includes: and automatically exploring and analyzing the identifier of the field content, and then manually intervening by combining the value of the original data to identify the name entity of the person, place, certificate number and mobile phone number in the field content.

Further, the type and format exploration specifically includes: and probing whether the field type and the format of the personnel information data meet the specification.

Further, the problem data exploration is to explore the legality of the field and the data which are not in accordance with the specification in the field, and the scrambling code rate of each column is counted through a formula (2):

in the formula: rate2 represents the scrambling code Rate, g (b) represents the number of scrambling codes, b represents the lower bound, e represents the upper bound, and h represents the total number of rows.

Further, the data pre-cleaning in the step 4) comprises conditional filtering, field splicing, splitting and character string replacing operations; the condition filtering is to provide three choices of null, non-null and range rules, and the foreground takes the form of query conditions as parameters to transmit to the background; the field splicing, splitting and replacing character strings are formed by splicing and splitting the existing field and replacing character strings through a CONCAT function and a REPLACE function.

Further, the data definition in the step 5) is defined based on a data standard, and the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition; the data reading definition is to define the reading of original data from a source platform according to the personnel information data exploration result and define a file character set of the personnel information data according to the service requirement; the data format definition refers to a data standard and completes the mapping of the personnel information original field and the personnel information standard field; the data processing definition comprises a step1 data cleaning strategy definition, a step2 data extraction strategy definition; the data management definition comprises resource catalog registration, wherein the resource catalog registration is to synchronously register the registered personnel information data in the data record to the data resource catalog so as to comprehensively master the condition of the personnel information data.

Further, step1 data cleaning policy definition in the step 5) specifically includes: according to the business requirements, defining strategies of filtering personnel information data conditions, splicing fields, splitting fields and replacing character strings; and 4) filtering the conditions in the step 4), wherein the difference of field splicing, field splitting and character string replacing operation is as follows: the data objects aimed at by the two are inconsistent, step 4) is to perform pre-washing operation on the sample data, and the rule definition of the real data washing strategy is defined herein, so that a basis is provided for step1 washing in the following step 6);

further, step2 data extraction policy definition in the step 5) specifically includes: and according to the business requirements, defining and defining the extraction mapping relation from the source data to the target data, and extracting the relation between part of fields of the personnel information data and the personnel information data.

Further, in the step 6), data reading is to extract data from a hive temporary library pre-cleaned by personnel information data, check whether data definitions are consistent, perform data reading if the data definitions are consistent, and stop reading if the data definitions are inconsistent; the data checking is a link synchronously performed in a data reading stage, and the integrity and the correctness of the personnel information data are checked in a certain time reconciliation time node.

Further, the personnel information data processing in the step 6) comprises step1 washing and step2 extraction, and the steps are as follows:

according to the data cleaning strategy definition, carrying out real-time access on personnel information data;

step1 washing: according to the data cleaning strategy definition, repeatedly executing the operations of condition filtering, field splicing, splitting and character string replacement in the step 4) on the personnel information data accessed in real time; but differs from step 4), step 5); the data objects aimed at by the two are different, step 4) aims at sample data, and step 5) and step 6) aim at real data; the difference between step 5) and step 6) is again that: the purposes of the two are different, step 5) is defined by cleaning rules, and step 6) is to process real data according to step 5);

step2 extraction: and extracting target format data from the source format data according to the data extraction strategy definition, and extracting the relation between partial fields of the personnel information data accessed in real time and the personnel information data.

Further, the asset management of the personnel information data accessed in step 6) in step 7) specifically includes: through the resource catalog, the resource overview is visually displayed; the resource catalog is used for managing data, including resource classification and cataloguing, and combing the data stored by the big data management platform and the personnel information data provided for the big data management platform by an interface mode.

The invention has the beneficial effects that:

according to the method for managing the personnel information data based on the big data, the scattered and diversified personnel information data are standardized, and the personnel information data are subjected to data filing, data exploration, data precleaning, data definition and data access processing to realize data management and control, so that complete and timely high-quality personnel information data can be provided.

Drawings

Fig. 1 is a general design architecture diagram of the present invention.

FIG. 2 is a schematic diagram of a data standard flow of the present invention.

FIG. 3 is a schematic diagram of the data precleaning process of the present invention.

Detailed Description

The present invention will be described in detail with reference to the following embodiments.

A method for managing personnel information data based on big data comprises the following steps:

1) data standard: the personnel information data are standardized, unified and standard management is carried out on the personnel information data, the personnel information data barriers among all departments are eliminated, and the personnel information data can be conveniently shared;

2) and (3) data recording: uploading the basic information and the original data of the personnel information data to carry out unified registration and filing, and making memorandum registration work of the personnel information data;

3) data exploration: acquiring the original data registered and recorded in the step 2), and carrying out personnel information data exploration on the original data to generate a personnel information data exploration report; the data exploration is a key step in the personnel information data quality, really controls the personnel information data source, improves the personnel information data quality, ensures the correctness of the personnel information data, and provides a basis for the precleaning of the following data; preferably, the personnel information data is probed by adopting multi-dimensional probing analysis;

4) pre-cleaning data: acquiring a personnel information data exploration report from the step 3), mastering the quality problem of personnel information data, performing pre-cleaning operation on the personnel information data, and storing the result in a hive temporary library, so that the quality problem of the personnel information data is solved, the consistency of the personnel information data is ensured, and the use value and the quality of the personnel information data are improved;

6) data access processing: according to business requirements, based on steps 3) -5), accessing multisource heterogeneous personnel information data into a big data processing center, processing the personnel information data in the accessing process, performing data check with a personnel information data provider, and finally writing the processed personnel information data into a file for storage, so that the personnel information data accessing processing efficiency is improved;

7) data asset: and (4) performing asset management on the personnel information data accessed in the step 6), mastering the condition of the personnel information data assets, and visually displaying and comprehensively knowing the accessed personnel information data assets.

The personnel information data exploration in the step 3) comprises two repeated multidimensional exploration analyses, one multidimensional exploration analysis is carried out on original data, and the other multidimensional exploration analysis is carried out on the personnel information data subjected to data precleaning in the step 4), so that two-way inspection of exploration and data precleaning is realized, and the timely discovery of personnel information data problems is ensured.

The multi-dimensional exploration analysis in the step 3) comprises exploration of personnel information data volume, exploration of fields and quality of personnel information data and exploration of problem data.

The personnel information data volume exploration is used for exploring all personnel information data volume conditions.

The field and quality exploration for the personnel information data comprises the following steps: a) field null rate probe, b) named entity probe, c) type and format probe.

The field null value rate exploration is characterized in that the field control null value proportion condition is counted through a formula (1):

in the formula: rate represents null Rate, f (k) represents field null number, k represents lower bound, n represents upper bound, m represents total number of lines, z represents number of special character lines, the null condition of personnel information data is explored through the field null Rate, useful data in the personnel information data is discovered, and the value of the personnel information data is known; the field null rate is the statistical field control duty ratio.

The named entity exploration comprises the following specific steps: and automatically exploring and analyzing the identification of the field content, then carrying out manual intervention by combining the value of the original data, identifying name entities of the person, place name, certificate number and mobile phone number in the field content, providing a basis for extracting the following personnel information data, and extracting the valuable personnel information data relation.

The type and format exploration comprises the following specific steps: whether the field type and the format of the personnel information data meet the specifications or not is explored, a data quality basis is provided for the extraction of the personnel information data later, and the quality of the personnel information data is ensured; the problem data is probed to detect the legality of the field and the data which does not meet the specification in the field, and the scrambling code rate of each column is counted through a formula (2):

The personnel information data pre-cleaning in the step 4) comprises condition filtering, field splicing, splitting and character string replacing operations; the condition filtering is to provide three choices of null, non-null and range rules, and the foreground takes the form of query conditions as parameters to transmit to the background; the field splicing, splitting and replacing character strings are formed by splicing and splitting the existing field and replacing character strings through a CONCAT function and a REPLACE function.

The data definition in the step 5) is defined based on a data standard, and the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition.

The definition of data reading is to define the reading of original data from a source platform according to the personnel information data exploration result, and define a file character set of the personnel information data according to business requirements.

The data format definition refers to a data standard and completes the mapping of the personnel information original field and the personnel information standard field.

The data processing definition comprises the following specific steps:

step1 data cleaning strategy definition, according to the business requirement, defining the strategy of personnel information data condition filtration, field splicing, splitting and character string replacement; but differs from step 4); the data objects aimed at by the two are inconsistent, step 4) is to perform pre-cleaning operation on the sample data, and the rule definition of real data cleaning is defined herein, so that a basis is provided for step1 cleaning in the following step 6);

step2 data extraction policy definition: and according to the business requirements, defining and defining the extraction mapping relation from the source data to the target data, and extracting the relation between part of fields of the personnel information data and the personnel information data.

The data management definition comprises resource catalog registration, wherein the resource catalog registration is to synchronously register the registered personnel information data in the data record to the data resource catalog so as to comprehensively master the condition of the personnel information data.

In the step 6), data reading is to extract data from a hive temporary library pre-cleaned by personnel information data, check whether data definitions are consistent, perform data reading if the data definitions are consistent, and stop reading if the data definitions are inconsistent; the data checking is a link synchronously performed in a data reading stage, and is used for checking the integrity and the correctness of personnel information data in a reconciliation time node at a certain time in the data providing party and the data reading process.

The personnel information data processing in the step 6) comprises cleaning and extracting, and specifically comprises the following steps:

step1 washing: according to the definition of the data cleaning strategy, repeatedly executing the step 4) on the personnel information data accessed in real time, but the difference between the step 4) and the step 5) is that; the data objects aimed at by the two are different, step 4) aims at sample data, and step 5) and step 6) aim at real data; the difference between step 5) and step 6) is again that: the purposes of the two are different, step 5) is defined by cleaning rules, and step 6) is to process real data according to step 5);

The asset management of the personnel information data accessed in the step 6) in the step 7) is specifically as follows: through the resource catalog, the resource overview is visually displayed; the resource catalog is used for managing data, including resource classification and cataloguing, and combing the data stored by the big data management platform and the personnel information data provided for the big data management platform by an interface mode.

According to the method, through means of data standard, data filing, data exploration, data precleaning, data definition, data access processing, data assets and the like, a big data management platform is used for carrying out data management and control on personnel information data, personnel information data definition is carried out through data standardization, experience self-learning is provided, standard contrast self-learning and recommendation functions are provided for defined resources, and personnel information data access efficiency is improved; the efficiency of personnel information data management is improved by providing a set of rich real-time processing rules; by standardizing the personnel information data, the data legality and compliance are improved, and the correctness and quality of the data are ensured; by means of a bidirectional inspection method of diversified exploration and precleaning before access, the data quality problem can be found and solved in time; the use value and the quality of the personnel information data are improved through reliable personnel information data, powerful support is provided in the field of personnel information data management, and economic benefits are gained for enterprises.

Examples

Embodiments of the present invention will be described in further detail with reference to the accompanying drawings 1 to 3, in which:

in fig. 1, a series of complete processes of personnel information data filing, exploration, precleaning, definition, and access processing are implemented, and the specific steps are as follows:

2) and (3) data recording: uploading the basic information and the original data of the personnel information data to carry out unified registration and filing, and making a memo of the personnel information data;

3) data exploration: acquiring registered original data from the step 2), carrying out multi-dimensional exploration analysis on the original data, and generating a personnel information data exploration report; the data exploration is a key step in the personnel information data quality, really controls personnel information data sources, improves the personnel information data quality and provides a basis for subsequent cleaning;

the following personnel information table is specifically used for example:

if the personnel information table has 7 pieces of data in total, it contains 7 fields AGE, HM, XM, CSRQ, SJ, XB, DZ, where field AGE is AGE, HM is certificate number, XM is name, CSRQ is date of birth, SJ is time, XB is gender, DZ is address:

step1 explores the situation of all personnel information data volume through the data volume exploration mode, provides basis for the following null value rate calculation, and the total number m of data explored to the personnel information table is 7;

step2 if DZ field values of 7 pieces of data of the personal information table are all null, the null rate is determined according to the field null rateFormula (2)

Calculating the null value rate of the DZ field to be 100%;

step3 type and format exploration is to explore the type and format of the field in the personnel information table and display the occupation condition of various formats of the field; if the HM field of 5 pieces of data in the personnel information table is the correct identity card number, and the HM field of the other 2 pieces of data is provided with special symbols such as #, $, then the type and format exploration can identify the 5 pieces of data in the personnel information table as the identity card number format, and the other 2 pieces of data which do not conform to the identity card number format are identified as unknown data, namely for the HM field of the personnel information table, the identity card number format accounts for 71.4%, and the unknown data format accounts for 28.6%;

step4 named entity exploration is to automatically explore and analyze the identification of the field content of the personnel information table, then to combine the value of the original data to carry out manual intervention, to identify the named entities of the certificate number, the mobile phone number and the address in the field content, to provide basis for the data extraction strategy definition in the data access processing process; if the HM field in the personnel information table is explored as the identity card number, the XM field is the name, the DZ field is the address, and other entities which cannot be matched in the existing rule are all defaulted to be unknown;

step5, through a problem data exploration mode, exploring data which are not in accordance with the specification in the personnel information table field, and counting the scrambling code rate of each column, wherein fields containing special characters such as #, $canbe identified as scrambling codes; referring to step3, the human information table has 5 data with correct identity numbers in the HM field, and the other 2 data with special symbols # and $inthe HM field; according to the formula

Calculating the scrambling code rate of the certificate number to be 28.6%; if the CSRQ field value of 7 data in the personnel information table has 3 data in 1996 and the CSRQ field value of 2 data in 1995, the time distribution of the CSRQ field of the personnel information table is explored: the first time ranked was 1996, data volume was 3; second 1995, data volume2；

4) Pre-cleaning data: acquiring personnel information data exploration results from data exploration, mastering the quality problem of the personnel information data, and performing conditional filtering, field splicing, splitting and character string replacing pre-cleaning operation on the personnel information data, wherein the method comprises the following specific steps:

step1 conditional filtration: providing default empty, non-empty and range filtering rule selection, and using a query condition form as a parameter to be transmitted; if the XM field value of 1 data in 7 data of the personnel information table is null and the filtering condition is set to be XM-non-null, the data lacking values in the XM field is filtered out, and only 6 data with values in the XM field are left in the personnel information table after cleaning;

step2 field concatenation: splicing the selected fields and separators to input the fields into the newly added fields, wherein the new fields are added after the original fields; the field splicing separation mode supports space separation, comma separation and no separator; if the CSRQ and SJ fields of the personnel information table are spliced into a CSSJ new field, selecting the two fields as CSRQ and SJ, selecting a space in a separation mode, and finally splicing to obtain a new field CSSJ (birth time);

step3 is performed on the personnel information table after pre-cleaning operation, and step 3) is performed on the cleaned data again to realize pre-cleaning exploration of the personnel information data, realize bidirectional inspection of exploration and cleaning and ensure the quality of the personnel information data;

5) data definition: the method comprises the following steps of generating configuration required by a personnel information access processing program by taking registered original data as dimensions, wherein the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition, and the method comprises the following specific steps:

step1 defines data reading that defines reading of original data from a source platform according to personnel information data exploration results, and defines a file character set of the personnel information data according to business requirements; the data format defines a reference data standard, the mapping of personnel information original fields and personnel information standard fields is completed by combining personnel information data exploration results, data definition is carried out by utilizing data standardization, and defined resources have standard contrast self-learning and recommendation functions; when the next data definition meets the HM, XB and DZ fields which are defined in a standardized way in the personnel information table, automatically marking the fields as the standard data fields of ZJHM, SEX and ADDRESS according to experience;

step2 data processing definition comprises data cleaning strategy definition and data extraction strategy definition; the data cleaning strategy definition defines strategies of data condition filtering, field splicing and splitting according to a data format definition result, and specifically comprises the following steps:

and (3) conditional filtration: providing default empty, non-empty and range filtering rule selection, and using a query condition form as a parameter to be transmitted; if the filtering condition set for the personnel information table accessed in real time is defined as XM-non-null, then the subsequent step 6) will carry out condition filtering processing on the personnel information table;

field splicing: splicing the selected fields and separators to input the fields into the newly added fields, wherein the new fields are added after the original fields; the field splicing separation mode supports space separation, comma separation and no separator; if the CSRQ and SJ fields of the real-time accessed personnel information table are spliced into a CSSJ new field, selecting the two fields as CSRQ and SJ, and selecting a space in a separation mode;

data extraction policy definition: defining and defining the extraction mapping relation from the source data resource to the target data resource, and extracting the relation between partial fields of the personnel information data and the personnel information data; defining the relation of name-age-gender-certificate number of the extracted personnel information table according to requirements;

step3 data governance definition comprises resource catalog registration, data is registered to a data resource catalog according to a data format definition result, the condition of personnel information data is visually displayed, and corresponding personnel information data is quickly searched as required;

6) data access processing: according to the configuration generated in the data definition, data is processed in the personnel information data access process, data is checked with a data provider, and finally the processed personnel information data is written into a file for storage, and the specific steps are as follows:

step1 defines policy according to the data format of step 5), standardizing the field HM of the personnel information table into ZJHM and XB, standardizing the field HM into SEX and DZ into ADDRESS, generating a new personnel information table after processing, wherein the personnel information table comprises 6 data, 8 fields AGE, ZJHM, XM, CSRQ, SJ, SEX, ADDRESS and CSSJ, wherein the field AGE is AGE, the ZJHM is certificate number, the XM is name, the CSRQ is birth date, the SJ is time, SEX is gender, ADDRESS is ADDRESS, and the CSSJ is birth time (new field after splicing);

step2, processing data according to the data processing definition strategy of step 5), including data cleaning and data extraction; the data cleaning processing is to perform condition filtering, field splicing and splitting operations on the data according to data definition data cleaning rules;

filtering the data lacking values in XM fields of the personnel information table accessed in real time according to condition filtering defined by data cleaning, namely, only 6 data with values in the XM fields of the personnel information table after cleaning is left;

splicing fields of the personnel information table accessed in real time into CSSJ new fields according to field splicing defined by data cleaning, selecting two fields as CSRQ and SJ, selecting blank in a separation mode, and finally splicing to obtain a new field CSSJ (birth time);

the data extraction is to extract target format data from the source format data according to the data extraction strategy definition to generate a personnel information table of 'name-age-gender-certificate number' relationship;

7) and performing standard management on the accessed personnel information data, mastering the condition of personnel information data assets, and visually displaying and comprehensively knowing the accessed personnel information data.

FIG. 2 shows a data standards flow diagram for synchronizing personnel information data standards to a data standards interface; and after receiving a standard application of the data definition, the data standard interface issues the standard to the data definition.

FIG. 3 is a flow chart of data pre-cleaning, in which personnel information data is pre-cleaned according to data problems found in personnel information data exploration, a task result set is freely formed in the process of personnel information data pre-cleaning and is stored in a hive temporary library, the task result set for confirming that all pre-cleaning work is completed can be used as new original data, and the record information is automatically synchronized from inherited original data for subsequent data exploration and data definition.

Claims

1. A method for managing personnel information data based on big data is characterized by comprising the following steps:

7) data asset: performing asset management on the personnel information data accessed in the step 6) to master the condition of the personnel information data assets;

the personnel information data exploration in the step 3) comprises two repeated multi-dimensional exploration analyses, wherein one exploration is carried out on original data, and the other exploration is carried out on the personnel information data subjected to data pre-cleaning in the step 4); the multi-dimensional exploration analysis comprises exploration on personnel information data volume, exploration on fields and quality of personnel information data and exploration on problem data;

the data pre-cleaning in the step 4) comprises the operations of condition filtering, field splicing, splitting and character string replacement; the condition filtering is to provide three choices of null, non-null and range rules, and the foreground takes the form of query conditions as parameters to transmit to the background; the field splicing, splitting and replacing character strings are obtained by splicing and splitting the existing field and replacing character strings through a CONCAT function and a REPLACE function;

the data definition in the step 5) is defined based on a data standard, and the definition comprises a data reading definition, a data format definition, a data processing definition and a data governance definition; the data reading definition is to define the reading of original data from a source platform according to the personnel information data exploration result and define a file character set of the personnel information data according to the service requirement; the data format definition refers to a data standard and completes the mapping of the personnel information original field and the personnel information standard field; the data processing definition comprises a step1 data cleaning strategy definition, a step2 data extraction strategy definition; the data management definition comprises resource catalog registration, wherein the resource catalog registration is to synchronously register the registered personnel information data in the data record to a data resource catalog so as to comprehensively master the condition of the personnel information data;

in the step 6), data reading is to extract data from a hive temporary library pre-cleaned by personnel information data, check whether data definitions are consistent, perform data reading if the data definitions are consistent, and stop reading if the data definitions are inconsistent; the data checking is a link synchronously performed in a data reading stage, and the integrity and the correctness of the personnel information data are checked in a reconciliation time node at a certain time;

the personnel information data processing in the step 6) comprises step1 cleaning and step2 extraction, and the steps are as follows:

step1 washing: performing condition filtering, field splicing, field splitting and character string replacing operations on real-time accessed personnel information data according to the data cleaning strategy definition;

step2 extraction: extracting target format data from the source format data according to the data extraction strategy definition, and extracting the relation between partial fields of the personnel information data accessed in real time and the personnel information data;

the data pre-cleaning in the step 4), the step1 data cleaning strategy definition in the step 5) and the step1 cleaning in the step 6) are different in the targeted data objects, the step 4) is targeted at sample data, and the step 5) and the step 6) are targeted at real data; the difference between step 5) and step 6) is again that: the purposes of the two are different, step 5) is defined by cleaning rules, and step 6) is to process real data according to step 5).

2. The big data-based personnel information data governance method according to claim 1, wherein the investigation of personnel information data volume is an investigation of all personnel information data volume conditions; the field and quality exploration of the personnel information data comprises the following steps: a) field null rate probe, b) named entity probe, c) type and format probe.

3. The method for personnel information data governance based on big data according to claim 2, wherein the field null value rate exploration is carried out by counting the field null value proportion condition through a formula (1):

（1）

4. The method for personnel information data governance based on big data according to claim 2, wherein the named entity exploration is specifically: automatically detecting and analyzing the identifier of the field content, and then manually intervening by combining the value of the original data to identify name entities of a person name, a place name, a certificate number and a mobile phone number in the field content; the type and format exploration comprises the following specific steps: whether the field type and the format of the personnel information data meet the specification or not is explored; the problem data exploration is to explore the legality of the field and the data which are not in accordance with the specification in the field, and the messy code rate of each column is counted through a formula (2):

（2）

5. The big data-based personnel information data governance method according to claim 1, wherein step1 data cleaning policy definition in step 5) is specifically: according to the business requirements, defining a condition filtering strategy of personnel information data and a strategy of field splicing, splitting and replacing character strings, and completing the rule definition aiming at a real data cleaning strategy;

step2 data extraction strategy definition in the step 5) specifically comprises the following steps: and according to the business requirements, defining and defining the extraction mapping relation from the source data to the target data, and extracting the relation between part of fields of the personnel information data and the personnel information data.

6. The method for managing personnel information data based on big data according to claim 1, wherein the asset management of the personnel information data accessed in step 6) in step 7) specifically comprises: through the resource catalog, the resource overview is visually displayed; the resource catalog is used for managing data, including resource classification and cataloguing, and combing the data stored by the big data management platform and the personnel information data provided for the big data management platform by an interface mode.