CN112214478A - Client data cleaning method and device, electronic equipment and readable storage medium - Google Patents

Client data cleaning method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112214478A
CN112214478A CN202011058991.0A CN202011058991A CN112214478A CN 112214478 A CN112214478 A CN 112214478A CN 202011058991 A CN202011058991 A CN 202011058991A CN 112214478 A CN112214478 A CN 112214478A
Authority
CN
China
Prior art keywords
data
data set
customer
original
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011058991.0A
Other languages
Chinese (zh)
Inventor
赵志明
王璠
陈海涛
李福宇
高宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Merchants Finance Technology Co Ltd
Original Assignee
China Merchants Finance Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Merchants Finance Technology Co Ltd filed Critical China Merchants Finance Technology Co Ltd
Priority to CN202011058991.0A priority Critical patent/CN112214478A/en
Publication of CN112214478A publication Critical patent/CN112214478A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a data processing technology, and discloses a client data cleaning method, which comprises the following steps: the method comprises the steps of obtaining an original customer data set, conducting data missing value filling and abnormal value removing processing on the original customer data set to obtain an initial customer data set, generating a contrast matrix according to the initial customer data set, calculating the repeatability of the contrast matrix, eliminating customer data corresponding to the contrast matrix with the repeatability being larger than a preset threshold value to obtain a standard customer data set, conducting data extraction on the standard customer data set by using a pre-constructed key data extraction model to obtain key data, constructing a customer knowledge graph according to the key data, and comparing the customer knowledge graph with the preset data to execute validity verification of the standard customer data set. The invention also discloses a client data cleaning device, electronic equipment and a storage medium. The invention can improve the efficiency and accuracy of client data cleaning.

Description

Client data cleaning method and device, electronic equipment and readable storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for cleaning client data, an electronic device, and a computer-readable storage medium.
Background
Along with the development of social economy, the number of group enterprises is increased, the number of clients below the group enterprises is increased, and the data related to the clients is increased. In order to better manage clients of a group enterprise, client data needs to be cleaned, whether necessary data in the client data exists or not is judged, redundant and repeated data is eliminated, and therefore integrity and uniqueness of the client data are achieved.
In the prior art, the cleaning mode of the customer data is usually non-automatic, the customer data is checked and identified manually, the identification efficiency is low, and the result is not accurate enough.
Disclosure of Invention
The invention provides a client data cleaning method, a client data cleaning device, electronic equipment and a computer readable storage medium, and mainly aims to solve the problems that the traditional client data cleaning method is low in efficiency and inaccurate.
In order to achieve the above object, the present invention provides a client data cleansing method, including:
acquiring an original client data set, and performing missing value filling and abnormal value removing processing on the original client data set to obtain an initial client data set;
generating a contrast matrix according to the initial client data set, calculating the repetition degree of the contrast matrix, and executing duplication removal operation on client data corresponding to the contrast matrix with the repetition degree larger than a preset threshold value to obtain a standard client data set;
performing data extraction on the standard customer data set by using a pre-constructed key data extraction model to obtain key data, and constructing a customer knowledge graph according to the key data;
comparing the customer knowledge graph with preset data to perform validity verification of the standard customer data set;
and sending the validity verification result to a preset monitoring terminal.
Optionally, the performing missing value padding and outlier removal processing on the original client data set to obtain an initial client data set includes:
judging whether the original customer data set has a missing value, and when the original customer data set has the missing value, performing data filling on the original customer data set;
and judging whether the original client data set has abnormal values or not, and deleting the abnormal values contained in the original client data set when the original client data set has the abnormal values.
Optionally, the determining whether an outlier exists in the original customer data set comprises:
calculating a local reachable density ratio of adjacent data in the original customer data set;
and when the local reachable density ratio is smaller than or equal to a preset ratio, determining the original customer data to be an abnormal value.
Optionally, the calculating a local reachable density ratio of neighboring data in the original customer data set includes:
calculating the local achievable density ratio using the following equation:
Figure BDA0002711762360000021
Figure BDA0002711762360000022
wherein N isk(q) isSaid original client data set, ldk(q) is Nk(q) q-th customer data in (q), ld (p) is adjacent data of the q-th customer data, and k is Nk(q) the number of data in the original customer dataset, and reach-distk (p, q) is the distance between p and q.
Optionally, before the data extraction is performed on the standard customer data set by using the pre-constructed key data extraction model to obtain the key data, the method further includes:
generating a training data set and a standard result corresponding to the training data set;
inputting the training data set into the key data extraction model for feature extraction to obtain a training result;
calculating loss values of the training results and the standard results by using a preset loss function to obtain loss values;
when the loss value is larger than or equal to a preset loss threshold value, adjusting parameters of the key data extraction model, and returning to input the training data set to the key data extraction model for feature extraction to obtain a training result;
and when the loss value is smaller than the loss threshold value, obtaining a standard key data extraction model.
Optionally, the performing a loss value calculation on the training result and the standard result by using a preset loss function to obtain a loss value includes: the loss value was calculated using the following formula:
Figure BDA0002711762360000031
wherein the content of the first and second substances,
Figure BDA0002711762360000032
in order to obtain the value of the loss,
Figure BDA0002711762360000033
and Y is the training result, Y is the standard result, and alpha represents an error factor.
Optionally, the generating a contrast matrix from the initial customer data set includes:
constructing a null matrix with corresponding size according to the data length of any two client data in the initial client data set;
and sequentially filling the two customer data into the empty matrix according to a preset rule to obtain the comparison matrix.
In order to solve the above problem, the present invention also provides a customer data cleansing apparatus, comprising:
the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is used for acquiring an original client data set, and performing missing value filling and abnormal value removing processing on the original client data set to obtain an initial client data set;
the data deduplication module is used for generating a contrast matrix according to the initial customer data set, calculating the repetition degree of the contrast matrix, and executing deduplication operation on customer data corresponding to the contrast matrix with the repetition degree larger than a preset threshold value to obtain a standard customer data set;
the map construction module is used for extracting data of the standard customer data set by using a pre-constructed key data extraction model to obtain key data and constructing a customer knowledge map according to the key data;
and the validity verification module is used for comparing the client knowledge graph with preset data to execute validity verification of the standard client data set and sending the validity verification result to a preset monitoring terminal.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the customer data cleansing method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above client data cleansing method.
The embodiment of the invention firstly carries out missing value filling and abnormal value removing processing on the original client data set so as to delete abnormal data in the original client data set and fill the missing data, thereby ensuring the completeness and the accuracy of the data. Therefore, the client data cleaning method, the client data cleaning device and the computer readable storage medium can improve the efficiency and the accuracy of the client data cleaning method.
Drawings
FIG. 1 is a flow chart illustrating a client data cleansing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram illustrating one step in the customer data cleansing method shown in FIG. 1;
FIG. 3 is a schematic flow chart diagram illustrating one step in the customer data cleansing method shown in FIG. 1;
FIG. 4 is a block diagram of a client data cleansing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic internal structural diagram of an electronic device implementing a client data cleansing method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a client data cleaning method. The execution subject of the client data cleaning method includes, but is not limited to, at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the present application. In other words, the client data cleansing method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Fig. 1 is a schematic flow chart of a client data cleansing method according to an embodiment of the present invention. In this embodiment, the client data cleansing method includes:
and S1, acquiring an original client data set, and performing missing value filling and abnormal value removing processing on the original client data set to obtain an initial client data set.
The embodiment of the invention utilizes a python statement with a data grabbing function to grab the original customer data set from a database storing the original customer data.
Wherein the original customer data set includes, but is not limited to, a company name of a company to which the customer belongs, a social credit code of the company, a registration number of the company, an organization number of the company, and the like.
Preferably, in an embodiment of the present invention, the performing missing value padding and outlier removing processing on the original client data set to obtain an initial client data set includes:
judging whether the original customer data set has a missing value, and when the original customer data set has the missing value, performing data filling on the original customer data set;
and judging whether the original client data set has abnormal values or not, and deleting the abnormal values contained in the original client data set when the original client data set has the abnormal values.
In detail, the embodiment of the present invention utilizes a java statement with missing value detection function to determine whether the original customer data set has a missing value. Specifically, in the embodiment of the present invention, the java statement with the missing value detection function is used to perform length detection on the attribute data in each piece of original client data in the original client data set, when it is detected that the numerical length of the attribute data is not 0, it is determined that the value of the attribute data is not missing, and when it is detected that the numerical length of the attribute data is 0, it is determined that the value of the attribute data is missing. In the embodiment of the present invention, the original customer data set includes a plurality of attributes and corresponding attribute values, for example, if a company registration number exists in the original customer data set and an attribute value corresponding to the company registration number exists in the original customer data set, it is detected whether each attribute data in the original customer data set is 0 or not during length detection.
When the original client data set has a missing value, the embodiment of the present invention may perform data padding on the original client data set by using an existing missing value padding method.
In detail, existing missing value filling methods include, but are not limited to, filling default, mean, mode, KNN filling.
Further, the determining whether the original client data set has an abnormal value according to the embodiment of the present invention includes:
calculating a local reachable density ratio of adjacent data of each original customer data in the original customer data set;
and when the local reachable density ratio is smaller than or equal to a preset ratio, determining the original customer data to be an abnormal value.
Specifically, the embodiment of the present invention calculates the local reachable density ratio LF of the neighboring data of each original client data in the original client data set by using the following algorithmk(q):
Figure BDA0002711762360000061
Figure BDA0002711762360000062
Wherein N isk(q) is the original customer data set,ldk(q) is Nk(q) q-th customer data in (q), ld (p) is adjacent data of the q-th customer data, and k is Nk(q) the number of data in the original customer dataset, and reach-distk (p, q) is the distance between p and q.
When an abnormal value exists in the original customer data set, the embodiment of the present invention performs a deletion operation on the abnormal value.
According to the embodiment of the invention, the data integrity in the original client data set can be improved by performing data preprocessing on the original client data set, invalid data and error data are deleted, data redundancy is reduced, and data accuracy is improved.
And S2, generating a contrast matrix according to the initial customer data set, calculating the repetition degree of the contrast matrix, and executing deduplication operation on the customer data corresponding to the contrast matrix with the repetition degree larger than a preset threshold value to obtain a standard customer data set.
Referring to fig. 2, in an embodiment of the present invention, the generating a contrast matrix according to the initial customer data set includes:
s201, constructing a null matrix with a corresponding size according to the data length of any two pieces of customer data in the initial customer data set;
s202, sequentially filling the two customer data into the empty matrix according to a preset rule to obtain the comparison matrix.
Specifically, for example, any two different company registration numbers in the initial customer data set are 440403000032117 and 416703000014797, respectively, a length-size empty matrix is constructed according to the length of the initial customer data, and the initial customer data is sequentially filled into the empty matrix, so that a comparison matrix is obtained as follows:
[ 4 4 0 4 0 3 0 0 0 0 3 2 1 1 7
4 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
4 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
wherein, the construction rule of the contrast matrix is as follows:
judging whether the characters in the horizontal row and the characters in the vertical row in the contrast matrix are the same or not;
and if the words in the horizontal row and the words in the vertical row in the contrast matrix are different, making the matrix data of the crossing position of the horizontal row and the vertical row in the contrast matrix be 0, and if the words in the horizontal row and the words in the vertical row in the contrast matrix are the same, making the matrix data of the crossing position of the horizontal row and the vertical row in the contrast matrix be 1.
For example, in the embodiment of the present invention, if there are 4 in both the horizontal words and the vertical words, the matrix data at the intersection of the two 4 is 1.
Further, the embodiment of the present invention calculates the repetition degree by using the following repetition degree formula:
F=αTα
wherein F is the repetition degree, alpha, the contrast matrix, alphaTIs the transpose of the contrast matrix alpha.
In the embodiment of the invention, the repetition degree of the contrast matrix is calculated, when the repetition degree is greater than a preset threshold value, one piece of client data corresponding to the contrast matrix is removed, and when the repetition degree is less than or equal to the preset threshold value, the client data corresponding to the contrast matrix is reserved to obtain a cleaned standard client data set.
And S3, extracting data of the standard customer data set by using a pre-constructed key data extraction model to obtain key data, and constructing a customer knowledge graph according to the key data.
In one embodiment of the present invention, before the performing data extraction on the standard customer data set by using the pre-constructed key data extraction model to obtain key data, the method further includes: and training the key data extraction model.
In detail, the training process of the key data extraction model includes:
step A: generating a training data set and a standard result corresponding to the training data set;
and B: inputting the training data set into the key data extraction model for feature extraction to obtain a training result;
and C: calculating loss values of the training results and the standard results by using a preset loss function to obtain loss values;
step D: when the loss value is greater than or equal to a preset loss threshold value, the output result of the key data extraction model is not accurate enough, the parameters of the key data extraction model are adjusted, and the step B is returned to extract the key data again;
step E: and when the loss value is smaller than the loss threshold value, the output result of the key data extraction model is accurate, and a standard key data extraction model is obtained.
In detail, in the embodiment of the present invention, the loss value is calculated for the training result and the standard result by using the loss function as follows, so as to obtain a loss value:
Figure BDA0002711762360000081
wherein the content of the first and second substances,
Figure BDA0002711762360000082
in order to obtain the value of the loss,
Figure BDA0002711762360000083
and obtaining the training result, wherein Y is the standard result, and alpha represents an error factor and is a preset constant.
According to the embodiment of the invention, the key data extraction is carried out on the standard customer data set by training the key data extraction model, so that the key data in the standard customer data set can be identified, and the data analysis efficiency is improved. The precision of data extraction can be improved through the training model, and errors are avoided when manual data extraction is carried out.
In the embodiment of the present invention, the key data refers to information such as a company name, a social credit code of the company, a registration number of the company, and an organization number of the company.
Referring to fig. 3, further, the building a customer knowledge graph according to the key data includes:
s301, performing structuring processing on the key data to obtain structured data;
s302, performing entity extraction and relationship extraction on the structured data to respectively obtain entity information and an entity relationship;
s303, carrying out information fusion processing on the entity information and the entity relation to obtain the customer knowledge graph.
Specifically, in the embodiment of the present invention, the structuring process is to define the key data to obtain structured data.
For example, the key data includes hua technology limited and zhongxing communication shares limited, and the hua technology limited and the zhongxing communication shares limited are defined as enterprises to realize the structured processing of the initial client data.
The structured processing can lead the structured data obtained after the processing to be regularly stored and arranged, thereby facilitating the subsequent operation.
Further, the embodiment of the invention can adopt a named entity identification method to perform entity extraction and relationship extraction on the structured data.
Further, the embodiment of the invention fuses the entity information and the entity relationship to obtain a plurality of triples, and obtains the customer knowledge graph according to the triples. The triplet is an information representation of "entity + relationship ═ entity", for example: the client of company a is client B, represented by a triplet "client B" for company a + client relationship ", 440403000032117 for company C, and 440403000032117 for company C + registration number.
The graph structure of the customer knowledge graph may be used to provide a basic data structure for subsequent data validity checks.
According to the embodiment of the invention, the client knowledge graph is constructed according to the key data, so that the correlation among a plurality of entities in the client knowledge graph can be reflected visually, and the efficiency of further analysis by using the client knowledge graph is improved.
And S4, comparing the customer knowledge graph with preset data to execute validity verification of the standard customer data set.
In the embodiment of the invention, the data needing to be subjected to the validity check can be visually obtained from the client knowledge graph, and the data needing to be subjected to the validity check is directly compared with the preset data, so that the efficiency and the accuracy of the data validity check can be improved. For example, it is determined whether 440403000032117 matches the registration number in the preset data according to "company C + registration number 440403000032117", and if so, the item of data passes the validity verification.
And S5, sending the validity verification result to a preset monitoring terminal.
The embodiment of the invention can send the final validity verification result to the preset monitoring terminal, so as to be beneficial for a data administrator to further analyze the data.
The embodiment of the invention firstly carries out missing value filling and abnormal value removing processing on the original client data set so as to delete abnormal data in the original client data set and fill the missing data, thereby ensuring the completeness and the accuracy of the data. Therefore, the embodiment of the invention can improve the efficiency and the accuracy of the client data cleaning method.
FIG. 4 is a block diagram of a client data cleansing apparatus according to the present invention.
The customer data washing apparatus 100 according to the present invention may be installed in an electronic device. According to the realized functions, the client data washing device 100 can comprise a preprocessing module 101, a data deduplication module 102, a map construction module 103 and a validity verification module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the preprocessing module 101 is configured to obtain an original client data set, and perform missing value filling and outlier removal processing on the original client data set to obtain an initial client data set;
the data deduplication module 102 is configured to generate a comparison matrix according to the initial customer data set, calculate a repetition degree of the comparison matrix, and perform deduplication operation on customer data corresponding to the comparison matrix with the repetition degree being greater than a preset threshold value to obtain a standard customer data set;
the map construction module 103 is configured to perform data extraction on the standard customer data set by using a pre-constructed key data extraction model to obtain key data, and construct a customer knowledge map according to the key data;
the validity verification module 104 is configured to compare the customer knowledge graph with preset data to perform validity verification of the standard customer data set, and send the validity verification result to a preset monitoring terminal.
In detail, when the modules in the client data washing device 100 are applied, a client data washing method including the following steps can be implemented:
step one, the preprocessing module 101 obtains an original client data set, and performs missing value filling and abnormal value removing processing on the original client data set to obtain an initial client data set.
In the embodiment of the present invention, the preprocessing module 101 may capture the original customer data set from a database storing the original customer data by using a python statement having a data capture function.
Wherein the original customer data set includes, but is not limited to, a company name of a company to which the customer belongs, a social credit code of the company, a registration number of the company, an organization number of the company, and the like.
Preferably, in an embodiment of the present invention, the preprocessing module 101 performs missing value filling and outlier removing processing on the original client data set to obtain an initial client data set, including:
judging whether the original customer data set has a missing value, and when the original customer data set has the missing value, performing data filling on the original customer data set;
and judging whether the original client data set has abnormal values or not, and deleting the abnormal values contained in the original client data set when the original client data set has the abnormal values.
In detail, in the embodiment of the present invention, the preprocessing module 101 uses a java statement with a missing value detection function to determine whether a missing value exists in the original client data set. Specifically, in the embodiment of the present invention, the preprocessing module 101 performs length detection on the attribute data in each piece of original client data in the original client data set by using the java statement with the missing value detection function, when it is detected that the numerical length of the attribute data is not 0, it is determined that the value of the attribute data is not missing, and when it is detected that the numerical length of the attribute data is 0, it is determined that the value of the attribute data is missing. In the embodiment of the present invention, the original customer data set includes a plurality of attributes and corresponding attribute values, for example, if a company registration number exists in the original customer data set and an attribute value corresponding to the company registration number exists in the original customer data set, it is detected whether each attribute data in the original customer data set is 0 or not during length detection.
When the original client data set has a missing value, the preprocessing module 101 in the embodiment of the present invention may perform data padding on the original client data set by using an existing missing value padding method.
In detail, existing missing value filling methods include, but are not limited to, filling default, mean, mode, KNN filling.
Further, the determining whether the original client data set has an abnormal value according to the embodiment of the present invention includes:
calculating a local reachable density ratio of adjacent data of each original customer data in the original customer data set;
and when the local reachable density ratio is smaller than or equal to a preset ratio, determining the original customer data to be an abnormal value.
Specifically, the preprocessing module 101 according to the embodiment of the present invention calculates a local reachable density ratio LF of neighboring data of each original client data in the original client data set by using the following algorithmk(q):
Figure BDA0002711762360000121
Figure BDA0002711762360000122
Wherein N isk(q) is the original customer data set, ldk(q) is Nk(q) q-th customer data in (q), ld (p) is adjacent data of the q-th customer data, and k is Nk(q) the number of data in the original customer dataset, and reach-distk (p, q) is the distance between p and q.
When an abnormal value exists in the original customer data set, the embodiment of the present invention performs a deletion operation on the abnormal value.
According to the embodiment of the invention, the data preprocessing module 101 is used for preprocessing the original client data set, so that the data integrity of the original client data set can be improved, invalid data and error data are deleted, data redundancy is reduced, and the data accuracy is improved.
And step two, the data deduplication module 102 generates a contrast matrix according to the initial customer data set, calculates the repetition degree of the contrast matrix, and performs deduplication operation on the customer data corresponding to the contrast matrix with the repetition degree greater than a preset threshold value to obtain a standard customer data set.
In this embodiment of the present invention, the generating, by the data deduplication module 102, a contrast matrix according to the initial customer data set includes: constructing a null matrix with corresponding size according to the data length of any two client data in the initial client data set; and sequentially filling the two customer data into the empty matrix according to a preset rule to obtain the comparison matrix.
Specifically, for example, any two different company registration numbers in the initial customer data set are 440403000032117 and 416703000014797, respectively, a length-size empty matrix is constructed according to the length of the initial customer data, and the initial customer data is sequentially filled into the empty matrix, so that a comparison matrix is obtained as follows:
[ 4 4 0 4 0 3 0 0 0 0 3 2 1 1 7
4 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
4 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
wherein, the construction rule of the contrast matrix is as follows:
judging whether the characters in the horizontal row and the characters in the vertical row in the contrast matrix are the same or not;
and if the words in the horizontal row and the words in the vertical row in the contrast matrix are different, making the matrix data of the crossing position of the horizontal row and the vertical row in the contrast matrix be 0, and if the words in the horizontal row and the words in the vertical row in the contrast matrix are the same, making the matrix data of the crossing position of the horizontal row and the vertical row in the contrast matrix be 1.
For example, in the embodiment of the present invention, if there are 4 in both the horizontal words and the vertical words, the matrix data at the intersection of the two 4 is 1.
Further, the data deduplication module 102 performs the duplication degree calculation using the following duplication degree formula:
F=αTα
wherein F is the repetition degree, alpha, the contrast matrix, alphaTIs the transpose of the contrast matrix alpha.
In the embodiment of the invention, the repetition degree of the contrast matrix is calculated, when the repetition degree is greater than a preset threshold value, one piece of client data corresponding to the contrast matrix is removed, and when the repetition degree is less than or equal to the preset threshold value, the client data corresponding to the contrast matrix is reserved to obtain a cleaned standard client data set.
Thirdly, the map building module 103 extracts data of the standard customer data set by using a pre-built key data extraction model to obtain key data, and builds a customer knowledge map according to the key data.
In one embodiment of the present invention, before the performing data extraction on the standard customer data set by using the pre-constructed key data extraction model to obtain key data, the method further includes: and training the key data extraction model.
In detail, the training process of the key data extraction model includes:
step A: generating a training data set and a standard result corresponding to the training data set;
and B: inputting the training data set into the key data extraction model for feature extraction to obtain a training result;
and C: calculating loss values of the training results and the standard results by using a preset loss function to obtain loss values;
step D: when the loss value is greater than or equal to a preset loss threshold value, the output result of the key data extraction model is not accurate enough, the parameters of the key data extraction model are adjusted, and the step B is returned to extract the key data again;
step E: and when the loss value is smaller than the loss threshold value, the output result of the key data extraction model is accurate, and a standard key data extraction model is obtained.
In detail, in the embodiment of the present invention, the loss value is calculated for the training result and the standard result by using the loss function as follows, so as to obtain a loss value:
Figure BDA0002711762360000141
wherein the content of the first and second substances,
Figure BDA0002711762360000142
in order to obtain the value of the loss,
Figure BDA0002711762360000143
and obtaining the training result, wherein Y is the standard result, and alpha represents an error factor and is a preset constant.
According to the embodiment of the invention, the key data extraction is carried out on the standard customer data set by training the key data extraction model, so that the key data in the standard customer data set can be identified, and the data analysis efficiency is improved. The precision of data extraction can be improved through the training model, and errors are avoided when manual data extraction is carried out.
In the embodiment of the present invention, the key data refers to information such as a company name, a social credit code of the company, a registration number of the company, and an organization number of the company.
Further, the building of the customer knowledge graph according to the key data comprises: carrying out structuralization processing on the key data to obtain structuralization data; performing entity extraction and relationship extraction on the structured data to respectively obtain entity information and an entity relationship; and carrying out information fusion processing on the entity information and the entity relation to obtain the customer knowledge graph.
Specifically, in the embodiment of the present invention, the structuring process is to define the key data to obtain structured data.
For example, the key data includes hua technology limited and zhongxing communication shares limited, and the hua technology limited and the zhongxing communication shares limited are defined as enterprises to realize the structured processing of the initial client data.
The structured processing can lead the structured data obtained after the processing to be regularly stored and arranged, thereby facilitating the subsequent operation.
Further, the embodiment of the invention can adopt a named entity identification method to perform entity extraction and relationship extraction on the structured data.
Further, the embodiment of the invention fuses the entity information and the entity relationship to obtain a plurality of triples, and obtains the customer knowledge graph according to the triples. The triplet is an information representation of "entity + relationship ═ entity", for example: the client of company a is client B, represented by a triplet "client B" for company a + client relationship ", 440403000032117 for company C, and 440403000032117 for company C + registration number.
The graph structure of the customer knowledge graph may be used to provide a basic data structure for subsequent data validity checks.
According to the embodiment of the invention, the client knowledge graph is constructed according to the key data, so that the correlation among a plurality of entities in the client knowledge graph can be reflected visually, and the efficiency of further analysis by using the client knowledge graph is improved.
Step four, the validity verification module 104 compares the customer knowledge graph with preset data to perform validity verification of the standard customer data set.
In the embodiment of the invention, the data needing to be subjected to the validity check can be visually obtained from the client knowledge graph, and the data needing to be subjected to the validity check is directly compared with the preset data, so that the efficiency and the accuracy of the data validity check can be improved. For example, it is determined whether 440403000032117 matches the registration number in the preset data according to "company C + registration number 440403000032117", and if so, the item of data passes the validity verification.
And step five, the validity verification module 104 further sends the validity verification result to a preset monitoring terminal.
The embodiment of the invention can send the final validity verification result to the preset monitoring terminal, so as to be beneficial for a data administrator to further analyze the data.
Fig. 5 is a schematic structural diagram of an electronic device implementing the client data cleansing method according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a client data washing program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the client data cleansing program 12, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., executing a client data cleansing program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 stores a client data washing program 12 which is a combination of instructions that, when executed in the processor 10, enable:
acquiring an original client data set, and performing missing value filling and abnormal value removing processing on the original client data set to obtain an initial client data set;
generating a contrast matrix according to the initial client data set, calculating the repetition degree of the contrast matrix, and executing duplication removal operation on client data corresponding to the contrast matrix with the repetition degree larger than a preset threshold value to obtain a standard client data set;
performing data extraction on the standard customer data set by using a pre-constructed key data extraction model to obtain key data, and constructing a customer knowledge graph according to the key data;
comparing the customer knowledge graph with preset data to perform validity verification of the standard customer data set;
and sending the validity verification result to a preset monitoring terminal.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any accompanying claims should not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for cleansing customer data, the method comprising:
acquiring an original client data set, and performing missing value filling and abnormal value removing processing on the original client data set to obtain an initial client data set;
generating a contrast matrix according to the initial client data set, calculating the repetition degree of the contrast matrix, and executing duplication removal operation on client data corresponding to the contrast matrix with the repetition degree larger than a preset threshold value to obtain a standard client data set;
performing data extraction on the standard customer data set by using a pre-constructed key data extraction model to obtain key data, and constructing a customer knowledge graph according to the key data;
comparing the customer knowledge graph with preset data to perform validity verification of the standard customer data set;
and sending the validity verification result to a preset monitoring terminal.
2. The client data cleansing method of claim 1, wherein the missing value filling and outlier removing processing on the original client data set to obtain an initial client data set comprises:
judging whether the original customer data set has a missing value, and when the original customer data set has the missing value, performing data filling on the original customer data set;
and judging whether the original client data set has abnormal values or not, and deleting the abnormal values contained in the original client data set when the original client data set has the abnormal values.
3. The customer data cleansing method of claim 2, wherein the determining whether an outlier exists in the original customer data set comprises:
calculating a local reachable density ratio of adjacent data in the original customer data set;
and when the local reachable density ratio is smaller than or equal to a preset ratio, determining the original customer data to be an abnormal value.
4. The customer data cleansing method of claim 3, wherein said calculating a local reachable density ratio of neighboring data in the original customer data set comprises:
calculating the local achievable density ratio using the following equation:
Figure FDA0002711762350000011
Figure FDA0002711762350000012
wherein N isk(q) is the original customer data set, ldk(q) is Nk(q) q-th customer data in (q), ld (p) is adjacent data of the q-th customer data, and k is Nk(q) the number of data in the original customer dataset, and reach-distk (p, q) is the distance between p and q.
5. The customer data cleansing method according to any one of claims 1 to 4, wherein before the data extraction of the standard customer data set using the pre-constructed key data extraction model to obtain the key data, the method further comprises:
generating a training data set and a standard result corresponding to the training data set;
inputting the training data set into the key data extraction model for feature extraction to obtain a training result;
calculating loss values of the training results and the standard results by using a preset loss function to obtain loss values;
when the loss value is larger than or equal to a preset loss threshold value, adjusting parameters of the key data extraction model, and returning to input the training data set to the key data extraction model for feature extraction to obtain a training result;
and when the loss value is smaller than the loss threshold value, obtaining a standard key data extraction model.
6. The customer data cleansing method of claim 5, wherein the performing a loss value calculation on the training results and the standard results using a predetermined loss function to obtain a loss value comprises: the loss value was calculated using the following formula:
Figure FDA0002711762350000021
wherein the content of the first and second substances,
Figure FDA0002711762350000022
in order to obtain the value of the loss,
Figure FDA0002711762350000023
and Y is the training result, Y is the standard result, and alpha represents an error factor.
7. The customer data cleansing method of claim 1, wherein the generating a contrast matrix from the initial customer data set comprises:
constructing a null matrix with corresponding size according to the data length of any two client data in the initial client data set;
and sequentially filling the two customer data into the empty matrix according to a preset rule to obtain the comparison matrix.
8. A customer data cleansing apparatus, the apparatus comprising:
the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is used for acquiring an original client data set, and performing missing value filling and abnormal value removing processing on the original client data set to obtain an initial client data set;
the data deduplication module is used for generating a contrast matrix according to the initial customer data set, calculating the repetition degree of the contrast matrix, and executing deduplication operation on customer data corresponding to the contrast matrix with the repetition degree larger than a preset threshold value to obtain a standard customer data set;
the map construction module is used for extracting data of the standard customer data set by using a pre-constructed key data extraction model to obtain key data and constructing a customer knowledge map according to the key data;
and the validity verification module is used for comparing the client knowledge graph with preset data to execute validity verification of the standard client data set and sending the validity verification result to a preset monitoring terminal.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the customer data cleansing method of any one of claims 1 to 7.
10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the customer data cleansing method according to any one of claims 1 to 7.
CN202011058991.0A 2020-09-30 2020-09-30 Client data cleaning method and device, electronic equipment and readable storage medium Withdrawn CN112214478A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011058991.0A CN112214478A (en) 2020-09-30 2020-09-30 Client data cleaning method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011058991.0A CN112214478A (en) 2020-09-30 2020-09-30 Client data cleaning method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112214478A true CN112214478A (en) 2021-01-12

Family

ID=74052413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011058991.0A Withdrawn CN112214478A (en) 2020-09-30 2020-09-30 Client data cleaning method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112214478A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377753A (en) * 2021-06-09 2021-09-10 国网吉林省电力有限公司 Heat accumulating type electric boiler load data cleaning system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377753A (en) * 2021-06-09 2021-09-10 国网吉林省电力有限公司 Heat accumulating type electric boiler load data cleaning system

Similar Documents

Publication Publication Date Title
CN112148577A (en) Data anomaly detection method and device, electronic equipment and storage medium
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN113946690A (en) Potential customer mining method and device, electronic equipment and storage medium
CN113807553A (en) Method, device, equipment and storage medium for analyzing number of reservation services
CN114491047A (en) Multi-label text classification method and device, electronic equipment and storage medium
CN112579621A (en) Data display method and device, electronic equipment and computer storage medium
CN114612194A (en) Product recommendation method and device, electronic equipment and storage medium
CN112699142A (en) Cold and hot data processing method and device, electronic equipment and storage medium
CN112507230A (en) Webpage recommendation method and device based on browser, electronic equipment and storage medium
CN111339072A (en) User behavior based change value analysis method and device, electronic device and medium
CN114547696A (en) File desensitization method and device, electronic equipment and storage medium
CN114219023A (en) Data clustering method and device, electronic equipment and readable storage medium
CN112214478A (en) Client data cleaning method and device, electronic equipment and readable storage medium
CN112148566A (en) Monitoring method and device of computing engine, electronic equipment and storage medium
CN111460293B (en) Information pushing method and device and computer readable storage medium
CN112783989A (en) Data processing method and device based on block chain
CN112990374A (en) Image classification method, device, electronic equipment and medium
CN112256472A (en) Distributed data calling method and device, electronic equipment and storage medium
CN111932147A (en) Visualization method and device for overall index, electronic equipment and storage medium
CN113850260B (en) Key information extraction method and device, electronic equipment and readable storage medium
CN116304251A (en) Label processing method, device, computer equipment and storage medium
CN114840631A (en) Spatial text query method and device, electronic equipment and storage medium
CN114238233A (en) Automatic file cleaning method, device, equipment and storage medium
CN114518993A (en) System performance monitoring method, device, equipment and medium based on business characteristics
CN114417998A (en) Data feature mapping method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210112

WW01 Invention patent application withdrawn after publication