CN114168581A - Data cleaning method and device, computer equipment and storage medium - Google Patents

Data cleaning method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114168581A
CN114168581A CN202111520378.0A CN202111520378A CN114168581A CN 114168581 A CN114168581 A CN 114168581A CN 202111520378 A CN202111520378 A CN 202111520378A CN 114168581 A CN114168581 A CN 114168581A
Authority
CN
China
Prior art keywords
data
rule
processed
information
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111520378.0A
Other languages
Chinese (zh)
Inventor
任智慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Pension Insurance Corp
Original Assignee
Ping An Pension Insurance Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Pension Insurance Corp filed Critical Ping An Pension Insurance Corp
Priority to CN202111520378.0A priority Critical patent/CN114168581A/en
Publication of CN114168581A publication Critical patent/CN114168581A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries

Abstract

The invention discloses a data cleaning method and device, computer equipment and a storage medium, and belongs to the technical field of computers. The data cleaning method can automatically convert the acquired rule information for cleaning data into the rule link network, does not need manual operation to convert the rule information into rule codes, reduces manpower and material resources, has high efficiency, and can adapt to various scenes; the information to be processed is matched with the regular link network, so that the matched data to be processed matched with the regular link data can be rapidly obtained, and the matched data to be processed is simplified and combined based on the regular link data to obtain the target information corresponding to the information to be processed, so that the cleaned target data can be correspondingly processed.

Description

Data cleaning method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a data cleaning method, a data cleaning device, computer equipment and a storage medium.
Background
With the development of information technology, the quantity and the type of information also show blowout type growth. In order to avoid overloading the full amount of information during data processing (such as profit prediction), the data can be correspondingly cleaned before being processed according to the data processing requirements, so that the merged and simplified data can be docked into a corresponding system for processing. For example: in the insurance field, in order to avoid overloading of the total information of an insurance policy (hereinafter referred to as policy) during profit prediction, data with similar attributes (such as age, gender, channel, etc.) in the policy needs to be merged (i.e., data cleaning) to simplify the data volume and improve the profit prediction process.
With the increasing amount and variety of information, the data cleaning rule is more and more complex, which is mainly realized by a hard coding mode, and has the advantages of numerous branch structures, a large number of nested layers, a large code amount, difficulty in maintenance (difficult code modification and easy error), and incapability of meeting diversified scene requirements due to the fact that the code is coupled with a system code. In practical application, in order to adapt to different requirements, code personnel need to write corresponding codes according to business requirements so as to realize the purpose of cleaning data based on cleaning rule codes, and thus, the operation is long in time consumption and low in efficiency.
Disclosure of Invention
Aiming at the problem that the existing data cleaning rule can not meet the requirements of diversified scenes, a data cleaning method, a device, computer equipment and a storage medium which aim at meeting the requirements of diversified scenes are provided.
In order to achieve the above object, the present invention provides a data cleaning method, including:
acquiring rule information for cleaning data, and converting the rule information into a rule link network;
receiving information to be processed, matching the information to be processed with the regular link network by adopting a Rete algorithm, and acquiring data to be processed in the information to be processed and regular link data in the regular link network matched with the data to be processed;
and cleaning the matched data to be processed based on the regular link data to obtain target information.
Optionally, the obtaining rule information for cleaning data and converting the rule information into a rule link network includes:
acquiring the rule information, wherein the rule information at least comprises a piece of rule data;
and converting all the rule data in the rule information into corresponding rule link data respectively, wherein all the rule link data form the rule link network.
Optionally, the rule data includes at least one matching object and one execution object;
the converting all the rule data in the rule information into corresponding rule link data respectively, where all the rule link data form the rule link network, and the converting includes:
acquiring matching objects and execution objects in the rule data, and identifying the incidence relation among the matching objects in the rule data;
converting the matching object into a network node, and converting the execution object into an execution event;
constructing a configuration relationship among the network nodes based on the association relationship among the matched objects;
and generating the regular link data according to the execution event, the network nodes and the configuration relationship among the network nodes, wherein all the regular link data form the regular link network.
Optionally, the information to be processed at least includes one piece of data to be processed;
the receiving the information to be processed, matching the information to be processed with the regular link network by adopting a Rete algorithm, and acquiring the matched data to be processed in the information to be processed and the regular link data in the regular link network matched with the matched data to be processed, includes:
receiving information to be processed;
extracting characteristic objects in each piece of data to be processed in the information to be processed;
matching the characteristic object with a matching object in each piece of regular link data in the regular link network respectively;
and acquiring the matching data to be processed associated with the feature object based on the feature object matched with the rule link data, and taking the rule link data matched with the feature object as the rule link data of the matching data to be processed associated with the feature object.
Optionally, the cleaning the to-be-processed matching data based on the regular link data to obtain target information includes:
according to the execution event in the regular link data, executing cleaning operation on the to-be-processed matching data matched with the regular link data to obtain a cleaning result;
and generating the target information according to the cleaning results of all the matched data to be processed in the information to be processed.
Optionally, the target information adopts an rpt format.
In order to achieve the above object, the present invention also provides a data washing apparatus, comprising:
the conversion unit is used for acquiring rule information used for cleaning data and converting the rule information into a rule link network;
the matching unit is used for receiving information to be processed, matching the information to be processed with the regular link network by adopting a Rete algorithm, and acquiring data to be processed in the information to be processed and regular link data in the regular link network matched with the data to be processed;
and the execution unit is used for cleaning the matched data to be processed based on the regular link data so as to obtain target information.
Optionally, the conversion unit is configured to obtain the rule information, and convert all rule data in the rule information into corresponding rule link data, where all rule link data form the rule link network;
wherein the rule information includes at least one piece of rule data.
To achieve the above object, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.
To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the above method.
According to the data cleaning method, the data cleaning device, the computer equipment and the storage medium, the acquired rule information for cleaning data can be automatically converted into the rule link network, the rule information does not need to be manually converted into the rule codes, manpower and material resources are reduced, the efficiency is high, and the method and the device can adapt to various scenes; the information to be processed is matched with the regular link network, so that the matched data to be processed matched with the regular link data can be rapidly obtained, and the matched data to be processed is simplified, combined and processed (cleaned) based on the regular link data to obtain the target information corresponding to the information to be processed, so that the cleaned target data can be correspondingly processed.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of a data cleansing method according to the present invention;
FIG. 2 is a flowchart of a method of one embodiment of the present invention for converting regular data into regular link data;
FIG. 3 is a block diagram of an embodiment of a data cleansing apparatus according to the present invention;
FIG. 4 is a block diagram of one embodiment of a conversion unit according to the present invention;
FIG. 5 is a block diagram of an embodiment of the matching unit according to the present invention
Fig. 6 is a schematic hardware architecture diagram of an embodiment of a computer device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The data cleaning method, the data cleaning device, the computer equipment and the storage medium are suitable for the business fields of insurance, banking, finance and the like. The method can automatically convert the acquired rule information into the rule link network, does not need manual operation to convert the rule information into the rule code, reduces manpower and material resources, has high efficiency, and can adapt to various scenes; the information to be processed is matched with the regular link network, so that the matched data to be processed matched with the regular link data can be rapidly obtained, and the matched data to be processed is simplified, combined and processed (cleaned) based on the regular link data to obtain the target information corresponding to the information to be processed, so that the cleaned target data can be correspondingly processed.
Example one
Referring to fig. 1, a data cleaning method of the present embodiment includes the following steps:
s1, acquiring rule information for cleaning data, and converting the rule information into a rule link network.
In this embodiment, the rule information may be in a table form, such as: EXCEL form file. The rule information may be structured data or unstructured data.
In practical application, the rule information may be information written by business personnel according to business requirements.
Further, step S1 further includes the following steps:
and S11, acquiring the rule information.
Wherein the rule information includes at least one piece of rule data; each piece of the rule data comprises a matching item and an execution item, wherein the matching item at least comprises a matching object, and the execution item comprises an execution object.
Take the example that the rule information includes three pieces of rule data:
first piece of rule data: while condition A and condition B and condition D the Rule Action 1;
second rule data: while condition A and condition B and condition E the Rule Action 2;
third rule data: the while condition A and condition C the Rule Action 3;
where the content between the where … … then is a match and the content after then is an execute. condition a, condition B, condition C, and condition D are all matching objects; rule Action 1, Rule Action 2 and Rule Action 3 are all execution objects.
The data cleaning method of the embodiment can be applied to a server, and can store the acquired rule information in the memory module so as to improve the processing speed of the memory module.
And S12, converting all the rule data in the rule information into corresponding rule link data respectively, wherein all the rule link data form the rule link network.
Specifically, step S12 shown in fig. 2 may include the following steps:
s121, obtaining matching objects and execution objects in the rule data, and identifying incidence relations among the matching objects in the rule data.
In this embodiment, the matching item may be determined according to a preset matching item identifier or a preset function (e.g., where); determining an execution item according to a preset execution item identifier or a preset function (such as the then); and extracting the matching object in the matching item and the execution object in the execution item.
And S122, converting the matching object into a network node, and converting the execution object into an execution event.
In this embodiment, each matching object corresponds to a network node, and the same network node may be configured in multiple regular link data.
Take the example of a regular link network comprising three regular link data:
first rule link data: a → b → d → R1;
second regular link data: a → b → e → R2;
third rule link data: a → c → R3;
wherein a, b, c and d are network nodes; r1, R2, R3 are all executive events; the network node a and the network node b are configured with a plurality of regular link data.
And S123, constructing a configuration relationship among the network nodes based on the association relationship among the matched objects.
In this embodiment, the matching object corresponds to a network node, and the execution object and the execution event may be constructed according to an association relationship between each matching object in the same rule data, and correspond to a configuration relationship between each network node in the rule link data.
And S124, generating the regular link data according to the execution event, the network nodes and the configuration relationship among the network nodes, wherein all the regular link data form the regular link network.
In this embodiment, the regular link network is a rete network, which may be presented by using a tree structure diagram.
Compared with the traditional method for writing the rule codes according to the rule information provided by the service personnel to realize data cleaning, the method and the system can automatically convert the rule information provided by the service personnel into the rule link network without manually writing the codes, and have the advantages of high response speed, high efficiency, convenience in maintenance, reduction of manpower and material resources, adaptability to various scenes and the like.
S2, receiving the information to be processed, matching the information to be processed with the regular link network by adopting a Rete algorithm, and acquiring the matched data to be processed in the information to be processed and the regular link data in the regular link network matched with the matched data to be processed.
Further, step S2 may include:
and S21, receiving the information to be processed.
Wherein, the information to be processed at least comprises one piece of data to be processed.
In this embodiment, the information to be processed may be structured data or unstructured data.
In practical applications, the information to be processed may be an insurance policy, such as: including basic information of the user (age, academic calendar, height, gender, ethnicity, physical fitness, etc.).
And S22, extracting the characteristic objects in the data to be processed in the information to be processed.
The feature object refers to data for representing the features of the information to be processed, such as: age, calendar, height, sex, ethnicity, physical fitness characteristics.
In this embodiment, the data to be processed may include a plurality of pieces of data to be processed. The feature object may be included in the data to be processed, so that the feature object may not be included.
And S23, matching the characteristic object with a matching object in each piece of regular link data in the regular link network.
In this embodiment, when the data to be processed includes a plurality of feature objects, the plurality of feature objects are matched with each piece of rule link data in the rule link network, so as to obtain the rule link data matched with all the feature objects.
By way of example and not limitation, the data to be processed may be matched with the regular link data in the regular link network according to a depth priority traversal manner or an breadth priority traversal manner.
Depth-priority traversal belongs to one of the graph algorithms, abbreviated in english as DFS. The process is briefly that each possible branch path is too deep to be deep, and each node can only be visited once. The depth-first traversal of the binary tree is divided into a first-order traversal, a middle-order traversal and a last-order traversal. And (3) performing sequencing traversal: for any subtree, firstly accessing the root, then traversing the left subtree and finally traversing the right subtree; and (3) middle-order traversal: for any subtree, firstly traversing the left subtree, then accessing the root, and finally traversing the right subtree; and (3) subsequent traversal: for any subtree, the left subtree is traversed, then the right subtree is traversed, and finally the root is visited. The depth priority traversal algorithm does not fully reserve nodes, and the occupied space is small; and the operation speed is slow due to backtracking operation (namely, push operation and pop operation).
The breadth priority traversal is called hierarchy traversal, each layer is sequentially accessed from top to bottom, nodes are accessed from left to right (or from right to left) in each layer, and the next layer is accessed after the nodes are accessed, until no nodes can be accessed. The breadth priority traversal algorithm can reserve all nodes, and the occupied space is large; and no backtracking operation (namely no stack-in operation and stack-out operation) is performed, and the running speed is high.
S24, acquiring the to-be-processed matching data associated with the feature object based on the feature object matched with the regular link data, and taking the regular link data matched with the feature object as the regular link data of the to-be-processed matching data associated with the feature object.
In this embodiment, a Drools rule engine may be used to match the information to be processed with the rule link network.
In the present embodiment, the Rete algorithm is a forward rule fast matching algorithm, and the matching speed is independent of the number of rules. Rete is latin, corresponding to net, i.e. network. The Rete algorithm performs pattern matching by forming a Rete network, has two characteristics of Temporal redundancy (Temporal redundancy) and structural similarity (structural similarity), and is high in matching efficiency.
And S3, cleaning the matched data to be processed based on the regular link data to obtain target information.
In this embodiment, the target information may adopt an rpt format.
Further, step S3 may include:
and S31, according to the execution event in the regular link data, executing cleaning operation on the to-be-processed matched data matched with the regular link data to obtain a cleaning result.
In this embodiment, when one piece of data to be processed is matched with a plurality of pieces of regular link data, the matched plurality of pieces of regular link data may be stored in one conflict set, execution events (agenda) corresponding to the regular link data in the conflict set are sequentially executed according to the matching sequence, and a corresponding cleaning result is obtained.
And S32, generating the target information according to the cleaning results of all the matched data to be processed in the information to be processed.
In this embodiment, the target information corresponds to the information to be processed, when only part of the data to be processed in the information to be processed matches the regular link data in the regular link network, the matched data to be processed may be regarded as data requiring a cleaning operation, the unmatched data to be processed may be regarded as data not requiring a cleaning operation, and the data to be processed not requiring a cleaning operation and the cleaned data obtained after cleaning are merged and output as the target information.
Taking the profit prediction of insurance policy data as an example, the data cleaning method of the embodiment can be used for carrying out data cleaning (simplification and combination processing) on insurance policies, obtaining cleaned rpt format insurance policy information, and sending the obtained rpt format insurance policy information to Prophet software (which is a set of comprehensive actuarial operating system, including life insurance, non-life insurance and other parts, and can provide series service functions such as profit test, asset evaluation, business model setting and the like for the financial service industry, thereby meeting a large number of technical requirements of financial service companies including insurance companies) for profit prediction.
In the embodiment, the data cleaning method can automatically convert the acquired rule information into the rule link network, manual operation is not needed to convert the rule information into the rule codes, manpower and material resources are reduced, the efficiency is high, and the method can adapt to various scenes; the information to be processed is matched with the regular link network, so that the matched data to be processed matched with the regular link data can be rapidly obtained, and the matched data to be processed is simplified, combined and processed (cleaned) based on the regular link data to obtain the target information corresponding to the information to be processed, so that the cleaned target data can be correspondingly processed. The data cleaning method realizes the separation of the rule information and the system code, can intensively manage the rule information, is convenient for expansion, maintenance and quick response, and reduces the cost and risk of hard coding.
Example two
Referring to fig. 3, a data cleaning apparatus 1 of the present embodiment includes: a conversion unit 11, a matching unit 12 and an execution unit 13.
And the conversion unit 11 is configured to acquire rule information for cleaning data, and convert the rule information into a rule link network.
In this embodiment, the rule information may be in a table form, such as: EXCEL form file. The rule information may be structured data or unstructured data.
In practical application, the rule information may be information written by business personnel according to business requirements.
Further, the conversion unit 11 is further configured to obtain the rule information.
Wherein the rule information includes at least one piece of rule data; each piece of the rule data comprises a matching item and an execution item, wherein the matching item at least comprises a matching object, and the execution item comprises an execution object.
Take the example that the rule information includes three pieces of rule data:
first piece of rule data: while condition A and condition B and condition D the Rule Action 1;
second rule data: while condition A and condition B and condition E the Rule Action 2;
third rule data: the while condition A and condition C the Rule Action 3;
where the content between the where … … then is a match and the content after then is an execute. condition a, condition B, condition C, and condition D are all matching objects; rule Action 1, Rule Action 2 and Rule Action 3 are all execution objects.
The data cleaning apparatus 1 of the present embodiment may be applied to a server, and may store the acquired rule information in a memory module, so as to improve the processing speed thereof.
The conversion unit 11 is further configured to convert all rule data in the rule information into corresponding rule link data, where all rule link data form the rule link network.
Specifically, the conversion unit 11 shown with reference to fig. 4 may include: an identification module 111, a conversion module 112, a construction module 113 and a generation module 114.
The identifying module 111 is configured to obtain a matching object and an execution object in the rule data, and identify an association relationship between the matching objects in the rule data.
In this embodiment, the matching item may be determined according to a preset matching item identifier or a preset function (e.g., where); determining an execution item according to a preset execution item identifier or a preset function (such as the then); and extracting the matching object in the matching item and the execution object in the execution item.
A conversion module 112, configured to convert the matching object into a network node, and convert the execution object into an execution event.
In this embodiment, each matching object corresponds to a network node, and the same network node may be configured in multiple regular link data.
A building module 113, configured to build a configuration relationship between the network nodes based on the association relationship between the matching objects.
In this embodiment, the matching object corresponds to a network node, and the execution object and the execution event may be constructed according to an association relationship between each matching object in the same rule data, and correspond to a configuration relationship between each network node in the rule link data.
A generating module 114, configured to generate the regular link data according to the execution event, the network nodes, and the configuration relationship among the network nodes, where all the regular link data form the regular link network.
In this embodiment, the regular link network is a rete network, which may be presented by using a tree structure diagram.
Compared with the traditional method for writing the rule codes according to the rule information provided by the service personnel to realize data cleaning, the method and the system can automatically convert the rule information provided by the service personnel into the rule link network without manually writing the codes, and have the advantages of high response speed, high efficiency, convenience in maintenance, reduction of manpower and material resources, adaptability to various scenes and the like.
The matching unit 12 is configured to receive information to be processed, match the information to be processed with the regular link network by using a Rete algorithm, and obtain matching data to be processed in the information to be processed and regular link data in the regular link network that is matched with the matching data to be processed.
Further, the matching unit 12 may be configured to receive the information to be processed.
Wherein, the information to be processed at least comprises one piece of data to be processed.
In this embodiment, the information to be processed may be structured data or unstructured data.
In practical applications, the information to be processed may be an insurance policy, such as: including basic information of the user (age, academic calendar, height, gender, ethnicity, physical fitness, etc.).
Specifically, the matching unit 12 shown with reference to fig. 5 may include: an extraction module 121, a matching module 122 and a processing module 123.
An extracting module 121, configured to extract a feature object in each piece of to-be-processed data in the to-be-processed information.
The feature object refers to data for representing the features of the information to be processed, such as: age, calendar, height, sex, ethnicity, physical fitness characteristics.
In this embodiment, the data to be processed may include a plurality of pieces of data to be processed. The feature object may be included in the data to be processed, so that the feature object may not be included.
A matching module 122, configured to match the feature object with a matching object in each piece of the regular link data in the regular link network, respectively.
In this embodiment, when the data to be processed includes a plurality of feature objects, the plurality of feature objects are matched with each piece of rule link data in the rule link network, so as to obtain the rule link data matched with all the feature objects.
By way of example and not limitation, the data to be processed may be matched with the regular link data in the regular link network according to a depth priority traversal manner or an breadth priority traversal manner.
Depth-priority traversal belongs to one of the graph algorithms, abbreviated in english as DFS. The process is briefly that each possible branch path is too deep to be deep, and each node can only be visited once. The depth-first traversal of the binary tree is divided into a first-order traversal, a middle-order traversal and a last-order traversal. And (3) performing sequencing traversal: for any subtree, firstly accessing the root, then traversing the left subtree and finally traversing the right subtree; and (3) middle-order traversal: for any subtree, firstly traversing the left subtree, then accessing the root, and finally traversing the right subtree; and (3) subsequent traversal: for any subtree, the left subtree is traversed, then the right subtree is traversed, and finally the root is visited. The depth priority traversal algorithm does not fully reserve nodes, and the occupied space is small; and the operation speed is slow due to backtracking operation (namely, push operation and pop operation).
The breadth priority traversal is called hierarchy traversal, each layer is sequentially accessed from top to bottom, nodes are accessed from left to right (or from right to left) in each layer, and the next layer is accessed after the nodes are accessed, until no nodes can be accessed. The breadth priority traversal algorithm can reserve all nodes, and the occupied space is large; and no backtracking operation (namely no stack-in operation and stack-out operation) is performed, and the running speed is high.
A processing module 123, configured to obtain the to-be-processed matching data associated with the feature object based on the feature object matched with the rule link data, and use the rule link data matched with the feature object as the rule link data of the to-be-processed matching data associated with the feature object.
In this embodiment, a Drools rule engine may be used to match the information to be processed with the rule link network.
In the present embodiment, the Rete algorithm is a forward rule fast matching algorithm, and the matching speed is independent of the number of rules. Rete is latin, corresponding to net, i.e. network. The Rete algorithm performs pattern matching by forming a Rete network, has two characteristics of Temporal redundancy (Temporal redundancy) and structural similarity (structural similarity), and is high in matching efficiency.
And the execution unit 13 is configured to clean the to-be-processed matching data based on the regular link data to obtain target information.
In this embodiment, the target information may adopt an rpt format.
Further, the execution unit 13 may execute a cleaning operation on the to-be-processed matching data matched with the regular link data according to the execution event in the regular link data, so as to obtain a cleaning result.
In this embodiment, when one piece of data to be processed is matched with a plurality of pieces of regular link data, the matched plurality of pieces of regular link data may be stored in one conflict set, execution events (agenda) corresponding to the regular link data in the conflict set are sequentially executed according to the matching sequence, and a corresponding cleaning result is obtained.
The execution unit 13 may further generate the target information according to the cleaning result of all the matching data to be processed in the information to be processed.
In this embodiment, the target information corresponds to the information to be processed, when only part of the data to be processed in the information to be processed matches the regular link data in the regular link network, the matched data to be processed may be regarded as data requiring a cleaning operation, the unmatched data to be processed may be regarded as data not requiring a cleaning operation, and the data to be processed not requiring a cleaning operation and the cleaned data obtained after cleaning are merged and output as the target information.
Taking the profit prediction of insurance policy data as an example, the data cleaning device 1 of the embodiment can be used for cleaning (simplifying and combining) insurance policy data, obtaining cleaned rpt format insurance policy information, and sending the obtained rpt format insurance policy information to Prophet software (which is a set of comprehensive actuarial operation system, including life insurance, non-life insurance and other parts, and can provide series service functions such as profit test, asset evaluation, business model setting and the like for the financial service industry, thereby meeting a large number of technical requirements of financial service companies including insurance companies) to perform profit prediction.
In this embodiment, the data cleaning device 1 may automatically convert the acquired rule information for cleaning data into a rule link network through the conversion unit 11, and does not need to manually convert the rule information into a rule code, so that manpower and material resources are reduced, efficiency is high, and the data cleaning device is adaptable to various scenes; the matching unit 12 matches the information to be processed with the regular link network, so as to quickly acquire the matching data to be processed matched with the regular link data, and the execution unit 13 is used for simplifying and combining (cleaning) the matching data to be processed based on the regular link data to obtain the target information corresponding to the information to be processed, so that the cleaned target data can be correspondingly processed. The data cleaning device 1 realizes the separation of the rule information and the system code, can carry out centralized management on the rule information, is convenient for expansion, maintenance and quick response, and reduces the cost and risk of hard coding.
EXAMPLE III
In order to achieve the above object, the present invention further provides a computer device 2, where the computer device 2 includes a plurality of computer devices 2, components of the data cleaning apparatus 1 according to the second embodiment may be dispersed in different computer devices 2, and the computer device 2 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster formed by a plurality of servers) that executes a program, or the like. The computer device 2 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 23, a network interface 22, and the data cleansing device 1 (refer to fig. 6) which are communicatively connected to each other through a system bus. It is noted that fig. 6 only shows the computer device 2 with components, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
In this embodiment, the memory 21 includes at least one type of computer-readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both an internal storage unit of the computer device 2 and an external storage device thereof. In this embodiment, the memory 21 is generally used for storing an operating system installed in the computer device 2 and various application software, such as program codes of the data cleaning method in the first embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 23 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor, or other data Processing chip in some embodiments. The processor 23 is typically used for controlling the overall operation of the computer device 2, such as performing control and processing related to data interaction or communication with the computer device 2. In this embodiment, the processor 23 is configured to operate the program codes stored in the memory 21 or process data, for example, operate the data washing apparatus 1.
The network interface 22 may comprise a wireless network interface or a wired network interface, and the network interface 22 is typically used to establish a communication connection between the computer device 2 and other computer devices 2. For example, the network interface 22 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 6 only shows the computer device 2 with components 21-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the data washing apparatus 1 stored in the memory 21 may be further divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 23) to complete the present invention.
Example four
To achieve the above objects, the present invention also provides a computer-readable storage medium including a plurality of storage media such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by the processor 23, implements corresponding functions. The computer readable storage medium of the embodiment is used for storing the data cleaning device 1, and when being executed by the processor 23, the computer readable storage medium implements the data cleaning method of the first embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for data cleansing, comprising:
acquiring rule information for cleaning data, and converting the rule information into a rule link network;
receiving information to be processed, matching the information to be processed with the regular link network by adopting a Rete algorithm, and acquiring data to be processed in the information to be processed and regular link data in the regular link network matched with the data to be processed;
and cleaning the matched data to be processed based on the regular link data to obtain target information.
2. The data cleansing method according to claim 1, wherein the obtaining rule information for cleansing data, and converting the rule information into a rule link network comprises:
acquiring the rule information, wherein the rule information at least comprises a piece of rule data;
and converting all the rule data in the rule information into corresponding rule link data respectively, wherein all the rule link data form the rule link network.
3. The data cleansing method according to claim 2, wherein the rule data includes at least one matching object and one executing object;
the converting all the rule data in the rule information into corresponding rule link data respectively, where all the rule link data form the rule link network, and the converting includes:
acquiring matching objects and execution objects in the rule data, and identifying the incidence relation among the matching objects in the rule data;
converting the matching object into a network node, and converting the execution object into an execution event;
constructing a configuration relationship among the network nodes based on the association relationship among the matched objects;
and generating the regular link data according to the execution event, the network nodes and the configuration relationship among the network nodes, wherein all the regular link data form the regular link network.
4. The data cleansing method according to claim 3, wherein the information to be processed includes at least one piece of data to be processed;
the receiving the information to be processed, matching the information to be processed with the regular link network by adopting a Rete algorithm, and acquiring the matched data to be processed in the information to be processed and the regular link data in the regular link network matched with the matched data to be processed, includes:
receiving information to be processed;
extracting characteristic objects in each piece of data to be processed in the information to be processed;
matching the characteristic object with a matching object in each piece of regular link data in the regular link network respectively;
and acquiring the matching data to be processed associated with the feature object based on the feature object matched with the rule link data, and taking the rule link data matched with the feature object as the rule link data of the matching data to be processed associated with the feature object.
5. The data cleansing method according to claim 3, wherein the cleansing the matching data to be processed based on the regular link data to obtain target information comprises:
according to the execution event in the regular link data, executing cleaning operation on the to-be-processed matching data matched with the regular link data to obtain a cleaning result;
and generating the target information according to the cleaning results of all the matched data to be processed in the information to be processed.
6. The data cleansing method according to claim 1 or 5, wherein the object information is in rpt format.
7. A data cleansing apparatus, comprising:
the conversion unit is used for acquiring rule information used for cleaning data and converting the rule information into a rule link network;
the matching unit is used for receiving information to be processed, matching the information to be processed with the regular link network by adopting a Rete algorithm, and acquiring data to be processed in the information to be processed and regular link data in the regular link network matched with the data to be processed;
and the execution unit is used for cleaning the matched data to be processed based on the regular link data so as to obtain target information.
8. The data cleaning apparatus according to claim 7, wherein the converting unit is configured to obtain the rule information, and convert all rule data in the rule information into the corresponding rule link data, respectively, where all the rule link data form the rule link network;
wherein the rule information includes at least one piece of rule data.
9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202111520378.0A 2021-12-13 2021-12-13 Data cleaning method and device, computer equipment and storage medium Pending CN114168581A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111520378.0A CN114168581A (en) 2021-12-13 2021-12-13 Data cleaning method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111520378.0A CN114168581A (en) 2021-12-13 2021-12-13 Data cleaning method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114168581A true CN114168581A (en) 2022-03-11

Family

ID=80486346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111520378.0A Pending CN114168581A (en) 2021-12-13 2021-12-13 Data cleaning method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114168581A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023185972A1 (en) * 2022-03-31 2023-10-05 阿里巴巴达摩院(杭州)科技有限公司 Data processing method and apparatus, and electronic device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023185972A1 (en) * 2022-03-31 2023-10-05 阿里巴巴达摩院(杭州)科技有限公司 Data processing method and apparatus, and electronic device

Similar Documents

Publication Publication Date Title
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
CN110990403A (en) Business data storage method, system, computer equipment and storage medium
CN112667860A (en) Sub-graph matching method, device, equipment and storage medium
CN110275889B (en) Feature processing method and device suitable for machine learning
CN114168581A (en) Data cleaning method and device, computer equipment and storage medium
CN113434506A (en) Data management and retrieval method and device, computer equipment and readable storage medium
CN111124883B (en) Test case library introduction method, system and equipment based on tree form
CN112506869A (en) File processing method, device and system
CN112598289A (en) Index configuration method, system, computer device and computer readable storage medium
CN111522840A (en) Label configuration method, device, equipment and computer readable storage medium
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
CN115408546A (en) Time sequence data management method, device, equipment and storage medium
CN114511314A (en) Payment account management method and device, computer equipment and storage medium
CN114519071A (en) Generation method, matching method, system, device and medium of rule matching model
CN113868138A (en) Method, system, equipment and storage medium for acquiring test data
CN113672771A (en) Data entry processing method and device, medium and electronic equipment
CN114218261A (en) Data query method and device, storage medium and electronic equipment
CN113342647A (en) Test data generation method and device
CN112632266A (en) Data writing method and device, computer equipment and readable storage medium
US11916807B2 (en) Evaluation framework for cloud resource optimization
CN116383454B (en) Data query method of graph database, electronic equipment and storage medium
CN112402955B (en) Game log recording method and system
CN113987785B (en) Management method and device for complete information of algorithm block of nuclear power station DCS system
CN117539946A (en) Service implementation method, device, computer equipment and storage medium
CN117555487A (en) Data splitting method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination