CN113342788A - Big data based data cleaning method and cloud server - Google Patents

Big data based data cleaning method and cloud server Download PDF

Info

Publication number
CN113342788A
CN113342788A CN202110554388.XA CN202110554388A CN113342788A CN 113342788 A CN113342788 A CN 113342788A CN 202110554388 A CN202110554388 A CN 202110554388A CN 113342788 A CN113342788 A CN 113342788A
Authority
CN
China
Prior art keywords
data
service data
cleaned
service
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110554388.XA
Other languages
Chinese (zh)
Inventor
李孔雀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110554388.XA priority Critical patent/CN113342788A/en
Publication of CN113342788A publication Critical patent/CN113342788A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

According to the big data-based data cleaning method and the cloud server, the corresponding relation between the data cleaning reference information and the data cleaning index can be considered, so that the service data block information is determined, information recombination can be achieved from the level of the associated priority and the service demand information, and data sorting is performed based on the determined service data sorting mode. In this way, the service data list can be completely determined.

Description

Big data based data cleaning method and cloud server
The application is a divisional application with the application number of 202011385263.0, the application date of "12/01/2020", and the name of "data cleaning method and cloud server applied to big data and deep learning".
Technical Field
The application relates to the technical field of big data and artificial intelligence, in particular to a data cleaning method based on big data and a cloud server.
Background
With the rapid development of big data, all walks of modern society can hardly leave big data. The continuous transformation and upgrading of the digital society releases labor force, so that the economic construction is accelerated, and the life quality and the production efficiency of people are improved.
In many business fields, business handling and business interaction are mostly performed based on business data, but with continuous expansion of data scale, the quantity of business data is also increased, which may cause delay of normal business handling and business interaction, and in order to improve the problem, certain data cleaning is usually required to be performed on the business data. However, the problem that the service data cannot be normally used may be faced after the service data is cleaned by a common data cleaning technology.
Disclosure of Invention
A first aspect of the present application discloses a data cleaning method based on big data, where service data to be processed includes multiple sets of service data to be cleaned and multiple sets of cleaned service data in a cleaning intermediate process, the method including:
comparing the service data difference information of the current service data to be cleaned with the service data of the previous group, and processing the service data difference information based on a plurality of groups of cleaned service data before the current service data to be cleaned so as to determine the data cleaning reference information of the current service data to be cleaned;
acquiring the service data characteristics of the current service data to be cleaned; determining a data cleaning index of the current service data to be cleaned according to the service data characteristics; processing the current business data to be cleaned according to the data cleaning reference information and the data cleaning index to obtain a business data list corresponding to the current business data to be cleaned;
and cleaning the abnormal service data of the current service data to be cleaned based on the service data list.
In a preferred embodiment, the comparing the service data difference information between the current service data to be cleaned and the previous service data group, and processing the service data difference information based on a plurality of groups of cleaned service data before the current service data to be cleaned to determine the data cleaning reference information of the current service data to be cleaned includes:
comparing the business behavior difference information of the current business data to be cleaned and the previous group of business data through a business behavior analysis model;
determining whether the current service data to be cleaned is a candidate service data mining group or not based on the service behavior difference information;
if yes, comparing the business data difference information of the current business data to be cleaned and the previous group of business data based on data classification records;
and performing difference information analysis on the service data difference information based on a plurality of groups of data classification records of the cleaned service data before the current service data to be cleaned so as to determine data cleaning reference information of the current service data to be cleaned.
In a preferred embodiment, the performing difference information analysis on the difference information of the service data based on the data classification records of the plurality of groups of cleaned service data before the current service data to be cleaned to determine the data cleaning reference information of the current service data to be cleaned includes:
performing difference information analysis on the difference information of the service data based on a plurality of groups of data classification records of the cleaned service data before the current service data to be cleaned to obtain a first difference information description value of the current service data to be cleaned;
comparing the first difference information description value to a first description value threshold and a second description value threshold, the first description value threshold being less than the second description value threshold;
if the first difference information description value is smaller than the first description value threshold value, the data cleaning reference information of the current to-be-cleaned service data indicates that the current to-be-cleaned service data is not a dynamic service data group;
if the first difference information description value is larger than the second description value threshold value, the data cleaning reference information of the current to-be-cleaned service data represents that the current to-be-cleaned service data is a dynamic service data group;
if the first difference information description value is larger than the first description value threshold value and smaller than the second description value threshold value, the data cleaning reference information of the current to-be-cleaned service data represents that the current to-be-cleaned service data is an interactive service data group;
wherein the method further comprises:
detecting a comparison result of a first difference information description value of continuous groups of service data to be cleaned and the first description value threshold value and the second description value threshold value;
if the first difference information description values of a preset number of continuous groups of service data to be cleaned are all larger than a first description value threshold value, taking the first group of service data to be cleaned larger than the first description value threshold value as an initial dynamic service data group;
determining record difference of the accumulated data classification records separated by the set number groups, analyzing difference information, and obtaining second difference information description values of the service data to be cleaned separated by the set number groups;
comparing a second difference information description value with the second description value threshold value, and comparing a first difference information description value of a preset number of continuous groups of service data to be cleaned with the first description value threshold value;
and if the second difference information description value of the current service data to be cleaned is greater than a second description value threshold value and the first difference information description values of a preset number of continuous groups of service data to be cleaned are less than a first description value threshold value, taking the current service data to be cleaned as a system service data group.
In a preferred embodiment, the determining a data cleansing index of the current service data to be cleansed according to the service data feature includes:
and determining a data cleaning index of the current service data to be cleaned according to the service data calling frequency, the proportion of shared service data, the service data fault tolerance rate and the service data correlation matrix as service data characteristics.
In a preferred embodiment, the determining a data cleaning index of the current service data to be cleaned according to the service data calling frequency, the proportion of shared service data, the service data fault tolerance, and the service data correlation matrix as service data features includes:
comparing the calling frequency with a calling frequency description value threshold according to the service data, and comparing the ratio of the shared service data with a set ratio;
if the calling frequency of the service data is less than or equal to the calling frequency description value threshold value and the proportion of the shared service data is less than or equal to the set proportion, determining that the data cleaning index of the current service data to be cleaned represents an incomplete data cleaning index;
if the calling frequency of the business data is greater than the calling frequency description value threshold value or the proportion of the shared business data is greater than the set proportion, determining that the data cleaning index of the current business data to be cleaned represents a repeated data cleaning index;
and comparing the service data fault tolerance rate and the service data correlation matrix with a preset value, and comparing the average calling frequency with the calling frequency description value threshold value to determine that the data cleaning index of the current service data to be cleaned represents an error data cleaning index.
In a preferred embodiment, the processing the current service data to be cleaned according to the data cleaning reference information and the data cleaning index to obtain a service data list corresponding to the current service data to be cleaned includes:
determining a corresponding relation between the data cleaning reference information and the data cleaning index, and acquiring service data block information of the current service data to be cleaned based on the corresponding relation, wherein the service data block information comprises association priority of data block sequence information and data block association information;
determining a business data arrangement mode;
determining whether the service requirement information corresponding to the minimum service data block in the current service data to be cleaned needs to be analyzed according to the association priority of the data block association information and the data block sequence information;
if the analysis is needed, performing information screening on at least part of service demand information of at least one group of current service data to be cleaned to obtain service demand information corresponding to the minimum service data block;
determining whether business demand information corresponding to the minimum business data block needs to be recombined again by utilizing the business data block information; if the business requirement information needs to be recombined again, generating new business requirement information, and performing data sorting based on the business data sorting mode to obtain the business data list;
wherein, the data block sequence information includes the service duration of the current service data to be cleaned, the number of service events of the current service data to be cleaned, and the service priority, the method further includes:
judging whether the sequence priority of the data block sequence information, the service priority and the associated priority of the data block associated information are the same;
if the sequence priority of the data block sequence information, the service priority and the associated priority of the data block associated information are the same, judging whether a time sequence characteristic label is set according to the service duration of the multiple groups of current service data to be cleaned when the service data sorting mode is a time sequence sorting mode; if the time sequence feature tag is set, performing data sorting on the multiple groups of current service data to be cleaned based on the time sequence feature tag; if the time sequence feature tag is not set, data sorting is carried out on the multiple groups of current service data to be cleaned;
when the business data sorting mode is the event sorting mode, judging whether a business event label is set according to the number of the business events of the multiple groups of current business data to be cleaned; if the business event label is set, data processing and sorting are carried out on the multiple groups of current business data to be cleaned based on the business event label; if the service event label is not set, data sorting is carried out on the multiple groups of current service data to be cleaned;
if the sequence priority of the data block sequence information, the service priority or the associated priority of the data block associated information are different, selecting one group of current service data to be cleaned from the multiple groups of current service data to be cleaned as reference time sequence service data when the service data sorting mode is the time sequence sorting mode, and judging whether to set the time sequence feature tag according to the service duration of the multiple groups of current service data to be cleaned; if the time sequence characteristic label is set, screening other current service data to be cleaned by using the service data block information of the reference time sequence service data and the time sequence characteristic label, and performing data sorting; if the time sequence characteristic label is not set, screening the other current service data to be cleaned by using the service data block information of the reference time sequence service data, and sorting the data; when the business data sorting mode is the event sorting mode, selecting one group of current business data to be cleaned from the multiple groups of current business data to be cleaned as reference event business data, and judging whether the business event label is set according to the number of business events of the multiple groups of current business data to be cleaned; if the service event label is set, screening the other current service data to be cleaned by using the service data block information of the reference event service data and the service event label, and performing data sorting; and if the service event label is not set, screening the other current service data to be cleaned by using the service data block information of the reference event service data, and sorting the data.
In a preferred embodiment, the service data block information further includes a data block configuration record, and the method further includes:
when the service data arrangement mode is the time sequence arrangement mode, judging whether the service duration of the multiple groups of service data to be cleaned currently are the same according to the data block sequence information; if so, performing data sorting on the multiple groups of current service data to be cleaned according to the data block configuration record; if not, setting the label weight of the time sequence feature label according to the service duration of the multiple groups of current service data to be cleaned, and performing data sorting on the multiple groups of current service data to be cleaned according to the time sequence feature label;
when the business data sorting mode is the event sorting mode, judging whether the number of the business events of the multiple groups of current business data to be cleaned is the same according to the data block sequence information; if so, performing data sorting on the multiple groups of current service data to be cleaned according to the service data formats and service data storage paths of the multiple groups of current service data to be cleaned; if not, setting the label weight of the service event label according to the number of the service events of the plurality of groups of current service data to be cleaned and the service event label to perform data sorting on the plurality of groups of current service data to be cleaned; the service data format comprises a modifiable format and a non-modifiable format, when the service data format is the modifiable format, the service requirement information comprises real-time service requirement information and delay service requirement information, and the service data storage path comprises a service data authority access path and a service data calling path.
In a preferred embodiment, performing abnormal service data cleaning on the current service data to be cleaned based on the service data list includes:
determining list structure characteristic information, list area characteristic information and list grouping characteristic information of a service data list;
determining first service data distribution information corresponding to the service data list based on the list grouping feature information of the service data list and the list grouping feature information of a reference service data list, wherein the reference service data list is a service data list which comprises three list grouping features with different feature dimensions and the total number of the included list grouping features is greater than a first set number, and the list generation time of the reference service data list is before the list generation time of the service data list;
determining an abnormal service data distribution result corresponding to the service data list based on the list structure characteristic information and the list region characteristic information of the service data list, the abnormal service data marking information and the abnormal service data clustering information corresponding to the previous service data list, and the first service data distribution information, wherein the abnormal service data distribution result at least comprises second service data distribution information, and the abnormal service data distribution result corresponding to the service data list refers to an abnormal service data distribution result of a service data processing terminal when the service data list is generated;
if the distribution information error between the first service data distribution information and the second service data distribution information is larger than a set error threshold value, determining that the service data list is a key service data list, and determining third service data distribution information, corresponding abnormal service data marks and abnormal service data clusters of all key service data lists in a service data processing environment based on the first service data distribution information and the abnormal service data distribution result;
performing abnormal service data cleaning on the current service data to be cleaned through third service data distribution information of all key service data lists in the service data processing environment, corresponding abnormal service data marks and abnormal service data clusters to obtain target service data;
the determining of the first service data distribution information corresponding to the service data list based on the list grouping feature information of the service data list and the list grouping feature information of the reference service data list includes:
determining different feature dimension description information of the list grouping feature of each different feature dimension in the service data list under a service data interaction environment corresponding to the service data list based on the list grouping feature information of the service data list to obtain different feature dimension description information of the list grouping feature of three different feature dimensions in the service data list;
acquiring different feature dimension description information of list grouping features of three different feature dimensions in the reference service data list based on the list grouping feature information of the reference service data list;
acquiring fourth service data distribution information corresponding to the reference service data list;
determining first service data distribution information corresponding to the service data list based on different feature dimension description information of list grouping features in the service data list, different feature dimension description information of list grouping features in the reference service data list and the fourth service data distribution information;
wherein, the determining the distribution result of the abnormal service data corresponding to the service data list based on the list structure characteristic information and the list region characteristic information of the service data list, the abnormal service data marking information and the abnormal service data clustering information corresponding to the previous service data list, and the first service data distribution information includes:
determining an initial abnormal service data distribution result of the service data list based on the abnormal service data distribution result of the last service data list; determining list structure characteristics indicated by the list structure characteristic information and list region characteristics indicated by the list region characteristic information of the service data list, and determining abnormal service data marks indicated by abnormal service data mark information corresponding to the previous service data list and abnormal service data clusters indicated by abnormal service data cluster information;
determining a first mapping list of abnormal service data marks corresponding to the last service data list in the service data list and a first mapping data cluster of abnormal service data clusters corresponding to the last service data list in the service data list based on an initial abnormal service data distribution result of the service data list;
determining a target list structure feature matched with the first mapping list in the list structure features of the service data list, and determining a target list area feature matched with the first mapping data cluster in the list area features of the service data list; determining an abnormal service data distribution result corresponding to the service data list based on the initial abnormal service data distribution result, the first mapping list, the target list structure feature, the first mapping data cluster, the target list region feature and the first service data distribution information.
A second aspect of the present application discloses a cloud server, comprising a processing engine, a network module, and a memory; the processing engine and the memory communicate via the network module, and the processing engine reads the computer program from the memory and runs it to perform the method of the first aspect.
A third aspect of the present application discloses a computer-readable signal medium having stored thereon a computer program which, when executed, implements the method of the first aspect.
Compared with the prior art, the big data based data cleaning method and the cloud server provided by the embodiment of the application have the following technical effects: and for different service data to be cleaned, the service data difference information between the service data to be cleaned and the previous group of service data can be considered, and corresponding data cleaning reference information is determined. Furthermore, by determining the service data characteristics and the data cleaning indexes of the current service data to be cleaned, the service data list of the current service data to be cleaned can be determined. The current service data to be cleaned can be managed and disassembled comprehensively through the service data list, so that the accuracy and reliability of later-stage data cleaning are ensured, the data cleaning and the actual service can be dynamically combined, and the problem that subsequent service data caused by mechanical data cleaning are difficult to normally use is avoided. Therefore, when abnormal business data cleaning is carried out based on the business data list, dynamic combination of the abnormal business data and actual business can be considered, so that different states of the abnormal business data in different time periods are considered, the accuracy of data cleaning can be ensured, and the influence of the data cleaning on normal use of subsequent business data can be avoided.
In the description that follows, additional features will be set forth, in part, in the description. These features will be in part apparent to those skilled in the art upon examination of the following and the accompanying drawings, or may be learned by production or use. The features of the present application may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations particularly pointed out in the detailed examples that follow.
Drawings
As shown in the foregoing background, the inventors have analyzed the above problems and found that a common data cleansing technique does not consider dynamic combination of service data and actual services, and therefore, during data cleansing, cleansing is usually performed mechanically by using a fixed cleansing standard, so that if some abnormal service data is useless for a part of service data but useful for another part of service data, after cleansing the abnormal service data, subsequent service data may not be used normally.
In order to solve the problem, the inventor innovatively provides a data cleaning method based on big data and a cloud server, and dynamic combination of abnormal business data and actual business can be considered, so that different states of the abnormal business data at different time intervals are considered, the accuracy of data cleaning can be ensured, and the influence of data cleaning on normal use of subsequent business data can be avoided.
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
The methods, systems, and/or processes of the figures are further described in accordance with the exemplary embodiments. These exemplary embodiments will be described in detail with reference to the drawings. These exemplary embodiments are non-limiting exemplary embodiments in which reference numerals represent similar mechanisms throughout the various views of the drawings.
FIG. 1 is a block diagram of an exemplary big data based data cleansing system, shown in accordance with some embodiments of the present application.
Fig. 2 is a schematic diagram illustrating hardware and software components in an exemplary cloud server according to some embodiments of the present application.
FIG. 3 is a flow diagram of an exemplary big data based data cleansing method and/or process, shown in accordance with some embodiments of the present application.
FIG. 4 is a block diagram of an exemplary big data based data cleansing apparatus, according to some embodiments of the present application.
Detailed Description
In order to better understand the technical solutions, the technical solutions of the present application are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, and are not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.
In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant guidance. It will be apparent, however, to one skilled in the art that the present application may be practiced without these specific details. In other instances, well-known methods, procedures, systems, compositions, and/or circuits have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present application.
These and other features, functions, methods of execution, and combination of functions and elements of related elements in the structure and economies of manufacture disclosed in the present application may become more apparent upon consideration of the following description with reference to the accompanying drawings, all of which form a part of this application. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the application. It should be understood that the drawings are not to scale. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the application. It should be understood that the drawings are not to scale.
Flowcharts are used herein to illustrate the implementations performed by systems according to embodiments of the present application. It should be expressly understood that the processes performed by the flowcharts may be performed out of order. Rather, these implementations may be performed in the reverse order or simultaneously. In addition, at least one other implementation may be added to the flowchart. One or more implementations may be deleted from the flowchart.
Fig. 1 is a block diagram illustrating an exemplary big data based data cleansing system 300 according to some embodiments of the present application, where the big data based data cleansing system 300 may include a cloud server 100 and a business data processing terminal 200.
In some embodiments, as shown in fig. 2, the cloud server 100 may include a processing engine 110, a network module 120, and a memory 130, the processing engine 110 and the memory 130 communicating through the network module 120.
Processing engine 110 may process the relevant information and/or data to perform one or more of the functions described herein. For example, in some embodiments, processing engine 110 may include at least one processing engine (e.g., a single core processing engine or a multi-core processor). By way of example only, the Processing engine 110 may include a Central Processing Unit (CPU), an Application-Specific Integrated Circuit (ASIC), an Application-Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller Unit, a Reduced Instruction Set Computer (RISC), a microprocessor, or the like, or any combination thereof.
Network module 120 may facilitate the exchange of information and/or data. In some embodiments, the network module 120 may be any type of wired or wireless network or combination thereof. Merely by way of example, the Network module 120 may include a cable Network, a wired Network, a fiber optic Network, a telecommunications Network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth Network, a Wireless personal Area Network, a Near Field Communication (NFC) Network, and the like, or any combination thereof. In some embodiments, the network module 120 may include at least one network access point. For example, the network module 120 may include wired or wireless network access points, such as base stations and/or network access points.
The Memory 130 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 130 is used for storing a program, and the processing engine 110 executes the program after receiving the execution instruction.
It is to be understood that the configuration shown in fig. 2 is merely illustrative, and that cloud server 100 may include more or fewer components than shown in fig. 2, or have a different configuration than shown in fig. 2. The components shown in fig. 2 may be implemented in hardware, software, or a combination thereof.
Fig. 3 is a flowchart illustrating an exemplary big data based data cleansing method and/or process according to some embodiments of the present application, where the big data based data cleansing method is applied to the cloud server 100 in fig. 1, and may specifically include the following steps S11-S13.
It can be understood that the following method can be used for cleaning the service data to be processed, and the intermediate process of cleaning the service data to be processed can include multiple sets of service data to be cleaned and multiple sets of cleaned service data, that is, the following method can be applied to a dynamic process of data cleaning. The service data referred to below may relate to multiple fields, such as block chain payment, internet finance, cloud game interaction, smart industrial control, smart manufacturing control, digital economic upgrade, smart city monitoring, smart medical treatment, internet of things interaction, cloud-side computing, online e-commerce platform, and various big data service fields, which are not limited herein.
Step S11, comparing the service data difference information between the current service data to be cleaned and the previous service data, and processing the service data difference information based on the cleaned service data in the multiple groups before the current service data to be cleaned, so as to determine the data cleaning reference information of the current service data to be cleaned.
For example, the service data difference information may represent the difference of different service data at the service type, service object, service event and service environment level. The data cleansing reference information is used for providing different data cleansing modes for different service data, for example, abnormal service data corresponding to some service data may be useful for other service data, and if data cleansing is directly performed without considering the association between different service data, some service data may not be normally used at a later stage, so the data cleansing reference information may provide a reference for data cleansing.
Step S12, acquiring the service data characteristics of the current service data to be cleaned; determining a data cleaning index of the current service data to be cleaned according to the service data characteristics; and processing the current service data to be cleaned according to the data cleaning reference information and the data cleaning index to obtain a service data list corresponding to the current service data to be cleaned.
For example, the service data features are used for describing the current service data to be cleaned from different dimensions, the features can be represented in a vector form, and the service data features of different service data to be cleaned are different. In some cases, the feature dimensions of the traffic data features may be as large as possible. The data cleaning index is used for indicating which abnormal service data need to be cleaned, and can also be used for indicating the change condition of the abnormal service data in different service periods or different service requirements. The service data list summarizes various service information corresponding to the current service data to be cleaned, such as service demand information, service priority information, and the like. The current service data to be cleaned can be managed and disassembled comprehensively through the service data list, so that the accuracy and reliability of later-stage data cleaning are ensured, the data cleaning and the actual service can be dynamically combined, and the problem that subsequent service data caused by mechanical data cleaning are difficult to normally use is avoided.
Step S13, performing abnormal service data cleaning on the current service data to be cleaned based on the service data list.
For example, abnormal service data cleaning is performed based on the service data list, and dynamic combination of the abnormal service data and actual services can be considered, so that different states of the abnormal service data at different time intervals are taken into consideration, and therefore when the service data cleaning is realized, not only can the accuracy of the data cleaning be ensured, but also the influence of the data cleaning on the normal use of subsequent service data can be avoided.
It can be understood that by implementing the above steps S11-S13, for different service data to be cleaned, the service data difference information between the service data to be cleaned and the previous group of service data can be considered, and the corresponding data cleaning reference information can be determined. Furthermore, by determining the service data characteristics and the data cleaning indexes of the current service data to be cleaned, the service data list of the current service data to be cleaned can be determined. The current service data to be cleaned can be managed and disassembled comprehensively through the service data list, so that the accuracy and reliability of later-stage data cleaning are ensured, the data cleaning and the actual service can be dynamically combined, and the problem that subsequent service data caused by mechanical data cleaning are difficult to normally use is avoided. Therefore, when abnormal business data cleaning is carried out based on the business data list, dynamic combination of the abnormal business data and actual business can be considered, so that different states of the abnormal business data in different time periods are considered, the accuracy of data cleaning can be ensured, and the influence of the data cleaning on normal use of subsequent business data can be avoided.
In the following, some alternative embodiments will be described, which should be understood as examples and not as technical features essential for implementing the present solution.
In some examples, comparing the service data difference information of the current service data to be cleaned with the previous service data set and processing the service data difference information based on the cleaned service data sets before the current service data to be cleaned to determine the data cleaning reference information of the current service data to be cleaned as described in step S11 may include the following steps S111-S114.
And step S111, comparing the business behavior difference information of the current business data to be cleaned and the previous group of business data through a business behavior analysis model.
For example, the business behavior analysis model may be a convolutional neural network model, the training sample of the model may be previous business data, and the training process of the model is prior art and will not be described here. The service behavior difference information is used for representing difference information corresponding to interaction behaviors made by the user aiming at different service data.
Step S112, determining whether the current service data to be cleaned is a candidate service data mining group or not based on the service behavior difference information.
For example, the candidate business data mining group is used for representing that the current business data to be cleaned has data mining value.
And step S113, if yes, comparing the business data difference information of the current business data to be cleaned and the previous group of business data based on data classification records.
Step S114, performing difference information analysis on the service data difference information based on the data classification records of the plurality of groups of cleaned service data before the current service data to be cleaned, so as to determine data cleaning reference information of the current service data to be cleaned.
For example, the data classification record describes different data classes of the cleaned business data.
It can be understood that by applying the above steps S111 to S114, the difference of the business behavior can be taken into account, so as to ensure that the data cleansing reference information matches with the actual business behavior of the user, thereby avoiding affecting the normal business behavior of the user when data cleansing is performed later.
Further, the performing, by the step S114, difference information analysis on the service data difference information based on the data classification records of the plurality of groups of cleaned service data before the current service data to be cleaned to determine the data cleaning reference information of the current service data to be cleaned may include the following steps S1141 to S1145.
Step S1141, performing difference information analysis on the service data difference information based on a plurality of sets of data classification records of the cleaned service data before the current service data to be cleaned, and obtaining a first difference information description value of the current service data to be cleaned.
For example, the difference information description value may be understood as representing the difference information by a certain value or a certain sequence value, which facilitates data processing and analysis at a later stage and reduces data processing pressure of the cloud server.
Step S1142, comparing the first difference information description value with a first description value threshold and a second description value threshold, where the first description value threshold is smaller than the second description value threshold.
For example, the description value threshold may be preset, and will not be described herein.
Step S1143, if the first difference information description value is smaller than the first description value threshold, the data cleaning reference information of the current service data to be cleaned indicates that the current service data to be cleaned is not a dynamic service data group.
For example, the dynamic traffic data set is used to characterize the traffic data as being updatable and adjustable.
Step S1144, if the first difference information description value is greater than the second description value threshold, the data cleaning reference information of the current service data to be cleaned indicates that the current service data to be cleaned is a dynamic service data group.
Step S1145, if the first difference information description value is greater than the first description value threshold and smaller than the second description value threshold, the data cleaning reference information of the current service data to be cleaned indicates that the current service data to be cleaned is an interactive service data group.
For example, the interactive service data set is used to characterize the presence of interactive behavior in the service data.
In addition, on the basis of the above, the method further includes the following steps S11461 to S11465.
Step S11461, detecting a comparison result between a first difference information description value of consecutive groups of service data to be cleaned and the first description value threshold and the second description value threshold.
In step S11462, if the first difference information description values of a preset number of consecutive groups of service data to be cleaned are all greater than the first description value threshold, the first group of service data to be cleaned greater than the first description value threshold is used as the initial dynamic service data group.
Step S11463, determining record difference of the accumulated data classification records separated by the set number group, and analyzing difference information to obtain a second difference information description value of the service data to be cleaned separated by the set number group.
Step S11464, comparing the second difference information description value with the second description value threshold, and comparing the first difference information description values of a preset number of consecutive groups of service data to be cleaned with the first description value threshold.
Step S11465, if the second difference information description value of the current service data to be cleaned is greater than the second description value threshold, and the first difference information description values of a preset number of consecutive groups of service data to be cleaned are less than the first description value threshold, using the current service data to be cleaned as a system service data group.
It can be understood that based on the above steps S1141 to S1145 and steps S11461 to S11465, different representation contents of the data cleaning reference information can be determined, so as to provide a comprehensive and reliable cleaning basis for subsequent data cleaning.
In some examples, the determining the data cleansing index of the current to-be-cleansed service data according to the service data feature described in step S12 may include the following steps 120: and determining a data cleaning index of the current service data to be cleaned according to the service data calling frequency, the proportion of shared service data, the service data fault tolerance rate and the service data correlation matrix as service data characteristics. By the design, the data cleaning index of the current service data to be cleaned can be determined based on different dimension indexes of the service data, so that the data cleaning index can cover as many dimension indexes of the current service data to be cleaned as possible.
For the technical features of the service data calling frequency, the proportion of the shared service data, the fault tolerance rate of the service data, and the correlation matrix of the service data, those skilled in the art can make an unambiguous derivation based on the content described in the present application, and implement the present solution completely and clearly based on the technical features.
Based on the step S120, the determining a data cleaning index of the current service data to be cleaned according to the service data calling frequency, the proportion of the shared service data, the service data fault tolerance rate, and the service data correlation matrix as the service data characteristics may further include steps S121 to S123.
Step S121, comparing the calling frequency with the calling frequency description value threshold according to the service data, and comparing the ratio of the shared service data with a set ratio.
Step S122, if the calling frequency of the service data is less than or equal to the calling frequency description value threshold, and the ratio of the shared service data is less than or equal to the set ratio, determining that the data cleaning index of the current service data to be cleaned represents an incomplete data cleaning index.
Step S123, if the calling frequency of the service data is greater than the calling frequency description value threshold, or the ratio of the shared service data is greater than the set ratio, determining that the data cleansing index of the current service data to be cleansed represents a repeated data cleansing index.
Step S124, comparing the service data fault tolerance rate and the service data correlation matrix with a preset value, and comparing the average calling frequency with the calling frequency description value threshold value, so as to determine that the data cleaning index of the current service data to be cleaned represents an error data cleaning index.
In this way, by implementing the above steps S121 to S124, different cleaning indexes indicated by the data cleaning index can be determined based on the comparison result of the calling frequency of the service data and the calling frequency description value threshold and the comparison result of the proportion of the shared service data and the set proportion, which is convenient for subsequently cleaning data by using different data cleaning methods, thereby ensuring accurate and reliable data cleaning.
In one possible embodiment, the inventors found that, in order to ensure the integrity of the business data list, the correspondence between the data cleansing reference information and the data cleansing index needs to be considered. For this purpose, in step S12, the current service data to be cleaned is processed according to the data cleaning reference information and the data cleaning index, so as to obtain a service data list corresponding to the current service data to be cleaned, which may include steps S12a to S12 e.
Step S12a, determining a corresponding relationship between the data cleansing reference information and the data cleansing index, and obtaining service data block information of the current service data to be cleansed based on the corresponding relationship, where the service data block information includes an association priority of data block sequence information and data block association information.
And step S12b, determining a business data arrangement mode.
Step S12c, determining whether it is necessary to analyze the service requirement information corresponding to the smallest service data block in the current service data to be cleaned according to the associated priority of the data block associated information and the data block sequence information.
Step S12d, if the analysis is needed, performing information screening on at least part of the service demand information of at least one group of the current service data to be cleaned, so as to obtain the service demand information corresponding to the minimum service data block.
Step S12e, determining whether business requirement information corresponding to the minimum business data block needs to be recombined again by using the business data block information; and if the business requirement information needs to be recombined again, generating new business requirement information, and performing data sorting based on the business data sorting mode to obtain the business data list.
It can be understood that, by applying the above steps S12 a-S12 e, the service data block information can be determined by considering the corresponding relationship between the data cleansing reference information and the data cleansing index, so that information reorganization can be realized from the level of the associated priority and the service requirement information, and data sorting can be performed based on the determined service data sorting manner. In this way, the service data list can be completely determined.
On the basis of the above example, the data block sequence information includes the service duration of the current service data to be cleaned, the number of service events of the current service data to be cleaned, and the service priority. Based on this, the method may further include the following step S12 f-step S12 i.
Step S12f, determining whether the sequence priority of the data block sequence information, the service priority, and the association priority of the data block association information are the same.
Step S12g, if the sequence priority of the data block sequence information, the service priority and the associated priority of the data block associated information are the same, judging whether to set a time sequence feature tag according to the service duration of the multiple groups of current service data to be cleaned when the service data sorting mode is a time sequence sorting mode; if the time sequence feature tag is set, performing data sorting on the multiple groups of current service data to be cleaned based on the time sequence feature tag; and if the time sequence feature tag is not set, performing data sorting on the multiple groups of current service data to be cleaned.
Step S12h, when the business data arrangement mode is the event arrangement mode, judging whether to set a business event label according to the business event number of the multiple groups of current business data to be cleaned; if the business event label is set, data processing and sorting are carried out on the multiple groups of current business data to be cleaned based on the business event label; and if the service event label is not set, performing data sorting on the multiple groups of current service data to be cleaned.
Step S12i, if the sequence priority of the data block sequence information, the service priority, or the association priority of the data block association information are different, when the service data arrangement mode is the time sequence arrangement mode, selecting one group of current service data to be cleaned from the multiple groups of current service data to be cleaned as reference time sequence service data, and determining whether to set the time sequence feature tag according to the service duration of the multiple groups of current service data to be cleaned; if the time sequence characteristic label is set, screening other current service data to be cleaned by using the service data block information of the reference time sequence service data and the time sequence characteristic label, and performing data sorting; if the time sequence characteristic label is not set, screening the other current service data to be cleaned by using the service data block information of the reference time sequence service data, and sorting the data; when the business data sorting mode is the event sorting mode, selecting one group of current business data to be cleaned from the multiple groups of current business data to be cleaned as reference event business data, and judging whether the business event label is set according to the number of business events of the multiple groups of current business data to be cleaned; if the service event label is set, screening the other current service data to be cleaned by using the service data block information of the reference event service data and the service event label, and performing data sorting; and if the service event label is not set, screening the other current service data to be cleaned by using the service data block information of the reference event service data, and sorting the data.
In this way, based on the above steps S12 f-S12 i, different service data sorting methods can be adopted to implement processing on the current service data to be cleaned, so as to obtain a service data list corresponding to the current service data to be cleaned. Therefore, the method can be flexibly used for determining the service data list in different service scenes.
In addition, on the basis, the service data block information further includes a data block configuration record. Based on this, the method further includes step S12j and step S12 k.
Step S12j, when the service data arrangement mode is the time sequence arrangement mode, judging whether the service duration of the multiple groups of service data to be cleaned currently are the same according to the data block sequence information; if so, performing data sorting on the multiple groups of current service data to be cleaned according to the data block configuration record; if not, setting the label weight of the time sequence feature label according to the service duration of the multiple groups of current service data to be cleaned, and performing data sorting on the multiple groups of current service data to be cleaned according to the time sequence feature label.
Step S12k, when the service data arrangement mode is the event arrangement mode, judging whether the service event quantity of the multiple groups of service data to be cleaned currently is the same according to the data block sequence information; if so, performing data sorting on the multiple groups of current service data to be cleaned according to the service data formats and service data storage paths of the multiple groups of current service data to be cleaned; if not, setting the label weight of the service event label according to the number of the service events of the plurality of groups of current service data to be cleaned and the service event label to perform data sorting on the plurality of groups of current service data to be cleaned; the service data format comprises a modifiable format and a non-modifiable format, when the service data format is the modifiable format, the service requirement information comprises real-time service requirement information and delay service requirement information, and the service data storage path comprises a service data authority access path and a service data calling path.
Therefore, through the steps S12a to S12k, the service data list can be generated based on different service situations, so that the integrity of the service data list can be ensured in a global level, and the usability of the whole scheme can be improved to adapt to different service scenarios.
In some examples, in order to ensure the efficiency of data cleansing and ensure the normal use of subsequent service data, the abnormal service data cleansing of the current service data to be cleansed based on the service data list described in step S13 may include steps S131 to S135.
Step S131, determining list structure feature information, list region feature information, and list grouping feature information of the service data list.
Step S132, determining first service data distribution information corresponding to the service data list based on the list grouping feature information of the service data list and the list grouping feature information of a reference service data list, where the reference service data list is a service data list including three list grouping features with different feature dimensions and including a total number of the list grouping features greater than a first set number, and the list generation time of the reference service data list is before the list generation time of the service data list.
Step S133, determining an abnormal service data distribution result corresponding to the service data list based on the list structure feature information and the list region feature information of the service data list, the abnormal service data flag information and the abnormal service data clustering information corresponding to the previous service data list, and the first service data distribution information, where the abnormal service data distribution result at least includes the second service data distribution information, and the abnormal service data distribution result corresponding to the service data list refers to an abnormal service data distribution result of the service data processing terminal when the service data list is generated.
Step S134, if the distribution information error between the first service data distribution information and the second service data distribution information is greater than a set error threshold, determining that the service data list is a key service data list, and determining, based on the first service data distribution information and the abnormal service data distribution result, third service data distribution information, corresponding abnormal service data labels, and abnormal service data clusters of all key service data lists in a service data processing environment.
Step S135, performing abnormal service data cleaning on the current service data to be cleaned through the third service data distribution information of all key service data lists in the service data processing environment, the corresponding abnormal service data labels and the abnormal service data clusters, so as to obtain target service data.
It can be understood that, when the above steps S131 to S135 are applied, the list structure feature information, the list region feature information, and the list grouping feature information of the service data list can be taken into consideration, so as to determine different service data distribution information, and further determine different abnormal service data distribution results, so that the third service data distribution information, the abnormal service data flag, and the abnormal service data cluster can be further determined. Therefore, when data cleaning is carried out based on the third service data distribution information, the abnormal service data mark and the abnormal service data cluster, not only can the efficiency of data cleaning be ensured, but also the normal use of subsequent service data can be ensured.
On the basis, the list grouping feature of the service data list comprises list grouping features of three different feature dimensions. Based on this, the determining of the first service data distribution information corresponding to the service data list based on the list grouping feature information of the service data list and the list grouping feature information of the reference service data list, which is described in step S132, may include steps S1321 to S1324.
Step S1321, determining different feature dimension description information of the list grouping feature of each different feature dimension in the service data list under a service data interaction environment corresponding to the service data list based on the list grouping feature information of the service data list, and obtaining different feature dimension description information of the list grouping feature of three different feature dimensions in the service data list.
Step S1322 is to obtain different feature dimension description information of the list grouping feature of three different feature dimensions in the reference service data list based on the list grouping feature information of the reference service data list.
Step S1323, obtaining fourth service data distribution information corresponding to the reference service data list.
Step S1324, determining first service data distribution information corresponding to the service data list based on different feature dimension description information of the list grouping feature in the service data list, different feature dimension description information of the list grouping feature in the reference service data list, and the fourth service data distribution information.
In this way, the traffic data distribution information can be completely determined in real time based on steps S1321 to S1324.
Further, the determining, by the step S133, an abnormal service data distribution result corresponding to the service data list based on the list structure feature information and the list region feature information of the service data list, the abnormal service data flag information and the abnormal service data clustering information corresponding to the previous service data list, and the first service data distribution information, may include steps S1331 to S1333.
Step S1331, determining an initial abnormal service data distribution result of the service data list based on the abnormal service data distribution result of the previous service data list; determining list structure characteristics indicated by the list structure characteristic information and list region characteristics indicated by the list region characteristic information of the service data list, and determining abnormal service data marks indicated by abnormal service data mark information corresponding to the previous service data list and abnormal service data clusters indicated by abnormal service data cluster information.
Step S1332, determining, based on the initial abnormal service data distribution result of the service data list, a first mapping list in which the abnormal service data corresponding to the previous service data list is marked in the service data list, and a first mapping data cluster in which the abnormal service data corresponding to the previous service data list is clustered in the service data list.
Step S1333, determining a target list structure feature matched with the first mapping list in the list structure features of the service data list, and determining a target list region feature clustered and matched with the first mapping data in the list region features of the service data list; determining an abnormal service data distribution result corresponding to the service data list based on the initial abnormal service data distribution result, the first mapping list, the target list structure feature, the first mapping data cluster, the target list region feature and the first service data distribution information.
In this way, by implementing the above steps S1331 to S1333, the initial abnormal traffic data distribution result, the first mapping list, the target list structure feature, the first mapping data cluster, the target list region feature and the first traffic data distribution information can be considered at the same time, so that the abnormal traffic data distribution result can be ensured to correspond to the actual traffic state as much as possible, and the time sequence consistency between the abnormal traffic data distribution result and the actual traffic state can be ensured, thereby ensuring that the normal use of some traffic data is not affected after data cleaning.
In an alternative embodiment, in order to ensure the efficiency and accuracy of the subsequent service data cleansing, after performing the abnormal service data cleansing on the current service data to be cleansed based on the service data list as described in step S13, the method may further include the following steps described in step S14: and acquiring a service data cleaning record aiming at the current service data to be cleaned, and adjusting the thread parameters of a preset data cleaning thread according to the service data cleaning record.
For example, the service data cleansing record is used to record a complete process of performing various types of abnormal service data cleansing on the current service data to be cleansed, including but not limited to data deletion, data reassembly, and the like. The data cleansing thread may be a preconfigured cleansing algorithm, which may be, for example, an artificial intelligence model.
It can be understood that by implementing the step S14, the thread parameters can be dynamically adjusted, so that the update iteration of the data cleansing thread is realized based on different data cleansing, and thus the data cleansing thread can be continuously trained, thereby improving the efficiency and accuracy of subsequent service data cleansing.
Further, on the basis of step S14, adjusting the thread parameter of the preset data cleansing thread according to the service data cleansing record may include what is described in the following steps S141 to S145.
Step S141, determining data cleaning logic information corresponding to the service data cleaning record, and extracting a first abnormal data identification index and a second abnormal data identification index aiming at the abnormal service data from the data cleaning logic information; the first abnormal data identification index is used for representing a data integrity index of the abnormal business data, and the second abnormal data identification index is used for representing a data correctness index of the abnormal business data.
Step S142, constructing a first identification index feature matrix corresponding to a first abnormal data identification index, and constructing a second identification index feature matrix corresponding to a second abnormal data identification index, wherein the first identification index feature matrix and the second identification index feature matrix respectively comprise a plurality of index identification units with different identification accuracies; extracting initial cleaning index data of the first abnormal data identification index in any index identification unit of the first identification index characteristic matrix, and determining an index identification unit with the minimum identification accuracy in the second identification index characteristic matrix as a target index identification unit.
Step S143, mapping the initial cleaning index data to the target index identification unit according to the cleaning record timing information of the service data cleaning record, obtaining initial cleaning index mapping data in the target index identification unit, and generating an identification index association list between the first abnormal data identification index and the second abnormal data identification index according to the initial cleaning index data and the initial cleaning index mapping data.
Step S144, the initial cleaning index mapping data is taken as cleaning reference data to obtain thread data to be processed in the target index identification unit, the thread data to be processed is mapped to the index identification unit where the initial cleaning index data is located according to the relevance list characteristics corresponding to the identification index relevance list, real-time thread data corresponding to the thread data to be processed is obtained in the index identification unit where the initial cleaning index data is located, and the cleaning reference data of the real-time thread data is determined to be target cleaning index data.
Step S145, acquiring a data mapping path for mapping the initial cleaning index data to the target index identification unit; according to the relevance between the real-time thread data and path node parameters corresponding to a plurality of mapping path nodes on the data mapping path, traversing a target thread parameter matrix corresponding to the target cleaning index data in the second identification index characteristic matrix until the obtained index identification weight of an index identification unit where the target thread parameter matrix is located is consistent with the index identification weight of the target cleaning index data in the first identification index characteristic matrix, stopping obtaining the target thread parameter matrix in the next index identification unit, establishing a thread parameter updating matrix between the target cleaning index data and the target thread parameter matrix obtained at the last time, and adjusting the thread parameters of the preset data cleaning thread based on the thread parameter updating matrix.
FIG. 4 is a block diagram illustrating an exemplary big data based data washer 140, according to some embodiments of the present application, where the big data based data washer 140 may include the following functional modules.
The reference information determining module 141 is configured to compare service data difference information between current service data to be cleaned and a previous group of service data, and process the service data difference information based on a plurality of groups of cleaned service data before the current service data to be cleaned, so as to determine data cleaning reference information of the current service data to be cleaned.
A data list obtaining module 142, configured to obtain service data characteristics of the current service data to be cleaned; determining a data cleaning index of the current service data to be cleaned according to the service data characteristics; and processing the current service data to be cleaned according to the data cleaning reference information and the data cleaning index to obtain a service data list corresponding to the current service data to be cleaned.
And a service data cleaning module 143, configured to perform abnormal service data cleaning on the current service data to be cleaned based on the service data list.
It will be appreciated that the above description of the apparatus embodiment may be referred to as the description of the method embodiment shown in figure 3.
Based on the same inventive concept, a corresponding system embodiment is also provided, and the description about the system embodiment is as follows.
A1. A data cleaning method based on big data comprises a cloud server and a service data processing terminal which are communicated with each other; the service data to be processed comprises a plurality of groups of service data to be cleaned and a plurality of groups of cleaned service data in the cleaning intermediate process, and the cloud server is used for:
comparing the service data difference information of the current service data to be cleaned with the service data of the previous group, and processing the service data difference information based on a plurality of groups of cleaned service data before the current service data to be cleaned so as to determine the data cleaning reference information of the current service data to be cleaned;
acquiring the service data characteristics of the current service data to be cleaned; determining a data cleaning index of the current service data to be cleaned according to the service data characteristics; processing the current business data to be cleaned according to the data cleaning reference information and the data cleaning index to obtain a business data list corresponding to the current business data to be cleaned;
and cleaning the abnormal service data of the current service data to be cleaned based on the service data list.
It can be understood that the service data to be processed may be service data corresponding to the service data processing terminal.
It will be appreciated that the above description of the system embodiment may refer to the description of the method embodiment shown in figure 3.
It should be understood that, for technical terms that are not noun explanations to the above-mentioned contents, a person skilled in the art can deduce and unambiguously determine the meaning of the present invention according to the above-mentioned disclosure, for example, for some values, coefficients, weights and other terms, a person skilled in the art can deduce and determine according to the logical relationship before and after, the value range of these values can be selected according to the actual situation, for example, 0 to 1, for example, 1 to 10, for example, 50 to 100, but not limited thereto, and a person skilled in the art can unambiguously determine some preset, reference, predetermined, set and target technical features/technical terms according to the above-mentioned disclosure. For some technical characteristic terms which are not explained, the technical solution can be clearly and completely implemented by those skilled in the art by reasonably and unambiguously deriving the technical solution based on the logical relations in the previous and following paragraphs. The foregoing will therefore be clear and complete to those skilled in the art. It should be understood that the process of deriving and analyzing technical terms, which are not explained, by those skilled in the art based on the above disclosure is based on the contents described in the present application, and thus the above contents are not an inventive judgment of the overall scheme.
Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be considered merely illustrative and not restrictive of the broad application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.
Also, this application uses specific terminology to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of at least one embodiment of the present application may be combined as appropriate.
In addition, those skilled in the art will recognize that the various aspects of the application may be illustrated and described in terms of several patentable species or contexts, including any new and useful combination of procedures, machines, articles, or materials, or any new and useful modifications thereof. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as a "unit", "component", or "system". Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in at least one computer readable medium.
A computer readable signal medium may comprise a propagated data signal with computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable signal medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or any combination of the preceding.
Computer program code required for the execution of aspects of the present application may be written in any combination of one or more programming languages, including object oriented programming, such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, or similar conventional programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages, such as Python, Ruby, and Groovy, or other programming languages. The programming code may execute entirely on the user's computer, as a stand-alone software package, partly on the user's computer, partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).
Additionally, the order of the process elements and sequences described herein, the use of numerical letters, or other designations are not intended to limit the order of the processes and methods unless otherwise indicated in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it should be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware means, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.
It should also be appreciated that in the foregoing description of embodiments of the present application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of at least one embodiment of the invention. However, this method of disclosure is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Claims (7)

1. A data cleaning method based on big data is characterized in that service data to be processed comprises a plurality of groups of service data to be cleaned and a plurality of groups of cleaned service data in the middle of cleaning, and the method comprises the following steps:
determining a corresponding relation between data cleaning reference information and a data cleaning index, and acquiring service data block information of current service data to be cleaned based on the corresponding relation, wherein the service data block information comprises association priority of data block sequence information and data block association information;
determining a business data arrangement mode;
determining whether the service requirement information corresponding to the minimum service data block in the current service data to be cleaned needs to be analyzed according to the association priority of the data block association information and the data block sequence information;
if the analysis is needed, performing information screening on at least part of service demand information of at least one group of current service data to be cleaned to obtain service demand information corresponding to the minimum service data block;
determining whether business demand information corresponding to the minimum business data block needs to be recombined again by utilizing the business data block information; and if the business requirement information needs to be recombined again, generating new business requirement information, and performing data sorting based on the business data sorting mode to obtain a business data list.
2. The method of claim 1, wherein the data block sequence information includes a service duration of the current service data to be cleaned, a number of service events of the current service data to be cleaned, and a service priority, and the method further comprises:
judging whether the sequence priority of the data block sequence information, the service priority and the associated priority of the data block associated information are the same;
if the sequence priority of the data block sequence information, the service priority and the associated priority of the data block associated information are the same, judging whether a time sequence characteristic label is set according to the service duration of the multiple groups of current service data to be cleaned when the service data sorting mode is a time sequence sorting mode; if the time sequence feature tag is set, performing data sorting on the multiple groups of current service data to be cleaned based on the time sequence feature tag; if the time sequence feature tag is not set, data sorting is carried out on the multiple groups of current service data to be cleaned;
when the business data sorting mode is an event sorting mode, judging whether a business event label is set according to the number of the business events of the multiple groups of current business data to be cleaned; if the business event label is set, data processing and sorting are carried out on the multiple groups of current business data to be cleaned based on the business event label; if the service event label is not set, data sorting is carried out on the multiple groups of current service data to be cleaned;
if the sequence priority of the data block sequence information, the service priority or the associated priority of the data block associated information are different, selecting one group of current service data to be cleaned from the multiple groups of current service data to be cleaned as reference time sequence service data when the service data sorting mode is the time sequence sorting mode, and judging whether to set the time sequence feature tag according to the service duration of the multiple groups of current service data to be cleaned; if the time sequence characteristic label is set, screening other current service data to be cleaned by using the service data block information of the reference time sequence service data and the time sequence characteristic label, and performing data sorting; if the time sequence characteristic label is not set, screening the other current service data to be cleaned by using the service data block information of the reference time sequence service data, and sorting the data; when the business data sorting mode is an event sorting mode, selecting one group of current business data to be cleaned from the multiple groups of current business data to be cleaned as reference event business data, and judging whether the business event label is set according to the number of business events of the multiple groups of current business data to be cleaned; if the service event label is set, screening the other current service data to be cleaned by using the service data block information of the reference event service data and the service event label, and performing data sorting; and if the service event label is not set, screening the other current service data to be cleaned by using the service data block information of the reference event service data, and sorting the data.
3. The method of claim 2, wherein the service data block information further comprises a data block configuration record, the method further comprising:
when the service data arrangement mode is the time sequence arrangement mode, judging whether the service duration of the multiple groups of service data to be cleaned currently are the same according to the data block sequence information; if so, performing data sorting on the multiple groups of current service data to be cleaned according to the data block configuration record; if not, setting the label weight of the time sequence feature label according to the service duration of the multiple groups of current service data to be cleaned, and performing data sorting on the multiple groups of current service data to be cleaned according to the time sequence feature label;
when the business data sorting mode is an event sorting mode, judging whether the number of the business events of the multiple groups of current business data to be cleaned is the same according to the data block sequence information; if so, performing data sorting on the multiple groups of current service data to be cleaned according to the service data formats and service data storage paths of the multiple groups of current service data to be cleaned; if not, setting the label weight of the service event label according to the number of the service events of the plurality of groups of current service data to be cleaned and the service event label to perform data sorting on the plurality of groups of current service data to be cleaned; the service data format comprises a modifiable format and a non-modifiable format, when the service data format is the modifiable format, the service requirement information comprises real-time service requirement information and delay service requirement information, and the service data storage path comprises a service data authority access path and a service data calling path.
4. The method of claim 1,
before the step of determining a correspondence between data cleansing reference information and data cleansing indicators, the method further comprises: comparing the service data difference information of the current service data to be cleaned with the service data of the previous group, and processing the service data difference information based on a plurality of groups of cleaned service data before the current service data to be cleaned so as to determine the data cleaning reference information of the current service data to be cleaned; acquiring the service data characteristics of the current service data to be cleaned; determining a data cleaning index of the current service data to be cleaned according to the service data characteristics;
after the step of performing data sorting based on the service data sorting manner to obtain a service data list, the method further includes: and cleaning the abnormal service data of the current service data to be cleaned based on the service data list.
5. The method according to claim 4, wherein after the step of performing abnormal traffic data cleansing on the current traffic data to be cleansed based on the traffic data list, the method further comprises:
and acquiring a service data cleaning record aiming at the current service data to be cleaned, and adjusting the thread parameters of a preset data cleaning thread according to the service data cleaning record.
6. The method of claim 5, wherein adjusting the thread parameters of the preset data cleansing thread according to the service data cleansing record comprises:
determining data cleaning logic information corresponding to the service data cleaning record, and extracting a first abnormal data identification index and a second abnormal data identification index aiming at the abnormal service data from the data cleaning logic information; the first abnormal data identification index is used for representing a data integrity index of the abnormal business data, and the second abnormal data identification index is used for representing a data correctness index of the abnormal business data;
constructing a first identification index feature matrix corresponding to a first abnormal data identification index, and constructing a second identification index feature matrix corresponding to a second abnormal data identification index, wherein the first identification index feature matrix and the second identification index feature matrix respectively comprise a plurality of index identification units with different identification accuracies; extracting initial cleaning index data of the first abnormal data identification index in any index identification unit of the first identification index characteristic matrix, and determining an index identification unit with the minimum identification accuracy in the second identification index characteristic matrix as a target index identification unit;
mapping the initial cleaning index data to the target index identification unit according to cleaning record time sequence information of the service data cleaning record, obtaining initial cleaning index mapping data in the target index identification unit, and generating an identification index correlation list between the first abnormal data identification index and the second abnormal data identification index according to the initial cleaning index data and the initial cleaning index mapping data;
acquiring to-be-processed thread data in the target index identification unit by taking the initial cleaning index mapping data as cleaning reference data, mapping the to-be-processed thread data to the index identification unit where the initial cleaning index data is located according to the relevance list characteristics corresponding to the identification index relevance list, obtaining real-time thread data corresponding to the to-be-processed thread data in the index identification unit where the initial cleaning index data is located, and determining the cleaning reference data of the real-time thread data as target cleaning index data;
acquiring a data mapping path for mapping the initial cleaning index data to the target index identification unit; according to the relevance between the real-time thread data and path node parameters corresponding to a plurality of mapping path nodes on the data mapping path, traversing a target thread parameter matrix corresponding to the target cleaning index data in the second identification index characteristic matrix until the obtained index identification weight of an index identification unit where the target thread parameter matrix is located is consistent with the index identification weight of the target cleaning index data in the first identification index characteristic matrix, stopping obtaining the target thread parameter matrix in the next index identification unit, establishing a thread parameter updating matrix between the target cleaning index data and the target thread parameter matrix obtained at the last time, and adjusting the thread parameters of the preset data cleaning thread based on the thread parameter updating matrix.
7. A cloud server comprising a processing engine, a network module, and a memory; the processing engine and the memory communicate through a network module, the processing engine reading a computer program from the memory and operating to perform the method of any of claims 1-6.
CN202110554388.XA 2020-12-01 2020-12-01 Big data based data cleaning method and cloud server Withdrawn CN113342788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110554388.XA CN113342788A (en) 2020-12-01 2020-12-01 Big data based data cleaning method and cloud server

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110554388.XA CN113342788A (en) 2020-12-01 2020-12-01 Big data based data cleaning method and cloud server
CN202011385263.0A CN112486969B (en) 2020-12-01 2020-12-01 Data cleaning method applied to big data and deep learning and cloud server

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202011385263.0A Division CN112486969B (en) 2020-12-01 2020-12-01 Data cleaning method applied to big data and deep learning and cloud server

Publications (1)

Publication Number Publication Date
CN113342788A true CN113342788A (en) 2021-09-03

Family

ID=74938588

Family Applications (3)

Application Number Title Priority Date Filing Date
CN202011385263.0A Active CN112486969B (en) 2020-12-01 2020-12-01 Data cleaning method applied to big data and deep learning and cloud server
CN202110554391.1A Withdrawn CN113342789A (en) 2020-12-01 2020-12-01 Data cleaning method based on big data and deep learning and cloud server
CN202110554388.XA Withdrawn CN113342788A (en) 2020-12-01 2020-12-01 Big data based data cleaning method and cloud server

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CN202011385263.0A Active CN112486969B (en) 2020-12-01 2020-12-01 Data cleaning method applied to big data and deep learning and cloud server
CN202110554391.1A Withdrawn CN113342789A (en) 2020-12-01 2020-12-01 Data cleaning method based on big data and deep learning and cloud server

Country Status (1)

Country Link
CN (3) CN112486969B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114203312A (en) * 2021-11-12 2022-03-18 姜德秋 Digital medical service analysis method and server combined with big data intelligent medical treatment

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366078B2 (en) * 2013-11-27 2019-07-30 The Regents Of The University Of California Data reduction methods, systems, and devices
CN106294492A (en) * 2015-06-08 2017-01-04 深圳中兴网信科技有限公司 Data cleaning method and cleaning engine
CN104966172A (en) * 2015-07-21 2015-10-07 上海融甸信息科技有限公司 Large data visualization analysis and processing system for enterprise operation data analysis
CN106874290B (en) * 2015-12-11 2020-08-04 阿里巴巴集团控股有限公司 Data cleaning method and equipment
US10558627B2 (en) * 2016-04-21 2020-02-11 Leantaas, Inc. Method and system for cleansing and de-duplicating data
CN106202569A (en) * 2016-08-09 2016-12-07 北京北信源软件股份有限公司 A kind of cleaning method based on big data quantity
CN109947746B (en) * 2017-10-26 2023-12-26 亿阳信通股份有限公司 Data quality control method and system based on ETL flow
CN107943973A (en) * 2017-11-28 2018-04-20 上海云信留客信息科技有限公司 A kind of big data system for washing intelligently and cloud intelligent robot clean service platform
CN110275878B (en) * 2019-06-25 2021-08-17 北京达佳互联信息技术有限公司 Service data detection method and device, computer equipment and storage medium
CN111061732A (en) * 2019-12-05 2020-04-24 深圳迅策科技有限公司 Report generation method based on big data processing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114203312A (en) * 2021-11-12 2022-03-18 姜德秋 Digital medical service analysis method and server combined with big data intelligent medical treatment
CN114203312B (en) * 2021-11-12 2022-12-16 蓝气球(北京)医学研究有限公司 Digital medical service analysis method and server combined with big data intelligent medical treatment

Also Published As

Publication number Publication date
CN113342789A (en) 2021-09-03
CN112486969A (en) 2021-03-12
CN112486969B (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN108228469B (en) Test case selection method and device
CN113347632A (en) Hot spot sharing method applied to artificial intelligence and big data cloud platform
CN115174231B (en) Network fraud analysis method and server based on AI Knowledge Base
CN112487495B (en) Data processing method based on big data and cloud computing and big data server
CN112214496B (en) Cosmetic production line safety monitoring method based on big data analysis and cloud server
CN115048370B (en) Artificial intelligence processing method for big data cleaning and big data cleaning system
CN111917789B (en) Data processing method based on big data and Internet of things communication and cloud computing platform
CN112255983A (en) Big data processing method and production data processing center based on cosmetic production
CN112486969B (en) Data cleaning method applied to big data and deep learning and cloud server
CN115238828A (en) Chromatograph fault monitoring method and device
CN112702422A (en) Big data cooperative processing method based on cloud computing and edge computing and cloud server
CN115128438A (en) Chip internal fault monitoring method and device
CN112486955B (en) Data maintenance method based on big data and artificial intelligence and big data server
CN112528306A (en) Data access method based on big data and artificial intelligence and cloud computing server
CN110177006B (en) Node testing method and device based on interface prediction model
CN113098884A (en) Network security monitoring method based on big data, cloud platform system and medium
CN112215518B (en) Cloud computing-combined cosmetic production chain scheduling method and artificial intelligence cloud platform
EP3748549B1 (en) Learning device and learning method
CN112579756A (en) Service response method based on cloud computing and block chain and artificial intelligence interaction platform
CN113032236B (en) Business behavior processing method and server applied to artificial intelligence and cloud computing
CN117291436A (en) Fault diagnosis method, device and medium for power grid equipment acquisition system
CN115525331A (en) Reverse analysis method for intelligent terminal firmware of power grid sensing layer
CN112613878A (en) Information detection method based on big data and block chain payment and big data server
CN117615359A (en) Bluetooth data transmission method and system based on multiple rule engines
CN115205056A (en) Sample information analysis method and system applied to business wind control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210903

WW01 Invention patent application withdrawn after publication