CN110555019A

CN110555019A - Data cleaning method based on service end

Info

Publication number: CN110555019A
Application number: CN201910863837.1A
Authority: CN
Inventors: 周道华; 杨陈; 曾俊; 洪江; 彭容; 黄维; 李武鸿; 刘瑞东; 张明娟; 许江泽; 吴婷婷; 付志华; 刘杰; 詹飞; 程武彬; 杨眉
Original assignee: CHENGDU ZHONGKE DAQI SOFTWARE Co Ltd
Current assignee: CHENGDU ZHONGKE DAQI SOFTWARE Co Ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2019-12-10
Anticipated expiration: 2039-09-12
Also published as: CN110555019B

Abstract

the invention discloses a data cleaning method based on a service end, which comprises the following steps: the service end converts the local data into corresponding label fields and forms a label field group; uploading the label field group to a server; the server stores the incidence relation between a plurality of standard label fields and corresponding data cleaning algorithms; the server matches the uploaded tag field group with the standard tag field to obtain a standard tag field with the highest matching degree; the server side issues a data cleaning algorithm associated with the standard label field with the highest matching degree to the service side; and the service end cleans the local data by using the obtained data cleaning algorithm. The invention cleans the data of the server and transfers the data to the service end, and the idle resources of the service end are utilized to clean the data, thereby saving the cost of the service end; meanwhile, in the exemplary embodiment, the most suitable data cleaning algorithm of the service end is obtained by matching the tag fields corresponding to the data, so that the most suitable data cleaning algorithm is obtained.

Description

data cleaning method based on service end

Technical Field

The invention relates to a data cleaning method based on a service end.

background

Big data (big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth rate and diversified information asset which needs a new processing mode to have stronger decision-making power, insight discovery power and flow optimization capability.

the strategic significance of big data technology is not to grasp huge data information, but to specialize the data containing significance. In other words, if big data is compared to an industry, the key to realizing profitability in the industry is to improve the "processing ability" of the data and realize the "value-added" of the data through the "processing". Technically, the relation between big data and cloud computing is as inseparable as the front and back of a coin. The large data cannot be processed by a single computer necessarily, and a distributed architecture must be adopted. The method is characterized in that distributed data mining is carried out on mass data.

however, the existing data cleaning and data processing of big data are usually processed in the server, and as the data volume increases, the establishment and operation costs of the server become higher and higher, and if one of the steps (e.g. data cleaning) can be moved down to the service, the cost of the server can be reduced to some extent. At this time, a data cleaning program needs to be installed at the service end, but if the same program is adopted for data in different fields, the cleaning effect is not good.

The information disclosed in this background section is only for enhancement of understanding of the general background of the disclosure and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a data cleaning method based on a service end.

the purpose of the invention is realized by the following technical scheme:

The invention provides a data cleaning method based on a service end, which comprises the following steps:

The service end converts the local data into corresponding label fields and forms a label field group;

Uploading the label field group to a server;

the server stores the incidence relation between a plurality of standard label fields and corresponding data cleaning algorithms;

the server matches the uploaded tag field group with the standard tag field to obtain a standard tag field with the highest matching degree;

the server side issues a data cleaning algorithm associated with the standard label field with the highest matching degree to the service side;

And the service end cleans the local data by using the obtained data cleaning algorithm.

further, when the service end uploads the tag field group to the server end, the service end type is uploaded to the server;

The server side stores the incidence relation between a plurality of standard label fields and corresponding data cleaning algorithms, and replaces the incidence relation with the following relation:

The server stores the association relationship between a plurality of service end types, standard label fields and corresponding data cleaning algorithms.

Further, the manner of obtaining the association relationship between the standard tag field and the corresponding data washing algorithm specifically includes:

Acquiring a plurality of service end data, wherein the service end data comprises a label field;

Selecting a plurality of label fields to form a data dictionary according to actual requirements so as to form standard label fields;

Cleaning the data of the standard label field by adopting a plurality of data cleaning algorithms;

and associating the data washing algorithm with the best washing effect with the corresponding standard label field.

Further, the method further comprises:

and the server side calls the cleaned source data.

Further, the method further comprises:

When the service end detects that the service end accesses a preset interface, establishing connection with the service end, and establishing a service thread to acquire and analyze a request from the service end; the request comprises a data cleaning algorithm acquisition request and a source data calling request;

When the service thread of the server resolves that the request from the service end is a data cleaning algorithm acquisition request, the predetermined interface is used for issuing data;

And when the service thread of the service end analyzes that the request from the service end is a source data calling request, establishing the connection between the service end and a source data interface.

Further, after the connection between the service end and the source data interface is established, a first monitoring thread is established; simultaneously:

The first monitoring thread monitors the source data calling process, judges whether the source data interface can successfully access and successfully return request data, and judges whether returned data received by the source data interface is valid data; if any item is not satisfied, generating early warning information and sending the early warning information;

After judging that the source data interface can be successfully accessed, creating a second monitoring thread corresponding to the source data interface;

The second monitoring thread monitors database log records of the service end in a training mode, and whether current data exist effectively is verified; and if not, generating early warning information and sending the early warning information.

Further, the determining whether the source data interface can be successfully accessed specifically includes:

Accessing a request source data interface through an http protocol, and verifying whether the interface can be called normally or not;

Judging the request data successfully returned, specifically comprising:

judging the interface request state by adopting an http protocol, and verifying according to a returned protocol state code;

the determining whether the returned data received by the source data interface is valid data specifically includes:

Verifying the data structure: whether the returned data structure meets the service requirement after the interface request is successful or not;

verifying whether the format of each data item is correct or not, and verifying character types including characters, numbers, Chinese characters and lengths;

Verifying whether all returned data items are valid and valid data;

The verifying whether the current data exists effectively specifically includes:

Whether new data are generated in a data table recorded by a polling type search log in a database of a service end or not is specifically searched by using a Tsql script statement in a polling type mode through a task scheduler.

furthermore, the source data interface is scheduled and distributed by a task scheduling module of the server side, and the early warning message is sent to the task scheduling module; the early warning information comprises a data error condition and a source data interface ID;

when the early warning information of the ID of the same source data interface is received for multiple times in a period of time, the corresponding source data interface is hung, the connection between the service end and the service end is automatically disconnected, and a new source data interface is allocated to be connected with the service end; and then releasing the suspended source data interface.

further, the data error condition comprises a data error type and a data packet name of corresponding source data; sending the data packet name to a service end so that the service end sends the data packet which is not sent completely;

the server side performs combined processing on all data of the service side of the redistribution interface; wherein, for data with the same packet name, the data with the corresponding packet name whose data size is not the maximum is automatically discarded.

Further, the method further comprises:

the server side issues the visual tool corresponding to the standard label field with the highest matching degree to the service side;

and the service end visually displays the cleaned local data according to the data standard of the corresponding label field group through a visual tool.

The invention has the beneficial effects that:

(1) In an exemplary embodiment of the invention, the data cleaning of the server is transferred to the service end, and the idle resources (time and configuration) of the service end are used for cleaning, so that the cost of the service end is saved; meanwhile, in the exemplary embodiment, the most suitable data cleaning algorithm of the service end is obtained by matching the tag fields corresponding to the data, so that the most suitable data cleaning algorithm is obtained.

(2) In an exemplary embodiment of the invention, the service end types are uploaded together, and then the corresponding types are selected for association when matching is performed, so that the data cleaning algorithm is more accurately selected.

(3) In an exemplary embodiment of the present invention, the standard tag field is selected according to actual requirements (industry requirements or specific requirements), that is, a combination of the standard tags is performed at the server in advance, then, a plurality of data cleaning algorithms are performed on data corresponding to the standard tags to clean the data, then, cleaning results are compared, and the selected data is associated with the best cleaning effect.

(4) in an exemplary embodiment of the invention, the cleaned source data is uploaded to the server through the calling of the server, so that the server obtains the cleaned data to form big data, and the post-processing is facilitated.

(5) in an exemplary embodiment of the present invention, a service thread determines a request of a service end and confirms a data transmission interface; the data volume of the data cleaning algorithm is small, so that only a preset interface is adopted; when the source data is called, the data volume is large, so that the source data interface is used for connection.

(6) in an exemplary embodiment of the present invention, interface early warning is provided, and a core idea thereof is to ensure an early warning effect of a data interface in a two-way manner from a service end to a service end without missing: the first layer is to monitor the source data interface of the server and the second layer is to monitor the database log record of the server. Meanwhile, the premise of the second layer monitoring is the basis of the first layer monitoring, so that the problem that the second layer monitoring still creates to waste redundant resources when the server side has problems is avoided.

(7) In an exemplary embodiment of the present invention, a specific implementation manner for determining whether the source data interface can successfully access, determining whether the request data is successfully returned, determining whether returned data received by the source data interface is valid data, and verifying whether current data is valid is disclosed in an exemplary embodiment of the present invention.

its effect has two: the method has the advantages that the invalid data can be fully abandoned from a source data receiving end to the maximum extent, and all received and stored data are guaranteed to be valid data; and B, verifying whether the interface is normal through the first checkpoint, capturing real-time data failure at the first time, and finding and early warning at the first time to inform.

(8) In an exemplary embodiment of the invention, a task scheduling module is used for scheduling and allocating a source data interface, when early warning information of the ID of the same source data interface is received for multiple times in a period of time, the corresponding source data interface is suspended, the connection between a service end and a service end is automatically disconnected, and a new source data interface is allocated to be connected with the service end; and then releasing the suspended source data interface. Namely, post-processing after early warning is realized by reallocating the interfaces.

(9) in an exemplary embodiment of the present invention, the data can be continuously transmitted by the packet name. In addition, since the data point to be continuously transmitted is not judged by mistake, the problem can be solved by retransmitting the data packet.

(10) In an exemplary embodiment of the present invention, the cleaned source data is visually displayed, a software module displayed by the visualization tool is sent to the service end through the service end, and particularly, a corresponding software module is configured according to the data selected by the corresponding standard tag field.

drawings

FIG. 1 is a flow chart of a method in an exemplary embodiment of the invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Referring to fig. 1, fig. 1 is a flowchart illustrating a data cleansing method based on a service end according to an exemplary embodiment, which specifically includes:

S1: and the service end converts the local data into corresponding label fields and forms a label field group.

the label field is the actual concept of the local data, and the definition of the label field group is the sum of the label fields of all the local data of the service end.

s2: and uploading the label field group to a server.

Specifically, in this step, the tag field set is uploaded to the server for matching comparison.

in yet another exemplary embodiment, when the service side uploads the tag field set to the service side, the service side type is uploaded to the server together.

different service end types (the types can be industry types or type classifications proposed by a service end) may have the same or similar label field groups, but the same mode is adopted in the final processing, so that the effect is not good, the service end types are uploaded together, and the corresponding types are selected for association in the subsequent matching.

s3: the server stores the incidence relation between a plurality of standard label fields and corresponding data cleaning algorithms; and the server matches the uploaded label field group with the standard label field to obtain the standard label field with the highest matching degree.

specifically, in this step, the server stores an association relationship between a plurality of standard tag fields and corresponding data cleansing algorithms, where the association relationship is a one-to-one correspondence relationship, that is, one standard tag field corresponds to one data cleansing algorithm, and one data cleansing algorithm may correspond to a plurality of standard tag fields.

when matching, a variety of matching means may be employed, for example, the matching percentage of all the tag fields in a tag field group corresponding to a standard tag field or the total number of tag fields in a standard tag field corresponding to the matching number of tag fields.

Corresponding to the exemplary embodiment in step S2, the server stores several association relationships between standard tag fields and corresponding data cleansing algorithms, instead of:

in yet another exemplary embodiment, the obtaining manner of the association relationship between the standard tag field and the corresponding data cleansing algorithm specifically includes:

S31: acquiring a plurality of service end data, wherein the service end data comprises a label field;

S32: selecting a plurality of label fields to form a data dictionary according to actual requirements so as to form standard label fields;

S33: cleaning the data of the standard label field by adopting a plurality of data cleaning algorithms;

s34: and associating the data washing algorithm with the best washing effect with the corresponding standard label field.

Specifically, in the exemplary embodiment, the standard tag field is selected according to actual requirements (industry requirements or specific requirements), that is, the standard tag is combined at the server in advance, then, a plurality of data cleaning algorithms are performed on data corresponding to the standard tag for cleaning, then, cleaning results are compared, and the association with the best selection effect is performed.

s4: and the server side issues the data cleaning algorithm associated with the standard label field with the highest matching degree to the service side.

specifically, in the exemplary embodiment, a data cleansing algorithm is issued, and specifically the substance of the data cleansing algorithm may be a program installation package.

S5: and the service end cleans the local data by using the obtained data cleaning algorithm.

At this time, the service end obtains the most suitable data cleaning algorithm, and the data after cleaning is the most accurate data.

by adopting the mode, the data of the server is cleaned and transferred to the service end, and the idle resources (time and configuration) of the service end are utilized for cleaning, so that the cost of the service end is saved; meanwhile, in the application, the most suitable data cleaning algorithm of the service end is obtained by matching the label fields corresponding to the data, so that the most suitable data cleaning algorithm is obtained.

In yet another exemplary embodiment, the method further comprises:

S611: and the server side calls the cleaned source data.

That is, in this exemplary embodiment, the cleaned source data is uploaded to the server through the invocation of the server, so that the server obtains the cleaned data to form big data, and post-processing is facilitated.

Based on the implementation of the above exemplary embodiment, in a further exemplary embodiment, the method further includes:

S01: when the service end detects that the service end accesses a preset interface, establishing connection with the service end, and establishing a service thread to acquire and analyze a request from the service end; the request comprises a data cleansing algorithm acquisition request and a source data retrieval request.

In an exemplary embodiment, the predetermined interface may be a hardware interface, such as a serial port and a USB interface, that is, the corresponding service end may be an entity data device for collecting data; or may be a software interface, such as an application programming API interface, etc., i.e. the corresponding service end may be a storage device with software data.

however, since there are multiple requests (one of them is a source data retrieval request) for the service end, when the service end is accessed to the service end through a predetermined interface, the service end can establish a connection with the service end; and simultaneously creating a service thread which is used for acquiring and analyzing the request from the service end, in particular the identification data cleaning algorithm acquisition request and the source data calling request.

Specifically, when the service thread of the server resolves that the request from the service end is a data cleansing algorithm acquisition request (i.e., step S3), the data is issued by using the predetermined interface.

When the service thread of the server resolves the request from the service end as a source data retrieval request (i.e., step S611), S612: and establishing the connection between the service end and the source data interface. Namely, the service thread identifies the request data of the service end and simultaneously establishes the connection between the service end and the source data interface.

the data volume of the data cleaning algorithm is small, so that only a preset interface is adopted; when the source data is called, the data volume is large, so that the source data interface is used for connection.

and in an exemplary embodiment, further comprising the steps of: s613: after the connection between the service end and the source data interface is established, a first monitoring thread is established;

That is to say, when the service thread recognizes that the service end requests data retrieval, not only the connection between the service end and the source data interface is established, but also the first monitoring thread is created, that is, the first monitoring thread starts monitoring the layer of the service end.

S614: the first monitoring thread monitors the source data calling process, judges whether the source data interface can successfully access and successfully return request data, and judges whether returned data received by the source data interface is valid data; and if any item is not satisfied, generating early warning information and sending the early warning information.

Wherein, when calling the source data interface each time, will carry out dual verification: (1) judging whether the source data interface can be successfully accessed and successfully returning the request data; (2) judging whether the returned data received by the source data interface is valid data; if both are satisfied, the next step is carried out, otherwise, early warning information is generated and sent.

In an exemplary embodiment, the determining whether the source data interface can be successfully accessed specifically includes:

And accessing the request source data interface through the http protocol to verify whether the interface can be called normally.

In another exemplary embodiment, the determining that the request data is successfully returned specifically includes:

and judging the interface request state by adopting an http protocol, and verifying according to the returned protocol state code.

In yet another exemplary embodiment, the determining whether the returned data received by the source data interface is valid data specifically includes:

Verifying whether all returned data items are valid data.

s615: and after judging that the source data interface can be successfully accessed, creating a second monitoring thread corresponding to the source data interface.

That is, after the source data interface can be accessed, a second listening thread is created, that is, listening to the layer of the service end is started through the second listening thread.

and the second monitoring thread is created on the premise that the source data interface is successfully accessed, so that the problem that redundant resources are wasted when the server side has problems due to the fact that the second monitoring thread is still created is avoided.

s616: the second monitoring thread monitors database log records of the service end in a training mode, and whether current data exist effectively is verified; and if not, generating early warning information and sending the early warning information.

in an exemplary embodiment, the verifying whether the current data exists effectively specifically includes:

In an exemplary embodiment, the source data interface is scheduled and allocated by a task scheduling module of a server, and the early warning message is sent to the task scheduling module; the early warning information comprises a data error condition and a source data interface ID.

That is, there are a plurality of source data interfaces, and scheduling allocation needs to be realized by a unified mechanism (i.e., a task scheduling module), and when an early warning message is generated, the task scheduling module adjusts the source data interfaces according to an actual situation, in an exemplary embodiment, for the same source data interface: (1) if the early warning message appears only once (or for a preset number of times) within a period of time, the source data interface does not need to be processed; (2) when the early warning information of the same source data interface ID is received for multiple times in a period of time, the source data interface is proved to have obvious problems, and the task scheduling module needs to adjust the interface: specifically, the adjustment mode is to suspend the corresponding source data interface, automatically disconnect the connection between the service end and the service end, and allocate a new source data interface to connect with the service end; and then releasing the suspended source data interface.

Namely, post-processing after early warning is realized by reallocating the interfaces.

However, if the source data interface is suspended when the warning message is generated, the data may be sent in the middle of the whole data packet, i.e. suspended, so a solution is needed to solve the problem, specifically:

in an exemplary embodiment, the data error condition includes a data error type and a packet name of corresponding source data; and sending the data packet name to a service end so that the service end sends the data packet which is not sent completely.

The data error type includes the above three judgments (the server side twice, and the database log of the service side once), and the data can be continuously sent by the data packet name. In addition, since the data point to be continuously transmitted is not judged by mistake, the problem can be solved by retransmitting the data packet.

In order to avoid repeated acquisition of part of data (resulting in large data acquisition error) due to retransmission of the data packet, in an exemplary embodiment, the service end performs combination processing on all data of the service end of the reallocation interface; wherein, for data with the same packet name, the data with the corresponding packet name whose data size is not the maximum is automatically discarded.

in an exemplary embodiment, the early warning information is further sent to a display device, so that an administrator or a decision maker can obtain the fault condition of each source data interface and perform corresponding processing.

The display device can be a mobile terminal, a fixed terminal and the like of an administrator or a decision maker, and can be realized in a mail/short message/APP mode. When the administrator or the decision maker receives the early warning information comprising the ID of the source data interface and the data error condition, the administrator or the decision maker controls the fault condition of each source data interface in real time so as to process quickly.

In order to avoid problems, the source data interface is used before it is recovered, and therefore, in an exemplary embodiment, the following is used:

The source data interface after being suspended and released has a low priority in a preset time compared with a source data interface which is not suspended; the priority is restored after a preset time or after processing by an administrator or decision maker.

the method comprises the following steps that for a source data interface processed by an administrator or a decision maker, priority is immediately restored, namely the source data interface is considered to be restored to be normal; and for the source data interface which is not processed by the administrator or the decision maker, the priority is automatically recovered only after the preset time, so that the problem that the source data interface is called again immediately after being hung and released to cause problems again is avoided.

In yet another exemplary embodiment, the method further comprises:

s4': and the server side issues the visual tool corresponding to the standard label field with the highest matching degree to the service side. (i.e., after step S4)

S62: and the service end visually displays the cleaned local data according to the data standard of the corresponding label field group through a visual tool.

Specifically, the software module displayed by the visualization tool is issued to the service end through the service end, and particularly, the corresponding software module is configured according to the data selected by the corresponding standard tag field.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device), or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

these computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

it is to be understood that the above-described embodiments are illustrative only and not restrictive of the broad invention, and that various other modifications and changes in light thereof will be suggested to persons skilled in the art based upon the above teachings. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. a data cleaning method based on a service end is characterized in that: the method comprises the following steps:

Uploading the label field group to a server;

2. The data cleaning method based on the service end according to claim 1, characterized in that: when the business terminal uploads the label field group to the server terminal, the business terminal type is uploaded to the server;

3. The data cleaning method based on the service end according to claim 1, characterized in that: the obtaining mode of the incidence relation between the standard label field and the corresponding data cleaning algorithm specifically comprises the following steps:

4. The data cleaning method based on the service end according to claim 1, characterized in that: the method further comprises the following steps:

And the server side calls the cleaned source data.

5. The service-side-based data cleaning method according to claim 4, wherein: the method further comprises the following steps:

6. The service-side-based data cleaning method according to claim 5, wherein: after the connection between the service end and the source data interface is established, a first monitoring thread is established; simultaneously:

7. The data cleaning method based on the service end according to claim 6, wherein: the judging whether the source data interface can be successfully accessed specifically includes:

judging the request data successfully returned, specifically comprising:

verifying whether all returned data items are valid and valid data;

8. the service-side-based data cleaning method according to claim 4, wherein: the source data interface is dispatched and distributed by a task dispatching module of the server side, and the early warning message is sent to the task dispatching module; the early warning information comprises a data error condition and a source data interface ID;

9. The service-side-based data cleaning method according to claim 8, wherein: the data error condition comprises a data error type and a data packet name corresponding to source data; sending the data packet name to a service end so that the service end sends the data packet which is not sent completely;

10. the data cleaning method based on the service end according to claim 1, characterized in that: the method further comprises the following steps: