CN113138982B

CN113138982B - Big data cleaning method

Info

Publication number: CN113138982B
Application number: CN202110571367.9A
Authority: CN
Inventors: 黄柱挺; 申海平; 沈陕威; 苏军武
Original assignee: Shenzhen Yuanuniverse Technology Co ltd
Current assignee: Shenzhen Yuanuniverse Technology Co ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2022-09-27
Anticipated expiration: 2041-05-25
Also published as: CN113138982A

Abstract

The application provides a big data cleaning method, and relates to the technical field of data processing. In the application, firstly, original service data to be processed are obtained, wherein the original service data are service data of which the data volume is larger than a preset volume and which are obtained based on data acquisition of a target service object; and secondly, cleaning the original service data to screen out invalid data in the original service data to obtain target service data, wherein the invalid data is the service data of which the importance degree in the original service data is lower than a preset degree, and the target service data is part or all of the data in the original service data. Based on the method, the problem of poor data cleaning effect in the prior art can be solved.

Description

Big data cleaning method

Technical Field

The application relates to the technical field of data processing, in particular to a big data cleaning method.

Background

In the field of big data technology, the data to be processed is massive, and not all the data in the massive data can be utilized, and therefore, the acquired data needs to be cleaned. However, the inventors have found that the conventional data cleansing processing technology has a problem that the effect of data cleansing is poor.

Disclosure of Invention

In view of the above, an object of the present application is to provide a big data cleansing method to solve the problem of poor data cleansing effect in the prior art.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

a big data cleaning method is applied to big data cleaning equipment and comprises the following steps:

acquiring original service data to be processed, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained based on data acquisition of a target service object;

and cleaning the original service data to screen out invalid data in the original service data to obtain target service data, wherein the invalid data is service data of which the importance degree is lower than a preset degree in the original service data, and the target service data is part or all of the original service data.

In a possible embodiment, in the big data cleansing method, the step of cleansing the original service data to screen out invalid data in the original service data to obtain target service data includes:

denoising the original service data to screen distortion data in the original service data to obtain first service data, wherein the distortion data is error data in the original service data, and the first service data is part or all of data in the original service data;

and cleaning the first service data to screen out invalid data in the first service data to obtain target service data, wherein the invalid data is the service data of which the importance degree in the first service data is lower than a preset degree, and the target service data is part or all of the original service data.

In a possible embodiment, in the big data cleansing method, the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data includes:

performing content identification processing on the first service data to obtain a corresponding content identification result;

determining importance degree of each data part of the first service data based on the content identification result to obtain importance degree information corresponding to each data part;

determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part;

and determining the data part which does not belong to the invalid data as the target business data.

In a possible embodiment, in the big data cleansing method, the step of performing importance determination processing on each data portion of the first service data based on the content identification result to obtain importance information corresponding to each data portion includes:

obtaining a pre-constructed content-importance degree corresponding relation, wherein the content-importance degree corresponding relation is generated on the basis of a first configuration operation of the big data cleaning equipment responding to a user;

and determining importance degree information of each data part of the first service data based on the content identification result and the content-importance degree corresponding relation.

for each data part in the first service data, judging whether the data part has preset mark information or not, wherein the preset mark information is generated based on response user operation;

and determining the importance degree information of each data part with the preset mark information as having first importance degree information, and determining the importance degree information of each data part without the preset mark information as having second importance degree information, wherein the first importance degree information is used for representing that the corresponding data part does not belong to invalid data, and the second importance degree information is used for representing that the corresponding data part belongs to invalid data.

In a possible embodiment, in the big data cleansing method, the step of determining whether each of the data portions belongs to invalid data based on the importance information corresponding to each of the data portions includes:

acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated based on a second configuration operation of the big data cleaning equipment responding to a user;

judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information or not;

and determining the data part corresponding to each importance degree information smaller than the importance degree threshold information as invalid data, and determining the data part corresponding to each importance degree information larger than or equal to the importance degree threshold information as valid data.

counting the data amount of the data part corresponding to each importance degree information which is greater than or equal to the importance degree threshold information;

if the data volume is greater than or equal to a predetermined target data volume, determining the data portion corresponding to each importance degree information smaller than the importance degree threshold information as invalid data, and determining the data portion corresponding to each importance degree information greater than or equal to the importance degree threshold information as valid data;

and if the data quantity is smaller than the target data quantity, determining each data part as valid data.

responding to the identification processing of the first service data by the user to obtain a corresponding identification result;

determining importance degree of each data part of the first service data based on the identification result to obtain importance degree information corresponding to each data part;

responding to importance degree identification processing of each data part of the first service data by a user to obtain importance degree information corresponding to each data part;

and determining the data part which does not belong to invalid data as target business data.

In a possible embodiment, in the big data cleansing method, the step of denoising the original service data to screen distortion data in the original service data to obtain the first service data includes:

performing data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained based on data acquisition of a target service object;

analyzing the original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the original service data fragments, wherein the distorted data is error data in the original service data;

if the target service data fragment exists in the plurality of original service data fragments, taking each original service data fragment except the target service data fragment in the plurality of original service data fragments as the denoised first service data.

According to the big data cleaning method, invalid data with the importance degree lower than the preset degree in the obtained original business data are screened out, so that target business data with higher importance degree can be obtained. Therefore, by screening out the relatively unimportant invalid data and retaining the relatively important valid data, the data cleaning effect is better, and the problem of poor data cleaning effect in the prior art is solved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a block diagram of a big data cleaning apparatus according to an embodiment of the present disclosure.

Fig. 2 is a schematic flow chart of a big data cleaning method provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in FIG. 1, the embodiment of the application provides a big data cleaning device. Wherein the big data washing device may include a memory and a processor.

In detail, the memory and the processor are electrically connected directly or indirectly to realize data transmission or interaction. For example, they may be electrically connected to each other via one or more communication buses or signal lines. The memory can have stored therein at least one software function (computer program) which can be present in the form of software or firmware. The processor may be configured to execute the executable computer program stored in the memory, so as to implement the big data cleansing method provided by the embodiments (described later) of the present application.

Alternatively, the Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), a System on Chip (SoC), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

Also, the structure shown in fig. 1 is only an illustration, and the big data washing apparatus may further include more or less components than those shown in fig. 1, or have a different configuration from that shown in fig. 1, for example, may include a communication unit for information interaction with other apparatuses.

In an alternative example, the big data washing device may be a server with data processing capability.

With reference to fig. 2, an embodiment of the present application further provides a big data cleaning method, which can be applied to the big data cleaning apparatus. The method steps defined by the flow related to the big data cleaning method can be realized by the big data cleaning equipment.

The specific process shown in FIG. 2 will be described in detail below.

Step S110, obtaining original service data to be processed.

In this embodiment, the big data cleansing device may obtain the raw service data to be processed first, for example, the stored raw service data may be obtained from some databases.

The original business data is business data of which the data volume is larger than the preset volume and which is obtained by carrying out data acquisition on the target business object. For example, the target business object may be an internet transaction behavior formed based on the internet, and the raw business data may be internet transaction record data formed based on the internet transaction behavior.

Step S120, cleaning the original service data to screen out invalid data in the original service data to obtain target service data.

In this embodiment, after obtaining the original service data based on step S110, the big data cleansing device may perform cleansing processing on the original service data, so as to filter invalid data in the original service data, thereby obtaining valid target service data.

The invalid data is service data with the importance degree lower than a preset degree in the original service data, and the target service data is part or all of the original service data.

Based on the method, the invalid data with the importance degree lower than the preset degree in the obtained original business data is screened out, so that the target business data with higher importance degree can be obtained. Therefore, by screening out the relatively unimportant invalid data and retaining the relatively important valid data, the data cleaning effect is better, and the problem of poor data cleaning effect in the prior art is solved.

In the above example, it should be noted that, in step S120, a specific manner of performing the cleaning process on the original service data is not limited, and may be selected according to an actual application requirement.

For example, in an alternative example, the original business data may be subjected to a cleansing process based on the following steps, so as to obtain the target business data:

firstly, denoising the original service data to screen out distortion data in the original service data to obtain first service data, wherein the distortion data is error data in the original service data, and the first service data is partial or all data in the original service data; secondly, the first service data may be cleaned to screen out invalid data in the first service data to obtain target service data, where the invalid data is service data of which the importance degree is lower than a preset degree in the first service data, and the target service data is part or all of the original service data.

It is understood that, in an alternative example, the specific manner of denoising the raw traffic data may include the following three steps, which are described in detail below.

Firstly, carrying out data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments. In this embodiment, the big data denoising processing device may perform data segmentation processing on the obtained original service data, so that a plurality of original service data segments may be obtained. The original business data is business data of which the data volume is larger than the preset volume and which is obtained by carrying out data acquisition on the target business object. For example, the target business object may be an internet transaction behavior formed based on the internet, and the raw business data may be internet transaction record data formed based on the internet transaction behavior.

And secondly, analyzing and processing the original service data fragments to determine whether a target service data fragment belonging to the distorted data exists in the original service data fragments. In this embodiment, after obtaining the multiple original service data segments, the big data denoising processing device may perform parsing processing on the multiple original service data segments to determine whether a target service data segment belonging to distorted data exists in the multiple original service data segments. And the distortion data is error data in the original service data. For example, the distortion data may represent that the above internet transaction activities are cancelled after the internet transaction activities are completed, such as canceling or returning goods after goods are purchased, or the distortion data may represent that errors occur in data transmission due to downtime, tampering, and the like in the data storage process. And, if the target service data fragment exists in the plurality of original service data fragments, the third step may be performed.

And thirdly, taking each original service data segment except the target service data segment in the plurality of original service data segments as the denoised first service data. In this embodiment, after determining that the target service data segment exists in the plurality of original service data segments, the big data denoising processing device may use each original service data segment other than the target service data segment in the plurality of original service data segments as denoised first service data.

Based on the steps, original service data is divided into a plurality of original service data fragments, whether a target service data fragment belonging to distorted data exists or not is determined, and then each original service data fragment except the target service data fragment is used as the first service data after denoising. Therefore, error data in the original service data can be effectively eliminated, the authenticity of the data in the obtained first service data is high, and the good denoising effect is ensured.

It is understood that, in an alternative example, the data segmentation process may be performed on the original service data based on the following steps:

firstly, obtaining original service data from a target database (it can be understood that the target database may not belong to other servers communicatively connected to the big data denoising processing device) communicatively connected to the big data denoising processing device, wherein the original service data is sent to the target database for storage through a corresponding data acquisition device after being obtained based on data acquisition of the target service object;

secondly, acquiring a predetermined target data segmentation rule, wherein the target data segmentation rule is generated based on configuration operation of the big data denoising processing equipment responding to a user;

and then, based on the target data segmentation rule, segmenting the original service data to obtain a plurality of original service data fragments, wherein the original service data fragments are combined according to a certain sequence to form the original service data.

It will be appreciated that in an alternative example, the target data segmentation rule may be obtained based on the following steps:

firstly, performing content recognition processing on the original service data (for example, the content recognition processing may be performed based on some existing text recognition models or a neural network model obtained through pre-training), so as to obtain a content recognition result corresponding to the original service data, where the content recognition result is used to represent type information to which data content of the original service data belongs (for example, the type information may include information related to a transaction amount, information not related to the transaction amount, and the like);

secondly, determining a segmentation rule in a plurality of pre-constructed data segmentation rules based on the content recognition result, wherein the segmentation rule is used as a target data segmentation rule corresponding to the content recognition result, each data segmentation rule is generated based on a configuration operation performed by the big data denoising processing device in response to a user, each data segmentation rule is used for segmenting the original service data into a plurality of original service data segments with different numbers, and the target data segmentation rule is used for segmenting the original service data into a corresponding number of original service data segments.

It will be appreciated that in an alternative example, the target data segmentation rule may be determined among the plurality of data segmentation rules based on the following steps:

firstly, determining target importance information corresponding to type information to which data content of the original service data belongs based on the content identification result and a pre-constructed content-importance corresponding relation, wherein the content-importance corresponding relation is generated based on configuration operation performed by the big data denoising processing device in response to a user (for example, the importance degree corresponding to information related to transaction amount may be higher than the importance degree corresponding to information not related to transaction amount);

secondly, based on the target importance information and a pre-constructed importance-segmentation rule corresponding relation, determining a target data segmentation rule for segmenting the original service data corresponding to the target importance information from a plurality of pre-constructed data segmentation rules, wherein the higher the importance corresponding to the target importance information is, the larger the number of original service data segments obtained by segmenting the original service data based on the target data segmentation rule is, the lower the importance corresponding to the target importance information is, the smaller the number of original service data segments obtained by segmenting the original service data based on the target data segmentation rule is, and the target data segmentation rule is used for segmenting data quantities such as the original service data into a corresponding number of original service data segments, or, the method is used for dividing the original service data into a corresponding number of original service data fragments according to the generated time sequence, or is used for dividing the original service data into a corresponding number of original service data fragments according to the continuity of data contents.

It will be appreciated that in an alternative example, the parsing process for the plurality of raw traffic data segments may be based on the following steps:

firstly, for each original service data fragment in the plurality of original service data fragments, performing content identification processing (as described above) on the original service data fragment to obtain content representation information corresponding to the original service data fragment, where the content representation information is used to represent data content of the corresponding original service data fragment (e.g., extract keywords therein);

secondly, clustering the plurality of original service data fragments based on content characterization information corresponding to each original service data fragment to obtain at least one service data fragment set corresponding to the plurality of original service data fragments, wherein each service data fragment set comprises at least one original service data fragment, the content characterization information of any two original service data fragments belonging to the same service data fragment set is the same (if the content characterization information represents identity information of corresponding users, the content characterization information represents identity information of corresponding equipment, the content characterization information represents money amount information of corresponding transactions, or the content characterization information represents time information of corresponding transactions, and the like), and the content characterization information of any two original service data fragments belonging to different service data fragment sets is different;

thirdly, each service data fragment set with the number of original service data fragments which are contained in the at least one service data fragment set being more than or equal to 2 is used as a target service data fragment set;

and fourthly, for each target service data fragment set, carrying out comparative analysis on each original service data fragment in the target service data fragment set so as to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.

It will be appreciated that in an alternative example, each of the original traffic data segments may be comparatively analyzed based on the following steps:

firstly, for each target service data fragment set, determining the data type of each original service data fragment in the target service data fragment set based on the result of content identification processing on the original service data fragment, wherein the data type comprises quantized data and non-quantized data, and the non-quantized data comprises data with emotional colors;

secondly, for each target business data fragment set, determining a comparative analysis rule corresponding to the target business data fragment set based on the data type corresponding to the target business data fragment set, wherein the comparative analysis rules corresponding to the target business data fragment sets of different data types are different;

then, for each target service data fragment set, performing comparative analysis on each original service data fragment included in the target service data fragment set based on a comparative analysis rule corresponding to the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.

It is understood that, in an alternative example, each of the original business data segments may be comparatively analyzed by a corresponding comparative analysis rule based on the following steps:

first, for each target service data segment set whose corresponding data type is the quantized data, performing mean calculation based on quantized data values of each original service data segment in the target service data segment set to obtain a corresponding quantized mean value, and performing discrete calculation based on the quantized mean value and the quantized data values of each original service data segment to obtain a corresponding quantized data discrete degree value (it may be understood that, for non-quantized data such as data with emotional colors, change trend information of emotional colors corresponding to each original service data segment may be analyzed first, for example, the emotional colors are not changed all the time or the emotional colors gradually change from derogative to positive, and then, data which do not satisfy the change trend information are screened out as distorted data);

secondly, judging whether the quantized data discrete degree value is larger than a predetermined quantized data discrete degree threshold value or not, wherein the quantized data discrete degree threshold value is generated on the basis of configuration operation of the big data denoising processing equipment responding to a user;

thirdly, if the quantized data discrete degree value is larger than the quantized data discrete degree threshold value, determining a target service data fragment set corresponding to the quantized data discrete degree value as a target service data fragment which does not comprise distorted data;

step four, if the quantized data discrete degree value is less than or equal to the quantized data discrete degree threshold, calculating a difference value between the quantized data value of each original service data segment in a target service data segment set corresponding to the quantized data discrete degree value and the quantized mean value;

a fifth step of judging a magnitude relationship between the difference value and a predetermined comparison threshold (it is understood that the comparison threshold may be generated based on a configuration performed by a user according to an actual application scenario);

sixthly, if the difference value is larger than the comparison threshold value, determining the original service data fragment corresponding to the difference value as a target service data fragment belonging to distorted data;

and seventhly, if the difference is smaller than or equal to the comparison threshold, not determining the original service data fragment corresponding to the difference as a target service data fragment belonging to the distorted data.

It is understood that, in another alternative example, each of the original service data segments may also be contrasted based on the following steps:

the method comprises the steps that firstly, for each target service data fragment set, based on content representation information corresponding to each original service data fragment in the target service data fragment set, a historical service data fragment set corresponding to the target service data fragment set is determined, wherein the historical service data fragment set is a service data fragment set which is determined by analyzing other original service data in history and comprises distortion data, and the content representation information of the historical service data fragment set is the same as that of the corresponding target service data fragment set;

secondly, aiming at each historical service data fragment set, sequencing all service data fragments included in the historical service data fragment set based on the relative position relation of all service data fragments included in the historical service data fragment set in other original service data to obtain a historical service data fragment sequence corresponding to the historical service data fragment set;

thirdly, aiming at each target business data fragment set, acquiring the fragment number of original business data fragments included in the target business data fragment set;

fourthly, aiming at each target service data fragment set, determining at least one historical service data fragment subsequence in the historical service data fragment sequence corresponding to the target service data fragment set, wherein the number of service data fragments included in each historical service data fragment subsequence is the number of fragments corresponding to the target service data fragment set, and each historical service data fragment subsequence includes service data fragments belonging to distorted data;

fifthly, calculating a sequence similarity between an ordered set corresponding to each target service data fragment set (that is, the relative position relationship of each original service data fragment in the target service data fragment set in the original service data is sequenced to obtain the ordered set) and the corresponding at least one historical service data fragment subsequence (the sequence similarity can be calculated based on the existing sequence similarity calculation method, and is not described in detail herein);

sixthly, determining each target service data fragment set with the sequence similarity meeting a preset similarity threshold (it can be understood that the similarity threshold can be generated based on configuration operation performed by a user according to an actual application scene) as a target service data fragment set with target service data fragments belonging to distorted data, wherein the target service data fragments are determined based on position information of the service data fragments belonging to the distorted data in the corresponding historical service data fragment set.

It is understood that, in an alternative example, the denoised first traffic data may be obtained based on the following steps:

firstly, if the target service data fragment exists in the plurality of original service data fragments, determining the relative position relationship of each original service data fragment except the target service data fragment in the plurality of original service data fragments in the original service data;

secondly, combining each original service data segment except the target service data segment in the plurality of original service data segments based on the relative position relationship to obtain the first service data of which the original service data is denoised.

It can be understood that, in an alternative example, if it is determined that the target service data segment does not exist in the multiple original service data segments, all of the multiple original service data segments may be used as the denoised first service data, that is, the original service data may be directly used as the denoised first service data.

It is understood that, in an alternative example, a specific manner of performing the cleansing process on the first service data may include the following steps:

firstly, performing content recognition processing on the first service data (for example, the recognition processing may be performed based on some text recognition models in the prior art, where the text recognition models may be neural network models trained in advance based on sample data), and obtaining corresponding content recognition results; secondly, determining importance degree of each data part of the first service data based on the content identification result to obtain importance degree information corresponding to each data part; thirdly, determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part; and fourthly, determining the data part which does not belong to the invalid data as the target service data.

It is to be understood that, in another alternative example, a specific manner of performing the cleansing process on the first service data may also include the following steps:

first, in response to the identification processing performed by the user on the first service data, a corresponding identification result is obtained (that is, each data portion of the first service data may be identified by the corresponding user to obtain a corresponding identification result, where the identification result may refer to data content represented by the corresponding data portion, such as a main body of a transaction behavior); secondly, determining importance degree of each data part of the first service data based on the identification result to obtain importance degree information corresponding to each data part; thirdly, determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part; and fourthly, determining the data part which does not belong to the invalid data as the target service data.

It is understood that, in other alternative examples, the specific manner of performing the cleansing process on the first service data may also include the following steps:

first, in response to the importance level identification processing performed by the user on each data portion of the first service data, obtaining importance level information corresponding to each data portion (that is, the importance level information of each data portion can be obtained by identifying each data portion of the first service data by the corresponding user); secondly, determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part; and thirdly, determining the data part which does not belong to the invalid data as the target service data.

It will be appreciated that in the first alternative example described above, in one possible example, the importance level determination process may be performed based on the following steps:

firstly, obtaining a content-importance degree corresponding relation which is constructed in advance, wherein the content-importance degree corresponding relation is generated on the basis of a first configuration operation which is carried out by the big data cleaning equipment in response to a user; and secondly, determining importance degree information of each data part of the first service data based on the content identification result and the content-importance degree corresponding relation.

It is to be understood that, in the first alternative example described above, in another possible example, the importance level determination process may be performed based on the following steps:

firstly, judging whether preset mark information exists in each data part in the first service data, wherein the preset mark information is generated based on response user operation; secondly, determining the importance degree information of each data part with the preset mark information as having first importance degree information, and determining the importance degree information of each data part without the preset mark information as having second importance degree information, wherein the first importance degree information is used for representing that the corresponding data part does not belong to invalid data, and the second importance degree information is used for representing that the corresponding data part belongs to invalid data.

It will be appreciated that in the first alternative example described above, in one possible example, it may be determined whether each of the data portions belongs to invalid data based on:

the method comprises the steps of firstly, acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated on the basis of a second configuration operation of the big data cleaning equipment responding to a user; secondly, judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information; and thirdly, determining the data part corresponding to each piece of importance degree information smaller than the importance degree threshold information as invalid data, and determining the data part corresponding to each piece of importance degree information larger than or equal to the importance degree threshold information as valid data.

It will be appreciated that in the first alternative example described above, in another possible example, it may be determined whether each of the data portions belongs to invalid data based on the following steps:

the method comprises the steps of firstly, acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated on the basis of a second configuration operation of the big data cleaning equipment responding to a user; secondly, judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information; thirdly, counting the data quantity of the data part corresponding to each importance degree information which is greater than or equal to the importance degree threshold information; a fourth step of determining the data portion corresponding to each of the importance degree information smaller than the importance degree threshold information as invalid data and determining the data portion corresponding to each of the importance degree information larger than or equal to the importance degree threshold information as valid data, if the data amount is larger than or equal to a predetermined target data amount (the target data amount may be generated based on a third configuration operation performed by the big data washing apparatus in response to a user); and fifthly, if the data volume is smaller than the target data volume, determining each data part as valid data.

In summary, according to the big data cleaning method provided by the application, invalid data with an importance degree lower than a preset degree in the obtained original business data is screened out, so that target business data with a higher importance degree can be obtained. Therefore, by screening out the relatively unimportant invalid data and retaining the relatively important valid data, the data cleaning effect is better, and the problem of poor data cleaning effect in the prior art is solved.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A big data cleaning method is applied to big data cleaning equipment and comprises the following steps:

acquiring original service data to be processed, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained on the basis of data acquisition of a target service object;

cleaning the original service data to screen out invalid data in the original service data to obtain target service data, wherein the invalid data is service data of which the importance degree is lower than a preset degree in the original service data, and the target service data is part or all of the original service data;

the step of cleaning the original service data to screen out invalid data in the original service data to obtain target service data includes:

cleaning the first service data to screen out invalid data in the first service data to obtain target service data, wherein the invalid data is the service data of which the importance degree in the first service data is lower than a preset degree, and the target service data is part or all of the original service data;

the step of denoising the original service data to screen distortion data in the original service data to obtain first service data includes:

performing data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments, wherein the original service data are service data of which the data volume is larger than a preset volume and obtained on the basis of data acquisition on a target service object;

if the target service data fragment exists in the plurality of original service data fragments, taking each original service data fragment except the target service data fragment in the plurality of original service data fragments as denoised first service data;

the step of parsing the multiple original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the multiple original service data fragments includes:

for each original service data fragment in the plurality of original service data fragments, performing content identification processing on the original service data fragment to obtain content representation information corresponding to the original service data fragment, wherein the content representation information is used for representing the data content of the corresponding original service data fragment;

based on content characterization information corresponding to each original service data fragment, performing clustering processing on the plurality of original service data fragments to obtain at least one service data fragment set corresponding to the plurality of original service data fragments, wherein each service data fragment set comprises at least one original service data fragment, the content characterization information of any two original service data fragments belonging to the same service data fragment set is the same, and the content characterization information of any two original service data fragments belonging to different service data fragment sets is different;

taking each service data fragment set, in which the number of original service data fragments included in the at least one service data fragment set is greater than or equal to 2, as a target service data fragment set;

for each target service data fragment set, performing comparative analysis on each original service data fragment in the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set;

the step of comparing and analyzing each original service data fragment in the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set for each target service data fragment set includes:

for each target service data fragment set, determining a historical service data fragment set corresponding to the target service data fragment set based on content characterization information corresponding to each original service data fragment in the target service data fragment set, wherein the historical service data fragment set is a service data fragment set which is determined by analyzing other original service data historically and comprises distorted data, and the content characterization information of the historical service data fragment set is the same as that of the corresponding target service data fragment set;

for each historical service data fragment set, based on the relative position relationship of each service data fragment included in the historical service data fragment set in the other original service data, sequencing each service data fragment included in the historical service data fragment set to obtain a historical service data fragment sequence corresponding to the historical service data fragment set;

aiming at each target service data fragment set, acquiring the fragment number of original service data fragments included in the target service data fragment set;

for each target service data fragment set, determining at least one historical service data fragment subsequence in a historical service data fragment sequence corresponding to the target service data fragment set, wherein the number of service data fragments included in each historical service data fragment subsequence is the number of fragments corresponding to the target service data fragment set, and each historical service data fragment subsequence includes service data fragments belonging to distorted data;

aiming at each target business data fragment set, calculating the sequence similarity between the ordered set corresponding to the target business data fragment set and the corresponding at least one historical business data fragment subsequence;

determining each target service data fragment set with sequence similarity meeting a preset similarity threshold as a target service data fragment set with target service data fragments belonging to distorted data, wherein the target service data fragments are determined based on the position information of the service data fragments belonging to the distorted data in the corresponding historical service data fragment set.

2. The big data cleansing method according to claim 1, wherein the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data comprises:

3. The big data washing method according to claim 2, wherein the step of performing importance degree determination processing on each data portion of the first service data based on the content identification result to obtain importance degree information corresponding to each data portion includes:

4. The big data washing method according to claim 2, wherein the step of performing importance degree determination processing on each data portion of the first service data based on the content identification result to obtain importance degree information corresponding to each data portion includes:

5. The big data washing method according to claim 2, wherein the step of determining whether each of the data parts belongs to invalid data based on the importance information corresponding to each of the data parts comprises:

6. The big data washing method according to claim 2, wherein the step of determining whether each of the data parts belongs to invalid data based on the importance information corresponding to each of the data parts comprises:

counting the data quantity of the data part corresponding to each importance degree information which is greater than or equal to the importance degree threshold information;

7. The big data cleansing method according to claim 1, wherein the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data comprises:

8. The big data cleansing method according to claim 1, wherein the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data comprises: