CN113138982B - Big data cleaning method - Google Patents

Big data cleaning method Download PDF

Info

Publication number
CN113138982B
CN113138982B CN202110571367.9A CN202110571367A CN113138982B CN 113138982 B CN113138982 B CN 113138982B CN 202110571367 A CN202110571367 A CN 202110571367A CN 113138982 B CN113138982 B CN 113138982B
Authority
CN
China
Prior art keywords
data
service data
importance degree
original
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110571367.9A
Other languages
Chinese (zh)
Other versions
CN113138982A (en
Inventor
黄柱挺
申海平
沈陕威
苏军武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yuanuniverse Technology Co ltd
Original Assignee
Shenzhen Yuanuniverse Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yuanuniverse Technology Co ltd filed Critical Shenzhen Yuanuniverse Technology Co ltd
Priority to CN202110571367.9A priority Critical patent/CN113138982B/en
Publication of CN113138982A publication Critical patent/CN113138982A/en
Application granted granted Critical
Publication of CN113138982B publication Critical patent/CN113138982B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a big data cleaning method, and relates to the technical field of data processing. In the application, firstly, original service data to be processed are obtained, wherein the original service data are service data of which the data volume is larger than a preset volume and which are obtained based on data acquisition of a target service object; and secondly, cleaning the original service data to screen out invalid data in the original service data to obtain target service data, wherein the invalid data is the service data of which the importance degree in the original service data is lower than a preset degree, and the target service data is part or all of the data in the original service data. Based on the method, the problem of poor data cleaning effect in the prior art can be solved.

Description

Big data cleaning method
Technical Field
The application relates to the technical field of data processing, in particular to a big data cleaning method.
Background
In the field of big data technology, the data to be processed is massive, and not all the data in the massive data can be utilized, and therefore, the acquired data needs to be cleaned. However, the inventors have found that the conventional data cleansing processing technology has a problem that the effect of data cleansing is poor.
Disclosure of Invention
In view of the above, an object of the present application is to provide a big data cleansing method to solve the problem of poor data cleansing effect in the prior art.
In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:
a big data cleaning method is applied to big data cleaning equipment and comprises the following steps:
acquiring original service data to be processed, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained based on data acquisition of a target service object;
and cleaning the original service data to screen out invalid data in the original service data to obtain target service data, wherein the invalid data is service data of which the importance degree is lower than a preset degree in the original service data, and the target service data is part or all of the original service data.
In a possible embodiment, in the big data cleansing method, the step of cleansing the original service data to screen out invalid data in the original service data to obtain target service data includes:
denoising the original service data to screen distortion data in the original service data to obtain first service data, wherein the distortion data is error data in the original service data, and the first service data is part or all of data in the original service data;
and cleaning the first service data to screen out invalid data in the first service data to obtain target service data, wherein the invalid data is the service data of which the importance degree in the first service data is lower than a preset degree, and the target service data is part or all of the original service data.
In a possible embodiment, in the big data cleansing method, the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data includes:
performing content identification processing on the first service data to obtain a corresponding content identification result;
determining importance degree of each data part of the first service data based on the content identification result to obtain importance degree information corresponding to each data part;
determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part;
and determining the data part which does not belong to the invalid data as the target business data.
In a possible embodiment, in the big data cleansing method, the step of performing importance determination processing on each data portion of the first service data based on the content identification result to obtain importance information corresponding to each data portion includes:
obtaining a pre-constructed content-importance degree corresponding relation, wherein the content-importance degree corresponding relation is generated on the basis of a first configuration operation of the big data cleaning equipment responding to a user;
and determining importance degree information of each data part of the first service data based on the content identification result and the content-importance degree corresponding relation.
In a possible embodiment, in the big data cleansing method, the step of performing importance determination processing on each data portion of the first service data based on the content identification result to obtain importance information corresponding to each data portion includes:
for each data part in the first service data, judging whether the data part has preset mark information or not, wherein the preset mark information is generated based on response user operation;
and determining the importance degree information of each data part with the preset mark information as having first importance degree information, and determining the importance degree information of each data part without the preset mark information as having second importance degree information, wherein the first importance degree information is used for representing that the corresponding data part does not belong to invalid data, and the second importance degree information is used for representing that the corresponding data part belongs to invalid data.
In a possible embodiment, in the big data cleansing method, the step of determining whether each of the data portions belongs to invalid data based on the importance information corresponding to each of the data portions includes:
acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated based on a second configuration operation of the big data cleaning equipment responding to a user;
judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information or not;
and determining the data part corresponding to each importance degree information smaller than the importance degree threshold information as invalid data, and determining the data part corresponding to each importance degree information larger than or equal to the importance degree threshold information as valid data.
In a possible embodiment, in the big data cleansing method, the step of determining whether each of the data portions belongs to invalid data based on the importance information corresponding to each of the data portions includes:
acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated based on a second configuration operation of the big data cleaning equipment responding to a user;
judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information or not;
counting the data amount of the data part corresponding to each importance degree information which is greater than or equal to the importance degree threshold information;
if the data volume is greater than or equal to a predetermined target data volume, determining the data portion corresponding to each importance degree information smaller than the importance degree threshold information as invalid data, and determining the data portion corresponding to each importance degree information greater than or equal to the importance degree threshold information as valid data;
and if the data quantity is smaller than the target data quantity, determining each data part as valid data.
In a possible embodiment, in the big data cleansing method, the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data includes:
responding to the identification processing of the first service data by the user to obtain a corresponding identification result;
determining importance degree of each data part of the first service data based on the identification result to obtain importance degree information corresponding to each data part;
determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part;
and determining the data part which does not belong to the invalid data as the target business data.
In a possible embodiment, in the big data cleansing method, the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data includes:
responding to importance degree identification processing of each data part of the first service data by a user to obtain importance degree information corresponding to each data part;
determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part;
and determining the data part which does not belong to invalid data as target business data.
In a possible embodiment, in the big data cleansing method, the step of denoising the original service data to screen distortion data in the original service data to obtain the first service data includes:
performing data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained based on data acquisition of a target service object;
analyzing the original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the original service data fragments, wherein the distorted data is error data in the original service data;
if the target service data fragment exists in the plurality of original service data fragments, taking each original service data fragment except the target service data fragment in the plurality of original service data fragments as the denoised first service data.
According to the big data cleaning method, invalid data with the importance degree lower than the preset degree in the obtained original business data are screened out, so that target business data with higher importance degree can be obtained. Therefore, by screening out the relatively unimportant invalid data and retaining the relatively important valid data, the data cleaning effect is better, and the problem of poor data cleaning effect in the prior art is solved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
Fig. 1 is a block diagram of a big data cleaning apparatus according to an embodiment of the present disclosure.
Fig. 2 is a schematic flow chart of a big data cleaning method provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in FIG. 1, the embodiment of the application provides a big data cleaning device. Wherein the big data washing device may include a memory and a processor.
In detail, the memory and the processor are electrically connected directly or indirectly to realize data transmission or interaction. For example, they may be electrically connected to each other via one or more communication buses or signal lines. The memory can have stored therein at least one software function (computer program) which can be present in the form of software or firmware. The processor may be configured to execute the executable computer program stored in the memory, so as to implement the big data cleansing method provided by the embodiments (described later) of the present application.
Alternatively, the Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), a System on Chip (SoC), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Also, the structure shown in fig. 1 is only an illustration, and the big data washing apparatus may further include more or less components than those shown in fig. 1, or have a different configuration from that shown in fig. 1, for example, may include a communication unit for information interaction with other apparatuses.
In an alternative example, the big data washing device may be a server with data processing capability.
With reference to fig. 2, an embodiment of the present application further provides a big data cleaning method, which can be applied to the big data cleaning apparatus. The method steps defined by the flow related to the big data cleaning method can be realized by the big data cleaning equipment.
The specific process shown in FIG. 2 will be described in detail below.
Step S110, obtaining original service data to be processed.
In this embodiment, the big data cleansing device may obtain the raw service data to be processed first, for example, the stored raw service data may be obtained from some databases.
The original business data is business data of which the data volume is larger than the preset volume and which is obtained by carrying out data acquisition on the target business object. For example, the target business object may be an internet transaction behavior formed based on the internet, and the raw business data may be internet transaction record data formed based on the internet transaction behavior.
Step S120, cleaning the original service data to screen out invalid data in the original service data to obtain target service data.
In this embodiment, after obtaining the original service data based on step S110, the big data cleansing device may perform cleansing processing on the original service data, so as to filter invalid data in the original service data, thereby obtaining valid target service data.
The invalid data is service data with the importance degree lower than a preset degree in the original service data, and the target service data is part or all of the original service data.
Based on the method, the invalid data with the importance degree lower than the preset degree in the obtained original business data is screened out, so that the target business data with higher importance degree can be obtained. Therefore, by screening out the relatively unimportant invalid data and retaining the relatively important valid data, the data cleaning effect is better, and the problem of poor data cleaning effect in the prior art is solved.
In the above example, it should be noted that, in step S120, a specific manner of performing the cleaning process on the original service data is not limited, and may be selected according to an actual application requirement.
For example, in an alternative example, the original business data may be subjected to a cleansing process based on the following steps, so as to obtain the target business data:
firstly, denoising the original service data to screen out distortion data in the original service data to obtain first service data, wherein the distortion data is error data in the original service data, and the first service data is partial or all data in the original service data; secondly, the first service data may be cleaned to screen out invalid data in the first service data to obtain target service data, where the invalid data is service data of which the importance degree is lower than a preset degree in the first service data, and the target service data is part or all of the original service data.
It is understood that, in an alternative example, the specific manner of denoising the raw traffic data may include the following three steps, which are described in detail below.
Firstly, carrying out data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments. In this embodiment, the big data denoising processing device may perform data segmentation processing on the obtained original service data, so that a plurality of original service data segments may be obtained. The original business data is business data of which the data volume is larger than the preset volume and which is obtained by carrying out data acquisition on the target business object. For example, the target business object may be an internet transaction behavior formed based on the internet, and the raw business data may be internet transaction record data formed based on the internet transaction behavior.
And secondly, analyzing and processing the original service data fragments to determine whether a target service data fragment belonging to the distorted data exists in the original service data fragments. In this embodiment, after obtaining the multiple original service data segments, the big data denoising processing device may perform parsing processing on the multiple original service data segments to determine whether a target service data segment belonging to distorted data exists in the multiple original service data segments. And the distortion data is error data in the original service data. For example, the distortion data may represent that the above internet transaction activities are cancelled after the internet transaction activities are completed, such as canceling or returning goods after goods are purchased, or the distortion data may represent that errors occur in data transmission due to downtime, tampering, and the like in the data storage process. And, if the target service data fragment exists in the plurality of original service data fragments, the third step may be performed.
And thirdly, taking each original service data segment except the target service data segment in the plurality of original service data segments as the denoised first service data. In this embodiment, after determining that the target service data segment exists in the plurality of original service data segments, the big data denoising processing device may use each original service data segment other than the target service data segment in the plurality of original service data segments as denoised first service data.
Based on the steps, original service data is divided into a plurality of original service data fragments, whether a target service data fragment belonging to distorted data exists or not is determined, and then each original service data fragment except the target service data fragment is used as the first service data after denoising. Therefore, error data in the original service data can be effectively eliminated, the authenticity of the data in the obtained first service data is high, and the good denoising effect is ensured.
It is understood that, in an alternative example, the data segmentation process may be performed on the original service data based on the following steps:
firstly, obtaining original service data from a target database (it can be understood that the target database may not belong to other servers communicatively connected to the big data denoising processing device) communicatively connected to the big data denoising processing device, wherein the original service data is sent to the target database for storage through a corresponding data acquisition device after being obtained based on data acquisition of the target service object;
secondly, acquiring a predetermined target data segmentation rule, wherein the target data segmentation rule is generated based on configuration operation of the big data denoising processing equipment responding to a user;
and then, based on the target data segmentation rule, segmenting the original service data to obtain a plurality of original service data fragments, wherein the original service data fragments are combined according to a certain sequence to form the original service data.
It will be appreciated that in an alternative example, the target data segmentation rule may be obtained based on the following steps:
firstly, performing content recognition processing on the original service data (for example, the content recognition processing may be performed based on some existing text recognition models or a neural network model obtained through pre-training), so as to obtain a content recognition result corresponding to the original service data, where the content recognition result is used to represent type information to which data content of the original service data belongs (for example, the type information may include information related to a transaction amount, information not related to the transaction amount, and the like);
secondly, determining a segmentation rule in a plurality of pre-constructed data segmentation rules based on the content recognition result, wherein the segmentation rule is used as a target data segmentation rule corresponding to the content recognition result, each data segmentation rule is generated based on a configuration operation performed by the big data denoising processing device in response to a user, each data segmentation rule is used for segmenting the original service data into a plurality of original service data segments with different numbers, and the target data segmentation rule is used for segmenting the original service data into a corresponding number of original service data segments.
It will be appreciated that in an alternative example, the target data segmentation rule may be determined among the plurality of data segmentation rules based on the following steps:
firstly, determining target importance information corresponding to type information to which data content of the original service data belongs based on the content identification result and a pre-constructed content-importance corresponding relation, wherein the content-importance corresponding relation is generated based on configuration operation performed by the big data denoising processing device in response to a user (for example, the importance degree corresponding to information related to transaction amount may be higher than the importance degree corresponding to information not related to transaction amount);
secondly, based on the target importance information and a pre-constructed importance-segmentation rule corresponding relation, determining a target data segmentation rule for segmenting the original service data corresponding to the target importance information from a plurality of pre-constructed data segmentation rules, wherein the higher the importance corresponding to the target importance information is, the larger the number of original service data segments obtained by segmenting the original service data based on the target data segmentation rule is, the lower the importance corresponding to the target importance information is, the smaller the number of original service data segments obtained by segmenting the original service data based on the target data segmentation rule is, and the target data segmentation rule is used for segmenting data quantities such as the original service data into a corresponding number of original service data segments, or, the method is used for dividing the original service data into a corresponding number of original service data fragments according to the generated time sequence, or is used for dividing the original service data into a corresponding number of original service data fragments according to the continuity of data contents.
It will be appreciated that in an alternative example, the parsing process for the plurality of raw traffic data segments may be based on the following steps:
firstly, for each original service data fragment in the plurality of original service data fragments, performing content identification processing (as described above) on the original service data fragment to obtain content representation information corresponding to the original service data fragment, where the content representation information is used to represent data content of the corresponding original service data fragment (e.g., extract keywords therein);
secondly, clustering the plurality of original service data fragments based on content characterization information corresponding to each original service data fragment to obtain at least one service data fragment set corresponding to the plurality of original service data fragments, wherein each service data fragment set comprises at least one original service data fragment, the content characterization information of any two original service data fragments belonging to the same service data fragment set is the same (if the content characterization information represents identity information of corresponding users, the content characterization information represents identity information of corresponding equipment, the content characterization information represents money amount information of corresponding transactions, or the content characterization information represents time information of corresponding transactions, and the like), and the content characterization information of any two original service data fragments belonging to different service data fragment sets is different;
thirdly, each service data fragment set with the number of original service data fragments which are contained in the at least one service data fragment set being more than or equal to 2 is used as a target service data fragment set;
and fourthly, for each target service data fragment set, carrying out comparative analysis on each original service data fragment in the target service data fragment set so as to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.
It will be appreciated that in an alternative example, each of the original traffic data segments may be comparatively analyzed based on the following steps:
firstly, for each target service data fragment set, determining the data type of each original service data fragment in the target service data fragment set based on the result of content identification processing on the original service data fragment, wherein the data type comprises quantized data and non-quantized data, and the non-quantized data comprises data with emotional colors;
secondly, for each target business data fragment set, determining a comparative analysis rule corresponding to the target business data fragment set based on the data type corresponding to the target business data fragment set, wherein the comparative analysis rules corresponding to the target business data fragment sets of different data types are different;
then, for each target service data fragment set, performing comparative analysis on each original service data fragment included in the target service data fragment set based on a comparative analysis rule corresponding to the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.
It is understood that, in an alternative example, each of the original business data segments may be comparatively analyzed by a corresponding comparative analysis rule based on the following steps:
first, for each target service data segment set whose corresponding data type is the quantized data, performing mean calculation based on quantized data values of each original service data segment in the target service data segment set to obtain a corresponding quantized mean value, and performing discrete calculation based on the quantized mean value and the quantized data values of each original service data segment to obtain a corresponding quantized data discrete degree value (it may be understood that, for non-quantized data such as data with emotional colors, change trend information of emotional colors corresponding to each original service data segment may be analyzed first, for example, the emotional colors are not changed all the time or the emotional colors gradually change from derogative to positive, and then, data which do not satisfy the change trend information are screened out as distorted data);
secondly, judging whether the quantized data discrete degree value is larger than a predetermined quantized data discrete degree threshold value or not, wherein the quantized data discrete degree threshold value is generated on the basis of configuration operation of the big data denoising processing equipment responding to a user;
thirdly, if the quantized data discrete degree value is larger than the quantized data discrete degree threshold value, determining a target service data fragment set corresponding to the quantized data discrete degree value as a target service data fragment which does not comprise distorted data;
step four, if the quantized data discrete degree value is less than or equal to the quantized data discrete degree threshold, calculating a difference value between the quantized data value of each original service data segment in a target service data segment set corresponding to the quantized data discrete degree value and the quantized mean value;
a fifth step of judging a magnitude relationship between the difference value and a predetermined comparison threshold (it is understood that the comparison threshold may be generated based on a configuration performed by a user according to an actual application scenario);
sixthly, if the difference value is larger than the comparison threshold value, determining the original service data fragment corresponding to the difference value as a target service data fragment belonging to distorted data;
and seventhly, if the difference is smaller than or equal to the comparison threshold, not determining the original service data fragment corresponding to the difference as a target service data fragment belonging to the distorted data.
It is understood that, in another alternative example, each of the original service data segments may also be contrasted based on the following steps:
the method comprises the steps that firstly, for each target service data fragment set, based on content representation information corresponding to each original service data fragment in the target service data fragment set, a historical service data fragment set corresponding to the target service data fragment set is determined, wherein the historical service data fragment set is a service data fragment set which is determined by analyzing other original service data in history and comprises distortion data, and the content representation information of the historical service data fragment set is the same as that of the corresponding target service data fragment set;
secondly, aiming at each historical service data fragment set, sequencing all service data fragments included in the historical service data fragment set based on the relative position relation of all service data fragments included in the historical service data fragment set in other original service data to obtain a historical service data fragment sequence corresponding to the historical service data fragment set;
thirdly, aiming at each target business data fragment set, acquiring the fragment number of original business data fragments included in the target business data fragment set;
fourthly, aiming at each target service data fragment set, determining at least one historical service data fragment subsequence in the historical service data fragment sequence corresponding to the target service data fragment set, wherein the number of service data fragments included in each historical service data fragment subsequence is the number of fragments corresponding to the target service data fragment set, and each historical service data fragment subsequence includes service data fragments belonging to distorted data;
fifthly, calculating a sequence similarity between an ordered set corresponding to each target service data fragment set (that is, the relative position relationship of each original service data fragment in the target service data fragment set in the original service data is sequenced to obtain the ordered set) and the corresponding at least one historical service data fragment subsequence (the sequence similarity can be calculated based on the existing sequence similarity calculation method, and is not described in detail herein);
sixthly, determining each target service data fragment set with the sequence similarity meeting a preset similarity threshold (it can be understood that the similarity threshold can be generated based on configuration operation performed by a user according to an actual application scene) as a target service data fragment set with target service data fragments belonging to distorted data, wherein the target service data fragments are determined based on position information of the service data fragments belonging to the distorted data in the corresponding historical service data fragment set.
It is understood that, in an alternative example, the denoised first traffic data may be obtained based on the following steps:
firstly, if the target service data fragment exists in the plurality of original service data fragments, determining the relative position relationship of each original service data fragment except the target service data fragment in the plurality of original service data fragments in the original service data;
secondly, combining each original service data segment except the target service data segment in the plurality of original service data segments based on the relative position relationship to obtain the first service data of which the original service data is denoised.
It can be understood that, in an alternative example, if it is determined that the target service data segment does not exist in the multiple original service data segments, all of the multiple original service data segments may be used as the denoised first service data, that is, the original service data may be directly used as the denoised first service data.
It is understood that, in an alternative example, a specific manner of performing the cleansing process on the first service data may include the following steps:
firstly, performing content recognition processing on the first service data (for example, the recognition processing may be performed based on some text recognition models in the prior art, where the text recognition models may be neural network models trained in advance based on sample data), and obtaining corresponding content recognition results; secondly, determining importance degree of each data part of the first service data based on the content identification result to obtain importance degree information corresponding to each data part; thirdly, determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part; and fourthly, determining the data part which does not belong to the invalid data as the target service data.
It is to be understood that, in another alternative example, a specific manner of performing the cleansing process on the first service data may also include the following steps:
first, in response to the identification processing performed by the user on the first service data, a corresponding identification result is obtained (that is, each data portion of the first service data may be identified by the corresponding user to obtain a corresponding identification result, where the identification result may refer to data content represented by the corresponding data portion, such as a main body of a transaction behavior); secondly, determining importance degree of each data part of the first service data based on the identification result to obtain importance degree information corresponding to each data part; thirdly, determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part; and fourthly, determining the data part which does not belong to the invalid data as the target service data.
It is understood that, in other alternative examples, the specific manner of performing the cleansing process on the first service data may also include the following steps:
first, in response to the importance level identification processing performed by the user on each data portion of the first service data, obtaining importance level information corresponding to each data portion (that is, the importance level information of each data portion can be obtained by identifying each data portion of the first service data by the corresponding user); secondly, determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part; and thirdly, determining the data part which does not belong to the invalid data as the target service data.
It will be appreciated that in the first alternative example described above, in one possible example, the importance level determination process may be performed based on the following steps:
firstly, obtaining a content-importance degree corresponding relation which is constructed in advance, wherein the content-importance degree corresponding relation is generated on the basis of a first configuration operation which is carried out by the big data cleaning equipment in response to a user; and secondly, determining importance degree information of each data part of the first service data based on the content identification result and the content-importance degree corresponding relation.
It is to be understood that, in the first alternative example described above, in another possible example, the importance level determination process may be performed based on the following steps:
firstly, judging whether preset mark information exists in each data part in the first service data, wherein the preset mark information is generated based on response user operation; secondly, determining the importance degree information of each data part with the preset mark information as having first importance degree information, and determining the importance degree information of each data part without the preset mark information as having second importance degree information, wherein the first importance degree information is used for representing that the corresponding data part does not belong to invalid data, and the second importance degree information is used for representing that the corresponding data part belongs to invalid data.
It will be appreciated that in the first alternative example described above, in one possible example, it may be determined whether each of the data portions belongs to invalid data based on:
the method comprises the steps of firstly, acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated on the basis of a second configuration operation of the big data cleaning equipment responding to a user; secondly, judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information; and thirdly, determining the data part corresponding to each piece of importance degree information smaller than the importance degree threshold information as invalid data, and determining the data part corresponding to each piece of importance degree information larger than or equal to the importance degree threshold information as valid data.
It will be appreciated that in the first alternative example described above, in another possible example, it may be determined whether each of the data portions belongs to invalid data based on the following steps:
the method comprises the steps of firstly, acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated on the basis of a second configuration operation of the big data cleaning equipment responding to a user; secondly, judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information; thirdly, counting the data quantity of the data part corresponding to each importance degree information which is greater than or equal to the importance degree threshold information; a fourth step of determining the data portion corresponding to each of the importance degree information smaller than the importance degree threshold information as invalid data and determining the data portion corresponding to each of the importance degree information larger than or equal to the importance degree threshold information as valid data, if the data amount is larger than or equal to a predetermined target data amount (the target data amount may be generated based on a third configuration operation performed by the big data washing apparatus in response to a user); and fifthly, if the data volume is smaller than the target data volume, determining each data part as valid data.
In summary, according to the big data cleaning method provided by the application, invalid data with an importance degree lower than a preset degree in the obtained original business data is screened out, so that target business data with a higher importance degree can be obtained. Therefore, by screening out the relatively unimportant invalid data and retaining the relatively important valid data, the data cleaning effect is better, and the problem of poor data cleaning effect in the prior art is solved.
In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (8)

1. A big data cleaning method is applied to big data cleaning equipment and comprises the following steps:
acquiring original service data to be processed, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained on the basis of data acquisition of a target service object;
cleaning the original service data to screen out invalid data in the original service data to obtain target service data, wherein the invalid data is service data of which the importance degree is lower than a preset degree in the original service data, and the target service data is part or all of the original service data;
the step of cleaning the original service data to screen out invalid data in the original service data to obtain target service data includes:
denoising the original service data to screen distortion data in the original service data to obtain first service data, wherein the distortion data is error data in the original service data, and the first service data is part or all of data in the original service data;
cleaning the first service data to screen out invalid data in the first service data to obtain target service data, wherein the invalid data is the service data of which the importance degree in the first service data is lower than a preset degree, and the target service data is part or all of the original service data;
the step of denoising the original service data to screen distortion data in the original service data to obtain first service data includes:
performing data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments, wherein the original service data are service data of which the data volume is larger than a preset volume and obtained on the basis of data acquisition on a target service object;
analyzing the original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the original service data fragments, wherein the distorted data is error data in the original service data;
if the target service data fragment exists in the plurality of original service data fragments, taking each original service data fragment except the target service data fragment in the plurality of original service data fragments as denoised first service data;
the step of parsing the multiple original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the multiple original service data fragments includes:
for each original service data fragment in the plurality of original service data fragments, performing content identification processing on the original service data fragment to obtain content representation information corresponding to the original service data fragment, wherein the content representation information is used for representing the data content of the corresponding original service data fragment;
based on content characterization information corresponding to each original service data fragment, performing clustering processing on the plurality of original service data fragments to obtain at least one service data fragment set corresponding to the plurality of original service data fragments, wherein each service data fragment set comprises at least one original service data fragment, the content characterization information of any two original service data fragments belonging to the same service data fragment set is the same, and the content characterization information of any two original service data fragments belonging to different service data fragment sets is different;
taking each service data fragment set, in which the number of original service data fragments included in the at least one service data fragment set is greater than or equal to 2, as a target service data fragment set;
for each target service data fragment set, performing comparative analysis on each original service data fragment in the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set;
the step of comparing and analyzing each original service data fragment in the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set for each target service data fragment set includes:
for each target service data fragment set, determining a historical service data fragment set corresponding to the target service data fragment set based on content characterization information corresponding to each original service data fragment in the target service data fragment set, wherein the historical service data fragment set is a service data fragment set which is determined by analyzing other original service data historically and comprises distorted data, and the content characterization information of the historical service data fragment set is the same as that of the corresponding target service data fragment set;
for each historical service data fragment set, based on the relative position relationship of each service data fragment included in the historical service data fragment set in the other original service data, sequencing each service data fragment included in the historical service data fragment set to obtain a historical service data fragment sequence corresponding to the historical service data fragment set;
aiming at each target service data fragment set, acquiring the fragment number of original service data fragments included in the target service data fragment set;
for each target service data fragment set, determining at least one historical service data fragment subsequence in a historical service data fragment sequence corresponding to the target service data fragment set, wherein the number of service data fragments included in each historical service data fragment subsequence is the number of fragments corresponding to the target service data fragment set, and each historical service data fragment subsequence includes service data fragments belonging to distorted data;
aiming at each target business data fragment set, calculating the sequence similarity between the ordered set corresponding to the target business data fragment set and the corresponding at least one historical business data fragment subsequence;
determining each target service data fragment set with sequence similarity meeting a preset similarity threshold as a target service data fragment set with target service data fragments belonging to distorted data, wherein the target service data fragments are determined based on the position information of the service data fragments belonging to the distorted data in the corresponding historical service data fragment set.
2. The big data cleansing method according to claim 1, wherein the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data comprises:
performing content identification processing on the first service data to obtain a corresponding content identification result;
determining importance degree of each data part of the first service data based on the content identification result to obtain importance degree information corresponding to each data part;
determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part;
and determining the data part which does not belong to the invalid data as the target business data.
3. The big data washing method according to claim 2, wherein the step of performing importance degree determination processing on each data portion of the first service data based on the content identification result to obtain importance degree information corresponding to each data portion includes:
obtaining a pre-constructed content-importance degree corresponding relation, wherein the content-importance degree corresponding relation is generated on the basis of a first configuration operation of the big data cleaning equipment responding to a user;
and determining importance degree information of each data part of the first service data based on the content identification result and the content-importance degree corresponding relation.
4. The big data washing method according to claim 2, wherein the step of performing importance degree determination processing on each data portion of the first service data based on the content identification result to obtain importance degree information corresponding to each data portion includes:
for each data part in the first service data, judging whether the data part has preset mark information or not, wherein the preset mark information is generated based on response user operation;
and determining the importance degree information of each data part with the preset mark information as having first importance degree information, and determining the importance degree information of each data part without the preset mark information as having second importance degree information, wherein the first importance degree information is used for representing that the corresponding data part does not belong to invalid data, and the second importance degree information is used for representing that the corresponding data part belongs to invalid data.
5. The big data washing method according to claim 2, wherein the step of determining whether each of the data parts belongs to invalid data based on the importance information corresponding to each of the data parts comprises:
acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated based on a second configuration operation of the big data cleaning equipment responding to a user;
judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information or not;
and determining the data part corresponding to each importance degree information smaller than the importance degree threshold information as invalid data, and determining the data part corresponding to each importance degree information larger than or equal to the importance degree threshold information as valid data.
6. The big data washing method according to claim 2, wherein the step of determining whether each of the data parts belongs to invalid data based on the importance information corresponding to each of the data parts comprises:
acquiring preset importance degree threshold information, wherein the importance degree threshold information is generated based on a second configuration operation of the big data cleaning equipment responding to a user;
judging whether the importance degree information corresponding to each data part is smaller than the importance degree threshold information or not;
counting the data quantity of the data part corresponding to each importance degree information which is greater than or equal to the importance degree threshold information;
if the data volume is greater than or equal to a predetermined target data volume, determining the data portion corresponding to each importance degree information smaller than the importance degree threshold information as invalid data, and determining the data portion corresponding to each importance degree information greater than or equal to the importance degree threshold information as valid data;
and if the data quantity is smaller than the target data quantity, determining each data part as valid data.
7. The big data cleansing method according to claim 1, wherein the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data comprises:
responding to the identification processing of the first service data by the user to obtain a corresponding identification result;
determining importance degree of each data part of the first service data based on the identification result to obtain importance degree information corresponding to each data part;
determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part;
and determining the data part which does not belong to the invalid data as the target business data.
8. The big data cleansing method according to claim 1, wherein the step of cleansing the first service data to screen out invalid data in the first service data to obtain target service data comprises:
responding to importance degree identification processing of each data part of the first service data by a user to obtain importance degree information corresponding to each data part;
determining whether each data part belongs to invalid data or not based on the corresponding importance degree information of each data part;
and determining the data part which does not belong to the invalid data as the target business data.
CN202110571367.9A 2021-05-25 2021-05-25 Big data cleaning method Active CN113138982B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110571367.9A CN113138982B (en) 2021-05-25 2021-05-25 Big data cleaning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110571367.9A CN113138982B (en) 2021-05-25 2021-05-25 Big data cleaning method

Publications (2)

Publication Number Publication Date
CN113138982A CN113138982A (en) 2021-07-20
CN113138982B true CN113138982B (en) 2022-09-27

Family

ID=76817463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110571367.9A Active CN113138982B (en) 2021-05-25 2021-05-25 Big data cleaning method

Country Status (1)

Country Link
CN (1) CN113138982B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363466B (en) * 2022-03-22 2022-06-10 长沙居美网络科技有限公司 Intelligent cloud calling system based on AI
CN116204769B (en) * 2023-03-06 2023-12-05 深圳市乐易网络股份有限公司 Data cleaning method, system and storage medium based on data classification and identification
CN116303398A (en) * 2023-03-21 2023-06-23 华联世纪工程咨询股份有限公司 Historical engineering cost data cleaning method
CN116243833B (en) * 2023-05-08 2023-07-14 北京国信新网通讯技术有限公司 Cloud data-based electronic government platform communication management method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297919A (en) * 2019-05-23 2019-10-01 深圳壹账通智能科技有限公司 A kind of data cleaning method, device, equipment and storage medium
CN110427453A (en) * 2019-05-31 2019-11-08 平安科技(深圳)有限公司 Similarity calculating method, device, computer equipment and the storage medium of data
CN111488429A (en) * 2020-03-19 2020-08-04 杭州叙简科技股份有限公司 Short text clustering system based on search engine and short text clustering method thereof
CN111949647A (en) * 2020-09-03 2020-11-17 深圳市安亿通科技发展有限公司 Emergency management service data cleaning method, system, terminal and readable storage medium
CN112699646A (en) * 2020-12-23 2021-04-23 平安信托有限责任公司 Data processing method, device, equipment and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045807A (en) * 2015-06-04 2015-11-11 浙江力石科技股份有限公司 Data cleaning algorithm based on Internet trading information
CN111241258A (en) * 2020-01-08 2020-06-05 泰康保险集团股份有限公司 Data cleaning method and device, computer equipment and readable storage medium
CN111563071A (en) * 2020-04-03 2020-08-21 深圳价值在线信息科技股份有限公司 Data cleaning method and device, terminal equipment and computer readable storage medium
CN112667617A (en) * 2020-12-30 2021-04-16 南京诚勤教育科技有限公司 Visual data cleaning system and method based on natural language
CN112750029A (en) * 2020-12-30 2021-05-04 北京知因智慧科技有限公司 Credit risk prediction method, device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297919A (en) * 2019-05-23 2019-10-01 深圳壹账通智能科技有限公司 A kind of data cleaning method, device, equipment and storage medium
CN110427453A (en) * 2019-05-31 2019-11-08 平安科技(深圳)有限公司 Similarity calculating method, device, computer equipment and the storage medium of data
CN111488429A (en) * 2020-03-19 2020-08-04 杭州叙简科技股份有限公司 Short text clustering system based on search engine and short text clustering method thereof
CN111949647A (en) * 2020-09-03 2020-11-17 深圳市安亿通科技发展有限公司 Emergency management service data cleaning method, system, terminal and readable storage medium
CN112699646A (en) * 2020-12-23 2021-04-23 平安信托有限责任公司 Data processing method, device, equipment and medium

Also Published As

Publication number Publication date
CN113138982A (en) 2021-07-20

Similar Documents

Publication Publication Date Title
CN113138982B (en) Big data cleaning method
CN109191226B (en) Risk control method and device
WO2021164232A1 (en) User identification method and apparatus, and device and storage medium
CN112990386A (en) User value clustering method and device, computer equipment and storage medium
CN111144941A (en) Merchant score generation method, device, equipment and readable storage medium
CN116562991B (en) Commodity big data information identification method and system for meta-space electronic commerce platform
CN108197795B (en) Malicious group account identification method, device, terminal and storage medium
CN114581207A (en) Commodity image big data accurate pushing method and system for E-commerce platform
CN106997350B (en) Data processing method and device
CN112016756A (en) Data prediction method and device
CN113313217B (en) Method and system for accurately identifying dip angle characters based on robust template
CN114780606A (en) Big data mining method and system
CN113239031A (en) Big data denoising processing method
CN112884480A (en) Method and device for constructing abnormal transaction identification model, computer equipment and medium
CN112819476A (en) Risk identification method and device, nonvolatile storage medium and processor
CN113239381A (en) Data security encryption method
CN116610821A (en) Knowledge graph-based enterprise risk analysis method, system and storage medium
CN110659981A (en) Enterprise dependency relationship identification method and device and electronic equipment
CN113065892B (en) Information pushing method, device, equipment and storage medium
CN112269879B (en) Method and equipment for analyzing middle station log based on k-means algorithm
CN113256402A (en) Risk control rule determination method and device and electronic equipment
CN115439079A (en) Item classification method and device
CN113688206A (en) Text recognition-based trend analysis method, device, equipment and medium
CN114064872A (en) Intelligent storage method, device, equipment and medium for dialogue data information
CN112907306B (en) Customer satisfaction judging method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220908

Address after: J343, 3rd Floor, Port Building, Maritime Center, No. 59, Linhai Avenue, Nanshan Street, Qianhai Shenzhen-Hong Kong Cooperation Zone, Shenzhen, Guangdong 518000

Applicant after: Shenzhen yuanuniverse Technology Co.,Ltd.

Address before: 4 / F, complex building, 17 Danli Road, longhuan, Maling community, South District, Zhongshan City, Guangdong Province, 528455

Applicant before: Huang Zhuting

GR01 Patent grant
GR01 Patent grant