CN113239031A - Big data denoising processing method - Google Patents

Big data denoising processing method Download PDF

Info

Publication number
CN113239031A
CN113239031A CN202110571342.9A CN202110571342A CN113239031A CN 113239031 A CN113239031 A CN 113239031A CN 202110571342 A CN202110571342 A CN 202110571342A CN 113239031 A CN113239031 A CN 113239031A
Authority
CN
China
Prior art keywords
service data
data
target
original
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110571342.9A
Other languages
Chinese (zh)
Inventor
黄柱挺
申海平
沈陕威
苏军武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110571342.9A priority Critical patent/CN113239031A/en
Publication of CN113239031A publication Critical patent/CN113239031A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a big data denoising processing method, and relates to the technical field of big data processing. In the application, firstly, data segmentation processing is carried out on the obtained original service data to obtain a plurality of original service data fragments; secondly, analyzing and processing the plurality of original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the plurality of original service data fragments, wherein the distorted data is error data in the original service data; then, if a target service data segment exists in the plurality of original service data segments, each original service data segment except the target service data segment in the plurality of original service data segments is used as the first service data after denoising. Based on the method, the problem of poor data denoising effect in the prior art can be solved.

Description

Big data denoising processing method
Technical Field
The application relates to the technical field of big data processing, in particular to a big data denoising processing method.
Background
With the continuous development of internet technology, the application range of the internet technology is continuously expanded, and thus, a large amount of data is generated. In the prior art, a large amount of generated data can be used for information prediction or orientation processing and the like. However, the inventor researches and discovers that in the application process of data, denoising is generally performed first to ensure the accuracy of the data, however, the existing denoising technology has the problem of poor denoising effect.
Disclosure of Invention
In view of this, an object of the present application is to provide a big data denoising processing method, so as to solve the problem in the prior art that the effect of denoising data is poor.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
a big data denoising processing method is applied to big data denoising processing equipment and comprises the following steps:
performing data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained based on data acquisition of a target service object;
analyzing the original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the original service data fragments, wherein the distorted data is error data in the original service data;
if the target service data fragment exists in the plurality of original service data fragments, taking each original service data fragment except the target service data fragment in the plurality of original service data fragments as the denoised first service data.
In a possible embodiment, in the big data denoising processing method, the step of performing data segmentation processing on the obtained original service data to obtain a plurality of original service data segments includes:
original business data are obtained from a target database in communication connection with the big data denoising processing equipment, wherein the original business data are sent to the target database for storage through corresponding data acquisition equipment after being obtained based on data acquisition of the target business object;
acquiring a predetermined target data segmentation rule, wherein the target data segmentation rule is generated based on configuration operation of the big data denoising processing equipment responding to a user;
and segmenting the original service data based on the target data segmentation rule to obtain a plurality of original service data fragments, wherein the original service data fragments are combined according to a certain sequence to form the original service data.
In a possible embodiment, in the above big data denoising processing method, the step of obtaining a predetermined target data segmentation rule includes:
performing content identification processing on the original service data to obtain a content identification result corresponding to the original service data, wherein the content identification result is used for representing type information to which data content of the original service data belongs;
determining a segmentation rule from a plurality of pre-constructed data segmentation rules based on the content identification result, wherein the segmentation rule is used as a target data segmentation rule corresponding to the content identification result, each data segmentation rule is generated based on a configuration operation performed by the big data denoising processing device in response to a user, each data segmentation rule is used for segmenting the original service data into a plurality of original service data segments with different numbers, and the target data segmentation rule is used for segmenting the original service data into a corresponding number of original service data segments.
In a possible embodiment, in the big data denoising processing method, the step of determining, based on the content recognition result, one segmentation rule among a plurality of pre-constructed data segmentation rules as a target data segmentation rule corresponding to the content recognition result includes:
determining target importance information corresponding to type information to which the data content of the original service data belongs based on the content identification result and a pre-constructed content-importance corresponding relation, wherein the content-importance corresponding relation is generated based on configuration operation of the big data denoising processing equipment responding to a user;
determining a target data segmentation rule corresponding to the target importance information and a pre-constructed importance-segmentation rule corresponding relation in a plurality of pre-constructed data segmentation rules, wherein the higher the importance corresponding to the target importance information is, the larger the number of original business data segments obtained by segmenting the original business data based on the target data segmentation rule is, the lower the importance corresponding to the target importance information is, the smaller the number of original business data segments obtained by segmenting the original business data based on the target data segmentation rule is, and the target data segmentation rule is used for segmenting data volumes such as the original business data into a corresponding number of original business data segments, or is used for segmenting the original business data into a corresponding number of original business data segments according to a generated time sequence Or, the method is used for dividing the original service data into a corresponding number of original service data segments according to the continuity of the data content.
In a possible embodiment, in the big data denoising processing method, the step of parsing the plurality of original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the plurality of original service data fragments includes:
for each original service data fragment in the plurality of original service data fragments, performing content identification processing on the original service data fragment to obtain content representation information corresponding to the original service data fragment, wherein the content representation information is used for representing the data content of the corresponding original service data fragment;
based on content characterization information corresponding to each original service data fragment, performing clustering processing on the plurality of original service data fragments to obtain at least one service data fragment set corresponding to the plurality of original service data fragments, wherein each service data fragment set comprises at least one original service data fragment, the content characterization information of any two original service data fragments belonging to the same service data fragment set is the same, and the content characterization information of any two original service data fragments belonging to different service data fragment sets is different;
taking each service data fragment set, in which the number of original service data fragments included in the at least one service data fragment set is greater than or equal to 2, as a target service data fragment set;
and for each target service data fragment set, carrying out comparative analysis on each original service data fragment in the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.
In a possible embodiment, in the above method for denoising big data, the step of comparing and analyzing each original service data fragment in the target service data fragment set for each target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set includes:
for each target business data fragment set, determining the data type of each original business data fragment in the target business data fragment set based on the result of content identification processing on the original business data fragment, wherein the data type comprises quantized data and non-quantized data, and the non-quantized data comprises data with emotional colors;
for each target business data fragment set, determining a comparative analysis rule corresponding to the target business data fragment set based on the data type corresponding to the target business data fragment set, wherein the comparative analysis rules corresponding to the target business data fragment sets of different data types are different;
and for each target business data fragment set, carrying out comparative analysis on each original business data fragment included in the target business data fragment set based on a comparative analysis rule corresponding to the target business data fragment set so as to determine whether a target business data fragment belonging to distorted data exists in the target business data fragment set.
In a possible embodiment, in the above big data denoising processing method, the step of, for each target service data fragment set, performing comparative analysis on each original service data fragment included in the target service data fragment set based on a comparative analysis rule corresponding to the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set includes:
for each target service data segment set of which the corresponding data type is the quantized data, performing mean value calculation based on quantized data values of all the original service data segments in the target service data segment set to obtain a corresponding quantized mean value, and performing discrete calculation based on the quantized mean value and the quantized data values of all the original service data segments to obtain a corresponding quantized data discrete degree value;
judging whether the quantized data discrete degree value is larger than a predetermined quantized data discrete degree threshold value or not, wherein the quantized data discrete degree threshold value is generated on the basis of the configuration operation of the big data denoising processing equipment responding to a user;
if the quantized data discrete degree value is larger than the quantized data discrete degree threshold value, determining a target service data fragment set corresponding to the quantized data discrete degree value as a target service data fragment not including distorted data;
if the quantized data discrete degree value is smaller than or equal to the quantized data discrete degree threshold, calculating a difference value between the quantized data value of each original service data segment in a target service data segment set corresponding to the quantized data discrete degree value and the quantized mean value;
judging the magnitude relation between the difference value and a predetermined comparison threshold value;
if the difference is larger than the comparison threshold, determining the original service data fragment corresponding to the difference as a target service data fragment belonging to distorted data;
and if the difference is smaller than or equal to the comparison threshold, not determining the original service data fragment corresponding to the difference as a target service data fragment belonging to the distorted data.
In a possible embodiment, in the above method for denoising big data, the step of comparing and analyzing each original service data fragment in the target service data fragment set for each target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set includes:
for each target service data fragment set, determining a historical service data fragment set corresponding to the target service data fragment set based on content characterization information corresponding to each original service data fragment in the target service data fragment set, wherein the historical service data fragment set is a service data fragment set which is determined by analyzing other original service data historically and comprises distorted data, and the content characterization information of the historical service data fragment set is the same as that of the corresponding target service data fragment set;
for each historical service data fragment set, based on the relative position relationship of each service data fragment included in the historical service data fragment set in the other original service data, sequencing each service data fragment included in the historical service data fragment set to obtain a historical service data fragment sequence corresponding to the historical service data fragment set;
aiming at each target service data fragment set, acquiring the fragment number of original service data fragments included in the target service data fragment set;
for each target service data fragment set, determining at least one historical service data fragment subsequence in a historical service data fragment sequence corresponding to the target service data fragment set, wherein the number of service data fragments included in each historical service data fragment subsequence is the number of fragments corresponding to the target service data fragment set, and each historical service data fragment subsequence includes service data fragments belonging to distorted data;
aiming at each target business data fragment set, calculating the sequence similarity between the ordered set corresponding to the target business data fragment set and the corresponding at least one historical business data fragment subsequence;
determining each target service data fragment set with sequence similarity meeting a preset similarity threshold as a target service data fragment set with target service data fragments belonging to distorted data, wherein the target service data fragments are determined based on the position information of the service data fragments belonging to the distorted data in the corresponding historical service data fragment set.
In a possible embodiment, in the above method for denoising big data, if the target service data segment exists in the plurality of original service data segments, the step of using each original service data segment other than the target service data segment in the plurality of original service data segments as denoised first service data includes:
if the target service data fragment exists in the plurality of original service data fragments, determining the relative position relationship of each original service data fragment except the target service data fragment in the plurality of original service data fragments in the original service data;
and combining each original service data segment except the target service data segment in the plurality of original service data segments based on the relative position relationship to obtain the first service data of which the original service data is denoised.
In a possible embodiment, in the big data denoising processing method, the big data denoising processing method further includes:
and if the target service data fragment does not exist in the plurality of original service data fragments, taking the plurality of original service data fragments as the de-noised first service data.
According to the big data denoising processing method, original service data are divided into a plurality of original service data fragments, whether a target service data fragment belonging to distorted data exists is determined, and then each original service data fragment except the target service data fragment is used as denoised first service data. Therefore, error data in the original service data can be effectively eliminated, the authenticity of the data in the obtained first service data is higher, the good denoising effect is ensured, and the problem of poor denoising effect of the data in the prior art is solved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
Fig. 1 is a block diagram of a structure of a big data denoising processing device according to an embodiment of the present disclosure.
Fig. 2 is a schematic flow chart of steps included in the big data denoising processing method according to the embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, an embodiment of the present application provides a big data denoising processing apparatus. The big data denoising processing device can comprise a memory and a processor.
In detail, the memory and the processor are electrically connected directly or indirectly to realize data transmission or interaction. For example, they may be electrically connected to each other via one or more communication buses or signal lines. The memory can have stored therein at least one software function (computer program) which can be present in the form of software or firmware. The processor may be configured to execute the executable computer program stored in the memory, so as to implement the big data denoising processing method provided in the embodiments (described later) of the present application.
Alternatively, the Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), a System on Chip (SoC), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Moreover, the structure shown in fig. 1 is only an illustration, and the big data denoising processing device may further include more or fewer components than those shown in fig. 1, or have a different configuration from that shown in fig. 1, for example, may include a communication unit for information interaction with other devices.
In an alternative example, the big data denoising processing device may be a server with data processing capability.
With reference to fig. 2, an embodiment of the present application further provides a big data denoising processing method, which can be applied to the big data denoising processing device. The method steps defined by the flow related to the big data denoising processing method can be realized by the big data denoising processing device.
The specific process shown in FIG. 2 will be described in detail below.
Step S110, performing data segmentation processing on the obtained original service data to obtain a plurality of original service data segments.
In this embodiment, the big data denoising processing device may perform data segmentation processing on the obtained original service data, so that a plurality of original service data segments may be obtained.
The original business data is business data of which the data volume is larger than the preset volume and which is obtained by carrying out data acquisition on the target business object. For example, the target business object may be an internet transaction behavior formed based on the internet, and the raw business data may be internet transaction record data formed based on the internet transaction behavior.
Step S120, parsing the plurality of original service data fragments to determine whether a target service data fragment belonging to the distorted data exists in the plurality of original service data fragments.
In this embodiment, after obtaining the plurality of original service data fragments based on step S110, the big data denoising processing device may perform parsing processing on the plurality of original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the plurality of original service data fragments. And the distortion data is error data in the original service data. For example, the distorted data may represent that the above-mentioned internet transaction behavior is cancelled after completion, such as canceling or returning after purchasing goods, or the distorted data may represent that data transmission errors occur due to downtime, tampering, and the like during data storage.
And, if the target service data fragment exists in the plurality of original service data fragments, step S130 may be executed.
Step S130, using each original service data segment other than the target service data segment in the plurality of original service data segments as the denoised first service data.
In this embodiment, after determining that the target service data segment exists in the plurality of original service data segments based on step S120, the big data denoising processing device may use each original service data segment other than the target service data segment in the plurality of original service data segments as denoised first service data.
Based on the method, original service data is divided into a plurality of original service data fragments, whether a target service data fragment belonging to distorted data exists or not is determined, and then each original service data fragment except the target service data fragment is used as the first service data after denoising. Therefore, error data in the original service data can be effectively eliminated, the authenticity of the data in the obtained first service data is higher, the good denoising effect is ensured, and the problem of poor denoising effect of the data in the prior art is solved.
It is understood that, in an alternative example, the data segmentation process may be performed on the original service data based on the following steps:
firstly, obtaining original service data from a target database (it can be understood that the target database may not belong to other servers communicatively connected to the big data denoising processing device) communicatively connected to the big data denoising processing device, wherein the original service data is sent to the target database for storage through a corresponding data acquisition device after being obtained based on data acquisition of the target service object;
secondly, obtaining a predetermined target data segmentation rule, wherein the target data segmentation rule is generated based on configuration operation of the big data denoising processing equipment responding to a user;
and then, based on the target data segmentation rule, segmenting the original service data to obtain a plurality of original service data fragments, wherein the original service data fragments are combined according to a certain sequence to form the original service data.
It will be appreciated that in an alternative example, the target data segmentation rule may be obtained based on the following steps:
firstly, performing content recognition processing on the original service data (for example, the content recognition processing may be performed based on some existing text recognition models or a neural network model obtained through pre-training), so as to obtain a content recognition result corresponding to the original service data, where the content recognition result is used to represent type information to which data content of the original service data belongs (for example, the type information may include information related to a transaction amount, information not related to the transaction amount, and the like);
secondly, determining a segmentation rule in a plurality of pre-constructed data segmentation rules based on the content identification result, wherein the segmentation rule is used as a target data segmentation rule corresponding to the content identification result, each data segmentation rule is generated based on a configuration operation performed by the big data denoising processing device in response to a user, each data segmentation rule is used for segmenting the original service data into a plurality of original service data segments with different quantities, and the target data segmentation rule is used for segmenting the original service data into a corresponding number of original service data segments.
It will be appreciated that in an alternative example, the target data segmentation rule may be determined among the plurality of data segmentation rules based on the following steps:
firstly, determining target importance information corresponding to type information to which data content of the original service data belongs based on the content identification result and a pre-constructed content-importance corresponding relation, wherein the content-importance corresponding relation is generated based on configuration operation performed by the big data denoising processing device in response to a user (for example, the importance degree corresponding to information related to transaction amount may be higher than the importance degree corresponding to information not related to transaction amount);
secondly, based on the target importance information and a pre-constructed importance-segmentation rule corresponding relation, determining a target data segmentation rule for segmenting the original service data corresponding to the target importance information from a plurality of pre-constructed data segmentation rules, wherein the higher the importance corresponding to the target importance information is, the larger the number of original service data segments obtained by segmenting the original service data based on the target data segmentation rule is, the lower the importance corresponding to the target importance information is, the smaller the number of original service data segments obtained by segmenting the original service data based on the target data segmentation rule is, and the target data segmentation rule is used for segmenting data quantities such as the original service data into a corresponding number of original service data segments, or, the method is used for dividing the original service data into a corresponding number of original service data fragments according to the generated time sequence, or is used for dividing the original service data into a corresponding number of original service data fragments according to the continuity of data contents.
It will be appreciated that in an alternative example, the parsing process for the plurality of raw traffic data segments may be based on the following steps:
firstly, for each original service data fragment in the plurality of original service data fragments, performing content identification processing (as described above) on the original service data fragment to obtain content representation information corresponding to the original service data fragment, where the content representation information is used to represent data content of the corresponding original service data fragment (e.g., extract keywords therein);
secondly, clustering the plurality of original service data fragments based on content characterization information corresponding to each original service data fragment to obtain at least one service data fragment set corresponding to the plurality of original service data fragments, wherein each service data fragment set comprises at least one original service data fragment, the content characterization information of any two original service data fragments belonging to the same service data fragment set is the same (if the content characterization information represents identity information of corresponding users, the content characterization information represents identity information of corresponding equipment, the content characterization information represents money amount information of corresponding transactions, or the content characterization information represents time information of corresponding transactions, and the like), and the content characterization information of any two original service data fragments belonging to different service data fragment sets is different;
thirdly, each service data fragment set with the number of original service data fragments which are contained in the at least one service data fragment set being more than or equal to 2 is used as a target service data fragment set;
and fourthly, for each target service data fragment set, carrying out comparative analysis on each original service data fragment in the target service data fragment set so as to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.
It will be appreciated that in an alternative example, each of the original traffic data segments may be comparatively analyzed based on the following steps:
firstly, for each target service data fragment set, determining the data type of each original service data fragment in the target service data fragment set based on the result of content identification processing on the original service data fragment, wherein the data type comprises quantized data and non-quantized data, and the non-quantized data comprises data with emotional colors;
secondly, for each target business data fragment set, determining a comparative analysis rule corresponding to the target business data fragment set based on the data type corresponding to the target business data fragment set, wherein the comparative analysis rules corresponding to the target business data fragment sets of different data types are different;
then, for each target service data fragment set, performing comparative analysis on each original service data fragment included in the target service data fragment set based on a comparative analysis rule corresponding to the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.
It is understood that, in an alternative example, each of the original traffic data segments may be contrasted and analyzed through a corresponding contrasted analysis rule based on the following steps:
first, for each target service data segment set whose corresponding data type is the quantized data, performing mean calculation based on quantized data values of each original service data segment in the target service data segment set to obtain a corresponding quantized mean value, and performing discrete calculation based on the quantized mean value and the quantized data values of each original service data segment to obtain a corresponding quantized data discrete degree value (it may be understood that, for non-quantized data such as data with emotional colors, change trend information of emotional colors corresponding to each original service data segment may be analyzed first, for example, the emotional colors are not changed all the time or the emotional colors gradually change from derogative to positive, and then, data which do not satisfy the change trend information are screened out as distorted data);
secondly, judging whether the quantized data discrete degree value is larger than a predetermined quantized data discrete degree threshold value or not, wherein the quantized data discrete degree threshold value is generated on the basis of configuration operation of the big data denoising processing equipment responding to a user;
thirdly, if the quantized data discrete degree value is larger than the quantized data discrete degree threshold value, determining a target service data fragment set corresponding to the quantized data discrete degree value as a target service data fragment which does not comprise distorted data;
step four, if the quantized data discrete degree value is less than or equal to the quantized data discrete degree threshold, calculating a difference value between the quantized data value of each original service data segment in a target service data segment set corresponding to the quantized data discrete degree value and the quantized mean value;
a fifth step of judging a magnitude relationship between the difference value and a predetermined comparison threshold (it is understood that the comparison threshold may be generated based on a configuration performed by a user according to an actual application scenario);
sixthly, if the difference value is larger than the comparison threshold value, determining the original service data fragment corresponding to the difference value as a target service data fragment belonging to distorted data;
and seventhly, if the difference is smaller than or equal to the comparison threshold, not determining the original service data fragment corresponding to the difference as a target service data fragment belonging to the distorted data.
It is understood that, in another alternative example, each of the original service data segments may also be contrasted based on the following steps:
the method comprises the steps that firstly, for each target service data fragment set, based on content representation information corresponding to each original service data fragment in the target service data fragment set, a historical service data fragment set corresponding to the target service data fragment set is determined, wherein the historical service data fragment set is a service data fragment set which is determined by analyzing other original service data in history and comprises distortion data, and the content representation information of the historical service data fragment set is the same as that of the corresponding target service data fragment set;
secondly, aiming at each historical service data fragment set, sequencing all service data fragments included in the historical service data fragment set based on the relative position relation of all service data fragments included in the historical service data fragment set in other original service data to obtain a historical service data fragment sequence corresponding to the historical service data fragment set;
thirdly, aiming at each target service data fragment set, acquiring the fragment number of original service data fragments included in the target service data fragment set;
fourthly, aiming at each target service data fragment set, determining at least one historical service data fragment subsequence in the historical service data fragment sequence corresponding to the target service data fragment set, wherein the number of service data fragments included in each historical service data fragment subsequence is the number of fragments corresponding to the target service data fragment set, and each historical service data fragment subsequence includes service data fragments belonging to distorted data;
fifthly, calculating a sequence similarity between an ordered set corresponding to each target service data fragment set (that is, the relative position relationship of each original service data fragment in the target service data fragment set in the original service data is sequenced to obtain the ordered set) and the corresponding at least one historical service data fragment subsequence (the sequence similarity can be calculated based on the existing sequence similarity calculation method, and is not described in detail herein);
sixthly, determining each target service data fragment set with the sequence similarity meeting a preset similarity threshold (it can be understood that the similarity threshold can be generated based on configuration operation performed by a user according to an actual application scene) as a target service data fragment set with target service data fragments belonging to distorted data, wherein the target service data fragments are determined based on position information of the service data fragments belonging to the distorted data in the corresponding historical service data fragment set.
It is understood that, in an alternative example, the denoised first traffic data may be obtained based on the following steps:
firstly, if the target service data fragment exists in the plurality of original service data fragments, determining the relative position relationship of each original service data fragment except the target service data fragment in the plurality of original service data fragments in the original service data;
secondly, combining each original service data segment except the target service data segment in the plurality of original service data segments based on the relative position relationship to obtain the first service data of which the original service data is denoised.
It is understood that, in an alternative example, if it is determined based on step S120 that the target service data segment does not exist in the original service data segments, all of the original service data segments may be used as the denoised first service data, that is, the original service data may be directly used as the denoised first service data.
In summary, the big data denoising processing method provided by the present application divides original service data into a plurality of original service data fragments, determines whether a target service data fragment belonging to distorted data exists, and then takes each original service data fragment except the target service data fragment as denoised first service data. Therefore, error data in the original service data can be effectively eliminated, the authenticity of the data in the obtained first service data is higher, the good denoising effect is ensured, and the problem of poor denoising effect of the data in the prior art is solved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A big data denoising processing method is applied to big data denoising processing equipment and comprises the following steps:
performing data segmentation processing on the obtained original service data to obtain a plurality of original service data fragments, wherein the original service data is service data of which the data volume is larger than a preset volume and is obtained based on data acquisition of a target service object;
analyzing the original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the original service data fragments, wherein the distorted data is error data in the original service data;
if the target service data fragment exists in the plurality of original service data fragments, taking each original service data fragment except the target service data fragment in the plurality of original service data fragments as the denoised first service data.
2. The big data denoising processing method according to claim 1, wherein the step of performing data segmentation processing on the obtained raw service data to obtain a plurality of raw service data segments comprises:
original business data are obtained from a target database in communication connection with the big data denoising processing equipment, wherein the original business data are sent to the target database for storage through corresponding data acquisition equipment after being obtained based on data acquisition of the target business object;
acquiring a predetermined target data segmentation rule, wherein the target data segmentation rule is generated based on configuration operation of the big data denoising processing equipment responding to a user;
and segmenting the original service data based on the target data segmentation rule to obtain a plurality of original service data fragments, wherein the original service data fragments are combined according to a certain sequence to form the original service data.
3. The big data denoising processing method according to claim 2, wherein the step of obtaining a predetermined target data segmentation rule comprises:
performing content identification processing on the original service data to obtain a content identification result corresponding to the original service data, wherein the content identification result is used for representing type information to which data content of the original service data belongs;
determining a segmentation rule from a plurality of pre-constructed data segmentation rules based on the content identification result, wherein the segmentation rule is used as a target data segmentation rule corresponding to the content identification result, each data segmentation rule is generated based on a configuration operation performed by the big data denoising processing device in response to a user, each data segmentation rule is used for segmenting the original service data into a plurality of original service data segments with different numbers, and the target data segmentation rule is used for segmenting the original service data into a corresponding number of original service data segments.
4. The big data denoising processing method according to claim 3, wherein the step of determining one segmentation rule among a plurality of pre-constructed data segmentation rules based on the content recognition result as a target data segmentation rule corresponding to the content recognition result includes:
determining target importance information corresponding to type information to which the data content of the original service data belongs based on the content identification result and a pre-constructed content-importance corresponding relation, wherein the content-importance corresponding relation is generated based on configuration operation of the big data denoising processing equipment responding to a user;
determining a target data segmentation rule corresponding to the target importance information and a pre-constructed importance-segmentation rule corresponding relation in a plurality of pre-constructed data segmentation rules, wherein the higher the importance corresponding to the target importance information is, the larger the number of original business data segments obtained by segmenting the original business data based on the target data segmentation rule is, the lower the importance corresponding to the target importance information is, the smaller the number of original business data segments obtained by segmenting the original business data based on the target data segmentation rule is, and the target data segmentation rule is used for segmenting data volumes such as the original business data into a corresponding number of original business data segments, or is used for segmenting the original business data into a corresponding number of original business data segments according to a generated time sequence Or, the method is used for dividing the original service data into a corresponding number of original service data segments according to the continuity of the data content.
5. The big data denoising processing method according to claim 1, wherein the step of parsing the plurality of original service data fragments to determine whether a target service data fragment belonging to distorted data exists in the plurality of original service data fragments comprises:
for each original service data fragment in the plurality of original service data fragments, performing content identification processing on the original service data fragment to obtain content representation information corresponding to the original service data fragment, wherein the content representation information is used for representing the data content of the corresponding original service data fragment;
based on content characterization information corresponding to each original service data fragment, performing clustering processing on the plurality of original service data fragments to obtain at least one service data fragment set corresponding to the plurality of original service data fragments, wherein each service data fragment set comprises at least one original service data fragment, the content characterization information of any two original service data fragments belonging to the same service data fragment set is the same, and the content characterization information of any two original service data fragments belonging to different service data fragment sets is different;
taking each service data fragment set, in which the number of original service data fragments included in the at least one service data fragment set is greater than or equal to 2, as a target service data fragment set;
and for each target service data fragment set, carrying out comparative analysis on each original service data fragment in the target service data fragment set to determine whether a target service data fragment belonging to distorted data exists in the target service data fragment set.
6. The big data denoising processing method according to claim 5, wherein the step of performing comparative analysis on each original service data segment in the target service data segment set to determine whether a target service data segment belonging to distorted data exists in the target service data segment set for each target service data segment set comprises:
for each target business data fragment set, determining the data type of each original business data fragment in the target business data fragment set based on the result of content identification processing on the original business data fragment, wherein the data type comprises quantized data and non-quantized data, and the non-quantized data comprises data with emotional colors;
for each target business data fragment set, determining a comparative analysis rule corresponding to the target business data fragment set based on the data type corresponding to the target business data fragment set, wherein the comparative analysis rules corresponding to the target business data fragment sets of different data types are different;
and for each target business data fragment set, carrying out comparative analysis on each original business data fragment included in the target business data fragment set based on a comparative analysis rule corresponding to the target business data fragment set so as to determine whether a target business data fragment belonging to distorted data exists in the target business data fragment set.
7. The big data denoising processing method according to claim 6, wherein the step of performing a comparative analysis on each original service data segment included in the target service data segment set based on a comparative analysis rule corresponding to the target service data segment set for each target service data segment set to determine whether a target service data segment belonging to distorted data exists in the target service data segment set comprises:
for each target service data segment set of which the corresponding data type is the quantized data, performing mean value calculation based on quantized data values of all the original service data segments in the target service data segment set to obtain a corresponding quantized mean value, and performing discrete calculation based on the quantized mean value and the quantized data values of all the original service data segments to obtain a corresponding quantized data discrete degree value;
judging whether the quantized data discrete degree value is larger than a predetermined quantized data discrete degree threshold value or not, wherein the quantized data discrete degree threshold value is generated on the basis of the configuration operation of the big data denoising processing equipment responding to a user;
if the quantized data discrete degree value is larger than the quantized data discrete degree threshold value, determining a target service data fragment set corresponding to the quantized data discrete degree value as a target service data fragment not including distorted data;
if the quantized data discrete degree value is smaller than or equal to the quantized data discrete degree threshold, calculating a difference value between the quantized data value of each original service data segment in a target service data segment set corresponding to the quantized data discrete degree value and the quantized mean value;
judging the magnitude relation between the difference value and a predetermined comparison threshold value;
if the difference is larger than the comparison threshold, determining the original service data fragment corresponding to the difference as a target service data fragment belonging to distorted data;
and if the difference is smaller than or equal to the comparison threshold, not determining the original service data fragment corresponding to the difference as a target service data fragment belonging to the distorted data.
8. The big data denoising processing method according to claim 5, wherein the step of performing comparative analysis on each original service data segment in the target service data segment set to determine whether a target service data segment belonging to distorted data exists in the target service data segment set for each target service data segment set comprises:
for each target service data fragment set, determining a historical service data fragment set corresponding to the target service data fragment set based on content characterization information corresponding to each original service data fragment in the target service data fragment set, wherein the historical service data fragment set is a service data fragment set which is determined by analyzing other original service data historically and comprises distorted data, and the content characterization information of the historical service data fragment set is the same as that of the corresponding target service data fragment set;
for each historical service data fragment set, based on the relative position relationship of each service data fragment included in the historical service data fragment set in the other original service data, sequencing each service data fragment included in the historical service data fragment set to obtain a historical service data fragment sequence corresponding to the historical service data fragment set;
aiming at each target service data fragment set, acquiring the fragment number of original service data fragments included in the target service data fragment set;
for each target service data fragment set, determining at least one historical service data fragment subsequence in a historical service data fragment sequence corresponding to the target service data fragment set, wherein the number of service data fragments included in each historical service data fragment subsequence is the number of fragments corresponding to the target service data fragment set, and each historical service data fragment subsequence includes service data fragments belonging to distorted data;
aiming at each target business data fragment set, calculating the sequence similarity between the ordered set corresponding to the target business data fragment set and the corresponding at least one historical business data fragment subsequence;
determining each target service data fragment set with sequence similarity meeting a preset similarity threshold as a target service data fragment set with target service data fragments belonging to distorted data, wherein the target service data fragments are determined based on the position information of the service data fragments belonging to the distorted data in the corresponding historical service data fragment set.
9. The big data denoising method according to any one of claims 1 to 8, wherein if the target service data segment exists in the plurality of original service data segments, the step of taking each original service data segment other than the target service data segment in the plurality of original service data segments as denoised first service data comprises:
if the target service data fragment exists in the plurality of original service data fragments, determining the relative position relationship of each original service data fragment except the target service data fragment in the plurality of original service data fragments in the original service data;
and combining each original service data segment except the target service data segment in the plurality of original service data segments based on the relative position relationship to obtain the first service data of which the original service data is denoised.
10. The big data denoising processing method according to any one of claims 1 to 8, further comprising:
and if the target service data fragment does not exist in the plurality of original service data fragments, taking the plurality of original service data fragments as the de-noised first service data.
CN202110571342.9A 2021-05-25 2021-05-25 Big data denoising processing method Withdrawn CN113239031A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110571342.9A CN113239031A (en) 2021-05-25 2021-05-25 Big data denoising processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110571342.9A CN113239031A (en) 2021-05-25 2021-05-25 Big data denoising processing method

Publications (1)

Publication Number Publication Date
CN113239031A true CN113239031A (en) 2021-08-10

Family

ID=77138727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110571342.9A Withdrawn CN113239031A (en) 2021-05-25 2021-05-25 Big data denoising processing method

Country Status (1)

Country Link
CN (1) CN113239031A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115657971A (en) * 2022-12-27 2023-01-31 扬州博士创新技术转移有限公司 Cloud storage allocation method and system for enterprise digital service and cloud server

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115657971A (en) * 2022-12-27 2023-01-31 扬州博士创新技术转移有限公司 Cloud storage allocation method and system for enterprise digital service and cloud server
CN115657971B (en) * 2022-12-27 2023-03-10 扬州博士创新技术转移有限公司 Cloud storage allocation method and system for enterprise digital service and cloud server

Similar Documents

Publication Publication Date Title
CN113138982B (en) Big data cleaning method
CN114119137A (en) Risk control method and device
CN114764768A (en) Defect detection and classification method and device, electronic equipment and storage medium
CN111652315A (en) Model training method, object classification method, model training device, object classification device, electronic equipment and storage medium
CN111639607A (en) Model training method, image recognition method, model training device, image recognition device, electronic equipment and storage medium
CN112016756A (en) Data prediction method and device
CN115392937A (en) User fraud risk identification method and device, electronic equipment and storage medium
CN113793332B (en) Experimental instrument defect identification and classification method and system
CN113313217B (en) Method and system for accurately identifying dip angle characters based on robust template
CN114511037A (en) Automatic feature screening method and device, electronic equipment and storage medium
CN113239031A (en) Big data denoising processing method
CN115862638B (en) Big data safe storage method and system based on block chain
CN113032524A (en) Trademark infringement identification method, terminal device and storage medium
CN113239381A (en) Data security encryption method
CN112329810A (en) Image recognition model training method and device based on saliency detection
CN112364603A (en) Index code generation method, device, equipment and storage medium
CN116610821A (en) Knowledge graph-based enterprise risk analysis method, system and storage medium
CN114723536B (en) E-commerce platform cheap commodity selection method and system based on image big data comparison
CN111340139A (en) Method and device for judging complexity of image content
CN115562934A (en) Service flow switching method based on artificial intelligence and related equipment
CN112269879B (en) Method and equipment for analyzing middle station log based on k-means algorithm
CN113256402A (en) Risk control rule determination method and device and electronic equipment
CN113177603A (en) Training method of classification model, video classification method and related equipment
CN113328988A (en) Network security verification method and system based on big data and cloud computing
CN112861874A (en) Expert field denoising method and system based on multi-filter denoising result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210810

WW01 Invention patent application withdrawn after publication