WO2022048422A1 - 数据处理的方法、装置、设备及存储介质 - Google Patents
数据处理的方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2022048422A1 WO2022048422A1 PCT/CN2021/112248 CN2021112248W WO2022048422A1 WO 2022048422 A1 WO2022048422 A1 WO 2022048422A1 CN 2021112248 W CN2021112248 W CN 2021112248W WO 2022048422 A1 WO2022048422 A1 WO 2022048422A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- deduplication
- attribute
- real
- content
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 35
- 238000001514 detection method Methods 0.000 claims description 59
- 238000000034 method Methods 0.000 claims description 13
- 238000013500 data storage Methods 0.000 claims description 11
- 230000000717 retained effect Effects 0.000 claims description 7
- 238000012795 verification Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000012958 reprocessing Methods 0.000 claims 1
- 238000010200 validation analysis Methods 0.000 abstract 2
- 238000010586 diagram Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the embodiments of the present application relate to data processing technologies, for example, to a data processing method, apparatus, device, and storage medium.
- the processing method is to manually analyze the data. to be processed.
- the present application provides a data processing method, apparatus, device and storage medium, so as to realize massive data processing and complete the extraction operation of valid data.
- An embodiment of the present application provides a data processing method, including: receiving real-time stream data; performing de-duplication processing on the real-time stream data according to data de-duplication rules to obtain de-duplication data; The data is checked for correctness, valid data is obtained, and the valid data is stored.
- An embodiment of the present application further provides a data processing device, the device includes: a data acquisition module configured to receive real-time stream data; a data deduplication module configured to perform deduplication processing on the real-time stream data according to a data deduplication rule , to obtain deduplicated data; the correctness verification module is configured to perform correctness detection on the deduplicated data according to the correctness detection rule to obtain valid data; the data storage module is configured to store the valid data.
- Embodiments of the present application further provide an electronic device, including: one or more processors; a storage device configured to store one or more programs, when the one or more programs are executed by the one or more processors The execution causes the one or more processors to implement the data processing method provided by any embodiment of the present application.
- the embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the program is executed by a processor to perform the data processing method provided by any embodiment of the present application.
- Embodiment 1 is a flowchart of a data processing method in Embodiment 1 of the present application.
- FIG. 2 is a flowchart of a data processing method in Embodiment 2 of the present application.
- FIG. 3 is a schematic diagram of functional modules of a data processing apparatus in Embodiment 3 of the present application.
- FIG. 4 is a schematic structural diagram of an electronic device provided in Embodiment 4 of the present application.
- FIG. 1 is a flowchart of a data processing method provided in Embodiment 1 of the present application. This embodiment can be applied to the situation of obtaining valid data from massive data, the method can be executed by a data processing device, and the data processing device can be implemented by software and/or hardware, and the data processing device can be integrated in, for example, a computer or a server In electronic equipment such as the following steps are included.
- Streams are made up of a series of immutable messages of a similar type.
- a stream could be all click events on a website, all updates to a particular database, all logs generated by a service, or any other type of stream.
- time data Streaming data is a set of sequential, large, fast, and consecutively arriving sequences of data.
- streaming data can be viewed as a dynamic collection of data that grows infinitely over time.
- Real-time streaming data indicates that streaming data has a time attribute. From the perspective of timestamps, a piece of data in real-time streaming data is generated at a certain moment, then the value of this moment can be the time when the data source generates the data, or the inflow of the data.
- Receiving real-time streaming data can be receiving all action streaming data in the Internet through a high-throughput, low-latency Kafka stream processing platform, such as web browsing, searching, and other user actions.
- receiving real-time streaming data may be receiving the real-time streaming data based on the Flink streaming framework.
- the advantage of this setting is that the Flink streaming framework has high performance and fast data processing speed, and the Flink streaming framework is also fault-tolerant.
- the fault-tolerant mechanism of the Flink streaming framework will reduce the performance and throughput of the streaming framework.
- S120 Perform deduplication processing on the real-time stream data according to the data deduplication rule to obtain deduplication data.
- the data deduplication rules can be configured manually, and the received real-time stream data is deduplicated by configuring the data deduplication rules.
- the deduplication operation may be to compare multiple pieces of data in the real-time streaming data respectively, determine at least two duplicate data, retain one of the at least two duplicate data, delete other duplicate data, and obtain the deduplicated data. heavy data.
- the data contents in any two pieces of data may be matched one by one, and it is determined that two pieces of data with identical data contents are duplicate data.
- the deduplication operation can also be performed by comparing the data content of any two data according to the data type, comparing the data attributes of the data with the same type, and selecting the real-time streaming data with the same data type and data attributes. Consistent multiple data (that is, multiple data with the same data type and the same data content) are regarded as duplicate data.
- the correctness detection rule can be pre-configured, for example, it can be formed by inputting the correctness detection code into the correctness detection rule template.
- the correctness detection rule can be the correctness detection rule of the data attribute, and different data attributes correspond to different correctness detection rules. By configuring the correctness detection rules corresponding to different data attributes, the data content of the data attributes corresponding to the acquired deduplicated data is tested for correctness, and the deduplicated data that conforms to the correctness detection rules are selected as valid. data.
- the correctness detection rules corresponding to each data attribute may be stored independently.
- the correctness detection rules corresponding to each data attribute may be stored in the correctness detection rule database.
- the data attributes included in the real-time streaming data Invoke the corresponding correctness detection rule.
- prompt information is generated to prompt the configuration of a new correctness detection rule.
- invalid data with wrong data is deleted, so as to avoid the occupation of storage space by invalid data.
- prompt information of invalid data is generated based on the invalid data, and the prompt information of invalid data is displayed or sent to the associated terminal, so that the associated terminal or the operating user can correct the invalid data.
- the data deduplication rule and/or the correctness detection rule may be in an XML file format.
- the configuration rules are in XML file format. The advantage of this setting is that XML is a file format described in text form, which is readable and object-oriented.
- the working principle of the data processing method by receiving real-time stream data, configuring data deduplication rules, performing preliminary deduplication operations on real-time stream data, filtering real-time stream data to obtain deduplication data, and configuring correctness detection rules again to deduplicate
- the data is checked for correctness of the data.
- the invalid data is filtered again to obtain valid data.
- the valid data is stored in the local database and the cloud.
- the received real-time streaming data is processed twice consecutively through the data deduplication rule and the correctness detection rule, so as to remove duplicate data and erroneous data in the real-time streaming data and avoid invalid data.
- the occupation of storage space solves the problems of high data storage pressure and invalid data, and achieves the effect of reducing data storage pressure and improving data validity.
- FIG. 2 is a flowchart of a data processing method in Embodiment 2 of the present application. Based on the above embodiments, the method includes the following steps.
- S210 Receive real-time streaming data.
- the real-time streaming data includes a plurality of data, and each data includes a data type identifier, at least one data attribute, and data content of each data attribute.
- the format of the data can be defined, the format of the received data is defined as the first field, the first field is the data type code, and the following fields are the data attributes in sequence, exemplarily, the data format Can be defined as [Data Type Code], [Attribute 1], [Attribute 2]....
- the received real-time streaming data can be preprocessed, for example, the data type of the received real-time streaming data can be identified, and the encoding of the identified data type can be used in the first real-time streaming data.
- S220 Perform deduplication processing on the real-time stream data according to the data deduplication rule to obtain deduplication data.
- performing deduplication processing on the real-time stream data according to the data deduplication rule to obtain deduplicated data includes: when the data type identifiers of the two pieces of data in the real-time stream data are the same, deduplicating the real-time stream data. At least one data attribute of one data in the two data is compared with at least one data attribute of the other data; when all the data attributes of the one data are the same as all the data attributes of the other data, all data attributes of the other data are compared.
- the data content of at least one data attribute of the one data is compared with the data content of at least one data attribute of the other data; when the data content of all the data attributes of the one data is compared with all the data of the other data
- the data contents of the attributes are the same, it is determined that the two pieces of data are duplicate data, and the duplicate data is deduplicated;
- the data contents of the attributes are different, it is determined that the two data are not duplicate data, and the two data are retained; when at least one data attribute of the one data is different from at least one data attribute of the other data, it is determined that the two data
- the two pieces of data are not duplicate data, and the two pieces of data are reserved.
- the data deduplication rule in the real-time streaming data, compare the data attributes of any two data types with the same encoding. At the same time, the two compared data are determined as duplicate data, and the duplicate data is deduplicated, that is, any one of the two data can be selected.
- Comparing the data attributes of any two data types with the same encoding is performed.
- the encoding of the any two data types is the same.
- the data is determined not to be duplicate data, and the data encoded with the same arbitrary two data types is reserved.
- data 1 is represented as [01], [username], [mobile phone number]
- data 2 is represented as [01], [username], [gender], and data 1 is compared with data 2, wherein the data The [mobile phone number] attribute of 1 is different from the [gender] attribute of data 2, so data 1 and data 2 are different stream data.
- the data content of the data encoded with the same two data types is compared, and any two data types with the same data content are encoded with the same data content.
- the data is determined to be duplicate data, and any two data types with different data content encode the same data as different stream data.
- data 3 is represented as [01], [user name A], [mobile phone number B]
- data 4 is represented as [01], [user name A], [mobile phone number C], data 3 and data 4 are performed.
- the comparison if the data content of the mobile phone number attribute is different, it is determined that the data 3 and the data 4 are different stream data.
- performing deduplication processing on the real-time stream data according to the data deduplication rule to obtain deduplication data includes: determining at least one deduplication key attribute of each data in the real-time stream data; When the data type identifiers of the two data in the data are the same, compare at least one deduplication key attribute of one data in the two data with at least one deduplication key attribute of the other data; when all the deduplication keys of the one data When the attribute is the same as all the deduplication key attributes of the other data, the data content of the at least one deduplication key attribute of the one data is compared with the data content of the at least one deduplication key attribute of the other data.
- the deduplication processing is performed on the real-time stream data according to the data deduplication rule, the deduplication data is obtained, and the deduplication key attributes of the multiple data are determined.
- the deduplication key attribute It can be one or more of multiple data attributes.
- the key attribute of deduplication can be set and updated according to user requirements, which is not limited.
- data 4 is represented as [04], [username], [mobile phone number], [gender], [password], [ID number]
- data 5 is represented as [04], [username], [ Mobile phone number], [age], [ID number], you can choose [user name], [mobile phone number] and [ID number] as key attributes.
- data deduplication is performed by selecting at least one key attribute to perform a one-to-one comparison between the two pieces of data.
- the duplicate key attributes are the same, the attribute contents of the deduplicated key attributes of any two data are compared, and if the attribute contents of all the deduplicated key attributes of the any two data are the same, the arbitrary deduplication key attributes are compared.
- the two data are determined to be duplicate data, and the duplicate data is deduplicated.
- the deduplication process may be to select any one of the two duplicate data.
- data 4 is represented as [04], [username ], [mobile phone number], [gender], [password], [ID number]
- data 5 is represented as [04], [username], [mobile phone number], [age], [ID number], when Select [user name], [mobile phone number] and [ID card number] as key attributes, compare the data content of the key attributes of data 4 and data 5, when the data content of the key attributes of data 4 and data 5 are the same Meanwhile, data 4 and data 5 are determined to be the same stream data, and either data 4 or data 5 may be selected.
- the correctness of the deduplicated data is detected according to the standard of the configured correctness detection rule to determine the validity of the data.
- the correctness detection rule can be the correctness of at least one data attribute corresponding to the data type.
- the correctness detection rule (correctness detection standard), the correctness detection rule of at least one data attribute corresponding to the data type can be the correctness detection rule of different data attributes configured according to the at least one data attribute, and the correctness detection rule of the data attribute can be It is set using regular expressions.
- the correctness of the deduplicated data is detected by the correctness detection rules of the data attributes to obtain valid data.
- the correctness detection rules are configured for different data attributes.
- the mobile phone number attribute of a data is checked for correctness, the data that does not meet the conditions is excluded, and the correctness detection rule is selected. For example, when the mobile phone number is 1352, it is checked for correctness. By detecting that the mobile phone number does not meet the conditions of the correct mobile phone number, the mobile phone number will not be obtained.
- the correct Sex detection detects that the mobile phone number meets the conditions of the correct mobile phone number, and stores the mobile phone number in the database.
- the data format is defined, the real-time stream data is received, the received real-time stream data is deduplicated through attribute comparison based on the data deduplication rule, the deduplication data is obtained, and the difference in the configuration data is The correctness detection rule corresponding to the attribute, the attribute of the deduplicated data is checked for correctness of the data attribute through the correctness detection rule corresponding to the attribute, and the real-time streaming data with the correct data attribute is saved in the database.
- FIG. 3 is a schematic diagram of functional modules of a data processing apparatus in Embodiment 3 of the present application.
- the present application provides a data processing device, comprising: a data acquisition module 310 configured to receive real-time stream data; a data deduplication module 320 configured to perform deduplication processing on the real-time stream data according to data deduplication rules to obtain deduplication Duplicate data; the correctness verification module 330 is configured to perform correctness detection on the deduplicated data according to the correctness detection rule to obtain valid data; the data storage module 340 is configured to store the valid data.
- the data collection module 310 is configured to receive the real-time streaming data based on the Flink streaming framework.
- the real-time streaming data includes a plurality of data, and each data includes a data type identifier, at least one data attribute, and data content of each data attribute.
- the data deduplication module 320 is configured to, when the data type identifiers of the two data in the real-time streaming data are the same, compare at least one data attribute of one data in the two data with at least one data attribute of the other data. One data attribute is compared; when at least one data attribute of the one data is different from at least one data attribute of the other data, it is determined that the two data are not duplicate data, and the two data are retained; When all data attributes of one data are the same as all data attributes of the other data, comparing the data content of at least one data attribute of the one data with the data content of at least one data attribute of the other data; When the data content of all data attributes of the one data is the same as the data content of all data attributes of the other data, it is determined that the two pieces of data are duplicate data, and the duplicate data is deduplicated; When the data content of at least one data attribute in one piece of data is different from the data content of at least one data attribute of the other data, it is determined that the two pieces of data are not duplicate
- the data deduplication module 320 is configured to determine at least one deduplication key attribute of each data in the real-time streaming data; when the data type identifiers of the two data in the real-time streaming data are the same, the two data At least one deduplication key attribute of one data is compared with at least one deduplication key attribute of another data; when all the deduplication key attributes of the one data are compared with all the deduplication key attributes of the other data At the same time, the data content of the at least one deduplication key attribute of the one data is compared with the data content of the at least one deduplication key attribute of the other data; when the data content of the at least one deduplication key attribute of the one data is When the data content is different from the data content of at least one deduplicated key attribute of the other data, it is determined that the two data are not duplicate data, and the two data are retained; when all the data of the one data is deduplicated key attribute data When the content is the same as the data content of all the deduplication key attributes
- the correctness verification module 330 is configured to call the correctness detection rule corresponding to the data type according to the data type of the deduplicated data to determine valid data in the deduplicated data, wherein the correctness The detection rule includes the correctness detection standard of each data attribute in the at least one data attribute corresponding to the data type.
- the data deduplication rule and/or the correctness detection rule are in an XML file format.
- the data collection module receives the real-time stream data, firstly deduplicates the received real-time stream data through the data deduplication module according to deduplication rules, and obtains deduplication data, and then deduplicates the data.
- the correctness detection module performs correctness detection on the deduplicated data according to the configured correctness detection rules to obtain valid data, and finally the valid data is stored through the data storage module to complete the data storage.
- the above product can execute the method provided by any embodiment of the present application, and has functional modules corresponding to the execution method.
- FIG. 4 is a schematic structural diagram of an electronic device provided in Embodiment 4 of the present application.
- the device includes a processor 410 , a memory 420 , an input device 430 and an output device 440 ; the number of processors 410 in the device may be one or more, and one processor 410 is taken as an example in FIG. 4 .
- the processor 410 , the memory 420 , the input device 430 and the output device 440 in the device may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 4 .
- the memory 420 may be configured to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to data processing in the embodiments of the present application (for example, data acquisition in the data processing device). module 310, data deduplication module 320, correctness verification module 330 and data storage module 340).
- the processor 410 executes various functional applications and data processing of the device by running the software programs, instructions and modules stored in the memory 420 , that is, implements the above-mentioned data processing method.
- the memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, memory 420 may also include memory located remotely from processor 410, which may be connected to the device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
- the input device 430 may be configured to receive incoming streaming data and to generate data input related to user settings and functional control of the device.
- the output device 440 may include a display device such as a display screen.
- the fifth embodiment provides a computer-readable storage medium on which a computer program is stored.
- the data processing method provided by any embodiment of the present application is implemented, and the method includes: receiving real-time streaming data Carry out deduplication processing to the real-time stream data according to the data deduplication rule to obtain deduplication data; carry out correctness detection to the deduplication data according to the correctness detection rule, obtain valid data, and store the valid data .
- the computer storage medium of the embodiments of the present application may adopt any combination of one or more computer-readable media.
- the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination of the above.
- Examples (a non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (Read- Only Memory, ROM), Erasable Programmable Read-Only-Memory (EPROM) or Flash, Optical Fiber, Portable Compact Disc Read-Only Memory (CD-ROM), Optical A memory device, a magnetic memory device, or any suitable combination of the foregoing.
- a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
- the program code embodied on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
- suitable medium including but not limited to: wireless, wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
- Computer program code for carrying out the operations of the present application may be written in one or more programming languages, including object-oriented programming languages, such as Java, Smalltalk, C++, and conventional A procedural programming language, such as the "C" language or similar programming language.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or Wide Area Network (WAN), or may be connected to an external computer (eg, use an internet service provider to connect via the internet).
- LAN Local Area Network
- WAN Wide Area Network
- the above-mentioned multiple modules or multiple steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device or distributed on a network composed of multiple computing devices , optionally, they can be implemented with program codes executable by a computer device, so that they can be stored in a storage device and executed by the computing device, or they can be respectively made into a plurality of integrated circuit modules, or some of them can be Multiple modules or steps are implemented as a single integrated circuit module.
- the present application is not limited to any particular combination of hardware and software.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
- 一种数据处理方法,包括:接收实时流数据;根据数据去重规则对所述实时流数据进行去重处理,得到去重数据;根据正确性检测规则对所述去重数据进行正确性检测,得到有效数据,并存储所述有效数据。
- 根据权利要求1所述的方法,其中,所述实时流数据包括多个数据,每个数据包括数据类型标识、至少一个数据属性以及每个数据属性的数据内容。
- 根据权利要求2所述的方法,其中,所述根据数据去重规则对所述实时流数据进行去重处理,得到去重数据,包括:在所述实时流数据中的两数据的数据类型标识相同的情况下,将所述两数据中的一个数据的至少一个数据属性与另一个数据的至少一个数据属性的数据属性进行比对;在所述一个数据的至少一个数据属性与所述另一个数据的至少一个数据属性不相同的情况下,确定所述两数据不是重复数据,保留所述两数据;在所述一个数据的全部数据属性与所述另一个数据的全部数据属性相同的情况下,将所述一个数据的至少一个数据属性的数据内容与所述另一个数据的至少一个数据属性的数据内容进行比对;在所述一个数据的全部数据属性的数据内容与所述另一个数据的全部数据属性的数据内容相同的情况下,确定所述两数据为重复数据,对所述重复数据进行去重处理;在所述一个数据的至少一个数据属性的数据内容与所述另一个数据的至少一个数据属性的数据内容不相同的情况下,确定所述两数据不是重复数据,保留所述两数据。
- 根据权利要求2所述的方法,其中,所述根据数据去重规则对所述实时流数据进行去重处理,得到去重数据,包括:确定所述实时流数据中的每个数据的至少一个去重关键属性;其中,每个数据的至少一个去重关键属性是所述每个数据的全部数据属性中的至少一个数据属性;在所述实时流数据中的两数据的数据类型标识相同的情况下,将两数据中的一个数据的至少一个去重关键属性与另一个数据的至少一个去重关键属性进行比对;在所述一个数据的全部去重关键属性与所述另一个数据的全部去重关键属性相同的情况下,将所述一个数据的至少一个去重关键属性的数据内容与所述另一个数据的至少一个去重关键属性的数据内容进行比对;在所述一个数据的至少一个去重关键属性的数据内容与所述另一个数据的至少一个去重关键属性的数据内容不相同的情况下,确定所述两数据不是重复数据,保留所述两数据;在所述一个数据的全部去重关键属性的数据内容与所述另一个数据的全部去重关键属性的数据内容相同的情况下,确定所述两数据为重复数据,对所述重复数据进行去重处理。
- 根据权利要求2所述的方法,其中,根据正确性检测规则对所述去重数据进行正确性检测,得到有效数据,包括:根据所述去重数据的数据类型,调用所述数据类型对应的正确性检测规则,确定所述去重数据中的有效数据,其中,所述正确性检测规则中包括所述数据类型对应的至少一个数据属性中的每个数据属性的正确性检测标准。
- 根据权利要求1所述的方法,其中,所述接收实时流数据,包括:基于Flink流式框架接收所述实时流数据。
- 根据权利要求1所述的方法,其中,所述数据去重规则和所述正确性检测规则中的至少之一为XML文件格式。
- 一种数据处理装置,包括:数据采集模块,设置为接收实时流数据;数据去重模块,设置为根据数据去重规则对所述实时流数据进行去重处理,得到去重数据;正确性验证模块,设置为根据正确性检测规则对所述去重数据进行正确性检测,得到有效数据;数据存储模块,设置为存储所述有效数据。
- 一种电子设备,包括:至少一个处理器;存储装置,设置为存储至少一个程序,当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-7中任一所述的数据处理方法。
- 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-7中任一所述的数据处理方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010910153.5 | 2020-09-02 | ||
CN202010910153.5A CN112084179B (zh) | 2020-09-02 | 2020-09-02 | 一种数据处理的方法、装置、设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022048422A1 true WO2022048422A1 (zh) | 2022-03-10 |
Family
ID=73731836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/112248 WO2022048422A1 (zh) | 2020-09-02 | 2021-08-12 | 数据处理的方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112084179B (zh) |
WO (1) | WO2022048422A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112084179B (zh) * | 2020-09-02 | 2023-11-07 | 北京锐安科技有限公司 | 一种数据处理的方法、装置、设备及存储介质 |
CN113064888B (zh) * | 2021-03-25 | 2021-12-07 | 珠海格力电器股份有限公司 | 数据校对方法、装置和系统、服务器、设备 |
CN113084388B (zh) * | 2021-03-29 | 2023-05-09 | 广州明珞装备股份有限公司 | 焊接质量的检测方法、系统、装置及存储介质 |
CN117093416A (zh) * | 2023-08-23 | 2023-11-21 | 南方电网数字电网集团有限公司广东分公司 | 一种基于云平台的数据恢复系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109451006A (zh) * | 2018-10-30 | 2019-03-08 | 北京锐安科技有限公司 | 一种数据传输方法、装置、服务器及计算机存储介质 |
CN109857728A (zh) * | 2017-11-30 | 2019-06-07 | 广州明领基因科技有限公司 | 针对图书馆的大数据清洗系统 |
CN111367989A (zh) * | 2020-06-01 | 2020-07-03 | 北京江融信科技有限公司 | 一种实时数据指标计算系统和方法 |
CN112084179A (zh) * | 2020-09-02 | 2020-12-15 | 北京锐安科技有限公司 | 一种数据处理的方法、装置、设备及存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649676B (zh) * | 2016-12-15 | 2020-06-19 | 北京锐安科技有限公司 | 一种基于hdfs存储文件的去重方法及装置 |
CN106599234A (zh) * | 2016-12-20 | 2017-04-26 | 深圳飓风传媒科技有限公司 | 基于多维标识的数据可视化处理方法和系统 |
CN107577769A (zh) * | 2017-09-06 | 2018-01-12 | 河南腾龙信息工程有限公司 | 一种计量专业数据的挖掘方法及系统 |
CN108628931B (zh) * | 2018-03-15 | 2022-08-30 | 创新先进技术有限公司 | 一种数据驱动业务的方法、装置以及设备 |
-
2020
- 2020-09-02 CN CN202010910153.5A patent/CN112084179B/zh active Active
-
2021
- 2021-08-12 WO PCT/CN2021/112248 patent/WO2022048422A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109857728A (zh) * | 2017-11-30 | 2019-06-07 | 广州明领基因科技有限公司 | 针对图书馆的大数据清洗系统 |
CN109451006A (zh) * | 2018-10-30 | 2019-03-08 | 北京锐安科技有限公司 | 一种数据传输方法、装置、服务器及计算机存储介质 |
CN111367989A (zh) * | 2020-06-01 | 2020-07-03 | 北京江融信科技有限公司 | 一种实时数据指标计算系统和方法 |
CN112084179A (zh) * | 2020-09-02 | 2020-12-15 | 北京锐安科技有限公司 | 一种数据处理的方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112084179B (zh) | 2023-11-07 |
CN112084179A (zh) | 2020-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022048422A1 (zh) | 数据处理的方法、装置、设备及存储介质 | |
US11755630B2 (en) | Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes | |
US20210248143A1 (en) | Automatically executing graphql queries on databases | |
CN111311326B (zh) | 用户行为实时多维度分析方法、装置及存储介质 | |
US20160042015A1 (en) | Activity information schema discovery and schema change detection and notification | |
EP3987408A1 (en) | Regular expression generation using span highlighting alignment | |
WO2022068348A1 (zh) | 关系图谱构建方法、装置、电子设备及存储介质 | |
WO2020000742A1 (zh) | 一种去重流量记录方法、装置、服务器及存储介质 | |
US9195763B2 (en) | Identifying unknown parameter and name value pairs | |
CN113761565B (zh) | 数据脱敏方法和装置 | |
US20240195860A1 (en) | Sample message processing method and apparatus | |
CN116562255A (zh) | 表单信息生成方法、装置、电子设备和计算机可读介质 | |
WO2021129849A1 (zh) | 日志处理方法、装置、设备和存储介质 | |
WO2020263674A1 (en) | User interface commands for regular expression generation | |
WO2019134277A1 (zh) | 数据过滤方法、装置、服务器及可读存储介质 | |
CN118585569A (zh) | 一种数据导入方法和装置 | |
CN112486967A (zh) | 一种数据采集方法、终端设备及存储介质 | |
CN115510091A (zh) | 一种话单数据处理方法、装置、电子设备及存储介质 | |
CN113110873A (zh) | 统一系统编码规范的方法和装置 | |
CN116955420A (zh) | 数据访问方法及其装置、存储介质、程序产品 | |
CN115203228A (zh) | 数据处理方法、装置、介质以及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21863489 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21863489 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 22/09/2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21863489 Country of ref document: EP Kind code of ref document: A1 |